Reproducible Research with R Markdown

Research

Authoring

Reproduciblity

Tutorial

Author

Simisani Ndaba

Published

July 9, 2022

Meetup Description

Research is considered to be reproducible when the exact results can be reproduced if given access to the original data, software, or code. Reproducible research is sometimes known as reproducibility, reproducible statistical analysis, reproducible data analysis, reproducible reporting, and literate programming.

Note

This meetup does not have a recording. The tutorial can be followed using the available links at the bottom of the page.

About the Speaker

Jenine Harris is a Professor teaching biostatistics courses in the public health program at the George Warren Brown School of Social Work at Washington University in St. Louis in Missouri, United States. Her recent research interests focus on increasing diversity in data science and improving the quality of research in public health by using reproducible research practices throughout the research process. Jenine’s 2020 book, Statistics with R: Solving Problems Using Real-World Data, is an introductory statistics textbook published by Sage. The book combines statistical concepts with R coding and includes representation of women throughout as both characters and as authors of resources. Each chapter addresses a different social problem using a real-world data set.

She is also the co-founder and current organizer of R-Ladies St. Louis a local chapter of the R-Ladies Global organisation organization that has the mission of increasing gender diversity in the R community.

Contact Speaker

You can contact and follow Jenine K. Harris PhD on X (formely known as Twitter)
Email me at harrisj@wustl.edu

Reproducible Research with R Markdown

The first thing to understand is reproducibility.

According to Harris er al. (2018) analyses are reproducible when analyzing the same data with the same methods produces the same results.

This is different from:
- Repeatability, which is the ability to conduct the same analysis on the same data (regardless of results), and
- Replicability, which is the ability collect new data, use the same methods, and get the same results

You’re preparing your research for publication and the temptation may be to focus on the results and discussion sections of your paper-- after all, that’s what will make the biggest splash! But consider how to use publication to make your work reproducible, so that other researchers successfully recreate your results using your data, code and methods. (Reproducing the results of a study is a bit different than replicating a study, where another researcher uses your methods and your code but collects or generates a new data set. Both replication and reproduction are things another researcher may try to verify the results of a published study. For more on the reproducibility versus replicability, see “A Statistical Definition for Reproducibility and Replicability,” by Patil et al.

By making your work reproducible, you:

Increase the usefulness of your research by enabling others to easily build on your results, and re-use your research materials
Ensure validity and trust in your results, and help to support the validity of future studies that are based on your work
Increase accuracy, trust, and confidence in your field broadly.

Publishing studies that can be reproduced or replicated may seem like a no-brainer. But it’s not an inevitable outcome of every publication. In 2012, cancer researchers Begley and Ellis published a comment in the journal Nature, called “Drug development: Raise standards for preclinical cancer research.” The article describes a crisis in the quality of scientific literature in cancer research. Working over a period of 10 years, Begley and his team at Amgen labs attempted to replicate the results of 53 known “landmark” studies in the field, but were only able to confirm results in 6 of those studies (11%).

Some of these non-replicable studies had resulted in hundreds of secondary publications, building on unconfirmed results and likely leading to the development and eventually the testing of ineffective drugs in cancer patients. Certainly, drug development is a complex problem, with models and technologies that are challenging to work with. But the intense pressure to publish early and often can result in the submission of studies without the level of documentation that allows for either reproduction or replication of results, and doesn’t tell the full story of the research. A glance at the website Retraction Watch, a project of the Center for Scientific Integrity, shows that the problem of publishing unverifiable results isn’t confined to oncology research. For one perspective on how this plays out in different fields, see Roger Peng’s blog post “A Simple Explanation for the Replication Crisis in Science”.

In their comment, Begley and Ellis call for more rigorous documentation practices, such as the inclusion of all experimental methods and data from all trials of a given drug in a published paper about that drug-- not just the few trials that succeeded. A truly reproducible study should contain a complete narrative of the research, and include well-documented methods, code, and data.

There are a number of tools and practices that can help you tell a coherent research story, without gaps or fuzzy areas. See biostatistician Karl Broman’s terrific tutorial, “Initial Steps Toward Reproducible Research” for more on how you might get started. Another great resource is anthropologist Ben Marwick’s presentation “Reproducible Research: A View from the Social Sciences.” As mentioned in the introduction to this section, you don’t have to adopt every best practice in reproducibility at once! Find the ones that seem most promising for your work, and give them a try.

How big is the reproducibility problem?

Retraction Watch reports 500 to 600 papers per year are retracted
Few retractions are fraud, most are errors
Several fields have examined reproducibility and found major issues
- Researchers could only replicate 21% of 67 drug studies, 40% to 60% of psychology studies, and 61% of economics studies
- In a sample of psychology papers, 6% of p-values were incorrectly reported
- 11% of p-values were incorrectly reported in a sample of medical papers
- 20 to 80% of papers in each of ten top scientific journals omitted or were unclear about sample sizes and up to 40% of papers in each journal did not include the type of statistical tests performed

Tables were mislabeled in three of six reproduced public health studies

Why does it matter?

Errors and omissions can threaten the foundation of research that practitioners and policy makers use to make decisions
Poor quality research can also threaten human health with over 400,000 people enrolled and over 70,000 treated in medical studies later retracted
Poor quality and retracted research continue to be cited, influencing additional science

How can I make my research reproducible?

Two things are needed to reproduce research:
- The data
- The statistical code or very detailed methods instructions

Literate programming as a reproducibility tool

Literate programming integrates text with code and results in a single document

R Markdown is one literate programming option that can;

integrate code and text and results
Automates references
Some output formats can also embed the data
Highly flexible output formatting

An example of a document developed with R Markdown Package ’janitor

Instructions on how to install R markdown packages, download and save material can be found here

Renv package

The renv package helps you create create reproducible environments for your R projects. The renv package vignette introduces you to the basic nouns and verbs of renv, like the user and project libraries, and key functions like,

renv::init(), renv::snapshot() and renv::restore().

You’ll also learn about some of the infrastructure that makes renv tick, some problems that renv doesn’t help with, and how to uninstall it if you no longer want to use it.

Resources

Follow a tutorial on R Markdownhere. Analyze. Share. Reproduce. Your data tells a story. Tell it with R Markdown. Turn your analyses into high quality documents, reports, presentations and dashboards.
Follow the R Markdown cheat sheet to easily get functions to create a reproducible document.
Longer workshop on reproducible R Markdown by Mine Cetinkaya-Rundel

Reference

Harris JK, Johnson KJ, Carothers BJ, Combs TB, Luke DA, Wang X (2018) Use of reproducible research practices in public health: A survey of public health analysts. PLoS ONE 13(9): e0202447. https://doi.org/10.1371/journal.pone.0202447

Patil, P., Peng, R. D., & Leek, J. T. (2016). A statistical definition for reproducibility and replicability. BioRxiv, 066803.