Using R as an Epidemiologist Assessment of real-world data

R
Author

Simisani Ndaba

Published

March 12, 2022

meetup flyer

Meetup Details

R-Ladies Dammam and R-Ladies Gaborone co-host this talk describing how Brianna Lindsay PhD uses and has used R as an epidemiologist in industry and academia. The focus of the talk will review how she approaches a dataset, important lessons learned about ‘dirty’ real-world data and the basics of data exploration using R and R Studio.

About Speaker


Brianna Lindsay MPH PhD, is an infectious disease epidemiologist based full-time in Zambia. She serves as primary epidemiologic personnel for the CDC/PEPFAR-funded CIRKUITS HIV treatment and prevention service delivery programs, supporting data collection, database design, and analysis. She also assists globally through programs in the affiliated Center for International Health, Education, and Biosecurity (Ciheb) at the University of Maryland Baltimore, supporting eight countries across sub-Saharan Africa. Dr. Lindsay applies quantitative data analysis to inform epidemiological investigations and practical research in infectious disease prevention and treatment in real-world settings. She has published extensively in burden of disease estimation using various study designs has expertise in quantitative methods as well as analysis of high-throughput and advanced molecular detection techniques. She has planned and organized a number of global health data workshops in Africa, Asia and the U.S.

Contact Speaker


Brianna Lindsay on Linkedin

Using R as an Epidemiologist

According to Frerot et al., 2017,Epidemiology is the study of the distribution and determinants of health-related states of events in specified populations. Brianna shows us Epidemiology analysis with;

Note

the here, readr, janitor, summarytools and tidyverse packages need to be installed before executing the code.

  1. Grouping aggregate operation
library(here)
library(readr)
library(janitor)
library(summarytools)
library(tidyverse)


aggregate <- read_csv(here ("data", "AggregateData.csv"))

#Tidy-up using Janitor package
aggregate<-clean_names(aggregate)
aggregate = remove_empty(aggregate, which = c("rows","cols"))

#A little exploration (periodname, city, age, career, data)
freq(aggregate$periodname)
freq(aggregate$city)
freq(aggregate$age)
freq(aggregate$career)

#Remove epidemiologists from the analysis
aggregate<- filter(aggregate, career %in% c("astronaut","farmer","pilot","surgeon"))

#Check out an overall summary of data
overall<-aggregate %>%
  dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
                   sumtestpos = sum(tested_positive, na.rm = TRUE),
                   sumtestneg = sum(tested_negative, na.rm = TRUE),
                   sumeligible = sum(treatment_eligible, na.rm = TRUE),
                   sumtreated = sum(treatment, na.rm = TRUE),) %>%
  mutate(treatprop = (sumtreated/sumtestpos)*100,
         treatup = (sumtreated/sumeligible)*100)

#Group by career
career<-aggregate %>%
  dplyr::group_by(career) %>%
  dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
                   sumtestpos = sum(tested_positive, na.rm = TRUE),
                   sumtestneg = sum(tested_negative, na.rm = TRUE),
                   sumeligible = sum(treatment_eligible, na.rm = TRUE),
                   sumtreated = sum(treatment, na.rm = TRUE),) %>%
  mutate(treatprop = (sumtreated/sumtestpos)*100,
         treatup = (sumtreated/sumeligible)*100)

#Group by age
age<-aggregate %>%
  dplyr::group_by(age) %>%
  dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
                   sumtestpos = sum(tested_positive, na.rm = TRUE),
                   sumtestneg = sum(tested_negative, na.rm = TRUE),
                   sumeligible = sum(treatment_eligible, na.rm = TRUE),
                   sumtreated = sum(treatment, na.rm = TRUE),) %>%
  mutate(treatprop = (sumtreated/sumtestpos)*100,
         treatup = (sumtreated/sumeligible)*100)

#Group by career and age
careerage<-aggregate %>%
  dplyr::group_by(career,age) %>%
  dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
                   sumtestpos = sum(tested_positive, na.rm = TRUE),
                   sumtestneg = sum(tested_negative, na.rm = TRUE),
                   sumeligible = sum(treatment_eligible, na.rm = TRUE),
                   sumtreated = sum(treatment, na.rm = TRUE),) %>%
  mutate(treatprop = (sumtreated/sumtestpos)*100,
         treatup = (sumtreated/sumeligible)*100)

#Recode period monthly data to quarters
freq(aggregate$periodname)
aggregate<-aggregate %>% mutate(quarter = ifelse(periodname %in% c("Oct-20","Nov-20","Dec-20"),"Q1",
                                                    ifelse(periodname %in% c("Jan-21","Feb-21","Mar-21"),"Q2", 
                                                      ifelse(periodname %in% c("Apr-21","May-21","Jun-21"),"Q3",
                                                        ifelse(periodname %in% c("Jul-21","Aug-21","Sep-21"),"Q4",NA)))))
freq(aggregate$quarter)


##Any trends over quarters?
quarter<-aggregate %>%
  dplyr::group_by(quarter) %>%
  dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
                   sumtestpos = sum(tested_positive, na.rm = TRUE),
                   sumtestneg = sum(tested_negative, na.rm = TRUE),
                   sumeligible = sum(treatment_eligible, na.rm = TRUE),
                   sumtreated = sum(treatment, na.rm = TRUE),) %>%
  mutate(treatprop = (sumtreated/sumtestpos)*100,
         treatup = (sumtreated/sumeligible)*100)
         
  1. Pre-Post analysis

    The Pre-post analysis shows whether an antibiotic treatment efficacy improved overtime using an app. The figure below shows on the far left, the survey in app form to collect antibiotic treatment information from different demographics. The table of contects shows the symptons collected from the survey until the Pre-post app is developed on the far right.

pre post analysis
library(here)
library(readr)
library(janitor)
library(tidyverse)
library(Hmisc)
library(summarytools)
library(table1)

##Objective: Test for knowledge retained before and after an educational intervention

prepostAssess <- read_csv(here ("data", "prepostAssessment.csv"))

#Clean variable names
prepostAssess<-clean_names(prepostAssess)

#Shorten answers for easier analysis
prepostAssess[,8:12] <- lapply(prepostAssess[,8:12],
                               function(x) substr(x,1,1))
prepostAssess[,18:22] <- lapply(prepostAssess[,18:22],
                               function(x) substr(x,1,1))
#Code either correct or incorrect
prepostAssess<-mutate(prepostAssess, preq1_1 = ifelse(pre_q1 == "b", 1, 0))
freq(prepostAssess$pre_q1)
freq(prepostAssess$preq1_1)

prepostAssess<-mutate(prepostAssess, preq2_1 = ifelse(pre_q2 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, preq3_1 = ifelse(pre_q3 == "d", 1, 0))
prepostAssess<-mutate(prepostAssess, preq4_1 = ifelse(pre_q4 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, preq5_1 = ifelse(pre_q5 == "b", 1, 0))

prepostAssess<-mutate(prepostAssess, postq1_1 = ifelse(post_q1 == "b", 1, 0))
freq(prepostAssess$post_q1)
freq(prepostAssess$postq1_1)

prepostAssess<-mutate(prepostAssess, postq2_1 = ifelse(post_q2 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, postq3_1 = ifelse(post_q3 == "d", 1, 0))
prepostAssess<-mutate(prepostAssess, postq4_1 = ifelse(post_q4 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, postq5_1 = ifelse(post_q5 == "b", 1, 0))

#Calculate total score of correct answers
prepostAssess<-prepostAssess %>% mutate(preSum = rowSums(.[23:27]))
prepostAssess<-prepostAssess %>% mutate(postSum = rowSums(.[28:32]))
prepostAssess<-prepostAssess %>% mutate(Diff = postSum-preSum)

#Identify the individuals who had answered both the pre and post survey
abx_prepostComplete <- prepostAssess[ which(prepostAssess$both_complete == 1), ]

#Compare the pre-test score to post-test score
mean(abx_prepostComplete$preSum)
mean(abx_prepostComplete$postSum)
mean(abx_prepostComplete$Diff)

median(abx_prepostComplete$preSum)
median(abx_prepostComplete$postSum)

#Statistical test for difference between pre-score and post-score
t.test(abx_prepostComplete$postSum, abx_prepostComplete$preSum, paired = TRUE, alternative = "two.sided")

#Table1
freq(prepostAssess$subspecialty)
freq(prepostAssess$year)
table1(~ subspecialty + year + preSum + postSum, data=prepostAssess, overall="Total")

freq(abx_prepostComplete$subspecialty)
freq(abx_prepostComplete$year)
freq(abx_prepostComplete$post_please_select_your_gender)
table1(~ subspecialty + year + post_please_select_your_gender+ preSum + postSum, data=abx_prepostComplete, overall="Total")

#Table2
lapply(prepostAssess[,23:27],
       function(x) freq(x))
mean(prepostAssess$preSum)

lapply(abx_prepostComplete[,23:27],
       function(x) freq(x))

lapply(abx_prepostComplete[,28:32],
       function(x) freq(x))

table1(~ factor(preq1_1) + factor(preq2_1) + factor(preq3_1) + factor(preq4_1) + factor(preq5_1) + factor(postq1_1) + factor(postq2_1) + factor(postq3_1) + factor(postq4_1) + factor(postq5_1), data=abx_prepostComplete, overall="Total")

mean(prepostAssess$preSum)
mean(abx_prepostComplete$preSum)
mean(abx_prepostComplete$postSum)

##paired Samples Test: Is there a difference in the pre-test score vs. the post-test score
test <- mcnemar.test(table(abx_prepostComplete$preq1_1, abx_prepostComplete$postq1_1))
test

test <- mcnemar.test(table(abx_prepostComplete$preq2_1, abx_prepostComplete$postq2_1))
test

test <- mcnemar.test(table(abx_prepostComplete$preq3_1, abx_prepostComplete$postq3_1))
test

test <- mcnemar.test(table(abx_prepostComplete$preq4_1, abx_prepostComplete$postq4_1))
test

test <- mcnemar.test(table(abx_prepostComplete$preq5_1, abx_prepostComplete$postq5_1))
test
  1. Power analysis
#power analysis
library(pwr)
library(epiR)

#https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html

                          #effect size
#Test                       small   medium  large
#chi-square tests (chisq)     0.1   0.3 0.5

#Chi square
pwr.chisq.test(w =0.21, N = , df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w =0.1, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w =0.2, N = , df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w =0.2, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w =0.3, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w =0.3, N = , df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w =0.5, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w = , N = 225, df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w = , N = 297, df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w = , N = 276, df = 1, sig.level =0.05, power = 0.8)


#T-Test
#Small
pwr.t.test(d=0.2, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")

#Medium
pwr.t.test(d=0.5, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")

#Large
pwr.t.test(d=0.8, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")

#Small
pwr.t.test(d=0.2, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")

#Medium
pwr.t.test(d=0.5, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")

#Large
pwr.t.test(d=0.8, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")

#Back-calculation
pwr.t.test(n=130, d=, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")

pwr.t.test(n=130, d=, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")


power.prop.test(p1=.30, p2=.391304, power=.95) 
#these values for p1 and p2 give OR of 1.50

pwr.2p.test(h = ES.h(p1=.50, p2= 0.05), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )

power.prop.test(p1=.50, p2=.25, power=.80) 
pwr.2p.test(h = ES.h(p1=0.10, p2= 0.25), n = , sig.level =0.05, power = 0.8)

pwr.2p.test(h = ES.h(p1=0.1819, p2= 0.10), n = , sig.level =0.05, power = 0.8)
power.prop.test(p1=.1819, p2=.10, power=.80)


library(epiR)
pi.sscc(OR = 2.0, p1 = NA, p0 = 0.30, n = NA, power = 0.80, 
        r = 1, phi.coef = 0, design = 1, sided.test = 2, conf.level = 0.95, 
        method = "unmatched", nfractional = FALSE, fleiss = FALSE)$n.total

Resources


. The Epidemiologist R Handbook - R for applied epidemiology

. https://rladiessydney.org/courses/ryouwithme/01-basicbasics-0/

.https://riffomonas.org/generalR/

. Mutate and ifelse

. CDC Managing Data Workbook

. RECON Learn

Reference

Frerot, M., Lefebvre, A., Aho, S., Callier, P., Astruc, K., & Aho Glele, L. S. (2018). What is epidemiology? Changing definitions of epidemiology 1978-2017. PloS one,13(12), e0208442.