Using R as an Epidemiologist Assessment of real-world data
Meetup Details
R-Ladies Dammam and R-Ladies Gaborone co-host this talk describing how Brianna Lindsay PhD uses and has used R as an epidemiologist in industry and academia. The focus of the talk will review how she approaches a dataset, important lessons learned about ‘dirty’ real-world data and the basics of data exploration using R and R Studio.
About Speaker
Brianna Lindsay MPH PhD, is an infectious disease epidemiologist based full-time in Zambia. She serves as primary epidemiologic personnel for the CDC/PEPFAR-funded CIRKUITS HIV treatment and prevention service delivery programs, supporting data collection, database design, and analysis. She also assists globally through programs in the affiliated Center for International Health, Education, and Biosecurity (Ciheb) at the University of Maryland Baltimore, supporting eight countries across sub-Saharan Africa. Dr. Lindsay applies quantitative data analysis to inform epidemiological investigations and practical research in infectious disease prevention and treatment in real-world settings. She has published extensively in burden of disease estimation using various study designs has expertise in quantitative methods as well as analysis of high-throughput and advanced molecular detection techniques. She has planned and organized a number of global health data workshops in Africa, Asia and the U.S.
Contact Speaker
Using R as an Epidemiologist
According to Frerot et al., 2017,Epidemiology is the study of the distribution and determinants of health-related states of events in specified populations. Brianna shows us Epidemiology analysis with;
grouping aggregate data using summarytools package
pre-post analysis with tidyverse, and
power analysis with pwr package
the here, readr, janitor, summarytools and tidyverse packages need to be installed before executing the code.
- Grouping aggregate operation
library(here)
library(readr)
library(janitor)
library(summarytools)
library(tidyverse)
<- read_csv(here ("data", "AggregateData.csv"))
aggregate
#Tidy-up using Janitor package
<-clean_names(aggregate)
aggregate= remove_empty(aggregate, which = c("rows","cols"))
aggregate
#A little exploration (periodname, city, age, career, data)
freq(aggregate$periodname)
freq(aggregate$city)
freq(aggregate$age)
freq(aggregate$career)
#Remove epidemiologists from the analysis
<- filter(aggregate, career %in% c("astronaut","farmer","pilot","surgeon"))
aggregate
#Check out an overall summary of data
<-aggregate %>%
overall::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
dplyrsumtestpos = sum(tested_positive, na.rm = TRUE),
sumtestneg = sum(tested_negative, na.rm = TRUE),
sumeligible = sum(treatment_eligible, na.rm = TRUE),
sumtreated = sum(treatment, na.rm = TRUE),) %>%
mutate(treatprop = (sumtreated/sumtestpos)*100,
treatup = (sumtreated/sumeligible)*100)
#Group by career
<-aggregate %>%
career::group_by(career) %>%
dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
dplyrsumtestpos = sum(tested_positive, na.rm = TRUE),
sumtestneg = sum(tested_negative, na.rm = TRUE),
sumeligible = sum(treatment_eligible, na.rm = TRUE),
sumtreated = sum(treatment, na.rm = TRUE),) %>%
mutate(treatprop = (sumtreated/sumtestpos)*100,
treatup = (sumtreated/sumeligible)*100)
#Group by age
<-aggregate %>%
age::group_by(age) %>%
dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
dplyrsumtestpos = sum(tested_positive, na.rm = TRUE),
sumtestneg = sum(tested_negative, na.rm = TRUE),
sumeligible = sum(treatment_eligible, na.rm = TRUE),
sumtreated = sum(treatment, na.rm = TRUE),) %>%
mutate(treatprop = (sumtreated/sumtestpos)*100,
treatup = (sumtreated/sumeligible)*100)
#Group by career and age
<-aggregate %>%
careerage::group_by(career,age) %>%
dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
dplyrsumtestpos = sum(tested_positive, na.rm = TRUE),
sumtestneg = sum(tested_negative, na.rm = TRUE),
sumeligible = sum(treatment_eligible, na.rm = TRUE),
sumtreated = sum(treatment, na.rm = TRUE),) %>%
mutate(treatprop = (sumtreated/sumtestpos)*100,
treatup = (sumtreated/sumeligible)*100)
#Recode period monthly data to quarters
freq(aggregate$periodname)
<-aggregate %>% mutate(quarter = ifelse(periodname %in% c("Oct-20","Nov-20","Dec-20"),"Q1",
aggregateifelse(periodname %in% c("Jan-21","Feb-21","Mar-21"),"Q2",
ifelse(periodname %in% c("Apr-21","May-21","Jun-21"),"Q3",
ifelse(periodname %in% c("Jul-21","Aug-21","Sep-21"),"Q4",NA)))))
freq(aggregate$quarter)
##Any trends over quarters?
<-aggregate %>%
quarter::group_by(quarter) %>%
dplyr::summarise(sumsurvey = sum(surveyed, na.rm = TRUE),
dplyrsumtestpos = sum(tested_positive, na.rm = TRUE),
sumtestneg = sum(tested_negative, na.rm = TRUE),
sumeligible = sum(treatment_eligible, na.rm = TRUE),
sumtreated = sum(treatment, na.rm = TRUE),) %>%
mutate(treatprop = (sumtreated/sumtestpos)*100,
treatup = (sumtreated/sumeligible)*100)
Pre-Post analysis
The Pre-post analysis shows whether an antibiotic treatment efficacy improved overtime using an app. The figure below shows on the far left, the survey in app form to collect antibiotic treatment information from different demographics. The table of contects shows the symptons collected from the survey until the Pre-post app is developed on the far right.
library(here)
library(readr)
library(janitor)
library(tidyverse)
library(Hmisc)
library(summarytools)
library(table1)
##Objective: Test for knowledge retained before and after an educational intervention
<- read_csv(here ("data", "prepostAssessment.csv"))
prepostAssess
#Clean variable names
<-clean_names(prepostAssess)
prepostAssess
#Shorten answers for easier analysis
8:12] <- lapply(prepostAssess[,8:12],
prepostAssess[,function(x) substr(x,1,1))
18:22] <- lapply(prepostAssess[,18:22],
prepostAssess[,function(x) substr(x,1,1))
#Code either correct or incorrect
<-mutate(prepostAssess, preq1_1 = ifelse(pre_q1 == "b", 1, 0))
prepostAssessfreq(prepostAssess$pre_q1)
freq(prepostAssess$preq1_1)
<-mutate(prepostAssess, preq2_1 = ifelse(pre_q2 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, preq3_1 = ifelse(pre_q3 == "d", 1, 0))
prepostAssess<-mutate(prepostAssess, preq4_1 = ifelse(pre_q4 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, preq5_1 = ifelse(pre_q5 == "b", 1, 0))
prepostAssess
<-mutate(prepostAssess, postq1_1 = ifelse(post_q1 == "b", 1, 0))
prepostAssessfreq(prepostAssess$post_q1)
freq(prepostAssess$postq1_1)
<-mutate(prepostAssess, postq2_1 = ifelse(post_q2 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, postq3_1 = ifelse(post_q3 == "d", 1, 0))
prepostAssess<-mutate(prepostAssess, postq4_1 = ifelse(post_q4 == "c", 1, 0))
prepostAssess<-mutate(prepostAssess, postq5_1 = ifelse(post_q5 == "b", 1, 0))
prepostAssess
#Calculate total score of correct answers
<-prepostAssess %>% mutate(preSum = rowSums(.[23:27]))
prepostAssess<-prepostAssess %>% mutate(postSum = rowSums(.[28:32]))
prepostAssess<-prepostAssess %>% mutate(Diff = postSum-preSum)
prepostAssess
#Identify the individuals who had answered both the pre and post survey
<- prepostAssess[ which(prepostAssess$both_complete == 1), ]
abx_prepostComplete
#Compare the pre-test score to post-test score
mean(abx_prepostComplete$preSum)
mean(abx_prepostComplete$postSum)
mean(abx_prepostComplete$Diff)
median(abx_prepostComplete$preSum)
median(abx_prepostComplete$postSum)
#Statistical test for difference between pre-score and post-score
t.test(abx_prepostComplete$postSum, abx_prepostComplete$preSum, paired = TRUE, alternative = "two.sided")
#Table1
freq(prepostAssess$subspecialty)
freq(prepostAssess$year)
table1(~ subspecialty + year + preSum + postSum, data=prepostAssess, overall="Total")
freq(abx_prepostComplete$subspecialty)
freq(abx_prepostComplete$year)
freq(abx_prepostComplete$post_please_select_your_gender)
table1(~ subspecialty + year + post_please_select_your_gender+ preSum + postSum, data=abx_prepostComplete, overall="Total")
#Table2
lapply(prepostAssess[,23:27],
function(x) freq(x))
mean(prepostAssess$preSum)
lapply(abx_prepostComplete[,23:27],
function(x) freq(x))
lapply(abx_prepostComplete[,28:32],
function(x) freq(x))
table1(~ factor(preq1_1) + factor(preq2_1) + factor(preq3_1) + factor(preq4_1) + factor(preq5_1) + factor(postq1_1) + factor(postq2_1) + factor(postq3_1) + factor(postq4_1) + factor(postq5_1), data=abx_prepostComplete, overall="Total")
mean(prepostAssess$preSum)
mean(abx_prepostComplete$preSum)
mean(abx_prepostComplete$postSum)
##paired Samples Test: Is there a difference in the pre-test score vs. the post-test score
<- mcnemar.test(table(abx_prepostComplete$preq1_1, abx_prepostComplete$postq1_1))
test
test
<- mcnemar.test(table(abx_prepostComplete$preq2_1, abx_prepostComplete$postq2_1))
test
test
<- mcnemar.test(table(abx_prepostComplete$preq3_1, abx_prepostComplete$postq3_1))
test
test
<- mcnemar.test(table(abx_prepostComplete$preq4_1, abx_prepostComplete$postq4_1))
test
test
<- mcnemar.test(table(abx_prepostComplete$preq5_1, abx_prepostComplete$postq5_1))
test test
- Power analysis
#power analysis
library(pwr)
library(epiR)
#https://cran.r-project.org/web/packages/pwr/vignettes/pwr-vignette.html
#effect size
#Test small medium large
#chi-square tests (chisq) 0.1 0.3 0.5
#Chi square
pwr.chisq.test(w =0.21, N = , df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w =0.1, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w =0.2, N = , df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w =0.2, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w =0.3, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w =0.3, N = , df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w =0.5, N = , df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w = , N = 225, df = 1, sig.level =0.05, power = 0.9)
pwr.chisq.test(w = , N = 297, df = 1, sig.level =0.05, power = 0.8)
pwr.chisq.test(w = , N = 276, df = 1, sig.level =0.05, power = 0.8)
#T-Test
#Small
pwr.t.test(d=0.2, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")
#Medium
pwr.t.test(d=0.5, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")
#Large
pwr.t.test(d=0.8, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")
#Small
pwr.t.test(d=0.2, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")
#Medium
pwr.t.test(d=0.5, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")
#Large
pwr.t.test(d=0.8, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")
#Back-calculation
pwr.t.test(n=130, d=, sig.level=0.05, power=0.90, type="two.sample", alternative="two.sided")
pwr.t.test(n=130, d=, sig.level=0.05, power=0.80, type="two.sample", alternative="two.sided")
power.prop.test(p1=.30, p2=.391304, power=.95)
#these values for p1 and p2 give OR of 1.50
pwr.2p.test(h = ES.h(p1=.50, p2= 0.05), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
pwr.2p.test(h = ES.h(p1=.50, p2= 0.10), n = 225, sig.level =0.05, power = )
power.prop.test(p1=.50, p2=.25, power=.80)
pwr.2p.test(h = ES.h(p1=0.10, p2= 0.25), n = , sig.level =0.05, power = 0.8)
pwr.2p.test(h = ES.h(p1=0.1819, p2= 0.10), n = , sig.level =0.05, power = 0.8)
power.prop.test(p1=.1819, p2=.10, power=.80)
library(epiR)
pi.sscc(OR = 2.0, p1 = NA, p0 = 0.30, n = NA, power = 0.80,
r = 1, phi.coef = 0, design = 1, sided.test = 2, conf.level = 0.95,
method = "unmatched", nfractional = FALSE, fleiss = FALSE)$n.total
Resources
. The Epidemiologist R Handbook - R for applied epidemiology
. https://rladiessydney.org/courses/ryouwithme/01-basicbasics-0/
Reference
Frerot, M., Lefebvre, A., Aho, S., Callier, P., Astruc, K., & Aho Glele, L. S. (2018). What is epidemiology? Changing definitions of epidemiology 1978-2017. PloS one,13(12), e0208442.
YouTube link
Watch the recording of the meetup session and subscribe to the R-Ladies Gaborone channel and get notifications to new videos uploaded.