R Data Analysis in Medical Science Practical Demonstration in RStudio
Meetup Details
PhD candidate Mouneem Essabbar continues with this talk on data analysis and some of its applications in medical sciences with a practical demonstration in RStudio.
About Speaker
Abdelmounim ESSABBAR PhD works as a research engineer at the Cancer Research Center in Toulouse France. He graduated with his PhD degree in Life and Health Sciences at the Bioinformatics department at the Medical and Pharmacy School in Rabat, Morocco. He is a Bioinformatics teacher and is a member of the Moroccan COVID19 Genomic Surveillance.
Contact Speaker
R Data Analysis in Medical Science Practical
The plan for the session was to demonstrate analysis on the the Biomedical data applying Stroke predictions and COVID-19. Slides, code and presentation can be found here.
The following Rpresentation shows the presentation and data analysis.
Copy the following code and paste it in Rstudio to display the presentation and see analysis results.
MEDICAL DATA ANLYSIS========================================================
: Abdelmounim - ESSABBAR
author: 2022-02-10
date: true
autosize
Medical Data Analysis====================================
- Epidemiological data
- Genomic data
Load data====================================
```r
#setwd('~/medicaldataanalysis')
data = read.csv('https://raw.githubusercontent.com/rladies/meetup_presentations_gaborone/main/R%20Data%20Analysis%20in%20Medical%20Science/strokes.csv')
head(data)
str(data)
```
Structure and variables====================================
```r
str(data)
```
Data Visualization====================================
### What is the gender distribution ?
```r
typeof(data$gender)
```
### the variable gender is a character
Data Visualization====================================
### load GGPLOT2 library
```r
library('ggplot2')
ggplot(data , aes(x = gender) )+
geom_bar()
```
Data Visualization====================================
### color (fill) by gender
```r
library('ggplot2')
ggplot(data , aes(x = gender, fill = gender) )+
geom_bar()
```
Data Visualization====================================
### What is the age distribution ?
```r
typeof(data$age)
```
### the variable 'age' is a double (Number)
Data Visualization====================================
### Vizualize
```r
ggplot(data , aes(x = age) )+
geom_density(fill = 'lightblue')
```
Data Visualization====================================
### Distribution of age and gender ?
```r
ggplot(data , aes(x = age) )+
geom_density(fill = 'lightblue')
```
Data Visualization====================================
### Distribution of age and gender ?
-axsis: a & color: gender
x```r
ggplot(data , aes(x = age) )+
geom_density(aes(fill = gender), alpha = .25)
```
Data Visualization====================================
# Does gender relates to strokes ?
Data Visualization====================================
# Does gender relates to strokes ?
:
Compute frequencies```r
tabletable_gender_strk = table(data$stroke , data$gender)
tabletable_gender_strk
```
```r
df_gender_strk = as.data.frame(tabletable_gender_strk)
head(df_gender_strk)
```
Data Visualization====================================
# Does gender relates to strokes ?
```r
library(viridis) # load library for colors
ggplot(data = df_gender_strk, mapping = aes(x = Var1, y = Var2, fill = Freq)) +
geom_tile( ) + scale_fill_viridis() + theme_bw()
```
### imbalanced data: High ratio of patient who doesn't have strokes!
Data Visualization====================================
# Does age relates to strokes ?
Data Visualization====================================
# Does age relates to strokes ?
```r
data$stroke = as.factor(data$stroke) #convert stroke (var) to factor
ggplot(data , aes(x = age) )+
geom_density(aes(fill = stroke), alpha = .25)
```
Data Visualization====================================
# How age+gender relates to strokes ?
Data Visualization====================================
# How age+gender relates to strokes ?
: stroke / color : gender
fill ```r
ggplot(data , aes(x = age, color = gender, fill = stroke) )+
geom_density(alpha = .25)
```
Data Visualization====================================
# How age+gender relates to strokes ?
2: facet wrap
Solution ```r
ggplot(data , aes(x = age) )+
geom_density(aes(fill = stroke), alpha = .25)+
facet_wrap(.~gender)
```
Data Visualization====================================
# How age+gender relates to strokes ?
3: facet grid
Solution ```r
ggplot(data , aes(x = age) )+
geom_density(aes(fill = gender), alpha = .25)+
facet_grid(stroke~gender)
```
====================================
# What about the other parameters ?
```r
str(data)
```
====================================
# Correlation between mariage and Strokes ?
```r
table_married_strk = as.data.frame(table(data$stroke , data$ever_married))
head(table_married_strk)
```
```r
ggplot(data = table_married_strk, mapping = aes(x = Var1, y = Var2, fill = Freq)) +
geom_tile() + labs(x = 'have stroke', y = 'is married') +
scale_fill_viridis() + theme_bw()
```
#Imbalanced data
====================================
# Correlation between mariage and Strokes ?
## Normalize data by stroke
```r
table_married_strk = table(data$stroke , data$ever_married)
df = prop.table(table_married_strk, margin = 1) #The prop.table() is a built-in R function that expresses the table entries as Fraction of Marginal Table.
```
```r
table_married_strk = as.data.frame(df)
ggplot(data = table_married_strk, mapping = aes(x = Var1, y = Var2, fill = Freq)) +
geom_tile() + labs(x = 'have stroke', y = 'is married') +
scale_fill_viridis() + theme_bw()
```
====================================
# Impact of weight ? BMI ?
## Distribution by gender :
```r
data$bmi = as.numeric(data$bmi)
ggplot(data , aes(x = bmi, fill = gender) )+
geom_density(alpha = .2)
```
Data visualization====================================
```r
ggplot(data , aes(x = stroke, y = bmi) )+
geom_point()
```
## Problem: overlapping points
Data visualization====================================
```r
ggplot(data , aes(x = stroke, y = bmi) )+
geom_jitter ()
```
### More tunning ?
Data visualization====================================
```r
ggplot(data , aes(x = stroke, y = bmi) )+
geom_jitter (alpha = .5) +
geom_boxplot(outlier.shape = NA)
```
Data visualization====================================
# How early obisity impact the possibility of have strokes ?
```r
str(data)
```
Data visualization====================================
# How early obisity impact the possibility of have strokes ?
```r
ggplot(data , aes(x = age, y = bmi, color = stroke) )+
geom_point()+ facet_wrap(.~stroke)
```
Data visualization====================================
# How early obisity impact the possibility of have strokes ?
```r
#model LOESS
ggplot(data , aes(x = age, y = bmi, color = stroke) )+
geom_point(size = .1)+ facet_wrap(.~stroke) +
geom_smooth(se = F)
```
Data visualization====================================
# How early obisity impact the possibility of have strokes ?
```r
#model LOESS
ggplot(data , aes(x = age, y = bmi, color = stroke) )+
geom_point(size = .1)+ geom_smooth(se = F)
```
Data visualization====================================
### *What about the other parameters ?*
```r
data[sapply(data, is.character)] <- lapply(data[sapply(data, is.character)], as.factor) ## The conversion of chr to factors
pairs(data) # produces a matrix of scatterplots
```
#### not very helpfull :/
====================================
### *What about the other parameters ?*
## SOLUTION: Machine learning
![ML-MODEL](model.png)
Machine learning model====================================
```r
library(tidyverse)
#setwd('~/medicaldataanalysis')
data = read.csv('https://raw.githubusercontent.com/rladies/meetup_presentations_gaborone/main/R%20Data%20Analysis%20in%20Medical%20Science/strokes.csv')
data$gender <- as.factor(data$gender)
data$ever_married <- as.factor(data$ever_married)
data$work_type <- as.factor(data$work_type)
data$Residence_type <- as.factor(data$Residence_type)
data$smoking_status <- as.factor(data$smoking_status)
data$stroke <- factor(data$stroke, levels = c(0,1), labels = c("No","Yes"))
data$heart_disease <- factor(data$heart_disease, levels = c(0,1), labels = c("No", "Yes"))
data$hypertension <- factor(data$hypertension, levels = c(0,1), labels = c("No", "Yes"))
data$bmi <- as.numeric(data$bmi)
str(data)
```
: Data Preprocesing / Cleaning
Machine learning====================================
### Data cleaning: # how many "N/A" values are in my dataset per column?
```r
#Plotting variables with null values
library(naniar)
gg_miss_var(data)
```
```
Machine learning: Data Preprocesing / Cleaning
====================================
### Data cleaning: # how many "N/A" values are in my dataset per column?
```r
<- data %>% group_by(gender) %>% summarise(avg_bmi = mean(bmi,na.rm = TRUE))
avgbmi
avgbmi
$bmi <- ifelse(is.na(data$bmi)==TRUE, avgbmi$avg_bmi[avgbmi$gender %in% data$gender], data$bmi)
data```
```
: Data Preprocesing / Cleaning
Machine learning====================================
### Partition training and testing data
```r
#Partition training and testing data
set.seed(7)
#train: 80% - test 20%
sample_index <- sample(nrow(data),nrow(data)*0.8)
data_train <- data[sample_index,]
data_test <- data[-sample_index,]
```
: Random Forest
Machine learning====================================
#Initial random forest model
```r
library(randomForest)
forest1 <- randomForest(stroke~.-id,data = data_train,ntree = 1000,mtry = 5)
forest1
```
: Optimal Random Forest
Machine learning====================================
### Finding the optimal number of variables to use
```r
errorvalues <- vector()
for (i in 3:10){
temprf <- randomForest(stroke~.-id,data = data_train,ntree = 1000,mtry = i)
errorvalues[i] <- temprf$err.rate[nrow(temprf$err.rate),1]
}
plot(errorvalues)
```
: Random Forest
Machine learning====================================
### Creating a new rf model with the optimal number of variables
```r
library(randomForest)
forest2 <- randomForest(stroke~.-id,data = data_train,ntree = 1000,mtry = 3)
forest2
```
Prediction====================================
```r
head(data_test[5]) # SELECT A PATIENT
predict(forest2, newdata=data_test[5,-12])
```
YouTube Link
Watch the recording of the meetup session and subscribe to the R-Ladies Gaborone channel and get notifications to new videos uploaded.