R Data Analysis in Medical Science Practical Demonstration in RStudio

R
Author

Simisani Ndaba

Published

February 12, 2022

meetup flyer

Meetup Details

PhD candidate Mouneem Essabbar continues with this talk on data analysis and some of its applications in medical sciences with a practical demonstration in RStudio.

About Speaker


Abdelmounim ESSABBAR PhD works as a research engineer at the Cancer Research Center in Toulouse France. He graduated with his PhD degree in Life and Health Sciences at the Bioinformatics department at the Medical and Pharmacy School in Rabat, Morocco. He is a Bioinformatics teacher and is a member of the Moroccan COVID19 Genomic Surveillance.

Contact Speaker


Mouneem on Twitter

Mouneem on Linkedin

R Data Analysis in Medical Science Practical

The plan for the session was to demonstrate analysis on the the Biomedical data applying Stroke predictions and COVID-19. Slides, code and presentation can be found here.

The following Rpresentation shows the presentation and data analysis.

Tip

Copy the following code and paste it in Rstudio to display the presentation and see analysis results.

The 2nd part is available here

MEDICAL DATA ANLYSIS
========================================================
author: Abdelmounim - ESSABBAR
date: 2022-02-10
autosize: true


Medical Data Analysis
====================================
- Epidemiological data 
- Genomic data

Load data
====================================
```r
#setwd('~/medicaldataanalysis')
data = read.csv('https://raw.githubusercontent.com/rladies/meetup_presentations_gaborone/main/R%20Data%20Analysis%20in%20Medical%20Science/strokes.csv')
head(data)
str(data)
```

Structure and variables
====================================
```r
str(data)
```

Data Visualization
====================================
### What is the gender distribution ?
```r
typeof(data$gender)
```
### the variable gender is a character

Data Visualization
====================================
### load GGPLOT2 library
```r
library('ggplot2')
ggplot(data , aes(x = gender) )+
  geom_bar()
```

Data Visualization
====================================
### color (fill) by gender 
```r
library('ggplot2')
ggplot(data , aes(x = gender, fill = gender) )+
  geom_bar()
```



Data Visualization
====================================
### What is the age distribution ?
```r
typeof(data$age)
```
### the variable 'age' is a double (Number)


Data Visualization
====================================
### Vizualize
```r
ggplot(data , aes(x = age) )+
  geom_density(fill = 'lightblue')
```


Data Visualization
====================================
### Distribution of age and gender ?
```r
ggplot(data , aes(x = age) )+
  geom_density(fill = 'lightblue')
```

Data Visualization
====================================
### Distribution of age and gender ?
x-axsis: a & color: gender
```r
ggplot(data , aes(x = age) )+
  geom_density(aes(fill = gender), alpha = .25)
```


Data Visualization
====================================
# Does gender relates to strokes ?


Data Visualization
====================================
# Does gender relates to strokes ?
Compute frequencies:
```r
tabletable_gender_strk = table(data$stroke , data$gender)
tabletable_gender_strk
```
```r
df_gender_strk = as.data.frame(tabletable_gender_strk)
head(df_gender_strk)
```


Data Visualization
====================================
# Does gender relates to strokes ?
```r
library(viridis) # load library for colors
ggplot(data =  df_gender_strk, mapping = aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile( ) +   scale_fill_viridis() +  theme_bw()
```
### imbalanced data: High ratio of patient who doesn't have strokes!


Data Visualization
====================================
# Does age relates to strokes ?

Data Visualization
====================================
# Does age relates to strokes ?
```r
data$stroke = as.factor(data$stroke) #convert stroke (var) to factor
ggplot(data , aes(x = age) )+
  geom_density(aes(fill = stroke), alpha = .25)
```


Data Visualization
====================================
# How age+gender relates to strokes ?

Data Visualization
====================================
# How age+gender relates to strokes ?
fill : stroke / color : gender
```r
ggplot(data , aes(x = age, color = gender, fill = stroke) )+
  geom_density(alpha = .25)
```

Data Visualization
====================================
# How age+gender relates to strokes ?
Solution 2: facet wrap
```r
ggplot(data , aes(x = age) )+
  geom_density(aes(fill = stroke), alpha = .25)+
  facet_wrap(.~gender)
```

Data Visualization
====================================
# How age+gender relates to strokes ?
Solution 3: facet grid
```r
ggplot(data , aes(x = age) )+
  geom_density(aes(fill = gender), alpha = .25)+
  facet_grid(stroke~gender)
```



====================================
# What about the other parameters ?
```r
str(data)
```

====================================
# Correlation between mariage and Strokes ?
```r
table_married_strk = as.data.frame(table(data$stroke , data$ever_married))
head(table_married_strk)
```
```r
ggplot(data =  table_married_strk, mapping = aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() + labs(x = 'have stroke', y = 'is married') + 
  scale_fill_viridis() +   theme_bw()
```
#Imbalanced data

====================================
# Correlation between mariage and Strokes ?
## Normalize data by stroke
```r
table_married_strk = table(data$stroke , data$ever_married)
df = prop.table(table_married_strk, margin = 1) #The prop.table() is a built-in R function that expresses the table entries as Fraction of Marginal Table. 
```
```r
table_married_strk = as.data.frame(df)
ggplot(data =  table_married_strk, mapping = aes(x = Var1, y = Var2, fill = Freq)) +
  geom_tile() + labs(x = 'have stroke', y = 'is married') + 
  scale_fill_viridis() +   theme_bw()
``` 

====================================
# Impact of weight ? BMI ?
## Distribution by gender :
```r
data$bmi = as.numeric(data$bmi)
ggplot(data , aes(x = bmi, fill = gender) )+ 
  geom_density(alpha = .2)
```

Data visualization
====================================
```r
ggplot(data , aes(x = stroke,  y = bmi) )+
  geom_point()
```
## Problem: overlapping points 

Data visualization
====================================
```r
ggplot(data , aes(x = stroke,  y = bmi) )+
  geom_jitter ()
```
### More tunning ?

Data visualization
====================================
```r
ggplot(data , aes(x = stroke,  y = bmi) )+
    geom_jitter (alpha = .5) + 
  geom_boxplot(outlier.shape = NA)
```

Data visualization
====================================
# How early obisity impact the possibility of have strokes ?
```r
str(data)
```

Data visualization
====================================
# How early obisity impact the possibility of have strokes ?
```r
ggplot(data , aes(x = age,  y = bmi, color = stroke) )+
  geom_point()+  facet_wrap(.~stroke)
```

Data visualization
====================================
# How early obisity impact the possibility of have strokes ?
```r
#model LOESS
ggplot(data , aes(x = age,  y = bmi, color = stroke) )+
  geom_point(size = .1)+ facet_wrap(.~stroke) +
  geom_smooth(se = F)
```

Data visualization
====================================
# How early obisity impact the possibility of have strokes ?
```r
#model LOESS
ggplot(data , aes(x = age,  y = bmi, color = stroke) )+
  geom_point(size = .1)+   geom_smooth(se = F)
```

Data visualization
====================================
### *What about the other parameters ?*
```r
data[sapply(data, is.character)] <- lapply(data[sapply(data, is.character)],   as.factor) ## The conversion of chr to factors
pairs(data) # produces a matrix of scatterplots
```
#### not very helpfull :/

====================================
### *What about the other parameters ?*
## SOLUTION: Machine learning
![ML-MODEL](model.png)

Machine learning model
====================================
```r

library(tidyverse) 

#setwd('~/medicaldataanalysis')
data = read.csv('https://raw.githubusercontent.com/rladies/meetup_presentations_gaborone/main/R%20Data%20Analysis%20in%20Medical%20Science/strokes.csv')

data$gender <- as.factor(data$gender)
data$ever_married <- as.factor(data$ever_married)
data$work_type <- as.factor(data$work_type)
data$Residence_type <- as.factor(data$Residence_type)
data$smoking_status <- as.factor(data$smoking_status)
data$stroke <- factor(data$stroke, levels = c(0,1), labels = c("No","Yes"))
data$heart_disease <- factor(data$heart_disease, levels = c(0,1), labels = c("No", "Yes"))
data$hypertension <- factor(data$hypertension, levels = c(0,1), labels = c("No", "Yes"))
data$bmi <- as.numeric(data$bmi)

str(data)
```
Machine learning: Data Preprocesing / Cleaning
====================================
### Data cleaning: # how many "N/A" values are in my dataset per column?
```r
#Plotting variables with null values
library(naniar)
gg_miss_var(data)

```


```
Machine learning: Data Preprocesing / Cleaning
====================================
### Data cleaning: # how many "N/A" values are in my dataset per column?
```r
avgbmi <- data %>% group_by(gender) %>% summarise(avg_bmi = mean(bmi,na.rm = TRUE))
avgbmi

data$bmi <- ifelse(is.na(data$bmi)==TRUE, avgbmi$avg_bmi[avgbmi$gender %in% data$gender], data$bmi)
```
```


Machine learning: Data Preprocesing / Cleaning
====================================
### Partition training and testing data
```r
#Partition training and testing data
set.seed(7)
#train: 80% - test 20%
sample_index <- sample(nrow(data),nrow(data)*0.8)
data_train <- data[sample_index,]
data_test <- data[-sample_index,]
```

Machine learning: Random Forest
====================================
#Initial random forest model
```r
library(randomForest)
forest1 <- randomForest(stroke~.-id,data = data_train,ntree = 1000,mtry = 5)
forest1
```


Machine learning: Optimal Random Forest
====================================
### Finding the optimal number of variables to use
```r
errorvalues <- vector()
for (i in 3:10){
  temprf <- randomForest(stroke~.-id,data = data_train,ntree = 1000,mtry = i)
  errorvalues[i] <- temprf$err.rate[nrow(temprf$err.rate),1]
}

plot(errorvalues)
```



Machine learning: Random Forest
====================================
### Creating a new rf model with the optimal number of variables
```r
library(randomForest)
forest2 <- randomForest(stroke~.-id,data = data_train,ntree = 1000,mtry = 3)
forest2
```

Prediction
====================================
```r
head(data_test[5]) # SELECT A PATIENT
predict(forest2, newdata=data_test[5,-12])

```