knitr::opts_chunk$set(echo=TRUE, root.dir = "../")
source("Rsource/Hub_microbiome_workshop_source.R")
Increasingly, microbiome researchers are looking at ways to use machine learning in order to filter through OTUs and associate them with phenotypic metadata. Doing that while also accounting for the metadata covariates is a complicated task.
One recent approach, SIAMCAT, blackboxes much of the machine learning and streamlines the interface so that you mostly have to deal with interpreting the somewhat complex output figures (in my opinion).
This is a condensed version of the SIAMCAT vignette.
https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_vignette.html
We will use the example dataset that is part of the SIAMCAT package itself.
Fundamentally, SIAMCAT needs two datasets.
A label is created for the main outcome variable in the metadata, and the rest are used as covariates in the downstream models.
library(SIAMCAT)
data("feat_crc_zeller", package="SIAMCAT")
data("meta_crc_zeller", package="SIAMCAT")
#2 objects in memory now, feat.crc.zeller and meta.crc.zeller
#create label, Case is the affected category (Control is the baseline)
label.crc.zeller <- create.label(meta=meta.crc.zeller,
label='Group', case='CRC')
Here we load SIAMCAT using the simplest case where each of the inputs are dataframes. Another standard approach would be to convert phyloseq objects, which have the otu/metadata integrated into one object. See the second vignette about inputs for the full breadth of options.
https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_read-in.html
The basic logic to siamcat is that after you create the initial siamcat, you then add downstream results to the original object. They are placed in separate S6 slots, you can retrieve the original data as well as the downstream results.
siamcat <- siamcat(feat=feat.crc.zeller,
label=label.crc.zeller,
meta=meta.crc.zeller)
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 141 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.04 s
#do some quick filtering to remove low-abundance OTUs
siamcat <- filter.features(siamcat,
filter.method = 'abundance',
cutoff = 0.001)
## Features successfully filtered
This plot shows the association between the outcome variable and OTUs.
#start with some data normalization
# log normalize the OTUs
siamcat <- normalize.features(
siamcat,
norm.method = "log.unit",
norm.param = list(
log.n0 = 1e-06,
n.p = 2,
norm.margin = 1
)
)
## Features normalized successfully.
#split the data into training and testing partitions
#the train dataset is used for training/tuning the model
#how well it predicts it used by comparing it with the test splits
siamcat <- create.data.split(
siamcat,
num.folds = 5,
num.resample = 2
)
## Features splitted for cross-validation successfully.
#plot the model
siamcat <- check.associations(
siamcat,
sort.by = 'fc',
alpha = 0.05,
mult.corr = "fdr",
detect.lim = 10 ^-6,
plot.type = "quantile.box",
panels = c("fc", "prevalence", "auroc"))
## ### WARNING: Not plotting to a pdf-file.
## ### The plot is optimized for landscape DIN-A4 (or similar) layout.
## ### Please make sure that your plotting region is large enough!!!
## ### Use at your own risk...
## Are you sure that you want to continue? (Yes/no/cancel)
SIAMCAT provides a simplified interface to machine learning, as compared to caret/scikit-learn where you have more control over the options but also need to know more about the hyper-parameters, etc.
Here, you mostly just select one of a few popular methods, that have various methods of feature selection.
The full list of models is: method = c(“lasso”, “enet”, “ridge”, “lasso_ll”, “ridge_ll”, “randomForest”)
#train model
siamcat <- train.model(
siamcat,
method = "lasso"
)
## Trained lasso models successfully.
#make predictions
siamcat <- make.predictions(siamcat)
## Made predictions successfully.
pred_matrix <- pred_matrix(siamcat)
There are two final plots. The evaluate.predictions() tells us how well the method performed using ROC curves (Charley can explain maybe?)
#plot the predictions as an ROC curve
siamcat <- evaluate.predictions(siamcat)
## Evaluated predictions successfully.
model.evaluation.plot(siamcat)
The most complex plot output is the interpretation plot, there’s just a lot of data to associate.
There are 4 basic panels.
#if you want to print out to a pdf, use
#fn.plot as an option
#eg fn.plot="interpretation.pdf"
#actually, this is so big I think we need to print to a pdf
model.interpretation.plot(
siamcat,
consens.thres = 0.5,
limits = c(-3, 3),
heatmap.type = 'zscore',
prompt=FALSE,
fn.plot="Rfigs/siamcat_interpret.pdf"
)
## Successfully plotted model interpretation plot to: Rfigs/siamcat_interpret.pdf