knitr::opts_chunk$set(echo=TRUE, root.dir = "../")
source("Rsource/Hub_microbiome_workshop_source.R")

Increasingly, microbiome researchers are looking at ways to use machine learning in order to filter through OTUs and associate them with phenotypic metadata. Doing that while also accounting for the metadata covariates is a complicated task.

One recent approach, SIAMCAT, blackboxes much of the machine learning and streamlines the interface so that you mostly have to deal with interpreting the somewhat complex output figures (in my opinion).

This is a condensed version of the SIAMCAT vignette.

https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_vignette.html

We will use the example dataset that is part of the SIAMCAT package itself.

initial loading of data

Fundamentally, SIAMCAT needs two datasets.

  1. an OTU table (converted to relative abundance)
  2. a table of metadata to associate the OTUs with

A label is created for the main outcome variable in the metadata, and the rest are used as covariates in the downstream models.

library(SIAMCAT)

data("feat_crc_zeller", package="SIAMCAT")
data("meta_crc_zeller", package="SIAMCAT") 
#2 objects in memory now, feat.crc.zeller and meta.crc.zeller

#create label, Case is the affected category (Control is the baseline)
label.crc.zeller <- create.label(meta=meta.crc.zeller,
    label='Group', case='CRC')

create siamcat object

Here we load SIAMCAT using the simplest case where each of the inputs are dataframes. Another standard approach would be to convert phyloseq objects, which have the otu/metadata integrated into one object. See the second vignette about inputs for the full breadth of options.

https://bioconductor.org/packages/release/bioc/vignettes/SIAMCAT/inst/doc/SIAMCAT_read-in.html

siamcat object

The basic logic to siamcat is that after you create the initial siamcat, you then add downstream results to the original object. They are placed in separate S6 slots, you can retrieve the original data as well as the downstream results.

siamcat <- siamcat(feat=feat.crc.zeller,
    label=label.crc.zeller,
    meta=meta.crc.zeller)
## + starting validate.data
## +++ checking overlap between labels and features
## + Keeping labels of 141 sample(s).
## +++ checking sample number per class
## +++ checking overlap between samples and metadata
## + finished validate.data in 0.04 s
#do some quick filtering to remove low-abundance OTUs

siamcat <- filter.features(siamcat,
    filter.method = 'abundance',
    cutoff = 0.001)
## Features successfully filtered

Check associations between OTUs and metadata

This plot shows the association between the outcome variable and OTUs.

#start with some data normalization
# log normalize the OTUs

siamcat <- normalize.features(
    siamcat,
    norm.method = "log.unit",
    norm.param = list(
        log.n0 = 1e-06,
        n.p = 2,
        norm.margin = 1
    )
)
## Features normalized successfully.
#split the data into training and testing partitions
#the train dataset is used for training/tuning the model
#how well it predicts it used by comparing it with the test splits

siamcat <-  create.data.split(
    siamcat,
    num.folds = 5,
    num.resample = 2
)
## Features splitted for cross-validation successfully.
#plot the model
siamcat <- check.associations(
    siamcat,
    sort.by = 'fc',
    alpha = 0.05,
    mult.corr = "fdr",
    detect.lim = 10 ^-6,
    plot.type = "quantile.box",
    panels = c("fc", "prevalence", "auroc"))
## ### WARNING: Not plotting to a pdf-file.
## ### The plot is optimized for landscape DIN-A4 (or similar) layout.
## ### Please make sure that your plotting region is large enough!!!
## ### Use at your own risk...
## Are you sure that you want to continue? (Yes/no/cancel)