knitr::opts_chunk$set(echo=TRUE, root.dir = "../")
#knitr::opts_knit$set(root.dir = "../")
source("Rsource/Hub_microbiome_workshop_source.R")

So far, we’ve characterized several aspects of community-composition and made comparisons between samples. These descriptive studies are an important first step of microbiome data-analyses; they stop short of delivering meaningful functional data however, such as the biological pathways active in these communities, the products of which may be actual drivers of our phenotypes of interest (e.g., obese vs. lean).

16S data is inherently limited for assessing function, but we can extrapolate some information using phylogenetic tools. In our next session, we’ll be using our output differential abundance table (“Routput/differential_taxa_abundance.csv”) to extrapolate functional data.

*** For logistical purposes (because the software required for this part is linux-based, not R), we’ve run PICRUSt2 on our output data already for you. We’ve provided the commands we used below. We can advise you in more detail should you need to run these steps for your own work.***

Introduction for analyzing Picrust2 in a statistical framework

Basics of this are sort of light in the official Aldex tutorial, which frames in the more general RNAseq/whatever. Which, according to their philosphy, should work the same way.

Basically, the Aldex way says that everything counts should be scaled relative (even rnaseq), and then you can use some abundance methods they developed to do comparative testing.

The fast and easy ways are essentially t-tests for 2 factor designs, etc.

The full on, fully modeled way is using a glm, which is apparently quite slow, but lets the variable tested be multivariate. It tests in a one-against-all fashion if I remember correctly (link to docs)

The code to confirm the logic of hooking up the picrust2 outputs (and which) to the aldex come from

https://github.com/gavinmdouglas/picrust2_manuscript/blob/master/scripts/analyses/hmp2/hmp2_aldex2_disease_state_and_enriched.R

take the normalized pathway abundance file (seqtab_norm.tsv?)

to run:

conda activate picrust2

#the default output uses less memory, but outputs in a long format
#we would have to spread the data to get it back to wide (like an otu table)
#time picrust2_pipeline.py -s seqs.fna -i table.biom -o picrust_ex -p 10 --per_sequence_contrib --stratified --coverage
#add --wide-table so you don't have to spread from the data
time picrust2_pipeline.py -s seqs.fna -i table.biom -o picrust_ex -p 10 --per_sequence_contrib --stratified --coverage --wide_table

Note! the stratified tables, which are what we want, will increase the run time and outputfile size considerably (since they are the full ASV matrices)

https://github.com/chrismitbiz/metagenomepredictionanalysis/blob/master/16smetagenomepredanalysis.Rmd

see here to get to the RDS used later on by the picrust2 MS https://github.com/gavinmdouglas/picrust2_manuscript/blob/4f6eaf92333d2f4e3982501f41eaa2f900f91d28/scripts/analyses/hmp2/hmp2_prep_input.R

also:

https://rpubs.com/chrisLaTrobe/Picrust_PCA_KEGG

though the aldex code results are elided

preprocessing data for aldex

The main things to do are to drop out the pathway/sequence columns, merge them into rownames. You only need to do this with the stratified data (which splits out the effect of pathway X otu).

Unstratified looks at a whole-pathway level, so is more summarised.

General note, I prefer using readr to read in data, but we need to convert all the data from tibbles to dataframes to properly use the rownames. They have a habit of getting dropped if the tibble is modified.

library(ALDEx2)
library(readr)
library(dplyr)
#wide file?
#this is taken from the prep_input code
df=read_tsv("Rdata/picrust_ex/pathways_out/path_abun_strat.tsv")
dfs=as.data.frame(df)
rownames(dfs)=paste(df$pathway,df$sequence,sep="|")
dfs=dfs[,-which(colnames(dfs) %in% c("pathway","sequence"))]
dfout=as.data.frame(df)
dfus=read_tsv("Rdata/picrust_ex/pathways_out/path_abun_unstrat.tsv")
dfus=as.data.frame(dfus)
rownames(dfus)=dfus$pathway
dfus=dfus[,-which(colnames(dfus)=="pathway")]
#nothing in the pathway, look in the kegg?
dfk=read_tsv("Rdata/picrust_ex/KO_metagenome_out/pred_metagenome_unstrat.tsv")
dfk=as.data.frame(dfk)
rownames(dfk)=dfk$`function`
dfk=dfk[,-which(colnames(dfk)=="function")]
rownames(dfout)=paste(dfout$pathway,dfout$sequence,sep="|")
dfout=dfout[,-which(colnames(dfout) %in% c("pathway","sequence"))]
df_tmp= df[, -which(colnames(df) == "sequence")]
df_abun=aggregate(.~pathway,data=df_tmp,FUN=sum)
rownames(df_abun)=df_abun$pathway
df_abun=df_abun[,-which(colnames(df_abun)=="pathway")]
#convert to relative abundance
df_rabun=data.frame(sweep(df_abun,2,colSums(df_abun),"/"),check.names=F)
# asin transformation
asinT=function(x) asin(sqrt(x))
df_rabun2=df_rabun %>% mutate_all(asinT)
#finally, round them so they
df_rrabun=round(df_rabun)

exploratory analyses with phyloseq/pca

Before we go into the real-deal modeling, lets quickly check if the kegg/pathways cluster by strain or facility. We get to see the null results in real time.

The picrust2 notes that the example data is “greatly simplified” for the picrust2 demo, presumably mainly on the amount of metadata.

The preprint for the paper states what we will shortly show. https://peerj.com/articles/5494/

A lot of this code is directly from https://rpubs.com/chrisLaTrobe/Picrust_PCA_KEGG

Thanks Chris LaTrobe!

library(phyloseq)
library(microbiome)
mdata=read_tsv("Rdata/metadata.tsv")
kotu=otu_table(dfk,taxa_are_rows=TRUE)
md=as.data.frame(mdata)
rownames(md)=mdata$SampleID
md=md[,-which(colnames(md)=="SampleID")]
md=sample_data(md)
kegg=phyloseq(kotu,md)
pwo=otu_table(dfout,taxa_are_rows=TRUE)
pw=phyloseq(pwo,md)
#kegg is the kegg table
#pw is the pathway table (much higher level)
pw_clr=microbiome::transform(pw,"clr")
kegg_clr=microbiome::transform(kegg,"clr")

basic ordination of kegg/pathway data

Ordination / dimension reduction is basically the same as when done with regular phyloseq and “regular” OTU tables. Only thing different is that you aren’t working with a taxa table.

#rda is one of many ordination methods
ord_pw=phyloseq::ordinate(pw_clr,"RDA")
ord_kegg=phyloseq::ordinate(kegg_clr,"RDA")
#get the top eigenvalues of the first few PC axes
head(ord_pw$CA$eig)
##        PC1        PC2        PC3        PC4        PC5        PC6 
## 10248.7200  1073.2851   477.5336   354.4748   259.2237   182.5990
#transform to percent
sapply(ord_pw$CA$eig[1:8], function(x) x / sum(ord_pw$CA$eig))
##         PC1         PC2         PC3         PC4         PC5         PC6 
## 0.777307832 0.081402648 0.036218239 0.026884921 0.019660661 0.013849106 
##         PC7         PC8 
## 0.010101279 0.007011489
sapply(ord_kegg$CA$eig[1:8], function(x) x / sum(ord_kegg$CA$eig))
##        PC1        PC2        PC3        PC4        PC5        PC6        PC7 
## 0.57121844 0.14500097 0.09936569 0.06640698 0.04023323 0.02199517 0.01960665 
##        PC8 
## 0.01415656
#ok, so most variation is in the first 2 axes for pathway
# 3-4 axes for kegg
p=plot_ordination(pw,ord_pw,type="samples",color="Facility",shape="Genotype")
p=p+geom_polygon(aes(fill=Genotype))
#much larger separation by Facility, genotype seems to overlap

fancy-pants statistical modeling with Aldex2

Aldex2 provides a way to model compositional (count) data, using a Welch test or Kruskal-Wallis for simple comparisons (two types), and a full blown general linear model (glm) for multiple variables at the same time (though I ran into issues getting the effect-size and plotting).

The other key thing to note is that picrust2 outputs fractional counts, so some rounding needs to be done or most of these programs will break.

If you have continuous metadata, you can use the aldex.corr function. For categorical data, use aldex

The aldex function performs 3 steps:

1. some MC to get a sense of the Dirichlet distribution and then convert using a center-log transform.
2. aldex.ttest (or aldex.kw, aldex.glm) to test for differences.
3. aldex.effect to get effect sizes for the rows

In the first plots, you’ll see nothing is significant (no red dots).

When we redo the plots by Facility, lots of things are significant. Looking back at the paper, there’s an inherent Facility/Strain batch-effect that isn’t possible to unravel (all samples of a strain (KO and WT) were sequenced at the same facility).

A reminder that its probably worth it to talk to someone about your design before you actually sequence/do the experiment.

conds=mdata$Genotype
#kegg
x.kf=aldex(round(dfk),conds,effects=T,denom="all")
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
#unstratified pathway
x.pwu=aldex(round(dfus),conds,effects=TRUE,denom="all")
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
#stratified pathway
x.pws=aldex(round(dfs),conds,effects=TRUE)
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
#x.all=aldex(round(dfs),conds,effects=TRUE)
#x.all <- aldex(selex.sub,conds,mc.samples=16,test="t",effect=TRUE,include.samples.summary=FALSE,demon="all",verbose=FALSE)
par(mfrow=(c(3,2)))
aldex.plot(x.kf,type="MA",test="welch",xlab="Log-ratio abundance",
           ylab="Difference")
aldex.plot(x.kf,type="MW",test="welch",xlab="Dispersion",
           ylab="Difference")
aldex.plot(x.pwu,type="MA",test="welch",xlab="Log-ratio abundance",
           ylab="Difference")
aldex.plot(x.pwu,type="MW",test="welch",xlab="Dispersion",
           ylab="Difference")
aldex.plot(x.pwu,type="MA",test="welch",xlab="Log-ratio abundance",
           ylab="Difference")
aldex.plot(x.pwu,type="MW",test="welch",xlab="Dispersion",
           ylab="Difference")

#nothing is significant
#redo with conds=facility
conds=mdata$Facility
x.kf=aldex(round(dfk),conds,effects=T)
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
x.pwu=aldex(round(dfus),conds,effects=T)
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
x.pws=aldex(round(dfs),conds,effects=T)
## aldex.clr: generating Monte-Carlo instances and clr values
## operating in serial mode
## computing center with all features
## aldex.ttest: doing t-test
## aldex.effect: calculating effect sizes
par(mfrow=(c(3,2)))
aldex.plot(x.kf,type="MA",test="welch",xlab="Log-ratio abundance",
           ylab="Difference")
aldex.plot(x.kf,type="MW",test="welch",xlab="Dispersion",
           ylab="Difference")
aldex.plot(x.pwu,type="MA",test="welch",xlab="Log-ratio abundance",
           ylab="Difference")
aldex.plot(x.pwu,type="MW",test="welch",xlab="Dispersion",
           ylab="Difference")
aldex.plot(x.pwu,type="MA",test="welch",xlab="Log-ratio abundance",
           ylab="Difference")
aldex.plot(x.pwu,type="MW",test="welch",xlab="Dispersion",
           ylab="Difference")

visualizing the most significant pathways

Now, let’s imagine that the Facility test was something meaningful, for purposes of examples.

Let’s see how to pull out the top significant pathways for further exploration.

The pathways come from biocyc/metacyc, so you can search for the actual names there.

xsig=x.pwu[x.pwu$we.eBH<0.05,]
xsig$path=rownames(xsig)
xsig2=arrange(xsig,desc(abs(effect)))
xsig3=xsig2[1:10,] #just the first 10



print(nrow(xsig)/nrow(x.pwu)) #percent significant pathways
## [1] 0.7230769
p=ggplot(xsig3,aes(x=path,y=effect))+geom_bar(stat="identity")+coord_flip()+theme_classic()
p

glm model for multivariate (not quite working)

Here is the general framework for doing analyses with a glm. Note that it takes much longer than the other tests, (<1 minute for 2 variable vs 10+ minutes for glm).

We can pull out the most significant features, but we don’t get effect size, that’s only calculated right now for conditions that have 2 groups, which can be run through the t-test module.

Can run this if you want, it’s only 3 minutes with a reduced number of Monte Carlo samples.

There’s a similar plot to pull out the significant genotype-specific pathways, based on the genotype p value in the model, and plotting the top pathways based on the effect sizes from that.

If you have complex interactions, this is probably the way to go.

dm=model.matrix(~Genotype+Facility,mdata)
#can do complex models with multiple terms with glm
#it is SLOWWWWW
#for this smallish dataset, it took 10+ minutes
library(tictoc)
tic()
x=aldex.clr(round(dfus),dm,denom="all",mc.samples=32)
## checking for condition length disabled!
## using all features for denominator
## operating in serial mode
## computing center with all features
#NOT RUN, aldex.glm
x.glm=aldex.glm(x,dm)
## running tests for each MC instance:
## |------------(25%)----------(50%)----------(75%)----------|
toc()
## 8.237 sec elapsed
#filtering on the model.genotypeWT, but hard to select
#based on how 
glmsig=x.glm[x.glm[,15]<.05,]
glmsig$path=rownames(glmsig)
colnames(glmsig)[5]="geno_est"
glmsig2=arrange(glmsig,desc(abs(geno_est)))

glmsig3=glmsig2[1:10,]
p=ggplot(glmsig3,aes(x=path,y=geno_est))+geom_bar(stat="identity")+coord_flip()+theme_classic()
print(p)

#x.eff=aldex.effect(x)
#par(mfrow=(c(1,2)))
#plot(x.glm[,"model.GenotypeWT Pr(>|t|).BH"], x.all$we.eBH, log="xy",xlab="glm model Genotype", ylab="Welch's t-test")
#plot(glm.test[,"model.FacilityPaloAlto Pr(>|t|).BH"], x.all$we.eBH, log="xy",xlab="glm model Facility", ylab="Welch's t-test")