Microbiome Workshop: Day 2

Pipeline Overview

GOAL: The purpose of this analysis is to obtain an Amplicon Sequence Variant (ASV) table for all of our microbiome-sample example-data.

The present analysis is based on dada2 package in R # For more details, please follow the link https://benjjneb.github.io/dada2/tutorial.html

INPUT DATA: We will start with demultiplexed fastq files for all samples. This analysis is for paired-end data. Thus, for each sample, there will be two files, named according to Illumina platform conventions:

Forward-reads, named *_R1_001.fastq
Reverse-reads, named *_R2_001.fastq

Project Organization

Before we begin, let’s take a moment to get organized. The importance of documentation and good record-keeping are essential to producing high-quality and reproducible computational analyses, just as they are at the bench!

We recommend you keep your analyses organized by project (just as we organized this example). Looking around:

Rdata: this folder contains our input .fastq.gz files and our input database of 16S-sequences that we’ll use to identify taxa present in our samples.
Ranalysis: this folder contains any scripts we create to analyze our data, like this R-Markdown (.Rmd) document.
Routput: we will direct any output data-files from our analyses to this folder.
Rfigs: we will direct any figures we generate from our analyses to this folder.
Rsource: this folder contains any R source-scripts we create to set up our environment for our analyses–custom functions, which packages to load, etc. etc. You don’t need to worry about this one since we made it for you.

You can think of the files in Rsource as set-up scripts–just load it at the beginning of your session and forget about it.

dada2 Pipeline

LOADING AND CLEANING DATA

This code calls a source script from the RSource folder that installs required packages

source("Rsource/source.R")

Let’s load the appropriate R packages (dada2 and a few others) for the analysis. All packages are already installed.

Important: make sure to note the package-version of each package you’re using!

packageVersion("dada2")

## [1] '1.16.0'

packageVersion("GUniFrac")

## [1] '1.1'

packageVersion("simEd")

## [1] '2.0.0'

packageVersion("tictoc")

## [1] '1.0'

Now, we seperate our forward and reverse reads

# the path to the directory where all the fastq files are assigned to "demo_microbiome_fasqfiles"
demo_microbiome_fasqfiles <- "Rdata/fastq"

# Check that the fastq directory contains all eight of our fastq files
list.files(demo_microbiome_fasqfiles)

## [1] "demo1_R1_001.fastq" "demo1_R2_001.fastq" "demo2_R1_001.fastq"
## [4] "demo2_R2_001.fastq" "demo3_R1_001.fastq" "demo3_R2_001.fastq"
## [7] "demo4_R1_001.fastq" "demo4_R2_001.fastq"

# Now make two variables(demo_F,demo_R) to separate and store the forward and the reverse reads
demo_F <- sort(list.files(demo_microbiome_fasqfiles,
                          pattern="_R1_001.fastq", # gets the file names iw/ this expression
                          full.names = TRUE))      # outputs the full directory, not just the file

demo_R <- sort(list.files(demo_microbiome_fasqfiles,
                          pattern="_R2_001.fastq",
                          full.names = TRUE))

# Extract filenames of all samples for future steps in the analysis
demo_samplenames <- sapply(strsplit(basename(demo_F), # basename() grabs filename of the path
                                    "_"),'[', 1) # separates string by "_" and takes characters before the first "_"

JO NOTE: We’ll be moving at a fast pace and you’ll be encountering lots of new functions. Remember you can type ?function_name() into the console at any time to get an explanation for what any function does and available options/parameters. The information will be dense, but helpful!

EVALUATING AND FILTERING DATA

Let’s check our data-quality by making plots and viewing them directly in RStudio. Your plots will appear in the RStudio “Plots” pane to the lower-right.

# Plot a visual summary of the distribution of quality scores of our forward-reads:
plotQualityProfile(demo_F,  # file path to forward fastq files
                   n=1e+06) # of records we want to sample

# Plot a visual summary of the distribution of quality scores of our reverse-reads.
plotQualityProfile(demo_R,
                   n=1e+06)

Let’s also output the plots as .pdf files so we can view them later. They’ll be saved in the Rfigs directory.

Helpful tip: It is important to save any data or figures you generate in R that you want to keep to file; they are not saved when you quit RStudio, and you’ll have to regenerate them!

## save data-quality plot for forward-reads:
pdf("Rfigs/demo_F_quality.pdf",
    width = 12,
    height = 8,
    pointsize = 8)
plotQualityProfile(demo_F[1:4],
                   n=1e+06)
dev.off()

## quartz_off_screen 
##                 2

# save the data-quality plot for our reverse-reads:
pdf("Rfigs/demo_R_quality.pdf",
    width = 12,
    height = 8,
    pointsize = 8)
plotQualityProfile(demo_R[1:4],
                   n=1e+06)
dev.off()

## quartz_off_screen 
##                 2

The next step is to filter the sequences appropriately, the parameters for which will depend on the data.

Conceptually, we will discard the bad reads, trim the ends of the good reads, and then save the trimmed good reads to a new directory.

# The first step here is to specify the path and name the output-files to which the good sequences will be written. 
  # the directory and output-files we specify here will be created in the next step (filterAndTrim).
demo_goodF <- file.path("Routput/demo_good_filtered",
                        paste0(demo_samplenames,
                               "F_good.fastq.gz"))
demo_goodR <- file.path("Routput/demo_good_filtered",
                        paste0(demo_samplenames,
                               "R_good.fastq.gz"))
names(demo_goodF) <- demo_samplenames
names(demo_goodR) <- demo_samplenames

Now we perform the very important step of filtering and trimming each fastq. We also want to know how much time the most important steps take for this task. So, we are using R package ‘tictoc’ to find it out.

These parameters are flexible and should depend on your data!

tic(msg = NULL, quiet = TRUE) #  Starts the timer and stores the start time and message.

demo_good_proper <- filterAndTrim(demo_F,      # paths to forward fastq files
                                  demo_goodF,  # paths to output forward filtered files
                                  demo_R,      # paths to reverse fastq files
                                  demo_goodR,  # paths to output reverse filtered files
                                  trimLeft = c(17, 21), # number of nucleotides to remove from the start of each read
                                  truncLen = c(145, 135), # reads shorter than this are discarded
                                  rm.phix = TRUE, # discard reads that match against the phiX genome
                                  truncQ = 2, # truncates reads at the first instance of a quality score less than or equal to '2'
                                  maxN = 0, # after truncation, sequences with more than '0' will be discarded
                                  minQ=1, # after truncation, reads contain a quality score < '1' will be discarded
                                  maxEE = c(2, 4), # after truncation, reads with higher than maxEE "expected errors" will be discarded
                                  n = 1e+5, # number of reads to read in and filter at any one time
                                  compress = TRUE, # output fastq file(s) are gzipped
                                  verbose = TRUE) # whether to output status messages

## Creating output directory: Routput/demo_good_filtered

## Read in 264588 paired-sequences, output 264212 (99.9%) filtered paired-sequences.

## Read in 121238 paired-sequences, output 121038 (99.8%) filtered paired-sequences.

## Read in 405982 paired-sequences, output 405684 (99.9%) filtered paired-sequences.

## Read in 138673 paired-sequences, output 138191 (99.7%) filtered paired-sequences.

# save the output of previous step (a summary table indicating how many reads there were for each sample before and after quality-filtering):
write.table(demo_good_proper,
            file = "Routput/demo_filteredout.txt",
            sep = "\t",
            quote = FALSE)


datacleantime <- toc(log = FALSE, quiet = TRUE) #  Computes elapsed time since the matching call to tic()

# So, the total time to clean the data is
time_clean <- datacleantime$toc - datacleantime$tic
time_clean

## elapsed 
##  58.904

DEREPLICATION

Dereplicating the data collapses together reads that encode the same sequence this ends up saving computational time in later stages. (see section 4 https://bioconductor.org/packages/devel/bioc/vignettes/dada2/inst/doc/dada2-intro.html)

derep_demo_F <- derepFastq(demo_goodF, # file paths to fastq files
                           n = 1e+06, # maximum number of reads to parse and dereplicate at any one time.
                           verbose = TRUE) # outputs final status of the dereplication

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo1F_good.fastq.gz

## Encountered 24817 unique sequences from 264212 total sequences read.

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo2F_good.fastq.gz

## Encountered 14663 unique sequences from 121038 total sequences read.

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo3F_good.fastq.gz

## Encountered 26526 unique sequences from 405684 total sequences read.

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo4F_good.fastq.gz

## Encountered 14443 unique sequences from 138191 total sequences read.

derep_demo_R <- derepFastq(demo_goodR,
                           n = 1e+06,
                           verbose = TRUE)

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo1R_good.fastq.gz

## Encountered 40285 unique sequences from 264212 total sequences read.

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo2R_good.fastq.gz

## Encountered 15484 unique sequences from 121038 total sequences read.

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo3R_good.fastq.gz

## Encountered 44962 unique sequences from 405684 total sequences read.

## Dereplicating sequence entries in Fastq file: Routput/demo_good_filtered/demo4R_good.fastq.gz

## Encountered 18387 unique sequences from 138191 total sequences read.

names(derep_demo_F) <- demo_samplenames
names(derep_demo_R) <- demo_samplenames

LEARN THE ERROR RATES

Now let’s calculate the error-rates (see below) for the forward and reverse sequences, plot them directly in RStudio and save to .pdf.

see section 5 From the dada2 vignette:

" The dada algorithm uses a parametric model of the errors introduced by PCR amplification and sequencing. Those error parameters typically vary between sequencing runs and PCR protocols, so our method provides a way to estimate those parameters from the data itself."

# In order to make the results reproducible when carried out multiple times, we use the set.seed function.
set.seed(56456)

tic(msg = NULL, quiet = TRUE)

# Forward read error rates
demo_error_F <- learnErrors(derep_demo_F, # path to fastq files or list of derep-class objects
                            nbases = 1e+07, # min number of total bases to use for error rate learning
                            randomize = TRUE, # samples are picked at random from those provided
                            MAX_CONSIST = 12, # max number of times to step through the self-consistency loop
                            multithread = TRUE, # multithreading is enabled and the number of available threads is automatically determined
                            verbose = TRUE) # prints verbose text output

## 17688448 total bases in 138191 reads from 1 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
## Convergence after  5  rounds.

plotErrors(demo_error_F, 
           obs = TRUE,      #  the observed error rates are plotted as points
           nominalQ = TRUE) # plot the expected error rates

# Reverse read error rates
demo_error_R <- learnErrors(derep_demo_R,
                            nbases = 1e+07,
                            randomize = TRUE,
                            MAX_CONSIST = 12,
                            multithread = TRUE,
                            verbose = TRUE)

## 30120168 total bases in 264212 reads from 1 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .
##    selfConsist step 2
##    selfConsist step 3
##    selfConsist step 4
##    selfConsist step 5
## Convergence after  5  rounds.

plotErrors(demo_error_R, obs = TRUE,
           nominalQ = TRUE)

errmodel <- toc(log = FALSE, quiet = TRUE)

# So, the total time to get the error models from the data is
time_errormodel <- errmodel$toc - errmodel$tic
time_errormodel

## elapsed 
##  17.476

We’ll also save both plots to .pdf in our Rfigs directory for our records:

pdf("Rfigs/demo_error_F_plot.pdf",
    width = 10,
    height = 10,
    pointsize = 8)
plotErrors(demo_error_F,
           obs = TRUE,
           nominalQ = TRUE)
dev.off()

## quartz_off_screen 
##                 2

pdf("Rfigs/demo_error_R_plot.pdf",
    width = 10,
    height = 10,
    pointsize = 8)
plotErrors(demo_error_R,
           obs = TRUE,
           nominalQ = TRUE)
dev.off()

## quartz_off_screen 
##                 2

CALCULATING ASVs

Now it is time to run the actual dada2 algorithm to determine the ASVs in the dataset.

** This step is run separately for forward and reverse sets of paired-end reads **

tic(msg = NULL, quiet = TRUE) 

# Forward reads
demo_dada_F <- dada(derep_demo_F, # list of derep-class objects
                    err=demo_error_F, # the matrix of estimated rates for each possible nucleotide transition
                    pool = TRUE, # the algorithm will pool together all samples prior to sample inference
                    multithread = TRUE) # On Windows, set multithread=FALSE

## 4 samples were pooled: 929125 reads in 77269 unique sequences.

# Reverse reads
demo_dada_R <- dada(derep_demo_R,
                    err=demo_error_R,
                    pool = TRUE,
                    multithread = TRUE)

## 4 samples were pooled: 929125 reads in 115520 unique sequences.

asvtime <- toc(log = FALSE, quiet = TRUE)

# total time to generate the ASVs from the data 
time_asv_generation <- asvtime$toc - asvtime$tic


# Let's see how many sequence variants we have got in the forward set
demo_dada_F[[1]]

## dada-class: object describing DADA2 denoising results
## 293 sequence variants were inferred from 24817 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16

Next we will merge the forward and reverse sets (the paired-end reads for all our samples), and output a sequence-table of all ASVs.

tic(msg = NULL, quiet = TRUE)
demo_merged <- mergePairs(demo_dada_F, # object(s) generated by denoising the forward reads
                          derep_demo_F, # object(s) used as input to the the dada function when denoising the forward reads
                          demo_dada_R, # object(s) generated by denoising the reverse reads
                          derep_demo_R, # object(s) used as input to the the dada function when denoising the reverse reads
                          minOverlap = 20, # min length of overlap required for merging
                          maxMismatch = 0, # max mismatches allowed in the overlap region
                          verbose = TRUE) # a summary of the function results are printed to standard output

## 244045 paired-reads (in 893 unique pairings) successfully merged out of 261131 (in 1977 pairings) input.

## 117182 paired-reads (in 475 unique pairings) successfully merged out of 119634 (in 1084 pairings) input.

## 392688 paired-reads (in 602 unique pairings) successfully merged out of 402610 (in 1563 pairings) input.

## 133781 paired-reads (in 422 unique pairings) successfully merged out of 136348 (in 1126 pairings) input.

mergetime <- toc(log = FALSE, quiet = TRUE)
# total time to generate the ASVs from the data 
time_merge_reads <- mergetime$toc - mergetime$tic


# Let's make a sequence table of all the ASVs
demo_sequence_table <- makeSequenceTable(demo_merged, # list of the samples to include in the sequence table.
                                         orderBy = "abundance") #  Specifies how the sequences (columns) of the returned table should be ordered (decreasing)

# We can check the distribution of the ASVs by length
table(nchar(getSequences(demo_sequence_table)))

## 
##  131  153  171  176  179  188  214  215  216 
##    1    1    1    2    1    1  481 1344    8

REMOVE CHIMERAS

Chimeras are another importance source of spurious sequences in amplicon sequencing. Chimeras are formed during PCR amplification. When one sequence is incompletely amplified, the incomplete amplicon primes the next amplification step, yielding a spurious amplicon. The result is a sequence read which is half of one sample sequence and half another.

tic(msg = NULL, quiet = TRUE)
# Remove chimeric sequences
demo_nochim <- removeBimeraDenovo(demo_sequence_table, # sequence table
                                  method = "consensus", # samples in a sequence table are independently checked for bimeras, and a consensus decision on each sequence variant is made
                                  minFoldParentOverAbundance = 1, # only sequences greater than this-fold more abundant than a sequence can be its "parents"
                                  verbose = TRUE,
                                  multithread = TRUE)

## Identified 1697 bimeras out of 1840 input sequences.

chimeratime <- toc(log = FALSE, quiet = TRUE)

time_chimera_removal <- chimeratime$toc - chimeratime$tic

# Let's see how many ASVs remain after filtering chimeric sequences:
dim(demo_nochim)

## [1]   4 143

# Let's see the proportion of sequences we retained after filtering for chimeric sequences:
sum(demo_nochim)/sum(demo_sequence_table)

## [1] 0.885267

SEQUENCES RETAINED

Now we have completed all the filtering, trimming, cleanup etc. to arrive at our final data-set. Here we should check and record how many sequences we retained after each step.

** This test is important for future trouble-shooting purposes should you need to come back to your data-cleaning steps. **

# Create a function to calculate reads retained
fetch_numbers <- function(a) sum(getUniques(a))

# Then apply this function to the output of each step in our pipeline to generate a counts-table of reads remaining after each step
demo_track_steps <- cbind(demo_good_proper,
                          sapply(demo_dada_F,
                                 fetch_numbers),
                          sapply(demo_dada_R,
                                 fetch_numbers),
                          sapply(demo_merged,
                                 fetch_numbers),
                          rowSums(demo_nochim))

colnames(demo_track_steps) <- c("input",
                                "filtered",
                                "denoisedF",
                                "denoisedR",
                                "merged",
                                "nochim")
rownames(demo_track_steps) <- demo_samplenames

# And save the output to a new file for our records:
write.table(demo_track_steps,
            file = "Routput/demo_filtering_steps_track.txt",
            sep = "\t",
            quote = FALSE)

ASSIGNING TAXONOMY TO ALL ASVs

We will next determine taxa present in our samples using the Silva database v.132.

dada2 helpfully maintains specially formatted databases for 3 of the most popular 16S microbiome-databases: Silva, Greengenes, and RDP (also UNITE for ITS).

We will be using the dada2 Silva database: https://zenodo.org/record/1172783#.XcClW9VOnb1

We provided this file in your Rdata directory (Rdata/Silva_db/silva_nr_v132_train_set.fa)

tic(msg = NULL, quiet = TRUE)
demo_taxonomy <- assignTaxonomy(demo_nochim, # character vector of the sequences to be assigned
                                "Rdata/Silva_db/silva_nr_v132_train_set.fa", # the path to the reference fasta file
                                minBoot = 80, # min bootstrap confidence for assigning a taxonomic level
                                verbose = TRUE,
                                multithread = TRUE)

## Finished processing reference fasta.

write.table(demo_taxonomy,file = "Routput/demo_taxaout.txt",
            sep = "\t",
            quote = FALSE)

taxotime <- toc(log = FALSE, quiet = TRUE)

# total time assigning taxonomy
time_taxonomy <- taxotime$toc - taxotime$tic

Now we will generate output-files critical for further analyses and data-visualization. These include:

a table summarizing ASVs by taxa
a fasta-file of all ASVs
a table of ASV-counts per sample (the OTU-table)

# Let's create a table by replacing the ASV sequences with ids (ASV_1, ASV_2 etc.) and their corresponding classifications
demo_taxa_summary <- demo_taxonomy
row.names(demo_taxa_summary) <- NULL
head(demo_taxa_summary)

##      Kingdom    Phylum           Class                 Order              
## [1,] "Bacteria" "Proteobacteria" "Gammaproteobacteria" "Enterobacteriales"
## [2,] "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"    
## [3,] "Bacteria" "Actinobacteria" "Actinobacteria"      "Bifidobacteriales"
## [4,] "Bacteria" "Firmicutes"     "Clostridia"          "Clostridiales"    
## [5,] "Bacteria" "Actinobacteria" "Actinobacteria"      "Bifidobacteriales"
## [6,] "Bacteria" "Bacteroidetes"  "Bacteroidia"         "Bacteroidales"    
##      Family               Genus                 
## [1,] "Enterobacteriaceae" "Escherichia/Shigella"
## [2,] "Bacteroidaceae"     "Bacteroides"         
## [3,] "Bifidobacteriaceae" "Bifidobacterium"     
## [4,] "Lachnospiraceae"    "Lachnoclostridium"   
## [5,] "Bifidobacteriaceae" "Bifidobacterium"     
## [6,] "Bacteroidaceae"     "Bacteroides"

# Let's make a file listing all the ASVs and their sequences in fasta format
demo_asv_seqs <- colnames(demo_nochim)
demo_asv_headers <- vector(dim(demo_nochim)[2],
                           mode = "character")
for (i in 1:dim(demo_nochim)[2]) {demo_asv_headers[i] <- paste(">ASV",
                                                               i,
                                                               sep = "_")}
demo_asv.fasta <- c(rbind(demo_asv_headers,
                          demo_asv_seqs))

write(demo_asv.fasta,
      file = "Routput/demo_out_asv.fasta")

# At this step, we need to make a table of ASV counts for each sample (which is going to be most important for all statistical analyses)
demo_asv_tab <- t(demo_nochim)
row.names(demo_asv_tab) <- sub(">",
                               "",
                               demo_asv_headers)
write.table(demo_asv_tab,
            file = "Routput/demo_asv_counts.tsv",
            sep = "\t",
            quote=FALSE,
            col.names = NA)

# Finally, let's make a table with the taxonomy of all the ASVs
demo_asv_taxa <- demo_taxonomy
row.names(demo_asv_taxa) <- sub(">",
                                "",
                                demo_asv_headers)

write.table(demo_asv_taxa,file = "Routput/demo_asvs_taxonomy.tsv",
            sep = "\t",
            quote=FALSE,
            col.names = NA)
dim(demo_asv_taxa)

## [1] 143   6

# Perform rarefaction to subsample equal number of reads to reduce bias

demo_asv_transpose <- t(demo_asv_tab)

demo_rarefied <- Rarefy(demo_asv_transpose, depth = min(rowSums(demo_asv_transpose)))

write.table(demo_rarefied$otu.tab.rff, "Routput/demo_rar_table.txt", sep = "\t")

write.table(demo_rarefied$discard, "Routput/demo_rar_discard.txt", sep = "\t")

# Remove those ASVs whose sum is zero in the total dataset

demo_final <- demo_rarefied$otu.tab.rff[, which(colSums(demo_rarefied$otu.tab.rff) != 0)]

write.table(demo_final, "Routput/demo_final_count_table.txt", sep = "\t")

Let’s see how much time the more important steps take

cat("\nThe total time for data clean:", time_clean , "seconds\n")

## 
## The total time for data clean: 58.904 seconds

cat("The total time for error models:", time_errormodel , "seconds\n")

## The total time for error models: 17.476 seconds

cat("The total time for generating ASVs:", time_asv_generation , "seconds\n")

## The total time for generating ASVs: 24.896 seconds

cat("The total time for merging paired-end reads:", time_merge_reads , "seconds\n")

## The total time for merging paired-end reads: 2.904 seconds

cat("The total time to remove chimera:", time_chimera_removal , "seconds\n")

## The total time to remove chimera: 1.291 seconds

cat("The total time to assign taxonomies to ASVs:", time_taxonomy , "seconds\n")

## The total time to assign taxonomies to ASVs: 59.485 seconds

We’ll also need to make fake sample-bmi data for tomorrow’s visualization-exercises (phyloseq)

bmi <- c('obese',
         'obese',
         'lean',
         'lean')

demo_fake_sample_data <- data.frame(bmi_group=bmi)
rownames(demo_fake_sample_data) <- c("demo1",
                                     "demo2",
                                     "demo3",
                                     "demo4")
write.table(demo_fake_sample_data,
            file = "Routput/made_up_sample_data.tsv",
            sep="\t",
            quote=FALSE)