Microbiome Analysis Workshop
GOAL
In addition to understanding the importance of experimental design, we will walk through turning raw sequence data into useful counts-data that we can use to visualize microbiome sample-composition. We will then discuss methods to extrapolate function from these abundance-data (and limitations thereof) to ultimately arrive at biological insight.
Table of Contents
Pre-course Materials
⬣ Best Practices for Analyzing Microbiomes - This article discusses how all stages of conducting a microbiome study, from designing the experiment to collecting and storing the samples to obtaining insight from graphical displays of the sequence data, can substantially impact the result.
⬣ If you do not have R and RStudio installed already, you can follow these instructions for both Mac and Windows. The R versions listed in the instructions might be outdated but the links are the correct. If you already have RStudio, make sure you’re using R version 4.0.5 by clicking on ‘Global Options…’ in the Tools tab. The version is also stated in the first line of the console when you first open RStudio. If you are not using the most recent version, follow the previous installation instructions and restart R.
If you have Windows, then a very easy way to update your R-version and packages is by simply running the following code in the RStudio console:
install.packages("installr")
library(installr)
updateR()
You can also use latest version of RStudio. You can check this within RStudio by going the Help tab and clicking ‘Check for Updates’.
Finally, update your packages by clicking ‘Check for Package Updates…’ in Tools.
⬣ We know that programming can be very intimidating at first, so we created this introductory R course to help researchers such as you start your programming journey. If you are a bit familiar with R, please still check out this resource as it covers how the workshop tutorials will be set up. We’ll be moving quickly through basic concepts in R to get to the actual data-analysis. We strongly recommend reviewing the R tutorial to get you started/help you keep up.
Day One
Presentation Slides
Day one of this workshop focuses on an overview of available methods and best-practice considerations for experimental design in microbiome studies. You can download the individual presentations for each topic in the agenda below. The full agenda can be downloaded here.
NOTE: Speakers may change with each workshop event. Presentation slides from every workshop are still listed here.
AGENDA | INSTRUCTOR |
---|---|
Intro to Microbiome Studies | |
USF Genomics Introduction and Workshop Overview | Dr. Jenna Oberstaller |
USF Genomics Equipment Core | Dr. Min Zhang |
Introduction to Microbiome Data Analysis | Dr. Anujit Sarkar |
Best Practices for Microbiome Sample-handling and Nucleic Acid-processing | Swamy Rakesh Adapa, MS |
Statistical Considerations for Microbiome Studies | Dr. Ryan McMinds |
Experimental Design | Swamy Rakesh Adapa, MS |
Overview of Microbiome Data-Visualization | Dr. Justin Gibbons |
Functional Profiling with PICRUSt2 | Dr. Thomas Keller |
Day Two
Presentation Slides
Introduction to R and Plotting Data (Dr. Charley Wang)
Taxonomic Analysis (Dr. Anujit Sarkar)
R Hands-on Practice
Download Charley’s tutorial. Follow along below!
Initial ASV Analysis
Download the zip file and extract it to get started on Charley’s initial ASV analysis tutorial. Open the .Rmd file in RStudio by going to File tab and clicking ‘Open.’ Depending on where you extracted your folder to on your computer, your directory path will be different. More on working directories can be found here. We will need to change the paths in the first chunk of code in this Rscript which loads the text files we need to run the tutorial. The path is the first part of the read.table function surrounded by single quotations.
Follow along Charleys tutorial here!!
NOTE: We will be using an R Project format for the rest of our tutorials so we will not need to worry about changing paths for the remainder of the workshop but it is still important to understand how file paths and directories work when loading data from our local computer since you will most likely be doing it a lot.
DADA2 Pipeline
Overview
Goal: The purpose of this analysis is to obtain an Amplicon Sequence Variant (ASV) table for all of our microbiome-sample example-data.
Input data: We will start with demultiplexed fastq files for all samples. This analysis is for paired-end data. Thus, for each sample, there will be two files, named according to Illumina platform conventions:
- Forward-reads, named *_R1_001.fastq
- Reverse-reads, named *_R2_001.fastq
Creating the Project
1. Follow this link and download a zipped file of this folder going to this unsophisticated icon in the top right corner and clicking “Download” or if you are not already logged on to your Box account, it will just say “Download”. You will not need a Box account to download this folder.
2. Extract the downloaded zip file to where you want it.
3. Open RStudio and click on New Project in the File tab.
4. Create the new project by choosing ‘Existing Directory’
5. Browse to the directory where you extracted the zip file and make sure ‘Day2’ is the base name in your project directory file path.
You should see the folders(Ranalysis,Rdata,etc..) when you open the Day2 folder.
6. Click create project. You should now see a Day2.Rproj file in the lower right files pane. Double click it to make sure you are within the project. If you are not already within your R project, you will be asked to open it. You can tell you are in your R Project if you see the name of your R Project at the top of your RStudio window.
For more info behind the logic of creating RStudio projects and adhering to an organizational directory-structure as you build your data-analysis skills, see this post on reproducible scientific data-analyses from Software Carpentry. We don’t use exactly the same structure they do, but the concepts are the same: structured analyses make sharing and reproducing analyses much easier!
Tutorial Stucture
Before we begin, let’s take a moment to get organized. The importance of documentation and good record-keeping are essential to producing high-quality and reproducible computational analyses, just as they are at the bench!
We recommend you keep your analyses organized by project (just as we organized this example).
Looking around in the file browser tab of the lower right section, you should find the following folders if you set the project directory correctly:
Rdata: this folder contains our input .fastq.gz files and our input database of 16S-sequences that we’ll use to identify taxa present in our samples.
Ranalysis: this folder contains any scripts we create to analyze our data, like this R-Markdown (.Rmd) document.
Routput: we will direct any output data-files from our analyses to this folder.
Rfigs: we will direct any figures we generate from our analyses to this folder.
Rsource: this folder contains any R source-scripts we create to set up our environment for our analyses–custom functions, which packages to load, etc. etc. You don’t need to worry about this one since we made it for you.
You can think of any files in Rsource as set-up scripts–just load it at the beginning of your session and forget about it.
Setting up the Environment
Now that we are familar with the project, we can set up the environment!
1. Go to the Ranalysis folder in the lower right files pane and open the .Rmd file
2. Make sure your Knit Directory is set to project directory as shown below.
3. Run only the second chunk of code beginning at line 48 by clicking the green arrow within the upper right corner of the chunk. Running this code calls a source script from the RSource folder that installs all of the packages needed to run the tutorial.
This pipeline is written in R Markdown, a file format for making dynamic documents with R. An R Markdown document is written in markdown (an easy-to-write plain text format) and contains chunks of embedded R code. We rendered this R markdown script into an HTML file linked below that shows the results of the code so you can follow along.
Let’s begin the day2 tutorial!
Day Three
Presentation Slides
Taxonomic Analysis: Visualization Part I (Dr. Justin Gibbons)
Taxonomic Analysis: Visualization Part II (Dr. Thomas Keller)
Taxonomic Analysis: Machine Learning (Dr. Thomas Keller)
Microbiome Data Visualization
Pipeline Overview
Goal: First, we will plot our results from the small data-set we analyzed yesterday. Do not worry if you were unable to generate the results; it is included in Day3’s download. The important aspect is understanding the workflow. Then we will visualize other microbiome data (previously processed in the same way as we processed the small test-dataset from yesterday, from raw sequence-data to OTU or ASV tables) from a different larger dataset.
Input data:
Taxonomic Analysis: Visualization Part I
Our outputs from Day 2 (we’ve provided them in the Day 3 download)
- demo_asv_counts.tsv
- demo_asvs_taxonomy.tsv
- made_up_sample_data.tsv
Biom file and mapping files that will be converted to phyloseq-class
- otu_table.biom
- meta_data.csv
Taxonomic Analysis: Visualization Part II
- demo_asv_counts.tsv
- pathways_out -> pathway .tsv files
- metagenome_out -> kegg .tsv files
- metadata.tsv
Taxonomic Analysis: Machine Learning
- an OTU table (converted to relative abundance)
- a table of metadata to associate with the OTUs
Creating the Project
We’ll follow the same steps to create the Day3 project in RStudio as we did to create the Day2 project.
1. Follow this link and download a zipped file of this folder going to this unsophisticated icon in the top right corner and clicking “Download” or if you are not already logged on to your Box account, it will just say download. You will not need a Box account to download this folder.
2. Extract the downloaded zip file to where you want it.
3. Open RStudio and click on New Project in the File tab.
4. Create the new project by choosing ‘Existing Directory’
5. Browse to the directory where you extracted the zip file and make sure ‘day3’ is the base name in your project directory file path.
You should see the folders(Ranalysis,Rdata,etc..) when you open the day3 folder.
6. Click create project. You should now see a day3.Rproj file in the lower right files pane. Double click it to make sure you are within the project. If you are not already within your R project, you will be asked to open it. You can tell you are in your R Project if you see the name of your R Project at the top of your RStudio window.
Tutorial Structure
Day 3 follows the same directory-structure as Day 2 above.
Setting up the Environment
If you succesfully ran the Day 2 tutorial, then you should already have the packages needed for Day 3.
However, the same instructions for Day 2 go for Day 3 if needed.
As you run the tutorial chunk by chunk, you can follow along with the output document linked below that includes the R code and its output.
Let’s begin the day3: visualization part I tutorial!
Let’s begin the day3: visualization part II tutorial!
Let’s begin the day3: machine learning tutorial!
Resources
Here are some resources mentioned in this workshop and some extra information that you might find helpful in your microbiome research.
Journal-articles Dr. Ji referenced as examples for different study-designs in his session (Statistical Considerations for Microbiome Studies)
-
Socioeconomic Status and the Gut Microbiome: A TwinsUK Cohort Study
-
Meta-analysis of gut microbiome studies identifies disease-specific and shared responses
Microbiome R Packages
Microbiome Software
-
FastQC: A quality control tool for high throughput sequence data.
-
QIIME2: an open-source bioinformatics pipeline for performing microbiome analysis from raw DNA sequencing data.