Monday, October 21, 2013

Open-Source Analysis of My Raw American Gut Data

Just like I used open-source tools to re-analyze my 23andMe data, I wanted to see what I could learn from analyzing the raw 16S rRNA reads from the American Gut Project.  The American Gut team is very well organized, and you can currently access your raw data in public databases.

I've provided a tutorial based entirely on web-based analysis, so you don't need to know any programming to follow these steps in your own data.  You can also skip the tutorial links to just see my own results.

Step #1: Get Your Data
Step #2a: Analyze Your Data in MG-RAST (preferred, but time-consuming)
Step #2b: Analyze Your Data using RDP-Classifier (quick, but less functionality)

Here are some charts that I could quickly create in Excel using data from RDP-Classifier:


If you compare these distributions to the average gut distribution from the American Gut preliminary report, then you can tell it is very different than the average participant.  Clearly, I have more Proteobacteria than the average participant (and less Firmicutes and Bacteroidetes).  However, it is also important to note that there was also a large amount of variation between participants.

I don't know how my class distributions compare to other samples, but it seems like I can at least infer that there is more variety in my oral sample than my fecal sample:


I also found it useful to see what specific genera were most highly abundant.  According to RDP-Classifer, these are the genera with more than 1000 reads:


Fecal Oral
Streptococcaceae Streptococcus 0 10614
Neisseriaceae Neisseria 0 8824
Actinomycetaceae Actinomyces 0 4129
Lachnospiraceae Oribacterium 0 2319
Veillonellaceae Veillonella  0 2124
Bacteroidaceae Bacteroides 1800 5
Enterobacteriaceae Escherichia/Shigella 8039 5

I was able to find reports that Actinomyces was associated with transformation of lymphocytes in patients with periodontal disease (Baker et al. 1976) and Streptococcus mutans plays a role in human dental decay (Loesche 1986), although I don't know if I had this specific strain (based upon the MG-RAST report, it looks like I don't).  Likewise, I showed this list to my dentist, and she recognized these two genera as being associated with dental problems.  Although I don't know how common these are overall, it seems to make sense that they could be found in an oral sample.

Obviously, I recognized Escherichia/Shigella.  The American Gut report points out that the phylum containing this genus is not highly abundant in an average participant.  At one point, I had to be hospitalized with ulcerative colitis (with a strain of E. coli producing shiga toxin), so perhaps this is related to the high abundance of this genus (although that was several years ago).

If you are patient enough to wait for your MG-RAST results, then you can make similar (but slightly cooler looking) pie charts and tables automatically.  For example, you can take a look at the corresponding plots from MG-RAST for my fecal and oral samples.  You can also create plots to compare species in multiple samples (red bars are for my oral samples, green bars are for my fecal sample):





Perhaps most importantly, MG-RAST will provide annotations down to the species level (and strain level, when possible).  The species counts aren't perfectly correlated with the genera counts (predicted from the classifier), but the the most interesting genera appeared in both lists.


metagenome
strain
abundance
oral
uncultured bacterium
19715
fecal
Escherichia coli ED1a
11316
oral
Abiotrophia para-adiacens
6185
oral
Actinomyces odontolyticus
4154
oral
Veillonella dispar
1913
oral
Butyrivibrio fibrisolvens
1431
oral
Blautia sp. Ser8
1075
fecal
Bacteroides stercoris
736
oral
Syntrophococcus sucromutans
661
fecal
Pseudomonas fluorescens
660
fecal
Bacteroides caccae
559
fecal
Bacteroides stercoris ATCC 43183
558
fecal
Bacteroides vulgatus
510
oral
Streptococcus
499
oral
Rothia mucilaginosa
450
fecal
uncultured bacterium
370
oral
Haemophilus haemolyticus
295
fecal
Prevotella buccalis
287
oral
Ruminococcus gauvreauii
282
oral
Streptococcus sanguinis
275
oral
Ruminococcus torques L2-14
272
fecal
Escherichia coli
232
oral
Abiotrophia defectiva
213
oral
Gemella morbillorum
204
fecal
Dialister propionicifaciens
184
oral
Veillonella parvula
175
oral
Butyrivibrio hungatei
142
oral
Atopobium minutum
137
oral
Parvimonas micra
126
oral
Leptotrichia shahii
121

This information can allowed to conduct more effective literature searches.  For example, my understanding is that the ED1a strain has not been shown to be associated with ulcerative colitis.  On the other hand, the species information allowed me to find a paper for the discovery of my specific strain of Actinomyces, which was harvested from 450 tooth cavities (Batty 2005).  Likewise, I could confirm that Streptococcus sanguinis was also pathogenic (Xu et al. 2007).

FYI, Galaxy also has some metagenomic tools.  However, running BLAST on Galaxy will take a long time.  If you are comfortable with running BLAST locally, it should be easier (but this requires some comfort using the computer).  You can also analyze your data locally using QIIME upload the results to PICRUSt for functional enrichment, if you don't need the convenience of the web-based tools listed above (MG-RAST can produce QIIME reports, but I think it is better to use a tab-delimited text file to avoid formatting problems).

I am still interested in seeing what my official individual report will look like: although I have general experience with bioinformatics analysis, the folks at American Gut have looked at a lot more metagenomic data than I have.  Likewise, I am interesting in seeing how my profiles change at different time points: once I eventually get my uBiome results, I will put together another post to compare the results.

7 comments:

  1. Charles, thank you for posting this information. Now that I have my Taxa Summary I'm hoping that MG-RAST will provide me with species information so I can try to better understand what's in me.

    I loaded my data this afternoon so now I just need to hurry up and wait.

    Do you plan to post your Taxa Summary information?

    ReplyDelete
  2. The first table in the post is from RDP-Classifer and the second table is from MG-RAST. The pie charts also come from the the higher level levels of the taxa organization.

    I also posted my official American Gut report here:

    http://cdwscience.blogspot.com/2013/11/my-american-gut-individual-report.html

    Are you looking for something else?

    ReplyDelete
    Replies
    1. Yes, I'm looking for your American Gut Taxa Summary. Mine was made available last week. It lists the detailed make up of the samples down to the Genus with their percentage.

      Delete
    2. Ok - the top 5 most abundant genera are in the PDF version of the report, but you can view my entire list here:

      https://drive.google.com/file/d/0B1xpw6_kQMKuUmJnTGZMc0NtSm8/edit?usp=sharing

      One thing that this report made more obvious is the lack of WAL_1855D (the most abundant taxa) anywhere in my MG-RAST results. So far, I've contacted the American Gut and MG-RAST tech support, and it looks like this may be due to the fact MG-RAST use an older version of the Greengenes 16S reference database. Because all samples need to be processed the same way, updating the reference databases takes some time in MG-RAST, but they did say this is something they are working on. I'll add a note to the individual report post once I can confirm this is the case.

      Thank you for your interest!

      Delete
  3. How did you generate the "metagenome strain abundance" table? I have tried lots of MG-RAST settings but I have found none that produce a similar table. I'm very interested in the species level detail.

    ReplyDelete
    Replies
    1. I think it might be best to post a question like this for the MG-RAST report post that includes the detailed instructions. That way, I think other users with similar questions will be more likely to see the results:

      http://cdwscience.blogspot.com/2013/10/analyze-your-16s-rrna-data-using-mg-rast.html

      It take some time to process your sample (especially if you don't immediately make it publicly available)

      Picking up from the end of that tutorial (selecting datasets, reference database, etc), here is what you should do:

      1) Click radio button for "table"

      2) Click "generate" on the right-hand slide of the radio buttons

      3) Wait for table to be produced. If the annotations are not specific enough, use the pull-down for "group table by" to select the specificity of the report (in your case, you should select "species")

      4) Click the "change" button immediately to the right of the pull-down option.

      5) Optionally, there is a "download" button if you want to use a text file to browse your results

      Delete
    2. Thanks, that did it! Off to do more research...

      Delete

 
Creative Commons License
My Biomedical Informatics Blog by Charles Warden is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 United States License.