Wednesday, December 3, 2014

American Gut and uBiome Compared

In July this year, I conducted an experiment. I sent identical fecal samples to American Gut and uBiome to see how they would compare. There have been several discussions on this subject, seeming to make the two rival microbiome test labs flawed in their methods. An explanation from uBiome.

One argument was that even on an individual turd, there would be differences in the microbes found on it's surface. Makes sense.

Another argument was that the normal sampling method, ie. wiping a cotton swab on used toilet paper, would also lead to different microbes being detected.  Sounds right.

I did things a bit differently. I did my "#2" in a new, food grade, plastic bag. Kneaded it thoroughly. Then I touched the exact same spot lightly with the swabs provided by the two companies.

I now have the results!

Unfortunately, the two companies provide completely different looking reports, so it is impossible to hold them up side-by-side and look for inconsistencies.  What I did instead, was go line by line on each report and any time I found an exact same genus listed by name, I wrote down the results to compare. But first here is a bar chart I crafted to give an overall impression:

I am pretty happy with the phyla numbers.  Better than the other comparisons I saw.

Here are the genus level comparisons:


Major discrpancies in yellow

Strange.  Some really big discrepancies and also some eerie similarities. I'd say overall they did a pretty good job, but when they were off...they were WAY off.

Overall, I'm impressed with the comparison.  It's possible that the major discrepancies were due to sampling differences. And anyway, this is all just for fun.  I am glad that at least it seems that both companies found pretty much the same microbes, and in the same relative abundances.  I feel good recommending either company, but would caution against head-to-head comparisons.   

Obviously it makes a direct comparison between AmGut and uBiome, like I did here, pointless.  And if anyone was looking at this chart to make grand statements, you were wrong and may want to correct them.

Gut Microbe
(Genus Level)
Real Food
(uBiome) My Results
Average Results
Potato Starch Added (AmGut) My Results
F. Prausnitzi
not shown at this level

If there are any biotech geeks out there, the uBiome raw data can be found here:  uBiome.txt

And the AmGut Genus list can be found here:  AmGut Genus.xls

If anyone knows how to get the American Gut raw data, please drop me a comment or email.  I found a link that says it's available through EBI.  Has anyone tried to figure it out yet?

The raw data can be fetched from the European Bioinformatics Institute. EBI is part of The International Nucleotide Sequence Database Collaboration and is a public warehouse for sequence data. The deposited American Gut Project accessions so far are:
  1. ERP003819
  2. ERP003822
  3. ERP003820
  4. ERP003821
  5. ERP005367
  6. ERP005366
  7. ERP005361
  8. ERP005362
Processed sequence data and open-access descriptions of the bioinformatic processing can be found at our Github repository.
Sequencing of American Gut samples is an on-going project, as are the bioinformatic analyses. These resources will be updated as more information is added and as more open-access descriptions are finalized.


  1. Tim,
    This is brilliant work. It is hugely important because it lets us know that much of the speculation that is based on these tests is so far probably pretty worthless. First, I wonder how many samples being reported (and used in “studies”) were homogenized in the way you did – I suspect a minuscule number – and therefore the results may be entirely random and bear no relationship to the actual proportions of the different phyla or genera. Secondly, the substantive differences between these two sets of results based on the single sample (which you highlighted in yellow) means that at least one of these analyses is wrong (and perhaps both are). Finally, even if homogenous samples were being used, and the analysis was accurate, we have no idea if it is the relative proportions of these bacteria in our microbiome that matter – we have seen a huge amount of speculation based on the percentage of one phylum compared to another, or one genus to another, when it might be the absolute quantities that are significant. Those of us who have experimented with supplementation with different fibres, resistant starches or other polysaccharides, are very aware that stool volume can increase magnificently. The total size of our microbiome maybe much more important than the relative proportions of individual bacteria. I feel that as yet we know almost next to nothing for certain about any of this – we are only just getting a tiny sense of how important it might all be. Please keep up the good work.

  2. I was just reading a great article that Gemma sent today on some of the problems with microbiota science. three-voice debate about gut microbiota research

    One issue I concur with is the overuse of animal models for human microbe studies, and another is that there seems to be a rush on to get gut articles published, possibly leading to shoddy conclusions.

    I think it's up to us to really get this all figured out. We need to keep trying different things and reporting what we are finding. I'm in this for the long run!

  3. What are your thoughts on Dr BG's comments regrading RS2 suppressing bifido based on the numbers you are seeing with your results? I know you eat a wide range of RS fiber, but it would seem your numbers are still quite good when using supplemental PS.

  4. Just finishing up a post comparing a high PS diet and zero PS diet and two AmGut samples. Basically, what I see, is that potato starch creates huge growth of Bifido, of course, Dr. BG will counter with "it's the wrong type of Bifido," but I can see no basis for that.

    Hopefully will be up in just a little bit, just need to check spelling and format.


  6. Sorry, copy and paste fail there. What I meant to say was:

    "Kneaded it thoroughly." - Love this blog. I was thinking of doing something similar myself, though I really don't buy the claim that these two companies should be so far apart. I suspect one of them - though it could easily be both - are doing something wrong.

    1. Hey Bud - You know, I'm amazed that they can do as good as they do. I was pretty impressed. Just getting the taxonomy correct is a miracle, then to establish the amount as a percentage? Damn.

      This is from AmGut's FAQ on how bacteria are sequenced:

      "The primary software package we use for processing 16S sequence data is called Quantitative Insights into Microbial Ecology (QIIME; Caporaso et al. 2010). Using this package, we are able to start with raw sequence data and process it to so that we end up be able to explore the relationships within and between samples using a variety of statistical methods and metrics. To help in the process, we leverage a standard and comprehensive (to date) reference database called Greengenes (McDonald et al. 2011; DeSantis et al. 2006) that includes information on a few hundred thousand Bacteria and Archaea (it is likely that there are millions or more species of bacteria). Due to the molecular limitations of our approach, and the lack of a complete reference database (because the total diversity of microbes on Earth is still unknown), our ability to determine whether a specific organism is present has a margin of error on the order of millions of years, which limits our ability to assess specific strains or even species using this inexpensive technique (more expensive techniques, such as some of the higher-level perks, can provide this information)."

    2. If you ever get ambitious, take a class on Bioinformatics. It's an amazing science. Here is a link to a lecture we had on predicting gene sequences:

      Lecture Slides

      What you will find, the sequences that they find do not come out of the computer program with a name, only a genomic sequence. This sequence must them be matched against hundreds of thousands of others to determine a match and get a name.

      If you want to play along, here is a practice quiz we had based on the lecture slides. Practice Quiz

      You'll see in the first question that matching genomes in the database rarely gives one result, it gives dozens. A set of parameters needs to be used to determine the best match. It's possible that AmGut uses a "longest frame" method and uBiome uses a "lowest e value" or some other parameter to determine the match.

      I am just totally amazed that these tools are on the internet, free to use and the gene depositories all share info freely.

      Truly amazing times we live in!

    3. The technology is cool and I love the fact we have access to these tests. They have different methods and different databases etc., and it's new territory, so it'll take time to iron everything out. Also the taxonomy is moving all the time too - same problem that genetic genealogy companies have faced for years, in that new findings are made all the time which move things along and things get moved from one group to another (this might well be important when comparing results over time and perhaps between companies too). I'm sure both companies are rapidly getting better and better at it all. All that said though, a difference of more than 10 fold on some of the major taxa like Roseburia...I don't know,that seems quite a difference even accounting for all the factors above. I'm sure it'll get figured out though.

    4. I was very disappointed by the Roseburia! Most of the microbes in the bottom 2/3 of the report, were just reported as like .01% - .04%, those didn't really concern me, and both reports had some unique hits.

  7. Wouldn't surprise me in the least that if you submitted the same sample to the same lab under two different aliases you would probably get back varied results. Would be an interesting test of the capabilities of the lab & procedures.

    1. I had considered that! Wish I was filthy rich. I'm sure that they run duplicate samples as a check on their equipment, it's an industry starndard tactic to check accuracy.

      Also, probably important to note. AmGut and uBiome (or anyone) doesn't really identify these microbes...the 'predict' what the microbe is by matching it to huge databases, and the databases have errors in them. It is only through huge projects like AmGut and uBiome that these databases get bigger and better.

      The prediction of species by examining genes is still relatively young, but impressively accurate.

  8. Is there a way to interpret the results on your own? Say, you have too much "bad" bacteria showing up in the report and need to rebalance? I really believe my gut issues led to Hashimoto's Thyroiditis, but I've never had my gut checked.

    1. uBiome has a really nice set-up that allows you to compare your results with all the others they have processed. It's helpful in seeing if you have some strange pathogens or lack of beneficial types.

  9. Hey Tim,

    Great blog! I’m glad I found it! I have results from 3 uBiome samples and 5 American Gut samples. I’ve been writing Python scripts to try to reproduce their results, just for fun and education. I’m also in the Coursersa Bioinformatics class, but more for reference than to do the homework (I work full time too). So far, I have found that uBiome gives us about 20 times as many sequences per sample as American Gut (900,000 vs 40,000), but at first glance, the American Gut samples look more consistent.

    I got my uBiome raw data right off the web site. I don’t know if everyone can do that, since I chose a donation level that included data analysis, I think.

    I followed the American Gut project notebook, as you did, to look for my raw sample data (a tedious process!), and I found only one of 5 samples. But when I sent them an email, they sent me links to the other 4 samples. They are very helpful, so you might just ask them. I’m using the Greengenes taxonomy data from May 2013, as the notebook says, but I have questions about that. Each OTU (e.g. specific genus or species) has up to 64,000 sequences in the fasta file, and I want only 1 “truth” sequence, to compare my sequences against. Here’s a sample of my output, showing that Greengeens gives us 2900 unique OTUs:
    2900 unique OTUs written to COUNT PCT
    p__Actinobacteria; c__Actinobacteria; o__Actinomycetales; f__Corynebacteriaceae; g__Corynebacterium; s__ 64385 5.098
    p__Firmicutes; c__Bacilli; o__Lactobacillales; f__Streptococcaceae; g__Streptococcus; s__ 59282 4.694

    I’m interested in Akkermansia muciniphila because my level is much higher than most people. It’s the bacterium that has been shown to posibly contribute to obesity, when your levels are too low. Greengenes gives 1535 sequences for this species, and I copy/pasted a couple of them to BLAST, comparing to their known Akkermansia muciniphila genome. The longest sequence matched the entire 16s rRNA gene sequence perfectly! So I’m hoping I can just use the longest sequence from each of the 2900 unique OTUs in Greengenes as my “truth” catalog to identify all my sequences.

    By the way, I was shocked to find that my two earliest American Gut samples (April and June 2013) have completely different results now on the web page than they originally did. My guess is that they weren’t using the May 2013 Greengenes catalog yet in their original processing.

    My latest experiment is to compare a vegan sample to a meat/dairy sample from American Gut. But it will probably be months before they finish processing. I haven’t even taken the meat/dairy sample yet. By the way, American Gut has completely moved to UCSD.

  10. Ha! A fellow gut-geek!

    Amazing how many tools available to the average schmuck willing to learn to use them. BLAST is just the tip of the iceberg. Send me an email and I'll send you copies of the lecture slides from last semester with links to about a dozen or more such tools used for comparative analysis.

    I had fun plugging all the raw data into MG-rast and playing with all the tools there.

    My problem is finding the time required to fully get into it. You really need to keep on top of what you send in as some things take several days to compute.

    I love that AmGut and uBiome are so cheap and, from what I see, accurate. I think we are closing in on identifying some key species and markers for gut health. With all of the results they are collecting, they can develop micro-arrays and an app to decipher, bringing us one step closer to a real-time, home-use tool for examining our gut flora.

    Hey, if you ever feel like writing up your experiments you I can give you author rights to post them as a blog-post here. It might be a nice reference for future gut sleuths, and give you a place to store your data for easy reference.

    Thanks for the note!

  11. Tim,

    Did you end up getting all your AmGut raw data? And your uBiome data? Do you know why uBiome uses 20 times as many samples? Have you ever done or seen a study that shows the bacteria types that are most similar, i.e. the ones whose 16s rRNA sequences get confused for each other most easily. Maybe that would explain the uBiome/AmGut discrepancies you show above. I'm going to do my own comparison, just for fun.

    By the way, AmGut told me I wouldn't be able to directly compare their sequences to uBiome because they look at a different piece of the 16s gene. I found that they overlap by 85% for Akkermansia muciniphila (23 nucleotides apart out of 150). I don't think we need to directly compare them anyway, to get meaningful results. But maybe the discrepancies you see are due to where they are looking on the gene. I might start with yours and see what I find, e.g. I'll try to see where AmGut and uBiome were looking on the Roseburia 16s gene (will they also be 85% overlapping?) and look for other OTUs that are similar (esp to AmGut, since yours are "missing"). Do you know of tools that already do this? Either way, I want to build my own. Also by the way, my Roseburia ranged from .9% to 1.8% (two uBiome samples), with two AmGut at 1.2% and 1.4%, so it looks like my hypotheses will fail (i.e. they compare fine for me).

    1. Dave - it really gets to be a mess when you start looking so closely, doesn't it? I have tons of raw data files. Have you figured out yet how to get your AmGut FASTA files from the Euro repository ( yet? Just do a search for your 9 digit AmGut kit number (ie 000016449) and then scroll down until you see the AmGut entry, click on it, then look for 'file 1' under 'fastq files' in the table. Save this as a zip and use however you like.

      I have 15-20 files loaded in MG-rast if you'd like to see, both AmGut and uBiome. And I know another guy who also has a bunch of files loaded that would probably love to play with some data with you (right, Barney?)

      But yeah, it's apples and oranges unless you are using the exact same databases and sequencing methods. The only thing I like about all this is that as these databases expand and information is shared between them we get closer and closer to being able to sequence species found in the human gut.

    2. Peeks his head out sheepishly ... yes. But not as many files as you Tim. Run the MG-Rast analysis yet Dave? There are some tweaks there that may help. You are much more advanced than I am if you are able to do that kind of genetic analysis and I would be riding your coattails rather than providing you with any meaningful help. I did notice the same thing you did about AmGut vs. UBiome samples on Tim's tests. Still debating about the 2 for 1 deal at UBiome.

  12. I have an MG-Rast account but haven't run any data yet. This poster has some negative comments on it. I've been told to use QIIME instead. I'm only learning now that identification of sequences is complicated and imperfect. I'll have to work hard in that next Bioinformatics class to do more on my own without falling into too many traps.

    1. That 'conquistador platypus' page confused me. They were looking for platypus RNA within 16s rRNA samples? Surely you see the problem there.

      I have not played with QIIME. Yep, we have to know that this is all imperfect, in a way. But, we are at a point now where the database is so big we don't have to discover many new species, all we have to do is match them.

      Microarrays are the latest 'big thing' in microbe matching. Really, what is the sense in looking at every single microbe in a human? In the old days, they had to culture the stool samples, unfortunately, only a very small percentage of microbes are culturable.

      It's kind of funny, the bioinformatic text books are still describing exactly how these processes work and the formulas used, but all anyone really needs to know is how to run the programs to get the best answer.

      It was like that taking computer classes in the early '90's. They wanted you to know how a computer worked, how the 1's and 0's were managed inside the chipsets, and how to write in DOS.

      I'll bet it really pisses the old-timer biologists off that we can just plug our zip files into a free online tool and get the results we need, without even knowing what rRNA stands for.

  13. Your blog give us much information i desperately need it. Thanks for providing such helpful blog and contents is so nice.
    Cold Storage Warehouse Chicago

  14. Happy Mother's Day! Your mom gave you the first important seed of your microbiome, so why not thank her with a loving gift? From now until Sunday, all uBiome kits are 50% off with the discount code THANKSMOM.

  15. Many thanks for your blog, Tim, so help so much people with your informations!
    The quantity of bacteria (count) of the Ubiome test: can we relate this number to the full number of bacteria (100 trillions), that we should have in the gut?

    1. I don't see how. The bacteria will increase/decrease in numbers based on food available. The total number of bacteria is not static. After a 12 hours fast, the bacteria is decimated in numbers. After a big feast, they grow as much as they need to consume any food you give them.

      Your poop will show what was produced to consume your last meal.

      Interesting thought, though. It would be cool to follow all of the species as they grow and die in real time.

    2. Thank you for your answer, Tim! I realize differences in the Ubiome test result of my friend (55000 - she took antibiotics for half a year a year ago), me (1365000 - I take high dosage RS, prebiotics and probiotics since 10 months) and my husband (365000 - normal western diet).
      I am curious, if my next Ubiome tests show increasing bacterias. I treat candida, and I assume, that decreasing candida makes place in the gut, and allow the bacterias to increase.

    3. I think the results are always given as a percentage, not an absolute number of cells for each type of bacteria. I took my raw data (from uBiome and AmGut) and recomputed the percentage for a few species, and my results generally agreed with theirs so far. You could take their percentages and estimate total cells of each type, assuming 100 trillion total cells. Where did you get the raw numbers (55000, 1365000, 365000) from uBiome? Are you looking at the raw data? You should be able to correlate those directly to the percentages they gave you.

  16. Thank you for your answer. Richard Sprague writes, that "count" is the actual number of organisms found in the sample, and "count_norm" means a "normalized" version of the count, which we can think as a percentage.
    It looks to me, that "count" is a number, and "count norm" is, when divided to 10000, a percentage.
    When I summary all numbers of "count", then my result is 1365000.