![]() This is true data discarded from larger datasets does have information that can be used to estimate the true abundance of taxa among samples. parametric) methods to estimate the true abundance of taxa. My read of McMurdi and Homes is that the authors object to subsampling because it disregards data that is present that could be used by more sophisticated (i.e. ![]() I would strongly welcome any suggestions, observations, and criticisms to this post! To try and understand if there is a good reason, at least with respect to my data and analysis goals, I undertook some further exploration. There’s no need however, to get everyone spun up without a good reason. I hope that our colleagues in statistics and applied mathematics continue optimizing these (and other) methods so that microbial ecology can improve as a quantitative science. I confess that I am (clearly) very far from being a statistician and there is a lot in McMurdie and Holmes, 2014 that I’m still trying to digest. I suspect that dissimilarity calculations between treatments or samples using realistic datasets are much less sensitive to reasonable subsampling than the authors suggest. I think the authors do make a reasonable argument for the latter analysis however, at worst the use of subsampling and/or normalization simply reduces the sensitivity of the analysis. McMurdie and Holmes argue that this approach is indefensible for two common types of analysis identifying differences in community structure between multiple samples or treatments, and identifying differences in abundance for specific taxa between samples and treatments. This attempts to approximate the case where both datasets undertake the same level of random sampling. ![]() A common practice is to reduce the amount of information present in the larger dataset by subsampling to the size of the smaller dataset. Thus more rare taxa might be represented. In theory this works great, but has the disadvantage that it does not take into account that a larger dataset has sampled the original population of DNA molecules deeper. ![]() One solution is to normalize by dividing the abundance of each taxa by the total reads in the dataset to get their relative abundance. It makes no sense to compare the abundance of taxa between two datasets of different sizes, as the dissimilarity between the datasets will appear much greater than it actually is. Most sequence-based microbial ecology studies involve some kind of comparison between samples or experimental treatments. Thus the final dataset contains a very small random sample of the original population of DNA molecules. After quality control and normalizing for multiple copies of the 16S rRNA gene it will contain far fewer. Only a tiny fraction (<< 1 %) of these DNA molecules actually get sequenced a “good” sequence run might contain only tens of thousands of sequences per sample. The extract might contain the DNA from 1 billion microbial cells present in the environment. To develop 16S rRNA gene data (or any other marker gene data) describing a microbial community we generally collect an environmental sample, extract the DNA, amplify it, and sequence the amplified material. The logic of that practice proceeds like this: Yesterday I finally got around to reading it and was immediately a little skeptical as a result of the hyperbole with which they criticized the common practice of subsampling* libraries of 16S rRNA gene reads during microbial community analysis. I have had the 2014 paper “ Waste Not, Want Not: Why Rarefying Microbiome Data is Inadmissable” by McMurdie and Holmes sitting on my desk for a while now.
0 Comments
Leave a Reply. |