Optimising High-throughput sequencing data analysis, from gene database
selection to the analysis of compositional data: A case study on
tropical soil nematodes
Abstract
High-throughput sequencing (HTS) provides an efficient and
cost-effective way to generate large amounts of sequence data. However,
marker-based methods and the resulting datasets come with a range of
challenges and disputes, including incomplete reference databases,
controversial sequence similarity thresholds for delineating taxa, and
downstream compositional data analysis. Here, we use HTS data from a
soil nematode biodiversity experiment to address the following
questions: (1) how the choice of reference database affects HTS data
analysis, (2) whether the same ecological patterns are detected with ASV
(100% similarity) versus classical OTU (97% similarity), and (3) how
different data normalization methods affect the recovery of beta
diversity patterns and identification of differentially abundant taxa.
At this time, the SILVA database performed better than PR2, assigning
more reads to family level and providing higher phylogenetic resolution.
ASV- and OTU-based alpha and beta diversity of nematodes correlated
closely, indicating that OTU-based studies represent useful reference
points. For downstream data analyses, our results indicate that
rarefaction-based methods are more vulnerable to missed findings, while
clr-transformation based methods may overestimate tested effects.
ANCOM-BC retains all data and accounts for uneven sampling fractions for
each sample, suggesting that this is currently the optimal method to
analyze compositional data. Overall, our study highlights the importance
of comparing and selecting taxonomic reference databases before data
analyses, and provides solid evidence for the similarity and
comparability between OTU- and ASV-based nematode studies. Further, the
results highlight the potential weakness of rarefaction-based and
clr-transformation based methods. We recommend future studies use ASV
and that both the taxonomic reference databases and normalization
strategies are carefully tested and selected before analyzing the data.