Optimising High-throughput sequencing data analysis, from gene database selection to the analysis of compositional data: A case study on tropical soil nematodes

Simin Wang; Dominik Schneider; Tamara Hartke; Johannes Ballauff; Carina Moura; Garvin Schulz; Zhipeng Li; Andrea Polle; Rolf Daniel; Oliver  Gailing; Bambang Irawan; Stefan Scheu; Valentyna Krashevska

doi:10.22541/au.170670978.85305089/v1

loading page

Optimising High-throughput sequencing data analysis, from gene database selection to the analysis of compositional data: A case study on tropical soil nematodes

Simin Wang,
Dominik Schneider,
Tamara Hartke,
Johannes Ballauff,
Carina Moura,
Garvin Schulz,
Zhipeng Li,
Andrea Polle,
Rolf Daniel,
Oliver Gailing,
Bambang Irawan,
Stefan Scheu,
Valentyna Krashevska

Abstract

High-throughput sequencing (HTS) provides an efficient and cost-effective way to generate large amounts of sequence data. However, marker-based methods and the resulting datasets come with a range of challenges and disputes, including incomplete reference databases, controversial sequence similarity thresholds for delineating taxa, and downstream compositional data analysis. Here, we use HTS data from a soil nematode biodiversity experiment to address the following questions: (1) how the choice of reference database affects HTS data analysis, (2) whether the same ecological patterns are detected with ASV (100% similarity) versus classical OTU (97% similarity), and (3) how different data normalization methods affect the recovery of beta diversity patterns and identification of differentially abundant taxa. At this time, the SILVA database performed better than PR2, assigning more reads to family level and providing higher phylogenetic resolution. ASV- and OTU-based alpha and beta diversity of nematodes correlated closely, indicating that OTU-based studies represent useful reference points. For downstream data analyses, our results indicate that rarefaction-based methods are more vulnerable to missed findings, while clr-transformation based methods may overestimate tested effects. ANCOM-BC retains all data and accounts for uneven sampling fractions for each sample, suggesting that this is currently the optimal method to analyze compositional data. Overall, our study highlights the importance of comparing and selecting taxonomic reference databases before data analyses, and provides solid evidence for the similarity and comparability between OTU- and ASV-based nematode studies. Further, the results highlight the potential weakness of rarefaction-based and clr-transformation based methods. We recommend future studies use ASV and that both the taxonomic reference databases and normalization strategies are carefully tested and selected before analyzing the data.