3.1 Exploring CRABS generated reference databases through incorporated visualizations
By downloading sequencing data from EMBL, MitoFish, and NCBI online repositories, the CRABS generated MiFish-E/U reference database incorporated 28,350 sequences, covering 16,906 species. The ‘–method diversity ’ visualization (Figure 2.a) shows that the majority of sequences belong to the class Actinopteri (36.2%), followed by Mammalia (19.7%), Amphibia (16.9%), and Lepidosauria (10.6%). The ‘–method amplicon_length ’ visualization displays an average amplicon length of ~180 bp, with a slightly larger average amplicon size for birds and amphibians compared to the target taxonomic group of fish (Supplement 4.a). Additionally, the ‘–method phylo ’ visualization identified species-level taxonomic resolution might not always be obtainable for this amplicon region (Supplement 4.b).
The Taberlet c/h primer set is designed to target land plants and the CRABS reference database was built using the NCBI and EMBL online repositories. The curated CRABS reference database consisted of 71,031 sequences, covering 51,366 species. Based on the ‘–method diversity ’ visualization output, the majority of sequences belong to the classes Magnoliopsida (90.0%), Bryopsida (5.0%), and Jungermanniopsida (3.8%) within the phylum Streptophyta (99.9%; Supplement 4.c). Average amplicon length showed large variations within the phylum Streptophyta, with amplicon size ranging from <100 bp to ~180 bp (visualization method: ‘–method amplicon_length ’; Supplement 4.d). Despite the large variation in amplicon sizes, the ’–method primer_efficiency ’ visualization revealed only two places in the primer-binding regions with a significant proportion of mismatch occurrence for species within the phylum Streptophyta (Figure 2.b).
For the mlCOIintF/jgHC02198 primer set that is designed to target eukaryotes, the CRABS reference database was built using the BOLD, EMBL, MitoFish, and NCBI online repositories. The reference database included 590,228 sequences covering 109,545 species, with the phyla Arthropoda (72.1%), Chordata (17.4%), and Mollusca (4.8%) most abundantly present (visualization method: ‘–method diversity ’; Supplement 4.e). The ‘–method amplicon_length ’ visualization displays an average amplicon length of ~313 bp, with high consistency between taxonomic groups (Supplement 4.f). Furthermore, the ‘–method phylo ’ revealed intraspecific variation is present in the amplicon region for a majority of taxa (Figure 2.c; example: genus Apteryx [Kiwi]).
CRABS generated a reference database containing 339,286 sequences and covering 42,961 species for the gITS7/ITS4 primer set that is designed to target fungi, by incorporating sequencing data from NCBI and EMBL. According to the ‘–method diversity ’ visualization, the majority of sequences belong to the phyla Ascomycota (67.9%), Basidiomycota (28.9%), and Mucoromycota (3.0%; Supplement 4.g). Amplicon size was restricted between 100 bp to 500 bp during database curation (Supplement 2). However, the ‘–method amplicon_length ’ visualization indicates this range could be too restrictive for the maximum length size (Figure 2.d). The ‘–method primer_efficiency ’ visualization determined >80% of taxa within the phylum Ascomycota showed no diversity at the degenerate base locations in the primer-binding regions, while also showing a larger variation in base pair composition at the 3’ end of the reverse primer (Supplement 4.h).
3.2 | Comparing incorporated diversity between reference databases
For each of the four primer sets tested in this study, reference databases generated by CRABS contain the largest number of sequences and species compared to ecoPCR, MetaCurator, and RESCRIPt (Figure 3), except for the increased number of species contained within the mlCOIintF/jgHC02198 reference database generated by MetaCurator (Figure 3c). Reference databases generated by MetaCurator contain the second largest number of sequences and species, followed by RESCRIPt and ecoPCR (Figure 3). RESCRIPt was unable to generate reference databases for the mlCOIintF/jgHC02198 and gITS7/ITS4 primer sets. Both primer sets amplify a wide variety of taxonomic groups and contain degenerate bases in the primer-binding region, which resulted in difficulties to create local alignments and implement the in silico PCR step in the standard QIIME toolkit (QIIME extract_reads).
A significant overlap in incorporated species was observed between reference databases, with 80.6% of species incorporated in more than one reference database for the MiFish-E/U primer set, 52.9% for Taberlet c/h, 56.3% for mlCOIintF/jgHC02198, and 54.8% for the gITS7/ITS4 primer set (Figure 4). Similarly, a significant overlap in sequence ID’s was observed between reference databases, with 55.3% of sequence ID’s incorporated in more than one reference database for the MiFish-E/U primer set, 37.1% for Taberlet c/h, 26.8% for mlCOIintF/jgHC02198, and 27.6% for the gITS7/ITS4 primer set (Supplement 5). Interestingly, the amplicon region retrieved by the software packages was only identical between CRABS, ecoPCR, and RESCRIPt, as MetaCurator failed to recover the full amplicon region, except for the gITS7/ITS4 amplicon region (Figure 4).
3.3 | Taxonomic assignment differences between reference databases
Reference database choice did not significantly impact the number of OTUs assigned to a specific taxonomic rank (Figure 5). On average, 10.6% ± 0.9% of OTUs failed to be assigned a taxonomy for the MiFish-E/U sequencing data, 5.1% ± 0.9% of OTUs for the Taberlet c/h sequencing data, 73.8% ± 3.6% of OTUs for the mlCOIintF/jgHC02198 sequencing data, and 16.8% ± 0.6% of OTUs for the gITS7/ITS4 sequencing data. Additionally, high similarity was achieved between reference databases as to which OTUs were able to be assigned a taxonomy (Supplement 6).
The achieved taxonomic assignment of OTUs between reference databases, on the other hand, showed limited overlap, with 39.5% identical taxonomy assignments for the MiFish-E/U sequencing data, 25.0% for the Taberlet c/h sequencing data, 28.3% for the mlCOIintF/jgHC02198 sequencing data, and 30.0% for the gITS7/ITS4 sequencing data (Figure 6). The limited overlap in taxonomy assignment resulted from differences in taxonomic resolution for a specific OTU, rather than the assignment of OTUs to different taxonomic lineages. No consistency was observed for which reference database achieved higher taxonomic resolution across OTUs within each sequencing data set (Supplement 6).