3.1 Exploring CRABS generated reference databases through
incorporated visualizations
By downloading sequencing data from EMBL, MitoFish, and NCBI online
repositories, the CRABS generated MiFish-E/U reference database
incorporated 28,350 sequences, covering 16,906 species. The
‘–method diversity ’ visualization (Figure 2.a) shows
that the majority of sequences belong to the class Actinopteri (36.2%),
followed by Mammalia (19.7%), Amphibia (16.9%), and Lepidosauria
(10.6%). The ‘–method amplicon_length ’ visualization
displays an average amplicon length of ~180 bp, with a
slightly larger average amplicon size for birds and amphibians compared
to the target taxonomic group of fish (Supplement 4.a).
Additionally, the ‘–method phylo ’ visualization identified
species-level taxonomic resolution might not always be obtainable for
this amplicon region (Supplement 4.b).
The Taberlet c/h primer set is designed to target land plants and the
CRABS reference database was built using the NCBI and EMBL online
repositories. The curated CRABS reference database consisted of 71,031
sequences, covering 51,366 species. Based on the ‘–method
diversity ’ visualization output, the majority of sequences belong to
the classes Magnoliopsida (90.0%), Bryopsida (5.0%), and
Jungermanniopsida (3.8%) within the phylum Streptophyta (99.9%;
Supplement 4.c). Average amplicon length showed large
variations within the phylum Streptophyta, with amplicon size ranging
from <100 bp to ~180 bp (visualization method:
‘–method amplicon_length ’; Supplement 4.d). Despite
the large variation in amplicon sizes, the ’–method
primer_efficiency ’ visualization revealed only two places in the
primer-binding regions with a significant proportion of mismatch
occurrence for species within the phylum Streptophyta (Figure
2.b).
For the mlCOIintF/jgHC02198 primer set that is designed to target
eukaryotes, the CRABS reference database was built using the BOLD, EMBL,
MitoFish, and NCBI online repositories. The reference database included
590,228 sequences covering 109,545 species, with the phyla Arthropoda
(72.1%), Chordata (17.4%), and Mollusca (4.8%) most abundantly
present (visualization method: ‘–method diversity ’;
Supplement 4.e). The ‘–method amplicon_length ’
visualization displays an average amplicon length of
~313 bp, with high consistency between taxonomic groups
(Supplement 4.f). Furthermore, the ‘–method phylo ’
revealed intraspecific variation is present in the amplicon region for a
majority of taxa (Figure 2.c; example: genus Apteryx
[Kiwi]).
CRABS generated a reference database containing 339,286 sequences and
covering 42,961 species for the gITS7/ITS4 primer set that is designed
to target fungi, by incorporating sequencing data from NCBI and EMBL.
According to the ‘–method diversity ’ visualization, the
majority of sequences belong to the phyla Ascomycota (67.9%),
Basidiomycota (28.9%), and Mucoromycota (3.0%; Supplement
4.g). Amplicon size was restricted between 100 bp to 500 bp during
database curation (Supplement 2). However, the
‘–method amplicon_length ’ visualization indicates this range
could be too restrictive for the maximum length size (Figure
2.d). The ‘–method primer_efficiency ’ visualization
determined >80% of taxa within the phylum Ascomycota
showed no diversity at the degenerate base locations in the
primer-binding regions, while also showing a larger variation in base
pair composition at the 3’ end of the reverse primer
(Supplement 4.h).
3.2 | Comparing incorporated diversity between
reference databases
For each of the four primer sets tested in this study, reference
databases generated by CRABS contain the largest number of sequences and
species compared to ecoPCR, MetaCurator, and RESCRIPt (Figure
3), except for the increased number of species contained within the
mlCOIintF/jgHC02198 reference database generated by MetaCurator
(Figure 3c). Reference databases generated by MetaCurator
contain the second largest number of sequences and species, followed by
RESCRIPt and ecoPCR (Figure 3). RESCRIPt was unable to generate
reference databases for the mlCOIintF/jgHC02198 and gITS7/ITS4 primer
sets. Both primer sets amplify a wide variety of taxonomic groups and
contain degenerate bases in the primer-binding region, which resulted in
difficulties to create local alignments and implement the in
silico PCR step in the standard QIIME toolkit (QIIME extract_reads).
A significant overlap in incorporated species was observed between
reference databases, with 80.6% of species incorporated in more than
one reference database for the MiFish-E/U primer set, 52.9% for
Taberlet c/h, 56.3% for mlCOIintF/jgHC02198, and 54.8% for the
gITS7/ITS4 primer set (Figure 4). Similarly, a significant
overlap in sequence ID’s was observed between reference databases, with
55.3% of sequence ID’s incorporated in more than one reference database
for the MiFish-E/U primer set, 37.1% for Taberlet c/h, 26.8% for
mlCOIintF/jgHC02198, and 27.6% for the gITS7/ITS4 primer set
(Supplement 5). Interestingly, the amplicon region retrieved by
the software packages was only identical between CRABS, ecoPCR, and
RESCRIPt, as MetaCurator failed to recover the full amplicon region,
except for the gITS7/ITS4 amplicon region (Figure 4).
3.3 | Taxonomic assignment differences between
reference databases
Reference database choice did not significantly impact the number of
OTUs assigned to a specific taxonomic rank (Figure 5). On
average, 10.6% ± 0.9% of OTUs failed to be assigned a taxonomy for the
MiFish-E/U sequencing data, 5.1% ± 0.9% of OTUs for the Taberlet c/h
sequencing data, 73.8% ± 3.6% of OTUs for the mlCOIintF/jgHC02198
sequencing data, and 16.8% ± 0.6% of OTUs for the gITS7/ITS4
sequencing data. Additionally, high similarity was achieved between
reference databases as to which OTUs were able to be assigned a taxonomy
(Supplement 6).
The achieved taxonomic assignment of OTUs between reference databases,
on the other hand, showed limited overlap, with 39.5% identical
taxonomy assignments for the MiFish-E/U sequencing data, 25.0% for the
Taberlet c/h sequencing data, 28.3% for the mlCOIintF/jgHC02198
sequencing data, and 30.0% for the gITS7/ITS4 sequencing data
(Figure 6). The limited overlap in taxonomy assignment resulted
from differences in taxonomic resolution for a specific OTU, rather than
the assignment of OTUs to different taxonomic lineages. No consistency
was observed for which reference database achieved higher taxonomic
resolution across OTUs within each sequencing data set
(Supplement 6).