Taxonomic assignment
Separate from the sample-based data decontamination procedure, described
above, taxonomic assignment for each metabarcoding sequence required
evaluating the full set of BLAST hits for each ASV using a custom R
script (R Core Team, 2019). The goal of the R script was to obtain the
highest taxonomic resolution for each sequence while accounting for all
BLAST hits above the 96% minimum identity required by the blastn query.
Species-level identification was only accepted if the ASV sequence
matched the database reference sequence at >98% identity
(as in Alberdi et al., 2018) and only then if no BLAST hits within 2%
identity of the top hit matched a different species. When BLAST results
for a given ASV violated either of these rules, the next taxonomic level
(i.e., common ancestor) was tested using the same criteria and so on
until a consensus taxonomic rank was obtained within the top 2%
identity of matches. For example, when an ASV had only a significant hit
to a single species, that species was assigned unless the sequence match
was <98% identity, in which case, the ASV would be assigned
to the genus-level. However, when an ASV had significant hits to
multiple taxa, the common ancestor for BLAST hits within the top 2%
identity determined whether that sequence could be attributed to a
species, genus, or family, or whether the sequence provided little
informative variation for a high-resolution assignment (code available
on GitHub).
Decontaminated ASV and read count data were merged with taxonomic
information from ranked and filtered BLAST hits. Multiple ASVs within a
locus that matched the same taxon at more than one taxonomic level
(e.g., one ASV identifies the family Clupeidae and another matches the
genus Clupea ) were merged to retain the highest-resolution
assignment (in this example, the genus, Clupea ) for each taxon
within each replicate/locus. We reasoned that both sequences would
likely come from the same fish, and therefore retained the
higher-resolution assignment.
Finally, taxonomic assignments were used to compare the performance of
the individual markers and metabarcoding loci for recovering species
added to the vouchered reference and full reference DNA pools, as well
as determining the optimal combination of markers to maximize
identification of reference taxa to species-level.
The portfolio of complementary markers was identified by ranking markers
using an accumulation curve to identify which recovered the greatest
number of species from the FR, followed by the greatest number of
additional species, and so on until the curve saturated (Fig. 1). The
minimal panel of primer pairs that captured the full species diversity
in the DNA pools were used to analyse the experimental feeds and examine
quantitative relationships between relative tissue abundance and
sequencing read proportions in heterogeneous mixtures.