Bioinformatics
Sequencing reads were assigned to samples using the unique dual indexes
and to markers based on the primer sequence. Sequencing adapters and
locus-specific markers were removed from paired-end reads using the
linked-adapter function in cutadapt (Martin, 2011). Trimmed
reads were imported into qiime2 (Bolyen et al., 2019) and data
were then denoised, in a process including estimating sequencing error
rates from the data, correcting errors, dereplicating sequences,
removing chimeras, and then merging overlapping paired-end reads using
the dada2 denoise-paired function with default
parameters (Callahan et al., 2016). Reads were then truncated at +/-
10% of the expected amplicon-size for each locus to remove low quality
bases at the trailing ends of sequencing reads (Callahan et al., 2016).
The denoising algorithm derives an error model from the sequencing data
and aims to correct errors in order to minimize single-nucleotide
sequencing errors and therefore, preserve true species-level variation
as amplicon sequencing variants (ASV; Callahan, McMurdie, & Holmes,
2017). Because of the short fragment length of the minibarcodes, often
just a small number of base-pair differences distinguish species within
the same genus (Edgar, 2018), and thus, ASVs have the potential to
provide better species-level taxonomic resolution than other analysis
methods (e.g., operational taxonomic units, OTU; Callahan, McMurdie, &
Holmes, 2017).
Despite similar DNA inputs to PCR reactions, sequencing read depth
varied several orders of magnitude among markers for reference DNA pool
samples (SI, Fig. S2). To better evaluate marker performance, we
subsampled 100,000 reads per marker per sample for each occurrence that
included >100,000 reads (two markers, 16Svar and
L2513/H2714 had <100,000 reads). Subsampling was done on the
fastq files using the sample function from seqtk
(https://github.com/lh3/seqtk) and then output (subsampled) data
were reprocessed in qiime2 and dada2, as described
above.
Data for each marker was analysed separately, and the resulting tables
included read counts for every ASV sequence occurring in each sample.
ASV sequence data were exported from qiime2 in FASTA format and
then queried against a custom metazoan database derived from the NCBI
Nucleotide sequence database. Separate databases for each gene (12S,
16S, COI, 18S, and 28S) were created by querying NCBI using Entrez. The
search was performed using the gene name and a filter that included only
animals and sequences matching the NCBI taxon ID for Metazoa (33208).
Gene names were offered to Entrez in many variants (e.g. “COI”,
“COX1”, “cytochrome oxidase subunit I”, “cytochrome oxidase subunit
1”) to account for variation in naming scheme (additional details and
code on GitHub). For mitochondrial genes (12S, 16S, and COI), queries
were also restricted to mitochondrial sequences and filtered by length
to exclude sequences substantially shorter than any of the targeted
amplicons (<50 bp) while including whole mitochondria of all
sizes (filter <100,000 bp; largest known metazoan
mitochondrial genome is 80,923 bp; Stampar et al., 2019). For nuclear
genes (18S and 28S), the same length filters were used to exclude short
sequences (<50 bp) and long genomic scaffolds
(<100,000 bp). Search terms “scaffold” and “shotgun” were
excluded from search queries. These filtered Entrez queries were then
downloaded separately in FASTA format. By using only NCBI nucleotide
data for genes that corresponded to our markers, and only searching the
custom database that corresponded to the appropriate gene for each
primer set, we were able to reduce overall computation time and
potential off-target BLAST hits. Downloaded sequences were further
filtered to remove environmental, unverified, uncultured, and protein
database sequences and then configured into a BLAST database with
makeblastdb from the BLAST+ suite (Camacho et al., 2009; code available
on GitHub).
FASTA output from qiime2 was queried against the reference
databases using BLAST (Altschul, Gish, Miller, Myers, & Lipman, 1990)
command line with the blastn program using 96% minimum identity, 95%
query coverage; e-value = 10, reward = 2 and penalty = -3, penalty to
open a gap = 5 and to extend a gap = 2, and with no limit to the number
of hits per sequence (no culling limit). The BLAST query might have
included some liberal matches, but all BLAST hits were further filtered
and refined in subsequent steps (the highest e-value for data passing
filter = 1.54 e-38). GenBank accession numbers were
used to access the taxonomic lineage for each hit using a custom bash
script which utilizes NCBI’s accession2taxid files and TaxonKit (Shen &
Xiong, 2019) to extract seven taxonomic ranks (i.e., domain, phylum,
class, order, family, genus, species) for each BLAST hit (code on
GitHub).
GNU Parallel was used throughout the bioinformatic pipeline to run
multiple jobs simultaneously while maintaining the optimal number of
threads for each program (Tange, 2011).