2 | METHODS
2.1 | CRABS workflow
We present CRABS as a versatile software program to generate curated
reference databases for metagenomic analyses. The CRABS workflow
consists of five modules (Figure 1), including: (i)sequence retrieval : download and import data from online
repositories into CRABS; (ii) amplicon extraction : extract
amplicon regions through in silico PCR analysis and pairwise
global alignments; (iii) database curation : clean up the database
through dereplication, sequence parameters, and metadata information;
(iv) database format : export the database into various formats to
use CRABS-generated reference databases into most taxonomic assignment
software packages; and (v) database exploration : generate a
summary overview of the final reference database through multiple
post-processing functions and visualizations. A brief explanation for
each of the five modules is provided below. Additional information for
each function with example code can be found on GitHub
(https://github.com/gjeunen/reference_database_creator).
2.2 | Module 1: sequence retrieval
CRABS supports the download of sequencing data from four major online
repositories using the ‘db_download ’ function, including Barcode
of Life Data System (BOLD), National Center for Biotechnology
Information (NCBI), European Molecular Biology Laboratory (EMBL), and
Mitochondrial Genome Database of Fish (MitoFish). Upon downloading,
sequencing data will be processed to CRABS format, a simple two-line
fasta format with NCBI accession numbers as header information. When
accession numbers are not available, CRABS will generate unique sequence
IDs using the following format:
‘CRABS_ [num]:species_name ’. Sequence data can be
merged using the ‘db_merge ’ function when data from multiple
sources are downloaded. Additionally, in-house generated data, in fasta
format, can be imported into CRABS format using the ‘db_import ’
function.
2.3 | Module 2: amplicon extraction
Once sequences have been downloaded and imported into CRABS, the
amplicon region can be extracted. For amplicon-based sequencing, we
recommend a two-step approach, whereby (i) amplicons are found by
locating the forward and reverse primer-binding regions through in
silico PCR analysis using the ‘insilico_pcr ’ function and (ii)
using the amplicons retrieved by in silico PCR as seed sequences
for a pairwise global alignment analysis (function: ‘pga ’) to
retrieve amplicon regions with missing info on the primer-binding
regions.
CRABS utilizes cutadapt v3.5 (Martin, 2011) to accomplish the in
silico PCR analysis. Forward and reverse primer sequences that were
used to generate the sequencing library can be specified through the
‘–fwd ’ and ‘–rev ’ parameters, respectively and a
maximum number of errors in the primer-binding sites can be specified
using the ‘–error ’ parameter. Sequences for which
primer-binding regions could not be located are reverse complemented
using VSEARCH v2.21.1 (Rognes et al., 2016). Reverse complemented
sequences are subjected once more to the in silico PCR analysis,
thereby targeting sequences that were deposited on the negative strand
in online repositories.
The in silico PCR analysis will fail to retrieve amplicons when
one or both primer-binding regions are missing from the deposited
sequence. This can occur when a partial gene sequence is deposited or
when the deposited sequence has been generated using one or both primers
used in the metabarcoding analysis, as these primer-binding regions are
removed prior to online archiving. To recover these amplicon regions,
CRABS implements a pairwise global alignment (function: ‘pga ’)
using the ‘–usearch_ global’ algorithm in VSEARCH, whereby
amplicon regions retrieved through in silico PCR analysis are
used as seed sequences (parameter: ‘–database ’). The
‘pga ’ function can be restricted to only include alignments
starting or ending within the length of the forward or reverse
primer-binding region (parameter: ‘–filter_method strict ’),
to minimize the risk of including erroneous sequences. Additionally, the
similarity and query coverage thresholds within the ‘pga ’
function can be specified using the ‘–percid ’ and
‘–coverage ’ parameters, respectively.
2.4 | Module 3: database curation
Taxonomic lineages are based on the NCBI taxonomy information, which
must be downloaded first using the ‘db_download –source
taxonomy ’ function. Afterwards, taxonomic lineages can be created using
the ‘assign_tax ’ function. The taxonomic lineage consists of
seven levels by default, including: (i) kingdom; (ii) phylum; (iii)
class; (iv) order; (v) family; (vi) genus; and (vii) species. However,
users can specify any taxonomic lineage through the parameter
‘–ranks ’. From this point forward, CRABS will output a
tab-delimited file, whereby all information for each sequence record
(including the sequence) is contained on a single line.
The reference database can be dereplicated in three ways using the
‘–method ’ parameter within the ‘dereplicate ’ function,
including: (i) strict : unique sequences will be kept,
irrespective of taxonomy; (ii) single_species : a single sequence
is retained for each species in the database; and (iii)uniq_species : all unique sequences are retained for each species
in the database.
The reference database can be further curated, using the
‘seq_cleanup ’ function, by filtering sequences on (i) minimum
and maximum sequence length (consistent with amplicon length range),
(ii) number of ambiguous bases, (iii) environmental sequences, (iv)
unspecified species names, and (v) missing taxonomic information.
Lastly, the reference database can be curated using the
‘geo_cleanup ’ function, by excluding species not occurring
inside the area of interest. Recent research has provided evidence for
increased taxonomic assignment accuracy when geographical species
locations are considered (Gold et al., 2021; Murali et al., 2018). Users
can, therefore, provide a text file containing species names using the
‘–species ’ parameter and CRABS will make a subset from the
reference database by only including sequences assigned to these species
and providing information on which species are present and absent in the
database.
2.5 | Module 4: database format
Once the database is curated and cleaned, CRABS can output the database
in six different formats (function: ‘tax_format ’) to allow the
reference database to be implemented in most taxonomic assignment
software programs. Implemented formats are: (i) SINTAX, incorporated in
VSEARCH (Rognes et al., 2016) and USEARCH (Edgar, 2010, 2016); (ii) RDP,
incorporated in the RDP classifier (Qiong et al., 2007); (iii) QIIf
(flat-file format), incorporated in QIIME and QIIME2 (Bolyen et al.,
2019; Caporaso et al., 2010); (iv) DAD, incorporated in DADA2 (Callahan
et al., 2016); (v) DADS, incorporated in DADA2; and (vi) IDT,
incorporated in IDTaxa (Murali et al., 2018).
2.6 | Module 5: database exploration
Once the reference database has been curated to the user’s
specifications, CRABS can provide five outputs to provide information
about the generated reference database using the ‘visualization ’
function. The ‘diversity ’ method will generate a horizontal bar
plot displaying number of species and number of sequences for each
taxonomic group at a user-specified taxonomic level. The
‘amplicon_length ’ method will generate a line graph displaying
the amplicon length distribution for each taxonomic group at a
user-specified taxonomic level. Included in the legend are the total
number of sequences assigned to each taxonomic group. The
‘db_completeness ’ method generates a table containing the number
of closely related species that are present in the reference database
for a user-provided list of species. The ‘phylo ’ method will
generate phylogenetic trees for a user-provided list of species to
determine the taxonomic resolution of the amplicon region for each
species at a user-specified taxonomic level. Finally, the
‘primer_efficiency ’ method will generate a graph for the forward
and reverse primers, displaying the proportion of base pair occurrence
in the primer-binding region for a user-specified taxonomic group.
2.7 | Evaluation of CRABS for multiple common
metabarcoding primer sets
To test the performance of CRABS, reference databases were generated for
four widely-used primer sets in metabarcoding research
(Supplement 1), including MiFish-E/U (Miya et al., 2015),
mlCOIintF/jgHC02198 (Leray et al., 2013), Taberlet c/h (Taberlet et al.,
1991, 2007), and gITS7/ITS4 (Ihrmark et al., 2012; White et al., 1990).
CRABS-generated reference databases were benchmarked for included
diversity against databases created by ecoPCR (Ficetola et al., 2010),
MetaCurator (Richardson et al., 2020), and the QIIME RESCRIPt plugin
(Robeson et al., 2020). Tutorial workflows for each of the four software
packages were followed to create the reference databases, except for the
final database curation step, which was conducted by CRABS, to enable a
true comparison between software programs. For RESCRIPt, due to
alignment difficulties in the extraction step, the standard QIIME
‘extract_ reads’ tool was used instead. Missing functionality
within the reference database creation workflow in MetaCurator, ecoPCR,
and RESCRIPt was also covered by CRABS (Supplement 2).
Additionally, the quality of all reference databases was determined by
comparing taxonomy assignments of OTUs (Operational Taxonomic Units)
created from publicly available sequencing data: MiFish-E/U sequencing
data was used from Jeunen et al. (2020); mlCOIintF/jgHC02198 sequencing
data was used from Jeunen et al. (2019); Taberlet c/h sequencing data
was used from (Cross et al., unpubl. ); and gITS7/ITS4 sequencing
data was used from Cross et al. (2017). A detailed comparison of the
features implemented in CRABS, ecoPCR (Ficetola et al., 2010),
MetaCurator (Richardson et al., 2020), and QIIME RESCRIPt (Robeson et
al., 2020) is presented in Table 1, including database download
options, amplicon extraction methodology, database clean-up parameters,
and database output options. Bioinformatic scripts and reference
databases for each assay and software program are provided in
Supplement 2 and Supplement 3, respectively.