1 | INTRODUCTION
Investigating, classifying, and understanding Earth’s biodiversity is a
fundamental component of many disciplines, including ecology, evolution,
taxonomy, and paleobiology (Soulé, 1985). While essential, a thorough
understanding of diversity patterns has traditionally been hard to
achieve, due to the complexity of biological systems (Chave, 2013;
Kovalenko et al., 2012).
In the last two decades, the advent of high-throughput DNA sequencing
has enabled researchers to understand biodiversity at an unprecedented
scope (Hagelberg et al., 2015; Hugenholtz & Tyson, 2008). Untargeted
(i.e., metagenomic) sequencing approaches, such as Illumina shotgun
sequencing, take full advantage of the gigabytes of data generated in a
single sequencing run to uncover the complete complexity of biodiversity
present within a sample (Bovo et al., 2020; Cowart et al., 2018; Key et
al., 2017). However, with the lack of reference genomes (Scott et al.,
2021) and the immense barcoding effort already undertaken (Hebert &
Gregory, 2005), targeted sequencing strategies (i.e., metabarcoding),
targeting one or several gene regions through PCR amplification (Jeunen
et al., 2019; Seersholm et al., 2018) or capture-enrichment (Ávila-Arcos
et al., 2011; Seeber et al., 2019), are frequently used to increase the
percentage of reads to be taxonomically assigned by enriching the
sequencing library for barcoding gene regions (Stat et al., 2017).
Additionally, by enriching the library for a select few gene regions,
metabarcoding is a more cost-friendly alternative compared to
metagenomic sequencing, by reducing the required sequencing depth per
sample, as well as the computational time and effort (Stat et al., 2017;
Taberlet et al., 2012).
The popularity of metabarcoding analyses has led to the development of
multiple tools aiming to improve the taxonomic assignment accuracy of
the biological community present in samples. Most of the taxonomy
assignment programs can be split into four distinct methods (Hleap et
al., 2021), including sequence similarity methods (BLAST [Altschul et
al., 1990] and Kraken2 [Wood & Salzberg, 2014]), sequence
composition methods (RDP [Qiong et al., 2007] and IDTaxa [Murali
et al., 2018]), phylogenetics methods (EPA [Barbera et al., 2019]
and pplacer [Matsen et al., 2010]), and probabilistic methods
(Protax [Somervuo et al., 2016]).
Comparative studies have revealed that, irrespective of the taxonomic
assignment method used, comprehensive, curated, and well-annotated
reference databases are critical for accurate taxonomy assignment (Gold
et al., 2021; Hleap et al., 2021; Leray et al., 2022). The early
adoption of metabarcoding in microbial research, as well as a focus on
the 16S rRNA gene for bacterial species identification (Johnson et al.,
2019), have led to the creation of curated reference databases used to
assign taxonomy in the majority of microbiome studies, e.g., RDP (Qiong
et al., 2007). Metabarcoding research exploring eukaryotic diversity, on
the other hand, employs a wide variety of primer sets targeting a broad
range of gene regions (Zhang et al., 2020), including cytochromec oxidase subunit I (COI), 16S ribosomal RNA (16S rRNA), 18S
ribosomal RNA (18S rRNA), and nuclear ribosomal internal transcribed
spacer regions (nrITS). Hence, the majority of eukaryotic metabarcoding
research utilizes global databases to assign taxonomy, such as NCBI
(Johnson et al., 2008) and EMBL (Kanz et al., 2005). However, the lack
of curation with global databases allow for the entry of sequences with
missing species ID (environmental studies), erroneous identification,
and duplication, factors which contribute to a reduced accuracy of
taxonomic assignment algorithms (Gold et al., 2021; Hleap et al., 2021;
Leray et al., 2018).
Multiple curated eukaryotic reference databases, as well as pipelines to
generate them, have, therefore, been published in recent years,
including BOLD (Ratnasingham & Hebert, 2007), UNITE (Kõljalg et al.,
2005), PLANiTS (Banchi et al., 2020), MitoFish (Iwasaki et al., 2013),
MARES (Arranz et al., 2020), Meta-Fish-Lib (Collins et al., 2021), and
MIDORI2 (Leray et al., 2022). The missing built-in flexibility and the
large number of reference databases to cover all target gene regions and
taxonomic groups, however, favours the development and use of software
programs able to generate curated reference databases that are
customized through user-specified parameters. Existing software packages
are able to extract amplicon regions through in silico PCR
(ecoPCR/OBITools; hereafter named ”ecoPCR”; Boyer et al., 2016; Ficetola
et al., 2010), local alignments (RESCRIPt; Robeson et al., 2020), and
profile hidden Markov models (MetaCurator; Richardson et al., 2020). An
easy-to-use software program able to complete the full reference
database creation workflow from start to finish on a personal computer
with true flexibility, limited storage requirements, and fast results
would further aid in increasing taxonomic assignment accuracy for
user-specific experimental designs.
Here, we introduce CRABS (Creating Reference databases for
Amplicon-Based Sequencing), a software package to generate curated
reference databases and assess the incorporated diversity and taxonomic
resolution. To determine the flexibility and efficiency of CRABS, we
compare reference databases generated by CRABS to ecoPCR, MetaCurator,
and RESCRIPt for four widely-used primer sets in metabarcoding research:
MiFish-E/U (Chondrichthyes/Actinopterygii; Miya et al., 2015),
mlCOIintF/jgHC02198 (Eukaryota; Leray et al., 2013), Taberlet c/h
(Plantae; Taberlet et al., 1991, 2007), and gITS7/ITS4 (Fungi; Ihrmark
et al., 2012; White et al., 1990). Additionally, we assess the quality
of the generated reference databases through taxonomic assignment of
published sequencing data. We show that the reference databases
generated by CRABS are equivalent to or outperform available tools based
on incorporated diversity. Additionally, CRABS is feature-rich, highly
versatile in its implementation, and requires relatively limited
computational resources.