1 | INTRODUCTION
Investigating, classifying, and understanding Earth’s biodiversity is a fundamental component of many disciplines, including ecology, evolution, taxonomy, and paleobiology (Soulé, 1985). While essential, a thorough understanding of diversity patterns has traditionally been hard to achieve, due to the complexity of biological systems (Chave, 2013; Kovalenko et al., 2012).
In the last two decades, the advent of high-throughput DNA sequencing has enabled researchers to understand biodiversity at an unprecedented scope (Hagelberg et al., 2015; Hugenholtz & Tyson, 2008). Untargeted (i.e., metagenomic) sequencing approaches, such as Illumina shotgun sequencing, take full advantage of the gigabytes of data generated in a single sequencing run to uncover the complete complexity of biodiversity present within a sample (Bovo et al., 2020; Cowart et al., 2018; Key et al., 2017). However, with the lack of reference genomes (Scott et al., 2021) and the immense barcoding effort already undertaken (Hebert & Gregory, 2005), targeted sequencing strategies (i.e., metabarcoding), targeting one or several gene regions through PCR amplification (Jeunen et al., 2019; Seersholm et al., 2018) or capture-enrichment (Ávila-Arcos et al., 2011; Seeber et al., 2019), are frequently used to increase the percentage of reads to be taxonomically assigned by enriching the sequencing library for barcoding gene regions (Stat et al., 2017). Additionally, by enriching the library for a select few gene regions, metabarcoding is a more cost-friendly alternative compared to metagenomic sequencing, by reducing the required sequencing depth per sample, as well as the computational time and effort (Stat et al., 2017; Taberlet et al., 2012).
The popularity of metabarcoding analyses has led to the development of multiple tools aiming to improve the taxonomic assignment accuracy of the biological community present in samples. Most of the taxonomy assignment programs can be split into four distinct methods (Hleap et al., 2021), including sequence similarity methods (BLAST [Altschul et al., 1990] and Kraken2 [Wood & Salzberg, 2014]), sequence composition methods (RDP [Qiong et al., 2007] and IDTaxa [Murali et al., 2018]), phylogenetics methods (EPA [Barbera et al., 2019] and pplacer [Matsen et al., 2010]), and probabilistic methods (Protax [Somervuo et al., 2016]).
Comparative studies have revealed that, irrespective of the taxonomic assignment method used, comprehensive, curated, and well-annotated reference databases are critical for accurate taxonomy assignment (Gold et al., 2021; Hleap et al., 2021; Leray et al., 2022). The early adoption of metabarcoding in microbial research, as well as a focus on the 16S rRNA gene for bacterial species identification (Johnson et al., 2019), have led to the creation of curated reference databases used to assign taxonomy in the majority of microbiome studies, e.g., RDP (Qiong et al., 2007). Metabarcoding research exploring eukaryotic diversity, on the other hand, employs a wide variety of primer sets targeting a broad range of gene regions (Zhang et al., 2020), including cytochromec oxidase subunit I (COI), 16S ribosomal RNA (16S rRNA), 18S ribosomal RNA (18S rRNA), and nuclear ribosomal internal transcribed spacer regions (nrITS). Hence, the majority of eukaryotic metabarcoding research utilizes global databases to assign taxonomy, such as NCBI (Johnson et al., 2008) and EMBL (Kanz et al., 2005). However, the lack of curation with global databases allow for the entry of sequences with missing species ID (environmental studies), erroneous identification, and duplication, factors which contribute to a reduced accuracy of taxonomic assignment algorithms (Gold et al., 2021; Hleap et al., 2021; Leray et al., 2018).
Multiple curated eukaryotic reference databases, as well as pipelines to generate them, have, therefore, been published in recent years, including BOLD (Ratnasingham & Hebert, 2007), UNITE (Kõljalg et al., 2005), PLANiTS (Banchi et al., 2020), MitoFish (Iwasaki et al., 2013), MARES (Arranz et al., 2020), Meta-Fish-Lib (Collins et al., 2021), and MIDORI2 (Leray et al., 2022). The missing built-in flexibility and the large number of reference databases to cover all target gene regions and taxonomic groups, however, favours the development and use of software programs able to generate curated reference databases that are customized through user-specified parameters. Existing software packages are able to extract amplicon regions through in silico PCR (ecoPCR/OBITools; hereafter named ”ecoPCR”; Boyer et al., 2016; Ficetola et al., 2010), local alignments (RESCRIPt; Robeson et al., 2020), and profile hidden Markov models (MetaCurator; Richardson et al., 2020). An easy-to-use software program able to complete the full reference database creation workflow from start to finish on a personal computer with true flexibility, limited storage requirements, and fast results would further aid in increasing taxonomic assignment accuracy for user-specific experimental designs.
Here, we introduce CRABS (Creating Reference databases for Amplicon-Based Sequencing), a software package to generate curated reference databases and assess the incorporated diversity and taxonomic resolution. To determine the flexibility and efficiency of CRABS, we compare reference databases generated by CRABS to ecoPCR, MetaCurator, and RESCRIPt for four widely-used primer sets in metabarcoding research: MiFish-E/U (Chondrichthyes/Actinopterygii; Miya et al., 2015), mlCOIintF/jgHC02198 (Eukaryota; Leray et al., 2013), Taberlet c/h (Plantae; Taberlet et al., 1991, 2007), and gITS7/ITS4 (Fungi; Ihrmark et al., 2012; White et al., 1990). Additionally, we assess the quality of the generated reference databases through taxonomic assignment of published sequencing data. We show that the reference databases generated by CRABS are equivalent to or outperform available tools based on incorporated diversity. Additionally, CRABS is feature-rich, highly versatile in its implementation, and requires relatively limited computational resources.