2 | METHODS
2.1 | CRABS workflow
We present CRABS as a versatile software program to generate curated reference databases for metagenomic analyses. The CRABS workflow consists of five modules (Figure 1), including: (i)sequence retrieval : download and import data from online repositories into CRABS; (ii) amplicon extraction : extract amplicon regions through in silico PCR analysis and pairwise global alignments; (iii) database curation : clean up the database through dereplication, sequence parameters, and metadata information; (iv) database format : export the database into various formats to use CRABS-generated reference databases into most taxonomic assignment software packages; and (v) database exploration : generate a summary overview of the final reference database through multiple post-processing functions and visualizations. A brief explanation for each of the five modules is provided below. Additional information for each function with example code can be found on GitHub (https://github.com/gjeunen/reference_database_creator).
2.2 | Module 1: sequence retrieval
CRABS supports the download of sequencing data from four major online repositories using the ‘db_download ’ function, including Barcode of Life Data System (BOLD), National Center for Biotechnology Information (NCBI), European Molecular Biology Laboratory (EMBL), and Mitochondrial Genome Database of Fish (MitoFish). Upon downloading, sequencing data will be processed to CRABS format, a simple two-line fasta format with NCBI accession numbers as header information. When accession numbers are not available, CRABS will generate unique sequence IDs using the following format: ‘CRABS_ [num]:species_name ’. Sequence data can be merged using the ‘db_merge ’ function when data from multiple sources are downloaded. Additionally, in-house generated data, in fasta format, can be imported into CRABS format using the ‘db_import ’ function.
2.3 | Module 2: amplicon extraction
Once sequences have been downloaded and imported into CRABS, the amplicon region can be extracted. For amplicon-based sequencing, we recommend a two-step approach, whereby (i) amplicons are found by locating the forward and reverse primer-binding regions through in silico PCR analysis using the ‘insilico_pcr ’ function and (ii) using the amplicons retrieved by in silico PCR as seed sequences for a pairwise global alignment analysis (function: ‘pga ’) to retrieve amplicon regions with missing info on the primer-binding regions.
CRABS utilizes cutadapt v3.5 (Martin, 2011) to accomplish the in silico PCR analysis. Forward and reverse primer sequences that were used to generate the sequencing library can be specified through the ‘–fwd ’ and ‘–rev ’ parameters, respectively and a maximum number of errors in the primer-binding sites can be specified using the ‘–error ’ parameter. Sequences for which primer-binding regions could not be located are reverse complemented using VSEARCH v2.21.1 (Rognes et al., 2016). Reverse complemented sequences are subjected once more to the in silico PCR analysis, thereby targeting sequences that were deposited on the negative strand in online repositories.
The in silico PCR analysis will fail to retrieve amplicons when one or both primer-binding regions are missing from the deposited sequence. This can occur when a partial gene sequence is deposited or when the deposited sequence has been generated using one or both primers used in the metabarcoding analysis, as these primer-binding regions are removed prior to online archiving. To recover these amplicon regions, CRABS implements a pairwise global alignment (function: ‘pga ’) using the ‘–usearch_ global’ algorithm in VSEARCH, whereby amplicon regions retrieved through in silico PCR analysis are used as seed sequences (parameter: ‘–database ’). The ‘pga ’ function can be restricted to only include alignments starting or ending within the length of the forward or reverse primer-binding region (parameter: ‘–filter_method strict ’), to minimize the risk of including erroneous sequences. Additionally, the similarity and query coverage thresholds within the ‘pga ’ function can be specified using the ‘–percid ’ and ‘–coverage ’ parameters, respectively.
2.4 | Module 3: database curation
Taxonomic lineages are based on the NCBI taxonomy information, which must be downloaded first using the ‘db_download –source taxonomy ’ function. Afterwards, taxonomic lineages can be created using the ‘assign_tax ’ function. The taxonomic lineage consists of seven levels by default, including: (i) kingdom; (ii) phylum; (iii) class; (iv) order; (v) family; (vi) genus; and (vii) species. However, users can specify any taxonomic lineage through the parameter ‘–ranks ’. From this point forward, CRABS will output a tab-delimited file, whereby all information for each sequence record (including the sequence) is contained on a single line.
The reference database can be dereplicated in three ways using the ‘–method ’ parameter within the ‘dereplicate ’ function, including: (i) strict : unique sequences will be kept, irrespective of taxonomy; (ii) single_species : a single sequence is retained for each species in the database; and (iii)uniq_species : all unique sequences are retained for each species in the database.
The reference database can be further curated, using the ‘seq_cleanup ’ function, by filtering sequences on (i) minimum and maximum sequence length (consistent with amplicon length range), (ii) number of ambiguous bases, (iii) environmental sequences, (iv) unspecified species names, and (v) missing taxonomic information.
Lastly, the reference database can be curated using the ‘geo_cleanup ’ function, by excluding species not occurring inside the area of interest. Recent research has provided evidence for increased taxonomic assignment accuracy when geographical species locations are considered (Gold et al., 2021; Murali et al., 2018). Users can, therefore, provide a text file containing species names using the ‘–species ’ parameter and CRABS will make a subset from the reference database by only including sequences assigned to these species and providing information on which species are present and absent in the database.
2.5 | Module 4: database format
Once the database is curated and cleaned, CRABS can output the database in six different formats (function: ‘tax_format ’) to allow the reference database to be implemented in most taxonomic assignment software programs. Implemented formats are: (i) SINTAX, incorporated in VSEARCH (Rognes et al., 2016) and USEARCH (Edgar, 2010, 2016); (ii) RDP, incorporated in the RDP classifier (Qiong et al., 2007); (iii) QIIf (flat-file format), incorporated in QIIME and QIIME2 (Bolyen et al., 2019; Caporaso et al., 2010); (iv) DAD, incorporated in DADA2 (Callahan et al., 2016); (v) DADS, incorporated in DADA2; and (vi) IDT, incorporated in IDTaxa (Murali et al., 2018).
2.6 | Module 5: database exploration
Once the reference database has been curated to the user’s specifications, CRABS can provide five outputs to provide information about the generated reference database using the ‘visualization ’ function. The ‘diversity ’ method will generate a horizontal bar plot displaying number of species and number of sequences for each taxonomic group at a user-specified taxonomic level. The ‘amplicon_length ’ method will generate a line graph displaying the amplicon length distribution for each taxonomic group at a user-specified taxonomic level. Included in the legend are the total number of sequences assigned to each taxonomic group. The ‘db_completeness ’ method generates a table containing the number of closely related species that are present in the reference database for a user-provided list of species. The ‘phylo ’ method will generate phylogenetic trees for a user-provided list of species to determine the taxonomic resolution of the amplicon region for each species at a user-specified taxonomic level. Finally, the ‘primer_efficiency ’ method will generate a graph for the forward and reverse primers, displaying the proportion of base pair occurrence in the primer-binding region for a user-specified taxonomic group.
2.7 | Evaluation of CRABS for multiple common metabarcoding primer sets
To test the performance of CRABS, reference databases were generated for four widely-used primer sets in metabarcoding research (Supplement 1), including MiFish-E/U (Miya et al., 2015), mlCOIintF/jgHC02198 (Leray et al., 2013), Taberlet c/h (Taberlet et al., 1991, 2007), and gITS7/ITS4 (Ihrmark et al., 2012; White et al., 1990). CRABS-generated reference databases were benchmarked for included diversity against databases created by ecoPCR (Ficetola et al., 2010), MetaCurator (Richardson et al., 2020), and the QIIME RESCRIPt plugin (Robeson et al., 2020). Tutorial workflows for each of the four software packages were followed to create the reference databases, except for the final database curation step, which was conducted by CRABS, to enable a true comparison between software programs. For RESCRIPt, due to alignment difficulties in the extraction step, the standard QIIME ‘extract_ reads’ tool was used instead. Missing functionality within the reference database creation workflow in MetaCurator, ecoPCR, and RESCRIPt was also covered by CRABS (Supplement 2). Additionally, the quality of all reference databases was determined by comparing taxonomy assignments of OTUs (Operational Taxonomic Units) created from publicly available sequencing data: MiFish-E/U sequencing data was used from Jeunen et al. (2020); mlCOIintF/jgHC02198 sequencing data was used from Jeunen et al. (2019); Taberlet c/h sequencing data was used from (Cross et al., unpubl. ); and gITS7/ITS4 sequencing data was used from Cross et al. (2017). A detailed comparison of the features implemented in CRABS, ecoPCR (Ficetola et al., 2010), MetaCurator (Richardson et al., 2020), and QIIME RESCRIPt (Robeson et al., 2020) is presented in Table 1, including database download options, amplicon extraction methodology, database clean-up parameters, and database output options. Bioinformatic scripts and reference databases for each assay and software program are provided in Supplement 2 and Supplement 3, respectively.