4 | DISCUSSION
The necessity to generate and curate reference databases to increase taxonomic assignment accuracy and resolution in eukaryotic metabarcoding research has recently come to light (Gold et al., 2021; Hleap et al., 2021). Since metabarcoding research targets a broad range of taxonomic groups and gene regions through a vast number of primer sets, flexibility is required from software packages to suit user-specific needs. Here, we present CRABS, an easy-to-use software program with a full suite of features to generate, curate, and explore reference databases.
4.1 | Sequence retrieval
The increased diversity contained within CRABS-generated reference databases compared to the other software programs tested can partially be explained by CRABS’ ability to access multiple online sequencing repositories, including BOLD, EMBL, NCBI, and MitoFish. As sequence data only partially overlaps between online repositories (Arranz et al., 2020; Meiklejohn et al., 2019; Porter & Hajibabaei, 2018), CRABS facilitates the generation of reference databases using the largest proportion of available sequences, thereby increasing the diversity included in the final curated reference database. MetaCurator was the second-best performing software package in our comparison, but does not incorporate a function to download sequencing data. By using CRABS to download sequencing data from multiple online repositories in the MetaCurator pipeline, we could have influenced the output diversity achieved from that program.
While CRABS can access multiple online repositories, downloaded file sizes and time requirements are kept to a minimum by solely downloading the gene region or taxonomic group of interest using the ‘db_download –query ’ parameter. Additionally, sequence length restrictions can be specified to exclude genome sequences, further reducing file sizes and speeding up the process. EcoPCR, on the other hand, recommends the download of the entire EMBL database, taking up >2 TB of storage and a significant amount of time. While time- and size-inefficient, ecoPCR-generated reference databases included several species that were missed by CRABS due to the initial sequence length exceeding the restriction parameter. Another benefit of utilizing the full online repository is the identification of issues around co-amplification from unintended taxonomic groups (Banos et al., 2018). For example, ecoPCR identified the co-amplification of plants for the gITS7/ITS4 (fungal) primer set, taxa not incorporated in the CRABS reference database for which initial sequencing data was restricted to fungal ITS sequences. To avoid the need to download complete online repositories, we recommend using primer-specificity testing software, such as Primer-BLAST (Ye et al., 2012), to determine which taxonomic groups need to be included in the initial sequence download by CRABS.
4.2 | Amplicon extraction
The extraction of the amplicon region from sequences deposited in online repositories is a crucial part in the creation of curated reference databases. While the different methodologies implemented in software packages can be effective, CRABS’ combined implementation of in silico PCR analysis and pairwise global alignments resulted in the most complete reference databases for each of the four primer sets. In particular, using amplicon regions extracted from the in silicoPCR analysis as seed sequences for pairwise global alignments substantially increased the diversity included in the final reference database, thereby outperforming an “in silico PCR-only” approach. The proportion of additional barcodes retrieved by the pairwise global alignment step will be heavily influenced by the chosen primer set, with lower success for metabarcoding primers located within the traditional barcoding region. Caution is warranted in the relaxed parameter settings in the ‘pga ’ function, as it may increase the inclusion of false-positive hits in the reference database. We, therefore, recommend the use of ‘pga –filter_method strict ’ to reduce the chance of including erroneous sequences.
MetaCurator, employing profile hidden Markov models to extract amplicon regions, is the only software package in our test not taking into account any information about the primer-binding regions. Instead, MetaCurator imports up to 10 user-provided seed sequences that are trimmed to the exact marker of interest. Without information about primer-binding regions, MetaCurator often failed to recover several base pairs at the beginning and end of the target amplicon region. While no effect on taxonomy assignment was observed in our comparison, further studies are required to determine the impact of partial reference sequences on taxonomy assignment. Additionally, the amplicon extraction step implemented in MetaCurator came at a great computational cost, taking hundreds of more CPU hours than any other method. Furthermore, MetaCurator was unable to handle the high number of sequences included in the mlCOIintF/jgHC02198 and gITS7/ITS4 database trials. Once the number of query sequences reached approximately half a million, MetaCurator could not run to completion (the program did not finish in our trial of a two week runtime). To circumvent the issue, larger files were split into subsets, which were run separately, after which the results were combined, furthermore adding to the computational cost and potentially introducing other issues with running subsets separately.
Even greater computational problems were encountered for the QIIME RESCRIPt plug-in. When following the author’s guidelines to create a database using RESCRIPt, we were unable to proceed past the alignment step. Due to extreme sequence divergence across taxonomic groups and/or excessive indels, alignments around a seed group of taxa contained an average gap percentage of >70%. From these poor alignments – even at the genus level – no amplicons were retained during the amplicon extraction step in the pipeline. In light of this, we opted to use the in silico PCR step in the standard QIIME toolkit (QIIME extract_reads), rather than relying on the local alignment method implemented in RESCRIPt. While successful for the MiFish-E/U and Taberlet c/h primer sets, no results could be obtained for the mlCOIintF/jgHC02198 and gITS7/ITS4 primer sets due to the presence of degenerate bases in the primer sequences.
4.3 | Database curation
Comprehensive and well-annotated reference databases are of crucial importance to increase taxonomy assignment accuracy (Hleap et al., 2021). While database curation parameters between software packages were not compared in our experiment, CRABS boasts the most complete set of features, with the only software program able to take into account geographical species locations, a parameter shown to increase taxonomic assignment accuracy (Gold et al., 2021; Murali et al., 2018). Additionally, the efficiency of the taxonomy-dependent dereplication function differed greatly between software packages, with CRABS and RESCRIPt handling large datasets within seconds, while MetaCurator failed to handle datasets greater than ~500,000 sequences.
4.4 | Database exploration
Alongside the functionality to generate curated reference databases, CRABS facilitates the exploration of reference databases using multiple visualizations, thereby providing essential information on how to interpret taxonomy assignment results. For example, machine learning classifiers, such as SINTAX (Edgar, 2016) and RDP (Qiong et al., 2007), are known to overclassify sequences in situations where the correct label lies outside the scope of the reference database (Dave, 1991; Murali et al., 2018). Hence, it is crucial to determine the completeness of the reference database for missing barcodes of closely related species, implemented in the CRABS ‘–methoddb_completeness ’ visualization. It should be noted, however, that the ‘–method db_completeness ’ visualization should be interpreted as a guide only, as the function is built around the NCBI taxonomy database (Federhen, 2012), known to exhibit errors (Schoch et al., 2020). Therefore, consulting the primary literature remains essential.
Metabarcoding analyses attempt to classify sequences to species-level resolution, based on the variations present in the amplicon region. However, these partial gene sequences do not always permit species-level resolution and the variation might not be consistent between taxonomic groups, as has repeatedly been observed (Porter & Hajibabaei, 2020). The generation of amplicon-based phylogenetic trees, implemented in the CRABS ‘–method phylo ’ visualization, can provide guidance about the resolution of the amplicon region for specific taxonomic groups, thereby aiding in the assignment of taxonomy at the correct resolution.
With multiple primer sets available for specific taxonomic groups targeting various gene regions (Zhang et al., 2020), CRABS visualizations could aid in determining the optimal primer set for a specific experimental design. For example, ‘–method db_completeness ’ might provide information about which gene regions contains the largest number of barcodes for taxa of interest, while the ‘–method diversity ’ visualization gives insight into issues surrounding unintended co-amplification. The ‘–method amplicon_length ’, on the other hand, could determine which primer sets can be multiplexed on an Illumina sequencing run and the ‘–method phylo ’ can visualize which amplicon obtains the highest taxonomic resolution. Finally, the ‘–method primer_efficiency ’ shows which primer set contains the least amount of mismatches for target taxa or provides information on how to optimise the primer sequences for the taxa of interest.
4.5 | Implemented features within CRABS
As shown in Table 1, CRABS is equally feature-rich or richer in certain feature categories, compared with the other three software packages. In particular, CRABS’ support for downloading sequencing data of interest from multiple online repositories is distinctive. Additionally, the ability to use extracted amplicons from the in silico PCR analysis as a database for pairwise global alignment analysis enables the retrieval of a larger portion of amplicon regions. Furthermore, CRABS incorporates the most comprehensive set of database curation parameters, as well database export formats, thereby facilitating reference databases to be immediately used in taxonomy assignment software packages. The implemented CRABS visualizations, also, allow for a thorough investigation of the reference database to aid taxonomy assignment of sequencing data. Finally, easy installation of the conda package, simple parameter settings, and a fully documented step-by-step workflow for reference database curation renders CRABS user and analysis friendly.