4 | DISCUSSION
The necessity to generate and curate reference databases to increase
taxonomic assignment accuracy and resolution in eukaryotic metabarcoding
research has recently come to light (Gold et al., 2021; Hleap et al.,
2021). Since metabarcoding research targets a broad range of taxonomic
groups and gene regions through a vast number of primer sets,
flexibility is required from software packages to suit user-specific
needs. Here, we present CRABS, an easy-to-use software program with a
full suite of features to generate, curate, and explore reference
databases.
4.1 | Sequence retrieval
The increased diversity contained within CRABS-generated reference
databases compared to the other software programs tested can partially
be explained by CRABS’ ability to access multiple online sequencing
repositories, including BOLD, EMBL, NCBI, and MitoFish. As sequence data
only partially overlaps between online repositories (Arranz et al.,
2020; Meiklejohn et al., 2019; Porter & Hajibabaei, 2018), CRABS
facilitates the generation of reference databases using the largest
proportion of available sequences, thereby increasing the diversity
included in the final curated reference database. MetaCurator was the
second-best performing software package in our comparison, but does not
incorporate a function to download sequencing data. By using CRABS to
download sequencing data from multiple online repositories in the
MetaCurator pipeline, we could have influenced the output diversity
achieved from that program.
While CRABS can access multiple online repositories, downloaded file
sizes and time requirements are kept to a minimum by solely downloading
the gene region or taxonomic group of interest using the
‘db_download –query ’ parameter. Additionally, sequence length
restrictions can be specified to exclude genome sequences, further
reducing file sizes and speeding up the process. EcoPCR, on the other
hand, recommends the download of the entire EMBL database, taking up
>2 TB of storage and a significant amount of time. While
time- and size-inefficient, ecoPCR-generated reference databases
included several species that were missed by CRABS due to the initial
sequence length exceeding the restriction parameter. Another benefit of
utilizing the full online repository is the identification of issues
around co-amplification from unintended taxonomic groups (Banos et al.,
2018). For example, ecoPCR identified the co-amplification of plants for
the gITS7/ITS4 (fungal) primer set, taxa not incorporated in the CRABS
reference database for which initial sequencing data was restricted to
fungal ITS sequences. To avoid the need to download complete online
repositories, we recommend using primer-specificity testing software,
such as Primer-BLAST (Ye et al., 2012), to determine which taxonomic
groups need to be included in the initial sequence download by CRABS.
4.2 | Amplicon extraction
The extraction of the amplicon region from sequences deposited in online
repositories is a crucial part in the creation of curated reference
databases. While the different methodologies implemented in software
packages can be effective, CRABS’ combined implementation of in
silico PCR analysis and pairwise global alignments resulted in the most
complete reference databases for each of the four primer sets. In
particular, using amplicon regions extracted from the in silicoPCR analysis as seed sequences for pairwise global alignments
substantially increased the diversity included in the final reference
database, thereby outperforming an “in silico PCR-only”
approach. The proportion of additional barcodes retrieved by the
pairwise global alignment step will be heavily influenced by the chosen
primer set, with lower success for metabarcoding primers located within
the traditional barcoding region. Caution is warranted in the relaxed
parameter settings in the ‘pga ’ function, as it may increase the
inclusion of false-positive hits in the reference database. We,
therefore, recommend the use of ‘pga –filter_method strict ’
to reduce the chance of including erroneous sequences.
MetaCurator, employing profile hidden Markov models to extract amplicon
regions, is the only software package in our test not taking into
account any information about the primer-binding regions. Instead,
MetaCurator imports up to 10 user-provided seed sequences that are
trimmed to the exact marker of interest. Without information about
primer-binding regions, MetaCurator often failed to recover several base
pairs at the beginning and end of the target amplicon region. While no
effect on taxonomy assignment was observed in our comparison, further
studies are required to determine the impact of partial reference
sequences on taxonomy assignment. Additionally, the amplicon extraction
step implemented in MetaCurator came at a great computational cost,
taking hundreds of more CPU hours than any other method. Furthermore,
MetaCurator was unable to handle the high number of sequences included
in the mlCOIintF/jgHC02198 and gITS7/ITS4 database trials. Once the
number of query sequences reached approximately half a million,
MetaCurator could not run to completion (the program did not finish in
our trial of a two week runtime). To circumvent the issue, larger files
were split into subsets, which were run separately, after which the
results were combined, furthermore adding to the computational cost and
potentially introducing other issues with running subsets separately.
Even greater computational problems were encountered for the QIIME
RESCRIPt plug-in. When following the author’s guidelines to create a
database using RESCRIPt, we were unable to proceed past the alignment
step. Due to extreme sequence divergence across taxonomic groups and/or
excessive indels, alignments around a seed group of taxa contained an
average gap percentage of >70%. From these poor alignments
– even at the genus level – no amplicons were retained during the
amplicon extraction step in the pipeline. In light of this, we opted to
use the in silico PCR step in the standard QIIME toolkit (QIIME
extract_reads), rather than relying on the local alignment method
implemented in RESCRIPt. While successful for the MiFish-E/U and
Taberlet c/h primer sets, no results could be obtained for the
mlCOIintF/jgHC02198 and gITS7/ITS4 primer sets due to the presence of
degenerate bases in the primer sequences.
4.3 | Database curation
Comprehensive and well-annotated reference databases are of crucial
importance to increase taxonomy assignment accuracy (Hleap et al.,
2021). While database curation parameters between software packages were
not compared in our experiment, CRABS boasts the most complete set of
features, with the only software program able to take into account
geographical species locations, a parameter shown to increase taxonomic
assignment accuracy (Gold et al., 2021; Murali et al., 2018).
Additionally, the efficiency of the taxonomy-dependent dereplication
function differed greatly between software packages, with CRABS and
RESCRIPt handling large datasets within seconds, while MetaCurator
failed to handle datasets greater than ~500,000
sequences.
4.4 | Database exploration
Alongside the functionality to generate curated reference databases,
CRABS facilitates the exploration of reference databases using multiple
visualizations, thereby providing essential information on how to
interpret taxonomy assignment results. For example, machine learning
classifiers, such as SINTAX (Edgar, 2016) and RDP (Qiong et al., 2007),
are known to overclassify sequences in situations where the correct
label lies outside the scope of the reference database (Dave, 1991;
Murali et al., 2018). Hence, it is crucial to determine the completeness
of the reference database for missing barcodes of closely related
species, implemented in the CRABS ‘–methoddb_completeness ’ visualization. It should be noted, however,
that the ‘–method db_completeness ’ visualization should be
interpreted as a guide only, as the function is built around the NCBI
taxonomy database (Federhen, 2012), known to exhibit errors (Schoch et
al., 2020). Therefore, consulting the primary literature remains
essential.
Metabarcoding analyses attempt to classify sequences to species-level
resolution, based on the variations present in the amplicon region.
However, these partial gene sequences do not always permit species-level
resolution and the variation might not be consistent between taxonomic
groups, as has repeatedly been observed (Porter & Hajibabaei, 2020).
The generation of amplicon-based phylogenetic trees, implemented in the
CRABS ‘–method phylo ’ visualization, can provide guidance
about the resolution of the amplicon region for specific taxonomic
groups, thereby aiding in the assignment of taxonomy at the correct
resolution.
With multiple primer sets available for specific taxonomic groups
targeting various gene regions (Zhang et al., 2020), CRABS
visualizations could aid in determining the optimal primer set for a
specific experimental design. For example, ‘–method
db_completeness ’ might provide information about which gene regions
contains the largest number of barcodes for taxa of interest, while the
‘–method diversity ’ visualization gives insight into issues
surrounding unintended co-amplification. The ‘–method
amplicon_length ’, on the other hand, could determine which primer sets
can be multiplexed on an Illumina sequencing run and the
‘–method phylo ’ can visualize which amplicon obtains the
highest taxonomic resolution. Finally, the ‘–method
primer_efficiency ’ shows which primer set contains the least amount of
mismatches for target taxa or provides information on how to optimise
the primer sequences for the taxa of interest.
4.5 | Implemented features within CRABS
As shown in Table 1, CRABS is equally feature-rich or richer in
certain feature categories, compared with the other three software
packages. In particular, CRABS’ support for downloading sequencing data
of interest from multiple online repositories is distinctive.
Additionally, the ability to use extracted amplicons from the in
silico PCR analysis as a database for pairwise global alignment
analysis enables the retrieval of a larger portion of amplicon regions.
Furthermore, CRABS incorporates the most comprehensive set of database
curation parameters, as well database export formats, thereby
facilitating reference databases to be immediately used in taxonomy
assignment software packages. The implemented CRABS visualizations,
also, allow for a thorough investigation of the reference database to
aid taxonomy assignment of sequencing data. Finally, easy installation
of the conda package, simple parameter settings, and a fully documented
step-by-step workflow for reference database curation renders CRABS user
and analysis friendly.