ABSTRACT
The measurement of biodiversity is an integral aspect of life science
research. With the establishment of second- and third-generation
sequencing technologies, an increasing amount of metabarcoding data is
being generated as we seek to describe the extent and patterns of
biodiversity in multiple contexts. The reliability and accuracy of
taxonomically assigning metabarcoding sequencing data has been shown to
be critically influenced by the quality and completeness of reference
databases. Custom, curated, eukaryotic reference databases, however, are
scarce, as are the software programs for generating them. Here, we
present CRABS (Creating Reference databases for Amplicon-Based
Sequencing), a software package to create custom reference databases for
metabarcoding studies. CRABS includes tools to download sequences from
multiple online repositories (i.e., NCBI, BOLD, EMBL, MitoFish),
retrieve amplicon regions through in silico PCR analysis and
pairwise global alignments, curate the database through multiple
filtering parameters (e.g., dereplication, sequence length, sequence
quality, unresolved taxonomy), export the reference database in multiple
formats for the immediate use in taxonomy assignment software, and
investigate the reference database through implemented visualizations
for diversity, primer efficiency, reference sequence length, and
taxonomic resolution. CRABS is a versatile tool for generating curated
reference databases of user-specified genetic markers to aid taxonomy
assignment from metabarcoding sequencing data. CRABS is available for
download as a conda package and via GitHub
(https://github.com/gjeunen/reference_database_creator).
Keywords: Reference database curation, environmental DNA,
eDNA, ancient DNA, aDNA, taxonomy assignment, python