4.1 The importance of good reference databases
Metabarcoding enables biodiversity monitoring either with or without the
taxonomic identification of the retrieved taxa. Taxonomic identification
clearly requires appropriate reference databases that can be obtainedad hoc (e.g. by amplifying sequences from all the taxa
from the target group) (Cilleros et al. 2019; Morinière et
al. 2019) or by searching public databases such as GenBank. Public
databases offer an ever-growing resource, given that they combine the
outcome of thousands of studies and produce a sheer amount of data that
would be unreachable by ad hoc studies. Public databases are not
error-free, still analyses showed that for animals, the error rate of
GenBank for genus-level identification is generally low
(~0.7 / 3.5%), suggesting that it can be a formidable
data source for applications relying on molecular data to understand the
impact of environmental changes on biodiversity (Leray et al.2019). However, public databases are opportunistic collections of the
material from multiple studies, thus they do not have the ambition of a
taxonomic completeness. Ad-hoc databases (see also Ratnasingham &
Hebert 2007) are thus essential resources to obtain the taxonomic
coverage required if we want to identify most of benthic
macroinvertebrates.
Several researchers advocated that COI-based markers should be favoured
for metabarcoding because they are standard barcodes for animals, and
thus we can expect a very large availability of sequences in reference
databases (Andújar et al. 2018; Leray et al. 2019). For
benthic macroinvertebrates, a very large number of COI sequences is
available in GenBank (Table 3). For instance, BF1_BR2-COI is largely
the marker with the highest number of sequences of benthic Diptera, with
nearly 3,000 sequences of BF1_BR2-COI available against only 1000
sequences of Inse01 (16S rDNA), still the number of available sequences
is surprisingly variable across taxa. Nevertheless, a very large number
of sequences does not necessarily allow a better taxonomic coverage. In
fact, most of genera of benthic Diptera do not have sequences in
reference database for COI, and Inse01 sequences represent slightly more
genera than BF1_BR2-COI (25% for Inse01 against just 15% for
BF1_BR2-COI; Fig. 3). The mismatch between number of sequences and
database completeness could be related to the different scopes of
studies employing the different markers. In fact, COI is the most used
marker by standard barcoding studies, which often aim at unveiling
diversity among closely related, cryptic taxa, thus studies often
consider many individuals from closely related, morphologically similar
species within genera (Hebert et al. 2004). Conversely, the 16S
and 18S rDNA genes are often used to build phylogenies (e.g.Alvarez-Presas et al. 2008; Criscione & Ponder 2013), and many
phylogenetic studies aim at representing the largest number of genera
and families. Such process could also explain the strong differences
among taxa (e.g. a very high completeness for Euka02 with
Turbellaria, and a much better coverage for Inse01 with Gastropoda; Fig.
3). If the aim is the species-level identification, databases should be
exhaustive at the species-level, and markers should have a species-level
resolution. Likely, for freshwater biomonitoring a genus-level
identification is often enough (Bailey et al. 2001; Chessmanet al. 2007), thus our database provides a good completeness that
can allow the identification of most of genera, particularly with the
markers Euka02 and Inse01.
Matching metabarcodes with reliable reference databases can allow
obtaining metabarcoding-based biomonitoring data, that should be
comparable with historical data obtained through traditional
(e.g. morphological) approaches. Freshwater environments are
highly sensitive to human impacts, and the availability of long-term
time series is pivotal to identify trends of occupancy and the
ecological quality of environments (Outhwaite et al. 2020).