4.1 The importance of good reference databases
Metabarcoding enables biodiversity monitoring either with or without the taxonomic identification of the retrieved taxa. Taxonomic identification clearly requires appropriate reference databases that can be obtainedad hoc (e.g. by amplifying sequences from all the taxa from the target group) (Cilleros et al. 2019; Morinière et al. 2019) or by searching public databases such as GenBank. Public databases offer an ever-growing resource, given that they combine the outcome of thousands of studies and produce a sheer amount of data that would be unreachable by ad hoc studies. Public databases are not error-free, still analyses showed that for animals, the error rate of GenBank for genus-level identification is generally low (~0.7 / 3.5%), suggesting that it can be a formidable data source for applications relying on molecular data to understand the impact of environmental changes on biodiversity (Leray et al.2019). However, public databases are opportunistic collections of the material from multiple studies, thus they do not have the ambition of a taxonomic completeness. Ad-hoc databases (see also Ratnasingham & Hebert 2007) are thus essential resources to obtain the taxonomic coverage required if we want to identify most of benthic macroinvertebrates.
Several researchers advocated that COI-based markers should be favoured for metabarcoding because they are standard barcodes for animals, and thus we can expect a very large availability of sequences in reference databases (Andújar et al. 2018; Leray et al. 2019). For benthic macroinvertebrates, a very large number of COI sequences is available in GenBank (Table 3). For instance, BF1_BR2-COI is largely the marker with the highest number of sequences of benthic Diptera, with nearly 3,000 sequences of BF1_BR2-COI available against only 1000 sequences of Inse01 (16S rDNA), still the number of available sequences is surprisingly variable across taxa. Nevertheless, a very large number of sequences does not necessarily allow a better taxonomic coverage. In fact, most of genera of benthic Diptera do not have sequences in reference database for COI, and Inse01 sequences represent slightly more genera than BF1_BR2-COI (25% for Inse01 against just 15% for BF1_BR2-COI; Fig. 3). The mismatch between number of sequences and database completeness could be related to the different scopes of studies employing the different markers. In fact, COI is the most used marker by standard barcoding studies, which often aim at unveiling diversity among closely related, cryptic taxa, thus studies often consider many individuals from closely related, morphologically similar species within genera (Hebert et al. 2004). Conversely, the 16S and 18S rDNA genes are often used to build phylogenies (e.g.Alvarez-Presas et al. 2008; Criscione & Ponder 2013), and many phylogenetic studies aim at representing the largest number of genera and families. Such process could also explain the strong differences among taxa (e.g. a very high completeness for Euka02 with Turbellaria, and a much better coverage for Inse01 with Gastropoda; Fig. 3). If the aim is the species-level identification, databases should be exhaustive at the species-level, and markers should have a species-level resolution. Likely, for freshwater biomonitoring a genus-level identification is often enough (Bailey et al. 2001; Chessmanet al. 2007), thus our database provides a good completeness that can allow the identification of most of genera, particularly with the markers Euka02 and Inse01.
Matching metabarcodes with reliable reference databases can allow obtaining metabarcoding-based biomonitoring data, that should be comparable with historical data obtained through traditional (e.g. morphological) approaches. Freshwater environments are highly sensitive to human impacts, and the availability of long-term time series is pivotal to identify trends of occupancy and the ecological quality of environments (Outhwaite et al. 2020).