Metabarcodes for the monitoring of freshwater benthic biodiversity through environmental DNA
Gentile Francesco Ficetola1,2, Frédéric Boyer1, Alice Valentini3, Aurélie Bonin1,2, Albin Meyer4, Tony Dejean3, Coline Gaboriaud3, Philippe Usseglio-Polatera4, Pierre Taberlet1,5
1 Univ. Grenoble Alpes, CNRS, Univ. Savoie Mont Blanc, LECA, Laboratoire d’Ecologie Alpine, F-38000 Grenoble
2 Department of Environmental Sciences and Policy; University of Milano, Milano, Italia.
3 SPYGEN, Le Bourget du Lac, France
4 Université de Lorraine, CNRS, LIEC, F-57000 Metz, France
5 UiT – The Arctic University of Norway, Tromsø Museum, Tromsø, Norway
Abstract
Environmental DNA and metabarcoding have great potential for the biomonitoring of freshwater environments. However, successful application of metabarcoding to biodiversity monitoring requires universal primers with high taxonomic coverage that amplify highly-variable, short metabarcodes with high taxonomic resolution. Moreover, reliable and extensive reference databases are essential to match the outcome of metabarcoding analyses with the available taxonomy and biomonitoring indices. Benthic invertebrates, particularly insects, are key taxa for freshwater biomonitoring. Nevertheless, so far, no formal comparison has assessed primers for metabarcoding of freshwater macrobenthos. Here we combined in vitro and in silicoanalyses to test the performance of metabarcoding primers amplifying regions in the 18S rDNA (Euka02 metabarcode), 16S rDNA (Inse01), and COI (BF1_BR2-COI) genes, and developed an extensive database of benthic invertebrates of France and Europe, with a special focus on three key insect orders (Ephemeroptera, Plecoptera and Trichoptera). In vitro analyses on 1514 individuals, belonging to 578 different taxonomic units showed very different amplification rates across primer combinations. The Euka02 marker showed the highest universality, while the Inse01 marker showed excellent performance for the amplification of insects. The BF1_BR2-COI metabarcode showed the highest resolution, while the resolution of Euka02 was often limited. By combining in vitro data with GenBank information, we developed a curated database including sequences representing 822 genera. The heterogeneous performance of the different metabarcodes highlights the complexity of the identification of the best markers, and advocates for the integration of multiple metabarcodes for a more comprehensive and accurate understanding of ecological impacts on freshwater biodiversity.
Keywords: freshwater biodiversity; biomonitoring; biotic indices; DNA metabarcoding; primer bias; invertebrates; cytochrome c oxidase I; amplification rate; universality; taxonomic resolution.
1 INTRODUCTION
Freshwater environments are essential providers of clean water and other services for human society. They also host a substantial biodiversity, still they are globally subjected to the joint impact of multiple stressors such as pollution, eutrophication, climate change and hydrological and hydromorphological modifications (Noges et al.2016; Iversen et al. 2019). As a consequence, numerous regulations have been adopted at both the national and international level for the protection of water resources, such as the European Water Framework Directive (Directive 2000/60/EC) and the Clean Water Act of the US Environmental Protection Agency (33 U.S.C. §§1251-1387 1972) (Pawlowski et al. 2018). These regulations generally require the monitoring of freshwater environments through a combination of physicochemical, hydrological, and biotic parameters, to obtain prompt measurements of water quality and of the ecological status of ecosystems.
Multiple approaches exist to assess freshwater quality using aquatic organisms. Benthic invertebrates are perhaps the most frequently used biological group in aquatic bioassessment (Birk et al. 2012), because they are (i) taxonomically, biologically and functionally diverse (Usseglio-Polatera et al. 2000; Usseglio-Polateraet al. 2001), (ii) rather easy to identify at the genus or family levels (Tachet et al. 2010), (iii) often sedentary and reacting rapidly to anthropogenic pressures in all types of freshwater bodies (Hering et al. 2006b; Archaimbault et al. 2010; Heringet al. 2013), and (iv) their occurrence integrates the effects of environmental changes over several months (Floury et al. 2013). Invertebrate assemblages are thus a tool of choice to assess the ecological status of water bodies (e.g. Marzin et al.2012; Hering et al. 2013; Mondy & Usseglio-Polatera 2013) and to demonstrate environmental degradation (Miler et al. 2013; Mondy & Usseglio-Polatera 2013; Theodoropoulos et al. 2020) or restoration (Arce et al. 2014; Kupilas et al. 2016; Camargo 2017; Carlson et al. 2018).
Generally, bioassessment indices relying on benthic communities are based on the standardized collection of invertebrate assemblages from monitored sites, followed by organism sorting and taxonomic identification using morphological criteria. Then, quality scores can be attributed on the basis of the presence and/or abundance of certain taxa (Friberg et al. 2006; Birk et al. 2012). As morphological identification is often challenging, in many case protocols do not require species-level identification, and identification at the genus or family level (and, in some cases, even at coarser levels) can be enough for the calculation of many biotic indices evaluating the ecological status of rivers (Bailey et al. 2001; Chessman et al.2007; Birk et al. 2012). Nevertheless, the morphological identification of hundreds collected specimens, including young, small-sized, larval stages and organisms damaged during sampling, remains time-consuming and requires a substantial taxonomic expertise, increasing the cost and time required for in-depth assessment of water quality (Haase et al. 2004; Hering et al. 2018).
Environmental DNA (eDNA) and metabarcoding are revolutionizing the monitoring of biodiversity at all levels, because they circumvent the challenge of morphological identification and allow the efficient detection of many taxa that are difficult to capture and detect using traditional methods (Taberlet et al. 2018). eDNA and metabarcoding are therefore extremely promising for the assessment of freshwater communities (Hering et al. 2018). DNA can be extracted from the tissue of pooled invertebrate communities, amplified using universal primers, sequenced, and identified on the basis of reference databases (Baird & Hajibabaei 2012; Yu et al. 2012; Andújaret al. 2018). This approach uses the same starting material than traditional biomonitoring, but allows skipping the complexity of morphology-based taxonomy (Baird & Hajibabaei 2012). Alternatively, DNA can be obtained directly from the water (Ficetola et al. 2008). Environmental DNA extracted from freshwaters allows the detection of many taxa that are difficult to capture and detect using traditional methods, but also poses new challenges compared to metabarcoding performed on the tissues of captured individuals. When in aquatic environments, DNA undergoes rapid degradation (Eichmiller et al.2016; Buxton et al. 2017); therefore eDNA is generally characterized by small fragment sizes (Jo et al. 2017; Bylemanset al. 2018), but see also (Sigsgaard et al. 2017). This generally precludes the use of ”standard” barcode primers, which often amplify long DNA fragments (e.g. >300 bp in the most frequently used COI markers; Andújar et al. 2018). Furthermore, highly degenerated primers increase the risk of non-specific amplification, thus this kind of primers is not really suitable for the amplification of the complex mix of DNA extracted from the environment. As a consequence, the monitoring of benthic invertebrates using eDNA requires the development and assessment of primers with appropriate features.
Besides the length of the amplified region, three main characteristics are essential for satisfactory eDNA metabarcodes. First, the eDNA amplification rate generally decreases with the number of mismatches between target fragments and primers. Primers must therefore be designed in order to have a consistently low number of mismatches within sequences of the target group (high universality or taxonomic coverage; Ficetola et al. 2010; Piñol et al. 2015; Marquina et al. 2019). Taxonomic coverage can be assessed through both in silico and in vitro analyses. In silico analyses can allow the rapid assessment of all the taxa for which information is publicly available in databases, but in vitro tests are still needed to confirm the conditions under which primers work in the real world. Second, the amplified region must be highly variable, to ensure the identification of amplified organisms at the desired taxonomic level (high resolution; Ficetola et al. 2010; Tang et al. 2012; Marquina et al. 2019). Finally, extensive databases are essential if we want to assign the amplified sequences to known taxa. Even though attempts have been made for the assessment of environmental quality without a taxonomic assignment of DNA fragments (Ji et al. 2013; Apothéloz-Perret-Gentil et al. 2017), taxonomic assignment is essential if we want to produce data comparable with traditional indices of water quality, or if we want to combine eDNA data with information obtained through traditional methods (e.g. to analyse long-term series of water body surveys). Despite several attempts to assess freshwater quality using eDNA (Hering et al. 2018; Serranaet al. 2019; Czechowski et al. 2020; Pont et al.2020; Yang & Zhang 2020), so far no formal comparison has been performed among short primers suitable for eDNA metabarcoding of freshwater macrobenthos. In addition, there is a pressing need of exhaustive reference databases for taxonomic assignment.
In this study we combined in vitro and in silico analyses to compare the performance of three primer pairs potentially suitable for the analysis of eDNA from freshwater invertebrates (macrobenthos), and we developed an extensive reference database for benthic invertebrates living in European freshwaters. We mostly focused on three insect orders (Ephemeroptera, Plecoptera and Trichoptera), which are among the most frequently used invertebrates for the bioassessment of streams (e.g. Brabec et al. 2004; Hering et al. 2006a; Gabriels et al. 2010; Arman et al. 2019; but see also Coxet al. 2019). We also considered a broad range of organisms belonging to other orders of insects and to other classes. We first produced the metabarcodes on the broadest available number of taxa from France, and then combined metabarcodes obtained in vitro with sequences available in public database, to obtain extensive and reliable measures of metabarcode performance, and to produce a extensive reference database for the monitoring of freshwaters through eDNA.
2 MATERIAL AND METHODS
We used the standardized database of European freshwater organisms (Schmidt-Kloiber & Hering 2015; download on 01 March 2018) as taxonomic reference for our analyses, considering all the benthic macro-invertebrates. Although in some cases this database considers non-monophyletic groups (e.g. Crustacea), it provides an exhaustive checklist of benthic macroinvertebrates that serve as an essential basis for monitoring bioassessment.
2.1 In vitro analyses of reference specimens
Most of the reference specimens were provided by OPIE-Benthos which is a working group of OPIE (Office Pour les Insectes et leur Environnement) especially dedicated to aquatic insect studies and aquatic ecosystem protection in France. OPIE-Benthos has developed a national inventory and reference collection of aquatic insects, including Ephemeroptera, Plecoptera, Trichoptera, and more recently aquatic Coleoptera, aquatic and semi-aquatic Heteroptera, aquatic larval stages of Megaloptera, Nevroptera and Diptera (Ptychopteridae) (http://www.opie-benthos.fr/opie/insecte.php). Corresponding organisms, identified at the highest possible level (species, if possible) by experienced taxonomists, were provided in triplicates (i.e. three specimens per taxon). The collection was completed by additional taxa (e.g. non-insect taxa) specifically sampled by the authors for this reference database.
Specimens were stored in 99% ethanol before DNA extraction. Total DNA was extracted from the entire organism. Samples (constituted of one specimen) were initially incubated overnight at 56 °C in 0.5 ml of lysis buffer (Tris-HCl 0.1 M, EDTA 0.1 M, NaCl 0.01 M and N-lauroyl sarcosine 1%, pH 7.5–8.0). Extractions were then completed using the DNeasy Blood Tissue Kit (Qiagen GmbH, Hilden, Germany), according to the manufacturer’s instructions. DNA extracts were recovered in a total volume of 300 μl of elution buffer. Negative extractions without specimens were systematically performed to monitor possible contaminations. Three DNA amplifications were carried out for each sample using the following primer pairs: Inse01, amplifying a ~155 bp region of the 16S mitochondrial rDNA (Taberletet al. 2018); Euka02, amplifying a ~123 bp region of the 18S rDNA (Guardiola et al. 2015; Taberlet et al.2018); and the BF1 and BR2 primers, which amplify a ~316 bp region of the cytochrome c oxidase I (Elbrecht & Leese 2017). Inse01 has been developed mostly to amplify insects, Euka02 to amplify all eukaryotes, while BF1 and BR2 were designed to amplify freshwater macroinvertebrates (Elbrecht & Leese 2017; Taberlet et al.2018). DNA amplifications were performed in a final volume of 20 μL, using 2 μL of DNA extract as template. The amplification mixture contained 10 µL of Applied Biosystems™ Master Mix AmpliTaq Gold™ 360, 0.2 μg/μL of bovine serum albumin (BSA, Roche Diagnostic, Basel, Switzerland) and 0.5 µM of each primer for COI and Inse01, or 0.2 µM of each primer for Euka2. Forward and reverse primers were 5’-labeled with eight-nucleotide tags with at least three differences between any pair of tags, so that each PCR replicate was identified by a unique combination of tags. This allowed the assignment of each sequence to the corresponding replicate during sequence analysis (Coissac 2012; Taberletet al. 2018). The PCR mixture was denatured at 95°C for 10 min, followed by 35 cycles of 30 s at 95°C, 30 s at 52°C for COI and Inse01 or 45°C for Euka2, and 1 min at 72°C (1m 30s for COI), and followed by a final elongation at 72°C for 7 min. Negative DNA extraction and PCR controls (ultrapure water, with 3 replicates as well) were analysed in parallel with the samples to monitor possible contaminations during the PCR step.
For Euka02 and Inse01, sequencing was performed by 2 × 125-bp pair-end sequencing on Illumina HiSeq 2500 platform, while for BF1_BR2-COI sequencing was performed by 2 × 250-bp pair-end sequencing on Illumina MiSeq platform at Fasteris (Geneva, Switzerland). Sequencing data were processed using the OBITools (Boyer et al. 2016). Raw sequences were first aligned (illuminapairedend) to recover the amplicon sequence and then demultiplex (ngsfilter) to assign them to the samples. This was followed by dereplication (obiuniq) keeping track for each sequence of its count in the samples. Then for each sample, the ratio of counts for the most abundant sequence and the second most abundant sequence was calculated. Only the most abundant sequences having a count greater than 1000 and a ratio above 1/10 were considered to get rid of badly amplified samples and samples were several product were amplified.
As a further validation step, all the retrieved metabarcodes were matched against NCBI using BLAST, to identify eventual cases in which the obtained metabarcode is a spurious amplification of a non-target organism (e.g. fungi or algae). The in vitro amplification rate was measured for each taxon as the proportion of specimens for which we obtained valid metabarcodes.
2.2 Setting up the composite reference databases
For each species within the database of European freshwater organisms (Schmidt-Kloiber & Hering 2015), we matched the binomial name with the NCBI taxonomy database to retrieve their NCBI taxonomic code (taxid). All the available metabarcodes for the three regions of interest, together with their associated taxid, were extracted from the EMBL sequence data repository (release 136) using the ecoPCR program (Ficetola et al. 2010) by matching the primer sequences with up to 3 errors and restricting the metabarcodes to relevant lengths (length >30 bp for Euka02, length 70-270 bp for Inse01, length 100-500 for BF1_BR2-COI). The three composite reference databases (one for each metabarcode region) were then built by aggregating metabarcodes for each genus with those obtained from specimens analysed in vitro . In order to obtain the most complete coverage of genera found in France, we obtained the taxid of all metabarcodes produced throughin vitro analyses as well as metabarcodes extracted from EMBL and associated to the taxid of a species found in France. For genera for which no such metabarcode existed, we included the metabarcodes extracted from EMBL and associated to the taxid of a species of the same genus found in Europe. If no such metabarcode existed, we included all the metabarcodes extracted from EMBL, and associated to a taxid belonging to this genus, also considering species that are not native in Europe.
2.3 Assessing the resolution of metabarcodes
We assessed the resolution of each metabarcoding region with the same procedure. First, the metabarcodes obtained as described above were compared to each other to find identical metabarcodes; this allowed producing a list of unique metabarcodes. For each unique metabarcode, we obtained the list of all the associated taxids. We tested taxonomic resolution at four levels: order, family, genus, and species. More specifically, we tested if, at a given taxonomic level, the list of associated taxids would collapse to a unique taxid or not (i.e.all taxids have the same ancestor taxid at that level). If a list would not collapse to one unique taxid for the tested taxonomic level, it meant that this metabarcode was not discriminant for this taxonomic level. Consider for instance a given metabarcode associated to multiple species within multiple genera within one single family. This particular metabarcode showed a family-level resolution, but not a species- or a genus-level resolution. It must be remarked that these measures of taxonomic resolution are heavily dependent on the available database. For example, if the database includes the metabarcode of only one species within a genus, this analysis could return a species-level resolution, even though it is possible that unanalysed species within the same genus share the same metabarcode.
3 RESULTS
3.1 In vitro analyses of reference specimens
We extracted and amplified DNA from 1514 individuals, belonging to 578 different taxa (Table 1). The majority of specimens were insects, and three insect orders with macrobenthic larvae (Ephemeroptera, Plecoptera and Trichoptera) altogether accounted for 80% of analysed specimens. Out of these specimens, 99% were morphologically identified at the family level or higher, 95% at the genus level or higher, and 62% at the species level. The average number of sampled individuals was 2.6 individuals per taxon (range: 1-12; median: 3). For Ephemeroptera, Plecoptera, Trichoptera and Megaloptera the analysed specimens covered well the diversity of French and European benthic fauna (100%, 74%, 78% and 100% of genera recorded in France for Ephemeroptera, Plecoptera, Trichoptera and Megaloptera, respectively; 70%, 52%, 65% and 100% of all the genera recorded in Europe; Table 2). Representation was relatively good for Coleoptera, Hemiptera and Neuroptera, whereas coverage was weaker for the remaining orders of insects and for non-insects.
The amplification rate using the three metabarcodes was highly heterogeneous among taxonomic groups (Fig. 1). Euka02 (18S) showed the highest average amplification success (88%), with consistently high amplification success in all the taxa except Malacostraca (Fig. 1). Within insects, Euka02 showed excellent amplification success in most of orders, but its amplification success was poor with Diptera (Fig. 1b).
As expected, Inse01 showed good amplification success for insects (82%), while it showed a limited amplification of the remaining taxa (Fig. 1a). Within insects, Inse01 showed excellent amplification success in all the orders except Trichoptera, where amplification success was 71% (Fig. 1b).
Finally, BF1_BR2-COI showed an average amplification rate of 48%, with highly variable results among taxa (Fig. 1a). BF1_BR2-COI showed a good amplification rate with Gastropoda, Clitellata and Malacostraca, while the rate was lower for several orders of insects. Within insects, BF1_BR2-COI showed good performance in Coleoptera and Diptera (amplification success ≥ 74%), while it amplified less than 50% of specimens from Ephemeroptera, Plecoptera and Trichoptera (Fig. 1b).
3.2 Combined database
When we combined sequences obtained in vitro with sequences obtained from GenBank, we obtained a total of 18 834 metabarcodes (3 441 for Euka02, 9 715 for Inse01 and 5 678 for BF1_BR2-COI). Insects accounted for the majority of metabarcodes, followed by Crustacea and Clitellata (Table 3). The combined database showed a good coverage of the diversity of European benthic fauna. For the Euka02 primer pair, the completeness of the database was particularly good (>80%) for Turbellaria, Coleoptera and Odonata. For Inse01, the completeness was particularly good for Coleoptera, Ephemeroptera and Odonata, while BF1_BR2-COI showed a relatively homogeneous completeness across taxa, with values between 50 and 70% for most of taxa (Fig. 2).
3.3 Taxonomic resolution of metabarcodes
The taxonomic resolution was strongly different among metabarcodes. For Euka02, 21% of metabarcodes were associated with more than one species in the database (Fig. 3a). The best resolution was observed for BF1_BR2-COI, with just 3% of metabarcodes associated with more than one species, while Inse01 showed an intermediate resolution (10% of metabarcodes associated with more than one species; Fig. 3a). The taxonomic resolutions of these metabacodes were clearly better if we consider the identification at the genus level (Fig. 3b). Euka02 showed the weakest performance, with around 6% of metabarcodes associated with more than one genus, while BF1_BR2-COI showed the best performance, with less than 1% of metabarcodes associated with more than one genus. Inse01 showed a generally good performance, with less than 1% of metabarcodes associated with more than one genus for most taxa. The performance was slightly poorer for Plecoptera and Trichoptera, with around 4% of metabarcodes associated with more than one genus. Family level identification was very good for all the metabarcodes, with a slightly poorer performance of Euka02 (Fig. 3c). It must be remarked that these values of resolution are calculated on an incomplete set of data, since our database did not include the sequences of many species and genera (Table 3), and all resolution estimates would probably be poorer if calculated on a complete database.