Introduction
Our planet’s current biodiversity is intricately complex, dynamic and
heterogenous, but the evolutionary history of most taxa is poorly
documented. Such lack of knowledge may hamper conservation strategies,
and, given current anthropogenic pressures on ecosystems (Barnosky et
al., 2012; Dirzo et al., 2014), it may contribute to the irretrievable
loss of poorly known biodiversity. Genomic data at the inter- and/or
intraspecific level are essential to assess or infer ecological and
evolutionary properties, but despite the continuous development of
increasingly versatile genomic methods for non-model organisms
(Romiguier et al., 2014), the differential rates at which genomic
resources are acquired along various branches of the tree of life
contribute to this heterogeneity (e.g. Gayral et al., 2013). Currently,
detailed insights into the demographic history, the genome-wide
development of reproductive isolation, the development of phenotypic
traits and mechanisms of adaptation exist for a restricted number of
model organisms with superior genomic resources (e.g. Cooney et al.,
2017; Ronco et al., 2020; Van Belleghem et al., 2017).
A critical challenge for model and non-model organisms alike remains
understanding the interplay of microevolutionary and macroevolutionary
drivers and dynamics of diversification (Erwin, 2000; Reznick &
Ricklefs, 2009), because both have traditionally been studied with
different approaches and at different timescales. Microevolutionary
studies typically consist of in-depth analyses of a small number of
species based on large intraspecific samples, with limited opportunities
for generalization across taxa. Macroevolutionary studies usually
construe from comparative analyses on a restricted set of
representatives for each of a large number of extant and/or fossil
species, limiting insight into mechanisms operating at the level of
individual species. Theoretical and empirical studies at both levels
have indicated the need of a large number of orthologous loci to
document phylogenetic relatedness, genetic diversity and population
history (Dutoit et al., 2017; Helyar et al., 2011; Leaché & Rannala,
2011; Wortley et al., 2005). Rapid advances in high-throughput
sequencing methods and genomic data analysis now allow to develop large
multi-locus datasets on model and non-model organisms alike, creating
novel perspectives for the integration of microevolutionary and
macroevolutionary dynamics.
A multitude of strategies exist to obtain molecular datasets at a
variety of taxonomic levels, enabling the development of genomic
sampling schemes that could tackle questions at macroevolutionary and
microevolutionary scales simultaneously. For non-model organisms the
majority of approaches consist of reduced representation sequencing,
where a subset of orthologous markers of the nuclear genome across taxa
or individuals is obtained, for example with RAD-seq (Miller et al.,
2007), transcriptomics (RNA-seq; Gayral et al., 2013), or by sequencing
libraries after targeted sequence capture/enrichment (Hyb-seq). The
latter strategy includes anchored hybrid enrichment (Lemmon et al.,
2012), the sequencing of ultraconserved elements (UCEs; Faircloth et
al., 2012), sequence capture using PCR-generated probes (Peñalba et al.,
2014), transcriptome-based exon capture (Bi et al., 2012), or if genomic
resources exist, by using conserved non-coding elements (CNEs; Vavouri
et al., 2007), including conserved non-exonic elements (CNEEs; Edwards
et al., 2017).
Here we develop a methodological framework combining several of the
abovementioned approaches to enrich a set of loci that enables
phylogenetic and population genetic studies in taxa with very limited
genomic resources. Several motivations drive this effort. First, as
comprehensive phylogenetic and population genetic studies require large
sample sizes, we require a strategy that is scalable to 100s or 1,000s
of individuals without massive inflation of sequencing costs. Second,
microevolutionary and macroevolutionary studies each impose specific
constraints, e.g. related to orthology, the identification of coding vs.
non-coding regions and within coding regions of synonymous vs.
non-synonymous sites, so that the advantages and disadvantages of an
integrative strategy are to be evaluated. Third, in the absence of a
well-assembled reference genome for the focal taxa or their close
relatives, it remains difficult to leverage many of the abovementioned
reduced representation sequencing methods. Specifically, we propose a
strategy based on target enrichment of entire open reading frames (ORFs)
of genes, which have been selected from ingroup-specific transcriptome
sequencing, supplemented with more universal UCE targets that were
identified from comparisons among distant genomes. The ORF of a gene is
a stretch of DNA sequence, in the correct register, between start and
stop codons that encodes a protein for translation, i.e. the coding
sequence. Because ORFs are usually clustered in gene-rich regions within
animal genomes (Osbourn & Field, 2009; Sproul et al., 2005), we include
UCEs to increase the evenness at which the genome is sampled.
Additionally, the integration of
multiple types of markers has been suggested to enhance opportunities to
resolve phylogenetic conflict (Chan et al., 2020; Hutter et al., 2019;
Reddy et al., 2017). Both ORFs and UCEs are useful markers for organisms
with no or limited genomic resources (Faircloth et al., 2012; Portik et
al., 2016), as they enable the reconstruction of phylogenetic
relationships across clades of varying age and taxonomic scale (Bi et
al., 2012; Bragg et al., 2016; Faircloth et al., 2012; Harvey et al.,
2016; Hugall et al., 2016; Lemmon et al., 2012; Teasdale et al., 2016),
and they allow the detection of SNPs for population-level analyses (De
Wit & Palumbi, 2012; Harvey et al., 2016; Schunter et al., 2014). A
novelty of our approach is to focus on the entire ORFs of genes, which
allows more rigorous assessment of genetic diversity at the population
level, for example, through more accurate assessment of synonymous vs.
non-synonymous genetic diversity and demographic history (Gayral et al.,
2013), including examinations of the speciation continuum (Roux et al.,
2016). Inclusion of multiple exons per gene also provides access to
additional intronic/intergenic flanking regions (compared to when a
single exon is used), which may contain substantial phylogenetic
information, especially at shallow taxonomic levels (Breinholt et al.,
2018).
Given the abovementioned requirements, we avoided RAD-seq, which
produces short, blind markers for which alignment and orthology
assessment may be challenging. Subjecting all samples to transcriptome
sequencing was not feasible because it does not allow leveraging
historical, ethanol-preserved collections nor pooling as many samples
per sequencing run. Consequently, it would result in decreased
species/specimen representation and/or inflated costs. Sequence capture
approaches face two important challenges to select targets: 1) the
qualification of orthology, as only single-copy markers that are
orthologous across all taxa under study are phylogenetically informative
(Teasdale et al., 2016), and 2) the need to identify intron-exon
boundaries to select exome targets (Karin et al., 2019; Portik et al.,
2016). Both of these challenges typically require a reference genome (Bi
et al., 2012; Bragg et al., 2016; Portik et al., 2016). In
invertebrates, especially the mollusks we are concerned with here,
orthology assessments are usually undertaken with very distant genomes
due to the paucity of well-assembled genomes, e.g. divergence
>400 Ma for Lottia vs Eupulmonata in gastropods andBathymodiolus vs Unionidae in bivalves (Combosh et al., 2017;
Pfeiffer et al., 2019; Sun et al., 2019; Teasdale et al., 2016). Such
ancient divergences imply that ortholog assessments for the reference
may differ substantially from those for ingroup taxa. Here we relax the
need of well-assembled reference genomes by assessing orthology from
existing genomic databases and representative ingroup transcriptomes.
Additionally, by focusing on entire ORFs as functional biological units,
instead of individual exons, we do not require to establish intron-exon
boundaries prior to target enrichment. Whereas our proposed strategy
enhances versatility, various issues could complicate target enrichment
of entire ORFs, notably their subdivision in multiple exons. If exons
are regularly shorter than the probe length, many probes will be tiled
over exon boundaries within ORFs, which could drastically reduce the
enrichment efficiency in genomic libraries. Evaluation of various
Metazoan genomes indicated that genes consist on average of several
short exons (number: 8.20 ± 1.90; length: 196 ± 69 bp; mean ± sd) that
are separated from one another by much longer introns (length: 3079±2063
bp) (Zhu et al., 2009). The number and lengths of exons in transcripts,
the length of probes, the level of divergence between probes and targets
and the length distribution of genomic library fragments are all
important factors that could influence the success of our proposed
strategy.
Against the abovementioned considerations, we here describe an
enrichment strategy for integrative studies of microevolution and
macroevolution in the Afrotropical freshwater bivalve tribe Coelaturini
(Parreysiinae; Unionidae) that can be readily expanded to other
non-model organisms. We present a new approach to select orthologous
single-copy genes from ingroup transcriptome assemblies, partly based on
manual data curation (see Teasdale et al., 2016) and a strategy to
successfully enrich their entire ORFs in genomic libraries.
Additionally, we developed (to our knowledge) the first set of UCEs for
bivalves and gastropods (and Mollusca altogether). We evaluate the
performance of target enrichment for these heterogenous targets, and
analyze the obtained datasets to illustrate their value for
phylogenetics and population genetics. Finally, we skimmed the raw
sequencing data to evaluate the possibility of recuperating off-target
mitochondrial sequences, on which previous Sanger-sequencing studies of
Unionidae (Lopes-Lima, Froufe, et al., 2017; Ortiz-Sepulveda et al.,
2020; Whelan et al., 2011), and bivalves in general (Combosh et al.,
2017), have relied heavily.