Target enrichment and sequencing
We validated our selection of
targets, bait design and our molecular protocols with four target
enrichment reactions. Pools were created at equimolar concentrations and
enriched with a ~19 h incubation period at 65 °C and
14-16 cycles in the post-hybridization PCR reaction. After enrichment,
we purified the resulting products using 0.8X magnetic beads and
rehydrated the enriched pool in 30 µl TLE. Post-enrichment libraries
were quantified with Qubit and BioAnalyser, after which the results of
multiple reactions were pooled at equimolar concentration and sequenced
paired-end (2×150 bp) on an Illumina NextSeq 500 of the LIGAN-MP
Genomics Platform of UMR8199.
Bioinformatic
analysis of sequencing results
ORFs with HybPiper. Demultiplexed raw reads were
cleaned with TrimGalore!, CutAdapt and
Trimmomatic, performing iterative FastQC quality
controls at each step. The cleaned paired-end reads of each specimen
were subjected to the HYBPIPER workflow v.1.3.1 (Johnson et al., 2016).
This workflow starts by mapping reads to our ORF targets and sorting
them per ORF in separate directories after removal of paired duplicated
reads with BLASTX and BWA v. 0.7.17 (H. Li & Durbin, 2009).
Subsequently, the reads per ORF are subjected to de novo assembly
using SPAdes v. 3.14.0 (Bankevich et al., 2012) resulting in
the construction of a supercontig per ORF and per specimen. The
resulting contigs are sorted and the boundaries between exons and
flanking regions are predicted with EXONERATE v. 2.4.0 (Slater &
Birney, 2005). Alignments of the supercontigs, which include exons and
intronic/intergenic flanking regions, were performed with MAFFT v.7.475
(Katoh & Standley, 2013) and MUSCLE, using our ORF targets to verify
exon predictions.
UCEs with Phyluce. Demultiplexed and cleaned reads
were also mapped to our UCE targets and treated with PHYLUCE. We
assembled contigs with Trinity, which subsequently underwent
orthology detection, paralog removal and the matching of contigs to
targets with lastz (Harris, 2007), after which they were
aligned with MAFFT or MUSCLE. Two essential parameters in the PHYLUCE
pipeline are –min_identity and –min_coverage
(–min_kmer_coverage was set to two), and we examined combinations
of each factor between 50 and 80% at intervals of 5% (49 combinations
in total). Many of these combinations are more permissive than the
recommendations of Bossert and Danforth (2017), but we used additional
downstream processing to eliminate potential contaminant sequences.
Parameter combinations were evaluated using the proportion of unique
contigs retrieved and the total number of UCEs retrieved for all 96
samples. The resulting data were analysed in R v. 3.6.1 (R Core Team,
2019).
Enrichment balance between
ORFs and UCEs. To verify the enrichment balance between ORFs and UCEs,
we examined the proportion of reads that mapped on either type of
target. This task has been performed starting from the reads that were
identified with HYBPIPER and PHYLUCE as mapping to ORF or UCE targets,
respectively, which also provides information on the specificity of our
enrichment strategy. More detailed information is obtained by extracting
kmers of 18 to 21 bases from various .fasta files with
Jellyfish v. 2.2.10 (Marcais & Kingsford, 2011), i.e. first
from .fasta files that contain the ORF and UCE target sequences, and
then for a sample of specimens in our dataset. Each kmer that occurs in
both the specimen and ORF dictionary is considered an ORF hit, whereas
each kmer that occurs in the specimen and UCE dictionary is considered
an UCE hit. The ratio of hits of both types indicates the enrichment
balance between both sets of targets.
Mitogenome skimming.Unionoida and some related bivalves have double uniparentally inherited
mitochondria, which appears to be unique within the animal kingdom and
which may play a role in sex determination (Breton et al., 2011; Zouros
et al., 1994). Against this background, and the abundant use of
mitochondrial fragments in previous phylogenetic inferences for
Unionidae (see above), we performed mitogenome skimming in two steps.
The tissues we used for genomic library preparation predominantly or
exclusively contain maternally-inherited mitochondria (Breton et al.,
2011; Froufe et al., 2020). First, we aligned reads on the mitochondrial
genome of Pygandon grandis (NCBI GenBank: NC_013661;
Breton et al., 2009) with Bowtie2 v. 2.3.5 (Langmead &
Salzberg, 2012) to produce a .bam file, which was sorted and indexed
with functions of SAMTools. Subsequently, the files for all
individuals were merged with SAMTools and visualized with IGV.
This merged .bam file was converted into .fastq files after which the
reads were assembled with SPAdes. The resulting assembly was
aligned to the Pyganodon grandis mitogenome and visually
inspected to retain unambiguous contigs. The retained contigs were used
to recover reads for 48 Coelaturini specimens from the Malawi Basin
using again Bowtie2, SAMTools and SPAdes,
after which individual assemblies were reassembled with Trinity. The
total assembly was evaluated and the procedure was iterated each time
with laterally extended starting contigs. Annotations were performed
with MITOS (Bernt et al., 2013).
Macroevolutionary
analyses
Alignment of UCE and ORF data. A single transcriptome was
available in NCBI GenBank for the subfamily Parreysiinae at the
start of these analyses, i.e. that of Scabies phaselus (Lea,
1856) (SRX5281799; Pfeiffer et al., 2019), which was hence used as
outgroup for phylogenetics. We used HYBPIPER as described above to
recover information on our ORF targets for this outgroup. Phylogenetic
analysis of the ORFs was performed using the supercontigs reconstructed
for ingroup taxa with HYBPIPER. These supercontigs contain exons and
intronic/intergenic flanking regions; alignments were produced with
MAFFT. The UCEs were processed for phylogenetic analysis with PHYLUCE.
Alignments for each UCE locus were obtained using
phyluce_align_seqcap_align with MAFFT as aligner and without
edge-trimming. Phylogenetic datasets of both UCEs and ORFs were filtered
to retain target loci with >50% taxon completeness. ORF
and UCE alignments were trimmed separately using BMGE v.1.2 (Cruscuolo
& Gribaldo, 2010) with a maximum gap rate per sequence and character of
0.3 and a maximum entropy threshold of 0.4 to remove ambiguous regions.
After concatenating trimmed single-locus alignments with AMAS v. 1.0
(Borowiec, 2016), we used Spruceup v. 2020.2.19 (Borowiec,
2019) to detect outliers, which were replaced as missing data (windows
size=20, overlap=15, criterion=lognorm, cutoff=0.99). Alignment
statistics were calculated with AMAS.
Phylogenetic inference. We performed phylogenetic analysis with
Maximum Likelihood (ML) on concatenated datasets for ORFs and UCEs
separately to evaluate the congruence between both. We used AMAS to
generate a partition file for the UCEs with the sliding-window site
characteristics method based on site entropies (SWSC-EN; Tagliacollo &
Lanfear, 2018) to define the limits of UCE ‘core’ and flanking regions.
A single partition was considered for the supercontig of each ORF.
Concatenated datasets and the partitioning information were subjected to
phylogenetic inference in IQTREE v. 2.0.3 (Minh et al., 2020), using the
integrated ModelFinder (Kalyaanamoorthy et al., 2017) to
determine the best-fit substitution model for each partition. We used
1000 bootstrap replicates; branch
support values were calculated with a Shimodaira-Hasegawa approximate
likelihood ratio test (SH-aLRT; Guindon et al., 2010) and ultrafast
Bootstrap (UFBoot2; Hoang et al., 2018).