Evaluation of target regions
Of the 84,484 ART_Illumina reads, 66,804 mapped onto theVenustaconcha genome (79.1%), all of which were unique hits.
These hits covered 3,931 of our 5,221 target regions (75.3%), i.e. 631
of 633 (99.7%) of our complete single-copy BUSCO ORFs, all of our
complete duplicate BUSCO ORFs, 295 of 296 (99.7%) Unioverse
ORFs, 996 of the 1255 stringent UCEs (79.4%; i.e. UCEs found in at
least 6 genomes) and 1824 of our 2852 less stringent UCEs (64.0%; UCEs
found in 5 of our 7 genomes), indicating a higher efficiency for ORFs
than UCEs. This mapping indicated that ORFs were regularly (but not
always) retrieved completely on the same Venustaconcha contig,
and that we could expect to retrieve multiple exons per ORF. In theVenustaconcha genome these exons were typically larger than 200
nt and often separated from other exons of the same ORF by 1,000s or
10,000s of nt.
Our targets for bait design covered 5,221 genomic regions with a length
of 2,272,996 nt. Of the 40,269 raw probes, 37,959 passed quality control
(~94.3%). The impact of this filtering on the overall
coverage of our target regions was minimal, however, for three UCEs we
were not able to develop any probes and for four ORFs the discarded
probes resulted in gaps of >300 nt, so that these ORFs were
expected to be incompletely covered upon target enrichment.
ORF and UCE
recovery
On average, we obtained over 3 million reads per sample (range: 35,246
to 9,132,732), of which (mean±sd) 61.81±13.94% were on target. Of these
on-target reads, 60.83±14.15% relate to ORFs whereas 0.97±0.73% to
UCEs. There was a weak but significant positive correlation among the
total number of reads per sample and the proportion of on-target reads
(R 2=0.045, F =4.442, df=1+94,
p =0.038), but no trend in the total number of reads per
phylogenetic clade within Coelaturini
(R 2=0.020, F =1.961, df=1+94,
p =0.165; Fig. S1). We observed unbalanced enrichment of UCEs
versus ORFs: Whereas the UCE regions contain 26.17% of the total of
targeted nucleotides, the number of UCE kmer hits compared to ORF kmer
hits is around 0.05%, indicating a substantial underrepresentation of
UCEs compared to ORFs.
On average over 1,102 of the 1,114 ORFs were consistently enriched and
mapped for all 95 unionids (857 are consistently recovered for over 50%
of their length in all specimens), with the exception of the distant
iridinid specimen (dna0240; see Fig. 2). HYBPIPER detected hidden
paralogy in at most 2 ORFs per specimen. As the number of reads obtained
for a sample decreases, we see a gradual decrease in the recovery of
ORFs, which becomes more marked for samples with <500,000
reads (n =12). As to UCEs, we recovered data for up to 1,905 out
of the 4,104 UCEs (46.5%), and the coverage per sample was proportional
with the number of reads (linear model:r 2=0.557, p <0.001), as was
observed for the ORFs (Fig. 2). The combination of 55% and 60%
thresholds on sequence coverage and identity, respectively, maximized
the total yield over all specimens, but it decreased the number of
retrieved UCEs slightly to 1,895. On average 281 UCEs are covered per
individual (range 30-473; total of 26,982 regions recovered for 96
samples). The number of recovered UCEs, and the proportion of unique
contigs recovered per individual decreased gradually as the thresholds
on sequence coverage and identity were altered, with more abrupt
decreases when the threshold on %identity was increased to ≥70% (Fig.
3). The consistency with which UCEs are recovered across taxa is low: 37
and 276 UCEs are recovered in >75% and >50%
of individuals, respectively. The length distribution of the retained
UCEs is highly similar to that of all UCEs (Fig. S2).