Selection of ORF targets
To identify target ORFs for orthology assessment, we first clustered the
contigs of our 12 Agalma transcriptomes into a
supertranscriptome using CD-Hit-Est v. 4.8.1 (W. Li & Godzik,
2006) with a 95% similarity threshold, retaining the longest variant
per cluster. We predicted ORFs for the supertranscriptome with
TransDecoder v. 5.5.0 (Haas & Papanicolaou, 2018) in two
steps. First, we identified all likely coding regions, from which we
retained long ORFs, after which we selected the best supported ORF per
transcript. These ORF predictions were clustered again with
CD-Hit-Est at 95% similarity to create homologous clusters.
These homologous ORF clusters were expected to still contain a high
level of redundancy because of fragmentation, frameshifts, mis-indexing,
mis-assembly or the potential existence of isoforms or paralogs. We
evaluated orthology in two steps. First, we mapped our ORF clusters
against the BUSCO Metazoa_odb9 database (Waterhouse et al.,
2017) and against the Unioverse probe set (Pfeiffer et al.,
2019) to identify targets that have high chances of being single-copy
and orthologous (=candidate genes). This approach does not allow to
detect unexpressed homologs or weakly divergent paralogs, so additional
follow-up verifications are performed during probe design (see below).
We extracted all complete BUSCO hits, which consist of single-copy hits
and multi-copy hits. For ORFs with single-copy BUSCO hits the likelihood
of multiple gene copies is small, given the gene is included in the
single-copy orthologous database of BUSCO, and given the lack of
expression variants with >95% similarity after clustering
12 ingroup transcriptomes. Multi-copy BUSCO hits imply that at least two
ORF homologous clusters map to a BUSCO gene and these were manually
curated to distinguish among the following scenarios: 1) multiple copies
may indicate the existence of distant homologs or paralogs, 2) assembly
errors may have resulted in the creation of multiple contigs for the
same gene, 3) various isoforms may exist of a single expressed RNA
fragment. All ORFs that map to an individual BUSCO were merged into a
.fasta file after which contigs were aligned with MUSCLE v.3.8.1551
(Edgar, 2004) and the nucleotide and protein sequences of homologs were
visually inspected in SeaView v. 5 (Galtier et al., 1996; Gouy
et al., 2010). If the presence of paralogs was suspected in this
evaluation, we rejected all ORFs for the respective BUSCO hit. If
different splicing or minor difficulties in assembly were suspected, the
longest ORF was retained.
A similar approach was used to evaluate the 811 loci from the
Unioverse probe set, which was developed for anchored hybrid
enrichment using the distant Bathymodiolus platifrons genome
(Pfeiffer et al., 2019). This probe set contains 173,707 nucleotides, on
average 214 per locus. To avoid overlap with targets that were already
retained from the abovementioned BUSCO comparisons, we first screened
the Unioverse loci against the BUSCO Metazoa_odb9
database and against our already retained ORFs with Yass v.
1.15 (NoƩ & Kucherov, 2005). In total 109 Unioverse loci were
accounted for by these verifications, leaving 702 loci to be examined.
Our remaining ORF clusters produced hits on all 702 remaining
Unioverse loci, and all hits for the same Unioverse
locus were compiled and aligned. When several Unioverse loci
mapped onto the same ORF, we also performed alignments including all
concerned Unioverse loci and all associated ORFs in
SeaView. The subsequent evaluation of candidate genes followed
the criteria indicated above for our BUSCO evaluation. Finally, we
mapped all retained ORF targets and any subregions within these ORFs to
one another with Yass to examine the percentage of sequence
similarity over a certain alignment length. ORFs were removed if
alignment lengths and similarities were judged to potentially interfere
with target enrichment.