Annotations of the genome
Repeat sequences in the genome comprised of simple sequence repeats
(SSRs), moderately repetitive sequences, and highly repetitive
sequences. The MISA tool (Thiel, Michalek, Varshney, & Graner, 2003)
was used to search for SSRs in the T. dalaica genome, with
default parameters applied. Tandem repeats were identified using Tandem
Repeats Finder 4.07b (Benson, 1999). RepeatMasker (Tarailo-Graovac &
Chen, 2009) was used to identify known transposable elements (TEs)
present in the T. dalaica genome, with Repbase (v. 22.11) as the
query (Bao, Kojima, & Kohany, 2015). RepeatModeler v.1.0.11 and
LTR_finder were also used to identify possible transposable elements,
de novo, with default settings applied (Tarailo-Graovac & Chen, 2009;
Z. Xu & Wang, 2007). Next, all the TEs were identified using
RepeatMasker (Tarailo-Graovac & Chen, 2009).
Homology-based ncRNA annotation was performed by scanning the covariance
models of rRNA, miRNA, and snRNA genes deposited in the Rfam database
(release 13.0) (Kalvari et al., 2018), with candidate regions residing
in the T. dalaica genome, preliminary detected using BLASTN
(E-value ≤ 1e−5) (Camacho et al., 2009). The tRNAscan-SE (v1.3.1) search
server (Lowe & Eddy, 1997) and the RNAmmer v1.2 server (Lagesen et al.,
2007) were also used to predict tRNAs and rRNAs, respectively, with
default settings applied.
De novo, together with homology- and transcriptome-based strategies were
used to predict possible protein coding genes. For de novo prediction, a
variety of software, including Augustus v3.3 (Stanke, Steinkamp, Waack,
& Morgenstern, 2004), GeneID v1.4.4 (Blanco, Parra, & Guigo, 2007),
GlimmerHMM (Majoros, Pertea, & Salzberg, 2004), and SNAP (Korf, 2004),
was used. For homology-based prediction, the proteome of each of the six
species, including Astyanax mexicanus , Danio rerio ,Ictalurus punctatus , Takifugu rubripes , Triplophysa
siluroides , and Xiphophorus maculatus, was mapped onto theT. dalaica genome, and each of the protein sequences compared
against the best aligned regions; with possible coding regions predicted
using GeneWise v2.2.0 (Birney & Durbin, 2000). For transcriptome-based
prediction, RNA-seq reads were mapped to the genome using TopHat
(Trapnell et al., 2012), and subsequently assembled into gene models
(Cufflinks-set) using Cufflinks (Roberts, Trapnell, Donaghey, Rinn, &
Pachter, 2011). The Cufflinks-set was then fed into PASA (Haas et al.,
2003) to identify the donor and acceptor sites of possible exon regions;
with resultant coding regions predicted using TransDecoder
(https://github.com/TransDecoder/TransDecoder/wiki). Finally, to
generate a consensus gene set, EVidenceModeler (EVM) v1.1.1 (Haas et
al., 2008) was used to integrate all predicted gene models using de
novo, homology-, and transcriptome-based strategies. Low-quality genes,
fewer than 50 encoded amino acids and/or harboring premature termination
or frameshifts, were also removed from the gene set.
Functional annotations of T. dalaica predicted genes were
performed by searching the nr, KOG, Uniprot (release 2018_10), and KEGG
(release 84.0) databases, using Blast with an E-value of 1e-5 (Camacho
et al., 2009). Descriptions and KEGG pathways were then extracted from
the best hit sequence. Next, InterProScan (Quevillon et al., 2005) was
used to annotate predicted genes based on the InterPro database
(5.21-60.0), with GO terms assigned according to the best hits.