Annotations of the genome
Repeat sequences in the genome comprised of simple sequence repeats (SSRs), moderately repetitive sequences, and highly repetitive sequences. The MISA tool (Thiel, Michalek, Varshney, & Graner, 2003) was used to search for SSRs in the T. dalaica genome, with default parameters applied. Tandem repeats were identified using Tandem Repeats Finder 4.07b (Benson, 1999). RepeatMasker (Tarailo-Graovac & Chen, 2009) was used to identify known transposable elements (TEs) present in the T. dalaica genome, with Repbase (v. 22.11) as the query (Bao, Kojima, & Kohany, 2015). RepeatModeler v.1.0.11 and LTR_finder were also used to identify possible transposable elements, de novo, with default settings applied (Tarailo-Graovac & Chen, 2009; Z. Xu & Wang, 2007). Next, all the TEs were identified using RepeatMasker (Tarailo-Graovac & Chen, 2009).
Homology-based ncRNA annotation was performed by scanning the covariance models of rRNA, miRNA, and snRNA genes deposited in the Rfam database (release 13.0) (Kalvari et al., 2018), with candidate regions residing in the T. dalaica genome, preliminary detected using BLASTN (E-value ≤ 1e−5) (Camacho et al., 2009). The tRNAscan-SE (v1.3.1) search server (Lowe & Eddy, 1997) and the RNAmmer v1.2 server (Lagesen et al., 2007) were also used to predict tRNAs and rRNAs, respectively, with default settings applied.
De novo, together with homology- and transcriptome-based strategies were used to predict possible protein coding genes. For de novo prediction, a variety of software, including Augustus v3.3 (Stanke, Steinkamp, Waack, & Morgenstern, 2004), GeneID v1.4.4 (Blanco, Parra, & Guigo, 2007), GlimmerHMM (Majoros, Pertea, & Salzberg, 2004), and SNAP (Korf, 2004), was used. For homology-based prediction, the proteome of each of the six species, including Astyanax mexicanus , Danio rerio ,Ictalurus punctatus , Takifugu rubripes , Triplophysa siluroides , and Xiphophorus maculatus, was mapped onto theT. dalaica genome, and each of the protein sequences compared against the best aligned regions; with possible coding regions predicted using GeneWise v2.2.0 (Birney & Durbin, 2000). For transcriptome-based prediction, RNA-seq reads were mapped to the genome using TopHat (Trapnell et al., 2012), and subsequently assembled into gene models (Cufflinks-set) using Cufflinks (Roberts, Trapnell, Donaghey, Rinn, & Pachter, 2011). The Cufflinks-set was then fed into PASA (Haas et al., 2003) to identify the donor and acceptor sites of possible exon regions; with resultant coding regions predicted using TransDecoder (https://github.com/TransDecoder/TransDecoder/wiki). Finally, to generate a consensus gene set, EVidenceModeler (EVM) v1.1.1 (Haas et al., 2008) was used to integrate all predicted gene models using de novo, homology-, and transcriptome-based strategies. Low-quality genes, fewer than 50 encoded amino acids and/or harboring premature termination or frameshifts, were also removed from the gene set.
Functional annotations of T. dalaica predicted genes were performed by searching the nr, KOG, Uniprot (release 2018_10), and KEGG (release 84.0) databases, using Blast with an E-value of 1e-5 (Camacho et al., 2009). Descriptions and KEGG pathways were then extracted from the best hit sequence. Next, InterProScan (Quevillon et al., 2005) was used to annotate predicted genes based on the InterPro database (5.21-60.0), with GO terms assigned according to the best hits.