Genome structure prediction and annotation
Repeat elements in the genome were identified using multiple tools and databases. RepeatMasker (v3.3.0) was utilized with the settings -nolow -norna -no_is (Tarailo-Graovac & Chen, 2009), and RepeatProteinMask was employed using the RepBase library with parameters -engine ncbi -noLowSimple -pvalue 1e-04 (Bao, Kojima, & Kohany, 2015). Additionally, RepeatModeler (v1.0.9) (Flynn et al., 2020), Piler (Edgar & Myers, 2005), and RepeatScout (v1.05) (Price, Jones, & Pevzner, 2005) were used based on sequence alignment. For de novo prediction, tools such as RepeatModeler, LTR FINDER (v1.05) (Xu & Wang, 2007), and the TRF tool (v4.04) were used.
Genome structure analysis incorporated homology-based, transcriptome-based, and de novo prediction methods. For homology-based annotation, protein sequences from eight teleost species:Cynoglossus semilaevis , Danio rerio , Gadus morhua ,Gasterosteus aculeatus , Larimichthys crocea ,Oreochromis niloticus , Oryzias latipes , and Takifugu rubripes , were downloaded from the NCBI database and annotated using GENEWISE (v2.4. 0) (Doerks, Copley, Schultz, Ponting, & Bork, 2002). Transcript annotation involved assembling transcripts from RNA-Seq data with TRINITY (v2.1. 1) (Grabherr et al., 2011), aligning them to the genome using PASA (Haas et al., 2003), and predicting ORFs with TransDecoder (part of TRINITY). De novo gene prediction was carried out using AUGUSTUS (v2.5.5) (Stanke et al., 2006), GlimmerHMM (v3.0.4) (Majoros, Pertea, & Salzberg, 2004), and GENSCAN (v2.1) (Burge & Karlin, 1997), followed by integration using Glean.
Functional annotation of gene sets was conducted using several databases, including NR, KEGG (Kanehisa & Goto, 2000), SwissProt (Amos Bairoch & Apweiler, 1997), and TrEMBL, with homologous proteins identified via BLASTp using an E-value cut-off of 1E-5 (A. Bairoch et al., 2005). Domain-based annotation was conducted using InterProScan (v4.7) (Mitchell et al., 2019).