Genome structure prediction and annotation
Repeat elements in the genome were identified using multiple tools and
databases. RepeatMasker (v3.3.0) was utilized with the settings -nolow
-norna -no_is (Tarailo-Graovac & Chen, 2009), and RepeatProteinMask
was employed using the RepBase library with parameters -engine ncbi
-noLowSimple -pvalue 1e-04 (Bao, Kojima, & Kohany, 2015). Additionally,
RepeatModeler (v1.0.9) (Flynn et al., 2020), Piler (Edgar & Myers,
2005), and RepeatScout (v1.05) (Price, Jones, & Pevzner, 2005) were
used based on sequence alignment. For de novo prediction, tools such as
RepeatModeler, LTR FINDER (v1.05) (Xu & Wang, 2007), and the TRF tool
(v4.04) were used.
Genome structure analysis incorporated homology-based,
transcriptome-based, and de novo prediction methods. For homology-based
annotation, protein sequences from eight teleost species:Cynoglossus semilaevis , Danio rerio , Gadus morhua ,Gasterosteus aculeatus , Larimichthys crocea ,Oreochromis niloticus , Oryzias latipes , and Takifugu
rubripes , were downloaded from the NCBI database and annotated using
GENEWISE (v2.4. 0) (Doerks, Copley, Schultz, Ponting, & Bork, 2002).
Transcript annotation involved assembling transcripts from RNA-Seq data
with TRINITY (v2.1. 1) (Grabherr et al., 2011), aligning them to the
genome using PASA (Haas et al., 2003), and predicting ORFs with
TransDecoder (part of TRINITY). De novo gene prediction was carried out
using AUGUSTUS (v2.5.5) (Stanke et al., 2006), GlimmerHMM (v3.0.4)
(Majoros, Pertea, & Salzberg, 2004), and GENSCAN (v2.1) (Burge &
Karlin, 1997), followed by integration using Glean.
Functional annotation of gene sets was conducted using several
databases, including NR, KEGG (Kanehisa & Goto, 2000), SwissProt (Amos
Bairoch & Apweiler, 1997), and TrEMBL, with homologous proteins
identified via BLASTp using an E-value cut-off of 1E-5 (A. Bairoch et
al., 2005). Domain-based annotation was conducted using InterProScan
(v4.7) (Mitchell et al., 2019).