2.8 Gene prediction and annotation
For the prediction of protein-coding genes in the assembled genome ofC. sonnerati , we used three strategies: homology, de novoand transcriptome sequencing. First, protein sequences fromEpinephelus lanceolatus , Plectropomus leopardus ,Epinephelus akaara , Oreochromis niloticus , Lates calcarifer , Gymnodraco acuticeps , Pseudochaenichthys georgianus and Cyclopterus lumpus were downloaded from Ensembl (Flicek et al., 2014) and aligned with C. sonnerati for homology annotation. Exonerate (v2.2.0) was used to conduct homology-based gene prediction. Second, we adopted Augustus (v3.3.1) (Stanke et al., 2004) and Genescan (Burge & Karlin, 1997) to perform de novo gene prediction. Third, protein-coding gene prediction based on transcriptome sequencing data was carried out using GMAP (version 2018-07-04) (Wu et al., 2005). TransDecoder (3.0.1) (https://github.com/TransDecoder/TransDecoder) was used to form the gene structure. Finally, Maker (v3.00) (Cantarel et al., 2008) was used to integrate the prediction results of the three methods to predict gene models.
Gene functions were inferred according to the best match of the alignments to the non-redundant (NR), TrEMBL (Boeckmann et al., 2003), InterPro (Mitchell et al., 2015), and SwissProt (Boeckmann et al., 2003) protein databases using BLASTP (NCBI blast v2.6.0+) (Altschul et al., 1997; Camacho et al., 2009) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (Kanehisa et al., 2012) with an e- value threshold of 1e-5. The protein domains were annotated using PfamScan (pfamscan_version) (Mistry et al., 2007) and InterProScan (v5.35 74.0) (Jones et al., 2014) based on InterPro protein databases. The motifs and domains within gene models were identified by PFAM databases (Finn et al., 2008). Gene Ontology (GO) (Ashburner et al., 2000) IDs for each gene were obtained from Blast2GO (Conesa & Gota, 2008).
In addition, we used tRNAscan SE (v1.3.1) algorithms (Lowe & Eddy, 1997) and tRNAscan with default parameters to identify the genes associated with tRNA. For rRNA identification, we first downloaded the closely related species rRNA sequences from the Ensembl database. Then rRNAs in the database were aligned against our genome using BlastN (Altschul et al., 1997; Camacho et al., 2009) with a cut-off of e-value <1e-5, identity of ≥85%, and match length ≥ 50bp. MiRNAs and snRNAs were identified by the Infernal (v1.1.2) (Nawrocki et al., 2009) software against the Rfam (v14.1) database (Finn et al., 2008) with default parameters.