2.8 Gene prediction and annotation
For the prediction of protein-coding genes in the assembled genome ofC. sonnerati , we used three strategies: homology, de novoand transcriptome sequencing. First, protein sequences fromEpinephelus lanceolatus , Plectropomus leopardus ,Epinephelus akaara , Oreochromis niloticus , Lates
calcarifer , Gymnodraco acuticeps , Pseudochaenichthys
georgianus and Cyclopterus lumpus were downloaded from Ensembl
(Flicek et al., 2014) and aligned with C. sonnerati for homology
annotation. Exonerate (v2.2.0) was used to conduct homology-based gene
prediction. Second, we adopted Augustus (v3.3.1) (Stanke et al., 2004)
and Genescan (Burge & Karlin, 1997) to perform de novo gene
prediction. Third, protein-coding gene prediction based on transcriptome
sequencing data was carried out using GMAP (version 2018-07-04) (Wu et
al., 2005). TransDecoder (3.0.1)
(https://github.com/TransDecoder/TransDecoder) was used to form the gene
structure. Finally, Maker (v3.00) (Cantarel et al., 2008) was used to
integrate the prediction results of the three methods to predict gene
models.
Gene functions were inferred according to the best match of the
alignments to the non-redundant (NR), TrEMBL (Boeckmann et al., 2003),
InterPro (Mitchell et al., 2015), and SwissProt (Boeckmann et al., 2003)
protein databases using BLASTP (NCBI blast v2.6.0+) (Altschul et al.,
1997; Camacho et al., 2009) and the Kyoto Encyclopedia of Genes and
Genomes (KEGG) database (Kanehisa et al., 2012) with an e- value
threshold of 1e-5. The protein domains were annotated
using PfamScan (pfamscan_version) (Mistry et al., 2007) and
InterProScan (v5.35 74.0) (Jones et al., 2014) based on InterPro protein
databases. The motifs and domains within gene models were identified by
PFAM databases (Finn et al., 2008). Gene Ontology (GO) (Ashburner et
al., 2000) IDs for each gene were obtained from Blast2GO (Conesa &
Gota, 2008).
In addition, we used tRNAscan SE (v1.3.1) algorithms (Lowe & Eddy,
1997) and tRNAscan with default parameters to identify the genes
associated with tRNA. For rRNA identification, we first downloaded the
closely related species rRNA sequences from the Ensembl database. Then
rRNAs in the database were aligned against our genome using BlastN
(Altschul et al., 1997; Camacho et al., 2009) with a cut-off of e-value
<1e-5, identity of ≥85%, and match length ≥
50bp. MiRNAs and snRNAs were identified by the Infernal (v1.1.2)
(Nawrocki et al., 2009) software against the Rfam (v14.1) database (Finn
et al., 2008) with default parameters.