3.2 Genome annotation
Repeat sequences that were 526.92 Mb in length, accounting for 50.47%, were identified in the assembled genome of the C. sonnerati . The TEs accounted for 47.23% with 493.11 Mb in length of the assembly genome (Table 4). The percentage was higher than that ofPlectropomus leopardus (30.74%) (Zhou et al., 2020) andEpinephelus akaara (43.02%) (Ge et al., 2019). Among them, DNA transposons, LINEs, and LTRs were the top three categories of repetitive elements, accounting for 24.82, 13.74, and 6.72%, respectively.
We predicted protein-coding genes of the C. sonnerati genome by using three methods, including de novo , homology-based and transcriptome sequencing-based gene predictions. A total of 26,130 protein-coding genes were generated from the genome of C. sonnerati (Supplementary Table S2). Then, the statistics of the predicted gene models were compared with eight closet teleost species (E. lanceolatus , P. leopardus , E. akaara ,O. niloticus ,L. calcarifer ,G. acuticeps ,P. georgianus andC. lumpus ), displaying similar distribution patterns in the exon and intron number, gene and CDS length, exon and intron length, and gene and CDS gene content of C. sonnerati (Figure 3). In total, 24,629 genes (approximately 94.26%) were functionally annotated in at least one of the databases (Table 5), which is higher than that of E. akaara (23,808) (Ge et al., 2019) and P. leopardus (24,364) (Zhou et al., 2020), but lower than that of E. lanceolatus(24,794) (Zhou et al., 2019).
For non-coding genes, 373 miRNAs, 2,232 tRNAs, 169 rRNAs and 515 snRNAs were also identified in the genome of C. sonnerati (Supplementary Table S3).