Genome annotations
Generally, annotations of a newly assembled genome include repeat, gene model, and gene function annotations. For repeat annotations, a total of 197,396 SSRs were obtained using MISA. Combined homology and de novo based results showed that repeat sequences accounted for 35.01% of theT. dalaica genome assembly (Table 2), of which, DNA transposons made up the greatest proportion (16.01%), followed by LRT (8.9%) and LINEs (4.24%).
The final set of protein-coding genes was obtained by integrating the results of ab initio, homologue based, RNA-seq based predictions. This set consisted of 23,925 genes, with average gene length, average CDS length, and number of exons, per gene, 12,128.58 bp, 1,715.77 bp and 9.9, respectively. Distribution of these parameters was similar amongT. dalaica and the species used for annotation (Figure S1), suggesting both gene conservation and annotation robustness. On the contrary, the homology-based ncRNA annotation showed a total of 1,664 miRNAs, 11,504 tRNAs, 684 rRNAs, and 1,207snRNAs residing in the genome.
Functional annotations, including function descriptions, KEGG pathways, and GO term assignments, as well as database summaries, are shown in Table S5. In total, 23,594, 98.62% of the total 23,925 genes, could be annotated as having potential functions.