3.1 Genome assembly
In this study, we generated a high-quality chromosome-level genome assembly of C. sonnerati using a combination of PacBio sequencing and Hi-C sequencing technologies. We obtained 56.98 Gb of clean short-read sequencing data from the genome of C. sonnerati(Figure 1). Then, the quality clean reads were used for genome size estimation by the k-mer-based methods (Liu. et al., 2013). Accordingly, the genome size of C. sonnerati was estimated to be 1015 Mb, with the proportion of repeat sequences and the heterozygosity rate determined to be 0.84% and 42.99%, respectively (Figure 2, Table 1).
With the SMRT cells in the PacBio Sequel platform, we generated ~100X subreads by removing adaptor sequences within sequences. The longest 150X subreads data was used for genome assembly of C. sonnerati . Then the draft assembly of the genome was assembled using mecat2 (Xiao et al., 2017) with default parameters. To correct errors in the primary assembly, we used gcpp (v1.9.0) (https://github.com/PacificBiosciences/gcpp)to polish the genome after the initial assembly of the genome was completed. In addition, we used Illumina derived short reads to correct any remaining errors by Pilon (v1.22) (Walker et al., 2014). Finally, we produced a total length of about 1043.66 Mb with an N50 length of 2.49 Mb, which accounted for 97.3% of the genome size estimated by k-mer analysis, containing 795 contigs (Table 2). Moreover, the genome of theC. sonnerati was longer than that the genome of the leopard coral grouper Plectropomus leopardus (881.55 Mb) (Zhou et al., 2020) but shorter than the genome of the red spotted grouper Epinephelus akaara (1135 Mb) (Ge et al., 2019). Furthermore, the assembled genome was subjected to BUSCO (Benchmarking Universal Single-Copy Orthologs) v3.0.2 with OrthoDB to evaluate the completeness of the genome. Overall, 95.8% and 95.6% of the complete BUSCOs were identified in the assembled and annotated genome, respectively (Supplementary Table S1). The results validated that the genome assembly was complete.
For anchored contigs, 801,816,224 clean read pairs were generated from the Hi-C library and were mapped to the polished C. sonneratigenome using BWA (bwa 0.7.17) with the default parameters. Then, we generated 324,980,877 unique mapped paired-end reads that were used to perform the Hi-C-associated scaffolding. Finally, we successfully clustered 795 contigs into 24 groups with the agglomerative hierarchical clustering method (Burton et al., 2013) in C. sonnerati . Subsequently, the genome of C. sonnerati was applied to order and orient the clustered contigs. Similarly, there were 767 contigs successfully ordered and oriented with 1.02 Gb. Finally, we obtained the first chromosome-level high-quality assembly, and chromosomal lengths ranged from 2.52 to 44.48 Mb, containing 98.01% of the total sequence (Table 3).