De novo assembly of the T. dalaica genome
All next-generation data generated in this report are summarized in Table S1. To estimate sample genome size and heterozygosity, a total of 37.5 Gb short reads, covering approximately 50-fold of the estimated genome, were selected from the generated NGS genome data. These reads were then subject to 17-mer analysis. As the main peak was located at a depth of 53 (Table S2), the genome size was estimated to be 631 Mb, which was similar to the previously reported genome sizes of otherTriplophysa species, such as T. siluroides and T. tibetana, which were estimated to be 638.07 Mb (L. Yang et al., 2019) and 652 Mb (X. Yang et al., 2019), respectively. Moreover, the estimated species heterozygosity rate was approximately 0.375%.
Data used for the de novo assembly of the T. dalaica genome were generated using PacBio Sequel II. Following adaptor and low-quality read removal, nearly 200X, totaling 126.5 Gb, subreads remained, with an average read length and N50 length of 19.7 kb and 32.4 kb, respectively. Preliminary genome assembly was performed in FALCON v0.3, with ultimate genome assembly, approximately 607.91 Mb with a contig N50 size of 9.27 Mb, obtained after several polishing rounds using Arrow and Pilon. Contig N50 lengths of the formerly reported T. siluroides andT. tibetana were 2.87 Mb and 3.1 Mb, respectively (L. Yang et al., 2019; X. Yang et al., 2019). Thus, the contig N50 length ofT. dalaica was much longer than that formerly reported. This may be because we used the newest sequel II technology available; thus, throughput and subread N50 length were significantly improved from previous platforms.
The quality of the assembled genome was thoroughly scrutinized. GC analysis was conducted to assess potential contamination before sequencing. As a result, a unimodal distribution of GC content was detected, with an average GC content of 38.77% for the assembly; suggesting no bacterial contamination. To evaluate coverage of the assembly, all RNA-seq reads were mapped to the T. dalaica genome using HISAT2 (Kim, Langmead, & Salzberg, 2015), with default parameters applied. The percentage of aligned reads ranged from 84.78% to 91.08% (Table S3). Moreover, Benchmarking Universal Single-Copy Orthologs (BUSCO) (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015) were used to estimate the coverage of the 4,584 single-copy genes conserved among all Actinopterygii, with approximately 93.7% of the complete BUSCOs found in the assembly (Table S4). Taken together, these results suggest that the genome assembly was robust and nearly complete.