De novo assembly of the T. dalaica genome
All next-generation data generated in this report are summarized in
Table S1. To estimate sample genome size and heterozygosity, a total of
37.5 Gb short reads, covering approximately 50-fold of the estimated
genome, were selected from the generated NGS genome data. These reads
were then subject to 17-mer analysis. As the main peak was located at a
depth of 53 (Table S2), the genome size was estimated to be 631 Mb,
which was similar to the previously reported genome sizes of otherTriplophysa species, such as T. siluroides and T.
tibetana, which were estimated to be 638.07 Mb (L. Yang et al., 2019)
and 652 Mb (X. Yang et al., 2019), respectively. Moreover, the estimated
species heterozygosity rate was approximately 0.375%.
Data used for the de novo assembly of the T. dalaica genome were
generated using PacBio Sequel II. Following adaptor and low-quality read
removal, nearly 200X, totaling 126.5 Gb, subreads remained, with an
average read length and N50 length of 19.7 kb and 32.4 kb, respectively.
Preliminary genome assembly was performed in FALCON v0.3, with ultimate
genome assembly, approximately 607.91 Mb with a contig N50 size of 9.27
Mb, obtained after several polishing rounds using Arrow and Pilon.
Contig N50 lengths of the formerly reported T. siluroides andT. tibetana were 2.87 Mb and 3.1 Mb, respectively (L. Yang et
al., 2019; X. Yang et al., 2019). Thus, the contig N50 length ofT. dalaica was much longer than that formerly reported. This may
be because we used the newest sequel II technology available; thus,
throughput and subread N50 length were significantly improved from
previous platforms.
The quality of the assembled genome was thoroughly scrutinized. GC
analysis was conducted to assess potential contamination before
sequencing. As a result, a unimodal distribution of GC content was
detected, with an average GC content of 38.77% for the assembly;
suggesting no bacterial contamination. To evaluate coverage of the
assembly, all RNA-seq reads were mapped to the T. dalaica genome
using HISAT2 (Kim, Langmead, & Salzberg, 2015), with default parameters
applied. The percentage of aligned reads ranged from 84.78% to 91.08%
(Table S3). Moreover, Benchmarking Universal Single-Copy Orthologs
(BUSCO) (Simao, Waterhouse, Ioannidis, Kriventseva, & Zdobnov, 2015)
were used to estimate the coverage of the 4,584 single-copy genes
conserved among all Actinopterygii, with approximately 93.7% of the
complete BUSCOs found in the assembly (Table S4). Taken together, these
results suggest that the genome assembly was robust and nearly complete.