Optimization of the in-silico mate-pair method improved contiguity and
accuracy of genome assembly
Abstract
A combination of short-read paired-end and mate-pair libraries of large
insert sizes is used as a standard method to generate genome assemblies
with high contiguity. The third-generation sequencing techniques also
are used to improve the quality of assembled genomes. However, both
mate-pair libraries and the third-generation libraries require
high-molecular-weight DNA, making the use of these libraries
inappropriate for samples with only degraded DNA. An in silico method
that generates mate-pair libraries using a reference genome was devised
for the task of assembling target genomes. Although the contiguity and
completeness of assembled genomes were significantly improved by this
method, a high level of errors manifested in the assembly, further to
which the methods for using reference genomes were not optimized. Here,
we tested different strategies for using reference genomes to generate
in silico mate-pairs. The results showed that using a closely related
reference genome from the same genus was more effective than using
divergent references. Conservation of in silico mate-pairs by comparing
two references and using those to guide genome assembly reduced the
number of misassemblies (18.6% – 46.1%) and increased the contiguity
of assembled genomes (9.7% – 70.7%), while maintaining gene
completeness at a level that was either similar or marginally lower than
that obtained via the current method. Finally, we developed a pipeline
of optimized method and compared it with another reference-guided
assembler, Ragtag. We found that Ragtag produced longer scaffolds (17.8
Mbp vs. 3.0 Mbp), but resulted in a much higher misassembly rate
(85.68%) than our optimized in silico mate-pair method. This optimized
in silico pipeline developed in this study should facilitate further
studies on genomics, population genetics and conservation of endangered
species.