Introduction
DNA markers from the nuclear and
plastid genomes have been widely applied for phylogenetic, evolutionary,
and ecological studies in the last few decades (Palmer et al., 1985;
Palmer & Thompson, 1982; Palmer & Zamir, 1982). Researchers have
identified numerous single-to-low-copy nuclear genes (Li et al., 2008;
Li et al., 2017; Small et al., 2004; Wu et al., 2006) to estimate the
phylogeny for seed plants from transcriptomic and genomic data. Building
phylogenetic frameworks from different types of next-generation
sequencing (NGS) data and large-scale molecular markers is becoming
commonplace and fundamental for other applied biological studies (Wang
et al., 2020; Wen et al., 2020). Molecular markers can be obtained from
a variety of sequencing resources, including RNA-seq (Wang et al.,
2009), Hyb-Seq (Weitemier et al., 2014), and shallow whole-genome
sequencing (genome skimming) (Straub et al., 2012). For example,
Allen et al. (2017) obtained
single-copy orthologs from whole genome sequencing.
Zhang et al. (2019) extracted
single-copy orthologs and ultra conserved elements from genome skimming.
Further, Liu et al. (2021)
captured single-copy nuclear genes, organellar genomes, and nuclear
ribosomal DNA from deep genome skimming data. aTRAM (Allen et al., 2015,
2018) exploits a BLAST-based iterative search-and-assemble approach to
extract specific genes from NGS data (Allen et al., 2017). HybPiper
(Johnson et al., 2016) and HybPhyloMaker (Fér & Schmickl, 2018) can
filter reads by mapping to the reference using BWA/bowtie2 and
subsequently assemble those reads into contigs. We have also designed
software for mining NGS data in the past, our previous tool, Easy353
(Zhang et al., 2022), enables researchers to mine Angiosperms353
(Johnson et al., 2019) genes from transcriptome and enriched genome
based on the reference-guided de Bruijn graph. However, the
aforementioned workflows and tools still present the following
challenges for retrieving phylogenetic
markers: (1) Some markers have
lower/higher coverage than others, leading to unever read coverage; (2)
The stability and accuracy of assembly results depend on the chosen
reference sequence; (3) The putative paralogs in the assembly results
can lead to misestimation of branch lengths; (4) They require
high-performance computing servers and advanced bioinformatics
skills.
In the realm of phylogenetics, molecular markers are characteristically
succinct, with their genomic arrangement frequently deemed
inconsequential. Single-to-low-copy orthologous genes, a small subset of
genomic data, are often used for phylogenetic studies at the genus level
or higher taxonomic levels. It is unnecessary to assemble complete and
sophisticated genome sequencing data to obtain these genes. With these
considerations, we introduce GeneMiner: a pipeline designed for the
extraction of phylogenetic markers from short reads NGS datasets. This
pipeline employs our proprietary reference-guided de Bruijn graph
construction algorithm. Our algorithm deliberately circumvents the need
for independent assembly tools such as SPAdes or Velvet. Compared to
other available tools, GeneMiner can captures gene fragments from both
transcriptome and genome skimming raw sequencing data more quickly,
accurately, and comprehensively. Importantly, GeneMiner can achieve all
of this on personal computers. Compared to our previous tool, Easy353,
GeneMiner contains several innovative features. These include: (1) no
restriction on the type of molecular markers, supporting the direct use
of sequences in GenBank format as reference sequences; (2) a
verification method to evaluate the accuracy of recovered target genes;
(3) an optimized weighted node model to accommodate distantly related
reference sequences; (4) a collection of new methods such as
re-filtering, re-assembly, and soft boundary to improve assembly
capability. Additionally, GeneMiner boasts excellent cross-platform
compatibility, supporting Windows, Mac, and Linux operating systems,
provides a user-friendly GUI interface for Windows and Mac users (Figure
1-A), and has distinct computational parameters that improve accuracy
over other tools in this category (Figure 1-B).