(1) Seed selection
As the initial step of sequence extension, seed selection is crucial in
determining the accuracy and completeness of an assembly. To simplify
this process, GeneMiner quickly scans through sequencing data to
automatically identify and select appropriate seeds without any manual
work. This saves and effort while ensuring optimal assembly. We assume
that k-mers, which occur at high frequencies in both the reference and
sequencing data, are conserved regions with a higher probability of
occurrence in the target genes. To achieve an unbiased selection of
candidate seeds, we apply a weighted seed model that accounts for the
k-mer counts in both the reference and sequencing data. This model
assigns a weighted score to each candidate seed, with a stronger
preference toward conserved regions in the reference, to avoid the risk
of high-frequency false positives or repeat regions in the sequencing
data. If the resulting assembly from a particular seed candidate is
unsatisfactory according to the assembly length and completeness,
GeneMiner will select a new candidate seed to optimize assembly
performance.
(2) Weighted node model
De Bruijn graph is the foundational methodology for almost all
short-read genome and transcriptome assembly tools (Bao et al., 2014;
Cameron et al., 2017; Chang et al., 2015; Li et al., 2017; Pandey et
al., 2017). In the field of genomics, each node in a de Bruijn graph
stands for a k-mer. These nodes are connected through directed edges
when their (k-1) long suffixes match another node’s (k-1) long prefix.
The k-mers are most often derived from unassembled DNA sequencing reads.
The key concepts of GeneMiner include utilizing de Bruijn graphs to
establish connections between k-mers. We employ a weighted node model
that combines information from both reference sequences and the input
reads to guide seed selection and node connection and use depth-first
search and stacks to enable efficient seed greedy extension and
backtracking. The weighted node model encompasses both seed selection
and node-to-node connection choices, with a distinct emphasis on
assigning weighted scores. By taking into account both the reads and
reference sequences information, the model importantly balances the
impact of sequencing errors and reference bias on assembly. As a result,
the weighted node model helps make optimal decisions for large numbers
of non-unique node-to-node connections or seed selections.