Assembly
Similar to Easy353, GeneMiner uses a reference-guided de Brujin graph assembly with seed self-discovey and a greedy extension method (Figure 2-B). Using this design, a k-mer is initially chosen as a seed, acting as the starting point for the assembly and it is self-discovered from the datasets. After identifying the seed, a greedy extension technique is applied, progressively appending the most probable next k-mer at each stage. This is completed through utilization of information from the de Bruijn graph and k-mer count. The reference sequence is utilized throughout the program to ensure the assembled sequence resembles the reference sequence.
The GeneMiner method is updated from our Easy353 program in the following two ways: seed selection and the method for extending the seed into contigs. For choosing the seed, GeneMiner splits both of the reference sequences and filtered reads into k-mers usingka (different from the kfused in the filtering) and builds k-mer sets as R and T , respectively. The intersection of the two sets becomes the potential seed set. Seeds are weighted, and those with higher weights are prioritized. In Easy353, only the number of reads were considered (Wseed = C read ). However, in GeneMiner, Wseed is jointly determined by both reference sequences and reads. The weight is calculated as\(W_{\text{seed}}=C_{\text{ref}}\ast\operatorname{}{{(C}_{\text{read}}+}1)\), where Cread represents the counts of the k-mer in the set T , and Cref represents the counts of the k-mer in the set R. For extending the seed into contigs, each k-mer is taken as a node and assigned a weighted score according to its frequency and position in the reference sequences. In Easy353, the node-weighted score only considered the position of a k-mer in the reference sequences, and the weight formula used was\(W_{\text{node\ }}={C_{\text{read}}}^{(1-P_{R})}\). However, in GeneMiner, the node-weighted score is calculated as\(\text{\ \ }W_{\text{node\ }}={C_{\text{read}}\ast e}^{(1-|P_{R}-\ P_{T}|)}\), where Cread represents the occurrences of the k-mer in the set T , PR represents the average percentage distance of the k-mer in the set R , andPT represents the percentage distance of the k-mer in target gene. If a k-mer appears at multiple positions in a reference sequence, the Wnode is downgraded toCread to avoid the introducing of insertion errors or duplication issues. In the case of a given seed, any secondary branches branching off from the main path, such as ”bubbles” in the graph, will be disregarded during sequence reconstruction as long as they have a lower score.
Verifying results
One valuable step, which is often overlooked, when designing tools for NGS data, is verification of the results. GeneMiner employs a novel verification method based on mutation models and repeated resampling to assess the influence of the reference and also confirm the stability of assembly results statistically (Figure 2-C). The mutation model is generated by calculating both the variation rate and the indel rate in the reference sequences. Based on the initially generated contig (Seq1 ), mutations and resamples are applied using the mutation model to obtain ref1 ,ref2 …. refn . All simulated ref1 , ref2refn are implemented as the new reference for the whole assembly process and get each new recovered target gene is noted as Seq2 . A bootstrap score is then used to assess the stability of the assembly results, calculated as\(\frac{N}{L}\ast 100\), where N represents the matched base number in the pairwise alignment Seq1 andSeq2 , and L represents the alignment length. A recovered target gene with a high bootstrap scoreindicates a stable assembly and a probably well-chosen reference data. Conversely, a recovered target gene with a low bootstrap scorerequires further consideration for reference selection and is probably not able to be used for phylogenetic studies.