The impact of reference sequences
In Test II, when the variance in the reference sequences 18%, the
number of high-quality assembly results significantly decreases. In our
test datasets, each gene included 50 reference sequences with the same
mutation rate, which mildly increased the software’s tolerance to
variance. In practice, one does not know the actual mutation rate
between the reference sequence and the target sequence in advance,
although it can be estimated through the pairwise comparison results for
the reference sequence. For the best performance of GeneMiner, we
suggest that the average mutation rate in the pairwise comparison of
reference sequences should not exceed 15%.
Most often, users may have only very limited reference sequences
available to them . If this is the case, we suggest that users use all
the genes of species from the same family as reference sequences,
without giving special consideration to the mutation rate of those
sequences. Particularly, if there is only a single or few very close
reference sequences available, the tolerance of GeneMiner to reference
sequences will significantly decrease. This is because the k-mer based
pre-filtering method is less effective at providing correct reads based
on limited reference sequences. Additionally, the assembly process might
fail to acquire the correct seed and allocate enough weight to the
correct results. In such cases, if providing more reference sequences is
not possible, users should try to reduce the values ofkf and ka in the program
(corresponding to ”k1” and ”k2”, respectively, in GeneMiner). This
adjustment should result in increased output and longer contigs, but
this might also increase the assembly errors in the results.
In addition, paralogous genes can also impact the results. This is
because the paralogs can lead to significant changes in the branch
length and even the topology of the phylogenetic tree. The most
prevalent method of identification of orthologs predominantly relies on
sequence similarity and sequence length, such as reciprocal best hits
(RBH), where orthologs are assumed if two genes from different genomes
find each other as the best hit in the other genome. For GeneMiner, we
use reference-guided assembly and constraint of sequence length to
obtain putative orthologs. If there are multiple very similar paralogous
genes in both the reference sequence and the target species, the target
gene may be incorrectly assembled from different paralogous genes.
Therefore, we recommend that researchers should clean up the paralogous
genes in the reference sequence in advance when providing it, to obtain
more accurate results. Similarly, GeneMiner is not designed or robust
enough to handle polyploids. Researchers should aim to use sequencing
data from diploid species whenever possible. Additionally, GeneMiner can
export contigs with degenerate bases, which can be further improved
using our PPD script (Zhou et al., 2022). This feature allows us to
accurately identify putative paralogs that may be overlooked by other
tools. PPD identifies paralogs by identifying shared heterozygosity at a
locus between individuals and heterozygosity at a locus within
individuals.