Discussion
In this study we optimized and applied a multiplexed long-read
sequencing approach which makes use of high-quality short read exome
data to perform routine phasing of de novo mutations. We report
the phasing results of 77 DNMs from 64 of our 77 patient-parent trios
linked to male infertility, achieving successful phasing for 71% of the
109 DNMs investigated using this long-read targeted approach. In
contrast, only 9 of these DNMs (8%) could be reliably phased based on
short-read WES data alone.
Short-read exome sequencing has become an increasingly common tool in
research and diagnostic of genetic disease, with patient-parent
trio-based sequencing routine for the detection of DNMs. With only 8%
of DNMs being phasable in our cohort of 77 patients when using
short-read WES alone, it is clear that an alternative approach is needed
to determine the parent-of-origin and timing of DNMs. Our method uses
the long-range PCR with standard optimisation steps to achieve the
simplest and quickest large-scale success. PCR is a simple and standard
wet lab practice, providing greater enrichment and target specificity
than any alternative target-based approach. The sequencing of target
enriched long DNA strands with ONT allowed us, in most cases, to acquire
10s of thousands of times coverage per target, with many targets run per
flow cell, supporting the projects scale demands. To overcome challenges
with error and postzygotic mutations we used the WES data and Sanger
validated DNMs to polish the variant analysis, which limited
computational demand as no complex algorithms were required, and
processing could be quick.
DNMs are known to arise from mutational events occurring during
gametogenesis, predominantly during spermatogenesis rather than during
oogenesis, which is assumed to be associated to the scale of male gamete
production and failure of DNA repair mechanisms which lead to the
increased opportunity for mutational events to occur (Aitken & Baker,
2020; Evenson et al., 2020; Grégoire et al., 2013; Haldane, 1947; Kong
et al., 2012). Previous literature has shown that DNMs occur on the
paternal allele approximately 80% of the time (Kong et al., 2012;
Goldmann et al., 2016; Yuen et al., 2016). In agreement with this
literature, 83% of all phased DNMs in this study were determined to be
of paternal origin.
Parent-of-origin and zygosity information adds another layer to our
understanding of potential disease-causing variants. This is important
when investigating genetic diseases, especially those that likely have
complex and varied mechanisms. In our cohort of 77 patients, 51 patients
were confirmed to suffer from non-obstructive severe oligospermic or
azoospermic phenotypes. In the original publication related to this work
(Oud et al., 2022), we showed that 6 out of the 8 likely causative DNMs
identified in these patients were of paternal origin (Supplementary
Table 8 and 14, Supplementary Figure 6). This suggests that DNMs with a
deleterious effect on the health of an individual can escape negative
selection in the paternal germline.
Accurate detection of the DNM allele frequency is critical to
differentiate prezygotic from postzygotic mutational events, important
in clinical settings for estimating the recurrence risk (Almobarak et
al., 2020; Scanga et al., 2021). Our approach yielded a highly accurate
allele frequency average of 49.6% in the prezygotic mutations, with an
SEM of 0.84% (Supplementary Table 13). Though similar accuracy may be
achievable with more computationally demanding methods, the strength of
our method lies in utilizing the WES data and DNM validation practices
commonly available. This shows that bioinformatic cleaning and more
complex haplotype processing steps are unnecessary, with accurate
results achievable through simple DNM and DNM-anchored iSNP selection.
In total, 8 of the 77 phased DNMs were classified as postzygotic events
(10%), largely in agreement with current literature results of 6.5% to
10% (Acuna-Hidalgo et al., 2015; Ye et al., 2018; Sasani et al., 2019),
supporting the validity of our method. Interestingly, while there was
significant correlation between WES and ONT postzygotic base/allele
frequencies, 25% of the postzygotic DNMs could not be determined from
WES DNM base frequencies. This demonstrates the importance of combining
phasing analysis with deep coverage long-read sequencing to further
characterise the timing of DNMs. As can be expected for postzygotic DNMs
(Girard et al., 2016), we see less paternal bias even though our numbers
are small (5 out of 8 postzygotic DNM are paternal, 62%).
We here use a standard PCR amplicon targeting approach with long-read
sequencing, rather than CRISPR-Cas targeting. Despite CRISPR-Cas
recently becoming a choice method for long-read targeted sequencing
(Hafford-Tear et al., 2019; Liu et al., 2019; Gilpatrick et al., 2020;
McDonald et al., 2021), the large number of targets and small target
sizes in our cohort would make CRISPR-Cas complex and costly. Standard
PCR targeting is optimal for routine application that does not require
methylation data, read lengths greater than 10-20 kb, or directly
representative read counts (Aird et al., 2011). While the CRISPR-Cas
approach can have a 10-100 fold enrichment of the target region compared
to standard low coverage long-read WGS, it still results in 95.4 %
off-target sequencing (Gilpatrick et al., 2020). This off-target
sequencing issue significantly limits the number of samples that can be
run per flow cell, and only a single sample can be run if demultiplexing
is based on the genomic position of the target. The reverse is seen when
comparing this to the standard amplicon approach used herein, where
dozens of samples were run per flow cell and no off-target mapping was
identified. Based on using the optimal CRISPR-Cas approach of 2-3 gRNAs,
and taking into account the reduced sample number per flow cell,
CRISPR-Cas methods also have >40 fold increase in cost per
target. Nonetheless, CRISPR-Cas target enrichment shows great promise,
and will likely be the best approach for targets larger than 10-20 kb.
Despite not observing more basecalling error from PCR extension in
targets of greater sizes, it is worth considering the potential
increases in base error and bias from PCR approaches which would
compound the lower accuracy inherent to long-read sequencing. Our data
supports the importance of minimizing target region sizes when
performing PCR based amplification for targeted sequencing, especially
when performing primer optimisation for >100 bespoke primer
pairs. Limiting target sizes will reduce labour intensive PCR
optimisation and though not observed in our study it may also reduce
base error from PCR fidelity issues. We should, however, be mindful that
for 11% of the DNMs studied no iSNPs were found within the 5kb window,
so minimizing the target region can also negatively impact phasing. For
another 18% of DNMs, however, the sequencing data was of insufficient
quality for phasing purposes, so clearly a balance must be found between
sequencing quality and target size.
Since ONT released the MinION platform in 2014, there have been
extensive leaps in advancing both the chemistry and the bioinformatic
tools. This has resulted in raw base accuracy moving from as low as
~60% (Loman and Watson, 2015) to the current 92-97% in
the 9.4.1 flow cell chemistry used in this investigation. It should be
noted that further increases in accuracy have also been suggested in
recent flow cell chemistry, such as the release this year of the R10.4.1
flow cell. Bioinformatic tools that include the variant caller ‘Clair’,
used here, have also shown increased confidence in variant calls but are
thought to be reaching their limit, with greater confidence requiring
significant alternative algorithms or improvements in chemistry (Luo et
al., 2020). Despite the bioinformatic improvements in base calling and
variant calling, we observe that the accuracy of long-read data on
long-range PCR products still causes far greater false positives than
WES short-read data. After filtering ONT variants by read depth and
quality scores, our anchored approach filtered an additional 50% of the
remaining variants on average. If false variants that were missed prior
to our anchored filtering approach were included in the phasing process
it is likely some targets would be phased incorrectly or not phased at
all. Many phasing tools such as ‘whatshap’ carry out phasing with the
understanding that variants within the vcf file are correct, so the
removal of false variants is important.
Our study provides an approach for accurately phasing and
parent-of-origin calling DNMs in a set of 77 patients. To our knowledge
this is the first time that phasing of DNMs has been investigated on
this scale using long-range PCR targeted ONT sequencing, where each
sample has a uniquely specific target. We optimized the method for
efficiency and streamlined the laboratory and computational pipelines
for processing large numbers of DNMs for detailed phasing analysis. We
incorporate additional short-read sequencing patient-parent trio data
and Sanger validated DNMs that are commonly available from DNM discovery
pipelines like ours. This approach enabled us to improve DNM phasing and
postzygotic calling. This data-supported and anchored phasing approach
can be of great use in both research and diagnostic settings where DNMs
are routinely studied and interpreted.