Introduction
De novo mutations (DNMs) arise from mutational events that occur during gametogenesis in either parent germ cells, or postzygotically in both somatic and germ cells of the individual carrying them. On average, one to two DNMs can be found in the coding region of a person’s genome (Durbin et al., 2010; O’Roak et al., 2011; Xu et al., 2011). DNMs are of particular significance due to their contribution to many diseases and genetic disorders, notably those affecting individual fitness such as intellectual disability and male infertility (Awadalla et al., 2010; Veltman and Brunner, 2012; Gilissen et al., 2014; Acuna-Hidalgo et al., 2016; Taylor et al., 2019; Oud et al., 2022). It has been shown that approximately 80% of DNMs are of paternal origin (Kong et al., 2012; Goldmann et al., 2016; Yuen et al., 2016; Oud et al., 2022). A major factor known to contribute to an increase in DMNs in individuals is advanced parental age at the time of conception, particularly paternal age (Kong et al., 2012; Goldmann et al., 2016). Investigating the parental origin and timing of DNMs provides not only biological insight into the generation and ability of these DNMs to underlie genetic disorders, it has also been shown to be important for determining the recurrency risk of these disorders (Campbell et al., 2014; Almobarak et al., 2020).
Phasing analysis interrogates the diploid genome, allowing allele separation of the parental chromosomes. This helps not only to determine the parental origin and timing of DNMs, but is also critical to identify compound heterozygous mutations and look into allele specific expression, linked variants, and structural variation (Tewhey et al., 2011; Soifer et al., 2020; Ebert et al., 2021). With short-read whole genome sequencing (WGS) of parent-offspring trios, 15-20% of DNMs can be successfully phased and parent-of-origin called (Goldmann et al., 2016). However, this percentage is expected to be even lower in whole exome sequencing (WES). Phasing challenges can be attributed to the limited sequencing read lengths, the presence of intronic gaps, and the reduced amount of genetic variation in the exonic regions compared to intronic regions (Frigola et al., 2017). By definition, germline DNMs need to be absent in the parental somatic cells, requiring trio-based exome or genome sequencing of parent-offspring trios for discovery. In a next step, the parent-of-origin and zygosity of a DNM can be identified by targeted amplification and long-read sequencing of a region spanning the DNM as well as one or more parentally informative single nucleotide polymorphism (iSNPs). While this appears straightforward, long-read sequencing has both random and positional error, which may result in false variants used for phasing, reducing reliability of downstream analysis (Magi et al., 2018; Watson and Warr, 2019).
There are numerous methodologies to target genomic regions for enrichment prior to sequencing, with the majority being divided into PCR- or CRISPR-based approaches (Hafford-Tear et al., 2019; Gilpatrick et al., 2020; Player et al., 2020). Importantly, when mapping sequence data to the reference genome from CRISPR targeting approaches, the off-target mapping of the sequences is several fold greater than PCR based methods and target coverage is therefore often many factors lower (Hafford-Tear et al., 2019; McDonald et al., 2021), and costs per target are significantly higher. Innate challenges also exist with PCR approaches, including the presence of inhibitors, variable target length, optimisation time, amplification bias, and nucleotide errors (Potapov and Ong, 2017; Shagin et al., 2017). However, despite these challenges with long-range PCR enrichment, the approach is arguably more effective for scaled-up targeted phasing at present.
This study aims to identify the DNM parent-of-origin and zygosity using a targeted long-range PCR approach for phasing 109 distinct DNMs previously identified in infertile men by patient-parent trio exome sequencing (Oud et al., 2022). Targeted amplification of regions encompassing each unique DNM is performed using an optimised long-range PCR workflow designed to quickly increase PCR success rates and reduce pre-sequencing base error. The combination of exome patient-parent trio data, targeted ONT sequencing and validated DNMs are used to improve phasing and allele frequency confidence. Critical aspects of the process are assessed to ascertain practical application for large-scale use of the approach, including amplification length, long-read sequencing error rates and overall phasing performance.