Structural variant discovery and genotyping

Short-read structural variant discovery was conducted with Delly v0.8.7 (Rausch et al., 2012), Manta v1.6.0 (X. Chen et al., 2016) and the wrapper programme Smoove v0.2.8 (Pedersen et al., 2020), which implements Lumpy-sv v0.2.13 for SV discovery (Layer et al., 2014), annotates variants with Duphold v0.2.1 (Pedersen & Quinlan, 2019) and genotypes SVs with SVTyper v0.7.0 (Chiang et al., 2015). Long-read SV discovery was conducted using CuteSV v1.0.11 (Jiang et al., 2020) and Sniffles v2.0.7 (Sedlazeck et al., 2018), and raw individual calls were refined for population genotyping using Jasmine v1.1.5 (Kirsche et al., 2023).
Each SV discovery tool differs in approach. For the short-read based discovery strategies, both Delly and Smoove (i.e., Lumpy-sv) implement two algorithms (paired-end and split-read), while Manta implements three (paired-end, split-read and assembly-based). The short-read tools also differ in the suggested strategy for population-level SV discovery. Both Delly and Smoove iterate through individual samples and subsequently merge SV calls for individual genotyping, whereas Manta recommends conducting SV discovery with all available samples at once to increase power (X. Chen et al., 2016). However, due to the assembly-based algorithm, Manta is computationally resource-heavy, and running >10 individuals at ~30x sequence coverage set can often exceed 125 Gb RAM (as observed in the Kakāpō125+ data). In instances where computational resources are limited, samples may be run in batches or individually, although this is not recommended due to the loss of power to resolve SVs and the challenges associated with merging variants called in different sample batches (Github, 2016a; Github, 2016b).
To assess the impacts of using a batched vs. joint calling strategy for SV discovery, Manta was run in two ways: 1) a batched approach where samples were divided into 14 batches (7 male batches and 7 female batches) with an average of 11 individuals per batch (Manta-Batch); and 2) a joint approach where all males were run together and all females were run together. For both datasets, male and female SV discovery was conducted separately due to the inclusion of the W chromosome in female alignments (Manta-Joint). In both cases, variants were merged into ‘batched’ and ‘joint’ datasets with BCFtools v1.12 (Danecek et al., 2021) with the merge -m all flag.
Long-read SV discovery approaches must incorporate methods to account for the low accuracy associated with long-read sequence data (Jiang et al., 2020; Sedlazeck et al., 2018). The two tools included here (CuteSV and Sniffles) also attempt to address two challenges associated with alignment-based SV discovery. For example, CuteSV uses multiple signature extraction methods to distinguish SVs from the background noise of long-read data, then implements clustering and refinement approaches to increase sensitivity and identify the signature of heterozygous SVs (Jiang et al., 2020). Sniffles similarly identifies the signature of different SV classes but implements additional methods to resolve nested SVs (Sedlazeck et al., 2018). SV discovery for both tools is performed on an individual-basis. Jasmine, which implements a modified minimum spanning forest algorithm, was used to merge SVs detected in individual kākāpō in each call set in preparation for population-scale genotyping with the available short-read data (Kirsche et al., 2023).
Regardless of discovery strategy, nominal genotype outputs from SV discovery tools are generally regarded as unreliable (Chander et al., 2019). To address this, both Delly and Smoove include genotyping programs (delly genotype, and SVTyper respectively), yet Manta, CuteSV and Sniffles do not. To genotype these call sets at the population-scale, SVs were filtered (as described below) and genotyped using the aligned kākāpō125+ short-reads with the genotyping tool BayesTyper v1.5 (Sibbesen et al., 2018). BayesTyper uses alignments of k-mers to a variant graph and reference genome, then implements a probabilistic model of k-mer counts to genotype individuals. BayesTyper has the benefit of being able to genotype a wide range of genomic variants (e.g., SNPs, small INDELs and SVs), in fact the inclusion of SNP data is recommended as it aids in matching relevant k-mers to sequence reads (Github, 2019.). Each VCF output from Manta was processed with the program BayesTyperTools convertAllele to convert symbolic allele notations to REF and ALT sequences. This step was not necessary for the long-read based call sets as they already provided REF and ALT sequences. For both Manta call sets (batch and joint), CuteSV and Sniffles, a SNP call set generated with DeepVariant (Guhlin et al., 2022 preprint) was used to aid SV genotyping. All VCFs were normalised, variants left-aligned and any multiallelic sites split with BCFtools norm prior to merging variants with BayesTyperTools combine. Finally, BayesTyper requires the generation of large intermediate files (>2Tb for this dataset) with the tool KMC (Kokot et al., 2017). As recommended, KMC v3.1.1 was run with k=55 and singleton k-mers included (-ci1) and a k-mer bloom filter for each individual was generated with BayesTyperTools makeBloom. Since BayesTyper cannot genotype more than 30 individuals at once, samples were batched into 5 groups of 30 and 1 group of 19 individuals prior to identifying variant clusters with BayesTyper cluster and genotyping with BayesTyper genotype under default settings.

Filtering parameters

Once SV discovery and genotyping were complete, filtering for each SV dataset was conducted in two stages for: 1) SV call quality; and 2) individual genotype quality. The outputs from SV call quality filters were used for comparisons of SV type frequency, size distributions and location (i.e., frequency per chromosome) between tools (described further in the Structural variant analyses analyses section below). For comparisons of genotype consistency and variability among individual kakāpō, the outputs from genotype quality filters were used (see Structural variant analyses below).
Upon completion of SV discovery, removal of SVs marked as low quality, and additional recommended filtering parameters specific to each tool, were implemented using BCFtools. A standardised filtering approach was not applied to outputs from all three short-read tools since each program recommends different statistics to assess the quality of SVs and genotypes. Structural variant filtering for all short-read tools excluded all breakend calls, and SVs ≥50kb in length as these likely represent unresolved complex variants, mapping error, and/or reference bias. Additional filtering for Delly excluded duplications and inversions <300bp, and deletions <50bp using the delly merge -m option. All remaining SVs that did not pass all variant call quality filters were removed with BCFtools (i.e., INFO/FILTER = “PASS”’). This excludes all SVs where paired-end support was <3 and a MAPQ score <20 (Rausch, 2014). Finally, genotype filtering for Delly SVs excluded all sites where <80% of variable genotypes passed all genotype filters with BCFtools (i.e., FMT/FT=”PASS”).
For Smoove, the lumpy_filter program identifies and discards interchromosomal read pair mismatches >3, and those with alternative matches, unless the identified split matches the location of the mate pair. This inbuilt filtering programme also removes reads where the depth is greater than 1,000x, as well as any orphaned reads. Variants are then genotyped and ready for annotation with the Smoove annotate programme. Once thesteps were complete, all breakends, deletions that did not have a depth fold-change relative to flanking regions (FORMAT/DHFFC) < 0.7, and duplications that did not have a depth fold-change relative to bins in the genome with similar GC-content (INFO/DHBFC) > 1.3 were excluded using BCFtools. or genotype filtering, an overall Mean Smoove Het Quality (INFO/MSHQ) ≥ 3 was implemented with BCFtools. The Smoove Het Quality (INFO/SHQ) metric scores confidence in individual heterozygous genotypes where 1 is a low confidence call and 4 is highest, with MSHQ representing the mean score for all heterozygous genotypes (Pedersen, 2018/2022).
Variants for both the Batch and Joint Manta outputs were filtered using BCFtools to exclude all variants <50bp in length, all breakend calls and all variants that did not pass all record-level filters (i.e. INFO/FILTER=PASS). Specifically, this excluded: all sites with a QUAL score <20; deletions and duplications not consistent with diploid expectations; SVs with breakpoint depths >3x the median chromosome depth; SVs <1kb in size where >40% of samples contained a MAPQ score of 0 around either breakend; all SVs that were significantly larger than the paired-read fragment size and did not have paired-read support for the alternate allele in any individual; and finally, SVs where no sample passed all sample-level filters.
CuteSV and Sniffles call sets were refined using Jasmine, followed by the manual removal of all imprecise sites including breakend calls. However, it is notable that while the CuteSV dataset had sufficient read depth to filter for SV specificity (i.e., INFO/IS_SPECIFIC=1), Sniffles did not retain any SVs once this metric was implemented. As a result, the Sniffles call set was not filtered on SV specificity in this study.
The SV call sets for both Manta datasets, CuteSV and Sniffles were genotyped using BayesTyper, which implements four ‘hard’ genotype filtering parameters by default. This includes the exclusion of variants with fixed heterozygous genotypes, alleles with <1 sampled k-mer, genotypes with a posterior probability <0.99, and alleles with k-mer coverage that fall below 1-e-0.275x. Here, x represents the mean of the negative binomial distribution for k-mer coverage for a specific sample (Sibbesen, 2018 Github). All variants with >20% genotypes missing and variants where <80% of genotypes passed all four BayesTyper quality metrics were excluded. Although BayesTyper ships with a programme for converting allele sequences to symbolic alleles (bayesTyperTools convertSeqToAlleleID), we found It challenging to resolve the class of all genotyped variants with this approach (i.e., insertions are incompatible and additional SV classes were changed or remained unresolved). To relate genotyping results back to the called SV class, BCFtools was used to identify the chromosomal positions of the genotyped variants and compared with the locations of SVs prior to file conversion with bayesTyperTools convertAllele.