Once SV discovery and genotyping were complete, filtering for each SV dataset was conducted in two stages for: 1) SV call quality; and 2) individual genotype quality. The outputs from SV call quality filters were used for comparisons of SV type frequency, size distributions and location (i.e., frequency per chromosome) between tools (described further in the Structural variant analyses analyses section below). For comparisons of genotype consistency and variability among individual kakāpō, the outputs from genotype quality filters were used (see Structural variant analyses below).
Upon completion of SV discovery, removal of SVs marked as low quality, and additional recommended filtering parameters specific to each tool, were implemented using BCFtools. A standardised filtering approach was not applied to outputs from all three short-read tools, since each program recommends different statistics to assess the quality of SVs and genotypes (Pedersen et al. 2020b; Anon 2022a; Anon 2022b). Structural variant filtering for all short-read tools excluded all breakends, and SVs ≥50kb in length as these likely represent unresolved complex variants, mapping error, and/or reference bias. Additional filtering for Delly excluded duplications and inversions <300bp, and deletions <50bp using the delly merge -m option. All remaining SVs that did not pass all variant call quality filters were removed with BCFtools (i.e., INFO/FILTER = “PASS”’). This excludes all SVs where paired-end support was <3 and a MAPQ score <20 (Anon 2022a). Finally, genotype filtering for Delly SVs excluded all sites where <80% of variable genotypes passed all genotype filters with BCFtools (i.e., FMT/FT=”PASS”).
For Smoove, the lumpy_filter program identifies and discards interchromosomal read pair mismatches >3, and those with alternative matches, unless the identified split matches the location of the mate pair. This inbuilt filtering programme also removes reads where the depth is greater than 1,000x, as well as any orphaned reads. Variants are then genotyped and ready for annotation with the Smoove annotate programme. Once these steps were complete, all breakends, deletions that did not have a depth fold-change relative to flanking regions (FORMAT/DHFFC) < 0.7, and duplications that did not have a depth fold-change relative to bins in the genome with similar GC-content (INFO/DHBFC) > 1.3 were excluded using BCFtools (Pedersen 2022). For genotype filtering, an overall Mean Smoove Het Quality (INFO/MSHQ) ≥ 3 was implemented with BCFtools (Pedersen et al. 2020b). The Smoove Het Quality (INFO/SHQ) metric scores confidence in individual heterozygous genotypes where 1 is a low confidence call and 4 is highest, with MSHQ representing the mean score for all heterozygous genotypes (Pedersen et al. 2020b).
Variants for both the Batch and Joint Manta outputs were filtered using BCFtools to exclude all variants <50bp in length, all breakend calls and all variants that did not pass all record-level filters (i.e. INFO/FILTER=PASS). Specifically, this excluded: all sites with a QUAL score <20; deletions and duplications not consistent with diploid expectations; SVs with breakpoint depths >3x the median chromosome depth; SVs <1kb in size where >40% of samples contained a MAPQ score of 0 around either breakend; all SVs that were significantly larger than the paired-read fragment size and did not have paired-read support for the alternate allele in any individual; and finally, SVs where no sample passed all sample-level filters.
Filtering of the CuteSV and Sniffles call sets was relatively simple, with all imprecise sites excluded from both call sets. However, it is notable that while the CuteSV had sufficient read depth to filter for SV specificity (i.e., INFO/IS_SPECIFIC=1), Sniffles did not retain any SVs once this metric was implemented. As a result, the Sniffles call set was not filtered on SV specificity in this study.
The SV call sets for both Manta datasets, CuteSV and Sniffles were genotyped using BayesTyper, which implements four ‘hard’ genotype filtering parameters by default. This includes the exclusion of variants with fixed heterozygous genotypes, alleles with <1 sampled k-mer, genotypes with a posterior probability <0.99, and alleles with k-mer coverage that fall below 1-e-0.275x. Here, x represents the mean of the negative binomial distribution for k-mer coverage for a specific sample (Sibbesen, 2018 GitHub). All variants with >20% genotypes missing and variants where <80% of genotypes passed all four BayesTyper quality metrics were excluded. Although BayesTyper ships with a programme for converting allele sequences to symbolic alleles (bayesTyperTools convertSeqToAlleleID), we found It challenging to resolve the class of all genotyped variants with this approach (i.e., insertions are incompatible and additional SV classes were changed or remained unresolved). To relate genotyping results back to the called SV class, BCFtools was used to identify the chromosomal positions of the genotyped variants and compared with the locations of SVs prior to file conversion with bayesTyperTools convertAllele.
Structural variant analyses
Structural variants were counted for each SV discovery tool prior to and after filtering. To explore the level of call consensus between these outputs, the number of overlapping SVs were identified using SURVIVOR v1.0.7 (Jeffares et al. 2017) in 1kb, 500bp, 50bp windows and for exact overlaps. To count as a consensus call, SV type and strand were required to match and a minimum variant length of 50bp were required. To assess whether some chromosomes carried more SVs relative to their size, we estimated the number of SVs per chromosome and the proportion of base-pairs of each chromosome within an SV (i.e., the sum of all SV lengths for a given chromosome / chromosome size).
Following SV discovery across the six approaches, all individuals were genotyped using the aligned kākāpō125+ short-read dataset. The genotype filtered SV data for all six variant call sets were used for comparisons of individual variability, assessing shifts in the the number of SVs per generation, and to assess population structure of SVs. When reporting the number of SVs per individual and number of SVs among kākāpō cohorts, we use presence or absence of SVs per individual. That is, we consider genotypes as evidence of whether or not the individual carries the SV (0/1 & 1/1 = carrier; 0/0 = non-carrier). Both Fiordland- and Rakiura-derived birds (herein, founders) were used for comparisons across three cohorts (n = 1, 3, 4 for Fiordland founders, F1 and F2 and n = 40, 60, 10 for Rakiura founders, F1 and F2 respectively). Due to the lek mating system and a relatively long lifespan, the kākāpō population has had significant backcrossing through the generations. Therefore, the F1 and F2 generations represented here excluded all individuals with backcrossed lineages, as this may bias true generational patterns in SVs carried by individuals. Finally, to compare variability in the SVs carried by individual kākāpō, genotypes from the genotype filtered SV data for all four strategies was used to conduct a discriminant analysis of principal components (DAPC) with the adegenet R package (Jombart 2008). Only individuals used for generational comparisons (n = 118) were used to assess individual variability and SV population structure.
In the absence of a previously validated catalogue of SVs, neither a ‘true’ nor ‘false’ positive rate of detection could be assessed. Despite not being able to estimate the precision and accuracy of SVs called in our data, we aimed to test the consistency of genotyping results using Mendelian Inheritance tests with parent-offspring trios. Although this does not eliminate the possibility of systematic error, nor does it provide an indication of the precision or accuracy of SV detection, departures from Mendelian Inheritance may indicate inconsistency of genotyping within a given SV call. Genotyping consistency is an important consideration for population studies as patterns of population structure or inferences about local adaptation may be impacted by inconsistencies.
To identify SVs that violate Mendelian Inheritance patterns, the BCFtools +mendelian plugin was used. Pedigree data provided by the New Zealand Department of Conservation identified 120 parent-offspring trios consisting of 158 unique individuals in the Kākāpō125+ sequence data. We tested SV genotypes by calculating the proportion of Mendelian Inheritance errors relative to the number of non-missing genotypes (i.e., GT != “mis”). Four thresholds were tested where adherence to Mendelian Inheritance expectations were either 100%, ≥95%, ≥90% and ≥80% of genotypes passed. It is important to note that not all 169 sequenced individuals were represented in pedigree trios, as they may not have descendants or antecedents represented in the short-read data analysed here. In addition, some individuals are represented multiple times in different family groups. This bias towards highly represented individuals in the kākāpō breeding population may not adequately capture all SVs called within the population. As such, we did not filter SVs using Mendelian Inheritance errors for downstream analysis. Rather, these tests may provide some insights into the relative performance of genotyping approaches among the pipelines used here.