Disucussion
We explored six strategies for SV discovery and genotyping with short- and long-read data in the critically endangered kākāpō. We found that the choice of SV discovery tool heavily impacted the overall count, location, and size distribution of SV types characterised. Further, the proportion of SVs retained after filtering for SV call quality and genotype quality varied across all six datasets. Finally, after leveraging a meticulously curated pedigree, we also found that each genotyping approach had variable success in consistently genotyping high quality SVs. As a result, the number and type of SVs carried by individual kākāpō also differed. Nevertheless, there was some agreement between datasets as to which individuals carried a relatively high number of SVs. On the other hand, for the six genotyped filtered datasets, the mean number of SVs in each of two kākāpō lineages differed both within and across generations. Our combined results highlight the challenges associated with the discovery and genotyping of SVs genome-wide. Despite this, we address caveats and highlight considerations and provide recommendations to encourage the pursuit of genome-wide SV characterization in biodiversity genomic research.
Challenges associated with resolving SVs
The SV discovery tools used here vary in the overall number of SVs detected, and the type, size and location of SVs, which suggests each may be sensitive to different mapping characteristics. For instance, the overall number of SVs initially detected, and those that passed call quality filters, varied greatly among the six datasets. Although all tools examined implement similar algorithms for SV discovery, the priority and sensitivity thresholds for each algorithm, as well as approaches for merging calls across individuals, are unique to each tool. Methods used to estimate SV call quality metrics also vary between tools, ranging in complexity from the number of supporting reads (e.g., Delly, Sniffles) to Bayesian inference of allele likelihoods (e.g., Manta). Further, it can be challenging to determine appropriate metric thresholds, in part because these are generally optimised for model species.
Variability in the number of each SV type detected was observed in all six datasets, especially when comparing long- and short-read based tools. All four short-read call quality filtered datasets had a very high prevalence of inversions. Both the individual-based strategy implemented by Delly and Smoove, as well as the multi-sample approach implemented by Manta, likely over-represented the number of inversions relative to other SV types. This is not surprising given the challenges associated with resolving inversion breakpoints, even after the merging of a consensus call set (Ho et al., 2020; Mahmoud et al., 2019). In addition, no clear filtering approach for consistently resolving well-supported inversion breakpoints emerged for the tools used here. While inversions are likely overrepresented in the short-read call set (e.g., Hallast et al., 2021; Kim et al., 2017; Knief et al., 2017), long-read based discovery strategies retained a relatively higher number of insertions than short-read discovery tools. This is also not surprising given the known limitations of short-read data when characterising insertions (Delage et al., 2020).
All six datasets had a wide size distribution across all SV types. It is challenging to determine whether the largest of these SVs are real. There is no clear consensus regarding the largest SV size short-read data can accurately detect (Ho et al., 2020; Mahmoud et al., 2019). It is also unclear what the ideal depth, read length and quality scores are required for long-reads to consistently resolve large SVs (but see below). For example, duplications were, on average, the largest across the six datasets, which could reflect highly repetitive regions and/or mapping error.
Consensus on the location of SVs across the six datasets was limited, and was largely dependent on the number of tools being compared. The lack of complete overlap in the location of SVs between the two Manta datasets is interesting given the overall similarity in the number of SVs per chromosome and the overall counts of each SV type. The strategies used to call SVs with Manta differ only in the way that individuals were grouped during the initial SV discovery (i.e., samples divided into 14 batches, versus all males analysed jointly together and all females analysed jointly). Given that Manta incorporates local assembly of reads when detecting SVs, it is possible that different read sets have therefore led to differences in both the power and precision to accurately locate SVs in these analyses. Randomisation of sample batches would have aided in resolving this, however, this was not possible due to computational resource limitation. It is notable that although the two long-read based tools had a high level of agreement with one another, CuteSV had fewer overlapping calls with the short-read based tools. Whether this can be attributed to the smaller number of SVs in the CuteSV dataset and/or the relative accuracy of each tool is unknown.