Challenges associated with characterising SV diversity
The challenges with accurately detecting and genotyping SVs had implications for assessments of individual SV diversity. Although there was some concordance between Delly and Smoove when identifying three individuals with the highest number of SVs, the SV type largely driving this pattern are inversions. As noted previously, inversions are particularly difficult to resolve using short-read data, making it difficult to determine whether these variants are indicative of a true biological signal or reflect systematic error in short-read mapping. The significantly lower frequency of inversions in both long-read datasets is notable as long-read data should better resolve more complex variants like duplications, insertions and inversions (Alkan et al., 2011; Chaisson et al., 2019; Mahmoud et al., 2019; Mérot et al., 2022). Further work is needed to determine whether the small sample size and relatively low sequence depth for the long-read data impeded discovery of inversions, or whether these calls are largely false-positives in the short-read based datasets.
Although the three individuals consistently carrying the most SVs in Delly and Smoove were not depth outliers, there was a trend for individuals with higher sequence depths to retain more SVs after genotype quality filtering for both datasets (Supplemental Figure 5). This has implications for the number of SVs detected in each generation and observed trends across datasets within each lineage. Despite some consistency in the datasets with the highest and lowest mean number of SVs in each generation, where Smoove had the most SVs followed by Manta – Batch, Manta – Joint, Sniffles, CuteSV and/or Delly, there were different overall trends between the two lineages. However, interpreting these results is challenging as sample sizes representing generations of Fiordland lineage are small (F0 = 1, F1 = 3, F2 = 4). In addition, there is some indication that a higher mean sequence depth of samples sequenced in later Rakiura generations is driving the higher observed mean in later generations (e.g., mean Rakiura F0 ~15x coverage vs Rakiura F2 ~24x coverage). Although the effect of mean sequence depth was not as strong in the datasets genotyped using BayesTyper (i.e., Manta – Batch, Manta – Joint, CuteSV, Sniffles), these results indicate the importance of accounting for sequencing batch effects when interpreting results.