Caveats and considerations
Caveats – Benchmarking studies of the tools examined here demonstrate that SV discovery is a complex endeavour–even in well studied model species like humans–that can be influenced by sequence type, depth and coverage, as well as the characteristics of the SVs themselves (e.g., Cameron et al., 2019; Kosugi et al. 2019). Another important consideration is that many SV tools are developed and benchmarked with human clinical applications in mind (e.g., Cameron et al., 2019; X. Chen et al., 2016; Jiang et al., 2016; Jiang et al., 2020; Kosugi et al., 2019; Pedersen et al., 2019; Rausch et al., 2012), and come with the implicit assumption that researchers have access to high-quality, chromosomally assembled reference genomes. All of these reasons combined make it challenging to retrofit these tools for population-scale studies in non-model species (e.g., GRIDSS; Cameron et al. 2017).
An inherent problem with linear reference genomes is the potential of reference bias (Theissinger et al., 2023; Wold et al. 2021). In this study, the reference genome was assembled using an individual of pure Rakiura lineage. This is significant since the Fiordland founder is the only individual without direct relation to the Rakiura lineage. The tendency to observe more SVs in birds of this lineage may be attributable to the comparison of groups of more- and less- related birds against a single reference. It is likely that the Fiordland founder carried more genetic differences in comparison to the reference genome, and these differences are likely to be inherited by his descendants. However, the small sample size representing the Fiordland lineage makes comparisons between the two difficult.
Below we highlight strategies to facilitate the characterisation of genome-wide SVs in biodiversity genomic research, with an emphasis on SV discovery given that SV genotyping is an especially active area of research (e.g., Nguyen et al., 2023). Here we acknowledge the extensive resources that made this study feasible including a relatively small, high-quality diploid reference genome, a near whole-species high-coverage short-read dataset, and a meticulously curated pedigree for a genetically depauperate species (Guhlin et al. 2022 preprint). Further, we acknowledge that there are many other tools and approaches for SV discovery and genotyping not outlined in this study and encourage researchers to explore both established and developing approaches.
Considerations – Two key considerations for research that seeks to characterise genome-wide SVs within and among populations are data characteristics (e.g., read length, sequence depth) and available resources (e.g., financial, computational). We frame our recommendations below with these considerations in mind.
Best case scenario, we recommend generating long-read sequence data for a target sequence depth of ≥20x for consistent performance across a broad range of SV types (Sedlazeck et al., 2018; Jiang et al., 2020). However, the size profile of reads can influence the sequence depth required (i.e., 20kb reads at ≥10x coverage for Sniffles; Sedlazeck et al., 2018). This was potentially demonstrated when all Sniffles calls failed filtering thresholds for specificity as the sequence depth and read length profile may be lower than recommended. The low number of SVs retained in the CuteSV dataset after filtering is also likely due to this constraint as this tool performs better when sequence coverage ≥20x (Jiang et al., 2020). Nevertheless, the future promise of long-read approaches and their ability to outperform most short-read SV discovery tools make these strategies our recommended investment for the characterisation of SVs. In our experience, alignment tools for long-read data and SV discovery did not exceed our computational resources (128Gb of RAM, 1Tb storage). However, we note that preparation of whole-genome long-read sequence data can be computationally intensive (e.g., basecalling of ONT data) and that neither CuteSV or Sniffles provide recommended tools for population-scale genotyping.
When long-read data is not practical, but substantial computational resources are available, we recommend using Manta and a target sequence depth ≥20x coverage (e.g., Supplementary Figure S13, Kosugi et al., 2019). We acknowledge that jointly calling samples can be challenging as Manta was by far the most computationally intensive programme used in this study. We did not track our final usage, but to call each of the Joint and Batch data sets, we increased our RAM allocation to 460Gb. However, caution is advised when implementing a batched approach given that Manta has more power to resolve SV breakpoints when all samples are pooled together. Another drawback is that, like both CuteSV and Sniffles, Manta does not provide a recommended programme for population-wide genotyping and interpreting the raw outputs is not recommended.
When financial and computational resources are limited, we suggest that Delly and Smoove are relatively interchangeable. Both tools call SVs on an individual sample prior to merging into a raw call set. As a result, they require significantly less computational resources than Manta and use similar computational resources to either CuteSV or Sniffles for SV discovery. They have the additional benefit of including their own genotyping methods, which were relatively efficient compared to BayesTyper (required >3Tb of storage and was run using 16 cores each with 8Gb of RAM over several days). It is notable that the number of genotyped SVs increased for individuals with higher depth in Delly and Smoove suggesting that consistency of sequence depth across samples may be an important consideration for these tools (Supplemental Figure 5).