Alignment-based SV discovery methods

Once a high-quality reference genome has been assembled for a species of interest, SVs can be detected using alignment-based methods where short- and/or long read sequence data for multiple individuals are aligned to the reference without prior assembly. These methods rely on identifying patterns in read mapping between a sample and the reference genome. Common algorithms include read-pair, read depth, split-reads and de novo or local assembly (Hajirasouliha et al., 2010; Korbel et al., 2007; Yoon, Xuan, Makarov, Ye, & Sebat, 2009). No single algorithm is well suited to detecting all SV types. For example, read pair-based algorithms, where the orientation and distance between paired ends are assessed, are suitable for detecting deletions, duplications, and inversions. In the case of read depth approaches, deletions and duplications are identified through variation in mapping depth. Split-reads identify regions where alignments map across a breakpoint, which is best suited for discerning the ends of an inversion or translocation (Figure 3). Finally, de novo and local assembly algorithms are best suited to identifying variants not present in the reference (e.g., insertions) as reads aligned to contigs may be reassembled alongside unmapped reads for pairwise comparison to a reference (Mahmoud et al., 2019). Early implementation of SV discovery programs typically relied on a single approach (e.g., BreakDancer, Pindel; Chen et al., 2009; Ye, Schulz, Long, Apweiler, & Ning, 2009). Although these programs are less computationally intensive, they are limited to calling only a few SV types and tend to underperform against generalist programs that incorporate at least three algorithms for SV detection (e.g., Delly, Lumpy, Manta, SvABA; Chen et al., 2016; Layer, Chiang, Quinlan, & Hall, 2014; Rausch et al., 2012; Wala et al., 2018).  
Because of the challenges of identifying and classifying SVs, many SV discovery approaches have a high (and systematic) false positive rate (e.g., Cameron et al., 2019), especially when data is derived from short-read sequencing methods (Figure 3; see more below). In order to address this error, an ensemble approach for SV characterization has been extensively applied in human data and in domesticates (Du et al., 2021; Ho et al., 2020; Zhou et al., 2019). With an ensemble approach, multiple SV callers are integrated into a single pipeline as a means to create more certainty around SV discovery, with only variants that intersect multiple SV callers retained (Becker et al., 2018; Mohiyuddin et al., 2015; Zarate et al., 2018 preprint). Further, because validated SV call sets have been developed for humans, it is straightforward to benchmark appropriate program combinations for SV discovery (Collins et al., 2020; Ho et al., 2020; Parikh et al., 2016; Zook et al., 2020).