Read processing and alignment

A highly contiguous reference genome, assembled by the Vertebrate Genome Project (VGP), is available for a single female kākāpō, ‘Jane’ (Rhie et al., 2021). As part of the Kākāpō125+ project, paired-end sequence libraries for 94 males and 75 females were sequenced to a target depth of 30x coverage on multiple Illumina platforms, including MiSeq2500, TruSeq Nano, and HiSeqX. Read lengths varied from 125 - 150bp. All preprocessing of raw sequence data was conducted by JG to maintain consistency across Kākāpō125+ subprojects. Briefly, reads were trimmed, adaptor content removed, and overlapped reads were collapsed into a single read using the default quality thresholds (minimum quality of 2) for fastp v0.20.0 (S. Chen et al., 2018) and AdapterRemoval v2.2.4 (Schubert et al., 2016). These processed reads were aligned to the reference genome and a machine learning program, DeepVariant (Poplin et al., 2018), employed to generate high quality SNPs for downstream analyses led by the Kākāpō125+ consortium (Guhlin et al. 2022 preprint). For short-read based SV discovery, reads were aligned to the reference genome using Burrows-Wheeler Aligner v0.7.17 (BWA; Li & Durbin, 2009).
In addition to the near-whole species resequence data, ten individuals highly represented in the extant population (5 male, 5 female), were targeted for long-read sequencing on the Oxford Nanopore Technologies platform. All individuals were sequenced on a MinION using R9 flow cells using the PCR-free LSK-110 ligation sequencing kit. Basecalling was performed using Guppy v6.3.7 using the ‘super’ accuracy model (dna_r9.4.1_450bps_sup). Adapters were trimmed using Porechop v0.2.4 (Wick, 2017/2022), lambda DNA removed using NanoLyse v1.2.0 (De Coster et al., 2018) and reads were filtered for a minimum Q-score of 10 and read length of 3kb using NanoFilt v2.8.0 (De Coster et al., 2018). Both the raw and filtered long-read quality were visualised using NanoPlot v1.39.0 (De Coster et al., 2018). For long-read based SV discovery, reads were aligned to the reference genome using Winnowmap v2.03 (Jain et al., 2020). Read mapping quality was assessed for both short- and long-read alignments using Mosdepth v0.3.3 (Pedersen & Quinlan, 2018) and qualimap v2.2.2 (García-Alcalde et al., 2012), with summaries of outputs from these tools visualised using MultiQC v1.13 (Ewels et al., 2016). A minimum alignment depth of 4x was required for inclusion in long-read-based SV discovery.
The highly contiguous VGP reference genome assembly represents a female kākāpō and thus includes both the Z and W sex chromosomes. This may be problematic for SV discovery as the W sex chromosome contains highly repetitive content homologous with content throughout the genome (Rhie et al., 2021). A preliminary analysis of SNPs indicated that this homology resulted in sufficient numbers of reads mapping to the W chromosome that erroneous heterozygous SNP calls were produced in both females and males (data not shown). Given that males are the homogametic sex (ZZ) and females are heterogametic (ZW), heterozygous SNP calls on the W for either sex indicate mis-mapping. To address these challenges, reads were realigned for all individuals excluding single-end reads and excluding the W chromosome from male alignments. Alignment for females also excluded single-end reads, but included the W chromosome scaffold to ensure that reads belonging to the W did not interfere with SV discovery on other chromosomes. For joint analyses of the kākāpō population, the Z and W chromosomes and all unplaced scaffolds were excluded from downstream analyses due to low confidence in variant discovery for these scaffolds.