Data processing and Stacks parameter testing
GBS sequence data was demultiplexed on the Cedar cluster hosted by Compute Canada using the process_radtags module in Stacks v. 2.3 (Rochette et al. 2019). Parameter testing following Pariset al. (2017) was conducted on the 24 individuals sequenced with and without the REPLI-g treatment (herein referred to as the “WGA test dataset”) using the denovo_map.pl script to determine the optimal values of the M and n parameters during subsequentde novo locus construction and SNP calling. The Mparameter controls the number of mismatches allowed between stacks in the same individual, which represent unique alleles, and the nparameter controls the number of mismatches in stacks across individuals as they are merged into loci (Catchen et al. 2011; Rochetteet al. 2019). We tested values between 1 and 9 for both parameters. Lower values of M and n permit fewer mismatches between stacks and, barring exceptionally high levels of natural polymorphism, should be more optimal in regional studies such as this one, where few geographic barriers exist between populations (Pariset al. 2017).
Following the recommendations in Paris et al. (2017), we additionally set the m parameter to 3, which controls the minimum allele depth, and used the r80 principle, a stringent approach to data filtering that retains only loci that are present in 80% of the dataset. When genomic data are assembled de novo, there is risk of constructing loci from contaminant DNA, and some studies have reported that WGA can increase the representation of such contaminants in raw sequence reads (Ellegaard et al. 2013; de Medeiros & Farrell 2018). However, contaminant DNA, if present, is typically unequally distributed among samples, so using the r80 parameter should reduce this risk (Paris et al. 2017); de Medeiros & Farrell (2018) found that a similar, stringent filtering approach was effective at removing such contaminants from their dataset. We assessed the number of recovered loci, polymorphic loci, and SNPs across each value of M and n independently for the WGA and non-WGA sequences in the WGA test dataset to assess any differences in the data that might be attributed to this treatment prior to GBS library preparation.
For CFM population genomic analyses, we processed all the WGA sequences from both sequencing runs together (n = 120, herein referred to as the “population genetic dataset”), specified a minimum minor allele frequency of 3%, limited the number of SNPs output per locus to one using the –write_random_snp option in the populationsmodule of Stacks to reduce genomic linkage, and removed any individuals with more than 50% missing data. COI sequences for the same specimens were aligned and quality checked following Mori et al.(2019).