Data processing and Stacks parameter testing
GBS sequence data was demultiplexed on the Cedar cluster hosted by
Compute Canada using the process_radtags module in Stacks v. 2.3
(Rochette et al. 2019). Parameter testing following Pariset al. (2017) was conducted on the 24 individuals sequenced with
and without the REPLI-g treatment (herein referred to as the “WGA test
dataset”) using the denovo_map.pl script to determine the
optimal values of the M and n parameters during subsequentde novo locus construction and SNP calling. The Mparameter controls the number of mismatches allowed between stacks in
the same individual, which represent unique alleles, and the nparameter controls the number of mismatches in stacks across individuals
as they are merged into loci (Catchen et al. 2011; Rochetteet al. 2019). We tested values between 1 and 9 for both
parameters. Lower values of M and n permit fewer
mismatches between stacks and, barring exceptionally high levels of
natural polymorphism, should be more optimal in regional studies such as
this one, where few geographic barriers exist between populations (Pariset al. 2017).
Following the recommendations in Paris et al. (2017), we
additionally set the m parameter to 3, which controls the minimum
allele depth, and used the r80 principle, a stringent approach to
data filtering that retains only loci that are present in 80% of the
dataset. When genomic data are assembled de novo, there is risk
of constructing loci from contaminant DNA, and some studies have
reported that WGA can increase the representation of such contaminants
in raw sequence reads (Ellegaard et al. 2013; de Medeiros &
Farrell 2018). However, contaminant DNA, if present, is typically
unequally distributed among samples, so using the r80 parameter
should reduce this risk (Paris et al. 2017); de Medeiros &
Farrell (2018) found that a similar, stringent filtering approach was
effective at removing such contaminants from their dataset. We assessed
the number of recovered loci, polymorphic loci, and SNPs across each
value of M and n independently for the WGA and non-WGA
sequences in the WGA test dataset to assess any differences in the data
that might be attributed to this treatment prior to GBS library
preparation.
For CFM population genomic analyses, we processed all the WGA sequences
from both sequencing runs together (n = 120, herein referred to as the
“population genetic dataset”), specified a minimum minor allele
frequency of 3%, limited the number of SNPs output per locus to one
using the –write_random_snp option in the populationsmodule of Stacks to reduce genomic linkage, and removed any individuals
with more than 50% missing data. COI sequences for the same
specimens were aligned and quality checked following Mori et al.(2019).