COI haplotype mapping and summary statistics
Due to missing nucleotide (nt) sequence at the 5’ and/or 3’ ends in 20
specimens (min. missing = 7 nt, max. missing = 80 nt, Table S1), we
created a masked dataset using the modal sequence of those missing
regions for each collection locality to ensure haplotype mapping was not
biased by missing data. Two specimens additionally failed to sequence
and were omitted from the COI dataset (final n = 104). The
minimum spanning haplotype network depicted a single large haplogroup
and nine additional smaller haplogroups (Fig. 1C). Central populations
(indicated by light blue and pink colours) had the greatest amount of
haplotype diversity, however overall haplotype variation was low (number
of segregating sites = 16, number of parsimony-informative sites = 13),
and there was no clear spatial relationship to haplotype variation;
except for the Swan River and Portage la Prairie populations, each
population had sequences in more than one haplogroup. The Swan River and
Portage la Prairie haplotypes were identical and clustered in the large
haplogroup with several specimens collected from central populations as
well as the western Sangudo population. The Athabasca population was
moderately distinct, and clustered mostly in a smaller haplogroup along
with a few other specimens from central populations.
DISCUSSION
Here, we present the first genomic analysis of population structure for
CFM. We used WGA to generate GBS libraries from these small insects, and
although we found some impact of WGA on the resulting raw sequence data,
there was no appreciable impact on filtered datasets and subsequent
population genomic analyses. Overall, the GBS dataset recovered very
little population structure across the majority of the sampled area,
although much more so than the comparable mitochondrial dataset, and the
only strongly differentiated populations were geographically disparate
and located at the edges of the canola production region. Given a lack
of alternative explanations for this pattern, we use these data to
expound on the hypothesis that CFM is a native species that has
unrecognized hosts both within and outside of the main agricultural
zone, which is where research on this newly described species has
focused thus far.
Whole genome amplification and GBS sequencing
performance
The observed differences in sequencing coverage between treatments in
the WGA test dataset may be attributed to multiple factors. Five of the
eight most highly sequenced samples were the same between the WGA and
non-WGA treatments, so those specimens may have had higher initial
molecular weight DNA compared to the other 16 individuals, which could
result in more sequence tags being cut and amplified (Andrews et
al. 2016). However, this does not sufficiently explain the overall
greater number of sequence reads attributed to the WGA samples. It is
possible that differences in quantification accuracy between the WGA and
non-WGA treatments prior to library preparation resulted in greater
amplification and sequencing output of the WGA samples, however we are
unable to follow-up on this possibility.
Perhaps most significantly, we observed a high level of adapter
contamination in both sequencing runs, regardless of WGA treatment. This
is generally the result of input DNA fragments being shorter than the
150 bp sequencing length, thus leading to adapter sequence integration
into the 3’ ends of the sequencing reads and subsequent sequencing of
these regions (Illumina 2020). Bioanalyzer results for the WGA test and
population genetic datasets confirmed that a high proportion of short
insert fragment lengths were present in our final libraries (shorter
than 150 bp excluding sequencing adapters, results not shown).
Nonoptimal size selection via our restriction enzyme choice or
substantial shearing of an intermediate library product may produce
short fragments, but due to a lack of size quantification in
intermediate steps we were unable to verify the stage at which this
occurred. Despite this, after processing the retained sequence reads
using Stacks, we were successful in assembling a moderate number of loci
with sufficient read depth for population genomic analyses (Appendix 1:
Table A3). The impact of this adapter contamination would have affected
both treatments in the WGA test dataset equivalently, as both groups of
samples were pooled together after primer ligation; thus, while a
greater number of useable sequencing reads would have likely increased
the overall number and depth of retained loci, this contamination does
not appear to have compromised the study, analytically.
Our results also indicate a tradeoff between sequencing coverage and
read depth when using WGA prior to GBS (Appendix 1: Table A3). This is
concordant with the findings of de Medeiros & Farrell (2018), who found
that samples with less input DNA were more prone to reduced genome
coverage after sequencing. Our results differ from those of Blairet al. (2015) and Cruaud et al. (2018), who both found
negligible differences in genome coverage and sequencing depth when
comparing WGA and non-WGA samples. However, we note that Blair et
al. (2015) used much higher quantities of input DNA for WGA than our
study system permitted, and Cruaud et al. (2018) pooled
individuals so they were unable to make the same individual comparisons
presented here, and in de Medeiros & Farrell (2018).
Reported differences in sequencing depth between treatments did not
appear to impact de novo locus construction and SNP calling in
the WGA test dataset, which was consistent with other studies (Blairet al. 2015; Cruaud et al. 2018; de Medeiros & Farrell
2018). Pairwise FST comparisons, observed
heterozygosity, and PCA indicated little difference in genotyping
between treatments when they were filtered together (Appendix 1: Table
A3, A4; Fig. A1). Additionally, similarity in the average number of SNPs
per locus between treatments (Table A3) suggests that the relatively
greater number of SNPs found in the non-WGA treatment can be attributed
to greater genomic coverage, rather than an inflation in the level of
measured polymorphism. Our results suggest that, despite the potential
for unequal amplification of genomic DNA by WGA, this approach is not
likely to produce significant biases that impact downstream de
novo SNP calling, provided that read depth is sufficient. Therefore, we
suggest that the benefits of WGA (namely, facilitating the use of single
specimens of small species for NGS) in studies that seek to randomly
sample markers across the genome outweigh the potential shortcoming of
reduced genome coverage.