COI haplotype mapping and summary statistics
Due to missing nucleotide (nt) sequence at the 5’ and/or 3’ ends in 20 specimens (min. missing = 7 nt, max. missing = 80 nt, Table S1), we created a masked dataset using the modal sequence of those missing regions for each collection locality to ensure haplotype mapping was not biased by missing data. Two specimens additionally failed to sequence and were omitted from the COI dataset (final n = 104). The minimum spanning haplotype network depicted a single large haplogroup and nine additional smaller haplogroups (Fig. 1C). Central populations (indicated by light blue and pink colours) had the greatest amount of haplotype diversity, however overall haplotype variation was low (number of segregating sites = 16, number of parsimony-informative sites = 13), and there was no clear spatial relationship to haplotype variation; except for the Swan River and Portage la Prairie populations, each population had sequences in more than one haplogroup. The Swan River and Portage la Prairie haplotypes were identical and clustered in the large haplogroup with several specimens collected from central populations as well as the western Sangudo population. The Athabasca population was moderately distinct, and clustered mostly in a smaller haplogroup along with a few other specimens from central populations.
DISCUSSION
Here, we present the first genomic analysis of population structure for CFM. We used WGA to generate GBS libraries from these small insects, and although we found some impact of WGA on the resulting raw sequence data, there was no appreciable impact on filtered datasets and subsequent population genomic analyses. Overall, the GBS dataset recovered very little population structure across the majority of the sampled area, although much more so than the comparable mitochondrial dataset, and the only strongly differentiated populations were geographically disparate and located at the edges of the canola production region. Given a lack of alternative explanations for this pattern, we use these data to expound on the hypothesis that CFM is a native species that has unrecognized hosts both within and outside of the main agricultural zone, which is where research on this newly described species has focused thus far.
Whole genome amplification and GBS sequencing performance
The observed differences in sequencing coverage between treatments in the WGA test dataset may be attributed to multiple factors. Five of the eight most highly sequenced samples were the same between the WGA and non-WGA treatments, so those specimens may have had higher initial molecular weight DNA compared to the other 16 individuals, which could result in more sequence tags being cut and amplified (Andrews et al. 2016). However, this does not sufficiently explain the overall greater number of sequence reads attributed to the WGA samples. It is possible that differences in quantification accuracy between the WGA and non-WGA treatments prior to library preparation resulted in greater amplification and sequencing output of the WGA samples, however we are unable to follow-up on this possibility.
Perhaps most significantly, we observed a high level of adapter contamination in both sequencing runs, regardless of WGA treatment. This is generally the result of input DNA fragments being shorter than the 150 bp sequencing length, thus leading to adapter sequence integration into the 3’ ends of the sequencing reads and subsequent sequencing of these regions (Illumina 2020). Bioanalyzer results for the WGA test and population genetic datasets confirmed that a high proportion of short insert fragment lengths were present in our final libraries (shorter than 150 bp excluding sequencing adapters, results not shown). Nonoptimal size selection via our restriction enzyme choice or substantial shearing of an intermediate library product may produce short fragments, but due to a lack of size quantification in intermediate steps we were unable to verify the stage at which this occurred. Despite this, after processing the retained sequence reads using Stacks, we were successful in assembling a moderate number of loci with sufficient read depth for population genomic analyses (Appendix 1: Table A3). The impact of this adapter contamination would have affected both treatments in the WGA test dataset equivalently, as both groups of samples were pooled together after primer ligation; thus, while a greater number of useable sequencing reads would have likely increased the overall number and depth of retained loci, this contamination does not appear to have compromised the study, analytically.
Our results also indicate a tradeoff between sequencing coverage and read depth when using WGA prior to GBS (Appendix 1: Table A3). This is concordant with the findings of de Medeiros & Farrell (2018), who found that samples with less input DNA were more prone to reduced genome coverage after sequencing. Our results differ from those of Blairet al. (2015) and Cruaud et al. (2018), who both found negligible differences in genome coverage and sequencing depth when comparing WGA and non-WGA samples. However, we note that Blair et al. (2015) used much higher quantities of input DNA for WGA than our study system permitted, and Cruaud et al. (2018) pooled individuals so they were unable to make the same individual comparisons presented here, and in de Medeiros & Farrell (2018).
Reported differences in sequencing depth between treatments did not appear to impact de novo locus construction and SNP calling in the WGA test dataset, which was consistent with other studies (Blairet al. 2015; Cruaud et al. 2018; de Medeiros & Farrell 2018). Pairwise FST comparisons, observed heterozygosity, and PCA indicated little difference in genotyping between treatments when they were filtered together (Appendix 1: Table A3, A4; Fig. A1). Additionally, similarity in the average number of SNPs per locus between treatments (Table A3) suggests that the relatively greater number of SNPs found in the non-WGA treatment can be attributed to greater genomic coverage, rather than an inflation in the level of measured polymorphism. Our results suggest that, despite the potential for unequal amplification of genomic DNA by WGA, this approach is not likely to produce significant biases that impact downstream de novo SNP calling, provided that read depth is sufficient. Therefore, we suggest that the benefits of WGA (namely, facilitating the use of single specimens of small species for NGS) in studies that seek to randomly sample markers across the genome outweigh the potential shortcoming of reduced genome coverage.