Materials and Methods
Plant materials. The diversity collections consisted of wild populations gathered across northern MN and Wisconsin (WI) (referred to here as the Natural Stand collection); cultivars and germplasm from the UMN cNWR breeding program (referred to here as the Cultivated collection); and a population of Zizania aquatica L. from the Platte River in MN (referred to here as the outgroup). A Temporal panel was also generated to look at potential changes in diversity over time within two natural stand lake populations. In all, a total of 889 individuals were evaluated in this study, which consisted of 530 Natural Stand and 209 Cultivated samples collected in 2018, in addition to 100 samples collected in 2010 for the Temporal panel.
The Natural Stand collection was obtained from 10 wild populations across northern Minnesota (50 samples per lake/river) and 2 wild populations from western Wisconsin (10 from Mud Hen Lake; 20 from Phantom Lake) (Figure 1; Table S1). An additional 50 Z. aquatica samples, also collected in central Minnesota, were used as an outgroup in this study. Three major hydrologic unit code (HUC) subbasins were represented in this collection including seven populations from the Upper Mississippi River (UMR) watershed, three populations from the Red River of the North (RRN) watershed, and two populations from the St. Croix River (SCR) watershed (Figure 1; Table S1). A distance (km) matrix for all Natural Stand populations is available in Table S2.
The Cultivated collection consisted of leaf samples from 209 open-pollinated individuals, representing 4 cultivars and 10 breeding populations from the UMN cNWR breeding program. Samples were collected at the UMN North Central Research and Outreach Center (NCROC) in Grand Rapids, MN in 2018 (Table S1; Figure 1; Figure S1).
Our Temporal collection consisted of 200 natural stand samples collected from Garfield and Shell Lakes in 2010 and in 2018, with 50 samples collected in each lake at each time point. GPS coordinates were used to ensure that the same population at the same location were collected at both time points (Table S1). We would like to note that samples from Garfield and Shell Lakes collected in 2018 were also a part of the Natural Stand collection (50 samples per lake).
DNA extraction and sequencing. During collection, approximately 8 cm of leaf tissue was harvested on ice and then lyophilized using a TissueLyser II (Qiagen, Valencia, CA, USA). Genomic DNA was extracted using a Qiagen DNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA) according to the manufacturer’s instructions. DNA concentration was measured with a NanoDrop spectrophotometer (Thermo Scientific, Wilmington, DE, USA), samples were submitted to the UMN Genomics Center for library preparation and sequencing. The digestion step was performed using two restriction enzymes, Btg1 (5’-C/CRYGG-3’) andTaqI (5’-T/CGA-3’) (Shao et al., 2019). Afterward, unique barcodes were ligated to DNA fragments for sample identification and pooling. Single-end 150-bp sequencing to a depth of 2.5 million reads per sample was performed on an Illumina NovaSeq machine (Illumina, San Diego, CA, USA). Raw data were deposited in the National Center for Biotechnology Information Short Read Archive (NCBI SRA) under accession number PRJNA774842. BioSample accession numbers for individual samples are provided in Table S3.
Read mapping . Quality control was initially performed on the FASTQ files using fastQC version 0.11.7 32 to check for read quality and adapter contamination. After adapter trimming with Cutadapt version 1.18 33, reads were mapped to the reference genome v1.0 of cNWR cultivar ‘Itasca-C12’ 34of BWA-MEM version 0.7.13 35. SNPs were called using the ‘mpileup’ function from BCFtools version 2.3 36and sorted with the sort function from samtools version 1.937, resulting in 2,183 variant call format (VCF) files, one for each scaffold of the NWR genome 34. The 17 largest VCF files representing the 15 NWR chromosomes and 2 additional large scaffolds of the NWR reference genome were merged into a single VCF file with the ‘concat’ function from BCFtools. Filtering of the merged VCF file was carried out using VCFtools 38. All analyses were done with default parameters. A maximum missing rate of 20% across all samples and a minimum depth of 4 reads per variant site was used to obtain the final SNP set, which was deposited at the Data Repository for the University of Minnesota (DRUM) under the handle https://hdl.handle.net/11299/264603 (doi will be provided at the revisions stage, if accepted).
Marker statistics . Summary statistics were calculated for each SNP including their distribution across chromosomes and scaffolds, their polymorphism information content (PIC) value, and their transition/transversion (TsTv) ratio. Polymorphism information content (PIC) was calculated using the snpReady R package39. The number and type of transitions and transversions were calculated using VCFtools.
Genetic Diversity Assessments. To assess the structure and distribution of genetic variation within our collections, we conducted a principal coordinate analysis (PCoA). To generate the full sample PCoA, the VCF file resulting from our SNP calling pipeline was imported into the R statistical environment 40 using the vcfR package 41. For the individual Natural Stand, Cultivated, and Temporal PCoAs, the original VCF file was first subsetted to only include the relevant samples using PLINK version 1.90b6.10 42. For all PCoAs, the R packagevegan 43 was used to calculate a dissimilarity matrix based on the Jaccard distance using the ‘vegdist’ function, as well as eigenvectors and eigenvalues using the ‘cmdscale’ function. PCoA plots were then generated using the ggplot2 package44.
Neighbor-joining trees for the Natural Stand and Cultivated collections as well as the Temporal collection were created using the R packagesadegenet 45,46, ape 47, poppr 48,49, andvcfR . The temporal cluster analysis was performed similarly to the methods described in Jacquemyn et al., 2006 50. We estimated Prevosti’s genetic distance using the unweighted pair group method with arithmetic mean (UPGMA) algorithm option for the ‘aboot’ function (bootstrapped dendrograms) in poppr with the 1,000 bootstrap replicates, and a selected cutoff value of 50.
Population structure and admixture in our Natural Stand and Cultivated collections was assessed using Bayesian clustering implemented in STRUCTURE version 2.3.4 51. Genotypic data for these individuals and the final bi-allelic SNP set were loaded into STRUCTURE and analyzed with the admixture model. The Markov Chain Monte Carlo was run from K =2 to K =14 with a burn-in length of 1,000 followed by 10,000 iterations. Lower K values were used to look for larger structural diversity patterns, while the higher K values were chosen to test if we were able to separate each population into their own STRUCTURE-assigned cluster (e.g., area of collection). Each K value was run 3 times, then compiled into a merged file using the ‘clumppExport’ function from the pophelper R package and subsequently plotted with the pophelper package 52. The ideal number of clusters was determined using the DeltaK statistic from the Structure Harvester web tool version 0.6.94 53, which uses the Evanno et al. (2005) method for determining the number of clusters 54.
Analysis of Molecular Variance (AMOVA) was performed using the ‘poppr.amova’ function from the poppr R package based on the work of Excoffier et al. (1992) 55. Groups were defined based on their collection membership (e.g., their lake/river of origin or cultivar/breeding line identity), species (Z. palustris orZ. aquatica ), and, more broadly, their germplasm type (e.g., Natural Stand or Cultivated). We performed AMOVAs using the ”farthest neighbor” algorithm and the ”quasieuclid” (default) method in theade4 package using default parameters 56. The significance was determined using the ‘randtest’ function with 999 repetitions.
Correlation between geographic and genetic distances in our Natural Stand collection, were assessed with a Mantel test using the ‘mantel.test’ function from the ape R package, based on genetic distances calculated with the ‘dist.genpop’ function from theadegenet R package and geographic distances calculated using the ‘distm’ function from the geosphere R package57. The regression equation was found by fitting the genetic and geographic distances to a linear model.
Gene flow . Pairwise estimations of genetic differentiation (FST ) 58 between different subgroups based on geographic origin and germplasm type (cultivated population, natural stand, species) were calculated using the ‘stamppFst’ function from the StAMPP R package59. Dsuite version 0.4 60 was used to calculate Patterson’s D -statistics, also known as the ABBA-BABA statistic, in order to test for introgressions between groups. Since this analysis requires four groups, we defined the groups based on their membership in STRUCTURE-assigned groups when K =4. This approach split the Natural Stands into two groups, while the Cultivated collection formed a third group. The fourth group (the outgroup) wasZ. aquatica .
Genome-wide Scans for Signatures of Selection . We used a series of tests to identify putative selection events in cultivated NWR. We calculated nucleotide diversity (π) 61, Tajima’s D62, and FST 58 for each SNP using VCFtools. Natural Stand and Cultivated collections were analyzed separately. The results were plotted in the R statistical environment by averaging π values of ~10 Mb/bin. We also used the Cross-Population Composite Likelihood Ratio (XP-CLR) test63 to test for large deviations between Natural Stand and Cultivated collections.