Materials and Methods
Plant materials. The diversity collections consisted of wild
populations gathered across northern MN and Wisconsin (WI) (referred to
here as the Natural Stand collection); cultivars and germplasm from the
UMN cNWR breeding program (referred to here as the Cultivated
collection); and a population of Zizania aquatica L. from the
Platte River in MN (referred to here as the outgroup). A Temporal panel
was also generated to look at potential changes in diversity over time
within two natural stand lake populations. In all, a total of 889
individuals were evaluated in this study, which consisted of 530 Natural
Stand and 209 Cultivated samples collected in 2018, in addition to 100
samples collected in 2010 for the Temporal panel.
The Natural Stand collection was obtained from 10 wild populations
across northern Minnesota (50 samples per lake/river) and 2 wild
populations from western Wisconsin (10 from Mud Hen Lake; 20 from
Phantom Lake) (Figure 1; Table S1). An additional 50 Z. aquatica samples, also collected in central Minnesota, were used as an outgroup
in this study. Three major hydrologic unit code (HUC) subbasins
were represented in this collection including seven populations from the
Upper Mississippi River (UMR) watershed, three populations from the Red
River of the North (RRN) watershed, and two populations from the St.
Croix River (SCR) watershed (Figure 1; Table S1). A distance (km) matrix
for all Natural Stand populations is available in Table S2.
The Cultivated collection consisted of leaf samples from 209
open-pollinated individuals, representing 4 cultivars and 10 breeding
populations from the UMN cNWR breeding program. Samples were collected
at the UMN North Central Research and Outreach Center (NCROC) in Grand
Rapids, MN in 2018 (Table S1; Figure 1; Figure S1).
Our Temporal collection consisted of 200 natural stand samples collected
from Garfield and Shell Lakes in 2010 and in 2018, with 50 samples
collected in each lake at each time point. GPS coordinates were used to
ensure that the same population at the same location were collected at
both time points (Table S1). We would like to note that samples from
Garfield and Shell Lakes collected in 2018 were also a part of the
Natural Stand collection (50 samples per lake).
DNA extraction and sequencing. During collection, approximately 8
cm of leaf tissue was harvested on ice and then lyophilized using a
TissueLyser II (Qiagen, Valencia, CA, USA). Genomic DNA was extracted
using a Qiagen DNeasy Plant Mini Kit (Qiagen, Valencia, CA, USA)
according to the manufacturer’s instructions. DNA concentration was
measured with a NanoDrop spectrophotometer (Thermo Scientific,
Wilmington, DE, USA), samples were submitted to the UMN Genomics Center
for library preparation and sequencing. The digestion step was performed
using two restriction enzymes, Btg1 (5’-C/CRYGG-3’) andTaqI (5’-T/CGA-3’) (Shao et al., 2019). Afterward, unique
barcodes were ligated to DNA fragments for sample identification and
pooling. Single-end 150-bp sequencing to a depth of 2.5 million reads
per sample was performed on an Illumina NovaSeq machine (Illumina, San
Diego, CA, USA). Raw data were deposited in the National Center for
Biotechnology Information Short Read Archive (NCBI SRA) under accession
number PRJNA774842. BioSample accession numbers for individual samples
are provided in Table S3.
Read mapping . Quality control was initially performed on the
FASTQ files using fastQC version 0.11.7 32 to check
for read quality and adapter contamination. After adapter trimming with
Cutadapt version 1.18 33, reads were mapped to the
reference genome v1.0 of cNWR cultivar ‘Itasca-C12’ 34of BWA-MEM version 0.7.13 35. SNPs were called using
the ‘mpileup’ function from BCFtools version 2.3 36and sorted with the sort function from samtools version 1.937, resulting in 2,183 variant call format (VCF)
files, one for each scaffold of the NWR genome 34. The
17 largest VCF files representing the 15 NWR chromosomes and 2
additional large scaffolds of the NWR reference genome were merged into
a single VCF file with the ‘concat’ function from BCFtools. Filtering of
the merged VCF file was carried out using VCFtools 38.
All analyses were done with default parameters. A maximum missing rate
of 20% across all samples and a minimum depth of 4 reads per variant
site was used to obtain the final SNP set, which was deposited at the
Data Repository for the University of Minnesota (DRUM) under the handle
https://hdl.handle.net/11299/264603 (doi will be provided at the
revisions stage, if accepted).
Marker statistics . Summary statistics were calculated for each
SNP including their distribution across chromosomes and scaffolds, their
polymorphism information content (PIC) value, and their
transition/transversion (TsTv) ratio. Polymorphism information content
(PIC) was calculated using the snpReady R package39. The number and type of transitions and
transversions were calculated using VCFtools.
Genetic Diversity Assessments. To assess the structure and
distribution of genetic variation within our collections, we conducted a
principal coordinate analysis (PCoA). To generate the full sample PCoA,
the VCF file resulting from our SNP calling pipeline was imported into
the R statistical environment 40 using the vcfR package 41. For the individual Natural Stand,
Cultivated, and Temporal PCoAs, the original VCF file was first
subsetted to only include the relevant samples using PLINK version
1.90b6.10 42. For all PCoAs, the R packagevegan 43 was used to calculate a dissimilarity
matrix based on the Jaccard distance using the ‘vegdist’ function, as
well as eigenvectors and eigenvalues using the ‘cmdscale’ function. PCoA
plots were then generated using the ggplot2 package44.
Neighbor-joining trees for the Natural Stand and Cultivated collections
as well as the Temporal collection were created using the R packagesadegenet 45,46, ape 47, poppr 48,49, andvcfR . The temporal cluster analysis was performed similarly to
the methods described in Jacquemyn et al., 2006 50. We
estimated Prevosti’s genetic distance using the unweighted pair group
method with arithmetic mean (UPGMA) algorithm option for the ‘aboot’
function (bootstrapped dendrograms) in poppr with the 1,000
bootstrap replicates, and a selected cutoff value of 50.
Population structure and admixture in our Natural Stand and Cultivated
collections was assessed using Bayesian clustering implemented in
STRUCTURE version 2.3.4 51. Genotypic data for these
individuals and the final bi-allelic SNP set were loaded into STRUCTURE
and analyzed with the admixture model. The Markov Chain Monte Carlo was
run from K =2 to K =14 with a burn-in length of 1,000
followed by 10,000 iterations. Lower K values were used to look
for larger structural diversity patterns, while the higher K values were chosen to test if we were able to separate each population
into their own STRUCTURE-assigned cluster (e.g., area of collection).
Each K value was run 3 times, then compiled into a merged file
using the ‘clumppExport’ function from the pophelper R package
and subsequently plotted with the pophelper package 52. The ideal number of clusters was determined using
the DeltaK statistic from the Structure Harvester web tool version
0.6.94 53, which uses the Evanno et al. (2005) method
for determining the number of clusters 54.
Analysis of Molecular Variance (AMOVA) was performed using the
‘poppr.amova’ function from the poppr R package based on the work
of Excoffier et al. (1992) 55. Groups were defined
based on their collection membership (e.g., their lake/river of origin
or cultivar/breeding line identity), species (Z. palustris orZ. aquatica ), and, more broadly, their germplasm type (e.g.,
Natural Stand or Cultivated). We performed AMOVAs using the ”farthest
neighbor” algorithm and the ”quasieuclid” (default) method in theade4 package using default parameters 56. The
significance was determined using the ‘randtest’ function with 999
repetitions.
Correlation between geographic and genetic distances in our Natural
Stand collection, were assessed with a Mantel test using the
‘mantel.test’ function from the ape R package, based on genetic
distances calculated with the ‘dist.genpop’ function from theadegenet R package and geographic distances calculated using the
‘distm’ function from the geosphere R package57. The regression equation was found by fitting the
genetic and geographic distances to a linear model.
Gene flow . Pairwise estimations of genetic differentiation
(FST ) 58 between different
subgroups based on geographic origin and germplasm type (cultivated
population, natural stand, species) were calculated using the
‘stamppFst’ function from the StAMPP R package59. Dsuite version 0.4 60 was used
to calculate Patterson’s D -statistics, also known as the
ABBA-BABA statistic, in order to test for introgressions between groups.
Since this analysis requires four groups, we defined the groups based on
their membership in STRUCTURE-assigned groups when K =4. This
approach split the Natural Stands into two groups, while the Cultivated
collection formed a third group. The fourth group (the outgroup) wasZ. aquatica .
Genome-wide Scans for Signatures of Selection . We used a series
of tests to identify putative selection events in cultivated NWR. We
calculated nucleotide diversity (π) 61, Tajima’s D62, and FST 58 for
each SNP using VCFtools. Natural Stand and Cultivated collections were
analyzed separately. The results were plotted in the R statistical
environment by averaging π values of ~10 Mb/bin. We also
used the Cross-Population Composite Likelihood Ratio (XP-CLR) test63 to test for large deviations between Natural Stand
and Cultivated collections.