2.2 Genetic data collection, bioinformatic processing, and locus
assembly
Tissue samples were preserved in 95% ethanol or RNAlater
(Sigma-Aldrich) and extracted using the Maxwell RSC system (Promega).
The nuclear gene c-mos and the mitochondrial gene cytochrome b(cyt b ) were PCR-amplified for each individual using standard
primers (c-mos: S67, S68; Lawson, Slowinski, Crother, & Burbrink, 2005;
cyt b : L4910B, H15720; Burbrink, Lawson, & Slowinski, 2000) and
sequenced on an ABI 3730 capillary electrophoresis system.
Electropherograms were edited manually in Geneious v5.6.7
(http://www.geneious.com, Kearse et al., 2012) and resulting sequences
were aligned in MAFFT v.5 with default parameters (Katoh & Kuma, 2002).
We also sequenced genome-wide anonymous nuclear markers for each
individual following a modified version of the ddRADseq protocol of
Peterson, Weber, Kay, Fisher, and Hoekstra (2012). For each individual,
a total of 300–500 ng of genomic DNA was double digested using the
restriction enzymes Sbf I (restriction site 5’- CCTGCAGG-3’) andMsp I (restriction site 5’-CCGG-3’). The resulting double
digestion products were then bead-cleaned with AmpureXP beads
(Agencourt) and individually barcoded using custom oligonucleotide
adapters. Pooled samples were size-selected to a mean insert length of
541 base pairs (bp) (487–595 bp range) with internal standards with a
Pippin Prep (Sage Science, Beverly, MA). Resulting post-ligation
products were amplified for eight cycles with a high-fidelity polymerase
(Phusion, New England Biolabs). An Agilent TapeStation was used to
determine the final fragment size distribution and concentration of each
pool. Library pools were combined in equimolar amounts for sequencing on
one Illumina HiSeqX lane (with a 10% Phi X spike-in and 150 bp
paired-end reads).
Illumina reads from the ddRAD libraries were processed using STACKS v.
2.4 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). Because the
ddRAD protocol generates strand-specific libraries, prior to read
filtering and assembly, we used a read-stitching approach (Hime,
Briggler, Reece, & Weisrock, 2019) to join the first read from an
Illumina read pair with the reverse complement of the second,
recapitulating the original orientation of fragments in the genome.
Stitched reads were quality-filtered and demultiplexed by individual
with the process_radtags function in STACKS with the following
parameters: demultiplex each library by in-line barcode, check for both
restriction enzyme cut sites, remove any read with an uncalled base,
rescue barcodes and RAD-Tags, and discard any read with average Phred
quality score < 20 over sliding windows of 15% of the total
read length. Next, we used STACKS to de novo assemble filtered
and stitched Illumina read pairs.
We aimed to produce three separate ddRAD data sets, including one forT. blandingii , one for T. pulverulenta , and a combined
data set comprising both species. Because the optimal de novoassembly of ddRADseq data can vary widely across taxa (Paris, Stevens,
& Catchen, 2017; Shafer et al., 2017), we tested a range of assembly
parameters to optimize the recovery of putatively single-copy
orthologous loci. Final assembly parameters were selected based on the
methods laid out in Paris et al. (2017). According to their
recommendations, in USTACKS, we kept m (the minimum number of reads
needed to form a stack) at 3 while in CSTACKS, we varied M (the number
of mismatches allowed during loci formation) and n (the number of
mismatches allowed during catalog formation) until we identified the
parameters at which the maximum number of polymorphic loci were
available across 80% (r = 0.8) of the population. For our data, this
was found to be M = 5 and n = 15. Further parameters were tested in
POPULATIONS separately for each species and for the genus as a whole in
order to balance missing data and number of polymorphic loci. WithinT. blandingii and T. pulverulenta , the percent missing
data was low (5% and 7.3% missing data respectively) and no further
processing was needed, and r = 0.8 was used. Because of dissimilarity
between the two species causing high levels of missing data in the
combined dataset, further restrictions were implemented. For the
genus-wide data set, we set r = 0.5 and p = 4 [p is the minimum number
of populations in which a locus must be present (here 4/5)]. This
approach increased the number of informative loci, but also the amount
of missing data. For each of our three separate data sets, we generated
a data set comprising only a single random SNP per locus (for population
clustering analyses and demographic modeling), and another data set
comprising full-length sequences for all loci (for use in phylogenetic
reconstruction).