2.2 Genetic data collection, bioinformatic processing, and locus assembly
Tissue samples were preserved in 95% ethanol or RNAlater (Sigma-Aldrich) and extracted using the Maxwell RSC system (Promega). The nuclear gene c-mos and the mitochondrial gene cytochrome b(cyt b ) were PCR-amplified for each individual using standard primers (c-mos: S67, S68; Lawson, Slowinski, Crother, & Burbrink, 2005; cyt b : L4910B, H15720; Burbrink, Lawson, & Slowinski, 2000) and sequenced on an ABI 3730 capillary electrophoresis system. Electropherograms were edited manually in Geneious v5.6.7 (http://www.geneious.com, Kearse et al., 2012) and resulting sequences were aligned in MAFFT v.5 with default parameters (Katoh & Kuma, 2002).
We also sequenced genome-wide anonymous nuclear markers for each individual following a modified version of the ddRADseq protocol of Peterson, Weber, Kay, Fisher, and Hoekstra (2012). For each individual, a total of 300–500 ng of genomic DNA was double digested using the restriction enzymes Sbf I (restriction site 5’- CCTGCAGG-3’) andMsp I (restriction site 5’-CCGG-3’). The resulting double digestion products were then bead-cleaned with AmpureXP beads (Agencourt) and individually barcoded using custom oligonucleotide adapters. Pooled samples were size-selected to a mean insert length of 541 base pairs (bp) (487–595 bp range) with internal standards with a Pippin Prep (Sage Science, Beverly, MA). Resulting post-ligation products were amplified for eight cycles with a high-fidelity polymerase (Phusion, New England Biolabs). An Agilent TapeStation was used to determine the final fragment size distribution and concentration of each pool. Library pools were combined in equimolar amounts for sequencing on one Illumina HiSeqX lane (with a 10% Phi X spike-in and 150 bp paired-end reads).
Illumina reads from the ddRAD libraries were processed using STACKS v. 2.4 (Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). Because the ddRAD protocol generates strand-specific libraries, prior to read filtering and assembly, we used a read-stitching approach (Hime, Briggler, Reece, & Weisrock, 2019) to join the first read from an Illumina read pair with the reverse complement of the second, recapitulating the original orientation of fragments in the genome. Stitched reads were quality-filtered and demultiplexed by individual with the process_radtags function in STACKS with the following parameters: demultiplex each library by in-line barcode, check for both restriction enzyme cut sites, remove any read with an uncalled base, rescue barcodes and RAD-Tags, and discard any read with average Phred quality score < 20 over sliding windows of 15% of the total read length. Next, we used STACKS to de novo assemble filtered and stitched Illumina read pairs.
We aimed to produce three separate ddRAD data sets, including one forT. blandingii , one for T. pulverulenta , and a combined data set comprising both species. Because the optimal de novoassembly of ddRADseq data can vary widely across taxa (Paris, Stevens, & Catchen, 2017; Shafer et al., 2017), we tested a range of assembly parameters to optimize the recovery of putatively single-copy orthologous loci. Final assembly parameters were selected based on the methods laid out in Paris et al. (2017). According to their recommendations, in USTACKS, we kept m (the minimum number of reads needed to form a stack) at 3 while in CSTACKS, we varied M (the number of mismatches allowed during loci formation) and n (the number of mismatches allowed during catalog formation) until we identified the parameters at which the maximum number of polymorphic loci were available across 80% (r = 0.8) of the population. For our data, this was found to be M = 5 and n = 15. Further parameters were tested in POPULATIONS separately for each species and for the genus as a whole in order to balance missing data and number of polymorphic loci. WithinT. blandingii and T. pulverulenta , the percent missing data was low (5% and 7.3% missing data respectively) and no further processing was needed, and r = 0.8 was used. Because of dissimilarity between the two species causing high levels of missing data in the combined dataset, further restrictions were implemented. For the genus-wide data set, we set r = 0.5 and p = 4 [p is the minimum number of populations in which a locus must be present (here 4/5)]. This approach increased the number of informative loci, but also the amount of missing data. For each of our three separate data sets, we generated a data set comprising only a single random SNP per locus (for population clustering analyses and demographic modeling), and another data set comprising full-length sequences for all loci (for use in phylogenetic reconstruction).