Transcriptome sequencing from greenhouse plants
To identify genetic markers that could distinguish the fourClarkia species, we performed transcriptome sequencing from tissue obtained from Clarkia plants we grew in the greenhouse (five individuals per species; n = 20). We chose to use transcriptome sequencing because it was a reduced-representation genomic approach that produced long, contiguous sequences—as compared to ddRAD sequencing—that was necessary for us to develop subsequent amplicon probes. To grow plants in the greenhouse, we first cold-stratified and germinated seeds of each species of Clarkia in February 2016 in Ithaca, NY. Germinated seedlings were transferred into D40L conetainers (Stuewe & Sons, Tangent, 208 Oregon, USA) with a mix of 50% potting soil and 50% perlite. Once in the conetainers, seedlings were bottom-watered and grown in common conditions in a greenhouse for two months. We harvested seedlings for RNA extraction when seedlings had more than four true leaves, but before they had started flowering (March-April 2016).
The leaf tissue of seedlings was harvested and flash frozen in liquid nitrogen. To extract total RNA, we first mechanically homogenized approximately 100mg of leaf tissue with a nitrogen-chilled mortar and pestle. We then mixed the homogenized tissue with 1mL of TRIzol and 200uL chloroform, following the manufacturers guidelines for RNA isolation. We then used 200uL of the isolation to a RNeasy Mini Elute silica column (Qiagen). We added 5 ul of DNAase (NEB) to the final 45 uL of the final elution, and aliquoted 20 uL of NEBNext Oligo d(t)25 beads to isolate mRNA from the total RNA pool. We then followed the protocol for NEBNext Ultra Directional RNA Library prep kit (NEB #E7429L). Due to low mRNA yield, we modified the protocol such that PCR enrichment included 30 cycles. We individually indexed the 20 samples and ran these on a single lane of Illumina HighSeq, using single end 100 bp sequencing chemistry.
We combined data from the five C. speciosa individuals to generate a reference draft transcriptome assembly using Trinity (Haas et al. 2013). We note here that many of our sequence reads derived from likely chloroplast DNA (cpDNA). Our goal was to obtain DNA markers within the nuclear genome as the presence and amount of choloroplast organelles within each pollen grain is unknown. We note that future studies might instead sample non-leafy tissue where possible, which will reduce the number of choloroplast reads. However, for the present study, we removed these reads by initially aligning the total read pools from all individuals to the chloroplast genome of a related species,Oenothera picensis (NCBI accession number KX118607). From the pool of reads that did not align to the O. picensis cpDNA genome, we aligned these reads from each individual to the draftC. speciosa transcriptome from above. We then called SNPs using the GATK pipeline, using the same set of presets as in Toews et al. (2016). We allowed for filtered out SNPs with more than 50% missing data and a minor allele frequency of less than 5%.
Ideally, our goal was to identify a single genomic region that (1) we could PCR amplify and (2) included derived, fixed SNPs for each species. To do this, we estimated Weir and Cockerham per-SNPF ST estimates from VCFTools (Danecek et al. 2011). We generated F ST estimates for one species compared to the other 15 individuals, and replicated this across all four species. We then determined which transcriptome contig contained multiple SNPs that had F ST = 1 for each of the four species. We used BLAST to compare our top contig in our assembly to the nucleotide database at NCBI Genbank. The top hit was aClarkia unguiculata sequence (NCBI accession number EF017402), and our contig aligned to a region that spans the 5.8SrRNA gene, the internal transcribed spacer 2 gene, and the26S rRNA gene. To amplify this region across additional samples, we used the forward primer sequence [TCGTCGGCAGCGTC]GTGCCTCGGAGATCATCTGT and reverse primer sequence
[GTCTCGTGGGCTCG]GCCGTGAACCATCGAGTCTTT, with the brackets indicating the portion of the sequence (P5 and P7, respectively) that would align to our dual-indexed (i5 and i7) adaptors.