3RAD - Analytical considerations/caveats
First, when assembling 3RAD data it is essential to consider the effects various parameters will have on downstream analyses, especially the locus occupancy. Studies have shown that the amount of missing data can greatly affect resulting phylogenetic inferences (e.g., Crotti et al., 2019; Eaton et al., 2016; Huang & Knowles, 2016), with the trend toward a greater amount of missing data yielding a more robust phylogeny compared to a low amount of missing data. This result has been attributed to larger data sets comprising more phylogenetic data/signal and informative sites being excluded when a greater percentage of taxon coverage is required for a locus to be retained in the final data matrix; therefore, quickly mutating sites become disproportionately omitted with increasing taxon coverage and exclude potentially variable and informative characters (Crotti et al., 2019).
We examined the effect of increasing the minimum sample per locus parameter for data sets containing all individuals as well as those withA. unicolor B samples only and found that increasing the number of samples required greatly lowered the number of SNPs retained in the final matrix (summarized in Table 1). Our phylogenetic inferences using these data sets (Figure 2; Supplementary Figures S1 & S2) reflect findings of previous studies (e.g., Crotti et al. 2019) showing disproportionate loss of informative SNPs resulting in phylogenies with ambiguous or unresolved evolutionary relationships. All_M30 (30% locus occupancy) and All_M40 (40% locus occupancy) matrices contain many nodes with low support (< 50) and polytomies present; however, All_M20 and All_M10 have very similar topologies and include a majority of highly-supported (≥ 95) nodes.
Not only did the amount of missing data affect our phylogenetic inferences, but it also influenced our STRUCTURE and VAE analyses (Figure 4). We explored the effects of missing data on these clustering analyses with our A. unicolor B only data sets (UniB_M30, UniB_M40, UniB_M50, UniB_M60, and UniB_M70). For STRUCTURE, the number of clusters decreased from 𝚫K=3 for UniB_M30 to 𝚫K=2 for UniB_M40, UniB_M50 and UniB_M60 while UniB_M70 had a peculiar increase back to 𝚫K=3. Although UniB_M30’s clusters reflect population structure (albeit with some admixture) corresponding to the three lineages within A. unicolor B, the data sets with 𝚫K=2 cluster B2+B3 together with varying amounts of admixture. The three clusters inferred from the UniB_M70 dataset do not necessarily reflect the structure found in UniB_M30 because several individuals are not clustering with individuals from their lineage. This is most likely due to the greatly decreased number of informative sites to accurately detect any structure in the data.
VAE clustering across these data sets reflected the same general trend as STRUCTURE whereby increasing taxon coverage greatly decreases the amount of structure detectable in the data (Figure 4). UniB_M30 clusters had the greatest amount of structure detected for each lineage, though some overlap between them is still present. UniB_M70 has practically no noticeable structure in the data, which most likely results from the small number and mostly uniform loci retained in the data matrix. VAE, which uses the structure inherent in the data to train the model, is likely affected by the deficiency of informative sites.