3RAD - Analytical considerations/caveats
First, when assembling 3RAD data it is essential to consider the effects
various parameters will have on downstream analyses, especially the
locus occupancy. Studies have shown that the amount of missing data can
greatly affect resulting phylogenetic inferences (e.g., Crotti et al.,
2019; Eaton et al., 2016; Huang & Knowles, 2016), with the trend toward
a greater amount of missing data yielding a more robust phylogeny
compared to a low amount of missing data. This result has been
attributed to larger data sets comprising more phylogenetic data/signal
and informative sites being excluded when a greater percentage of taxon
coverage is required for a locus to be retained in the final data
matrix; therefore, quickly mutating sites become disproportionately
omitted with increasing taxon coverage and exclude potentially variable
and informative characters (Crotti et al., 2019).
We examined the effect of increasing the minimum sample per locus
parameter for data sets containing all individuals as well as those withA. unicolor B samples only and found that increasing the number of
samples required greatly lowered the number of SNPs retained in the
final matrix (summarized in Table 1). Our phylogenetic inferences using
these data sets (Figure 2; Supplementary Figures S1 & S2) reflect
findings of previous studies (e.g., Crotti et al. 2019) showing
disproportionate loss of informative SNPs resulting in phylogenies with
ambiguous or unresolved evolutionary relationships. All_M30 (30% locus
occupancy) and All_M40 (40% locus occupancy) matrices contain many
nodes with low support (< 50) and polytomies present; however,
All_M20 and All_M10 have very similar topologies and include a
majority of highly-supported (≥ 95) nodes.
Not only did the amount of missing data affect our phylogenetic
inferences, but it also influenced our STRUCTURE and VAE analyses
(Figure 4). We explored the effects of missing data on these clustering
analyses with our A. unicolor B only data sets (UniB_M30,
UniB_M40, UniB_M50, UniB_M60, and UniB_M70). For STRUCTURE, the
number of clusters decreased from 𝚫K=3 for UniB_M30 to 𝚫K=2 for
UniB_M40, UniB_M50 and UniB_M60 while UniB_M70 had a peculiar
increase back to 𝚫K=3. Although UniB_M30’s clusters reflect population
structure (albeit with some admixture) corresponding to the three
lineages within A. unicolor B, the data sets with 𝚫K=2 cluster
B2+B3 together with varying amounts of admixture. The three clusters
inferred from the UniB_M70 dataset do not necessarily reflect the
structure found in UniB_M30 because several individuals are not
clustering with individuals from their lineage. This is most likely due
to the greatly decreased number of informative sites to accurately
detect any structure in the data.
VAE clustering across these data sets reflected the same general trend
as STRUCTURE whereby increasing taxon coverage greatly decreases the
amount of structure detectable in the data (Figure 4). UniB_M30
clusters had the greatest amount of structure detected for each lineage,
though some overlap between them is still present. UniB_M70 has
practically no noticeable structure in the data, which most likely
results from the small number and mostly uniform loci retained in the
data matrix. VAE, which uses the structure inherent in the data to train
the model, is likely affected by the deficiency of informative sites.