2.4 Demographic events and population assignment
Genetic diversity indices for each population, including observed and
expected heterozygosity, private alleles, nucleotide diversity and
inbreeding coefficient, were calculated using populations in
STACKS v2.2 (Rochette et al., 2019). In addition, Tajima’s D was
calculated using VCFtools (Danecek et al., 2011) to assess recent
historical demographic events. Finally, population differentiation was
analysed using pairwise Fst, and the statistical
significance was calculated under 20,000 permutations in ARLEQUIN
v.3.5.2.1.
We assessed the population structure and differentiation using three
clustering approaches. Firstly, we used the discriminant analysis of
principal components (DAPC) (Jombart, Devillard, & Balloux, 2010) as
implemented in the adegenet package (Jombart, 2008) in R (Team,
2020). The cross-validation xvaldapc function was used to
determine the optimal number of principal components (PCs) using a
training set of 0.6 with 1000 replicates (n.rep = 1000 ).
The find.cluster function was used to infer the most
likely number of clusters with the Akaike Information Criterion (AIC)
and K-means of 100 (Jombart et al., 2010). The optimal number of genetic
clusters was identified as the lowest AIC value. Secondly, we ran
STRUCTURE v2.3.4. (Pritchard, Stephens, & Donnelly, 2000) with 200,000
MCMC iterations following a burn-in of 100,000 iterations, using the
admixture ancestry model and no sampling location as a prior, setting a
putative K from 1 to 6 in a total of 15 independent iterations in
each run (Falush, Stephens, & Pritchard, 2003). The web version of
STRUCTURE HARVESTER (Earl & vonHoldt, 2011)
(http://taylor0.biology.ucla.edu/structureHarvester/) was used to
infer the most likely K value based on the Evanno method (Evanno,
Regnaut, & Goudet, 2005) and the highest posterior mean log-likelihood
(mean LnP(K)). CLUMPP (Jakobsson & Rosenberg, 2007) and DISTRUCT v1.1
(Rosenberg, 2003) were used to summarise the similarity between
replicates and visualise the STRUCTURE plots, respectively. Finally,
the fineRADstructure package (Malinsky, Trucchi, Lawson, &
Falush, 2018) was run to understand the recent shared ancestry at high
resolution derived from haplotype linkage information among all the
samples. The database used contained 13,149 SNPs obtained after
running populations on the set of 1,907 neutral SNPs deselecting
the -write-single-snps option; this allowed for the inclusion of
linked SNPs in the different RAD-tags. This dataset was reordered first
according to linkage disequilibrium using the script sampleLD.R,
fineRADstructure was run using the default settings. The results were
graphically interpreted using the Finestructure package (Lawson,
Hellenthal, Myers, & Falush, 2012) and
the fineRADstructurePlot.R script (Malinsky et al., 2018).
2.5 Demographic history scenarios testing and species
tree
To investigate which specific demographic history hypotheses were
observed to be most compatible with genetic structure, we applied a
Bayesian approximate computation approach with supervised machine
learning as implemented in DIYABC Random Forest (DIYABC-RF) v 1.0
(Collin et al., 2021). Since the method is computational demanding, the
scenarios were kept simple and easily distinguishable from each other
(Cabrera & Palsboll, 2017). To save calculation time, we grouped
individuals into four groups (1) South Africa, (2) Brazil, (3) Spain and
Guatemala (Spain-Guatemala), and (4) Greece and Australia
(Greece-Australia) according to genetic groups defined by DAPC and
STRUCTURE analyses (see Results, Fig. 3). The modelling strategy was
focused on testing different phylogeographic scenarios reflecting
hypotheses of colonisation routes from South Africa to Brazil or to the
cluster Spain-Guatemala based on the results in Fig. 6 and the
historical records (Fig. 1). Main phylogeographic hypotheses are
explained graphically in Fig. S1A-B, Supporting information. These
scenarios were analysed individually, after selecting the best
hypothetical evolutionary scenario, we performed a second analysis only
for the three scenarios with the highest classification votes (Fig. S1B,
Supporting information).
In accordance with DIYABC’s requirements, those SNPs missing in all
individuals in one population need to be excluded, which reduced the
dataset to 312 SNPs across the 92 individuals. Uniform priors were used
as described in Karsten et al. (2015). Effective population sizes
(Ne ) were allowed to range between 10 to 100,000;
the priors for generation times between divergence events were set from
10 to 10,000 and admixture rate ranged from 0.001 to 0.999. A total of
20,000 data sets were simulated for each scenario and all simulated data
sets were used in each Random Forest training set. We used five noise
variables and ran 2,000 random trees to select the most likely
phylogeographic scenario.
The species tree was inferred using SNAPP (Bryant, Bouckaert,
Felsenstein, Rosenberg, & RoyChoudhury, 2012), a coalescent-based
method analysis that uses SNP markers directly to compute the likelihood
of the species tree under a finite-sites mutation model over all
possible gene trees. SNAPP is implemented in BEAST v2.6.2 (Bouckaert et
al., 2014) and estimates the probability of allele frequency change
across ancestor and descendent nodes, giving as results a posterior
distribution for the species tree (Bryant et al., 2012).
The putative neutral dataset was additionally filtered for
non-polymorphic loci following the model assumptions. Running SNAPP
comes at a high computational cost; therefore, we pruned our dataset to
include five randomly selected individuals from each population (N=30).
The mutation rates (u and v ) were calculated and ran for
at least one million generations per replicate with sampling every 1,000
generations. We conducted three independent runs and evaluated
convergence in Tracer v1.7 (Rambaut, Drummond, Xie, Baele, & Suchard,
2018). The following analyses were carried out with the runs whose
parameters presented ESS > 200. We removed 10% of trees as
burn-in and merged tree and log files from the different runs using
LOGCOMBINER v 2.4.1. Then, TREEANNOTATOR v2.6.2 was used to obtain
maximum credibility trees and the trees that were contained in the 95%
highest posterior density (HDP). To visualise the posterior distribution
of trees we used DENSITREE v2.2.7 (Bouckaert, 2010). This process was
repeated further by randomly resampling different individuals than those
used for the first run (with the exception of sampling sites from
Australia and Guatemala with < 10 individuals) to determine
the subsampling effect on the analysis.
2.6 Microbiome analysis
Sixty-three different individuals collected from the populations used in
the RADseq analyses were used to study their microbial structure and
community composition. Individuals from all populations were used,
except for Greece and Guatemala. In addition, we included new samples
from Colombia (CO) and Israel (IS) in this analysis, resulting in a
total of six sampling sites collected in nine different host plants
(Figure 1 and details in Supplementary Table 1). DNA was extracted
following the protocol mentioned above, and the V4 region of the 16S
rRNA gene was amplified in duplicates using the primers 515F-806R
(Caporaso et al., 2011) in one step PCR (95°C 5min, 25 cycles x [95°C
20s, 55°C 20s, 72°C 30s], 72°C 5min). The resulting amplicons were
pooled per sample and cleaned (AMPure XP magnetic beads, Beckman
Coulter, Indianapolis, USA) and PCR indexing was carried out using a
Nextera XT DNA Library Preparation Kit (Illumina Inc., San Diego,
US-CA). The paired-end sequencing was performed at the Molecular Core
Labs (Sequencing Facility) of The Natural History Museum on a complete
flow cell of the Illumina MiSeq platform using v3 chemistry (2 x 300
bp).
Sequence data were processed in Mothur v1.41.3 by adapting
the MiSeq SOP protocol (Kozich, Westcott, Baxter, Highlander, &
Schloss, 2013; Schloss et al., 2009). Following quality filtering, we
merged forward and reverse reads. Primers were removed, ambiguous bases
were discarded, and sequences with over 15 homopolymers were excluded
from further analysis. Unique sequences were aligned against the Silva
Seed v132 reference database (Quast et al., 2013), which was reduced and
customized to the V4 region to avoid errors and improve alignment
quality. The maximum length of the sequences was set to 275 bp, and
those that did not align well were removed. Duplicate sequences were
merged, and denoising of unique sequences was performed with the commandpre.cluster with an allowance of 1 difference per 100 bp
(Callahan et al., 2016) to infer Amplicon Sequence Variants (ASVs).
Singletons that did not cluster were removed at this stage with the
command split.abund. Chimaeras were removed using UCHIME and the
Silva Gold database (Edgar, Haas, Clemente, Quince, & Knight, 2011).
ASVs were taxonomically classified using the Silva NR v132 reference
database with the command classify.seqs , and unwanted lineages,
such as Archaea and Eukaryota, were removed with the commandremove.lineage .
ASVs were summarised at the phylum, family, and genus taxonomic levels
for further analysis. All statistical analyses were computed in R (Team,
2020) using the packages phyloseq, microbiome,
ape and vegan . The number of ASVs shared in groups between
different countries were visualized with Venn diagrams produced through
the Van de Peer lab website
(http://bioinformatics.psb.ugent.be/webtools/Venn/). Alpha
diversity (within samples and communities) was calculated with the
species richness (total number of ASVs) estimator, and the Shannon
(Shannon, 1948) and Inverse Simpson’s (Simpson, 1949) indexes. In
addition, the Bray-Curtis distance metric was used to quantify the
dissimilarity of community composition between samples in the different
sampling localities, and permutational multivariate analysis of variance
(PERMANOVA) was used on the distance matrix to test for statistical
differences between groups with the function adonis() from the
vegan package. Group homogeneity of dispersion was evaluated with the
function betadisper() from the vegan package (Anderson et al.,
2006). Pairwise comparisons were tested with the pairwise.adoniswrapper function by Martinez Arbizu (2020), and p-values were adjusted
using Bonferroni corrections. We also performed analysis of similarities
(ANOSIM) tests, with the Bray-Curtis distance metric, to evaluate
whether any differences in microbiome composition between groups were
attributed to geographical distance or medfly diet. ANOSIM results in a
ratio of between to within groups variation, with an R value closer to 1
suggesting dissimilarity between groups and values closer to 0
suggesting even distribution between and within groups (Clarke, 1993).
For these tests we used the anosim() function from the vegan
package.
Differential abundance analyses were performed on the most
abundant ASVs to assess differences in the microbiome composition across
sample sites with the package edgeR (Robinson, McCarthy, &
Smyth, 2010) implemented in R (Team, 2020). First, ASVs with relative
abundance ≥ 0.01% were chosen, and data normalisation was implemented
with function calcNormFactors() . Then, an exact binomial test was
performed to generalise over dispersed counts with the
function exactTest() (Xia, Sun, & Chen, 2018). Next, the top
differentially abundant ASVs were tabulated with
function topTags() and the false discovery rate (FDR) was
estimated by adjusting the p-values with the Benjamini-Hochberg
correction method (Benjamini & Hochberg, 1995); ASVs with an FDR
< 0.01 were considered significant and were chosen to
visualize differential abundance in the microbiome composition with the
package ggplot2 (Wickham, 2016; Xia et al., 2018).