Genome resequencing and variant calling
Illumina paired-end 150 bp libraries with insert size equal to 350 bp
were prepared and sequenced in a Novaseq platform. A total of 8.0G reads
were produced resulting in a mean coverage per site and sample of 28X
before filtering. Read quality was evaluated using FASTQC (55 )
after sorting reads by individual with AXE
(56 ).
Trimming and quality filtering treatment was conducted using Trim Galore
(57 ), resulting in a set of reads ranging between 90 and 138 bp
long. Reads were then mapped against the previously published reference
genome for A. marina (58 ) using the mem algorithm in the
Burrows-Wheeler Aligner (BWA; 59 ). Read groups were assigned and
BAM files generated with Picard Tools version 1.126
(http://broadinstitute.github.io/picard). We used the HaplotypeCaller +
GenotypeGVCFs tools from the Genome Analysis Toolkit (GATK; 28 )
version 4.1.8.1 to produce a set of SNPs in the variant call format
(vcf ). Using vcftools (60 ), we retained biallelic SNPs
excluding those out of a range of coverage between 4 and 50, or with a
genotyping phred quality score below 40. Twenty-two samples presenting
more than a 25% of missing data were discarded at this point. We then
applied GATK generic hard-filtering recommendations consisting of
QualByDepth (QD) > 2.0; FisherStrand (FS) < 60.0;
RMSMappingQuality (MQ) > 40; MappingQualityRankSumTest
(MQRankSum) > -12.5; ReadPosRankSum (RPRK) <
-8.0; and StrandOddsRatio (SOR) > 3.0 (GATK Best Practices;61 ). The resulting dataset (hereafter referred to as ‘Full
Dataset’) consisted of 178 individuals (Table 1; Table S1, Supplementary
Information) and 15,702,886 SNPs with a per-individual average coverage
of 16.8 and a missing data rate of 0.11. The ‘Full Dataset’ was further
filtered and customized for downstream analyses (check Table S5 from the
Supplementary Information for a list of genomic datasets and filters
applied). The program KING v2.2.7 (62 ) was used to confirm the
absence of close relatives in our set of sampled individuals. Widespread
self-fertilization was also ruled out using the program RMES and 300
variant positions randomly selected (Table S6, Supporting Information).
Population structure analyses To explore genome-wide population structure in Arabian mangroves, we
conducted a t-distributed stochastic neighbor embedding (t-SNE)
analysis, a powerful method for analyzing high-dimensional data. After
excluding the samples from Australia, SNP loci under linkage
disequilibrium were filtered out from the ‘Full Dataset’ with bcftools
(63 ) applying and r2 limit of 0.2 in windows of
10K bp. Using vcftools, a threshold for SNPs showing highly significant
deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of
10-4 was also implemented to filter out false variants
arisen by the alignment of paralogous loci. Positions with less than
75% of individuals genotyped for each taxon/population were also
removed from the data matrix, along with those presenting a minor allele
frequency (MAF) below 0.02. To fulfil neutrality assumptions in
population structure analyses, we used PCADAPT (64 ) to detect and
exclude sites putatively under selection, applying a q-value threshold
of 0.05 , resulting in a final data matrix of 143,900 SNPs and 170
samples. We then used the Rtsne R package (65 ) to perform the
analysis with a perplexity value of 10, as recommended by the authors
for our number of points, and dimensionality equal to
2.
Using the same dataset as in the t-SNE analysis, we
also examined patterns of population divergence in Arabian mangroves
using a sparse non-negative matrix factorization method (SNMF;29 ). We ran the program five times per K value, with K ranging
from 2 to 20.
We computed pairwise F ST and Nei’s genetic
distance values with the hierfstat R package (66 ). For this
analysis, we did not exclude the Australian outgroup population
(Brisbane) from the dataset, yet applied the same filters than for the
t-SNE and SNMF analyses (SNP matrix = 113,706). We also computed the
observed (HO) and expected heterozygosity
(HE), and the nucleotide diversity (π) for each sampling
site using the population program from Stacks (67 ).
Phylogenetic analysis A maximum likelihood phylogeny was produced using the program IQ-TREE
(68 ) based on the SNP dataset used for F STand Nei’s genetic distance analysis, In this case, ambiguously constant
sites (position lacking homozygous representants of at least one of the
two alleles) were excluded from the ‘Neutral Dataset’ (SNP matrix =
29,433) to enable ascertainment bias correction and the generalized
time-reversible (GTR) model was
implemented.Population and demographic history
analyses We performed model comparisons under the likelihood framework developed
in fastSIMCOAL2 (30 ) to estimate demographic parameters and date
cladogenetic events among mangroves populations from the Red Sea and the
PAG plus the Sea of Oman lineages. Analyses were run independently for
each lineage, using three representative populations for each case. As
calibration node, we included in the models the Australian outgroup
population (Brisbane). We set the time of coalescence with the Arabian
lineages to 2.7 million years ago (26 ), and assumed a
conservative generation time of 25 years in the lower and upper
overlapping limits of the estimates for the Arabian Peninsula and
tropical areas, respectively (31, 47, 48 ). The mutation rate (µ)
was set to 3.265x10-8 bp/generation (31, 69 ).
For the Red Sea lineage, the populations of Duba, Al Kharrar and Farasan
Banks 2 were included in three models with different topologies:
simultaneous cladogenesis of the three lineages for a scenario of fast
diversification; subsequent cladogenesis; and a third scenario in which
Al Kharrar would have originated by the expansion and admixture of the
boreal and meridional populations. Because the Red Sea was never
completely drained through the glacial cycles during the last 400,000,
no assumptions were made about putative ancestral barriers to gene flow
and coalescence times were allowed to vary freely (Fig. S1A,
Supplementary Information).
In the case of the PAG lineage, two general hypotheses for lineage
differentiation were tested: a scenario of recent differentiation
following the colonization of the Gulf after the Last Glacial Maximum
(LGM); and a scenario of early lineage diversification in which
populations within the PAG became geographically isolated in glacial
refugia. To test these two competing hypotheses, we fixed the
coalescence times either to 560 generations or less (14,000 years); or
to above 720 generations (18,000 years). Representative populations
included in the model were Dammam from the northern basin, Ras Ghurab
from the southern basin, and Shinas from the Sea of Oman. Based on the
evolutionary relationships inferred in the phylogenetic analysis, three
different tree topologies were modeled: a case of simultaneous
cladogenesis corresponding to a fast diversification process; and two
more models in which the most recent split corresponded to Dammam-Ras
Ghurab; and to Ras Ghurab-Shinas, respectively. Each one of these
topologies was tested under the postglacial colonization and divergence
in glacial refugia scenarios, for a total of six models to be compared
(Fig. S1B, Supplementary Information).
As input data we used the folded site frequency spectra (SFS) generated
from resequencing data. We retained the samples corresponding to the
groups of study of each of the analyses from the ‘Full Dataset’. In
these analyses, SNP loci under linkage disequilibrium were filtered out
applying and more relaxed r2 limit of 0.4 in windows
of 10K bp. A HWE filter for SNPs with a p-value of
10-4 was implemented. Because singletons are important
for estimating parameters and likelihoods, no MAF filters were applied.
Positions with less than 50% of individuals genotyped for each
taxon/population were removed from the data matrices. Final matrices
were of 58,613 and 72,119 SNPs for the Red Sea and the PAG plus the Sea
of Oman datasets, respectively. The SFS were generated with easySFS
(https://github.com/isaacovercast/easySFS) maximizing the number of
segregating sites as recommended by the authors (pers. comm. ).
Parameters with the highest likelihood were estimated under each of the
models after 50 cycles of the algorithm, with 150,000 coalescent
simulations per cycle. This procedure was replicated 100 times and the
set of parameters with the highest final likelihood was retained as the
best point estimate. To identify the model that better fit the data, we
applied the Akaike information criterion. For estimating 95% CIs of the
parameters under the best model, we applied a non-parametric bootstrap
procedure. Bootstrapping was carried out by splitting the SNP matrices
in 1,000 SNP blocks and randomly combining them for each one of the
repetitions. A total of 100 analyses were run for confidence interval
estimation. Further details about fastSIMCOAL2 analyses, model sketches
and parameter files are provided in the Supplementary Information.
TreeMix v1.13 (32 ) was used to model historical patterns of gene
flow between mangrove populations. We ran separate analyses for the Red
Sea and the PAG plus the Sea of Oman lineages. Because in this analysis
there was no need to calibrate nodes, the Salalah population from the
Arabian Sea was used to root the trees in both models. SNPs datasets
were built applying the same filters as for the t-SNE and SNMF tests,
with the exemption of linkage disequilibrium filters as it can be
controlled for in the TreeMix command line. Resulting SNP matrices were
1,858,153 and 2,575,741 long for the Red Sea and the PAG plus the Sea of
Oman, respectively. We ran TreeMix for 0–10 migrations, grouping SNPs
in windows of 10K bp. Migration edges were plotted until 99.8% of the
variance in ancestry between populations was explained by the model
(32 ). The consistency of migration edges was evaluated by running
TreeMix with 50 total replicates for each added migration edge number
using a different, randomly generated seed. Results from the seed that
yielded the highest likelihood are reported.
To estimate changes in populations sizes for each one of the
biogeographic regions of the Arabian Peninsula, we also implemented an
analysis with MSMC2 program (33 ). Populations tested included
Duba, Al Kharrar and Farasan Banks 2 from the Red Sea; Salalah from the
Arabian Sea; Shinas from the Sea of Oman; and Ras Ghurab and Dammam from
the PAG. Two individuals per population with the lowest rates of missing
data were chosen for the analysis. We used the mpileup tool from
samtools (70 ) to generate a vcf dataset with both variant
and invariant sites from the bam files. The vcf file was
statistically phased with the tool whatshap (71 ), and then split
by individuals and chromosomes. Mask files were generated with the
script ‘vcfAllSiteParser.py’, and MSMC2 input files from the MSMC-Tools
package. We then generated the input files with the script
‘generate_multihetsep.py’ and run MSMC2 for each pair of individuals.
Times and population size were calibrated with generation time = 25
years and µ = 3.265x10-8 bp/generation.