Genome resequencing and variant calling
Illumina paired-end 150 bp libraries with insert size equal to 350 bp were prepared and sequenced in a Novaseq platform. A total of 8.0G reads were produced resulting in a mean coverage per site and sample of 28X before filtering. Read quality was evaluated using FASTQC (55 ) after sorting reads by individual with AXE (56 ). Trimming and quality filtering treatment was conducted using Trim Galore (57 ), resulting in a set of reads ranging between 90 and 138 bp long. Reads were then mapped against the previously published reference genome for A. marina (58 ) using the mem algorithm in the Burrows-Wheeler Aligner (BWA; 59 ). Read groups were assigned and BAM files generated with Picard Tools version 1.126 (http://broadinstitute.github.io/picard). We used the HaplotypeCaller + GenotypeGVCFs tools from the Genome Analysis Toolkit (GATK; 28 ) version 4.1.8.1 to produce a set of SNPs in the variant call format (vcf ). Using vcftools (60 ), we retained biallelic SNPs excluding those out of a range of coverage between 4 and 50, or with a genotyping phred quality score below 40. Twenty-two samples presenting more than a 25% of missing data were discarded at this point. We then applied GATK generic hard-filtering recommendations consisting of QualByDepth (QD) > 2.0; FisherStrand (FS) < 60.0; RMSMappingQuality (MQ) > 40; MappingQualityRankSumTest (MQRankSum) > -12.5; ReadPosRankSum (RPRK) < -8.0; and StrandOddsRatio (SOR) > 3.0 (GATK Best Practices;61 ). The resulting dataset (hereafter referred to as ‘Full Dataset’) consisted of 178 individuals (Table 1; Table S1, Supplementary Information) and 15,702,886 SNPs with a per-individual average coverage of 16.8 and a missing data rate of 0.11. The ‘Full Dataset’ was further filtered and customized for downstream analyses (check Table S5 from the Supplementary Information for a list of genomic datasets and filters applied). The program KING v2.2.7 (62 ) was used to confirm the absence of close relatives in our set of sampled individuals. Widespread self-fertilization was also ruled out using the program RMES and 300 variant positions randomly selected (Table S6, Supporting Information).
Population structure analyses To explore genome-wide population structure in Arabian mangroves, we conducted a t-distributed stochastic neighbor embedding (t-SNE) analysis, a powerful method for analyzing high-dimensional data. After excluding the samples from Australia, SNP loci under linkage disequilibrium were filtered out from the ‘Full Dataset’ with bcftools (63 ) applying and r2 limit of 0.2 in windows of 10K bp. Using vcftools, a threshold for SNPs showing highly significant deviations from Hardy-Weinberg equilibrium (HWE) with a p-value of 10-4 was also implemented to filter out false variants arisen by the alignment of paralogous loci. Positions with less than 75% of individuals genotyped for each taxon/population were also removed from the data matrix, along with those presenting a minor allele frequency (MAF) below 0.02. To fulfil neutrality assumptions in population structure analyses, we used PCADAPT (64 ) to detect and exclude sites putatively under selection, applying a q-value threshold of 0.05 , resulting in a final data matrix of 143,900 SNPs and 170 samples. We then used the Rtsne R package (65 ) to perform the analysis with a perplexity value of 10, as recommended by the authors for our number of points, and dimensionality equal to 2. Using the same dataset as in the t-SNE analysis, we also examined patterns of population divergence in Arabian mangroves using a sparse non-negative matrix factorization method (SNMF;29 ). We ran the program five times per K value, with K ranging from 2 to 20.
We computed pairwise F ST and Nei’s genetic distance values with the hierfstat R package (66 ). For this analysis, we did not exclude the Australian outgroup population (Brisbane) from the dataset, yet applied the same filters than for the t-SNE and SNMF analyses (SNP matrix = 113,706). We also computed the observed (HO) and expected heterozygosity (HE), and the nucleotide diversity (π) for each sampling site using the population program from Stacks (67 ).
Phylogenetic analysis A maximum likelihood phylogeny was produced using the program IQ-TREE (68 ) based on the SNP dataset used for F STand Nei’s genetic distance analysis, In this case, ambiguously constant sites (position lacking homozygous representants of at least one of the two alleles) were excluded from the ‘Neutral Dataset’ (SNP matrix = 29,433) to enable ascertainment bias correction and the generalized time-reversible (GTR) model was implemented.Population and demographic history analyses We performed model comparisons under the likelihood framework developed in fastSIMCOAL2 (30 ) to estimate demographic parameters and date cladogenetic events among mangroves populations from the Red Sea and the PAG plus the Sea of Oman lineages. Analyses were run independently for each lineage, using three representative populations for each case. As calibration node, we included in the models the Australian outgroup population (Brisbane). We set the time of coalescence with the Arabian lineages to 2.7 million years ago (26 ), and assumed a conservative generation time of 25 years in the lower and upper overlapping limits of the estimates for the Arabian Peninsula and tropical areas, respectively (31, 47, 48 ). The mutation rate (µ) was set to 3.265x10-8 bp/generation (31, 69 ).
For the Red Sea lineage, the populations of Duba, Al Kharrar and Farasan Banks 2 were included in three models with different topologies: simultaneous cladogenesis of the three lineages for a scenario of fast diversification; subsequent cladogenesis; and a third scenario in which Al Kharrar would have originated by the expansion and admixture of the boreal and meridional populations. Because the Red Sea was never completely drained through the glacial cycles during the last 400,000, no assumptions were made about putative ancestral barriers to gene flow and coalescence times were allowed to vary freely (Fig. S1A, Supplementary Information).
In the case of the PAG lineage, two general hypotheses for lineage differentiation were tested: a scenario of recent differentiation following the colonization of the Gulf after the Last Glacial Maximum (LGM); and a scenario of early lineage diversification in which populations within the PAG became geographically isolated in glacial refugia. To test these two competing hypotheses, we fixed the coalescence times either to 560 generations or less (14,000 years); or to above 720 generations (18,000 years). Representative populations included in the model were Dammam from the northern basin, Ras Ghurab from the southern basin, and Shinas from the Sea of Oman. Based on the evolutionary relationships inferred in the phylogenetic analysis, three different tree topologies were modeled: a case of simultaneous cladogenesis corresponding to a fast diversification process; and two more models in which the most recent split corresponded to Dammam-Ras Ghurab; and to Ras Ghurab-Shinas, respectively. Each one of these topologies was tested under the postglacial colonization and divergence in glacial refugia scenarios, for a total of six models to be compared (Fig. S1B, Supplementary Information).
As input data we used the folded site frequency spectra (SFS) generated from resequencing data. We retained the samples corresponding to the groups of study of each of the analyses from the ‘Full Dataset’. In these analyses, SNP loci under linkage disequilibrium were filtered out applying and more relaxed r2 limit of 0.4 in windows of 10K bp. A HWE filter for SNPs with a p-value of 10-4 was implemented. Because singletons are important for estimating parameters and likelihoods, no MAF filters were applied. Positions with less than 50% of individuals genotyped for each taxon/population were removed from the data matrices. Final matrices were of 58,613 and 72,119 SNPs for the Red Sea and the PAG plus the Sea of Oman datasets, respectively. The SFS were generated with easySFS (https://github.com/isaacovercast/easySFS) maximizing the number of segregating sites as recommended by the authors (pers. comm. ). Parameters with the highest likelihood were estimated under each of the models after 50 cycles of the algorithm, with 150,000 coalescent simulations per cycle. This procedure was replicated 100 times and the set of parameters with the highest final likelihood was retained as the best point estimate. To identify the model that better fit the data, we applied the Akaike information criterion. For estimating 95% CIs of the parameters under the best model, we applied a non-parametric bootstrap procedure. Bootstrapping was carried out by splitting the SNP matrices in 1,000 SNP blocks and randomly combining them for each one of the repetitions. A total of 100 analyses were run for confidence interval estimation. Further details about fastSIMCOAL2 analyses, model sketches and parameter files are provided in the Supplementary Information.
TreeMix v1.13 (32 ) was used to model historical patterns of gene flow between mangrove populations. We ran separate analyses for the Red Sea and the PAG plus the Sea of Oman lineages. Because in this analysis there was no need to calibrate nodes, the Salalah population from the Arabian Sea was used to root the trees in both models. SNPs datasets were built applying the same filters as for the t-SNE and SNMF tests, with the exemption of linkage disequilibrium filters as it can be controlled for in the TreeMix command line. Resulting SNP matrices were 1,858,153 and 2,575,741 long for the Red Sea and the PAG plus the Sea of Oman, respectively. We ran TreeMix for 0–10 migrations, grouping SNPs in windows of 10K bp. Migration edges were plotted until 99.8% of the variance in ancestry between populations was explained by the model (32 ). The consistency of migration edges was evaluated by running TreeMix with 50 total replicates for each added migration edge number using a different, randomly generated seed. Results from the seed that yielded the highest likelihood are reported.
To estimate changes in populations sizes for each one of the biogeographic regions of the Arabian Peninsula, we also implemented an analysis with MSMC2 program (33 ). Populations tested included Duba, Al Kharrar and Farasan Banks 2 from the Red Sea; Salalah from the Arabian Sea; Shinas from the Sea of Oman; and Ras Ghurab and Dammam from the PAG. Two individuals per population with the lowest rates of missing data were chosen for the analysis. We used the mpileup tool from samtools (70 ) to generate a vcf dataset with both variant and invariant sites from the bam files. The vcf file was statistically phased with the tool whatshap (71 ), and then split by individuals and chromosomes. Mask files were generated with the script ‘vcfAllSiteParser.py’, and MSMC2 input files from the MSMC-Tools package. We then generated the input files with the script ‘generate_multihetsep.py’ and run MSMC2 for each pair of individuals. Times and population size were calibrated with generation time = 25 years and µ = 3.265x10-8 bp/generation.