Candidate gene identification with genotype-environment association analysis
We used genotype-environment association (GEA) analysis to identify candidate genes evolving under specific environmental pressures, and to estimate their contribution to patterns of local adaptation. Here, we applied redundancy analysis (RDA), a canonical ordination method that combines regression and principal components analysis enabling an estimate of the variance of a set of response variables explained by multiple constraining or explanatory variables (72 ). In RDA applied to the GEA studies, explanatory variables are usually environmental parameters. Response variables correspond to individual genotypes or to population allele frequencies at each variant position. We chose to analyze population frequencies over individual genotypes to reduce computation time and resources. We used georeferenced environmental data extracted for the coordinates corresponding to the sampling sites and averaged over populations. An initial dataset specifically selected according to their relevance to mangrove ecology (73, 74 ) included seven marine parameters from Bio-ORACLE2 (75, 76 ) and MARSPEC (77 ) databases; and six terrestrial parameters from WorldClim2 (78, 79 ) (Table S7). To control for correlation among explanatory variables, pairwise correlation coefficients were computed and averaged for each one of the variables, and five variables exceeding a cutoff of 0.75 were removed from the dataset. Based on variance inflation in our RDA model, we further filtered four more variables for a final set of explanatory variables that included maximum of sea surface salinity averaged over months (MS_biogeo10_sss_max_5m, MARSPEC); maximum air temperature of the warmest month, minimum air temperature of the coldest month and annual precipitation (BIO5, BIO6 and BIO12, WorldClim). After excluding the samples of the Australian outgroup population from the ‘Full Dataset’, the HWE and MAF previously used were applied. Positions with less than 75% of individuals genotyped for each taxon/population were removed for a final dataset of 2,488,560 SNPs. Because RDA does not allow missing data, we replaced remaining non-genotyped positions with the major frequency allele. This approach may hinder genotype-environment signals, resulting in a higher false-negative rate, but allows us to screen a larger fraction of the genome.
To identify functional genes under directional selection linked to one or more of the tested ecological parameters, we searched for loci showing the highest contribution to the correlation patterns obtained in the redundancy analysis using the distribution of the loadings of the SNPs in the ordination space. SNPs loading at the center of the distribution do not have a relationship with the environmental explanatory variables. In turn, SNPs loading in the tails are more likely to be under selection as a function of one or more environmental predictors. Probability density functions were computed for the loading distributions of RDA axes, and two-tailed p-values were estimated for each SNP using on the area under the curve (AUC). We then applied a significance threshold α = 10-4 to detect loci putatively under selection. Finally, we used the annotation of our reference genome (58 ) to survey our set of candidate loci and identify those located within functionally annotated genes. To visualize genomic regions of high differential association across our multiple environmental predictors, per SNP lowest p-values of each one of the four RDA axes was plotted in a Manhattan plot. In addition, to explore differential association patterns of candidate genes at the individual level, we conducted a second RDA using the same environmental predictors and as response variables, the individual genotypes of the SNP outliers. The first two RDA axes were plotted for visual inspection, as well as pairwise Euclidean distances in the ordination space for all the populations included in the analysis.