Candidate gene identification with genotype-environment
association analysis
We used genotype-environment association (GEA) analysis to identify
candidate genes evolving under specific environmental pressures, and to
estimate their contribution to patterns of local adaptation. Here, we
applied redundancy analysis (RDA), a canonical ordination method that
combines regression and principal components analysis enabling an
estimate of the variance of a set of response variables explained by
multiple constraining or explanatory variables (72 ). In RDA
applied to the GEA studies, explanatory variables are usually
environmental parameters. Response variables correspond to individual
genotypes or to population allele frequencies at each variant position.
We chose to analyze population frequencies over individual genotypes to
reduce computation time and resources. We used georeferenced
environmental data extracted for the coordinates corresponding to the
sampling sites and averaged over populations. An initial dataset
specifically selected according to their relevance to mangrove ecology
(73, 74 ) included seven marine parameters from Bio-ORACLE2
(75, 76 ) and MARSPEC (77 ) databases; and six terrestrial
parameters from WorldClim2 (78, 79 ) (Table S7). To control for
correlation among explanatory variables, pairwise correlation
coefficients were computed and averaged for each one of the variables,
and five variables exceeding a cutoff of 0.75 were removed from the
dataset. Based on variance inflation in our RDA model, we further
filtered four more variables for a final set of explanatory variables
that included maximum of sea surface salinity averaged over months
(MS_biogeo10_sss_max_5m, MARSPEC); maximum air temperature of the
warmest month, minimum air temperature of the coldest month and annual
precipitation (BIO5, BIO6 and BIO12, WorldClim). After excluding the
samples of the Australian outgroup population from the ‘Full Dataset’,
the HWE and MAF previously used were applied. Positions with less than
75% of individuals genotyped for each taxon/population were removed for
a final dataset of 2,488,560 SNPs. Because RDA does not allow missing
data, we replaced remaining non-genotyped positions with the major
frequency allele. This approach may hinder genotype-environment signals,
resulting in a higher false-negative rate, but allows us to screen a
larger fraction of the genome.
To identify functional genes under directional selection linked to one
or more of the tested ecological parameters, we searched for loci
showing the highest contribution to the correlation patterns obtained in
the redundancy analysis using the distribution of the loadings of the
SNPs in the ordination space. SNPs loading at the center of the
distribution do not have a relationship with the environmental
explanatory variables. In turn, SNPs loading in the tails are more
likely to be under selection as a function of one or more environmental
predictors. Probability density functions were computed for the loading
distributions of RDA axes, and two-tailed p-values were estimated for
each SNP using on the area under the curve (AUC). We then applied a
significance threshold α = 10-4 to detect loci
putatively under selection. Finally, we used the annotation of our
reference genome (58 ) to survey our set of candidate loci and
identify those located within functionally annotated genes. To visualize
genomic regions of high differential association across our multiple
environmental predictors, per SNP lowest p-values of each one of the
four RDA axes was plotted in a Manhattan plot. In addition, to explore
differential association patterns of candidate genes at the individual
level, we conducted a second RDA using the same environmental predictors
and as response variables, the individual genotypes of the SNP outliers.
The first two RDA axes were plotted for visual inspection, as well as
pairwise Euclidean distances in the ordination space for all the
populations included in the analysis.