2.2. SDM development
Models were performed using data of species presence points and
explanatory variables. Two sets of variables were considered; climatic
and vegetation. For climatic variables, we focused on four primary
variables describing annual mean and variability of climatic conditions
including mean annual temperature (meananultmp), annual precipitation
(anulprc), temperature seasonality (tmpseas), and precipitation
seasonality (precseas) all downloaded from WorldClim dataset (Hijmans et
al., 2005). For the vegetation variable, we used the enhanced vegetation
index (EVI) of the MODIS products (MOD13A3), and adopted the same
variability in climatic variables for vegetation, e.i., mean annual EVI
and EVI seasonality. To do so, we downloaded monthly-provided
1-kilometer-resolution MOD13A3 for 2015 from EarthExplorer dataset
(https://earthexplorer.usgs.gov), extracted EVI bands in ENVI
version 5.1, mosaicked them to cover the entire study area in one scene,
and calculated the annual mean and standard deviation of the 12
monthly-EVI rasters. We used EVI instead of commonly-used NDVI, because
of its potential to minimize canopy background variations and maintain
sensitivity over dense vegetation conditions (Jiang et al., 2008). The
EVI also copes better with residual atmosphere contamination caused by
smoke and sub-pixel thin clouds. Before SDM analysis, using variance
inflation factor (VIF) in ‘usdm’ package (Naimi, 2015), we checked the
multicollinearity of variables and found no variable with VIF greater
than 6.
We focused on four SDM methods, GLM, GBM, RF, and MaxEnt, because of
their prevalence, well-performance, and approval over other methods
(Elith and Graham, 2009; Phillips et al., 2006; Shabani et al., 2016),
all implemented in R environment v 3.5.2. We first splitted MRC
occurrence points to training and test data and followed a
crossvalidation scheme on the training datast to fit the models. This
cross-validated scheme of training data was kept constant for tuning the
preliminary models and running the final version of the models. GLM, GBM
and RF were tuned up using the ‘caret’ package (Kuhn, 2021) with
considering different model-specific parameters and the best-fitted
model across the then folds was identified according to their ROC
scores. The fine-tuned model with the highest accuracy was then used to
generate the habitat suitability mpas and to predict the test dataset.
The GLM was performed using simple and quadratic terms of explanatory
variables and the model selection was based on a stepwise AIC selection
procedure. GBM was fitted with allowing the maximum number of trees up
to 2000, with three learning rates (i.e., shrinkage; 0.001, 0.01, 0.1),
three interaction depths (i.e., complexity of the tree, maximum nodes
per tree; 1, 3, and 5), and three values for subsampling fraction (i.e.,
bag fraction; 0.5, 0.65, and 0.8). The RF model was fitted with number
of trees (ntrees) 500 and 1000, number of variables randomly selected at
each split (i.e., mtry) 1 to 5, and node size 1 and 5. The MaxEnt model
was tuned up using the package ENMeval (Muscarella et al., 2014) with
allowing five combination of feature types (fc = L, LQ, LQH, LQHP, and
LQHPT) and regularization multiplier (rm) of 0.5, 1, 1.5, and 2. The
best fitted parameters for each model were then used to predict to the
environmental layers and to generate the corresponding habitat
suitability maps. Again the generation of habitat suitability maps was
carried on given the constant 10-fold crossvalidation of the occurrence
points, meaning that 10 habitat suitability map was predicted for each
model. The final ensemble habitat suitability map of the four SDM
algorithem was generated based on a proportionaly weighted average of
the obtained AUC score of the then repetition.
To address the purpose of our study, background data were selected in
two different ways including random and background weighting. For random
we selected 10000 background points spatially at random leaving cells
with MRC occurrence points. To create background weighting data by
generating a weighting surface we gave prominence to those areas having
less geographical proximity to others. Following Elith et al. (2010) we
first generated a density raster map from the occurrence points and then
allocated 10000 background points regarding its probability distribution
(Fig. 1). This method copes with the bias caused by the spatially
imbalanced-biased data in a way that favors occurrence points of
severely sampled areas over those of sparsely sampled areas (Shabani et
al., 2019). Of the 82 occurrence points of MRC, we used 12 newly sampled
records as out-of-bag data to test models’ performance. We used the area
under the curve (AUC) of the receiver operating characteristic (ROC)
plot to assess the discrimination capacity of models. AUC combines
specificity and sensitivity (Fielding and Bell, 1997), thus, neglects
the relative costs of errors of omission and commission
(Jiménez‐Valverde, 2012). Therefore, we also computed true statistic
skill (TSS) as a threshold-dependent measure of classification accuracy
calculated as sensitivity + specificity – 1. We used the package
‘PresenceAbsence’ to evaluate the performance of the models and the
threshold ‘10 percintile of training suitability’ was set to calculate
the threshold-dependent measures. In Addition to these
traditionally-used metrics which give an absolute measure of the model
performance, we plotted sensitivity and specificity of the models
against an ascending gradient of 100 thresholds to obtain more
informative inferences on the models predictive performance.