Random forests (RF)
The goal of RF is to create a set of classification rules (tree
branches) from the predictors included in a training data (70% of the
total dataset per each taxonomic group) (Breiman, 2001). The
classification rules are built by recursive binary partitioning of the
training data that split into two nodes from two regions that are
increasingly homogenous with respect to the classification variable
(Cutler et al., 2007). This split is selected according to the Gini
index, which measures the quality of its contribution to the
classification (Therneau & Atkinson, 1997). Data samples are generated
by a bootstrap technique (Cutler et al., 2007). Individuals (in our case
specific TDWG level 4 regions) present in a bootstrap sample are
referred to as ‘in-bag’ data, whereas the remaining individuals form the
‘out-of-bag’ (OOB) data. A non-pruned classification tree is built from
each bootstrap sample with the RF. We previously tuned the RF to
optimize its performance and enhance its predictive capabilities using
the iterative random forest (iRF) algorithm (Basu et al., 2018). iRF
searches for the optimal number of variables to randomly sample to be as
candidates at each split (Basu et al., 2018). We then applied RF to each
taxonomic group with 10,000 trees and considering the optimal number of
variables (‘mtry’ parameter) found with the iRF algorithm. The typical
variable importance in RF measures the impact of a predictor for
explaining the response variable without taking any other predictors
into account, that is, the marginal effects (here referred as
‘unconditional variable importance’), whereas measures of conditional
permutation importance (here referred as ‘conditional variable
importance’) quantify the impact of predictors after controlling for the
other predictors in the model (Debeer & Strobl, 2020). The threshold
value in the conditional reflects the degree of tuning to make the
permutation conditional and were set at a standard threshold of 0.95,
following Debeer & Strobl (2020). RF analysis was performed with the
packages tuneRF for tuning the RF (Basu et al., 2018),randomForest (Liaw & Wiener, 2002), and permimp for
conditional variable importance (Debeer & Strobl, 2020) in the software
R (R Core Team, 2023).