Random forests (RF)
The goal of RF is to create a set of classification rules (tree branches) from the predictors included in a training data (70% of the total dataset per each taxonomic group) (Breiman, 2001). The classification rules are built by recursive binary partitioning of the training data that split into two nodes from two regions that are increasingly homogenous with respect to the classification variable (Cutler et al., 2007). This split is selected according to the Gini index, which measures the quality of its contribution to the classification (Therneau & Atkinson, 1997). Data samples are generated by a bootstrap technique (Cutler et al., 2007). Individuals (in our case specific TDWG level 4 regions) present in a bootstrap sample are referred to as ‘in-bag’ data, whereas the remaining individuals form the ‘out-of-bag’ (OOB) data. A non-pruned classification tree is built from each bootstrap sample with the RF. We previously tuned the RF to optimize its performance and enhance its predictive capabilities using the iterative random forest (iRF) algorithm (Basu et al., 2018). iRF searches for the optimal number of variables to randomly sample to be as candidates at each split (Basu et al., 2018). We then applied RF to each taxonomic group with 10,000 trees and considering the optimal number of variables (‘mtry’ parameter) found with the iRF algorithm. The typical variable importance in RF measures the impact of a predictor for explaining the response variable without taking any other predictors into account, that is, the marginal effects (here referred as ‘unconditional variable importance’), whereas measures of conditional permutation importance (here referred as ‘conditional variable importance’) quantify the impact of predictors after controlling for the other predictors in the model (Debeer & Strobl, 2020). The threshold value in the conditional reflects the degree of tuning to make the permutation conditional and were set at a standard threshold of 0.95, following Debeer & Strobl (2020). RF analysis was performed with the packages tuneRF for tuning the RF (Basu et al., 2018),randomForest (Liaw & Wiener, 2002), and permimp for conditional variable importance (Debeer & Strobl, 2020) in the software R (R Core Team, 2023).