Introduction
Identifying the factors that are most likely to explain variation in a response variable or phenomenon is a primary goal of science, including in the fields of ecology and evolution. Examples include investigating the relative importance of abiotic and biotic factors for species distribution and coexistence (Gravel et al., 2011; Harisena et al., 2021) or identifying best practices for conservation success (Kapos et al., 2008). However, the importance of different predictors in determining a relevant response variable is poorly understood because, for instance, species distributions generally do not respond to a single predictor and many environmental factors are correlated. Further, in a complex world, not all variables of importance are known or can be anticipated, while only proxy data may be available (e.g., economic activity and human population used as proxies of invasion propagule pressure and environmental disturbance), particularly if analyses cover large spatial and temporal extents (e.g., global phenomena).
There is no universal definition of variable or relative importance (Achen, 1982; Budescu, 1993; Grömping, 2015; Wei et al., 2015), but it can be viewed as the proportion of the variation of the response variable potentially explained directly and indirectly by a given predictor (Johnson & LeBreton, 2004). Traditionally, multiple linear regression, its univariate generalizations (e.g., generalized linear and mixed models), or multivariate equivalents (e.g., redundancy analysis) have been used for modelling and identifying the likely most important predictors from a given data set (Weisberg, 1985; Montgomery et al., 2006). These methods are reasonably robust for prediction (Houlahan et al., 2017), theory-testing goals (Johnson & LeBreton, 2004), and for pattern identification (Grömping, 2009; Nathans et al., 2012). Partial regression coefficients or their standardized version (i.e., beta weights) are probably the most widely used measures to draw conclusions about the relative importance of predictors in ecology and evolution (Table S1) (Darlington & Hayes, 2017). Beta weights are the change (in standard deviation units) in the response variable associated with a change by one standard deviation in a given predictor if other predictors are held constant (Weisberg, 1985; Montgomery et al., 2006). When predictors are completely uncorrelated, zero-order correlations and standardized regression coefficients are equivalent, so the relative importance of predictors can be expressed as the proportion of predictable variance for which it accounts.
Beta weights are inadequate indices of variable importance when predictors are correlated, especially when correlations are high (Hoffman, 1960; Budescu, 1993; Green & Tull, 1974; Grömping, 2015; Wei et al., 2015), which typifies empirical studies because many potential predictors covary (Azen & Budescu, 2003). When two or more predictors in a model are not statistically independent of one another, the reliability of beta weights is lessened (Farrar & Glauber, 1967). Although there are many rules of thumb for addressing multicollinearity (e.g., excluding variables that have r ≥ |0.70| with one or more other variables), the more variables included in the model, the greater is the potential for multicollinearity or association among variables (Strobl et al., 2008) and the identification of the actually influential variables becomes more problematic. Moreover, beta weights depend not only on the effect size of the predictor but also on its variance (and thus the sampling range or design), among other factors (Greenland et al., 1991). The characterization of variable importance becomes a challenging task for which the outputs from linear regression models are not well suited (Grömping, 2007). Although reliance on beta weights is a common practice in many fields including ecology and evolution (Table S1), beta weights provide limited information (Nathans et al., 2012). A hypothetical example (obtained with the script in Fig. S1 and similar to other ones in the statistical literature) that illustrates these points is presented in Fig. 1.
Many methods have been developed in the last few decades to attempt to quantify the relative importance of intercorrelated predictors (Liakhovitski et al., 2010; see reviews in Nathans et al., 2012; Johnson & LeBreton, 2004; Grömping, 2015), and progress in computing power increasingly allows for using these methods with even large data sets. Examples include variance decomposition, variable transformation, neural networks, and machine learning (Grömping, 2015; Wei et al., 2015). Hierarchical partitioning (HP) (Chevan & Sutherland, 1991) is a method of variance decomposition based on measuring goodness of fit for all model subsets and measuring the independent contribution of each predictor (Mac Nally, 2002; Lai et al., 2022). The random forests approach (RF) is one of the best machine-learning algorithms for variable-importance assessment, and RF can be interpreted as a variance decomposition method in a broad sense (Grömping, 2015). RF can identify nonlinear relationships between the dependent and predictor variables, handle large numbers of variables with relatively small numbers of observations (Strobl et al., 2008), and can be effective in identifying relevant variables in high-dimensional problems with highly correlated and interacting predictors, such as those commonly encountered in nature (Liu et al., 2021).
The goal of our study is to compare three statistical approaches that may provide contrasting, complementary perspectives on the variable importance of ecological predictors. Despite the large number of variable importance methods available (Wei et al., 2015), we selected RF as a machine-learning method, and HP as a case of variance decomposition method because they both have been widely used across various scientific fields and have demonstrated high prediction accuracy. Both methods can also address two important properties of the data: RF allows for non-linear relationships, and both methods can separate the independent variance explained by predictors from conjoint explained variance. Using the important case of the worldwide distribution of alien species richness and predictors, we compared the results of measures of variable importance (HP and RF) with the standard statistical regression methods generally used in ecology and evolution, namely, beta weights (Table S1). We also aimed to identify issues that are likely to apply to many other biological and scientific questions.