Introduction
Identifying the factors that are most likely to explain variation in a
response variable or phenomenon is a primary goal of science, including
in the fields of ecology and evolution. Examples include investigating
the relative importance of abiotic and biotic factors for species
distribution and coexistence (Gravel et al., 2011; Harisena et al.,
2021) or identifying best practices for conservation success (Kapos et
al., 2008). However, the importance of different predictors in
determining a relevant response variable is poorly understood because,
for instance, species distributions generally do not respond to a single
predictor and many environmental factors are correlated. Further, in a
complex world, not all variables of importance are known or can be
anticipated, while only proxy data may be available (e.g., economic
activity and human population used as proxies of invasion propagule
pressure and environmental disturbance), particularly if analyses cover
large spatial and temporal extents (e.g., global phenomena).
There is no universal definition of variable or relative importance
(Achen, 1982; Budescu, 1993; Grömping, 2015; Wei et al., 2015), but it
can be viewed as the proportion of the variation of the response
variable potentially explained directly and indirectly by a given
predictor (Johnson & LeBreton, 2004). Traditionally, multiple linear
regression, its univariate generalizations (e.g., generalized linear and
mixed models), or multivariate equivalents (e.g., redundancy analysis)
have been used for modelling and identifying the likely most important
predictors from a given data set (Weisberg, 1985; Montgomery et al.,
2006). These methods are reasonably robust for prediction (Houlahan et
al., 2017), theory-testing goals (Johnson & LeBreton, 2004), and for
pattern identification (Grömping, 2009; Nathans et al., 2012). Partial
regression coefficients or their standardized version (i.e., beta
weights) are probably the most widely used measures to draw conclusions
about the relative importance of predictors in ecology and evolution
(Table S1) (Darlington & Hayes, 2017). Beta weights are the change (in
standard deviation units) in the response variable associated with a
change by one standard deviation in a given predictor if other
predictors are held constant (Weisberg, 1985; Montgomery et al., 2006).
When predictors are completely uncorrelated, zero-order correlations and
standardized regression coefficients are equivalent, so the relative
importance of predictors can be expressed as the proportion of
predictable variance for which it accounts.
Beta weights are inadequate indices of variable importance when
predictors are correlated, especially when correlations are high
(Hoffman, 1960; Budescu, 1993; Green & Tull, 1974; Grömping, 2015; Wei
et al., 2015), which typifies empirical studies because many potential
predictors covary (Azen & Budescu, 2003). When two or more predictors
in a model are not statistically independent of one another, the
reliability of beta weights is lessened (Farrar & Glauber, 1967).
Although there are many rules of thumb for addressing multicollinearity
(e.g., excluding variables that have r ≥ |0.70|
with one or more other variables), the more variables included in the
model, the greater is the potential for multicollinearity or association
among variables (Strobl et al., 2008) and the identification of the
actually influential variables becomes more problematic. Moreover, beta
weights depend not only on the effect size of the predictor but also on
its variance (and thus the sampling range or design), among other
factors (Greenland et al., 1991). The characterization of variable
importance becomes a challenging task for which the outputs from linear
regression models are not well suited (Grömping, 2007). Although
reliance on beta weights is a common practice in many fields including
ecology and evolution (Table S1), beta weights provide limited
information (Nathans et al., 2012). A hypothetical example (obtained
with the script in Fig. S1 and similar to other ones in the statistical
literature) that illustrates these points is presented in Fig. 1.
Many methods have been developed in the last few decades to attempt to
quantify the relative importance of intercorrelated predictors
(Liakhovitski et al., 2010; see reviews in Nathans et al., 2012; Johnson
& LeBreton, 2004; Grömping, 2015), and progress in computing power
increasingly allows for using these methods with even large data sets.
Examples include variance decomposition, variable transformation, neural
networks, and machine learning (Grömping, 2015; Wei et al., 2015).
Hierarchical partitioning (HP) (Chevan & Sutherland, 1991) is a method
of variance decomposition based on measuring goodness of fit for all
model subsets and measuring the independent contribution of each
predictor (Mac Nally, 2002; Lai et al., 2022). The random forests
approach (RF) is one of the best machine-learning algorithms for
variable-importance assessment, and RF can be interpreted as a variance
decomposition method in a broad sense (Grömping, 2015). RF can identify
nonlinear relationships between the dependent and predictor variables,
handle large numbers of variables with relatively small numbers of
observations (Strobl et al., 2008), and can be effective in identifying
relevant variables in high-dimensional problems with highly correlated
and interacting predictors, such as those commonly encountered in nature
(Liu et al., 2021).
The goal of our study is to compare three statistical approaches that
may provide contrasting, complementary perspectives on the variable
importance of ecological predictors. Despite the large number of
variable importance methods available (Wei et al., 2015), we selected RF
as a machine-learning method, and HP as a case of variance decomposition
method because they both have been widely used across various scientific
fields and have demonstrated high prediction accuracy. Both methods can
also address two important properties of the data: RF allows for
non-linear relationships, and both methods can separate the independent
variance explained by predictors from conjoint explained variance. Using
the important case of the worldwide distribution of alien species
richness and predictors, we compared the results of measures of variable
importance (HP and RF) with the standard statistical regression methods
generally used in ecology and evolution, namely, beta weights (Table
S1). We also aimed to identify issues that are likely to apply to many
other biological and scientific questions.