Abstract:
Ecologists often rely on observational data to understand causal
relationships. Although observational causal inference methodologies
exist, model selection based on information criterion (e.g., AIC)
remains a common approach used to understand ecological relationships.
However, such approaches are meant for predictive inference and is not
appropriate for drawing causal conclusions. Here, we highlight the
distinction between predictive and causal inference and show how model
selection techniques can lead to biased causal estimates. Instead, we
encourage ecologists to apply the backdoor criterion, a graphical rule
that can be used to determine causal relationships across observational
studies.
As ecologists, we are often interested in answering causal questions
about human impacts on the natural world, such as the effect of
climate-induced bleaching events on coral reef ecosystems (e.g., Graham
et al. 2015), the impact of deforestation on biodiversity (e.g., Brook
et al. 2003), or the effect of conservation and management responses on
restoring ecosystem services (e.g., Sala et al. 2018). Often, randomized
controlled experiments are unfeasible, and ecologists instead rely on
observational data to answer fundamental causal questions in ecology
(MacNeil, 2008). Recently, new advances in technology such as
remote-sensing and animal-borne sensors, as well as increased
availability of citizen science and electronic data have further
increased opportunities to answer causal questions from observational
data (Sagarin and Pauchard 2010).
In recent years, researchers have advocated for the increased
application of causal inference in ecology for answering cause and
effect relationships from observational data (e.g., Larsen et al. 2019;
Laubach et al. 2021) but these approaches have yet to be widely adopted.
Instead, drawing causal conclusions from observational data is typically
taboo, with Pearson’s oft-cited “correlation doesn’t equal causation”
used to block attempts to do so (Glymour 2009). This misconception –
that causality cannot be inferred using observational data – has
resulted in a culture where ecologists dependent on observational data
for understanding causal relationships avoid explicitly acknowledging
the causal goal of research projects and instead use coded language that
implies causality without explicitly saying so (Hernan 2018; Arif et al.
2021).
A common strategy used to understand ecological relationships is to
apply model selection, using information metrics such as Akaike’s
information criterion (AIC; Akaike 1973). Such approaches select the
‘best’ model among a candidate set and subsequently make inferences from
parameters that are of ecological interest within the winning model.
Often, these inferences are tied up with causal language, implying that
having selected the best model, one can proceed to using causal language
in reference to it (Table 1). However, model selection is not a valid
method for inferring causal relationships – rather, these techniques
aim to select the best model for predicting a response variable of
interest. For example, AIC approximates a model’s out of sample
predictive accuracy, using only within-sample data (Akaike 1973).
Although numerous model selection criteria exist (e.g., BIC, Schwarz et
al. 1978; DIC; Spiegelhalter et al. 2002; WAIC, Watanabe 2013; LOO-CV;
Vehtari et al. 2017), they are all used to compare models based on
predictive accuracy (McElreath 2020; Laubach et al. 2021; Tredennick et
al. 2021). Thus, model selection is appropriate for predictive inference
(i.e., which model best predicts Y?), which is fundamentally distinct
from causal inference (i.e., what is the effect of X on Y?).
To demonstrate this distinction, the directed acyclic graph (DAG) in
Figure 1 shows the causal structure of a hypothetical ecological system.
DAGs can be used to visualize causal relationships, where variables
(nodes) are connected to each other via directed arrows, pointing from
cause to effect (Elwert 2013). For example, forestry effects species Y
both directly (there is a directed arrow between them) and indirectly,
via the directed arrow from forestry to species A and from species A to
species Y (Fig 1). To illustrate the difference between model selection
and causal inference we created a simulated dataset that matches the
linear causal structure of this DAG, setting the total (i.e., direct and
indirect) causal effect of forestry on species Y to -0.75 (Appendix S1).
We further specified candidate linear regression models that included
all possible covariate combinations where species Y is a response. Using
our simulated data and our candidate models, both AIC and BIC selected a
‘best’ model where forestry, species A, human gravity, climate, and
invasive species Z were included as covariates (Appendix S1). However,
interpreting the coefficients of this model can provide biased causal
estimates. For example, the effect of forestry on species richness is
shown to be -0.36 [-0.38, -0.33], instead of -0.75 (Appendix S1).
In this scenario, there are two statistical biases at play. The first is
overcontrol bias, which occurs when the inclusion of intermediate
variables along a causal pathway removes the indirect causal effect
between predictor and response (Cinelli et al. 2021). Here, the
inclusion of the intermediate variable species A removes the indirect
effect between forestry and species Y. Second, the inclusion of invasive
species Z as a covariate leads to collider bias, which can result from
adjusting for a variable that is caused by both predictor and response
(Cinelli et al. 2021). Here, the inclusion of invasive species Z induces
an additional, but non-causal, association between forestry and species
Y.
It is worth noting that although the true predictive model (i.e., the
data-generating model for species Y, where all direct predictor
variables – human gravity, species A, forestry, and climate were
included as covariates) was included as a candidate model, both AIC and
BIC selected a more complex model with invasive species Z as a
covariate. Here, even though invasive species Z is not a predictor
variable for species Y, its statistical (non-causal) association with
species Y increased predictive accuracy, resulting in better out of
sample predictive accuracy. Indeed non-causal associations including
collider bias and reverse causation has been shown to increase
predictive accuracy (e.g., Luque-Fernandez et al. 2019; Griffith et al.
2020). Thus, a model selected based on predictive accuracy should not be
assumed to be causally accurate.
A more subtle point is that even if a model captures the data generating
process for a response variable of interest, it may not be appropriate
for answering specific causal queries. For example, if we want to know
the total effect of forestry on species Y, a model with all direct
predictor variables – human gravity, species A, forestry, and climate
– included as covariates, returns a causal estimate of -0.21[-0.23,
-0.18] instead of -0.75 (Appendix S1). Here, the inclusion of species
A as a covariate leads to overcontrol bias between forestry and species
Y, removing this indirect effect. As well, this model cannot be used to
determine the causal estimates of other distal drivers, such as climate
or fire. Ultimately, causal models must be built based on the specific
causal question at hand, as well as through the careful consideration of
the overall causal structure, including how different predictor
variables may be related to one another.