Figure 1. Cumulative distribution of NSE values over 531 catchments
taken from the US CAMELS database using a single LSTM Deep Learning
model trained over all the catchments (solid line) and separate LSTM
models trained on the individual catchments (dotted line) (taken from
Nearing et al., 2021)
Benchmarking and data uncertainties
There is also now increasing recognition of the way in which data and
boundary condition uncertainties might influence how well models can be
evaluated or tested as hypotheses about how catchment function (e.g.
Beven and Binley, 1992; Beven and Freer, 2001; Liu et al., 2004; Coxon
et al., 2014; Beven and Smith, 2015; McMillan et al., 2018; Beven, 2019;
Beven and Lane, 2022; Westerberg et al., 2022). Clearly, we cannot
expect any model to perform better than the quality of the data and
boundary conditions it is supplied with, or of the data that are used in
evaluation. This applies to both hydrological models and the machine
learning methods that are intended to extract the maximum amount of
information from the data. Indeed, Figure 1 suggests that averaging of
potential observation errors across many catchments might be of value
relative to training on only the data from a single catchment, even
where those catchments include a wide range of physical characteristics.
One interpretation of this is that epistemic errors in the observations
might dominate model structural errors for some catchments (see also
Beven, 2020).
That principle will also apply to both the hydrometric data and any
tracer or geochemical data (e.g. Harmel et al., 2009; Krueger et al.,
2009; Hollaway et al., 2018a). There have been a number of recent
studies using “tracer-aided” model calibrations and evaluations
(Birkel and Soulsby, 2015; Delavau et al., 2017; Smith et al., 2021;
Stevenson et al., 2021) but these have not generally considered
uncertainties in the data, and making use of such data will normally
involve the introduction of additional parameters. For some water
quality models, many more parameters might be involved (e.g.
Hollaway et al., 2018b).
A particular aspect of epistemic uncertainty in the hydrometric data
arises when the observations associated with individual events have
runoff coefficients greater than 1 in catchments where the effects of
snowmelt and longer term storage are not significant so that event-based
coefficients can be calculated (Beven, 2019). Many hydrological models
are constrained to satisfy mass balance and can therefore never
reproduce an event that has a runoff coefficient greater than 1
(allowing for recession contributions for the previous event). Beven and
Westerberg (2011) called such events disinformative events, in the sense
of not providing useful information for model calibration (see also
Beven et al., 2011; Beven and Smith, 2015; Beven et al., 2022b). Such
events will also have an effect on the simulation of subsequent events
since if the rainfall inputs for that event have been underestimated it
will also impact the antecedent conditions for the next event. We can
also envisage that there will also be events where the rainfall inputs
are overestimated, with runoff coefficients artificially low, but these
are much more difficult to identify securely. Such issues are a good
argument for not imposing mass balance in flood forecasting models, but
rather using data assimilation in real-time to compensate for errors in
estimating the inputs.
For flood hydrology, there is also the issue of uncertainty in the
estimation of flood peaks arising from rating curve uncertainties (e.g.
Clarke, 1999; Costa and Jarrett, 2008; Westerberg et al., 2011;
Domeneghetti et al., 2012; Coxon et al., 2015; McMillan and Westerberg,
2015). Uncertainties in rating curves can be estimated from statistical
theory when the rating curves are fitted to observed discharges using
regression methods. However, extrapolating well above the observed data
points to estimating peak flows can also involve epistemic uncertainties
as to the functional form of the curve (e.g. Hollaway et al., 2018a). In
some cases, it might be possible to constrain the extrapolation using
hydraulic modelling, but this can also introduce additional
uncertainties in boundary conditions and roughness parameter estimates.
Thus, in benchmarking models for flood flows it is important to consider
such uncertainties.
Benchmarking for a purpose
This also raises a more general issue for benchmarking. What do we wish
to benchmark for? Benchmarking is really a matter of trying to assess
the confidence we might have in a model or models as fit-for-purpose.
But fitness-for-purpose will depend on the purpose. We should expect
that different model structures or parameter sets might be more or less
suitable for different types of application, including the utility of
data assimilation in real-time. Thus, the first step in any benchmarking
exercise should be deciding on the purpose (see Figure 2). Different
purposes might require different types of evaluation (N-step ahead
predictions for forecasting; flood peaks for evaluating future change in
flood hazard; annual exceedance probabilities for flood frequencies;
flood inundation patterns for distributed models; …..) but all
benchmarking evaluations will need to allow for the uncertainties in the
observations.
This would be easier if we could safely assume that the uncertainties
involved in model forcings and evaluation could be considered as
aleatory and treated as stochastic variables. In that case the power of
formal statistical methods for hypothesis testing can be used. This is
not the case, however. As well as the rating curve extension problem
there are other sources of epistemic uncertainty in the modelling
process. Probably the most important is the question of estimating
catchment rainfalls, either at the catchment scale or in some
distributed form, from the limited rain gauge and uncertain radar data
that might be available. This is an epistemic uncertainty problem, with
the expectation that the uncertainty might vary in both time and space
in rather arbitrary ways.
This then suggests that some alternative to statistical hypothesis
testing might be needed for any benchmarking exercise. One approach is a
logical extension of the expectation that there might be equifinality of
model structures and parameter sets for different types of application,
hopefully with many that might be considered as fit-for-purpose. This
then suggests turning the problem around to consider what models and
parameter sets might be considered as not fit-for-purpose, while
allowing for the uncertainties in the forcing and evaluation data
(Beven, 2018, 2019). Beven and Lane (2019, 2022) discuss the principles
upon which such a rejectionist or model invalidation approach might be
based, including the principle of defining limits of acceptability for a
model to be considered as fit-for-purpose prior to any model runs being
made. Of course, because this involves a consideration of epistemic
sources of uncertainty, the definition of such limits of acceptability
might require an input of expert judgment (though see Beven and Smith,
2015, Beven, 2019, and Beven et al., 2022a,b for examples of doing so
based on historical event runoff coefficients that is applicable in
catchments without significant baseflow). Particularly for the
evaluation of distributed models, such judgments or feedback from local
stakeholders might be needed to decide if models are getting the right
results for the right reasons when distributed evaluation data are not
available (Beven, 2007; Beven and Lane, 2022; Beven et al., 2022b).
Benchmarking and fitness-for-purpose in predicting the future
One of the implications of taking such an approach is that all the
models tried might be rejected (see, for example, Brazier et al., 2000;
Page et al., 2007; Choi and Beven, 2007; Dean et al., 2009; Hollaway et
al., 2018b). As I have written many times before, this is, of course, a
good thing in that it forces a re-evaluation of some sort. This could be
a re-evaluation of model structures, of how the model parameters are
sampled, of the consistency of the available observations, or of the
range for the limits of acceptability. Since it will always be possible
to extend the limits arbitrarily to ensure that not all the models are
rejected, it is important that the assumptions on which the limits are
based be clearly stated. We can extend this to the requirement that
there should be an audit trail to justify and record all the assumptions
associated with any benchmarking study, that then allows those
assumptions to be revisited later (Beven and Alcock, 2012; Beven and
Lane, 2022). The CURE uncertainty estimation toolbox, for example, has a
facility for producing such an audit trail as an output from an analysis
(Page et al., 2023).
In setting limits of acceptability, we are necessarily constrained to
using evaluations based on past events and historical time series
(unless doing so on a purely subjective basis as to what might be
considered fit-for-purpose). Beven and Lane (2022) suggest 8 principles
for setting limits of acceptability, including where this might involve
expert elicitation. However, in many cases the reasons for using a
hydrological model are to predict what might happen under future
conditions. This could be an expected change in the inputs projected by
a climate model, or a change in catchment characteristics as a result,
for example, of natural flood management measures, deforestation or
urbanisation. In the case of changes in inputs, the value of evaluations
based on historical data will depend on the range of past conditions
monitored (see Wi and Steinschneider, 2022, for an example using a deep
learning model). If future conditions, especially the extremes are
expected to be outside the range of past behaviours, then both
process-based and data-based or machine learning models might be limited
in their abilities to predict such changes outside any training data
(e.g. Beven, 2020). In the case of changes in catchment characteristics,
the training data might again not include examples of such changes. We
then either have to transfer information from catchments where similar
changes have occurred or make subjective judgments about changes in
parameter values. This can work (e.g. Buytaert and Beven, 2009) but
might not work consistently. Where catchments have been monitored over
periods of such change, then evaluations of predictions of such change
could be assessed. If acceptable models are found, this can give
increased confidence in applications elsewhere.
It is clear that the types of limits of acceptability that might be used
in model evaluation, and the way in which they might be defined before
making model runs will very much depend on the purpose for which a model
might be used. Taking each of the vertical pathways in Figure 2, for
example, it will be appreciated that what is required for N-step ahead
real time forecasting will be different to the use of a catchment model
for continuous simulation flood frequency simulation, or for the
prediction of future catchment change, distributed inundation
predictions for planning purposes, or for tracer or water quality
variables. What figure 2 provides, however, is a common framework for
assessing model performance in a way that can allow considerations of
data uncertainties (and more subjective evaluation measures) to be
incorporated in a consistent and thoughtful way. It provides an
alternative to considering benchmarking in terms purely of relative
values of performance indices, that in the past have often ignored the
effects of observational errors on model performance (but which might
also include some additional dimensions of ease of understanding and use
and costs of application). In this respect we should learn from the poor
performance of both machine learning methods and conceptual hydrological
models in some catchments to really think about what might be considered
as fit-for-purpose for a particular application.
Of course, if it is necessary to reject all the models that are tried
for a particular purpose in a particular catchment of interest, it
should be the start of a learning process (as shown in Figure 2). This
could be learning about the failings of a particular model structure,
though it may often be difficult to understand why a model has failed,
especially in the case of a machine learning model. In many cases it
will be a result of providing the modelling process with inadequate or
inconsistent data. Machine learning, for example, should be able to deal
with data that have consistent errors (Beven, 2020). The fact that it
still seems to provide poor results on some catchments (e.g. Frame et
al., 2023) would certainly suggest that there are inconsistent errors or
forms of disinformation in some catchment datasets that limit predictive
performance. While such rejections do not help a decision maker, they
are important to advancing understanding of the modelling process (e.g.
Beven, 2018). In extremis a decision maker could still have resort to
trying to characterise the errors associated with each model run, and to
allow for those errors by being precautionary in her decisions. Still
better, of course, would be to understand just why models might fail
benchmarking tests.