Abstract
This commentary discusses a framework for the benchmarking of
hydrological models for different purposes when the datasets for
different catchments might involve epistemic uncertainties. The approach
might be expected to result in an ensemble of models that might be used
in prediction (including models of different types) but also provides
for model rejection to be the start of a learning process to improve
understanding.
On benchmarking and intercomparisons of hydrological models
One of the priority actions identified in the UK Flood Hydrology Roadmap
(Environment Agency, 2022) concerns the issue of how to benchmark models
for practical applications in flood hydrology. The aims would be
two-fold: to ensure that the models used for operational applications
can be considered as fit-for-purpose, and to provide a framework to make
it easier for moving models from research into practice. Previous
benchmarking exercises commissioned by the Environment Agency have been
one-off projects for the comparison of 1D hydraulic models (Environment
Agency, 2010) and later 2D hydraulic models (Environment Agency, 2013),
but these were primarily model to model intercomparisons using
hypothetical data sets rather testing for performance in real
applications. At the time, there were good reasons for this: it
established confidence in models giving consistent results without
raising the additional concerns of data uncertainties in both model
inputs and inundation datasets for evaluation. However, in the wider
flood hydrology context, concerns about data and boundary condition
uncertainties cannot be avoided. The question, therefore, is how data
uncertainties might affect a benchmarking methodology.
There have been international intercomparisons of hydrological models in
the past, including those organised by the World Meteorological
Organisation (WMO) for real-time forecasting and snowmelt runoff models
(Sittner, 1976; Cavidias and Morin, 1986; Georgakakos and Smith, 1990;).
Benchmarking has also been applied to land surface models, including for
projects such as PILPS and PLUMBER (Henderson-Sellars et al., 1996;
Abramowitz, 2012; Best et al., 2015; Haughton et al., 2016). More
recently, model intercomparison and benchmarking projects have included
DMIP and IHM-MIP projects for distributed models (e.g. Smith et al.,
2004, 2012, 2013; Maxwell et al., 2014; Kollet et al., 2017); the Great
Lakes Model Intercomparison project (e.g. Mai et al., 2022);
benchmarking of NLDAS land surface models (e.g. Nearing et al., 2016,
2018); and the testing of model ensembles. These have taken the form
either of testing which model provides the best simulations according to
some metric (often using a split record test, e.g. Knoben et al., 2019);
or testing against a benchmark model, either a chosen conceptual
hydrological model (e.g. Newman et al., 2017; Seibert et al., 2018) or a
purely data-based or machine learning model (e.g. Kratzert et al., 2019;
Lees et al., 2021). Some benchmarking projects have also concentrated on
seasonal and low flow forecasts (e.g. Nicolle et al., 2014; Girons-Lopez
et al., 2021)
Experience from those intercomparisons involving hydrological models
suggests that for most purposes there will be no model that can be
considered as better than others: the relative performance will depend
on which catchment is being simulated, which period or events are being
simulated, and which performance measure or measures are chosen to do
the evaluation. I have, of course, argued for a long, long time that the
idea of an optimum hydrological model should be considered as untenable
in favour of a concept of equifinality of models and parameter sets
(e.g. Beven and Freer, 2001; Beven, 2006). Others have also suggested
that the use of multiple metrics can reflect subjective judgments about
the acceptability of different models (e.g. Gauch et al., 2022), though
different experts might vary in their rankings (Crochemore et al.,
2015).
Perhaps more interesting have been the benchmarking exercises involving
comparisons with machine learning models (e.g. Nearing et al., 2021). In
most of these studies it has been shown that the machine learning
methods generally produce better predictions in both calibration and
validation. This has included the training of the machine learning
models on a large collection of catchments, when compared against models
calibrated on single catchments (Figure 1). However, it is also the case
that better does not always mean good. Distributions of the NSE
efficiency across a large number of the US CAMELS dataset catchments
show that there are some 10% of catchments where less than 50% of the
variance in the discharge is captured by the models. Similar variation
in performance has been reported in hydrological modelling studies of
large numbers of catchments in France (Perrin et al., 2001) and the UK
(Lane et al., 2019; Lees et al., 2021). So something else is also going
on here, which clearly has an impact on benchmarking in the sense of
whether models might be fit-for-purpose.