Abstract
This commentary discusses a framework for the benchmarking of hydrological models for different purposes when the datasets for different catchments might involve epistemic uncertainties. The approach might be expected to result in an ensemble of models that might be used in prediction (including models of different types) but also provides for model rejection to be the start of a learning process to improve understanding.
On benchmarking and intercomparisons of hydrological models
One of the priority actions identified in the UK Flood Hydrology Roadmap (Environment Agency, 2022) concerns the issue of how to benchmark models for practical applications in flood hydrology. The aims would be two-fold: to ensure that the models used for operational applications can be considered as fit-for-purpose, and to provide a framework to make it easier for moving models from research into practice. Previous benchmarking exercises commissioned by the Environment Agency have been one-off projects for the comparison of 1D hydraulic models (Environment Agency, 2010) and later 2D hydraulic models (Environment Agency, 2013), but these were primarily model to model intercomparisons using hypothetical data sets rather testing for performance in real applications. At the time, there were good reasons for this: it established confidence in models giving consistent results without raising the additional concerns of data uncertainties in both model inputs and inundation datasets for evaluation. However, in the wider flood hydrology context, concerns about data and boundary condition uncertainties cannot be avoided. The question, therefore, is how data uncertainties might affect a benchmarking methodology.
There have been international intercomparisons of hydrological models in the past, including those organised by the World Meteorological Organisation (WMO) for real-time forecasting and snowmelt runoff models (Sittner, 1976; Cavidias and Morin, 1986; Georgakakos and Smith, 1990;). Benchmarking has also been applied to land surface models, including for projects such as PILPS and PLUMBER (Henderson-Sellars et al., 1996; Abramowitz, 2012; Best et al., 2015; Haughton et al., 2016). More recently, model intercomparison and benchmarking projects have included DMIP and IHM-MIP projects for distributed models (e.g. Smith et al., 2004, 2012, 2013; Maxwell et al., 2014; Kollet et al., 2017); the Great Lakes Model Intercomparison project (e.g. Mai et al., 2022); benchmarking of NLDAS land surface models (e.g. Nearing et al., 2016, 2018); and the testing of model ensembles. These have taken the form either of testing which model provides the best simulations according to some metric (often using a split record test, e.g. Knoben et al., 2019); or testing against a benchmark model, either a chosen conceptual hydrological model (e.g. Newman et al., 2017; Seibert et al., 2018) or a purely data-based or machine learning model (e.g. Kratzert et al., 2019; Lees et al., 2021). Some benchmarking projects have also concentrated on seasonal and low flow forecasts (e.g. Nicolle et al., 2014; Girons-Lopez et al., 2021)
Experience from those intercomparisons involving hydrological models suggests that for most purposes there will be no model that can be considered as better than others: the relative performance will depend on which catchment is being simulated, which period or events are being simulated, and which performance measure or measures are chosen to do the evaluation. I have, of course, argued for a long, long time that the idea of an optimum hydrological model should be considered as untenable in favour of a concept of equifinality of models and parameter sets (e.g. Beven and Freer, 2001; Beven, 2006). Others have also suggested that the use of multiple metrics can reflect subjective judgments about the acceptability of different models (e.g. Gauch et al., 2022), though different experts might vary in their rankings (Crochemore et al., 2015).
Perhaps more interesting have been the benchmarking exercises involving comparisons with machine learning models (e.g. Nearing et al., 2021). In most of these studies it has been shown that the machine learning methods generally produce better predictions in both calibration and validation. This has included the training of the machine learning models on a large collection of catchments, when compared against models calibrated on single catchments (Figure 1). However, it is also the case that better does not always mean good. Distributions of the NSE efficiency across a large number of the US CAMELS dataset catchments show that there are some 10% of catchments where less than 50% of the variance in the discharge is captured by the models. Similar variation in performance has been reported in hydrological modelling studies of large numbers of catchments in France (Perrin et al., 2001) and the UK (Lane et al., 2019; Lees et al., 2021). So something else is also going on here, which clearly has an impact on benchmarking in the sense of whether models might be fit-for-purpose.