1. Introduction
Developing industrially focused mathematical models is one of the grand research challenges for the design, operation and commercialisation of next generation sustainable chemical and biochemical processes. Due to the lack of petroleum resources and the severe environmental issues surrounding them, microorganism based bio-production processes have become an attractive candidate to substitute traditional chemical processes for the industrial synthesis of platform chemicals and high-value materials 1–3. Given the sophisticated metabolisms, two characteristics exist in most bio-production processes. The first is that different strains and species share similar behaviour with respect to biomass growth, nutrient consumption, and bioproduct accumulation due to their delicate metabolic regulation mechanisms4,5. Whilst the second is that bioprocesses are difficult to reproduce, meaning that their performance varies from batch to batch even under similar operating conditions, as metabolic reactions are sensitive to the change of culture environment6,7.
At this moment, different predictive models have been proposed to account for bioprocess complexities. On the one hand, elaborate kinetic models have been developed by embedding new physical understandings into classic models such as the Monod and the Droop model8,9. These have been used to simulate, optimise, and scale up both fermentation processes and algal photo-production systems7,10. However, identifying a correct model structure to quantify the physical knowledge is a challenging task, usually with long development times. This often results in a complex model structure leading to issues with parameter estimation and identifiability, and sacrificing the model’s predictive capability 11. On the other hand, frontier machine learning models such as artificial neural networks, Gaussian processes, and reinforcement learning have been applied for bioprocess dynamic modelling and online optimisation, and their competency has been reported in a number of publications12–14. Although these data-driven models can well capture complex process behaviours in a specific operating range without prior physical knowledge, they suffer from other inherent weaknesses, such as the risk in model overfitting and difficulties in extrapolating a broader range of metabolism governed process behaviours11,15.
To resolve these challenges, a third modelling strategy – hybrid modelling – has been proposed in recent years 16. This strategy aims to combine physical knowledge and machine learning into a hybrid model structure to inherit the respective advantages of both kinetic models and data-driven models. The structure of a hybrid model is flexible (e.g. parallel structure or sequential structure) and depends on the amount of available physical information and process data 17. In spite of its merits and industrial potential, there exists only a few pioneering research studies attempting to improve and apply this technology into bioprocess engineering 18–20. In addition, hybrid model identification remains a challenge, as its kinetic aspect suffers from difficulties in quantifying physical knowledge and its data-driven part poses risks in overfitting. As a result, this study aims to develop a general framework that integrates state-of-the-art automatic model structure identification technology into the hybrid modelling strategy to facilitate its future industrial applications in bioprocess engineering.