1. Introduction
Forest inventories continuously monitor the status of forested ecosystems through the implementation of field campaigns for data collection and subsequent analysis (Smith, 2002). As forests play a key role in maintaining ecological stability, national forest inventories are playing an increasingly-important role in driving academic and governmental decision making (Saarela et al., 2020). For example, Mexico’s National Forest and Soils Inventory (INFyS) is a pillar of its measurement, reporting and verification system (MRV), and the foundation for the national inventory of greenhouse gasses (GHG) emissions in the Land Use, Land-Use Change and Forestry (LULUCF) sector and for the national forest reference emissions level (FREL). MRV and FREL are components of a carbon accounting system used by the United Nations to incentivize practices that lower carbon emissions (Mitchell et al., 2017). National forest inventories usually focus on collecting field data over large geographic areas. Developing analytical tools that enhance the accessibility and understanding of nation-wide forest inventory data is critical for democratizing information about forest structure at national and international scales.
Forest inventories based on a statistical sample are used to estimate mean or total amounts of forest inventory attributes within the population of interest (Tomppo, Haakana, et al., 2008). However, field surveys can be costly, time consuming and logistically-challenging. Furthermore, collecting data exclusively from field surveys can result in designs that do not satisfy the statistical assumptions and can have limited sample sizes due to the phenomenon of non-response, which occurs when field plots that were part of the design cannot be accessed. Improper management of nonresponse can produce bias or increase uncertainty when generating estimates (McRoberts et al., 2005). Emerging satellite and machine learning (ML) technologies give us the opportunity to build standardized analytical tools that can mitigate problems associated with non-response and produce maps that serve for multiple purposes (Tomppo, Olsson, et al., 2008).
Technologies for mapping forest attributes have evolved through the modeling of attributes contained in field data with remotely-sensed satellite data, and then the use of these models to predict the spatial distribution of forest attributes (Schumacher et al., 2020; Wang et al., 2009). The integration of both data sources has been widely applied to better visualize national-scale estimates, reduce uncertainty, and improve dataset robustness (Haakana et al., 2019; Ohmann et al., 2014; Saarela et al., 2020; Tomppo et al., 2010). This approach has played a key role in modeling national estimates of forest structure such as aboveground biomass (AGB) as well as attributes such as forest age (Saarela et al., 2020; Schumacher et al., 2020). Both tree height and tree density are drivers of AGB and bioenergy potential in forest ecosystems. To obtain accurate spatial predictions of forest attributes, many studies employ ML models using a multivariate approach (Khaledian & Miller, 2020; Li et al., 2020; Soriano-Luna et al., 2018; Wadoux et al., 2020). ML is a field of artificial intelligence (AI), and one of its main objectives is to identify and model relationships between dependent data (such as forest inventory attributes) and independent data (such as remote sensing), and apply these models to generate predictions in a semi-autonomous approach (James et al., 2013a). The performance of different types of ML models often varies when modeling forest attributes. For example, spatially explicit estimates of AGB varied by as much as 19% when performing linear (LM), generalized additive (GAM) and random forest (RF) empirical models in a temperate forest in central Mexico (Soriano-Luna et al., 2018). The three fitted AGB models performed well when predicting AGB spatial distribution, but GAM was better for representing AGB variations across the landscape. Thus, different ML models yield different results and studies use multiple models or algorithms to identify the best solutions for predicting forest attributes or specific response variables, as no silver bullets exist in ecological modeling (Qiao et al., n.d.).
One commonly-used set of ML approaches used to perform spatial prediction are ensemble learners, which integrate multiple ML models and algorithms (Holloway & Mengersen, 2018). Ensemble ML models are used in mapping forest attributes because they offer improvements in accuracy to independent algorithms (Healey et al., 2018). Examples of popular ensemble ML algorithms include RF (Breiman, 2001), which applies a bagging method, and Super Learner, which applies a stacked method and uses cross-validation to estimate the performance of multiple ML models (Polley & Laan, 2010). The latter has been shown to outperform the individual algorithms used to build the model (Davies & van der Laan, n.d.; Taghizadeh-Mehrjardi et al., 2021).
Forests in Mexico are a critical natural resource, containing vast amounts of biodiversity and providing ecosystem goods and services (e.g., timber production, water security, soil conservation) as well as economic benefits. The National Forestry Commission of Mexico (CONAFOR) has been in charge of implementing the INFyS from 2004 to the present. The INFyS is a national program in which a stratified, systematic sample of permanent ground plots is used to measure trees (e.g., height, diameter at breast height, count) and site (e.g., forest type, site class, topographic data) variables across all forest lands every 5 years (CONAFOR, 2017).
The main goal of this study is to develop a methodological framework with which CONAFOR can generate country-level maps of INFyS forest attributes. Specifically, this involves operationalizing methods based on integrating field data with remote sensing data in an ensemble ML framework to map forest attributes. We are starting with tree height and tree density, as these are key components of forest structure and can be useful to provide information that helps mitigate impacts of nonresponse, and in the estimation of AGB, carbon storage and forest productivity over time (Humagain et al., 2017; Pirotti, 2010; Selkowitz et al., 2012). Accurate spatial predictions of such structural variables are fundamental for the management and conservation of forest ecosystems, as they are important constituents in the study of land-atmosphere interactions, carbon cycling, assessment of fire hazards and timber volume estimation (Chopping et al., 2008; Selkowitz et al., 2012). By developing workflows and products based on INFyS data, this study aims to support CONAFOR in generating information that will be used by decision makers to manage forests more effectively, preserve the country’s forest patrimony, and improve national and international reporting associated with MRV and FREL. We envision this methodology could be further applied for several other forest attributes such as AGB, carbon storage, and timber volume, among others, and improve Mexico’s national estimates of other relevant forest attributes.