P Shi -

Understanding soil organic carbon (SOC) content is essential for environmental sustainability and carbon neutrality. Traditional methods of predicting SOC content are often difficult and imprecise. However, with the development of machine learning techniques, the ability and accuracy of predicting SOC content have greatly improved. This study evaluates various machine learning models, including Random Forest (RF), Support Vector Machine (SVM), Partial Least Squares Regression (PLSR), Convolutional Neural Network (CNN), Artificial Neural Network (ANN), and Extreme Gradient Boosting (XGBoost)—were used to predict SOC content. The research was conducted in north-central and north-western China, covering diverse land uses and climatic conditions. A comprehensive dataset was utilized, including soil samples, DEM data, rainfall and temperature data, soil moisture, erosion modulus, and NDVI. Ten-fold cross-validation was used for each model and metrics such as coefficient of determination (R2), mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), and the ratio of performance to interquartile distance (RPIQ). The XGBoost model outperformed the other models, achieving R2=0.715, MAE=0.424, MSE=0.707, RMSE=0.781, and RPIQ=2.565. The land use types included forest, grassland, and farmland. Air temperature and soil pH were identified as the key factors influencing SOC content, both showing a negative correlation with SOC content. For unutilized land, the key factors affecting SOC content were NDVI and soil pH. Additionally, SHapley Additive exPlanations (SHAP) were introduced to explain the model’s predictions, demystifying the machine learning ”black box” and improving the credibility of the predictions. This work demonstrate the potential of machine learning models to accurately predict SOC and identify key factors influencing SOC levels, providing new insights into soil management and climate change mitigation.