AUTHOR=Mishra Umakant , Gautam Sagar , Riley William J. , Hoffman Forrest M. TITLE=Ensemble Machine Learning Approach Improves Predicted Spatial Variation of Surface Soil Organic Carbon Stocks in Data-Limited Northern Circumpolar Region JOURNAL=Frontiers in Big Data VOLUME=Volume 3 - 2020 YEAR=2020 URL=https://www.frontiersin.org/journals/big-data/articles/10.3389/fdata.2020.528441 DOI=10.3389/fdata.2020.528441 ISSN=2624-909X ABSTRACT=Various approaches of differing mathematical complexities are being applied for spatial prediction of soil properties. Regression kriging is a widely used hybrid approach which combines correlation between soil properties and environmental controllers with spatial autocorrelation between soil observations. In this study, we compared four machine learning approaches (gradient boosting machine (GBM), multinarrative adaptive regression spline (MARS), random forest (RF), and support vector machine (SVM)) with regression kriging to predict the spatial heterogeneity of surface (0-30 cm) soil organic carbon (SOC) stocks at 250-m spatial resolution across the northern circumpolar permafrost region. We combined 1660 soil profile observations (calibration datasets) with georeferenced datasets of environmental factors (climate, topography, land cover, bedrock geology, and soil types) to predict the spatial heterogeneity of surface SOC stocks and evaluated the prediction accuracy at 714 randomly selected sites (validation datasets) across the study area. We found the different prediction techniques inferred different importance and number of environmental predictors for SOC stocks. Regression kriging approach produced lower prediction errors in comparison to MARS and SVM, and comparable prediction accuracy with GBM and RF techniques. However, the ensemble median prediction of SOC stocks obtained from all four machine learning techniques showed highest prediction accuracy. Though the use of different approaches in spatial prediction of soil properties will depend on the availability of soil and environmental datasets and computational resources, we conclude that the ensemble median prediction obtained from multiple machine learning approaches provides greater spatial details and produces highest prediction accuracy, and thus can be a better choice for predicting spatial heterogeneity of soil properties.