Machine learning-based estimation of soil organic matter using RGB values

Mansur, Nauaf; Abbod, Mohsen

Machine learning-based estimation of soil organic matter using RGB values

Nauaf Mansur ¹, Mohsen Abbod ²*

1, Department of Soil Science and land reclamation, Faculty of Agriculture, Homs University, Homs, Syria

2, Department of Plant Protection, Faculty of Agriculture, Homs University, Homs, Syria

E-mail:
abbod.mohsen111@gma i l.com

Received: 05/08/2025
Acceptance: 11/09/2025
Available Online: 13/09/2025
Published: 01/01/2026

Download this article

Manuscript link
http://dx.doi.org/10.30493/DAS.2025.539371

Abstract

Soil color is a straightforward yet insightful indication of soil properties. This work employed machine learning (ML) to predict soil organic matter (SOM) content utilizing 50 soil samples from Homs province, based on RGB color values. The concentration of SOM was determined using the wet oxidation method, and predictive models were built utilizing Support Vector Machine (SVM), Gaussian Process Regression (GPR), Least Squares Kernel (LSK), Ensemble Tree, Artificial Neural Network (ANN), and Multiple Linear Regression (MLR). Among these, GPR, SVM, and Ensemble Tree demonstrated robust training performance (R²=0.91, 0.89, and 0.89, respectively); however, SVM displayed decreased robustness with external data. Subsequent validation using external testing and multicriteria decision-making (MCDM) analysis demonstrated that the ANN model attained the highest predictive accuracy and generalization ability, followed by GPR. The findings indicate that machine learning models can reliably forecast soil organic matter from RGB values, providing a faster and more cost-effective substitute for laboratory analyses. This facilitates effective large-scale soil monitoring in Homs and comparable regions for improved agricultural and environmental decision-making.

Keywords: Soil organic matter, RGB, Machine learning, Wet oxidation

Introduction

The assessment of soil color represents a property of soil that can be evaluated with limited time and effort. It is rare for investigations of soil to exclude an analysis of soil color, as this aspect offers valuable information regarding numerous chemical, physical, and biological transformations [1][2]. The coloration observed in the uppermost soil layer primarily signifies biological processes. In contrast, in the subsurface layers, soil color serves as a more significant indicator of physical and chemical transformations, including the oxidation and reduction of iron and manganese, resulting in distinct color variations [1][3].

Given the relationship between soil color and various biological, chemical, and physical processes, it is evident that all soil classification systems have recognized soil color as an essential diagnostic characteristic. A multitude of investigations have explored the correlation between soil hue and organic matter (OM) concentration, frequently employing Munsell charts for the assessment of color [4-6]. Therefore, soil spectral reflectance serves as an indirect indicator of a range of soil properties.

Recent technological advancements and developments in computing enable the acquisition of high-resolution digital imagery and the application of computer vision algorithms for swift and non-invasive characterization of soil properties. Artificial intelligence (AI) and machine learning (ML) algorithms systematically analyze extensive soil data, including type, moisture, nutrients, and pH levels, to refine predictions and recommendations for soil management. This optimization facilitates more precise fertilizer application, irrigation strategies, and overall farming practices that are specifically tailored to various soil types [7-9].

A number of investigations have examined the application of color data in the analysis of soil. For example, RGB values derived from the Munsell chart alongside a K-Nearest Neighbor (KNN) algorithm were utilized for the classification of soil color [10]. In another study, a significant correlation (R = 0.75) between the red band of RGB values and soil organic matter (SOM) was found, leading to the development of a robust predictive model [11]. Moreover, investigations utilizing smartphone imagery alongside multiple linear regression (MLR) and partial least squares (PLS) algorithms have shown similar precision to spectral instruments in forecasting soil organic matter (SOM) [12]. Machine learning techniques, including Support Vector Machine (SVM), Gaussian Process Regression (GPR), decision trees, and ensemble methods, were also used to estimate soil organic matter using cellphone images [13]. The findings illustrated the potential of proximal soil sensors for the rapid, accurate, and non-destructive prediction of soil properties.

This research presents innovative models aimed at forecasting soil organic matter (SOM) levels in Homs province. These models were constructed utilizing RGB spectral data obtained from 50 soil samples, alongside a range of machine learning algorithms. Such methodology might present an economical and efficient option for both researchers and agricultural practitioners in the area, serving as a viable substitute for conventional laboratory evaluations of soil organic matter.

Material and Methods

Soil sample collection

Soil samples were systematically collected from the surface layer (0-25 cm) across three distinct regions within and adjacent to Homs Governorate. The collection sites included Shin, characterized by a humid Mediterranean climate located 37 km west of Homs; Al-Sha’irat, situated in a dry climate zone 37 km southeast of Homs; and Al-Salamiyah, also in a dry climate zone, positioned 45 km northeast of Homs. A total of 50 soil samples were collected, subjected to air-drying, and cleared of debris. The samples were then divided, with one portion preserved in its natural state and the other sieved to a particle size of less than 2mm for subsequent laboratory analysis.

Estimation of soil organic matter (SOM) content through wet oxidation method

The wet oxidation technique, as outlined by Walkley and Black (1934), was utilized to assess the soil organic matter content [14]. A 10 mL solution of 1N potassium dichromate (K₂Cr₂O₇) was introduced in excess to a 0.5 g soil sample in a highly acidic environment. This process resulted in the oxidation of the organic carbon present in the soil. The unreacted potassium dichromate was subsequently back-titrated with 1N ferrous sulfate, employing diphenylamine as an indicator. The quantity of potassium dichromate that underwent reaction with the organic carbon was ascertained by calculating the difference between the initial and residual amounts. Consequently, the calculations for the percentage of organic carbon and, in turn, the organic matter content within the soil were conducted [14].

Color measurement and determination of RGB values

The assessment of soil color involved the acquisition of images of soil samples placed in Petri dishes, utilizing a Panasonic Lumix DMC-FZ7 camera affixed to a tripod for stability and precision. Images were captured in an outdoor setting, ensuring uniform exposure to natural daylight without the influence of artificial lighting sources. The RGB values of the captured images were subsequently quantified utilizing ImageJ v1.54k (Fig. 1) [15].

Machine learning-based estimation of soil organic matter using RGB values — **Figure 1.** RGB-to-SOM (Soil Organic Matter) modeling workflow

Predicting soil organic matter (SOM) using regression and machine learning algorithms

K-means clustering of datasets

In the modeling process, the dataset underwent partitioning into a training set and a test set through the application of the k-means clustering technique. The training set included 35 samples, representing 70% of the overall dataset, whereas the remaining 15 samples accounted for 30% and formed the test set. In order to achieve representativeness, a total of 15 samples were randomly chosen from each cluster produced by the k-means algorithm, while the remaining 35 samples were designated for the training set [16]. The k-means clustering procedure was executed utilizing SPSS software.

Multiple linear regression (MLR) model

Multiple linear regression (MLR) [17] was employed to analyze the linear association between soil organic matter (SOM) content, serving as the dependent variable, and RGB values, which function as the independent variables. The samples were partitioned into a training subset comprising 70% (35 samples) and a testing subset consisting of 30% (15 samples), with the selection of the testing subset conducted randomly. Regression coefficients were ascertained through multiple linear regression employing the least-squares curve fitting technique. The derived regression equation is articulated as follows:

Where Y is the dependent variable (SOM), X_i are the independent variables (RGB), n is the number of RGB channels (3 values), a₀ is the constant in the equation, and a_i represent the coefficients of the descriptors. The SOM values were converted to pOM level (pOM = –logOM), and used as dependent variables, while the RGB value were transformed to standardized values and used as independent after variables in the MLR.

Machine learning models

Conventional linear regression methods frequently fall short when it comes to accurately representing intricate, non-linear relationships among variables. As a result, the implementation of more sophisticated machine learning algorithms is necessary, taking into consideration the dataset and required task [18][19]. This research utilized a variety of supervised machine learning methodologies, including Support Vector Machines (SVM), Gaussian Process Regression (GPR), Least Squares Kernel (LSK), Ensemble Tree approaches, and Multilayer Perceptron Artificial Neural Networks (MLP-ANN), to create effective models for predicting soil organic matter (SOM) levels based on RGB spectral data [20][21]. The efficacy of each model was evaluated through the coefficient of determination (R²), root mean squared error (RMSE), and mean absolute error (MAE). Bayesian optimization was employed to refine the hyperparameters of models, encompassing kernel functions, box constraint levels, kernel scales, sigma values, and basis function configurations, in order to attain the highest predictive performance.

Models validation

Leave Many Out cross validation (LMO-CV)

In order to evaluate the internal reliability of the statistical models, leave-multiple-out cross-validation (LMO-CV) was utilized. This approach entailed the systematic exclusion of 20% of the training dataset to serve as a validation subset. A threshold of a cross-validated correlation coefficient (R²_cv) exceeding 0.5 was established to determine the internal robustness of a model [22].

Y-Randomization test (Y-scrambling test)

Y-randomization (Y-scrambling) was performed to evaluate possible random correlations. This approach assesses whether the initial model accurately represents a genuine relationship rather than being a product of random variation [23]. The dependent variable underwent random permutation, leading to the construction of a new model. The validity of the original model is established when the mean R² and R²_cv of the randomized models fall below the original R² [23].

External validation

The models’ performance underwent additional assessment through an external test set. The model’s capability to forecast soil organic matter (SOM) in the external test set was evaluated through various statistical measures, including Q²_F1, Q²_F2, root mean square error (RMSE), and mean absolute error (MAE) [22].

Models’ comparison

The Multi-Criteria Decision Making (MCDM) module was utilized to systematically rank and determine the most effective model through a concurrent assessment of its performance across various statistical parameters. This methodology establishes a normalized scoring system that spans from 0 to 1, with 0 representing the lowest performance and 1 indicating the highest. The model that attains the highest score in multi-criteria decision-making is recognized for demonstrating superior performance in both internal and external validation. This is evidenced by critical metrics including elevated R², adjusted R², Q²_LMO, and the concordance correlation coefficient (CCC), alongside external validation indicators (Q²_F1, Q²_F2), in addition to exhibiting low values for root mean square error (RMSE) and mean absolute error (MAE). As a result, the model exhibiting the highest MCDM score is identified as the most optimal choice [24-26].

Results

Chemical analysis of soil samples

The content of soil organic matter in the collected samples was assessed utilizing the wet oxidation technique (Table 1). The percentage of soil organic matter ranged from 0.34% in sample C4 to 3.38% in sample A17, reflecting the diversity of soil types examined in this study. The coefficient of variation was determined to be 47%, indicating the variability of organic matter content across the selected locations, which is crucial for developing robust SOM predictive models.

Soil organic matter prediction using MLR model

MLR process resulted in the following model:

pOM = -0.1216 +0.042(R) +0.053(G) +0.12(B)

Where:

pOM is the predicted soil organic matter content (pOM = –logOM)

R, G, and B are the average red, green, and blue readings of a soil sample image

As shown in the equation, the RGB values exhibited a good linear correlation with SOM content, with the blue channel showing the strongest positive correlation (+0.12). Calculated metrics for this model were: n=35, R²_(train)=0.73, R²_(adj.)=0.71, RMSE_(train)=0.127, P<0.001, R²_{(test) =}0.6, RMSE_(test)=0.12, R²_cv=0.63. R²_(train) and RMSE_(train) values show that MLR model is statistically acceptable. The R²_cv value of 0.63, being greater than 0.5, indicates the model’s accuracy in predicting SOM content for unseen data.

Y-Randomization test

Y-randomization test was performed to assess the robustness of MLR model. The dependent variable (pOM) was randomly scrambled while keeping the independent variables (RGB values) constant, and 100 new models were generated. These new models consistently yielded lower R² and R²_cv values compared to the original model (Table 2). The Y-randomization test result, with a cRp² of 0.71, confirms the robustness of the developed MLR model and indicates that the predicted pOM values are not due to chance [23].

Machine learning models

A notable linear correlation was identified through the MLR model (R²_train=0.73); however, this does not suggest a flawless linear relationship. A portion of the variation in the dependent variable is not accounted for by the linear model. In pursuit of enhancing model performance, several machine learning models were investigated. A variety of supervised algorithms, including SVM, GPR, LSK, Ensemble Tree, and MLP-ANN, were utilized to create more robust models for estimating SOM based on RGB values. The determination of optimal hyperparameters for these models was achieved through Bayesian optimization utilizing Expected Improvement acquisition functions, with the objective of minimizing the Mean Squared Error (MSE).

The Gaussian Process Regression model, refined through Bayesian optimization employing a linear basis function and a sigma value of 1.4641, exhibited notable predictive accuracy, as indicated by a minimal Root Mean Square Error of 0.071. This demonstrates its efficacy in representing the correlation between RGB and SOM. Although the Gaussian Process Regression demonstrated a commendable training R² of 0.91, its efficacy on the external test set was comparatively diminished, registering at 0.71. The SVM model demonstrated impressive training performance (R²_train=0.89); however, there was a notable decline in Q²_F1 to 0.2 when evaluated on the test set, suggesting a lack of robustness with external data (Table 3 and Supplementary Table 1).

The Ensemble tree algorithm demonstrated efficacy in elucidating the correlation between SOM and RGB across both the training and test datasets. The results indicated a strong performance, with R²_train and Q²_F2 values recorded at 0.89 and 0.63, respectively (Table 3 and Fig. 2).

The MLP-ANN underwent training through the application of a backpropagation algorithm. The ideal number of neurons within the hidden layer was ascertained by systematically adjusting the neuron count from 1 to 20 and analyzing the corresponding root mean square error values. The network configuration comprising three neurons within a solitary hidden layer (3/3/1) demonstrated the highest level of performance for the specified inputs. The dataset was systematically divided into training (70%), testing (15%), and validation (15%) subsets for the artificial neural network model. The model demonstrated a strong R²_train of 0.84 and a Q²_F2 of 0.82, indicating significant predictive capability, which is further corroborated by a Q²_LMO of 0.83. Although the LSK model demonstrated strong performance during the training phase, evidenced by a R² of 0.84, the statistical metrics exhibited a decline during both external and internal validation (Table 3 and Supplementary Table 1).

Model comparison

The predictive capability of the models for the external test set was evaluated using several metrics, including Q²_F1, Q²_F2, RMSE_test, MAE_test, and CCC_test (Table 3). Multicriteria Decision Making (MCDM) analysis revealed that the ANN model attained the highest score (MCDM score = 1), followed by the GPR model with a score of 0.79, and the Ensemble tree model with a score of 0.65 (Table 3). These results highlight the robustness and predictive capacity of these models, particularly the ANN, for estimating soil organic matter content in new samples. The consistent performance across different datasets further suggests the models’ reliability, generalizability, and broader applicability.

Discussion

This study employed various machine learning (ML) models to predict soil organic matter (SOM) from RGB values of soil samples in Homs Governorate, revealing significant performance variability across algorithms. The Multiple Linear Regression (MLR) baseline model established a linear relationship (R²_train=0.73, Q²_F1=0.61), with the blue channel showing the strongest correlation (+0.12). This observation is consistent with prior research demonstrating color channel importance in SOM prediction [11][12]. Although MLR yielded statistically acceptable outcomes, its moderate performance corresponds with other research findings [27][28], highlighting the limitations of linear models in accurately representing the nonlinearities of soil properties, even with robustness validated through Y-randomization (cRp²=0.71).

Advanced machine learning (ML) techniques demonstrated superior performance compared to MLR. Specifically, Gaussian Process Regression (GPR) achieved the highest training accuracy (R²=0.91), though its external validation performance (Q²_F1=0.71) was reduced, aligning with established insights into GPR’s susceptibility to constrained datasets [29][30]. Support Vector Machine (SVM) showed strong training results (R²=0.89) but poor generalizability (Q²_F1=0.2). The Ensemble Tree’s balanced performance (R²=0.89, Q²_F1=0.63) and Artificial Neural Network’s (ANN) superior robustness (R²=0.84, Q²_F1=0.83) with optimal 3-neuron architecture corroborate other works on ANN efficacy for soil classification [31]. These findings demonstrate RGB-based ML models’ potential as cost-effective SOM prediction tools [7][32], though future work should address dataset limitations and environmental variability to enhance field applicability [19].

Conclusion

This research assessed the potential of different machine learning models to predict soil organic matter (SOM) in soil samples obtained from diverse locations within the Homs governorate, utilizing RGB spectral data. Among those explored, ANN, Ensemble Tree, and GPR demonstrated the most promising performance, outperforming SVM, LSK, and MLR. Validation confirmed the reliability and generalizability of these models, particularly the ANN, which exhibited superior validity and robustness. Consequently, artificial neural networks could serve as a significant, versatile instrument for evaluating soil organic matter content using standard RGB values, providing a more economical and time-saving option compared to conventional laboratory methods.

Supplementary Files

Supplementary Table

Conflict of interest statement
The authors declared no conflict of interest.
Funding statement
The authors declared that no funding was received in relation to this manuscript.
Data availability statement
The authors declared that the used imagery and RGB dataset will be available upon reasonable request from the corresponding author.

References

Schulze DG, Nagel JL, Van Scoyoc GE, Henderson TL, Baumgardner MF, Stott DE. Significance of organic matter in determining soil colors. Soil color. 1993;31:71-90. DOI
Blume HP, Brümmer GW, Horn R, Kandeler E, Kögel-Knabner I, Kretzschmar R, Stahr K, Wilke BM. Scheffer/schachtschabel: Lehrbuch der bodenkunde. Springer-Verlag. 2016.
Chenu C, Rumpel C, Védère C, Barré P. Methods for studying soil organic matter: nature, dynamics, spatial accessibility, and interactions with minerals. In Soil microbiology, ecology and biochemistry. Elsevier. 2024:369-406. DOI
Kirillova NP, Grauer-Gray J, Hartemink AE, Sileova TM, Artemyeva ZS, Burova EK. New perspectives to use Munsell color charts with electronic devices. Comput. Electron. Agric. 2018;155:378-85. DOI
Swetha RK, Dasgupta S, Chakraborty S, Li B, Weindorf DC, Mancini M, Silva SH, Ribeiro BT, Curi N, Ray DP. Using Nix color sensor and Munsell soil color variables to classify contrasting soil types and predict soil organic carbon in Eastern India. Comput. Electron. Agric. 2022;199:107192. DOI
Łachacz A, Załuski D. The usefulness of the Munsell colour indices for identification of drained soils with various content of organic matter. J. Soils Sediments. 2023;23(11):4017-31. DOI
Grunwald S. Artificial intelligence and soil carbon modeling demystified: power, potentials, and perils. Carbon Footprints. 2022;1(1). DOI
Swathi Kumari H, Veeramanju KT. Predictive models for optimal irrigation scheduling and water management: a review of AI and ML approaches. Int. J. Manag. Technol. Soc. Sci. 2024;9(2):94-110. DOI
Khatti J, Grover KS. Prediction of UCS of fine-grained soil based on machine learning part 2: comparison between hybrid relevance vector machine and Gaussian process regression. Multiscale Multidiscip. Model. Exp. Des. 2024;7(1):123-63. DOI
Maniyath SR, Hebbar R, Subramoniam SR. Soil color detection using Knn classifier. In 2018 International Conference on Design Innovations for 3Cs Compute Communicate Control (ICDI3C). IEEE. 2018:52-5. DOI
Wu C, Xia J, Yang H, Yang Y, Zhang Y, Cheng F. Rapid determination of soil organic matter content based on soil colour obtained by a digital camera. Int. J. Remote Sens. 2018;39(20):6557-71. DOI
Yang J, Shen F, Wang T, Luo M, Li N, Que S. Effect of smart phone cameras on color-based prediction of soil organic matter content. Geoderma. 2021;402:115365. DOI
Taneja P, Vasava HK, Daggupati P, Biswas A. Multi-algorithm comparison to predict soil organic matter and soil moisture content from cell phone images. Geoderma. 2021;385:114863. DOI
Walkley A, Black IA. An examination of the Degtjareff method for determining soil organic matter, and a proposed modification of the chromic acid titration method. Soil Sci. 1934;37(1):29-38.
Rasband W. ImageJ: A public domain Java image processing program. National Institute of Mental Health, Bethesda, Maryland, USA. 2008.
Leonard JT, Roy K. On selection of training and test sets for the development of predictive QSAR models. QSAR Comb. Sci. 2006;25(3):235-51. DOI
Uyanık GK, Güler N. A study on multiple linear regression analysis. Procedia Soc. Behav. Sci. 2013;106:234-40. DOI
Zeraatpisheh M, Ayoubi S, Mirbagheri Z, Mosaddeghi MR, Xu M. Spatial prediction of soil aggregate stability and soil organic carbon in aggregate fractions using machine learning algorithms and environmental variables. Geoderma Reg. 2021;27:e00440. DOI
Heil J, Jörges C, Stumpe B. Fine-scale mapping of soil organic matter in agricultural soils using UAVs and machine learning. Remote Sens. 2022;14(14):3349. DOI
Singh A, Thakur N, Sharma A. A review of supervised machine learning algorithms. In2016 3rd international conference on computing for sustainable global development (INDIACom). IEEE. 2016:1310-5.
Popescu MC, Balas VE, Perescu-Popescu L, Mastorakis N. Multilayer perceptron and neural networks. WSEAS Trans. Circuits Syst. 2009;8(7):579-88.
Golbraikh A, Tropsha A. Beware of q2!. J. Mol. Graph. Model. 2002;20(4):269-76. DOI
Kennedy PE, Cade BS. Randomization tests for multiple regression. Commun. Stat. Simul. Comput. 1996;25(4):923-36.
Akinsola JE, Awodele O, Kuyoro SO, Kasali FA. Performance evaluation of supervised machine learning algorithms using multi-criteria decision making techniques. In Proceedings of the International Conference on Information Technology in Education and Development (ITED). 2019:17-34.
Triantaphyllou E. Multi-criteria decision-making methods. In Multi-criteria decision making methods: A comparative study. Springer, Boston, MA. 2000:5-21. DOI
Pore S, Pelloux A, Chatterjee M, Banerjee A, Roy K. Machine learning-based q-RASAR predictions of the bioconcentration factor of organic molecules estimated following the organisation for economic co-operation and development guideline 305. J. Hazard. Mater. 2024;479:135725. DOI
El Jamaoui I, Sánchez MJ, Sirvent CP, Mana AA, López SM. Machine learning-driven modeling for soil organic carbon estimation from multispectral drone imaging: A case study in Corvera, Murcia (Spain). Model. Earth Syst. Environ. 2024;10(3):3473-94. DOI
Forkuor G, Hounkpatin OK, Welp G, Thiel M. High resolution mapping of soil properties using remote sensing variables in south-western Burkina Faso: a comparison of machine learning and multiple linear regression models. PloS one. 2017;12(1):e0170478. DOI
Lee K, Cho H, Lee I. Variable selection using Gaussian process regression-based metrics for high-dimensional model approximation with limited data. Struct. Multidiscip. Optim. 2019;59(5):1439-54. DOI
Mohammed RO, Cawley GC. Over-fitting in model selection with Gaussian process regression. In International Conference on Machine Learning and Data Mining in Pattern Recognition. Cham: Springer International Publishing. 2017:192-205. DOI
Macedo dos Santos-Tonial L, Colla MS, Carra JB, Fabris M, de Lima VA. Classification and total carbon determination of the soils using RGB digital images combined with machine learning. Commun. Soil Sci. Plant Anal. 2023;54(2):141-53. DOI
Lim HH, Cheon E, Lee SR. Machine learning and hyperspectral imaging to predict soil water content: methodology and field validation. Earth Sci. Inform. 2025;18(1):109. DOI

Cite this article:

Mansur, N., Abbod, M. Machine learning-based estimation of soil organic matter using RGB values. DYSONA – Applied Science, 2026;7(1): 73-81. doi: 10.30493/das.2025.539371

E-NAMTILA