Performance Evaluation of Classification Methods Utilizing Resampling Techniques for Water Quality Prediction on Imbalanced Data
Received: 30 April 2025 | Revised: 16 June 2025 | Accepted: 28 June 2025 | Online: 2 August 2025
Corresponding author: Rahmi Fadhilah
Abstract
Commonly observed challenges in water quality anomaly detection using Machine Learning (ML) classifiers include unbalanced class distribution and missing data. Classifiers trained on such imbalanced datasets often exhibit biased accuracy, favoring the majority class and neglecting the minority class, while incomplete datasets limit the applicability of more complex models and hinder thorough analysis. This research addresses the handling of incomplete data and class imbalance by proposing a robust framework for an ML-based water quality anomaly detection system using several resampling techniques. A comparative study was conducted on six imputation methods for missing data, including Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE), alongside three resampling techniques: Random Under Sampling (RUS), Rapidly Converging Gibbs (RACOG) sampler, and RACOG combined with RUS (RACOG-RUS). These methods were evaluated across three classifiers: Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Naïve Bayes (NB). The models were assessed using stratified 5-fold cross-validation and evaluated based on accuracy, Receiver Operating Characteristic Area Under Curve (ROC-AUC), and F1-score. Further experiments incorporated feature selection methods such as Boruta and Mean Decrease Accuracy (MDA) to optimize performance. Results demonstrate that RF combined with RACOG-RUS and EM achieved the highest F1-score of 0.9954, effectively addressing both class imbalance and missing data. Additionally, computational analysis highlights the efficiency of RF when optimized with appropriate hyperparameters.
Keywords:
water quality monitoring, machine learning classifier, class imbalance, missing value methodsDownloads
References
P. Jeffrey, Z. Yang, and S. J. Judd, "The status of potable water reuse implementation," Water Research, vol. 214, May 2022, Art. no. 118198. DOI: https://doi.org/10.1016/j.watres.2022.118198
N. Morin-Crini et al., "Worldwide cases of water pollution by emerging contaminants: a review," Environmental Chemistry Letters, vol. 20, no. 4, pp. 2311–2338, Aug. 2022. DOI: https://doi.org/10.1007/s10311-022-01447-4
N. U. H. Shar, G. Q. Shar, A. R. Shar, S. M. Wassan, Z. Q. Bhatti, and A. Ali, "Health Risk Assessment of Arsenic in the Drinking Water of Upper Sindh, Pakistan," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7558–7563, Oct. 2021. DOI: https://doi.org/10.48084/etasr.4336
R. P. Shete, A. M. Bongale, and D. Dharrao, "IoT-enabled effective real-time water quality monitoring method for aquaculture," MethodsX, vol. 13, Dec. 2024, Art. no. 102906. DOI: https://doi.org/10.1016/j.mex.2024.102906
R. K. Mishra, "Fresh Water availability and Its Global challenge," British Journal of Multidisciplinary and Advanced Studies, vol. 4, no. 3, pp. 1–78, May 2023. DOI: https://doi.org/10.37745/bjmas.2022.0208
H. Gunter, C. Bradley, D. M. Hannah, S. Manaseki‐Holland, R. Stevens, and K. Khamis, "Advances in quantifying microbial contamination in potable water: Potential of fluorescence‐based sensor technology," WIREs Water, vol. 10, no. 1, Jan. 2023. DOI: https://doi.org/10.1002/wat2.1622
W. Yang, X. Wei, and S. Choi, "A Dual-Channel, Interference-Free, Bacteria-Based Biosensor for Highly Sensitive Water Quality Monitoring," IEEE Sensors Journal, vol. 16, no. 24, pp. 8672–8677, Dec. 2016. DOI: https://doi.org/10.1109/JSEN.2016.2570423
B. Mizaikoff, "Infrared optical sensors for water quality monitoring," Water Science and Technology, vol. 47, no. 2, pp. 35–42, Jan. 2003. DOI: https://doi.org/10.2166/wst.2003.0079
T. Maqbool et al., "Exploring the relative changes in dissolved organic matter for assessing the water quality of full-scale drinking water treatment plants using a fluorescence ratio approach," Water Research, vol. 183, Sep. 2020, Art. no. 116125. DOI: https://doi.org/10.1016/j.watres.2020.116125
G. E. Adjovu, H. Stephen, D. James, and S. Ahmad, "Measurement of Total Dissolved Solids and Total Suspended Solids in Water Systems: A Review of the Issues, Conventional, and Remote Sensing Techniques," Remote Sensing, vol. 15, no. 14, Jul. 2023, Art. no. 3534. DOI: https://doi.org/10.3390/rs15143534
E. K. Nti et al., "Water pollution control and revitalization using advanced technologies: Uncovering artificial intelligence options towards environmental health protection, sustainability and water security," Heliyon, vol. 9, no. 7, Jul. 2023, Art. no. e18170, https://doi.org/10.1016/j.heliyon.2023.e18170. DOI: https://doi.org/10.1016/j.heliyon.2023.e18170
K. Gunasekaran and S. Boopathi, "Artificial Intelligence in Water Treatments and Water Resource Assessments," in Advances in Environmental Engineering and Green Technologies, IGI Global, 2023, pp. 71–98. DOI: https://doi.org/10.4018/978-1-6684-6791-6.ch004
E. Parimbelli, T. M. Buonocore, G. Nicora, W. Michalowski, S. Wilk, and R. Bellazzi, "Why did AI get this one wrong? — Tree-based explanations of machine learning model predictions," Artificial Intelligence in Medicine, vol. 135, Jan. 2023, Art. no. 102471. DOI: https://doi.org/10.1016/j.artmed.2022.102471
E. M. Dogo, N. I. Nwulu, B. Twala, and C. O. Aigbavboa, "Empirical Comparison of Approaches for Mitigating Effects of Class Imbalances in Water Quality Anomaly Detection," IEEE Access, vol. 8, pp. 218015–218036, 2020. DOI: https://doi.org/10.1109/ACCESS.2020.3038658
N. H. A. Malek, W. F. Wan Yaacob, S. A. Md Nasir, and N. Shaadan, "Prediction of Water Quality Classification of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques," Water, vol. 14, no. 7, Mar. 2022, Art. no. 1067. DOI: https://doi.org/10.3390/w14071067
S. Nuanmeesri, C. Tharasawatpipat, and L. Poomhiran, "Transfer Learning Artificial Neural Network-based Ensemble Voting of Water Quality Classification for Different Types of Farming," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15384–15392, Aug. 2024. DOI: https://doi.org/10.48084/etasr.7855
S. Nuanmeesri, L. Poomhiran, P. Kadmateekarun, and S. Chopvitayakun, "Improving the Water Quality Classification Model for Various Farms Using Features Based on Artificial Neural Network," TEM Journal, pp. 2144–2156, Nov. 2023. DOI: https://doi.org/10.18421/TEM124-25
S. Nuanmeesri and W. Sriurai, "Multi-Layer Perceptron Neural Network Model Development for Chili Pepper Disease Diagnosis Using Filter and Wrapper Feature Selection Methods," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7714–7719, Oct. 2021. DOI: https://doi.org/10.48084/etasr.4383
S. Nuanmeesri and W. Sriurai, "Thai Water Buffalo Disease Analysis with the Application of Feature Selection Technique and Multi-Layer Perceptron Neural Network," Engineering, Technology & Applied Science Research, vol. 11, no. 2, pp. 6907–6911, Apr. 2021. DOI: https://doi.org/10.48084/etasr.4049
S. Nuanmeesri, "Feature Selection for Analyzing Data Errors Toward Development of Household Big Data at the Sub-District Level Using Multi-Layer Perceptron Neural Network," International Journal of Interactive Mobile Technologies (iJIM), vol. 16, no. 05, pp. 121–138, Mar. 2022. DOI: https://doi.org/10.3991/ijim.v16i05.22523
Sistem Informasi Hidrologi & Kualitas Air. (2022), Balai Besar Wilayah Sungai Bengawan Solo. [Online]. Available: https://hidrologi.bbws-bsolo.net/kualitasair.
H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009. DOI: https://doi.org/10.1109/TKDE.2008.239
C. Ferri, J. Hernández-Orallo, and R. Modroiu, "An experimental comparison of performance measures for classification," Pattern Recognition Letters, vol. 30, no. 1, pp. 27–38, Jan. 2009. DOI: https://doi.org/10.1016/j.patrec.2008.08.010
F. M. Shrive, H. Stuart, H. Quan, and W. A. Ghali, "Dealing with missing data in a multi-question depression scale: a comparison of imputation methods," BMC Medical Research Methodology, vol. 6, no. 1, Dec. 2006. DOI: https://doi.org/10.1186/1471-2288-6-57
F. Mouret, A. Hippert-Ferrer, F. Pascal, and J.-Y. Tourneret, "A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data," IEEE Transactions on Signal Processing, vol. 71, pp. 1669–1682, 2023. DOI: https://doi.org/10.1109/TSP.2023.3267994
G. Biau and E. Scornet, "A random forest guided tour," TEST, vol. 25, no. 2, pp. 197–227, Jun. 2016, https://doi.org/10.1007/s11749-016-0481-7. DOI: https://doi.org/10.1007/s11749-016-0481-7
M. Sandri and P. Zuccolotto, "Variable Selection Using Random Forests," in Studies in Classification, Data Analysis, and Knowledge Organization, Berlin, Heidelberg, pp. 263–270. DOI: https://doi.org/10.1007/3-540-35978-8_30
H. Toutenburg, "Rubin, D.B.: Multiple imputation for nonresponse in surveys," Statistical Papers, vol. 31, no. 1, Dec. 1990. DOI: https://doi.org/10.1007/BF02924688
Janez Demsar, "Statistical Comparisons of Classifiers over Multiple Data Sets," Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.
Y. Liu, Y. Zhou, S. Wen, and C. Tang, "A Strategy on Selecting Performance Metrics for Classifier Evaluation," International Journal of Mobile Computing and Multimedia Communications, vol. 6, no. 4, pp. 20–35, Oct. 2014. DOI: https://doi.org/10.4018/IJMCMC.2014100102
J. A Ilemobayo et al., "Hyperparameter Tuning in Machine Learning: A Comprehensive Review," Journal of Engineering Research and Reports, vol. 26, no. 6, pp. 388–395, Jun. 2024. DOI: https://doi.org/10.9734/jerr/2024/v26i61188
N. Zhu, C. Zhu, L. Zhou, Y. Zhu, and X. Zhang, "Optimization of the Random Forest Hyperparameters for Power Industrial Control Systems Intrusion Detection Using an Improved Grid Search Algorithm," Applied Sciences, vol. 12, no. 20, Oct. 2022, Art. no. 10456. DOI: https://doi.org/10.3390/app122010456
T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse, "An Empirical Study of Learning from Imbalanced Data Using Random Forest," in 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007), Patras, Greece, Oct. 2007, pp. 310–317. DOI: https://doi.org/10.1109/ICTAI.2007.46
Downloads
How to Cite
License
Copyright (c) 2025 Rahmi Fadhilah, Heri Kuswanto, Dedy Dwi Prastyo

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
