Performance Evaluation of Classification Methods Utilizing Resampling Techniques for Water Quality Prediction on Imbalanced Data

Authors

  • Rahmi Fadhilah Department of Statistics, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
  • Heri Kuswanto Department of Statistics, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
  • Dedy Dwi Prastyo Department of Statistics, Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
Volume: 15 | Issue: 4 | Pages: 26091-26099 | August 2025 | https://doi.org/10.48084/etasr.11832

Abstract

Commonly observed challenges in water quality anomaly detection using Machine Learning (ML) classifiers include unbalanced class distribution and missing data. Classifiers trained on such imbalanced datasets often exhibit biased accuracy, favoring the majority class and neglecting the minority class, while incomplete datasets limit the applicability of more complex models and hinder thorough analysis. This research addresses the handling of incomplete data and class imbalance by proposing a robust framework for an ML-based water quality anomaly detection system using several resampling techniques. A comparative study was conducted on six imputation methods for missing data, including Expectation Maximization (EM) and Multiple Imputation by Chained Equations (MICE), alongside three resampling techniques: Random Under Sampling (RUS), Rapidly Converging Gibbs (RACOG) sampler, and RACOG combined with RUS (RACOG-RUS). These methods were evaluated across three classifiers: Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Naïve Bayes (NB).  The models were assessed using stratified 5-fold cross-validation and evaluated based on accuracy, Receiver Operating Characteristic Area Under Curve (ROC-AUC), and F1-score. Further experiments incorporated feature selection methods such as Boruta and Mean Decrease Accuracy (MDA) to optimize performance. Results demonstrate that RF combined with RACOG-RUS and EM achieved the highest F1-score of 0.9954, effectively addressing both class imbalance and missing data. Additionally, computational analysis highlights the efficiency of RF when optimized with appropriate hyperparameters.

Keywords:

water quality monitoring, machine learning classifier, class imbalance, missing value methods

Downloads

Download data is not yet available.

References

P. Jeffrey, Z. Yang, and S. J. Judd, "The status of potable water reuse implementation," Water Research, vol. 214, May 2022, Art. no. 118198. DOI: https://doi.org/10.1016/j.watres.2022.118198

N. Morin-Crini et al., "Worldwide cases of water pollution by emerging contaminants: a review," Environmental Chemistry Letters, vol. 20, no. 4, pp. 2311–2338, Aug. 2022. DOI: https://doi.org/10.1007/s10311-022-01447-4

N. U. H. Shar, G. Q. Shar, A. R. Shar, S. M. Wassan, Z. Q. Bhatti, and A. Ali, "Health Risk Assessment of Arsenic in the Drinking Water of Upper Sindh, Pakistan," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7558–7563, Oct. 2021. DOI: https://doi.org/10.48084/etasr.4336

R. P. Shete, A. M. Bongale, and D. Dharrao, "IoT-enabled effective real-time water quality monitoring method for aquaculture," MethodsX, vol. 13, Dec. 2024, Art. no. 102906. DOI: https://doi.org/10.1016/j.mex.2024.102906

R. K. Mishra, "Fresh Water availability and Its Global challenge," British Journal of Multidisciplinary and Advanced Studies, vol. 4, no. 3, pp. 1–78, May 2023. DOI: https://doi.org/10.37745/bjmas.2022.0208

H. Gunter, C. Bradley, D. M. Hannah, S. Manaseki‐Holland, R. Stevens, and K. Khamis, "Advances in quantifying microbial contamination in potable water: Potential of fluorescence‐based sensor technology," WIREs Water, vol. 10, no. 1, Jan. 2023. DOI: https://doi.org/10.1002/wat2.1622

W. Yang, X. Wei, and S. Choi, "A Dual-Channel, Interference-Free, Bacteria-Based Biosensor for Highly Sensitive Water Quality Monitoring," IEEE Sensors Journal, vol. 16, no. 24, pp. 8672–8677, Dec. 2016. DOI: https://doi.org/10.1109/JSEN.2016.2570423

B. Mizaikoff, "Infrared optical sensors for water quality monitoring," Water Science and Technology, vol. 47, no. 2, pp. 35–42, Jan. 2003. DOI: https://doi.org/10.2166/wst.2003.0079

T. Maqbool et al., "Exploring the relative changes in dissolved organic matter for assessing the water quality of full-scale drinking water treatment plants using a fluorescence ratio approach," Water Research, vol. 183, Sep. 2020, Art. no. 116125. DOI: https://doi.org/10.1016/j.watres.2020.116125

G. E. Adjovu, H. Stephen, D. James, and S. Ahmad, "Measurement of Total Dissolved Solids and Total Suspended Solids in Water Systems: A Review of the Issues, Conventional, and Remote Sensing Techniques," Remote Sensing, vol. 15, no. 14, Jul. 2023, Art. no. 3534. DOI: https://doi.org/10.3390/rs15143534

E. K. Nti et al., "Water pollution control and revitalization using advanced technologies: Uncovering artificial intelligence options towards environmental health protection, sustainability and water security," Heliyon, vol. 9, no. 7, Jul. 2023, Art. no. e18170, https://doi.org/10.1016/j.heliyon.2023.e18170. DOI: https://doi.org/10.1016/j.heliyon.2023.e18170

K. Gunasekaran and S. Boopathi, "Artificial Intelligence in Water Treatments and Water Resource Assessments," in Advances in Environmental Engineering and Green Technologies, IGI Global, 2023, pp. 71–98. DOI: https://doi.org/10.4018/978-1-6684-6791-6.ch004

E. Parimbelli, T. M. Buonocore, G. Nicora, W. Michalowski, S. Wilk, and R. Bellazzi, "Why did AI get this one wrong? — Tree-based explanations of machine learning model predictions," Artificial Intelligence in Medicine, vol. 135, Jan. 2023, Art. no. 102471. DOI: https://doi.org/10.1016/j.artmed.2022.102471

E. M. Dogo, N. I. Nwulu, B. Twala, and C. O. Aigbavboa, "Empirical Comparison of Approaches for Mitigating Effects of Class Imbalances in Water Quality Anomaly Detection," IEEE Access, vol. 8, pp. 218015–218036, 2020. DOI: https://doi.org/10.1109/ACCESS.2020.3038658

N. H. A. Malek, W. F. Wan Yaacob, S. A. Md Nasir, and N. Shaadan, "Prediction of Water Quality Classification of the Kelantan River Basin, Malaysia, Using Machine Learning Techniques," Water, vol. 14, no. 7, Mar. 2022, Art. no. 1067. DOI: https://doi.org/10.3390/w14071067

S. Nuanmeesri, C. Tharasawatpipat, and L. Poomhiran, "Transfer Learning Artificial Neural Network-based Ensemble Voting of Water Quality Classification for Different Types of Farming," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15384–15392, Aug. 2024. DOI: https://doi.org/10.48084/etasr.7855

S. Nuanmeesri, L. Poomhiran, P. Kadmateekarun, and S. Chopvitayakun, "Improving the Water Quality Classification Model for Various Farms Using Features Based on Artificial Neural Network," TEM Journal, pp. 2144–2156, Nov. 2023. DOI: https://doi.org/10.18421/TEM124-25

S. Nuanmeesri and W. Sriurai, "Multi-Layer Perceptron Neural Network Model Development for Chili Pepper Disease Diagnosis Using Filter and Wrapper Feature Selection Methods," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7714–7719, Oct. 2021. DOI: https://doi.org/10.48084/etasr.4383

S. Nuanmeesri and W. Sriurai, "Thai Water Buffalo Disease Analysis with the Application of Feature Selection Technique and Multi-Layer Perceptron Neural Network," Engineering, Technology & Applied Science Research, vol. 11, no. 2, pp. 6907–6911, Apr. 2021. DOI: https://doi.org/10.48084/etasr.4049

S. Nuanmeesri, "Feature Selection for Analyzing Data Errors Toward Development of Household Big Data at the Sub-District Level Using Multi-Layer Perceptron Neural Network," International Journal of Interactive Mobile Technologies (iJIM), vol. 16, no. 05, pp. 121–138, Mar. 2022. DOI: https://doi.org/10.3991/ijim.v16i05.22523

Sistem Informasi Hidrologi & Kualitas Air. (2022), Balai Besar Wilayah Sungai Bengawan Solo. [Online]. Available: https://hidrologi.bbws-bsolo.net/kualitasair.

H. He and E. A. Garcia, "Learning from Imbalanced Data," IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 9, pp. 1263–1284, Sep. 2009. DOI: https://doi.org/10.1109/TKDE.2008.239

C. Ferri, J. Hernández-Orallo, and R. Modroiu, "An experimental comparison of performance measures for classification," Pattern Recognition Letters, vol. 30, no. 1, pp. 27–38, Jan. 2009. DOI: https://doi.org/10.1016/j.patrec.2008.08.010

F. M. Shrive, H. Stuart, H. Quan, and W. A. Ghali, "Dealing with missing data in a multi-question depression scale: a comparison of imputation methods," BMC Medical Research Methodology, vol. 6, no. 1, Dec. 2006. DOI: https://doi.org/10.1186/1471-2288-6-57

F. Mouret, A. Hippert-Ferrer, F. Pascal, and J.-Y. Tourneret, "A Robust and Flexible EM Algorithm for Mixtures of Elliptical Distributions with Missing Data," IEEE Transactions on Signal Processing, vol. 71, pp. 1669–1682, 2023. DOI: https://doi.org/10.1109/TSP.2023.3267994

G. Biau and E. Scornet, "A random forest guided tour," TEST, vol. 25, no. 2, pp. 197–227, Jun. 2016, https://doi.org/10.1007/s11749-016-0481-7. DOI: https://doi.org/10.1007/s11749-016-0481-7

M. Sandri and P. Zuccolotto, "Variable Selection Using Random Forests," in Studies in Classification, Data Analysis, and Knowledge Organization, Berlin, Heidelberg, pp. 263–270. DOI: https://doi.org/10.1007/3-540-35978-8_30

H. Toutenburg, "Rubin, D.B.: Multiple imputation for nonresponse in surveys," Statistical Papers, vol. 31, no. 1, Dec. 1990. DOI: https://doi.org/10.1007/BF02924688

Janez Demsar, "Statistical Comparisons of Classifiers over Multiple Data Sets," Journal of Machine Learning Research, vol. 7, pp. 1–30, 2006.

Y. Liu, Y. Zhou, S. Wen, and C. Tang, "A Strategy on Selecting Performance Metrics for Classifier Evaluation," International Journal of Mobile Computing and Multimedia Communications, vol. 6, no. 4, pp. 20–35, Oct. 2014. DOI: https://doi.org/10.4018/IJMCMC.2014100102

J. A Ilemobayo et al., "Hyperparameter Tuning in Machine Learning: A Comprehensive Review," Journal of Engineering Research and Reports, vol. 26, no. 6, pp. 388–395, Jun. 2024. DOI: https://doi.org/10.9734/jerr/2024/v26i61188

N. Zhu, C. Zhu, L. Zhou, Y. Zhu, and X. Zhang, "Optimization of the Random Forest Hyperparameters for Power Industrial Control Systems Intrusion Detection Using an Improved Grid Search Algorithm," Applied Sciences, vol. 12, no. 20, Oct. 2022, Art. no. 10456. DOI: https://doi.org/10.3390/app122010456

T. M. Khoshgoftaar, M. Golawala, and J. V. Hulse, "An Empirical Study of Learning from Imbalanced Data Using Random Forest," in 19th IEEE International Conference on Tools with Artificial Intelligence(ICTAI 2007), Patras, Greece, Oct. 2007, pp. 310–317. DOI: https://doi.org/10.1109/ICTAI.2007.46

Downloads

How to Cite

[1]
R. Fadhilah, H. Kuswanto, and D. D. Prastyo, “Performance Evaluation of Classification Methods Utilizing Resampling Techniques for Water Quality Prediction on Imbalanced Data”, Eng. Technol. Appl. Sci. Res., vol. 15, no. 4, pp. 26091–26099, Aug. 2025.

Metrics

Abstract Views: 166
PDF Downloads: 259

Metrics Information