Optimization of Random Forest for Health Data Classification Using PCA and K-Means SMOTE-ENN

Authors

  • Dadang Priyanto Department of Computer Science, Universitas Bumigora, Mataram, Indonesia
  • Hairani Hairani Department of Computer Science, Universitas Bumigora, Mataram, Indonesia https://orcid.org/0000-0002-6756-5896
  • Khairan Marzuki Department of Computer Science, Universitas Bumigora, Mataram, Indonesia
  • Muhammad Innuddin Department of Computer Science, Universitas Bumigora, Mataram, Indonesia
Volume: 15 | Issue: 5 | Pages: 27646-27652 | October 2025 | https://doi.org/10.48084/etasr.12976

Abstract

Health data classification is a significant challenge in the healthcare field, particularly due to the inherent characteristics of health data, which typically exhibit high dimensionality and imbalanced class distributions. These factors can complicate the training process of classification models and adversely affect their performance and accuracy. Consequently, a method is required to address data complexity and class imbalance, ensuring that the resulting information is both accurate and reliable. This study aims to improve the performance of the Random Forest (RF) classification model when processing health data by integrating two primary approaches: Principal Component Analysis (PCA) and K-Means SMOTE-ENN. PCA is instrumental in reducing data dimensions while extracting the most informative features, thus minimizing noise and reducing computational demands. Meanwhile, K-Means SMOTE-ENN serves to balance class distribution through a combination of clustering-based oversampling and Edited Nearest Neighbors-based data cleaning, effectively addressing the issue of overfitting caused by unrepresentative synthetic data. The RF classification model was chosen, recognized for its strong performance in managing data with high dimensions and complex variable interactions. Experimental results indicate that the joint application of PCA and K-Means SMOTE-ENN significantly enhances the model performance. In the Pima Indians Diabetes dataset, accuracy rose to 98.41%, and the Area Under Curve (AUC) value reached 98.33%. For the Heart Disease dataset, an accuracy of 97.56% and an AUC of 97.73% were achieved. Compared with previous methods, the proposed approach achieves 2.91% accuracy improvement with SMOTE and Stacking Ensemble on the Pima Indians Diabetes dataset and 6.26% accuracy improvement and 14.73% AUC improvement compared with XGBoost on the Heart Disease dataset. These results show that combining PCA with K-Means SMOTE-ENN significantly improves the performance of RF on imbalanced healthcare data.

Keywords:

K-Means SMOTE-ENN, PCA, health data, data imbalance, data reduction

Downloads

Download data is not yet available.

References

B. Charbuty and A. Abdulazeez, "Classification Based on Decision Tree Algorithm for Machine Learning," Journal of Applied Science and Technology Trends, vol. 2, no. 01, pp. 20–28, Mar. 2021.

Y. Zeng and F. Cheng, "Medical and Health Data Classification Method Based on Machine Learning," Journal of Healthcare Engineering, vol. 2021, pp. 1–5, Nov. 2021.

R. F. Mansour, A. E. Amraoui, I. Nouaouri, V. G. Diaz, D. Gupta, and S. Kumar, "Artificial Intelligence and Internet of Things Enabled Disease Diagnosis Model for Smart Healthcare Systems," IEEE Access, vol. 9, pp. 45137–45146, 2021.

E. Šabić, D. Keeley, B. Henderson, and S. Nannemann, "Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data," AI & SOCIETY, vol. 36, no. 1, pp. 149–158, Mar. 2021.

P. Rani, R. Kumar, A. Jain, R. Lamba, R. Kumar Sachdeva, and T. Choudhury, "PCA-DNN: A Novel Deep Neural Network Oriented System for Breast Cancer Classification," EAI Endorsed Transactions on Pervasive Health and Technology, vol. 9, Oct. 2023.

G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, "Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data," IEEE Access, vol. 9, pp. 74763–74777, 2021.

L. Yuningsih, G. A. Pradipta, D. Hermawan, P. D. W. Ayu, D. P. Hostiadi, and R. R. Huizen, "IRS-BAG-Integrated Radius-SMOTE Algorithm with Bagging Ensemble Learning Model for Imbalanced Data Set Classification," Emerging Science Journal, vol. 7, no. 5, pp. 1501–1516, Oct. 2023.

V. P. K. Turlapati and M. R. Prusty, "Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19," Intelligence-Based Medicine, vol. 3–4, Dec. 2020, Art. no. 100023.

A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, "RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 5059–5074, Sep. 2022.

A. Govindu and S. Palwe, "Early detection of Parkinson’s disease using machine learning," Procedia Computer Science, vol. 218, pp. 249–261, 2023.

D. D. Prasetya, T. Widiyaningtyas, H. Hairani, and A. Aminuddin, "Addressing Imbalance in Health Datasets: A New Method NR-Clustering SMOTE and Distance Metric Modification," Computers, Materials & Continua, vol. 82, no. 2, pp. 2931–2949, 2025.

T. Widiyaningtyas, H. Hairani, D. D. Prasetya, U. Pujianto, and W. Caesarendra, "A Modified SMOTE with Noise Filtering and Manhattan Distance Metric Approach to Address Imbalanced Health Datasets," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 25452–25459, Aug. 2025.

M. S. Reza, R. Amin, R. Yasmin, W. Kulsum, and S. Ruhi, "Improving diabetes disease patients classification using stacking ensemble method with PIMA and local healthcare data," Heliyon, vol. 10, no. 2, Jan. 2024, Art. no. e24536.

R. C. Das, M. C. Das, Md. A. Hossain, Md. A. Rahman, M. H. Hossen, and R. Hasan, "Heart Disease Detection Using ML," in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, Mar. 2023, pp. 0983–0987.

K. M. Jha, V. Velaga, K. Routhu, G. Sadaram, S. B. Boppana, and N. Katnapally, "Evaluating the Effectiveness of Machine Learning for Heart Disease Prediction in Healthcare Sector," Journal of Cardiobiology, vol. 9, no. 1, 2025.

T. K. N. Fariz and S. S. Basha, "Enhancing solar radiation predictions through COA optimized neural networks and PCA dimensionality reduction," Energy Reports, vol. 12, pp. 341–359, Dec. 2024.

A. Razzaque and D. A. Badholia, "PCA based feature extraction and MPSO based feature selection for gene expression microarray medical data classification," Measurement: Sensors, vol. 31, Feb. 2024, Art. no. 100945.

T. M. Usman, Y. K. Saheed, D. Ignace, and A. Nsang, "Diabetic retinopathy detection using principal component analysis multi-label feature extraction and classification," International Journal of Cognitive Computing in Engineering, vol. 4, pp. 78–88, Jun. 2023.

D. Elreedy, A. F. Atiya, and F. Kamalov, "A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning," Machine Learning, vol. 113, no. 7, pp. 4903–4923, Jul. 2024.

M. Muntasir Nishat et al., "A Comprehensive Investigation of the Performances of Different Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for Imbalanced Heart Failure Dataset," Scientific Programming, vol. 2022, pp. 1–17, Mar. 2022.

X. Wang et al., "Exploratory study on classification of diabetes mellitus through a combined Random Forest Classifier," BMC Medical Informatics and Decision Making, vol. 21, no. 1, Dec. 2021, Art. no. 105.

"Pima Indians Diabetes Database." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.

"Heart Disease Dataset." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.

Y. Han and I. Joe, "Enhancing Machine Learning Models Through PCA, SMOTE-ENN, and Stochastic Weighted Averaging," Applied Sciences, vol. 14, no. 21, Oct. 2024, Art. no. 9772.

R. Oktafiani, "Breast Cancer Classification with Principal Component Analysis and Smote using Random Forest Method and Support Vector Machine," International Journal of Computer Applications, vol. 186, no. 16, Apr. 2024.

G. Douzas, F. Bacao, and F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE," Information Sciences, vol. 465, pp. 1–20, Oct. 2018.

Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, "A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data," Information Sciences, vol. 572, pp. 574–589, Sep. 2021.

R. Bounab, K. Zarour, B. Guelib, and N. Khlifa, "Enhancing Medicare Fraud Detection Through Machine Learning: Addressing Class Imbalance With SMOTE-ENN," IEEE Access, vol. 12, pp. 54382–54396, 2024.

U. Ependi, A. F. Rochim, and A. Wibowo, "A Hybrid Sampling Approach for Improving the Classification of Imbalanced Data Using ROS and NCL Methods," International Journal of Intelligent Engineering and Systems, vol. 16, no. 3, pp. 345–361, Jun. 2023.

I. Saifudin and T. Widiyaningtyas, "Systematic Literature Review on Recommender System: Approach, Problem, Evaluation Techniques, Datasets," IEEE Access, vol. 12, pp. 19827–19847, 2024.

H. M. Khasanah, A. Aminuddin, F. F. Abdulloh, M. Rahardi, H. Hairani, and B. Pramudya, "Optimizing mushroom classification through machine learning and hyperparameter tuning," Engineering and Applied Science Research, vol. 51, 2024, Art. no. 651660.

S. Rezvani and X. Wang, "A broad review on class imbalance learning techniques," Applied Soft Computing, vol. 143, Aug. 2023, Art. no. 110415.

M. A. Salam, A. Taher, M. Samy, and K. Mohamed, "The Effect of Different Dimensionality Reduction Techniques on Machine Learning Overfitting Problem," International Journal of Advanced Computer Science and Applications, vol. 12, no. 4, 2021.

M. Lamari et al., "SMOTE–ENN-Based Data Sampling and Improved Dynamic Ensemble Selection for Imbalanced Medical Data Classification," in Advances on Smart and Soft Computing, vol. 1188, F. Saeed, T. Al-Hadhrami, F. Mohammed, and E. Mohammed, Eds. Springer Singapore, 2021, pp. 37–49.

M. Lin, X. Zhu, T. Hua, X. Tang, G. Tu, and X. Chen, "Detection of Ionospheric Scintillation Based on XGBoost Model Improved by SMOTE-ENN Technique," Remote Sensing, vol. 13, no. 13, Jul. 2021, Art. no. 2577.

L. G. R. Putra, K. Marzuki, and H. Hairani, "Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction," Engineering and Applied Science Research, vol. 50, 2023, Art. no. 577583.

M. W. Huang, C. H. Chiu, C. F. Tsai, and W. C. Lin, "On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction," Applied Sciences, vol. 11, no. 14, Jul. 2021, Art. no. 6574.

S. Sreejith, H. Khanna Nehemiah, and A. Kannan, "Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection," Computers in Biology and Medicine, vol. 126, Nov. 2020, Art. no. 103991.

P. Mooijman, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, "The effects of data balancing approaches: A case study," Applied Soft Computing, vol. 132, Jan. 2023, Art. no. 109853.

A. Mousa, W. Mustafa, and R. B. Marqas, "A Comparative Study of Diabetes Detection Using The Pima Indian Diabetes Database," The Journal of University of Duhok, vol. 26, no. 2, pp. 277–288, Sep. 2023.

F. Maulidina, Z. Rustam, S. Hartini, V. V. P. Wibowo, I. Wirasati, and W. Sadewo, "Feature optimization using Backward Elimination and Support Vector Machines (SVM) algorithm for diabetes classification," Journal of Physics: Conference Series, vol. 1821, no. 1, Mar. 2021, Art. no. 012006.

A. Al Bataineh and S. Manacek, "MLP-PSO Hybrid Algorithm for Heart Disease Prediction," Journal of Personalized Medicine, vol. 12, no. 8, Jul. 2022, Art. no. 1208.

Downloads

How to Cite

[1]
D. Priyanto, H. Hairani, K. Marzuki, and M. Innuddin, “Optimization of Random Forest for Health Data Classification Using PCA and K-Means SMOTE-ENN”, Eng. Technol. Appl. Sci. Res., vol. 15, no. 5, pp. 27646–27652, Oct. 2025.

Metrics

Abstract Views: 102
PDF Downloads: 55

Metrics Information