Optimization of Random Forest for Health Data Classification Using PCA and K-Means SMOTE-ENN
Received: 1 July 2025 | Revised: 21 July 2025 and 3 August 2025 | Accepted: 15 August 2025 | Online: 26 August 2025
Corresponding author: Dadang Priyanto
Abstract
Health data classification is a significant challenge in the healthcare field, particularly due to the inherent characteristics of health data, which typically exhibit high dimensionality and imbalanced class distributions. These factors can complicate the training process of classification models and adversely affect their performance and accuracy. Consequently, a method is required to address data complexity and class imbalance, ensuring that the resulting information is both accurate and reliable. This study aims to improve the performance of the Random Forest (RF) classification model when processing health data by integrating two primary approaches: Principal Component Analysis (PCA) and K-Means SMOTE-ENN. PCA is instrumental in reducing data dimensions while extracting the most informative features, thus minimizing noise and reducing computational demands. Meanwhile, K-Means SMOTE-ENN serves to balance class distribution through a combination of clustering-based oversampling and Edited Nearest Neighbors-based data cleaning, effectively addressing the issue of overfitting caused by unrepresentative synthetic data. The RF classification model was chosen, recognized for its strong performance in managing data with high dimensions and complex variable interactions. Experimental results indicate that the joint application of PCA and K-Means SMOTE-ENN significantly enhances the model performance. In the Pima Indians Diabetes dataset, accuracy rose to 98.41%, and the Area Under Curve (AUC) value reached 98.33%. For the Heart Disease dataset, an accuracy of 97.56% and an AUC of 97.73% were achieved. Compared with previous methods, the proposed approach achieves 2.91% accuracy improvement with SMOTE and Stacking Ensemble on the Pima Indians Diabetes dataset and 6.26% accuracy improvement and 14.73% AUC improvement compared with XGBoost on the Heart Disease dataset. These results show that combining PCA with K-Means SMOTE-ENN significantly improves the performance of RF on imbalanced healthcare data.
Keywords:
K-Means SMOTE-ENN, PCA, health data, data imbalance, data reductionDownloads
References
B. Charbuty and A. Abdulazeez, "Classification Based on Decision Tree Algorithm for Machine Learning," Journal of Applied Science and Technology Trends, vol. 2, no. 01, pp. 20–28, Mar. 2021.
Y. Zeng and F. Cheng, "Medical and Health Data Classification Method Based on Machine Learning," Journal of Healthcare Engineering, vol. 2021, pp. 1–5, Nov. 2021.
R. F. Mansour, A. E. Amraoui, I. Nouaouri, V. G. Diaz, D. Gupta, and S. Kumar, "Artificial Intelligence and Internet of Things Enabled Disease Diagnosis Model for Smart Healthcare Systems," IEEE Access, vol. 9, pp. 45137–45146, 2021.
E. Šabić, D. Keeley, B. Henderson, and S. Nannemann, "Healthcare and anomaly detection: using machine learning to predict anomalies in heart rate data," AI & SOCIETY, vol. 36, no. 1, pp. 149–158, Mar. 2021.
P. Rani, R. Kumar, A. Jain, R. Lamba, R. Kumar Sachdeva, and T. Choudhury, "PCA-DNN: A Novel Deep Neural Network Oriented System for Breast Cancer Classification," EAI Endorsed Transactions on Pervasive Health and Technology, vol. 9, Oct. 2023.
G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, "Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data," IEEE Access, vol. 9, pp. 74763–74777, 2021.
L. Yuningsih, G. A. Pradipta, D. Hermawan, P. D. W. Ayu, D. P. Hostiadi, and R. R. Huizen, "IRS-BAG-Integrated Radius-SMOTE Algorithm with Bagging Ensemble Learning Model for Imbalanced Data Set Classification," Emerging Science Journal, vol. 7, no. 5, pp. 1501–1516, Oct. 2023.
V. P. K. Turlapati and M. R. Prusty, "Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19," Intelligence-Based Medicine, vol. 3–4, Dec. 2020, Art. no. 100023.
A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, "RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 5059–5074, Sep. 2022.
A. Govindu and S. Palwe, "Early detection of Parkinson’s disease using machine learning," Procedia Computer Science, vol. 218, pp. 249–261, 2023.
D. D. Prasetya, T. Widiyaningtyas, H. Hairani, and A. Aminuddin, "Addressing Imbalance in Health Datasets: A New Method NR-Clustering SMOTE and Distance Metric Modification," Computers, Materials & Continua, vol. 82, no. 2, pp. 2931–2949, 2025.
T. Widiyaningtyas, H. Hairani, D. D. Prasetya, U. Pujianto, and W. Caesarendra, "A Modified SMOTE with Noise Filtering and Manhattan Distance Metric Approach to Address Imbalanced Health Datasets," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 25452–25459, Aug. 2025.
M. S. Reza, R. Amin, R. Yasmin, W. Kulsum, and S. Ruhi, "Improving diabetes disease patients classification using stacking ensemble method with PIMA and local healthcare data," Heliyon, vol. 10, no. 2, Jan. 2024, Art. no. e24536.
R. C. Das, M. C. Das, Md. A. Hossain, Md. A. Rahman, M. H. Hossen, and R. Hasan, "Heart Disease Detection Using ML," in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, Mar. 2023, pp. 0983–0987.
K. M. Jha, V. Velaga, K. Routhu, G. Sadaram, S. B. Boppana, and N. Katnapally, "Evaluating the Effectiveness of Machine Learning for Heart Disease Prediction in Healthcare Sector," Journal of Cardiobiology, vol. 9, no. 1, 2025.
T. K. N. Fariz and S. S. Basha, "Enhancing solar radiation predictions through COA optimized neural networks and PCA dimensionality reduction," Energy Reports, vol. 12, pp. 341–359, Dec. 2024.
A. Razzaque and D. A. Badholia, "PCA based feature extraction and MPSO based feature selection for gene expression microarray medical data classification," Measurement: Sensors, vol. 31, Feb. 2024, Art. no. 100945.
T. M. Usman, Y. K. Saheed, D. Ignace, and A. Nsang, "Diabetic retinopathy detection using principal component analysis multi-label feature extraction and classification," International Journal of Cognitive Computing in Engineering, vol. 4, pp. 78–88, Jun. 2023.
D. Elreedy, A. F. Atiya, and F. Kamalov, "A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning," Machine Learning, vol. 113, no. 7, pp. 4903–4923, Jul. 2024.
M. Muntasir Nishat et al., "A Comprehensive Investigation of the Performances of Different Machine Learning Classifiers with SMOTE-ENN Oversampling Technique and Hyperparameter Optimization for Imbalanced Heart Failure Dataset," Scientific Programming, vol. 2022, pp. 1–17, Mar. 2022.
X. Wang et al., "Exploratory study on classification of diabetes mellitus through a combined Random Forest Classifier," BMC Medical Informatics and Decision Making, vol. 21, no. 1, Dec. 2021, Art. no. 105.
"Pima Indians Diabetes Database." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
"Heart Disease Dataset." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset.
Y. Han and I. Joe, "Enhancing Machine Learning Models Through PCA, SMOTE-ENN, and Stochastic Weighted Averaging," Applied Sciences, vol. 14, no. 21, Oct. 2024, Art. no. 9772.
R. Oktafiani, "Breast Cancer Classification with Principal Component Analysis and Smote using Random Forest Method and Support Vector Machine," International Journal of Computer Applications, vol. 186, no. 16, Apr. 2024.
G. Douzas, F. Bacao, and F. Last, "Improving imbalanced learning through a heuristic oversampling method based on k-means and SMOTE," Information Sciences, vol. 465, pp. 1–20, Oct. 2018.
Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, "A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data," Information Sciences, vol. 572, pp. 574–589, Sep. 2021.
R. Bounab, K. Zarour, B. Guelib, and N. Khlifa, "Enhancing Medicare Fraud Detection Through Machine Learning: Addressing Class Imbalance With SMOTE-ENN," IEEE Access, vol. 12, pp. 54382–54396, 2024.
U. Ependi, A. F. Rochim, and A. Wibowo, "A Hybrid Sampling Approach for Improving the Classification of Imbalanced Data Using ROS and NCL Methods," International Journal of Intelligent Engineering and Systems, vol. 16, no. 3, pp. 345–361, Jun. 2023.
I. Saifudin and T. Widiyaningtyas, "Systematic Literature Review on Recommender System: Approach, Problem, Evaluation Techniques, Datasets," IEEE Access, vol. 12, pp. 19827–19847, 2024.
H. M. Khasanah, A. Aminuddin, F. F. Abdulloh, M. Rahardi, H. Hairani, and B. Pramudya, "Optimizing mushroom classification through machine learning and hyperparameter tuning," Engineering and Applied Science Research, vol. 51, 2024, Art. no. 651660.
S. Rezvani and X. Wang, "A broad review on class imbalance learning techniques," Applied Soft Computing, vol. 143, Aug. 2023, Art. no. 110415.
M. A. Salam, A. Taher, M. Samy, and K. Mohamed, "The Effect of Different Dimensionality Reduction Techniques on Machine Learning Overfitting Problem," International Journal of Advanced Computer Science and Applications, vol. 12, no. 4, 2021.
M. Lamari et al., "SMOTE–ENN-Based Data Sampling and Improved Dynamic Ensemble Selection for Imbalanced Medical Data Classification," in Advances on Smart and Soft Computing, vol. 1188, F. Saeed, T. Al-Hadhrami, F. Mohammed, and E. Mohammed, Eds. Springer Singapore, 2021, pp. 37–49.
M. Lin, X. Zhu, T. Hua, X. Tang, G. Tu, and X. Chen, "Detection of Ionospheric Scintillation Based on XGBoost Model Improved by SMOTE-ENN Technique," Remote Sensing, vol. 13, no. 13, Jul. 2021, Art. no. 2577.
L. G. R. Putra, K. Marzuki, and H. Hairani, "Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction," Engineering and Applied Science Research, vol. 50, 2023, Art. no. 577583.
M. W. Huang, C. H. Chiu, C. F. Tsai, and W. C. Lin, "On Combining Feature Selection and Over-Sampling Techniques for Breast Cancer Prediction," Applied Sciences, vol. 11, no. 14, Jul. 2021, Art. no. 6574.
S. Sreejith, H. Khanna Nehemiah, and A. Kannan, "Clinical data classification using an enhanced SMOTE and chaotic evolutionary feature selection," Computers in Biology and Medicine, vol. 126, Nov. 2020, Art. no. 103991.
P. Mooijman, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, "The effects of data balancing approaches: A case study," Applied Soft Computing, vol. 132, Jan. 2023, Art. no. 109853.
A. Mousa, W. Mustafa, and R. B. Marqas, "A Comparative Study of Diabetes Detection Using The Pima Indian Diabetes Database," The Journal of University of Duhok, vol. 26, no. 2, pp. 277–288, Sep. 2023.
F. Maulidina, Z. Rustam, S. Hartini, V. V. P. Wibowo, I. Wirasati, and W. Sadewo, "Feature optimization using Backward Elimination and Support Vector Machines (SVM) algorithm for diabetes classification," Journal of Physics: Conference Series, vol. 1821, no. 1, Mar. 2021, Art. no. 012006.
A. Al Bataineh and S. Manacek, "MLP-PSO Hybrid Algorithm for Heart Disease Prediction," Journal of Personalized Medicine, vol. 12, no. 8, Jul. 2022, Art. no. 1208.
Downloads
How to Cite
License
Copyright (c) 2025 Dadang Priyanto, Hairani Hairani, Khairan Marzuki, Muhammad Innuddin

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.