Average Gain Ratio and Correlation-Based Feature Selection for Imprecise Classiﬁcation in Algorithm C4.5

Saeful Amri; M. Alharis; Edy Winarno; Astrid Novita Putri

doi:10.48084/etasr.10165

Authors

Saeful Amri Department of Data Science, Universitas Muhammadiyah Semarang, Indonesia
M. Alharis Department of Statistics, Universitas Muhammadiyah Semarang, Indonesia https://orcid.org/0000-0003-3702-3161
Edy Winarno Department of Information Technology, Universitas Muhammadiyah Semarang, Indonesia https://orcid.org/0000-0002-7488-5660
Astrid Novita Putri Department of Informatics Engineering, Universitas Semarang, Indonesia https://orcid.org/0000-0001-9075-3441

Volume: 15 | Issue: 4 | Pages: 25275-25279 | August 2025 | https://doi.org/10.48084/etasr.10165

Received: 8 January 2025 | Revised: 17 February 2025 and 23 April 2025 | Accepted: 27 April 2025 | Online: 2 August 2025

Corresponding author: Saeful Amri

Abstract

The attribute split in the C4.5 algorithm has proven successful in building a classification in the form of a tree to facilitate understanding and interpretation. However, the attribute split process tends to choose attributes with high values, even though they do not necessarily play a major role in the classification results, and does not consider the correlation of attributes with labels, affecting classification performance. Average Gain Ratio (AGR) has been proven to overcome weaknesses and problems in the split process. Correlation-Based Feature Selection (CBFS) is also used in the split process to calculate the correlation of attributes with labels. This study uses the AGR method to overcome the problem of attribute split criteria and the CBFS method to select attributes that correlate with labels to increase the performance of the C4.5 algorithm. The process of selecting the split attributes using AGR and the comparison with the CBFS method was shown to improve the performance of the C4.5 classifier, as indicated by the average results for accuracy (87%), sensitivity (91%), G-mean (85%), AUC (84%), and cost (3.67) on six UCI datasets.

Keywords:

decision tree, C4.5 algorithm, attribute split, attribute correlation, imprecise classication

Downloads

Download data is not yet available.

References

H. B. Wang and Y. J. Gao, "Research on C4.5 algorithm improvement strategy based on MapReduce," Procedia Computer Science, vol. 183, pp. 160–165, 2021. DOI: https://doi.org/10.1016/j.procs.2021.02.045

K. N. Singh and J. K. Mantri, "A clinical decision support system using rough set theory and machine learning for disease prediction," Intelligent Medicine, vol. 4, no. 3, pp. 200–208, Aug. 2024. DOI: https://doi.org/10.1016/j.imed.2023.08.002

A. Ashfaq, N. Cronin, and P. Müller, "Recent advances in machine learning for maximal oxygen uptake (VO2 max) prediction: A review," Informatics in Medicine Unlocked, vol. 28, 2022, Art. no. 100863. DOI: https://doi.org/10.1016/j.imu.2022.100863

O. Sagi and L. Rokach, "Explainable decision forest: Transforming a decision forest into an interpretable tree," Information Fusion, vol. 61, pp. 124–138, Sep. 2020. DOI: https://doi.org/10.1016/j.inffus.2020.03.013

I. D. Mienye, Y. Sun, and Z. Wang, "Prediction performance of improved decision tree-based algorithms: a review," Procedia Manufacturing, vol. 35, pp. 698–703, 2019. DOI: https://doi.org/10.1016/j.promfg.2019.06.011

L. Wang, Z. Zhang, X. Zhang, X. Zhou, P. Wang, and Y. Zheng, "A Deep-forest based approach for detecting fraudulent online transaction," in Advances in Computers, vol. 120, Elsevier, 2021, pp. 1–38. DOI: https://doi.org/10.1016/bs.adcom.2020.10.001

T. Shaikhina, D. Lowe, S. Daga, D. Briggs, R. Higgins, and N. Khovanova, "Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation," Biomedical Signal Processing and Control, vol. 52, pp. 456–462, Jul. 2019. DOI: https://doi.org/10.1016/j.bspc.2017.01.012

M. Alexiuk, H. Elgubtan, and N. Tangri, "Clinical Decision Support Tools in the Electronic Medical Record," Kidney International Reports, vol. 9, no. 1, pp. 29–38, Jan. 2024. DOI: https://doi.org/10.1016/j.ekir.2023.10.019

D. S. Macmillan and M. L. Chilton, "A defined approach for predicting skin sensitisation hazard and potency based on the guided integration of in silico, in chemico and in vitro data using exclusion criteria," Regulatory Toxicology and Pharmacology, vol. 101, pp. 35–47, Feb. 2019. DOI: https://doi.org/10.1016/j.yrtph.2018.11.001

V. A. Dev and M. R. Eden, "Formation lithology classification using scalable gradient boosted decision trees," Computers & Chemical Engineering, vol. 128, pp. 392–404, Sep. 2019. DOI: https://doi.org/10.1016/j.compchemeng.2019.06.001

S. Abolhosseini, M. Khorashadizadeh, M. Chahkandi, and M. Golalizadeh, "A modified ID3 decision tree algorithm based on cumulative residual entropy," Expert Systems with Applications, vol. 255, Dec. 2024, Art. no. 124821. DOI: https://doi.org/10.1016/j.eswa.2024.124821

C. C. Aggarwal, "Data Classification," in Data Mining, Springer International Publishing, 2015, pp. 285–344. DOI: https://doi.org/10.1007/978-3-319-14142-8_10

M. Qiu, "Path Planning Algorithm and ID3 Decision Tree Model Application of Scenic Intelligent Navigation System," Procedia Computer Science, vol. 247, pp. 1187–1196, 2024. DOI: https://doi.org/10.1016/j.procs.2024.10.143

V. Bolón-Canedo, I. Porto-Díaz, N. Sánchez-Maroño, and A. Alonso-Betanzos, "A framework for cost-based feature selection," Pattern Recognition, vol. 47, no. 7, pp. 2481–2489, Jul. 2014. DOI: https://doi.org/10.1016/j.patcog.2014.01.008

S. Bakhshandeh, R. Azmi, and M. Teshnehlab, "Graph Based Feature Selection Using Symmetrical Uncertainty in Microarray Dataset," Journal of Information Systems and Telecommunication, vol. 7, no. 1, pp. 35–40, 2019.

C. Qiu, L. Jiang, and C. Li, "Randomly selected decision tree for test-cost sensitive learning," Applied Soft Computing, vol. 53, pp. 27–33, Apr. 2017. DOI: https://doi.org/10.1016/j.asoc.2016.12.047

S. Zhang, "Decision tree classifiers sensitive to heterogeneous costs," Journal of Systems and Software, vol. 85, no. 4, pp. 771–779, Apr. 2012. DOI: https://doi.org/10.1016/j.jss.2011.10.007

Y. E. Touati, J. B. Slimane, and T. Saidani, "Adaptive Method for Feature Selection in the Machine Learning Context," Engineering, Technology & Applied Science Research, vol. 14, no. 3, pp. 14295–14300, Jun. 2024. DOI: https://doi.org/10.48084/etasr.7401

F. Mozaffari, I. R. Vanani, P. Mahmoudian, and B. Sohrabi, "Application of Machine Learning in the Telecommunications Industry: Partial Churn Prediction by using a Hybrid Feature Selection Approach," Journal of Information Systems and Telecommunication (JIST), vol. 4, no. 44, Dec. 2023, Art. no. 331. DOI: https://doi.org/10.61186/jist.38419.11.44.331

M. R. Kahrizi, "Long-Term Spectral Pseudo-Entropy (LTSPE) Feature," Journal of Information Systems and Telecommunication, vol. 6, no. 4, pp. 204–208, 2018.

M. Irfan, S. Basuki, and Y. Azhar, "Giving more insight for automatic risk prediction during pregnancy with interpretable machine learning," Bulletin of Electrical Engineering and Informatics, vol. 10, no. 3, Jun. 2021. DOI: https://doi.org/10.11591/eei.v10i3.2344

C. L. Blake and C. J. Merz, "UCI repository of machine learning databases," 1998.