Average Gain Ratio and Correlation-Based Feature Selection for Imprecise Classification in Algorithm C4.5
Received: 8 January 2025 | Revised: 17 February 2025 and 23 April 2025 | Accepted: 27 April 2025 | Online: 2 August 2025
Corresponding author: Saeful Amri
Abstract
The attribute split in the C4.5 algorithm has proven successful in building a classification in the form of a tree to facilitate understanding and interpretation. However, the attribute split process tends to choose attributes with high values, even though they do not necessarily play a major role in the classification results, and does not consider the correlation of attributes with labels, affecting classification performance. Average Gain Ratio (AGR) has been proven to overcome weaknesses and problems in the split process. Correlation-Based Feature Selection (CBFS) is also used in the split process to calculate the correlation of attributes with labels. This study uses the AGR method to overcome the problem of attribute split criteria and the CBFS method to select attributes that correlate with labels to increase the performance of the C4.5 algorithm. The process of selecting the split attributes using AGR and the comparison with the CBFS method was shown to improve the performance of the C4.5 classifier, as indicated by the average results for accuracy (87%), sensitivity (91%), G-mean (85%), AUC (84%), and cost (3.67) on six UCI datasets.
Keywords:
decision tree, C4.5 algorithm, attribute split, attribute correlation, imprecise classicationDownloads
References
H. B. Wang and Y. J. Gao, "Research on C4.5 algorithm improvement strategy based on MapReduce," Procedia Computer Science, vol. 183, pp. 160–165, 2021. DOI: https://doi.org/10.1016/j.procs.2021.02.045
K. N. Singh and J. K. Mantri, "A clinical decision support system using rough set theory and machine learning for disease prediction," Intelligent Medicine, vol. 4, no. 3, pp. 200–208, Aug. 2024. DOI: https://doi.org/10.1016/j.imed.2023.08.002
A. Ashfaq, N. Cronin, and P. Müller, "Recent advances in machine learning for maximal oxygen uptake (VO2 max) prediction: A review," Informatics in Medicine Unlocked, vol. 28, 2022, Art. no. 100863. DOI: https://doi.org/10.1016/j.imu.2022.100863
O. Sagi and L. Rokach, "Explainable decision forest: Transforming a decision forest into an interpretable tree," Information Fusion, vol. 61, pp. 124–138, Sep. 2020. DOI: https://doi.org/10.1016/j.inffus.2020.03.013
I. D. Mienye, Y. Sun, and Z. Wang, "Prediction performance of improved decision tree-based algorithms: a review," Procedia Manufacturing, vol. 35, pp. 698–703, 2019. DOI: https://doi.org/10.1016/j.promfg.2019.06.011
L. Wang, Z. Zhang, X. Zhang, X. Zhou, P. Wang, and Y. Zheng, "A Deep-forest based approach for detecting fraudulent online transaction," in Advances in Computers, vol. 120, Elsevier, 2021, pp. 1–38. DOI: https://doi.org/10.1016/bs.adcom.2020.10.001
T. Shaikhina, D. Lowe, S. Daga, D. Briggs, R. Higgins, and N. Khovanova, "Decision tree and random forest models for outcome prediction in antibody incompatible kidney transplantation," Biomedical Signal Processing and Control, vol. 52, pp. 456–462, Jul. 2019. DOI: https://doi.org/10.1016/j.bspc.2017.01.012
M. Alexiuk, H. Elgubtan, and N. Tangri, "Clinical Decision Support Tools in the Electronic Medical Record," Kidney International Reports, vol. 9, no. 1, pp. 29–38, Jan. 2024. DOI: https://doi.org/10.1016/j.ekir.2023.10.019
D. S. Macmillan and M. L. Chilton, "A defined approach for predicting skin sensitisation hazard and potency based on the guided integration of in silico, in chemico and in vitro data using exclusion criteria," Regulatory Toxicology and Pharmacology, vol. 101, pp. 35–47, Feb. 2019. DOI: https://doi.org/10.1016/j.yrtph.2018.11.001
V. A. Dev and M. R. Eden, "Formation lithology classification using scalable gradient boosted decision trees," Computers & Chemical Engineering, vol. 128, pp. 392–404, Sep. 2019. DOI: https://doi.org/10.1016/j.compchemeng.2019.06.001
S. Abolhosseini, M. Khorashadizadeh, M. Chahkandi, and M. Golalizadeh, "A modified ID3 decision tree algorithm based on cumulative residual entropy," Expert Systems with Applications, vol. 255, Dec. 2024, Art. no. 124821. DOI: https://doi.org/10.1016/j.eswa.2024.124821
C. C. Aggarwal, "Data Classification," in Data Mining, Springer International Publishing, 2015, pp. 285–344. DOI: https://doi.org/10.1007/978-3-319-14142-8_10
M. Qiu, "Path Planning Algorithm and ID3 Decision Tree Model Application of Scenic Intelligent Navigation System," Procedia Computer Science, vol. 247, pp. 1187–1196, 2024. DOI: https://doi.org/10.1016/j.procs.2024.10.143
V. Bolón-Canedo, I. Porto-Díaz, N. Sánchez-Maroño, and A. Alonso-Betanzos, "A framework for cost-based feature selection," Pattern Recognition, vol. 47, no. 7, pp. 2481–2489, Jul. 2014. DOI: https://doi.org/10.1016/j.patcog.2014.01.008
S. Bakhshandeh, R. Azmi, and M. Teshnehlab, "Graph Based Feature Selection Using Symmetrical Uncertainty in Microarray Dataset," Journal of Information Systems and Telecommunication, vol. 7, no. 1, pp. 35–40, 2019.
C. Qiu, L. Jiang, and C. Li, "Randomly selected decision tree for test-cost sensitive learning," Applied Soft Computing, vol. 53, pp. 27–33, Apr. 2017. DOI: https://doi.org/10.1016/j.asoc.2016.12.047
S. Zhang, "Decision tree classifiers sensitive to heterogeneous costs," Journal of Systems and Software, vol. 85, no. 4, pp. 771–779, Apr. 2012. DOI: https://doi.org/10.1016/j.jss.2011.10.007
Y. E. Touati, J. B. Slimane, and T. Saidani, "Adaptive Method for Feature Selection in the Machine Learning Context," Engineering, Technology & Applied Science Research, vol. 14, no. 3, pp. 14295–14300, Jun. 2024. DOI: https://doi.org/10.48084/etasr.7401
F. Mozaffari, I. R. Vanani, P. Mahmoudian, and B. Sohrabi, "Application of Machine Learning in the Telecommunications Industry: Partial Churn Prediction by using a Hybrid Feature Selection Approach," Journal of Information Systems and Telecommunication (JIST), vol. 4, no. 44, Dec. 2023, Art. no. 331. DOI: https://doi.org/10.61186/jist.38419.11.44.331
M. R. Kahrizi, "Long-Term Spectral Pseudo-Entropy (LTSPE) Feature," Journal of Information Systems and Telecommunication, vol. 6, no. 4, pp. 204–208, 2018.
M. Irfan, S. Basuki, and Y. Azhar, "Giving more insight for automatic risk prediction during pregnancy with interpretable machine learning," Bulletin of Electrical Engineering and Informatics, vol. 10, no. 3, Jun. 2021. DOI: https://doi.org/10.11591/eei.v10i3.2344
C. L. Blake and C. J. Merz, "UCI repository of machine learning databases," 1998.
Downloads
How to Cite
License
Copyright (c) 2025 Saeful Amri, M. Alharis, Edy Winarno, Astrid Novita Putri

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
