A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy

S. Rahamat Basha; J. Keziya Rani; J. J. C. Prasad Yadav

doi:10.48084/etasr.3173

Authors

S. Rahamat Basha Department of Computer Science & Technology, Sri Krishnadevaraya University, India http://orcid.org/0000-0003-3262-6350
J. Keziya Rani Department of Computer Science & Technology, Sri Krishnadevaraya University, India
J. J. C. Prasad Yadav Department of CSE, Rajeev Gandhi Memorial College of Engineering and Technology, India

Volume: 9 | Issue: 6 | Pages: 5001-5005 | December 2019 | https://doi.org/10.48084/etasr.3173

Corresponding author: S. Rahamat Basha

Abstract

Automatic summarization is the process of shortening one (in single document summarization) or multiple documents (in multi-document summarization). In this paper, a new feature selection method for the nearest neighbor classifier by summarizing the original training documents based on sentence importance measure is proposed. Our approach for single document summarization uses two measures for sentence similarity: the frequency of the terms in one sentence and the similarity of that sentence to other sentences. All sentences were ranked accordingly and the sentences with top ranks (with a threshold constraint) were selected for summarization. The summary of every document in the corpus is taken into a new document used for the summarization evaluation process.

Keywords:

summarization, dimension reduction, feature selection, feature extraction, feature clustering, text classification

References

J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy Self-Constructing Feature Clustering Algorithm for Text Classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 3, pp. 335–349, 2011 DOI: https://doi.org/10.1109/TKDE.2010.122

D. D. Lewis, “Feature selection and feature extraction for text categorization”, Workshop on Speech and Natural Language, Harriman, USA, February 23-26, 1992 DOI: https://doi.org/10.3115/1075527.1075574

X. Wan, J. Xiao, “Exploiting neighborhood knowledge for single document summarization and keyphrase extraction”, ACM Transactions on Information Systems,Vol. 28, No. 2, Article 8, 2010 DOI: https://doi.org/10.1145/1740592.1740596

M. Niepert, “An experiment system for textcClassification”, available at: https://www.semanticscholar.org/paper/An-Experiment-System-for-Text-Classification-Niepert/32be395201b132eb64939a1ca8541efd0f1e8984, 2005

H. Kim, P. Howland, H. Park, “Dimension reduction in text classification with support vector machines”, Journal of Machine Learning Research, Vol. 6, pp.37-53, 2005

A. L. Blum, P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, Vol. 97, pp. 245-271, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00063-5

E. F. Combarrow, E. Montanes, I. Diaz, J. Ranilla, R. Mones, “Introducing a family of linear measures for feature selection in text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 1223-1232, 2005 DOI: https://doi.org/10.1109/TKDE.2005.149

Y. Jan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-333, 2006 DOI: https://doi.org/10.1109/TKDE.2006.45

M. Alghobiri, “A comparative analysis of classification algorithms on diverse datasets”, Engineering, Technology & Applied Science Research, Vol. 8, No. 2, pp. 2790-2795, 2018 DOI: https://doi.org/10.48084/etasr.1952

E. Jamalian, R. Foukerdi, “A hybrid data mining method for customer churn prediction”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2991-2997, 2018 DOI: https://doi.org/10.48084/etasr.2108

D. Koller, M. Sahami, “Toward optimal feature selection”, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996

R. Kohavi, G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00043-X

Y. Yang, J. O. Pederson, “A comparative study on feature selection in text categorization”, 14th International Conference on Machine Learning, San Francisco, USA, July 8-12, 1997

N. Slonim, N. Tishby, “The power of word clusters for text classification”, 23rd European Colloquium on Information Retrieval Research, 2001

Y. Sasaki, Automatic Text Classification, Lecture notes, University of Manchester, available at: http://www.nactem.ac.uk/dtc/DTC-Sasaki.pdf, 2008

L. D. Baker, A. McCallum, “Distributional clustering of words for text classification”, 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, August 24-28, 1998 DOI: https://doi.org/10.1145/290941.290970

R. Bekkerman, R. EI-Yaniv, N. Tishhby, Y. Winter, “Distributional word clusters versus words for text categorization”, Journal of Machine Learning Research, Vol. 3, pp. 1183-120, 2003

I. S. Dhilllon, S. Mallela, R. Kumar, “A divisive information theoretic feature clustering algorithm for text classification”, Journal of Machine Learning Research, Vol. 3, pp. 1265-1287, 2003

https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

https://martin-thoma.com/nlp-reuters/

http://www.daviddlewis.com/resources/testcollections/reuters21578/

http://disi.unitn.it/moschitti/corpora.htm

S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav, G. Ravi Kumar, “Impact of feature selection techniques in Text Classification: an experimental study”, J. Mech. Cont.& Math. Sci., Special Issue, No. 3, pp. 39-51, 2019

G. Ravi Kumar, K. Nagamani, “A framework of dimensionality reduction utilizing PCA for neural network prediction”, International Conference on Data Science and Management, Bhubaneswar, USA, February 22-23

G. Ravi Kumar, K. Nagamani, “Banknote authentication system utilizing deep neural network with PCA and LDA machine learning techniques”, International Journal of Recent Scientific Research, Vol. 9, No. 12, pp. 30036-30038, 2018

M. V. Lakshmaiah, G. Ravi Kumar, G. Pakardin, “Framework for finding association rules in big data by using Hadoop Map/Reduce tool”, International Journal of Advance and Innovative Research, Vol. 2, No. 1(I), pp. 6-9, 2015

G. Ravi Kumar, G. A. Ramachandra, K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques”, International Journal of Innovations in Engineering and Technology, Vol. 2, No. 4, pp. 139-144, 2013