A Comparative Approach of Dimensionality Reduction Techniques in Text Classification

S. Rahamat Basha; J. K. Rani

doi:10.48084/etasr.3146

Authors

S. Rahamat Basha Department of Computer Science & Technology, Sri Krishnadevaraya University, India http://orcid.org/0000-0003-3262-6350
J. K. Rani Department of Computer Science & Technology, Sri Krishnadevaraya University, India

Volume: 9 | Issue: 6 | Pages: 4974-4979 | December 2019 | https://doi.org/10.48084/etasr.3146

Corresponding author: S. Rahamat Basha

Abstract

This work deals with document classification. It is a supervised learning method (it needs a labeled document set for training and a test set of documents to be classified). The procedure of document categorization includes a sequence of steps consisting of text preprocessing, feature extraction, and classification. In this work, a self-made data set was used to train the classifiers in every experiment. This work compares the accuracy, average precision, precision, and recall with or without combinations of some feature selection techniques and two classifiers (KNN and Naive Bayes). The results concluded that the Naive Bayes classifier performed better in many situations.

Keywords:

stop word removal, stemming, feature weighting and selection, KNN, Naive Bayes

References

J. Y. Jiang, R. J. Liou, S. J. Lee, “A Fuzzy self-constructing feature clustering algorithm for text classification”, IEEE Transactions on Knowledge and Data Engineering, Vol. 23, No. 3, pp. 335–349, 2011 DOI: https://doi.org/10.1109/TKDE.2010.122

H. Kim, P. Howland, H. Park, “Dimension reduction in text classification with Support Vector Machines”, Journal of Machine Learning Research, Vol. 6, pp. 37-53, 2005

A. L. Blum, P. Langley, “Selection of relevant features and examples in machine learning”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 245-271, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00063-5

E. F. Cambarro, E. Montanes, I. Diaz, J. Ranilla, R. Mones, “Introducing a family of linear measures for feature selection in text categorization”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No. 9, pp. 1223-1232, 2005 DOI: https://doi.org/10.1109/TKDE.2005.149

D. Koller, M. Sahami, “Toward optimal feature selection”, 13th International Conference on Machine Learning, Bari, Italy, July 3-6, 1996

R. Kohavi, G. H. John, “Wrappers for feature subset selection”, Artificial Intelligence, Vol. 97, No. 1-2, pp. 273-324, 1997 DOI: https://doi.org/10.1016/S0004-3702(97)00043-X

Y. Yang, J. O. Pederson, “A comparative study on Feature Selection in Text Categorization”, 14th International conference on Machine Learning, San Francisco, USA, July 8-12, 1997

N. Slonim, N. Tishby, “The power of word clusters for Text Classification”, 23rd European Colloquium on Information Retrieval Research, 2001

D. D. Lewis, “Feature selection and feature extraction for Text Categorization”, Workshop on Speech and Natural Language, New York, USA, February 23-26, 1992 DOI: https://doi.org/10.3115/1075527.1075574

Y. Jan, B. Zhang, N. Liu, S. Yan, Q. Cheng, W. Fan, Q. Yang, W. Xi, Z. Chen, “Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-331, 2006 DOI: https://doi.org/10.1109/TKDE.2006.45

M. C. Dalmau, O. W. Marquez Florez, “Experimental results of the signal processing approach to distributional clustering of terms on Reuters-21578 collection”, European Conference on Information Retrieval, Rome, Italy, April 2-5, 2007

F. Sebastani, “Machine learning in automated text categorization”, ACM Computing Surveys, Vol. 34, No. 1, pp. 1-47, 2002 DOI: https://doi.org/10.1145/505282.505283

M. F. Porter, “An algorithm for suffix stripping”, in: Readings in Information Retrieval, Morgan Kaufmann, 1997

M. Alghobiri, “A comparative analysis of classification algorithms on diverse datasets”, Engineering, Technology & Applied Science Research, Vol. 8, No. 2, pp. 2790-2795, 2018 DOI: https://doi.org/10.48084/etasr.1952

E. Jamalian, R. Foukerdi, “A hybrid data mining method for customer churn prediction”, Engineering, Technology & Applied Science Research, Vol. 8, No. 3, pp. 2991-2997, 2018 DOI: https://doi.org/10.48084/etasr.2108

R. Neumayer, R. Mayer, K. Norvag, “Combination of Feature Selection Methods for Text Categorisation”, in: Lecture notes in computer science, Vol. 6611, Springer, 2009

Y. Sasaki, Automatic Text Classification, Lecture notes, University of Manchester, available at: http://www.nactem.ac.uk/dtc/DTC-Sasaki.pdf 2008

https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/

https://en.wikipedia.org/wiki/Naive_Bayes_classifier

https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

https://martin-thoma.com/nlp-reuters/

http://www.daviddlewis.com/resources/testcollections/reuters21578/

http://disi.unitn.it/moschitti/corpora.htm

A. Ozgur, L. Ozgur, T. Gungor, “Text Categorization with class-based and corpus-based keyword selection”, 20th International Symposium, Istanbul, Turkey, October 26-28, 2005 DOI: https://doi.org/10.1007/11569596_63

R. Caruana, A. Niculescu-Mizil, “Data mining in metric space: an empirical analysis of supervised learning performance criteria”, KDD’04, Seattle, Washington, USA, August 22–25, 2004 DOI: https://doi.org/10.1145/1014052.1014063

S. Rahamat Basha, J. Keziya Rani, J. J. C. Prasad Yadav, G. Ravi Kumar, “Impact of feature selection techniques in Text Classification: an experimental study”, J. Mech. Cont.& Math. Sci., Special Issue, No.-3, pp. 39-51, 2019

G. Ravi Kumar, K. Nagamani, “A framework of dimensionality reduction utilizing PCA for neural network prediction”, International Conference on Data Science and Management, Bhubaneswar, USA, February 22-23

G. Ravi Kumar, K. Nagamani, “Banknote authentication system utilizing deep neural network with PCA and LDA machine learning techniques”, International Journal of Recent Scientific Research, Vol. 9, No. 12, pp. 30036-30038, 2018

M. V. Lakshmaiah, G. Ravi Kumar, G. Pakardin, “Framework for finding association rules in big data by using Hadoop Map/Reduce tool”, International Journal of Advance and Innovative Research, Vol. 2, No. 1(I), pp. 6-9, 2015

G. Ravi Kumar, G. A. Ramachandra, K. Nagamani, “An efficient prediction of breast cancer data using data mining techniques”, International Journal of Innovations in Engineering and Technology, Vol. 2, No. 4, pp. 139-144, 2013