›› 2014, Vol. 29 ›› Issue (3): 376-391.doi: 10.1007/s11390-014-1437-6

Special Issue: Artificial Intelligence and Pattern Recognition; Data Management and Data Mining

• Data Management and Data Mining • Previous Articles     Next Articles

Higher-Order Smoothing:A Novel Semantic Smoothing Method for Text Classification

Mitat Poyraz, Zeynep Hilal Kilimci, and Murat Can Ganiz*, Member, IEEE   

  1. Department of Computer Engineering, Dogus University, Istanbul 34722, Turkey
  • Received:2013-09-01 Revised:2014-03-11 Online:2014-05-05 Published:2014-05-05
  • About author:Mitat Poyraz is a business intelligence consultant at QlikView Turkey. Before joining the QlikView Turkey, he was a research assistant and Masters student at Computer Engineering Department of Dogus University, Istanbul, Turkey. His research interests are textual data mining, machine learning algorithms, and semantic smoothing approaches.
  • Supported by:

    This work was supported in part by the Scientific and Technological Research Council of Turkey (TÜBÍTAK) under Grant No.111E239. Points of views in this document are those of the authors and do not necessarily represent the offcial position or policies of the TÜBÍTAK. 111E239.

It is known that latent semantic indexing (LSI) takes advantage of implicit higher-order (or latent) structure in the association of terms and documents. Higher-order relations in LSI capture "latent semantics". These finding have inspired a novel Bayesian framework for classification named Higher-Order Naive Bayes (HONB), which was introduced previously, that can explicitly make use of these higher-order relations. In this paper, we present a novel semantic smoothing method named Higher-Order Smoothing (HOS) for the Naive Bayes algorithm. HOS is built on a similar graph based data representation of the HONB which allows semantics in higher-order paths to be exploited. We take the concept one step further in HOS and exploit the relationships between instances of different classes. As a result, we move not only beyond instance boundaries, but also class boundaries to exploit the latent information in higher-order paths. This approach improves the parameter estimation when dealing with insufficient labeled data. Results of our extensive experiments demonstrate the value of HOS on several benchmark datasets.

[1] Taskar B, Abbeel P, Koller D. Discriminative probabilistic models for relational data. In Proc. the 18th Conf. Uncertainty in Artificial Intelligence, August 2002, pp.485-492.

[2] Chakrabarti S, Dom B, Indyk P. Enhanced hypertext categorization using hyperlinks. In Proc. International Conference on Management of Data, June 1998, pp.307-318.

[3] Neville J, Jensen D. Iterative classification in relational data. In Proc. AAAI 2000 Workshop on Learning Statistical Models from Relational Data, July 2000, pp.13-20.

[4] Getoor L, Diehl C P. Link mining: A survey. ACM SIGKDD Explorations Newsletter, 2005, 7(2): 3-12.

[5] Ganiz M C, Kanitkar S, Chuah M C, Pottenger W M. Detection of interdomain routing anomalies based on higher-order path analysis. In Proc. the 6th IEEE International Conference on Data Mining, December 2006, pp.874-879.

[6] Ganiz M C, Lytkin N, Pottenger W M. Leveraging higher order dependencies between features for text classification. In Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases, September 2009, pp.375-390.

[7] Ganiz M C, George C, Pottenger W M. Higher order Naive Bayes: A novel non-IID approach to text classification. IEEE Trans. Knowledge and Data Engineering, 2011, 23(7): 10221034.

[8] Lytkin N. Variance-based clustering methods and higher order data transformations and their applications [Ph.D. Thesis]. Rutgers University, NJ, 2009.

[9] Edwards A, Pottenger W M. Higher order Q-Learning. In Proc. IEEE Symp. Adaptive Dynamic Programming and Reinforcement Learning, April 2011, pp.128-134.

[10] Deerwester S C, Dumais S T, Landauer T K et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41(6): 391-407.

[11] Kontostathis A, Pottenger W M. A framework for understanding latent semantic indexing (LSI) performance. Journal of the Information Processing and Management, 2006, 42(1): 56-73.

[12] Sarah Z, Hirsh H. Transductive LSI for short text classification problems. In Proc. the 17th International Florida Artificial Intelligence Research Society Conference, May 2004, pp.556-561.

[13] Li S, Wu T, Pottenger W M. Distributed higher order association rule mining using information extracted from textual data. SIGKDD Explorations Newsletter——Natural Language Processing and Text Mining, 2005, 7(1): 26-35.

[14] McCallum A, Nigam K. A comparison of event models for Naive Bayes text classification. In Proc. AAAI 1998 Workshop on Learning for Text Categorization, July 1998, pp.4148.

[15] Kim S B, Han K S, Rim H C, Myaeng S H. Some effective techniques for naive Bayes text classification. IEEE Trans. Knowl. Data Eng., 2006, 18(11): 1457-1466.

[16] Schneider K M. On word frequency information and negative evidence in Naive Bayes text classification. In Proc. Int. Conf. Advances in Natural Language Processing, October 2004, pp.474-485.

[17] Metsis V, Androutsopoulos I, Paliouras G. Spam filtering with Naive Bayes——Which Naive Bayes?. In Proc. Conference on Email and Anti-Spam, July 2006.

[18] McCallum A, Nigam K. Text classification by bootstrapping with keywords, EM and shrinkage. In Proc. ACL 1999 Workshop for the Unsupervised Learning in Natural Language Processing, June 1999, pp.52-58.

[19] Juan A, Ney H. Reversing and smoothing the multinomial Naive Bayes text classifier. In Proc. International Workshop on Pattern Recognition in Information Systems, April 2002, pp.200-212.

[20] Peng F, Schuurmans D, Wang S. Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 2004, 7(3/4): 317-345.

[21] Zhou X, Zhang X, Hu X. Semantic smoothing for Bayesian text classification with small training data. In Proc. International Conference on Data Mining, April 2008, pp.289-300.

[22] Chen S F, Goodman J. An empirical study of smoothing techniques for language modeling. In Proc. the 34th Annual Meeting on Association for Computational Linguistics, June 1996, pp.310-318

[23] Joachims T. Text categorization with support vector machines: Learning with many relevant features. In Proc. the 10th European Conf. Machine Learning, Apr. 1998, pp.137142.

[24] Gao B, Liu T, Feng G, Qin T, Cheng Q, Ma W. Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph co-partitioning. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(9): 1263-1273.

[25] Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.

[26] Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.

[27] Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85

[28] Chakrabarti S. Supervised learning. In Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002, pp.148-151.

[29] Manning C D, Schütze H. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999.

[30] Amasyal?M F, Beken A. Measurement of Turkish word semantic similarity and text categorization application. In Proc. IEEE Signal Processing and Communications Applications Conference, April 2009. (in Turkish)

[31] Toruno?lu D, Çak?rman E, Ganiz M C et al. Analysis of preprocessing methods on classification of Turkish texts. In Proc. International Symposium on Innovations in Intelligent Systems and Applications, June 2011, pp.112-118.

[32] Rennie J D, Shih L, Teevan J, Karger D R. Tackling the poor assumptions of Naive Bayes text classifiers. In Proc. ICML2003, August 2003, pp.616-623.

[33] Eyheramendy S, Lewis D D, Madigan D. On the Naive Bayes model for text categorization. In Proc. the 9th International Workshop on Artificial Intelligence and Statistics, January 2003, pp.332-339.

[34] Kolcz A, Yih W. Raising the baseline for high-precision text classifiers. In Proc. the 13th Int. Conf. Knowledge Discovery and Data Mining, August 2007, pp.400-409.

[35] Japkowicz N, Shah M. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, 2011.

[36] Su J, Shirab J S, Matwin S. Large scale text classification using semi-supervised multinomial Naive Bayes. In Proc. the 28th Int. Conf. Machine Learning, June 2011, pp.97-104.

[37] Nakov P, Popova A, Mateev P. Weight functions impact on LSA performance. In Proc. the EuroConference Recent Advances in Natural Language Processing, September 2001, pp.187-193.

[38] Poyraz M, Kilimci Z H, Ganiz M C. A novel semantic smoothing method based on higher order paths for text classification. In Proc. IEEE Int. Conf. Data Mining, Dec. 2012, pp.615624.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Meng Liming; Xu Xiaofei; Chang Huiyou; Chen Guangxi; Hu Mingzeng; Li Sheng;. A Tree-Structured Database Machine for Large Relational Database Systems[J]. , 1987, 2(4): 265 -275 .
[2] Lin Shan;. Using a Student Model to Improve Explanation in an ITS[J]. , 1992, 7(1): 92 -96 .
[3] Shen Yidong;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[4] LUO Junzhou; GU Guanqun; FEI Xiang;. An Architectural Model for Intelligent Network Management[J]. , 2000, 15(2): 136 -143 .
[5] Hua Li, Shui-Cheng Yan, and Li-Zhong Peng[1]. Robust Non-Frontal Face Alignment with Edge Based Texture[J]. , 2005, 20(6): 849 -854 .
[6] Feng Yi, Qi-Chao Sun, Jie Dong, and Lu Yu. Low-Complexity Tools in AVS Part 7[J]. , 2006, 21(3): 345 -353 .
[7] Jun Yao, Ji-Wu Shu, and Wei-Min Zheng. Distributed Storage Cluster Design for Remote Mirroring Based on Storage Area Network[J]. , 2007, 22(4): 521 -526 .
[8] Ning Wang (王宁) and Bao-Gang Hu (胡包钢), Senior Member, IEEE. Real-Time Simulation of Aeolian Sand Movement and Sand Ripple Evolution: A Method Based on the Physics of Blown Sand[J]. , 2012, 27(1): 135 -146 .
[9] Concha Bielza, Juan A. Fernández del Pozo, and Pedro Larrañaga. Parameter Control of Genetic Algorithms by Learning and Simulation of Bayesian Networks —— A Case Study for the Optimal Ordering of Tables[J]. , 2013, 28(4): 720 -731 .
[10] Guo-Dong Zhou, and Pei-Feng Li. Improving Syntactic Parsing of Chinese with Empty Element Recovery[J]. , 2013, 28(6): 1106 -1116 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved