›› 2014, Vol. 29 ›› Issue (3): 361-375.doi: 10.1007/s11390-014-1435-8

Special Issue: Artificial Intelligence and Pattern Recognition; Data Management and Data Mining; Computer Networks and Distributed Computing

• Data Management and Data Mining • Previous Articles     Next Articles

Inductive Model Generation for Text Classification Using a Bipartite Heterogeneous Network

Rafael Geraldeli Rossi, Alneu de Andrade Lopes, Thiago de Paulo Faleiros, and Solange Oliveira Rezende   

  1. Institute of Mathematics and Computer Science, University of São Paulo, São Carlos, Brasil
  • Received:2013-09-02 Revised:2014-01-18 Online:2014-05-05 Published:2014-05-05
  • About author:Rafael Geraldeli Rossi received the B.S degree in information systems and M.S degree in computer science and computational mathematics from University of São Paulo, Brazil, in 2009 and 2011 respectively. He is a Ph.D. candidate at University of São Paulo. His research interests include machine learning, text mining and graph-based methods.
  • Supported by:

    The work is supported by São Paulo Research Foundation (FAPESP) of Brasil under Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9.

Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such networkbased representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.

[1] Aggarwal C C, Zhai C. Mining Text Data. Springer, 2012.

[2] Feldman R, Sanger J. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, 2006.

[3] Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1): 1-47.

[4] Manning C D, Raghavan P, Schütze H.An Introduction to Information Retrieval. Cambridge University Press, 2008.

[5] Schutze H, Hull D A, Pedersen J O. A comparison of classifiers and document representations for the routing problem. In Proc. the 18th Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 1995, pp.229-237.

[6] Blanzieri E, Bryl A. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 2008, 29(1): 63-92.

[7] Kao A, Quach L, Poteet S, Woods S. User assisted text classification and knowledge management. In Proc. the 12th International Conference on Information and Knowledge Management, November 2003, pp.524-527.

[8] Han H, Giles C L, Manavoglu E, Zha H, Zhang Z, Fox E A. Automatic document metadata extraction using support vector machines. In Proc. ACM/IEEE-CS Joint Conference on Digital Libraries, May 2003, pp.37-48.

[9] Kessler B, Numberg G, Schütze H. Automatic detection of text genre. In Proc. the 35th Annual Meeting of the Association for Computational Linguistics and the 8th Conference of the European Chapter of the Association for Computational Linguistics, August 1997, pp.32-38.

[10] Dumais S, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Annual International Conference on Research and Development in Information Retrieval, July 2000, pp.256-263.

[11] Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley Longman Publishing Co., Inc., 1989.

[12] Lu Q, Getoor L. Link-based classification. In Proc. International Conference on Machine Learning, August 2003, pp.496-503.

[13] Chakrabarti S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan-Kauffman, 2002.

[14] Oh H J, Myaeng S H, Lee M H. A practical hypertext categorization method using links and incrementally available class information. In Proc. the 23rd ACM Int. SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.264-271.

[15] Angelova R, Weikum G. Graph-based text classification: Learn from your neighbors. In Proc. the 29th Annual Int. SIGIR Conf. Research and Development in Information Retrieval Conference, August 2006, pp.485-492.

[16] Tseng Y H, Ho Z P, Yang, K S, Chen C C. Mining term networks from text collections for crime investigation. Expert Systems with Applications, 2012, 39(11): 10082-10090.

[17] Wang W, Do D B, Lin X. Term graph model for text classification. In Proc. International Conference on Advanced Data Mining and Applications, July 2005, pp.19-30.

[18] Newman M. Networks: An Introduction. Oxford University Press, 2010.

[19] Widrow B, Hoff M E. Adaptive switching circuits. In Neurocomputing: Foundation of Research, Anderson J A (ed.), Cambridge.USA: MIT Press, 1998, pp.123-134.

[20] Rossi R G, Faleiros T P, Lopes A A, Rezende S O. Inductive model generation for text categorization using a bipartite heterogeneous network. In Proc. the 12th International Conference on Data Mining, December 2012, pp.1086-1091.

[21] Melville P, Gryc W, Lawrence R D. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proc. the 15th International Conference on Knowledge Discovery and Data Mining, June 2009, pp.1275-1284.

[22] Boiy E, Hens P, Deschacht K, Moens M F. Automatic sentiment analysis in on-line text. In Proc. the 11th International Conference on Electronic Publishing, June 2007, pp.349-360.

[23] Durant K T, Smith M D. Predicting the political sentiment of web log posts using supervised machine learning techniques coupled with feature selection. In Proc. the 8th International Workshop on Knowledge Discovery on the Web, August 2006, pp.187-206.

[24] Chen R C, Hsieh C H. Web page classification based on a support vector machine using a weighted vote schema. Expert Systems with Applications, 2006, 31(2): 427-435.

[25] Wilcox A, Hripcsak G. Medical text representations for inductive learning. In Proc. American Medical Informatics Association Symposium, Nov. 2000, pp.923-927.

[26] Sun A, Lim E P, Ng W K. Web classification using support vector machine. In Proc. the 4th International Workshop on Web Information and Data Management, November 2002, pp.96-99.

[27] Yu H, Han J, Chang K C C. PEBL: Positive example based learning for Web page classification using SVM. In Proc. the 8th International Conference on Knowledge Discovery and Data Mining, July 2002, pp.239-248.

[28] Yang Y, Slattery S, Ghani R. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 2002, 18(2/3): 219-241.

[29] Dumais S T, Chen H. Hierarchical classification of Web content. In Proc. the 23rd Int. ACM SIGIR Conf. Research and Development in Information Retrieval, July 2000, pp.256263.

[30] Han E H, Karypis G, Kumar V. Text categorization using weight adjusted k-nearest neighbor classification. In Proc. Pacific-Asia Conference on Knowledge Discovery and Data Mining, April 2001, pp.53-65.

[31] Yang Y. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1(1/2): 69-90.

[32] Androutsopoulos I, Koutsias J, Chandrinos K, Paliouras G, Spyropoulos C. An evaluation of naive Bayesian anti-spam filtering. In Proc. Workshop on Machine Learning in the New Information Age, May 2000, pp.9-17.

[33] Drucker H, Wu D, Vapnik V. Support vector machines for spam categorization. IEEE Transactions on Neural Networks, 1999, 10(5): 1048-1054.

[34] Han E, Karypis G. Centroid-based document classification: Analysis and experimental results. In Proc. the 4th European Conference Principles of Data Mining and Knowledge Discovery, June 2000, pp.424-431.

[35] Nguyen T T, Chang K, Hui S C. Supervised term weighting centroid-based classifiers for text categorization. Knowledge and Information Systems, 2013, 35(1): 61-85.

[36] Marcacini R M, Cherman E A, Metz J, Rezende S O. A fast dendrogram refinement approach for unsupervised expansion of hierarchies. In Proc. ECML/PKDD Discovery Challenge: Third Challenge on Large Scale Hierarchical Text Classification, September 2012, pp. 1-12.

[37] Frank E, Bouckaert R R. Naive Bayes for text classification with unbalanced classes. In Proc. the 10th European Conference on Principle and Practice of Knowledge Discovery in Databases, September 2003, pp.503-510.

[38] Ji M, Sun Y, Danilevsky M, Han J, Gao J. Graph regularized transductive classification on heterogeneous information networks. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2010, pp.570-586.

[39] Chiang M, Liou J, Wang J, Peng W, Shan M. Exploring heterogeneous information networks and random walk with restart for academic search. Knowledge and Information Systems, 2013, 36(1): 59-82.

[40] Xue G R, Shen D, Yang Q et al. IRC: An iterative reinforcement categorization algorithm for interrelated Web objects. In Proc. the 4th International Conference on Data Mining, November 2004, pp. 273{280.

[41] Yin Z, Li R, Mei Q, Han J. Exploring social tagging graph for web object classification. In Proc. International Conference on Knowledge Discovery and Data Mining, June 2009, pp.957-966.

[42] Zhou D, Bousquet O, Lal T N, Weston J, Schölkopf B. Learning with local and global consistency. In Proc. Advances in Neural Information Processing Systems, December 2003.

[43] Aggarwal C C, Zhao P. Towards graphical models for text processing. Knowledge and Information Systems, 2013, 36(1): 1-21.

[44] Markov A, Last M, Kandel A. Model-based classification of Web documents represented by graphs. In Proc. WEBKDD, August 2006, pp.84-89.

[45] Mishra M, Huan J, Bleik S, Song M. Biomedical text categorization with concept graph representations using a controlled vocabulary. In Proc. the 11th International Workshop on Data Mining in Bioinformatics, August 2012, pp.26-32.

[46] Cancho R F, Sole R V, Kohler. Patterns in syntactic dependency networks. Physical Review E, 2004, 69(1): 051915.

[47] Sousa C A R, Rezende S O, Batista G E A P A. In皍ence of graph construction on semi-supervised learning. In Proc. European Conference on Machine Learning and Knowledge Discovery in Databases, September 2013, pp.160-175.

[48] Tomás D, Vicedo J L. Minimally supervised question classification on fine-grained taxonomies. Knowledge and Information Systems, 2013, 36(2): 303-334.

[49] Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (2nd edition). Morgan Kaufmann, 2005.

[50] Caruana R, Niculescu-Mizil A. An empirical comparison of supervised learning algorithms. In Proc. the 23rd International Conference on Machine Learning, June 2006, pp.161-168.

[51] Kohonen T, Barna G, Chrisley R. Statistical pattern recognition with neural networks: Benchmarking studies. In Proc. International Conference on Neural Networks, July 1988, pp.61-68.

[52] Demsar J. Statistical comparisons of classifiers over multiple datasets. Journal of Machine Learning Research, 2006, 7(1): 1-30.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved