Journal of Computer Science and Technology ›› 2018, Vol. 33 ›› Issue (5): 1007-1022.doi: 10.1007/s11390-018-1871-y

Special Issue: Artificial Intelligence and Pattern Recognition

• Data Management and Data Mining • Previous Articles     Next Articles

Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing

Yang Li1,2,3, Member, CCF, Wen-Zhuo Song1,2, Member, CCF, Bo Yang1,2,*, Distinguished Member, CCF   

  1. 1 College of Computer Science and Technology, Jilin University, Changchun 130012, China;
    2 Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education Jilin University, Changchun 130012, China;
    3 Aviation University of Air Force, Changchun 130062, China
  • Received:2017-09-19 Revised:2018-07-09 Online:2018-09-17 Published:2018-09-17
  • Contact: Bo Yang,
  • Supported by:
    This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61572226 and 61876069, and the Key Scientific and Technological Research and Development Project of Jilin Province of China under Grant Nos. 20180201067GX and 20180201044GX.

Topic modeling is a mainstream and effective technology to deal with text data, with wide applications in text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models, supervised Latent Dirichlet Allocation (sLDA) is acknowledged as a popular and competitive supervised topic model. However, the gradual increase of the scale of datasets makes sLDA more and more inefficient and time-consuming, and limits its applications in a very narrow range. To solve it, a parallel online sLDA, named PO-sLDA (Parallel and Online sLDA), is proposed in this study. It uses the stochastic variational inference as the learning method to make the training procedure more rapid and efficient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand. The validation using two datasets with different sizes shows that the proposed approach has the comparative accuracy as the sLDA and can efficiently accelerate the training procedure. Moreover, its good convergence and online training capacity make it lucrative for the large-scale text data analyzing and processing.

Key words: topic modeling; large-scale text classification; stochastic variational inference; cloud computing; online learning;

[1] Blei D M, Ng A Y, Jordan M Y. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022.
[2] Blei D M, Mcauliffe J D. Supervised topic models. In Proc. Advances in Neural Information Processing Systems, December 2010, pp.327-332.
[3] Wang C, Blei D M, Li F F. Simultaneous image classification and annotation. In Proc. IEEE Conference on Computer Vision & Pattern Recognition, January 2009, pp.1903-1910.
[4] Zhu J, Ahmed A, Xing E P. MedLDA:Maximum margin supervised topic models. Journal of Machine Learning Research, 2012, 13(1):2237-2278.
[5] Hoffman M, Blei D M, Wang C, Paisley J. Stochastic variational inference. Computer Science, 2013, 14(1):1303-1347.
[6] Hoffman S, Leibler R A. On information and sufficiency. Annals of Mathematical Statistics, 1951, 22(22):79-86.
[7] Amari S I. Differential geometry of curved exponential families-curvatures and information loss. Annals of Statistics, 1982, 10(2):357-385.
[8] Song W Z, Yang B, Zhao X H, Li F. A fast and scalable supervised topic model using stochastic variational inference and MapReduce. In Proc. the 5th IEEE International Conference on Network Infrastructure and Digital Content, September 2016, pp.94-98.
[9] Li F F, Perona P. A Bayesian hierarchical model for learning natural scene categories. In Proc. the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, June 2005, pp.524-531.
[10] Griffiths T L, Steyvers M, Blei D M, Tenenbaum J B. Integrating topics and syntax. In Proc. Advances in Neural Information Processing Systems, December 2004, pp.537-544.
[11] Wang C, Blei D M. Collaborative topic modeling for recommending scientific articles. In Proc. the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.448-456.
[12] Lacoste-Julien S, Sha F, Jordan M I. DiscLDA:Discriminative learning for dimensionality reduction and classification. In Proc. Advances in Neural Information Processing Systems, January 2008, pp.897-904.
[13] Ramage D, Hall D, Nallapati R, Manning C D. Labeled LDA:A supervised topic model for credit attribution in multi-labeled corpora. In Proc. the Conference on Empirical Methods in Natural Language Processing, August 2009, pp.248-256.
[14] Perotte A, Bartlett N, Bartlett N, Wood F. Hierarchically supervised latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, January 2011, pp.2609-2617.
[15] Boyd-Graber J, Resnik P. Holistic sentiment analysis across languages:Multilingual supervised latent Dirichlet allocation. In Proc. the Conference on Empirical Methods in Natural Language Processing, October 2010, pp.45-55.
[16] Chen J S, He J, Shen Y L, Xiao L, He X D, Gao J F, Song X Y, Deng L. End-to-end learning of LDA by mirrordescent back propagation over a deep architecture. In Proc. the 28th International Conference on Neural Information Processing Systems, December 2015, pp.1765-1773.
[17] Zhai K, Boyd-Graber J, Asadi N, Alkhoujia K L. Mr.LDA:A flexible large scale topic modeling package using variational inference in MapReduce. In Proc. ACM International Conference on World Wide Web, April 2012, pp.879-888.
[18] White T. Hadoop:The Definitive Guide (2nd edition). Yahoo Press, 2010.
[19] Yu H F, Hsieh C J, Yun H, Vishwanathan S V N, Dhilon I S. A scalable asynchronous distributed algorithm for topic modeling. In Proc. ACM International Conference on World Wide Web, May 2015, pp.1340-1350.
[20] Liu X S, Zeng J, Yang X et al. Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.669-679.
[21] Yuan J, Gao F, Ho Q et al. LightLDA:Big topic models on modest computer clusters. In Proc. the 24th ACM International Conference on World Wide Web, May 2015, pp.1351-1361.
[22] Raman P, Zhang J, Yu H F, Ji S H. Extreme stochastic variational inference:Distribution and asynchronous. arXiv preprint arXiv:1605.09499, 2016., Aug. 2018.
[23] Hoffman M D, Blei D M, Bach F R. Online learning for latent Dirichlet allocation. In Proc. Advances in Neural Information Processing Systems, November 2010, pp.856-864.
[24] Jordan M I. Learning in Graphical Models. MIT Press Cambridge, 1999.
[25] Shvachko K, Kuang H, Radia S, Chansler R. The Hadoop distributed file system. In Proc. the 26th Symposium on Mass Storage Systems and Technologies (MSST), May 2010.
[26] Leskovec J, Rajaraman A, Ullman J D. Mining of Massive Datasets. Cambridge University Press, 2014.
[27] Yano T, Smith N A, Wilkerson J D. Textual predictors of bill survival in congressional committees. In Proc. the Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, June 2012, pp.793-802.
[28] Partalas I, Kosmopoulos A, Baskiotis N et al. LSHTC:A benchmark for large-scale text classification. arXiv preprint arXiv:1503.08581, 2015., Aug. 2018.
[29] Bizer C, Lehmann J, Kobilarov G et al. DBpedia-A crystallization point for the Web of Data. Web Semantics:Science, Services and Agents on the World Wide Web, 2009, 7(3):154-165.
[1] Jung-Been Lee, Taek Lee, Hoh Peter In. Topic Modeling Based Warning Prioritization from Change Sets of Software Repository [J]. Journal of Computer Science and Technology, 2020, 35(6): 1461-1479.
[2] Leo Mendiboure, Mohamed-Aymen Chalouf, Francine Krief. Edge Computing Based Applications in Vehicular Environments: Comparative Study and Main Issues [J]. Journal of Computer Science and Technology, 2019, 34(4): 869-886.
[3] Jiang Rong, Tao Qin, Bo An. Competitive Cloud Pricing for Long-Term Revenue Maximization [J]. Journal of Computer Science and Technology, 2019, 34(3): 645-656.
[4] Fateh Boucenna, Omar Nouali, Samir Kechid, M. Tahar Kechadi. Secure Inverted Index Based Search over Encrypted Cloud Data with User Access Rights Management [J]. Journal of Computer Science and Technology, 2019, 34(1): 133-154.
[5] Sai-Sai Gong, Wei Hu, Wei-Yi Ge, Yu-Zhong Qu. Modeling Topic-Based Human Expertise for Crowd Entity Resolution [J]. Journal of Computer Science and Technology, 2018, 33(6): 1204-1218.
[6] Bao-Kun Zheng, Lie-Huang Zhu, Meng Shen, Feng Gao, Chuan Zhang, Yan-Dong Li, Jing Yang. Scalable and Privacy-Preserving Data Sharing Based on Blockchain [J]. , 2018, 33(3): 557-567.
[7] An-Zhen Zhang, Jian-Zhong Li, Hong Gao, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah. CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing [J]. , 2018, 33(2): 366-379.
[8] Kang Li, Fa-Zhi He, Hai-Ping Yu. Robust Visual Tracking Based on Convolutional Features with Illumination and Occlusion Handing [J]. , 2018, 33(1): 223-236.
[9] Qin Liu, Yuhong Guo, Jie Wu, Guojun Wang. Effective Query Grouping Strategy in Clouds [J]. Journal of Computer Science and Technology, 2017, 32(6): 1231-1249.
[10] Wei-Qing, Liu Jing Li. An Approach to Automatic Performance Prediction for Cloud-enhanced Mobile Applications with Sparse Data [J]. , 2017, 32(5): 936-956.
[11] Yuhun Jun, Jaemin Lee, Euiseong Seo. Evaluation of Remote-I/O Support for a DSM-Based Computation Offloading Scheme [J]. , 2017, 32(5): 957-973.
[12] Dong-Gang Cao, Bo An, Pei-Chang Shi, Huai-Min Wang. Providing Virtual Cloud for Special Purposes on Demand in JointCloud Computing Environment [J]. , 2017, 32(2): 211-218.
[13] Zuo-Ning Chen, Kang Chen, Jin-Lei Jiang, Lu-Fei Zhang, Song Wu, Zheng-Wei Qi, Chun-Ming Hu, Yong-Wei Wu, Yu-Zhong Sun, Hong Tang, Ao-Bing Sun, Zi-Lu Kang. Evolution of Cloud Operating System: From Technology to Ecosystem [J]. , 2017, 32(2): 224-241.
[14] Bin-Lei Cai, Rong-Qi Zhang, Xiao-Bo Zhou, Lai-Ping Zhao, Ke-Qiu Li. Experience Availability: Tail-Latency Oriented Availability in Software-Defined Cloud Computing [J]. , 2017, 32(2): 250-257.
[15] Xian-Mang He, Xiaoyang Sean Wang, Member, CCF, ACM, IEEE, Dong Li, Yan-Ni Hao. Semi-Homogenous Generalization:Improving Homogenous Generalization for Privacy Preservation in Cloud Computing [J]. , 2016, 31(6): 1124-1135.
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
  Copyright ©2015 JCST, All Rights Reserved