›› 2015,Vol. 30 ›› Issue (4): 903-916.doi: 10.1007/s11390-015-1569-3

所属专题: Artificial Intelligence and Pattern Recognition Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于弱监督信息和大量数据抽取评论的特征词和情感词

Lei Fang(房磊), Biao Liu(刘 彪), Min-Lie Huang*(黄民烈), Member, CCF   

  1. State Key Laboratory on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • 收稿日期:2014-09-12 修回日期:2015-05-04 出版日期:2015-07-05 发布日期:2015-07-05
  • 通讯作者: Min-Lie Huang is an associate professor in the Department of Computer Science and Technology, Tsinghua University, Beijing. E-mail:aihuang@tsinghua.edu.cn
  • 作者简介:Lei Fang is a fifth year Ph.D. student in the Department of Computer Science and Technology, Tsinghua University, Beijing. He received his Bachelor's degree in computer science and technology from Harbin Institute of Technology, in 2010. His research interest includes natural language processing, data mining, and machine learning.
  • 基金资助:

    This work is partly supported by the National Basic Research 973 Program of China under Grant Nos. 2012CB316301 and 2013CB329403, the National Natural Science Foundation of China under Grant Nos. 61332007 and 61272227, and the Beijing Higher Education Young Elite Teacher Project.

Leveraging Large Data with Weak Supervision for Joint Feature and Opinion Word Extraction

Lei Fang(房磊), Biao Liu(刘 彪), Min-Lie Huang*(黄民烈), Member, CCF   

  1. State Key Laboratory on Intelligent Technology and Systems, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2014-09-12 Revised:2015-05-04 Online:2015-07-05 Published:2015-07-05
  • Contact: Min-Lie Huang is an associate professor in the Department of Computer Science and Technology, Tsinghua University, Beijing. E-mail:aihuang@tsinghua.edu.cn
  • About author:Lei Fang is a fifth year Ph.D. student in the Department of Computer Science and Technology, Tsinghua University, Beijing. He received his Bachelor's degree in computer science and technology from Harbin Institute of Technology, in 2010. His research interest includes natural language processing, data mining, and machine learning.
  • Supported by:

    This work is partly supported by the National Basic Research 973 Program of China under Grant Nos. 2012CB316301 and 2013CB329403, the National Natural Science Foundation of China under Grant Nos. 61332007 and 61272227, and the Beijing Higher Education Young Elite Teacher Project.

特征词和情感词的抽取在情感分析领域是一项比较重要工作。在本文中,我们提出了一种基于大量未标注的评论数据,仅使用少数的特征词-情感词的搭配作为先验知识,来抽取语料中的特征词和情感词。
我们的主要贡献有两个方面:第一,我们提出了一种数据驱动的表示方法来表示特征词和评价词在语料级别上的关系,这种表示方法能够灵活的刻画丰富的语言结构;第二,我们使用了简单的、引入先验知识的无监督学习模型来进行特征词和情感词的抽取,并且在抽取的过程中一定程度上减少了错误传播的可能性。实验的结果表明我们提出的方法对于特征词和情感词抽取这个任务来说是十分有效的。

Abstract: Product feature and opinion word extraction is very important for fine granular sentiment analysis. In this paper, we leverage large scale unlabeled data for joint extraction of feature and opinion words under a knowledge poor setting, in which only a few feature-opinion pairs are utilized as weak supervision. Our major contributions are two-fold: first, we propose a data-driven approach to represent product features and opinion words as a list of corpus-level syntactic relations, which captures rich language structures; second, we build a simple yet robust unsupervised model with prior knowledge incorporated to extract new feature and opinion words, which obtains high performance robustly. The extraction process is based upon a bootstrapping framework which, to some extent, reduces error propagation under large data. Experimental results under various settings compared with state-of-the-art baselines demonstrate that our method is effective and promising.

[1] Ante S E. Amazon: Turning consumer opinions into gold. Business Week. http://www.bloomberg.com/bw/magazine/content/0943/b4152047039565.htm, May 2015.

[2] Pang B, Lee L, Vaithyanathan S. Thumbs up?: Sentiment classification using machine learning techniques. In Proc. the ACL-02 Conference on Empirical Methods in Natural Language Processing, Jul. 2002, pp.79-86.

[3] Hu M, Liu B. Mining and summarizing customer reviews. In Proc. the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp. 168-177.

[4] Liu B, Hu M, Cheng J. Opinion observer: Analyzing and comparing opinions on the web. In Proc. the 14th International Conference on World Wide Web, May 2005, pp.342- 351.

[5] Qiu G, Liu B, Bu J, Chen C. Opinion word expansion and target extraction through double propagation. Comput. Linguist., 2011, 37(1): 9-27.

[6] Zhuang L, Jing F, Zhu X Y. Movie review mining and summarization. In Proc. the 15th ACM International Conference on Information and Knowledge Management, Nov. 2006, pp.43-50.

[7] Hai Z, Chang K, Cong G. One seed to find them all: Mining opinion features via association. In Proc. the 21st ACM International Conference on Information and Knowledge Management, Oct. 29 – Nov. 2, 2012, pp.255-264.

[8] Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993-1022.

[9] Titov I, McDonald R. A joint model of text and aspect ratings for sentiment summarization. In Proc. the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2008, pp.308-316.

[10] Zhao W X, Jiang J, Yan H, Li X. Jointly modeling aspects and opinions with a Maxent-LDA hybrid. In Proc. the 2010 Conference on Empirical Methods in Natural Language Processing, Oct. 2010, pp.56-65.

[11] Mukherjee A, Liu B. Aspect extraction through semisupervised modeling. In Proc. the 50th Annual Meeting of the Association for Computational Linguistics, Jul. 2012, pp.339-348.

[12] Newman D, Asuncion A, Smyth P, Welling M. Distributed algorithms for topic models. Journal of Machine Learning Research, 2009, 10: 1801-1828.

[13] Lin J, Kolcz A. Large-scale machine learning at Twitter. In Proc. the 2012 ACM SIGMOD International Conference on Management of Data, May 2012, pp.793-804.

[14] Halevy A, Norvig P, Pereira F. The unreasonable effectiveness of data. IEEE Intelligent Systems, 2009, 24(2): 8-12.

[15] Kobayashi N, Inui K, Matsumoto Y. Extracting aspectevaluation and aspect-of relations in opinion mining. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jun. 2007, pp.1065-1074.

[16] Wu Y, Zhang Q, Huang X, Wu L. Phrase dependency parsing for opinion mining. In Proc. the 2009 Conference on Empirical Methods in Natural Language Processing, Aug. 2009, pp.1533-1541.

[17] Li F, Han C, Huang M, Zhu X, Xia Y J, Zhang S, Yu H. Structure-aware review mining and summarization. In Proc. the 23rd International Conference on Computational Linguistics, Aug. 2010, pp.653-661.

[18] Choi Y, Cardie C. Hierarchical sequential learning for extracting opinions and their attributes. In Proc. the ACL 2010 Conference Short Papers, Jul. 2010, pp.269-274.

[19] Popescu A M, Etzioni O. Extracting product features and opinions from reviews. In Proc. the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Oct. 2005, pp.339-346.

[20] Kaji N, Kitsuregawa M. Building lexicon for sentiment analysis from massive collection of HTML documents. In Proc. the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, June 2007, pp.1075-1083.

[21] Guo H, Zhu H, Guo Z, Zhang X, Su Z. Product feature categorization with multilevel latent semantic association. In Proc. the 18th ACM Conference on Information and Knowledge Management, Nov. 2009, pp.1087-1096.

[22] Zhang L, Liu B, Lim S H, O'Brien-Strain E. Extracting and ranking product features in opinion documents. In Proc. the 23rd International Conference on Computational Linguistics, Aug. 2010, pp.1462-1470.

[23] Gindl S, Weichselbraun A, Scharl A. Rule-based opinion target and aspect extraction to acquire affective knowledge. In Proc. the 22nd International Conference on World Wide Web Companion, May 2013, pp.557-564.

[24] Mei Q, Ling X, Wondra M, Su H, Zhai C. Topic sentiment mixture: Modeling facets and opinions in weblogs. In Proc. the 16th International Conference on World Wide Web, May 2007, pp.171-180.

[25] Brody S, Elhadad N. An unsupervised aspect-sentiment model for online reviews. In Proc. Human Language Technologies: The 11th Annual Conference of the North American Chapter of the Association for Computational Linguistics, Jun. 2010, pp.804-812.

[26] Jo Y, Oh A H. Aspect and sentiment unification model for online review analysis. In Proc. the 4th ACM International Conference on Web Search and Data Mining, Feb. 2011, pp.815-824.

[27] Lu B, Ott M, Cardie C, Tsou B K. Multi-aspect sentiment analysis with topic models. In Proc. the 11th IEEE International Conference on Data Mining Workshops, Dec. 2011, pp.81-88.

[28] Moghaddam S, Ester M. ILDA: Interdependent LDA model for learning latent aspects and their ratings from online product reviews. In Proc. the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2011, pp.665-674.

[29] Chen Z, Mukherjee A, Liu B, Hsu M, Castellanos M, Ghosh R. Exploiting domain knowledge in aspect extraction. In Proc. the 2013 Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp.1655-1667.

[30] Wang H, Lu Y, Zhai C. Latent aspect rating analysis on review text data: A rating regression approach. In Proc. the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2010, pp.783-792.

[31] Snyder B, Barzilay R. Multiple aspect ranking using the good grief algorithm. In Proc. Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics, Apr. 2007, pp.300-307.

[32] Yu J, Zha Z J, Wang M, Chua T S. Aspect ranking: Identifying important product aspects from online consumer reviews. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics, Jun. 2011, pp.1496-1505.

[33] Li P, Wang Y, Gao W, Jiang J. Generating aspect-oriented multi-document summarization with event-aspect model. In Proc. the Conference on Empirical Methods in Natural Language Processing, Jul. 2011, pp.1137-1146.

[34] Liu K, Xu L, Zhao J. Opinion target extraction using wordbased translation model. In Proc. the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jul. 2012, pp.1346-1356.

[35] Liu K, Xu L, Zhao J. Syntactic patterns versus word alignment: Extracting opinion targets from online reviews. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics, Aug. 2013, pp.1754-1763.

[36] Xu L, Liu K, Lai S, Chen Y, Zhao J. Mining opinion words and opinion targets in a two-stage framework. In Proc. the 51st Annual Meeting of the Association for Computational Linguistics, Aug. 2013, pp.1764-1773.

[37] Andrzejewski D, Zhu X, Craven M. Incorporating domain knowledge into topic modeling via dirichlet forest priors. In Proc. the 26th Annual International Conference on Machine Learning, Jun. 2009, pp.25-32.

[38] Andrzejewski D, Zhu X, Craven M, Recht B. A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic. In Proc. the 22nd International Joint Conference on Artificial Intelligence, Jul. 2011, pp.1171-1177.

[39] Li T, Zhang Y, Sindhwani V. A non-negative matrix trifactorization approach to sentiment classification with lexical prior knowledge. In Proc. the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Aug. 2009, pp.244-252.

[40] Shen C, Li T. A non-negative matrix factorization based approach for active dual supervision from document and word labels. In Proc. the Conference on Empirical Methods in Natural Language Processing, Jul. 2011, pp.949-958.

[41] Fang L, Huang M, Zhu X. Exploring weakly supervised latent sentiment explanations for aspect-level review analysis. In Proc. the 22nd ACM International Conference on Information and Knowledge Management, Oct. 27 – Nov. 1, 2013, pp.1057-1066.

[42] Yu C N J, Joachims T. Learning structural SVMs with latent variables. In Proc. the 26th Annual International Conference on Machine Learning, Jun. 2009, pp.1169-1176.

[43] Druck G, Mann G, McCallum A. Learning from labeled features using generalized expectation criteria. In Proc. the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Jul. 2008, pp.595-602.

[44] Ganchev K, Graça J, Gillenwater J, Taskar B. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 2010, 11: 2001-2049.

[45] Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 2008, 51(1): 107-113.

[46] Klein D, Manning C D. Accurate unlexicalized parsing. In Proc. the 41st Annual Meeting on Association for Computational Linguistics, Jul. 2003, pp.423-430.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 冯玉琳;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
[2] 蔡士杰; 张福炎;. A Fast Algorithm for Polygon Operations[J]. , 1991, 6(1): 91 -96 .
[3] 陈昉; 施伯乐;. A Conservative Multiversion Locking-Graph Scheduler Algorithm[J]. , 1991, 6(2): 161 -166 .
[4] I.V.Vel bitsky; A.L.Kovalev; I.V.Kasatkina; 王镭;. R-Technology of Programming: Basic Notions and Implementation[J]. , 1992, 7(4): 345 -355 .
[5] 王晖; 刘大有; 王亚飞;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[6] 应明生;. Institutions of Variable Truth Values:An Approach in the Ordered Style[J]. , 1995, 10(3): 267 -273 .
[7] 周巢尘;. An Overview of Duration Calculus[J]. , 1998, 13(6): 552 .
[8] 魏华; 罗予频; 杨士元;. Fault Tolerance of Reconfigurable Bi-Directional Double-Loop LANs[J]. , 1999, 14(4): 379 -385 .
[9] 徐晓飞; 叶丹; 李全龙; 战德臣;. Dynamic Organization and Methodology for Agile Virtual Enterprises[J]. , 2000, 15(4): 368 -375 .
[10] . KASUM算法的更高水平的硬件综合实现方法[J]. , 2007, 22(1): 60 -70 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: