›› 2015,Vol. 30 ›› Issue (1): 200-213.doi: 10.1007/s11390-015-1513-6

所属专题: Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

相似空间中的评论著作者区分

Tie-Yun Qian1(钱铁云), Member, CCF, ACM, Bing Liu2(刘兵), Fellow, IEEE, Qing Li3(李青), Distinguished Member, CCF, Senior Member, IEEE, Jianfeng Si4(司建锋)   

  1. 1 State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China;
    2 Department of Computer Science, University of Illinois at Chicago, Chicago 60607, U.S.A.;
    3 Multimedia Software Engineering Research Centre and Department of Computer Science, City University of Hong Kong Hong Kong, China;
    4 Data Analytics Department, Institute for Infocomm Research, Singapore 138632, Singapore
  • 收稿日期:2014-02-19 修回日期:2014-11-14 出版日期:2015-01-05 发布日期:2015-01-05
  • 作者简介:Tie-Yun Qian is an associate professor at the State Key Laboratory of Software Engineering at Wuhan University. She received her B.S. degree in computer science from Wuhan University of Technology in 1991, and her Ph.D. degree in computer science from Huazhong University of Science and Technology, Wuhan, in 2006. Her current research interests include text mining, web mining, and natural language processing. She has published over 20 papers in top conferences including ACL, EMNLP, SIGIR, etc. She is a member of CCF and ACM. She has served as program committee member of many leading conferences: WWW, COLING, DASFAA, WAIM, and APWeb.
  • 基金资助:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61272275, 61232002, 61272110, 61202036, 61379004, 61472337, and 61028003, and the 111 Project of China under Grant No. B07037.

Review Authorship Attribution in a Similarity Space

Tie-Yun Qian1(钱铁云), Member, CCF, ACM, Bing Liu2(刘兵), Fellow, IEEE, Qing Li3(李青), Distinguished Member, CCF, Senior Member, IEEE, Jianfeng Si4(司建锋)   

  1. 1 State Key Laboratory of Software Engineering, Wuhan University, Wuhan 430072, China;
    2 Department of Computer Science, University of Illinois at Chicago, Chicago 60607, U.S.A.;
    3 Multimedia Software Engineering Research Centre and Department of Computer Science, City University of Hong Kong Hong Kong, China;
    4 Data Analytics Department, Institute for Infocomm Research, Singapore 138632, Singapore
  • Received:2014-02-19 Revised:2014-11-14 Online:2015-01-05 Published:2015-01-05
  • About author:Tie-Yun Qian is an associate professor at the State Key Laboratory of Software Engineering at Wuhan University. She received her B.S. degree in computer science from Wuhan University of Technology in 1991, and her Ph.D. degree in computer science from Huazhong University of Science and Technology, Wuhan, in 2006. Her current research interests include text mining, web mining, and natural language processing. She has published over 20 papers in top conferences including ACL, EMNLP, SIGIR, etc. She is a member of CCF and ACM. She has served as program committee member of many leading conferences: WWW, COLING, DASFAA, WAIM, and APWeb.
  • Supported by:

    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61272275, 61232002, 61272110, 61202036, 61379004, 61472337, and 61028003, and the 111 Project of China under Grant No. B07037.

著作者区分问题,也称为著作者分类问题,是指从文档集(评论集)中识别出其作者(评论者).常用的方法是用有监督学习技术构建分类器以进行分类.这种方法存在如下缺陷,使得其无法在很多领域得到应用.首先,有监督学习需要为每个作者准备大量的文档用作训练数据.这在现实中是难以实现的.例如,购物网站的评论者经常只有很少的评论,其评论量不足以形成足够的训练数据.其次,学到的分类器不能应用于在训练集中没有出现的未知作者.本文提出一种新的技术用于解决以上问题.其核心思想是在一个相似空间而非原始的文本空间进行学习.基于在线评论和评论者数据集的实验表明,所提的算法在性能上远远超过现有的有监督和无监督的基线方法.

Abstract: Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier using supervised learning. This approach has several issues which hurts its applicability. First, supervised learning needs a large set of documents from each author to serve as the training data. This can be difficult in practice. For example, in the online review domain, most reviewers (authors) only write a few reviews, which are not enough to serve as the training data. Second, the learned classifier cannot be applied to authors whose documents have not been used in training. In this article, we propose a novel solution to deal with the two problems. The core idea is that instead of learning in the original document space, we transform it to a similarity space. In the similarity space, the learning is able to naturally tackle the issues. Our experiment results based on online reviews and reviewers show that the proposed method outperforms the state-of-the-art supervised and unsupervised baseline methods significantly.

[1] Grieve J. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 2007, 22(3): 251-270.

[2] Baayen H, van Halteren H, Tweedie F. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 1996, 11(3): 121-132.

[3] Argamon S, Whitelaw C, Chase P, Hota S R, Garg N, Levitan S. Stylistic text classification using functional lexical features: Research articles. Journal of the Association for Information Science and Technology, 2007, 58(6): 802-822.

[4] Hedegaard S, Simonsen J G. Lost in translation: Authorship attribution using frame semantics. In Proc. the 49th ACL, June 2011, pp. 65-70.

[5] Hirst G, Feiguina O. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, 2007, 22(4): 405-417.

[6] Holmes D I, Forsyth R S. The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 1995, 10(2): 111-127.

[7] Koppel M, Schler J. Authorship verification as a one-class classification problem. In Proc. the 21st ICML, July 2004.

[8] Diederich J, Kindermann J, Leopold E, Paass G. Authorship attribution with support vector machines. Applied Intelligence, 2000, 19(1/2): 109-123.

[9] Escalante H J, Solorio T, Montes-y-Gómez M. Local histograms of character n-grams for authorship attribution. In Proc. the 49th ACL, June 2011, pp. 288-298.

[10] Li J, Zheng R, Chen H. From fingerprint to writeprint. Communications of the ACM, 2006, 49(4): 76-82.

[11] Stamatatos E, Fakotakis N, Kokkinakis G. Automatic text categorization in terms of genre and author. Computational Linguistics, 2000, 26(3): 471-495.

[12] Graham N, Hirst G, Marthi B. Segmenting documents by stylistic character. Natural Language Engineering, 2005, 11(4): 397-415.

[13] Seroussi Y, Bohnert F, Zukerman I. Authorship attribution with author-aware topic models. In Proc. the 50th ACL, July 2012, pp. 264-269.

[14] de Vel O, Anderson A, Corney M, Mohay G. Mining e-mail content for author identification forensics. ACM SIGMOD Record, 2001, 30(4): 55-64.

[15] Koppel M, Schler J, Argamon S. Authorship attribution in the wild. Language Resources and Evaluation, 2011, 45(1): 83-94.

[16] Solorio T, Pillay S, Raghavan S, y Gómez M M. Modality specific meta features for authorship attribution in Web forum posts. In Proc. the 5th IJCNLP, Nov. 2011, pp. 156-(\d)64.

[17] Kim S, Kim H, Weninger T, Han J, Kim H D. Authorship classification: A discriminative syntactic tree mining approach. In Proc. the 34th SIGIR, July 2011, pp. 455-464.

[18] Jindal N, Liu B. Opinion spam and analysis. In Proc. WSDM, Feb. 2008, pp. 219-230.

[19] Rudin C. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. The Journal of Machine Learning Research, 2009, 10: 2233-2271.

[20] Yih W, Meek C. Improving similarity measures for short segments of text. In Proc. AAAI, Nov. 2007, pp. 1489-1494.

[21] Agichtein E, Brill E, Dumais S T, Ragno R. Learning user interaction models for predicting web search result preferences. In Proc. the 29th SIGIR, Aug. 2006, pp. 3-10.

[22] Mosteller F, Wallace D L. Inference and Disputed Authorship: The Federalist. Addison-Wesley, 1964.

[23] Argamon S, Levitan S. Measuring the usefulness of function words for authorship attribution. In Proc. the 2005 ACH/ALLC Conference, June 2005.

[24] Gamon M. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proc. the 20th COLING, Aug. 2004, Article No. 611.

[25] Peng F, Schuurmans D, Wang S, Keselj V. Language independent authorship attribution using character level language models. In Proc. EACL, April 2003, pp. 267-274.

[26] Burrows J F. Not unless you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing, 1992, 7(2): 91-109.

[27] Sanderson C, Guenter S. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proc. EMNLP, July 2006, pp. 482-491.

[28] Madigan D, Genkin A, Lewis D, Argamon S, Fradkin D, Ye L. Author identification on the large scale. In Proc. CSNA, June 2005.

[29] Cao Y, Xu J, Liu T, Li H, Huang Y, Hon H. Adapting ranking SVM to document retrieval. In Proc. the 29th SIGIR, Oct. 2006, pp. 186-193.

[30] Stamatatos E. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology, Aug. 2009, 60(3): 538-556.

[31] Hoover D L. Statistical stylistics and authorship attribution: An empirical investigation. Literary and Linguistic Computing, 2001, 16(4): 421-444.

[32] Zheng R, Li J, Chen H, Huang Z. A framework for authorship identification of online messages: Writing style features and classification techniques. Journal of the Association for Information Science and Technology, 2006, 57(3): 378-393.

[33] Uzuner Ö, Katz B. A comparative study of language models for book and author recognition. In Proc. the 2nd IJCNLP, Oct. 2005, pp. 969-980.

[34] Zhao Y, Zobel J. Effective and scalable authorship attribution using function words. In Proc. the 2nd Asia Information Retrieval Symposium, Oct. 2005, pp. 174-189.

[35] Luyckx K, Daelemans W. Authorship attribution and verification with many authors and limited data. In Proc. the 22nd COLING, Aug. 2008, pp. 513-520.

[36] Vapnik V N. Statistical Learning Theory. WileyInterscience, 1998.

[37] Graepely T, Herbrichz R, Bollmann-Sdorraz P, Obermayery K. Classification on pairwise proximity data. In Proc. NIPS, Jan. 1999, pp. 438-444.

[38] Chen Y, Garcia E K, Gupta M R, Rahimi A, Cazzanti L. Similarity-based classification: Concepts and algorithms. The Journal of Machine Learning Research, 2009, 10: 747-776.

[39] Pezkalska E, Duin R P W. Dissimilarity representations allow for building good classifiers. Pattern Recognition Letters, 2002, 23(8): 943-956.

[40] Liao L, Noble W S. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proc. the 6th RECOMB, April 2002, pp. 225-232.

[41] Wang L, Yang C, Feng J. On learning with dissimilarity functions. In Proc. the 24th ICML, June 2007, pp. 991-998.

[42] Balcan M F, Blum A, Srebro N. A theory of learning with similarity functions. Machine Learning, 2008, 72(1/2): 89-112.

[43] Kar P, Jain P. Similarity-based learning via data driven embeddings. In Proc. the 25th NIPS, Dec. 2011.

[44] Yule G U. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.

[45] Metzler D, Bernstein Y, Croft W B, Moffat A, Zobel J. Similarity measures for tracking information flow. In Proc. the 14th CIKM, Oct. 2005, pp. 517-524.

[46] Joachims T. Training linear SVMs in linear time. In Proc. the 12th KDD, Aug. 2006, pp. 217-226.

[47] Klein D, Manning C D. Accurate unlexicalized parsing. In Proc. the 41st ACL, July 2003, pp. 423-430.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 吴信东;. Inductive Learning[J]. , 1993, 8(2): 22 -36 .
[2] . 面向对象的系统的基于代码的分析[J]. , 2006, 21(6): 965 -972 .
[3] Yu Dai Lei Yang Bin Zhang. 以性能预测为基础QoS驱动的自适应Web服务组合[J]. , 2009, 24(2): 250 -261 .
[4] 高庆狮, 高小宇, 胡玥. 一个满足所有经典集合公式的新模糊集合论[J]. , 2009, 24(4): 798 -804 .
[5] Darko Brodić, Student Member, IEEE. 基于扩展的水流方法进行文本行分割的算法[J]. , 2012, 27(1): 187 -194 .
[6] Luke Kien-Weng Tan (陈坚永), Jin-Cheon Na (罗镇川), Member, ACM, Yin-Leng Theng (邓燕玲),. 一种运用规则类型依赖与复杂短语分析的词语情感极性分类方法[J]. , 2012, 27(3): 650 -666 .
[7] Xiao-Hui Wang1 (王晓慧), Jia Jia1 (贾珈), Han-Yu Liao2 (廖捍宇), and Lian-Hong Cai1 (蔡莲红). 基于情感的灰度图彩色化[J]. , 2012, 27(6): 1119 -1128 .
[8] Yi-Qun Liu, Yan Li, Yun-Quan Zhang, Xian-Yi Zhang . 英特尔至强融核协处理器上高效访存的三维FFT算法[J]. , 2014, 29(6): 989 -1002 .
[9] Pedro Luis Mateo Navarro, Gregorio Martínez Pérez, Member, IEEE, Diego Sevilla Ruiz. 一个提升Agile-UX开发的基于脚本的原型化框架[J]. , 2016, 31(6): 1246 -1261 .
[10] Tian-Bi Jiang, Gui-Song Xia, Qi-Kai Lu, Wei-Ming Shen. 基于深度跨域特征的遥感影像检索[J]. , 2017, 32(4): 726 -737 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: