|
›› 2015,Vol. 30 ›› Issue (1): 200-213.doi: 10.1007/s11390-015-1513-6
所属专题: Data Management and Data Mining
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Tie-Yun Qian1(钱铁云), Member, CCF, ACM, Bing Liu2(刘兵), Fellow, IEEE, Qing Li3(李青), Distinguished Member, CCF, Senior Member, IEEE, Jianfeng Si4(司建锋)
Tie-Yun Qian1(钱铁云), Member, CCF, ACM, Bing Liu2(刘兵), Fellow, IEEE, Qing Li3(李青), Distinguished Member, CCF, Senior Member, IEEE, Jianfeng Si4(司建锋)
著作者区分问题,也称为著作者分类问题,是指从文档集(评论集)中识别出其作者(评论者).常用的方法是用有监督学习技术构建分类器以进行分类.这种方法存在如下缺陷,使得其无法在很多领域得到应用.首先,有监督学习需要为每个作者准备大量的文档用作训练数据.这在现实中是难以实现的.例如,购物网站的评论者经常只有很少的评论,其评论量不足以形成足够的训练数据.其次,学到的分类器不能应用于在训练集中没有出现的未知作者.本文提出一种新的技术用于解决以上问题.其核心思想是在一个相似空间而非原始的文本空间进行学习.基于在线评论和评论者数据集的实验表明,所提的算法在性能上远远超过现有的有监督和无监督的基线方法.
[1] Grieve J. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 2007, 22(3): 251-270.[2] Baayen H, van Halteren H, Tweedie F. Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 1996, 11(3): 121-132.[3] Argamon S, Whitelaw C, Chase P, Hota S R, Garg N, Levitan S. Stylistic text classification using functional lexical features: Research articles. Journal of the Association for Information Science and Technology, 2007, 58(6): 802-822.[4] Hedegaard S, Simonsen J G. Lost in translation: Authorship attribution using frame semantics. In Proc. the 49th ACL, June 2011, pp. 65-70.[5] Hirst G, Feiguina O. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, 2007, 22(4): 405-417.[6] Holmes D I, Forsyth R S. The federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 1995, 10(2): 111-127.[7] Koppel M, Schler J. Authorship verification as a one-class classification problem. In Proc. the 21st ICML, July 2004.[8] Diederich J, Kindermann J, Leopold E, Paass G. Authorship attribution with support vector machines. Applied Intelligence, 2000, 19(1/2): 109-123.[9] Escalante H J, Solorio T, Montes-y-Gómez M. Local histograms of character n-grams for authorship attribution. In Proc. the 49th ACL, June 2011, pp. 288-298.[10] Li J, Zheng R, Chen H. From fingerprint to writeprint. Communications of the ACM, 2006, 49(4): 76-82.[11] Stamatatos E, Fakotakis N, Kokkinakis G. Automatic text categorization in terms of genre and author. Computational Linguistics, 2000, 26(3): 471-495.[12] Graham N, Hirst G, Marthi B. Segmenting documents by stylistic character. Natural Language Engineering, 2005, 11(4): 397-415.[13] Seroussi Y, Bohnert F, Zukerman I. Authorship attribution with author-aware topic models. In Proc. the 50th ACL, July 2012, pp. 264-269.[14] de Vel O, Anderson A, Corney M, Mohay G. Mining e-mail content for author identification forensics. ACM SIGMOD Record, 2001, 30(4): 55-64.[15] Koppel M, Schler J, Argamon S. Authorship attribution in the wild. Language Resources and Evaluation, 2011, 45(1): 83-94.[16] Solorio T, Pillay S, Raghavan S, y Gómez M M. Modality specific meta features for authorship attribution in Web forum posts. In Proc. the 5th IJCNLP, Nov. 2011, pp. 156-(\d)64.[17] Kim S, Kim H, Weninger T, Han J, Kim H D. Authorship classification: A discriminative syntactic tree mining approach. In Proc. the 34th SIGIR, July 2011, pp. 455-464.[18] Jindal N, Liu B. Opinion spam and analysis. In Proc. WSDM, Feb. 2008, pp. 219-230.[19] Rudin C. The p-norm push: A simple convex ranking algorithm that concentrates at the top of the list. The Journal of Machine Learning Research, 2009, 10: 2233-2271.[20] Yih W, Meek C. Improving similarity measures for short segments of text. In Proc. AAAI, Nov. 2007, pp. 1489-1494.[21] Agichtein E, Brill E, Dumais S T, Ragno R. Learning user interaction models for predicting web search result preferences. In Proc. the 29th SIGIR, Aug. 2006, pp. 3-10.[22] Mosteller F, Wallace D L. Inference and Disputed Authorship: The Federalist. Addison-Wesley, 1964.[23] Argamon S, Levitan S. Measuring the usefulness of function words for authorship attribution. In Proc. the 2005 ACH/ALLC Conference, June 2005.[24] Gamon M. Linguistic correlates of style: Authorship classification with deep linguistic analysis features. In Proc. the 20th COLING, Aug. 2004, Article No. 611.[25] Peng F, Schuurmans D, Wang S, Keselj V. Language independent authorship attribution using character level language models. In Proc. EACL, April 2003, pp. 267-274.[26] Burrows J F. Not unless you ask nicely: The interpretative nexus between analysis and information. Literary and Linguistic Computing, 1992, 7(2): 91-109.[27] Sanderson C, Guenter S. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proc. EMNLP, July 2006, pp. 482-491.[28] Madigan D, Genkin A, Lewis D, Argamon S, Fradkin D, Ye L. Author identification on the large scale. In Proc. CSNA, June 2005.[29] Cao Y, Xu J, Liu T, Li H, Huang Y, Hon H. Adapting ranking SVM to document retrieval. In Proc. the 29th SIGIR, Oct. 2006, pp. 186-193.[30] Stamatatos E. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology, Aug. 2009, 60(3): 538-556.[31] Hoover D L. Statistical stylistics and authorship attribution: An empirical investigation. Literary and Linguistic Computing, 2001, 16(4): 421-444.[32] Zheng R, Li J, Chen H, Huang Z. A framework for authorship identification of online messages: Writing style features and classification techniques. Journal of the Association for Information Science and Technology, 2006, 57(3): 378-393.[33] Uzuner Ö, Katz B. A comparative study of language models for book and author recognition. In Proc. the 2nd IJCNLP, Oct. 2005, pp. 969-980.[34] Zhao Y, Zobel J. Effective and scalable authorship attribution using function words. In Proc. the 2nd Asia Information Retrieval Symposium, Oct. 2005, pp. 174-189.[35] Luyckx K, Daelemans W. Authorship attribution and verification with many authors and limited data. In Proc. the 22nd COLING, Aug. 2008, pp. 513-520.[36] Vapnik V N. Statistical Learning Theory. WileyInterscience, 1998.[37] Graepely T, Herbrichz R, Bollmann-Sdorraz P, Obermayery K. Classification on pairwise proximity data. In Proc. NIPS, Jan. 1999, pp. 438-444.[38] Chen Y, Garcia E K, Gupta M R, Rahimi A, Cazzanti L. Similarity-based classification: Concepts and algorithms. The Journal of Machine Learning Research, 2009, 10: 747-776.[39] Pezkalska E, Duin R P W. Dissimilarity representations allow for building good classifiers. Pattern Recognition Letters, 2002, 23(8): 943-956.[40] Liao L, Noble W S. Combining pairwise sequence similarity and support vector machines for remote protein homology detection. In Proc. the 6th RECOMB, April 2002, pp. 225-232.[41] Wang L, Yang C, Feng J. On learning with dissimilarity functions. In Proc. the 24th ICML, June 2007, pp. 991-998.[42] Balcan M F, Blum A, Srebro N. A theory of learning with similarity functions. Machine Learning, 2008, 72(1/2): 89-112.[43] Kar P, Jain P. Similarity-based learning via data driven embeddings. In Proc. the 25th NIPS, Dec. 2011.[44] Yule G U. The Statistical Study of Literary Vocabulary. Cambridge University Press, 1944.[45] Metzler D, Bernstein Y, Croft W B, Moffat A, Zobel J. Similarity measures for tracking information flow. In Proc. the 14th CIKM, Oct. 2005, pp. 517-524.[46] Joachims T. Training linear SVMs in linear time. In Proc. the 12th KDD, Aug. 2006, pp. 217-226.[47] Klein D, Manning C D. Accurate unlexicalized parsing. In Proc. the 41st ACL, July 2003, pp. 423-430. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |