相似空间中的评论著作者区分

钱铁云; 刘兵; 李青; 司建锋

doi:10.1007/s11390-015-1513-6

相似空间中的评论著作者区分

Review Authorship Attribution in a Similarity Space

摘要

摘要: 著作者区分问题,也称为著作者分类问题,是指从文档集(评论集)中识别出其作者(评论者).常用的方法是用有监督学习技术构建分类器以进行分类.这种方法存在如下缺陷,使得其无法在很多领域得到应用.首先,有监督学习需要为每个作者准备大量的文档用作训练数据.这在现实中是难以实现的.例如,购物网站的评论者经常只有很少的评论,其评论量不足以形成足够的训练数据.其次,学到的分类器不能应用于在训练集中没有出现的未知作者.本文提出一种新的技术用于解决以上问题.其核心思想是在一个相似空间而非原始的文本空间进行学习.基于在线评论和评论者数据集的实验表明,所提的算法在性能上远远超过现有的有监督和无监督的基线方法.

Abstract: Authorship attribution, also known as authorship classification, is the problem of identifying the authors (reviewers) of a set of documents (reviews). The common approach is to build a classifier using supervised learning. This approach has several issues which hurts its applicability. First, supervised learning needs a large set of documents from each author to serve as the training data. This can be difficult in practice. For example, in the online review domain, most reviewers (authors) only write a few reviews, which are not enough to serve as the training data. Second, the learned classifier cannot be applied to authors whose documents have not been used in training. In this article, we propose a novel solution to deal with the two problems. The core idea is that instead of learning in the original document space, we transform it to a similarity space. In the similarity space, the learning is able to naturally tackle the issues. Our experiment results based on online reviews and reviewers show that the proposed method outperforms the state-of-the-art supervised and unsupervised baseline methods significantly.

HTML全文

参考文献()

施引文献

资源附件()