We use cookies to improve your experience with our site.

句子语义距离计算与新颖性检测

Computation on Sentence Semantic Distance for Novelty Detection

  • 摘要: 新颖性检测指的是从与给定话题相关的句子集合当中,排除冗余并检索出新的信息内容。从2002年开始,国际文本检索会议(Text Retrieval Conference, TREC)设立了Novelty任务,专门评测系统在给定相关文档集合定位相关句子与新内容的性能。句子级别的相关性检索与新颖性检测介于自动问答系统与文档级别的信息检索之间,处理的粒度比自动问答精确的短语要大,而比信息检索的文档更小。常用的信息检索系统往往返回相关但大量冗余的文档,而句子级别的新颖性监测目标是返回用户需要的相关而且没有重复信息的句子。目前,这方面的研究刚刚起步,还没有非常成熟的技术手段与理论方法。在2003年TREC会议中,我们尝试了一种句子距离计算的方法来检测新颖性内容。在句子级别的计算与比较上,纯粹的词形匹配非常局限,大部分的词语并不重复出现。为此,我们的主要动机就是引入语义信息,对句子的内容进行扩展,直接计算句子之间语义层面的距离。句子间语义距离的计算主要综合了WordNet及其统计信息,从词语Synset层面开始,扩展到句子级别的计算。基于句子的语义距离,我们将新颖性检测视为二元分类问题,即新的句子与非新句子。我们分类采用的特征向量包含多种因素,其中包括当前句子到话题的语义距离以及当前句子到有效上文句子的距离。然后,我们分别采用Winnow与支持向量机分类器检测出新的句子。我们做了几个不同的实验来研究不同因素与最终结果之间的关系。实验表明,语义计算在新颖性检测中很有潜力。给定不同文档,我们还进一步研究了新句子与相关句子数量的比例关系,发现该比值以某个特定的速度下降 (大约0.86)。在该比值的指导下,我们作了另外一组对比试验,试验结果说明该比值对新颖性检测性能的提高很有帮助。

     

    Abstract: Novelty detection is to retrieve new information andfilter redundancy from given sentences that are relevant to a specifictopic. In TREC2003, the authors tried an approach to novelty detectionwith semantic distance computation. The motivation is to expand asentence by introducing semantic information. Computation on semanticdistance between sentences incorporates WordNet with statisticalinformation. The novelty detection is treated as a binaryclassification problem: new sentence or not. The feature vector, usedin the vector space model for classification, consists of variousfactors, including the semantic distance from the sentence to the topicand the distance from the sentence to the previous relevant contextoccurring before it. New sentences are then detected with Winnow andsupport vector machine classifiers, respectively. Several experimentsare conducted to survey the relationship between different factors andperformance. It is proved that semantic computation is promising innovelty detection. The ratio of new sentence size to relevant size isfurther studied given different relevant document sizes. It isfound that the ratio reduced with a certain speed (about 0.86). Thenanother group of experiments is performed supervised with the ratio.It is demonstrated that the ratio is helpful to improve the noveltydetection performance.

     

/

返回文章
返回