社会标签系统中共识语义建模方法研究

张斌; 张引; 高克宁

doi:10.1007/s11390-011-0179-y

摘要: 在社会标签系统中，用户可以使用任意的标签来标记在线资源，以便实现对这些资源的分类和索引。这种方法简单而容易使用，使其被广泛的采用并成为了Web 2.0时代最为重要的一种信息组织方式。然而，由于缺乏一个预先定义的词汇表，用户通常无法达成关于标签所具有的语义以及如何使用这些标签来分类信息等方面的共识。这使得基于标签的分类结果通常存在着不一致性以及冗余性等方面的问题。基于本体的方法可以帮助用户达成此类共识，但是这些方法却面临着诸多的问题，包括缺乏描述歧义概念的能力，以及如何及时的处理新的概念。我们注意到，对于使用次数很少的标签，由于他们只能在很特定的上下文中使用，他们的语义通常非常明确并且具体。虽然用户对于这些使用次数很少的标签的语义难以达成共识，却仍然有可能利用这些具体的语义信息来建模并描述其他标签的语义。通过分析一个提取自现实博客运营商的数据集，我们发现1）罕见标签可以提供大量的具体语义，以及2）通过共现关系罕见标签的语义被传播给常见标签，同时罕见标签和常见标签之间的关系要远远简单于常见标签之间的关系。基于这些观察本文提出了一个类似于随机游走以及激活扩散的模型来利用罕见标签的语义描述其他标签的语义。与PageRank使用的随机游走过程类似，该模型将用户理解标签语义的过程建模为用户在基于标签共现的标签网络中随机跳转的过程。通过将标签共现的次数建模为跳转概率，同时将罕见标签建模为吸收状态，该模型形成一个包含吸收状态的马尔可夫过程。通过这一方法，该模型将一个常见标签的语义建模为从该标签出发抵达所有罕见标签的概率向量，最终实现利用罕见但明确的标签语义建模常见标签语义的目的。通过将提出的模型与潜语义分析方法在一个概念聚类任务中进行对比，所提出的模型表现出来更好的性能，说明其可以很好的捕获和建模标签的语义信息。

Abstract: In social tagging systems, people can annotate arbitrary tags to online data to categorize and index them. However, the lack of the "a priori" set of words makes it difficult for people to reach consensus about the semantics of tags and how to categorize data. Ontologies based approaches can help reaching such consensus, but they are still facing problems such as inability of model ambiguous and new concepts properly. For tags that are used very few times, since they can only be used in very specific contexts, their semantics are very clear and detailed. Although people have no consensus on these tags, it is still possible to leverage these detailed semantics to model the other tags. In this paper we introduce a random walk and spreading activation like model to represent the semantics of tags using semantics of unpopular tags. By comparing the proposed model to the classic Latent Semantic Analysis approach in a concept clustering task, we show that the proposed model can properly capture the semantics of tags.

社会标签系统中共识语义建模方法研究

Modeling Consensus Semantics in Social Tagging Systems