ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques

Peng Li; Bin Wang; Wei Jin

doi:10.1007/s11390-012-1243-y

Peng Li, Bin Wang, Wei Jin. ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques. Journal of Computer Science and Technology, 2012, 27(3): 554-566. DOI: 10.1007/s11390-012-1243-y

Citation:

ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques

Abstract

Abstract

As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods, which indicates that tags could be a better resource for the clustering task.

FullText(HTML)

References (29)

Relative Articles

Supplements (0)

Cited By

ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques

Abstract

Catalog

Export File

Citation

Format

Content