We use cookies to improve your experience with our site.
Peng Li, Bin Wang, Wei Jin. ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques[J]. Journal of Computer Science and Technology, 2012, 27(3): 554-566. DOI: 10.1007/s11390-012-1243-y
Citation: Peng Li, Bin Wang, Wei Jin. ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques[J]. Journal of Computer Science and Technology, 2012, 27(3): 554-566. DOI: 10.1007/s11390-012-1243-y

ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques

  • As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods, which indicates that tags could be a better resource for the clustering task.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return