We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Peng Li, Bin Wang, Wei Jin. ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques[J]. Journal of Computer Science and Technology, 2012, 27(3): 554-566. DOI: 10.1007/s11390-012-1243-y
Citation: Peng Li, Bin Wang, Wei Jin. ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques[J]. Journal of Computer Science and Technology, 2012, 27(3): 554-566. DOI: 10.1007/s11390-012-1243-y

ImprovingWeb Document Clustering through Employing User-Related Tag Expansion Techniques

Funds: This work is supported by the National Natural Science Foundation of China under Grant No. 61070111.
More Information
  • Author Bio:

    Peng Li received his B.E. de-gree from Dalian University of Tech-nology, China in 2007. He is a Ph.D. candidate in the Institute of Computing Technology, Chinese Academy of Sciences, China. His re-search interests include analysis of user generated data and information retrieval.

  • Received Date: August 31, 2011
  • Revised Date: February 14, 2012
  • Published Date: May 04, 2012
  • As high quality descriptors of web page semantics, social annotations or tags have been used for web document clustering and achieved promising results. However, most web pages have few tags (less than 10). This sparsity seriously limits the usage of tags for clustering. In this work, we propose a user-related tag expansion method to overcome this problem, which incorporates additional useful tags into the original tag document by utilizing user tagging data as background knowledge. Unfortunately, simply adding tags may cause topic drift, i.e., the dominant topic(s) of the original document may be changed. To tackle this problem, we have designed a novel generative model called Folk-LDA, which jointly models original and expanded tags as independent observations. Experimental results show that 1) our user-related tag expansion method can be effectively applied to over 90% tagged web documents; 2) Folk-LDA can alleviate topic drift in expansion, especially for those topic-specific documents; 3) the proposed tag-based clustering methods significantly outperform the word-based methods, which indicates that tags could be a better resource for the clustering task.
  • [1]
    Hotho A, Staab S, Stumme G. Wordnet improves text docu-ment clustering. In Proc. SIGIR 2003 Semantic Web Work-shop, Toronto, Canada, Aug. 1, 2003.
    [2]
    Hu J, Fang L, Cao Y, Zeng H J, Li H, Yang Q, Chen Z. En-hancing text clustering by leveraging Wikipedia semantics. InProc. SIGIR 2008, Singapore, Jul. 20-24, 2008, pp.179-186.
    [3]
    Heymann P, Koutrika G, Garcia-Molina H. Can social book-marking improve web search? In Proc. WSDM2008, PaloAlto, USA, Feb. 11-12, 2008, pp.195-206.
    [4]
    Ramage D, Heymann P, Manning C D, Garcia-Molina H.Clustering the tagged web. In Proc. WSDM2009, Barcelona,Spain, Feb. 9-12, 2009, pp.54-63.
    [5]
    http://www.dai-labor.de/en/competence centers/irml/data-sets/, April 2010.
    [6]
    Li X, Guo L, Zhao Y E. Tag-based social interest discovery. InProc. WWW2008, Beijing, China, Apr. 21-25, 2008, pp.675-684.
    [7]
    Wetzker R, Zimmermann C, Bauckhage C. Analyzing so-cial bookmarking systems: A del.icio.us cookbook. In Proc.ECAI 2008 Mining Social Data Workshop, Patras, Greece,Jul. 21-25, 2008, pp.26-30.
    [8]
    Griffiths T L, Steyvers M. Finding scientific topics. In Proc.National Academy of Sciences, 2004, 101(Suppl.1): 5228-5235.
    [9]
    Blei D M, Ng A Y, Jordan M I. Latent dirichlet allocation.Journal of Machine Learning Research, 2003, 3: 993-1022.
    [10]
    Lu C, Chen X, Park E K. Exploit the tripartite network ofsocial tagging for web clustering. In Proc. CIKM2009, HongKong, China, Nov. 2-6, 2009, pp.1545-1548.
    [11]
    Manning C D, Raghavan P, Schtze H. Introduction to In-formation Retrieval. New York, USA: Cambridge UniversityPress, 2008.
    [12]
    Liu T, Liu S, Chen Z, Ma W Y. An evaluation on feature se-lection for text clustering. In Proc. ICML 2003, Washington,DC, USA, Aug. 21-24, 2003, pp.488-495.
    [13]
    Yang Y, Pedersen J O. A comparative study on feature selec-tion in text categorization. In Proc. ICML 1997, Nashville,USA, Jul. 8-12, 1997, pp.412-420.
    [14]
    McKeown K R, Barzilay R, Evans D, Hatzivassiloglou V,Klavans J L, Nenkova A, Sable C, Schiffman B, SigelmanS. Tracking and summarizing news on a daily basis withcolumbia's newsblaster. In Proc. HLT-ACL 2002, San Diego,USA, Mar. 24-27, 2002, pp.280-285.
    [15]
    Kriegel H P, Kröger P, Zimek A. Clustering high-dimensionaldata: A survey on subspace clustering, pattern-based cluster-ing, and correlation clustering. ACM Trans. Knowl. Discov.Data, 2009, 3(1): Articl No.1.
    [16]
    Zeng H J, He Q C, Chen Z, Ma W Y, Ma J. Learning to clus-ter web search results. In Proc. SIGIR 2004, Sheffield, UK,Jul. 25-29, 2004, pp.210-217.
    [17]
    Liu X, Croft W B. Cluster-based retrieval using languagemodels. In Proc. SIGIR 2004, Sheffield, UK, Jul. 25-29, 2004,pp.186-193.
    [18]
    Dave K, Lawrence S, Pennock D M. Mining the peanutgallery: Opinion extraction and semantic classification ofproduct reviews. In Proc. WWW2003, Budapest, Hungary,May 20-24, 2003, pp.519-528.
    [19]
    Gabrilovich E, Markovitch S. Feature generation for text cat-egorization using world knowledge. In Proc. IJCAI 2005,Edinburgh, Scotland, Jul. 30-Aug. 5, 2005, pp.1048-1053.
    [20]
    Gabrilovich E, Markovitch S. Overcoming the brittleness bot-tleneck using wikipedia: Enhancing text categorization withencyclopedic knowledge. In Proc. AAAI 2006, Boston, USA,Jul. 16-20, 2006, pp.1301-1306.
    [21]
    Su Z, Yang Q, Zhang H, Xu X, Hu Y. Correlation-based docu-ment clustering using web logs. In Proc. HICSS 2001, Jan. 3-6, 2001, p.5022.
    [22]
    Jing Y, Croft W B. An association thesaurus for informationretrieval. Tech. Rep., University of Massachusetts Amherst,1994.
    [23]
    Xu J, Croft W B. Query expansion using local and globaldocument analysis. In Proc. SIGIR 1996, Zurich, Switzer-land, Aug. 18-22, 1996, pp.4-11.
    [24]
    Tao T, Wang X, Mei Q, Zhai C. Language model informa-tion retrieval with document expansion. In Proc. HLT-NAACL 2006, New York, USA, June 2006, pp.407-414.
    [25]
    Zhou D, Bian J, Zheng S, Zha H, Giles C L. Exploring socialannotations for information retrieval. In Proc. WWW2008,Beijing, China, Apr. 21-25, 2008, pp.715-724.
    [26]
    Begelman G, Keller P, Smadja F. Automated tag clustering:Improving search and exploration in the tag space. In Proc.Collaborative Web Tagging Workshop at WWW2006, Edin-burgh, Scotland, May 22, 2006, pp.22-26.
    [27]
    Heymann P, Garcia-Molina H. Collaborative creation of com-munal hierarchical taxonomies in social tagging systems.Tech. Rep. 2006-10, Department of Computer Science, Stan-ford University, 2006, http://ilpubs.stanford.edu:8090/775/,April 2010.
    [28]
    Gemmell J, Shepitsen A, Mobasher B, Burke R. Personalizingnavigation in folksonomies using hierarchical tag clustering.In Proc. the 10th Int. Conference on Data Warehousing andKnowledge Discovery, Turin, Italy, Sept. 1-5, 2008, pp.196-205.
    [29]
    Shepitsen A, Gemmell J, Mobasher B, Burke R. Personalizedrecommendation in social tagging systems using hierarchicalclustering. In Proc. RecSys 2008, Lausanne, Switzerland,Oct. 23-25, 2008, pp.259-266.

Catalog

    Article views (16) PDF downloads (2656) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return