We use cookies to improve your experience with our site.

Indexed in:

SCIE, EI, Scopus, INSPEC, DBLP, CSCD, etc.

Submission System
(Author / Reviewer / Editor)
Tao-Yuan Cheng, Shan Wang. A Novel Approach to Clustering Merchandise Records[J]. Journal of Computer Science and Technology, 2007, 22(2): 228-231.
Citation: Tao-Yuan Cheng, Shan Wang. A Novel Approach to Clustering Merchandise Records[J]. Journal of Computer Science and Technology, 2007, 22(2): 228-231.

A Novel Approach to Clustering Merchandise Records

More Information
  • Received Date: April 30, 2006
  • Revised Date: January 10, 2007
  • Published Date: March 14, 2007
  • Object identification is one of the major challenges inintegrating data from multiple information sources. Since being short ofglobal identifiers, it is hard to find all records referring to the sameobject in an integrated database. Traditional object identificationtechniques tend to use character-based or vector space model-basedsimilarity computing in judging, but they cannot work well inmerchandise databases. This paper brings forward a new approach toobject identification. First, we use merchandise images to judgewhether two records belong to the same object; then, we use Na\"\i ve BayesianModel to judge whether two merchandise names have similar meaning. Wedo experiments on data downloaded from shopping websites, and theresults show good performance.
  • [1]
    Tejada S, Knoblock C A, Minton S. Learning domain-independent string transformation weights for high accuracy object identification. In -\it Proc. SIGKDD'2002}, Edmonton, Canada, July 23--26, 2002, pp.350--359.
    [2]
    Cohen W W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In -\it Proc. SIGKDD'2002}, Edmonton, Canada, July 2002, pp.475--480.
    [3]
    Cohen W, McAllester D, Kautz H. Hardening soft information sources. In -\it Proc. SIGKDD'2000}, Boston, USA, August 20--23, 2000, pp.255--259.
    [4]
    On B W, Lee D, Kang J, Mitra P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In -\it Proc. JCDL'2005}, Denver, USA, June 7--11, 2005, pp.344--353.
    [5]
    McCallum A, Nigamy K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In -\it Proc. SIGKDD'2000}, Boston, USA, %August 20--23, 2000, pp.169--178.
    [6]
    Monge A E, Elkan C P. An efficient domain independent algorithm for detecting approximately duplicate database records. In -\it Proc. DMKD'1997}, Tucson, USA, May 11, 1997, pp.23--29.
    [7]
    Bin Wang, Zhiwei Li, Mingjing Li. Large-scale duplicate detection for web image search. Technical Report, TR20060312013, Microsoft Research Asia, Beijing, China, 2006.
    [8]
    Ming Li, Xiaobing Xue, Zhihua Zhou. Chinese web index page recommendation based on multi-instance learning. \it Journal of Software, \rm 2004, 15(9): 1328--1335.
    [9]
    Newcombe H, Kennedy J, Axford S, James A. Automatic linkage of vital records. \it Science, \rm 1959, 130: 954--959.
    [10]
    Felligi I, Sunter A. A theory for record linkage. \it Journal of the American Statistical Society, \rm 1969, 64: 1183--1210.
    [11]
    Winkler W E. The state of record linkage and current research problems. Technical Report, RR/1999/04.U.S., Bureau of the Census, Washington DC, USA, 1999.
    [12]
    Hua-Jun Zeng, Qi-Cai He, Zheng Chen \it et al. \rm %, Wei-Ying Ma. Learning to cluster search results. In -\it Proc. SIGIR'2004}, Sheffield, UK, %July 25--29, 2004, pp.210--217.

Catalog

    Article views (16) PDF downloads (4486) Cited by()
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return