• Articles • Previous Articles     Next Articles

A Novel Approach to Clustering Merchandise Records

Tao-Yuan Cheng and Shan Wang   

  1. School of Information, Renmin University of China, Beijing 100872, China
  • Received:2006-05-01 Revised:2007-01-11 Online:2007-03-10 Published:2007-03-10

Object identification is one of the major challenges in integrating data from multiple information sources. Since being short of global identifiers, it is hard to find all records referring to the same object in an integrated database. Traditional object identification techniques tend to use character-based or vector space model-based similarity computing in judging, but they cannot work well in merchandise databases. This paper brings forward a new approach to object identification. First, we use merchandise images to judge whether two records belong to the same object; then, we use Na\"\i ve Bayesian Model to judge whether two merchandise names have similar meaning. We do experiments on data downloaded from shopping websites, and the results show good performance.

[1] Tejada S, Knoblock C A, Minton S. Learning domain-independent string transformation weights for high accuracy object identification. In -\it Proc. SIGKDD'2002}, Edmonton, Canada, July 23--26, 2002, pp.350--359.

[2] Cohen W W, Richman J. Learning to match and cluster large high-dimensional data sets for data integration. In -\it Proc. SIGKDD'2002}, Edmonton, Canada, July 2002, pp.475--480.

[3] Cohen W, McAllester D, Kautz H. Hardening soft information sources. In -\it Proc. SIGKDD'2000}, Boston, USA, August 20--23, 2000, pp.255--259.

[4] On B W, Lee D, Kang J, Mitra P. Comparative study of name disambiguation problem using a scalable blocking-based framework. In -\it Proc. JCDL'2005}, Denver, USA, June 7--11, 2005, pp.344--353.

[5] McCallum A, Nigamy K, Ungar L H. Efficient clustering of high-dimensional data sets with application to reference matching. In -\it Proc. SIGKDD'2000}, Boston, USA, %August 20--23, 2000, pp.169--178.

[6] Monge A E, Elkan C P. An efficient domain independent algorithm for detecting approximately duplicate database records. In -\it Proc. DMKD'1997}, Tucson, USA, May 11, 1997, pp.23--29.

[7] Bin Wang, Zhiwei Li, Mingjing Li. Large-scale duplicate detection for web image search. Technical Report, TR20060312013, Microsoft Research Asia, Beijing, China, 2006.

[8] Ming Li, Xiaobing Xue, Zhihua Zhou. Chinese web index page recommendation based on multi-instance learning. \it Journal of Software, \rm 2004, 15(9): 1328--1335.

[9] Newcombe H, Kennedy J, Axford S, James A. Automatic linkage of vital records. \it Science, \rm 1959, 130: 954--959.

[10] Felligi I, Sunter A. A theory for record linkage. \it Journal of the American Statistical Society, \rm 1969, 64: 1183--1210.

[11] Winkler W E. The state of record linkage and current research problems. Technical Report, RR/1999/04.U.S., Bureau of the Census, Washington DC, USA, 1999.

[12] Hua-Jun Zeng, Qi-Cai He, Zheng Chen \it et al. \rm %, Wei-Ying Ma. Learning to cluster search results. In -\it Proc. SIGIR'2004}, Sheffield, UK, %July 25--29, 2004, pp.210--217.
No related articles found!
Full text



No Suggested Reading articles found!

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved