? Crowd-Guided Entity Matching with Consolidated Textual Data
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | FAQ
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2017, Vol. 32 Issue (5) :858-876    DOI: 10.1007/s11390-017-1769-0
Special Section on Crowdsourced Data Management Current Issue | Archive | Adv Search << Previous Articles | Next Articles >>
Crowd-Guided Entity Matching with Consolidated Textual Data
Zhi-Xu Li1,2, Member, CCF, Qiang Yang1, An Liu1,*, Member, CCF, Guan-Feng Liu1, Member, CCF, Jia Zhu3, Member, CCF, Jia-Jie Xu1, Member, CCF, Kai Zheng1,4, Member, CCF, Min Zhang1, Member, CCF
1 School of Computer Science and Technology, Soochow University, Suzhou 215006, China;
2 Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China;
3 School of Computer, South China Normal University, Guangzhou 510631, China;
4 Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing 100872, China

Abstract
Reference
Related Articles
Download: [PDF 1225KB]     Export: BibTeX or EndNote (RIS)  
Abstract Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. To avoid from ignoring some hidden important sub-topics, we let the crowd to help us decide the weight of different sub-topics in doing EM. Our empirical study on two real-world data sets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.
Articles by authors
KeywordsEntity Matching   Consolidated Textual Data   Crowdsourcing     
Received 2017-03-01;
Fund:

This research is partially supported by the National Natural Science Foundation of China under Grant Nos. 61632016, 61402313, 61303019, 61472263, and 61572336, the Postdoctoral Scientific Research Funding of Jiangsu Province of China under Grant No. 1501090B, the National Postdoctoral Funding of China under Grant Nos. 2015M581859 and 2016T90493, and the Open Foundation of Guangdong Key Laboratory of Big Data Analysis and Processing of China under Grant No. 2017012.

Corresponding Authors: An Liu,anliu@suda.edu.cn     Email: anliu@suda.edu.cn
About author: Zhi-Xu Li is an associate professor in the School of Computer Science and Technology at Soochow University, Suzhou.His research interests include data cleaning, big data applications, information extraction and retrieval, machine learning, deep learning, knowledge graph and crowdsourcing.
Cite this article:   
Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Min Zhang.Crowd-Guided Entity Matching with Consolidated Textual Data[J]  Journal of Computer Science and Technology, 2017,V32(5): 858-876
URL:  
http://jcst.ict.ac.cn:8080/jcst/EN/10.1007/s11390-017-1769-0
Copyright 2010 by Journal of Computer Science and Technology