Crowd-Guided Entity Matching with Consolidated Textual Data

Zhi-Xu Li; Qiang Yang; An Liu; Guan-Feng Liu; Jia Zhu; Jia-Jie Xu; Kai Zheng; Min Zhang

doi:10.1007/s11390-017-1769-0

Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Min Zhang. Crowd-Guided Entity Matching with Consolidated Textual Data[J]. Journal of Computer Science and Technology, 2017, 32(5): 858-876. DOI: 10.1007/s11390-017-1769-0

Citation:

Crowd-Guided Entity Matching with Consolidated Textual Data

Abstract

Abstract

Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. To avoid from ignoring some hidden important sub-topics, we let the crowd to help us decide the weight of different sub-topics in doing EM. Our empirical study on two real-world data sets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

FullText(HTML)

References (34)

Relative Articles

Supplements (0)

Cited By

Crowd-Guided Entity Matching with Consolidated Textual Data

Abstract

Catalog

Export File

Citation

Format

Content