? 众包指导下基于纯文本属性数据的实体匹配
Journal of Computer Science and Technology
Quick Search in JCST
 Advanced Search 
      Home | PrePrint | SiteMap | Contact Us | Help
 
Indexed by   SCIE, EI ...
Bimonthly    Since 1986
Journal of Computer Science and Technology 2017, Vol. 32 Issue (5) :858-876    DOI: 10.1007/s11390-017-1769-0
Special Section on Crowdsourced Data Management << Previous Articles | Next Articles >>
众包指导下基于纯文本属性数据的实体匹配
Zhi-Xu Li1,2, Member, CCF, Qiang Yang1, An Liu1,*, Member, CCF, Guan-Feng Liu1, Member, CCF, Jia Zhu3, Member, CCF, Jia-Jie Xu1, Member, CCF, Kai Zheng1,4, Member, CCF, Min Zhang1, Member, CCF
1 School of Computer Science and Technology, Soochow University, Suzhou 215006, China;
2 Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China;
3 School of Computer, South China Normal University, Guangzhou 510631, China;
4 Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing 100872, China
Crowd-Guided Entity Matching with Consolidated Textual Data
Zhi-Xu Li1,2, Member, CCF, Qiang Yang1, An Liu1,*, Member, CCF, Guan-Feng Liu1, Member, CCF, Jia Zhu3, Member, CCF, Jia-Jie Xu1, Member, CCF, Kai Zheng1,4, Member, CCF, Min Zhang1, Member, CCF
1 School of Computer Science and Technology, Soochow University, Suzhou 215006, China;
2 Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China;
3 School of Computer, South China Normal University, Guangzhou 510631, China;
4 Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing 100872, China

摘要
参考文献
相关文章
Download: [PDF 1225KB]  
摘要 实体匹配是指识别同一数据源内或不同数据源之间指代相同实体的纪录。当结构化的信息不足以反映实体之间的关系时,已有的仅仅使用结构化属性值的实体匹配方法会失败,常见的数据值类型如数值型,日期型,短字符串等。目前一些包含实体综合信息的非结构化文本数据越来越多地存在于数据集中,但是很少有使用综合文本信息的实体匹配方法被提出。一些传统的字符串度量方法如编辑距离,词袋模型等并不适合度量综合文本信息之间的相似度,因为每个文本信息中包含成百上千个单词,这使得它们难以获得较高的准确性。已有的一些主题模型也不能起到较好的匹配效果,因为综合文本信息都是对同一个主题进行描述。在本文中,我们提出了一个新颖的短语共现主题模型用以识别综合文本信息中的各种类型的子主题,并基于这些子主题度量综合文本的相似度,结合结构化数据计算实体的相似性。为了避免一些隐藏的比较重要的子主题被忽略,在实体匹配中我们使用众包帮我们决定这些子主题的权重应如何分配。我们在Amazon Mechanical Turk Crowdsourcing平台上验证了我们提出的方法,两个真实数据集上的实验结果表明我们的方法优于当前最好的实体匹配方法和文本理解方法。
关键词实体匹配   综合文本数据   CTextEM   众包     
Abstract: Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. To avoid from ignoring some hidden important sub-topics, we let the crowd to help us decide the weight of different sub-topics in doing EM. Our empirical study on two real-world data sets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.
KeywordsEntity Matching   Consolidated Textual Data   Crowdsourcing     
Received 2017-03-01;
本文基金:

This research is partially supported by the National Natural Science Foundation of China under Grant Nos. 61632016, 61402313, 61303019, 61472263, and 61572336, the Postdoctoral Scientific Research Funding of Jiangsu Province of China under Grant No. 1501090B, the National Postdoctoral Funding of China under Grant Nos. 2015M581859 and 2016T90493, and the Open Foundation of Guangdong Key Laboratory of Big Data Analysis and Processing of China under Grant No. 2017012.

通讯作者: An Liu,anliu@suda.edu.cn     Email: anliu@suda.edu.cn
About author: Zhi-Xu Li is an associate professor in the School of Computer Science and Technology at Soochow University, Suzhou.His research interests include data cleaning, big data applications, information extraction and retrieval, machine learning, deep learning, knowledge graph and crowdsourcing.
引用本文:   
Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Mi.众包指导下基于纯文本属性数据的实体匹配[J]  Journal of Computer Science and Technology , 2017,V32(5): 858-876
Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Min Zhang.Crowd-Guided Entity Matching with Consolidated Textual Data[J]  Journal of Computer Science and Technology, 2017,V32(5): 858-876
链接本文:  
http://jcst.ict.ac.cn:8080/jcst/CN/10.1007/s11390-017-1769-0
Copyright 2010 by Journal of Computer Science and Technology