›› 2017, Vol. 32 ›› Issue (5): 858-876.doi: 10.1007/s11390-017-1769-0

Special Issue: Artificial Intelligence and Pattern Recognition; Data Management and Data Mining

• Special Section on Crowdsourced Data Management • Previous Articles     Next Articles

Crowd-Guided Entity Matching with Consolidated Textual Data

Zhi-Xu Li1,2, Member, CCF, Qiang Yang1, An Liu1,*, Member, CCF, Guan-Feng Liu1, Member, CCF, Jia Zhu3, Member, CCF, Jia-Jie Xu1, Member, CCF, Kai Zheng1,4, Member, CCF, Min Zhang1, Member, CCF   

  1. 1 School of Computer Science and Technology, Soochow University, Suzhou 215006, China;
    2 Guangdong Key Laboratory of Big Data Analysis and Processing, Guangzhou 510006, China;
    3 School of Computer, South China Normal University, Guangzhou 510631, China;
    4 Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing 100872, China
  • Received:2017-03-01 Revised:2017-08-09 Online:2017-09-05 Published:2017-09-05
  • Contact: An Liu,anliu@suda.edu.cn E-mail:anliu@suda.edu.cn
  • About author:Zhi-Xu Li is an associate professor in the School of Computer Science and Technology at Soochow University, Suzhou.His research interests include data cleaning, big data applications, information extraction and retrieval, machine learning, deep learning, knowledge graph and crowdsourcing.
  • Supported by:

    This research is partially supported by the National Natural Science Foundation of China under Grant Nos. 61632016, 61402313, 61303019, 61472263, and 61572336, the Postdoctoral Scientific Research Funding of Jiangsu Province of China under Grant No. 1501090B, the National Postdoctoral Funding of China under Grant Nos. 2015M581859 and 2016T90493, and the Open Foundation of Guangdong Key Laboratory of Big Data Analysis and Processing of China under Grant No. 2017012.

Entity Matching (EM) identifies records referring to the same entity within or across databases. Existing methods using structured attribute values (such as digital, date or short string values) only may fail when the structured information is not enough to reflect the matching relationships between records. Nowadays more and more databases may have some unstructured textual attribute containing extra Consolidated Textual information (CText for short) of the record, but seldom work has been done on using the CText information for EM. Conventional string similarity metrics such as edit distance or bag-of-words are unsuitable for measuring the similarities between CTexts since there are hundreds or thousands of words with each CText, while existing topic models either can not work well since there is no obvious gaps between topics in CText. In this paper, we propose a novel cooccurrence-based topic model to identify various sub-topics from each CText, and then measure the similarity between CTexts on the multiple sub-topic dimensions. To avoid from ignoring some hidden important sub-topics, we let the crowd to help us decide the weight of different sub-topics in doing EM. Our empirical study on two real-world data sets based on Amzon Mechanical Turk Crowdsourcing Platform shows that our method outperforms the state-of-the-art EM methods and Text Understanding models.

[1] Koudas N, Sarawagi S, Srivastava D. Record linkage:Similarity measures and algorithms. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2006, pp.802-803.

[2] Parkhomenko E, Tritchler D, Beyene J. Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 2009, 8(1):Article No. 1.

[3] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection:A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1):1-16.

[4] Ektefa M, Jabar M A, Sidi F, Memar S, Ibrahim H, Ramli A. A threshold-based similarity measure for duplicate detection. In Proc. IEEE Conf. Open Systems, September 2011, pp.37-41.

[5] Gao C, Hong X G, Peng Z H, Chen H D. Web trace duplication detection based on context. In Proc. the Int. Conf. Web Information Systems and Mining, September 2011, pp.292-301.

[6] Das D, Martins A F T. A Survey on Automatic Text Summarization. The MIT Press, 2007.

[7] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. The Journal of Machine Learning Research, 2003, 3:993-1022.

[8] Landauer T K, Foltz P W, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2/3):259-284.

[9] Hofmann T. Probabilistic latent semantic analysis. In Proc. the 15th Conf. Uncertainty in Artificial Intelligence, August 1999, pp.289-296.

[10] Kim D, Wang H X, Oh A. Context-dependent conceptualization. In Proc. the 23rd Int. Joint Conf. Artificial Intelligence, August 2013, pp.2654-2661.

[11] Guo S T, Dong X L, Srivastava D, Zajac R. Record linkage with uniqueness constraints and erroneous values. Proc. the VLDB Endowment, 2010, 3(1/2):417-428.

[12] Sun L W, Franklin M J, Krishnan S, Xin R S. Finegrained partitioning for aggressive data skipping. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.1115-1126.

[13] Sarwar B, Karypis G, Konstan J, Riedl J. Item-based collaborative filtering recommendation algorithms. In Proc. the 10th Int. Conf. World Wide Web, May 2001, pp.285-295.

[14] Feng J H, Li G L, Wang H N, Feng J H. Incremental quality inference in crowdsourcing. In Proc. the 19th Int. Conf. Database Systems for Advanced Applications, April 2014, pp.453-467.

[15] Aizawa A, Oyama K. A fast linkage detection scheme for multi-source information integration. In Proc. the Int. Workshop on Challenges on Web Information Retrieval and Integration, April 2005, pp.30-39.

[16] Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowledge and Data Engineering, 2012, 24(9):1537-1555.

[17] Borthwick A, Goldberg A, Cheung P, Winkel A. Batch Automated Blocking and Record Matching. The US Press, 2011.

[18] Yang Q, Li Z X, Jiang J, Zhao P P, Liu G F, Liu A, Zhu J. NokeaRM:Employing non-key attributes in record matching. In Proc. the 16th Int. Conf. Web-Age Information Management, June 2015, pp.438-442.

[19] Villarreal S E G, Brena R F. Topic mining based on graph local clustering. In Proc. the 10th Int. Conf. Artificial Intelligence:Advances in Soft Computing, November 2011, pp.201-212.

[20] Dhamankar R, Lee Y, Doan A H, Halevy A, Domingos P. iMAP:Discovering complex semantic matches between database schemas. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2004, pp.383-394.

[21] Weiss S M, Indurkhya N, Zhang T, Damerau F. Text Mining:Predictive Methods for Analyzing Unstructured Information. Springer, 2005.

[22] Hassell J, Aleman-Meza B, Arpinar I B. Ontology-driven automatic entity disambiguation in unstructured text. In Proc. the 5th Int. Conf. the Semantic Web, November 2006, pp.44-57.

[23] Zhang X, LeCun Y. Text understanding from scratch. arXiv:1502.01710, 2016. https://arxiv.org/abs/1502.01710, August 2017.

[24] Kim S J, Lee J H. Method of mining subtopics using dependency structure and anchor texts. In Proc. the 19th Int. Conf. String Processing and Information Retrieval, October 2012, pp.277-283.

[25] Wu M W, Zhang C D, Lan W Y, Wu Q Q. Text topic mining based on LDA and co-occurrence theory. In Proc. the 7th Int. Conf. Computer Science & Education, July 2012, pp.525-528.

[26] Li GL, Wang J N, Zheng Y D, Franklin M J. Crowdsourced data management:A survey. IEEE Trans. Knowledge and Data Engineering, 2016, 28(9):2296-2319.

[27] Doan A H, Ramakrishnan R, Halevy A Y. Crowdsourcing systems on the world-wide web. Communications of the ACM, 2011, 54(4):86-96.

[28] Gu B B, Li Z X, Yang Q, Xie Q, Liu A, Liu G F, Zheng K, Zhang X L. Web-ADARE:A web-aided data repairing system. Neurocomputing, 2017, 253:201-214.

[29] Li G L, Chai C L, Fan J, Weng X P, Li J, Zheng Y D, Li Y B, Yu X, Zhang X H, Yuan H T. CDB:Optimizing queries with crowd-based selections and joins. In Proc. the ACM Int. Conf. Management of Data, May 2017, pp.1463-1478.

[30] Jiang L L, Wang Y F, Hoffart J, Weikum G. Crowdsourced entity markup. In Proc. the 1st Int. Conf. Crowdsourcing the Semantic Web, October 2013, pp.59-68.

[31] Wang J N, Kraska T, Franklin M J, Feng J H. Crowder:Crowdsourcing entity resolution. Proc. the VLDB Endowment, 2012, 5(11):1483-1494.

[32] Gu B B, Li Z X, Zhang X L, Liu A, Liu G F, Zheng K, Zhao L, Zhou X F. The interaction between schema matching and record matching in data integration. IEEE Trans. Knowledge and Data Engineering, 2017, 29(1):186-199.

[33] Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd:Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. the 21st Int. Conf. World Wide Web, April 2012, pp.469-478.

[34] Gokhale C, Das S, Doan A H, Naughton J F, Rampalli N, Shavlik J, Zhu X J. Corleone:Hands-off crowdsourcing for entity matching. In Proc. the ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.601-612.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved