Journal of Computer Science and Technology ›› 2018, Vol. 33 ›› Issue (6): 1204-1218.doi: 10.1007/s11390-018-1882-8

Special Issue: Data Management and Data Mining

• Data Management and Data Mining • Previous Articles     Next Articles

Modeling Topic-Based Human Expertise for Crowd Entity Resolution

Sai-Sai Gong1, Wei Hu1,*, Member, CCF, ACM, Wei-Yi Ge2, Yu-Zhong Qu1, Senior Member, CCF   

  1. 1. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China;
    2. Science and Technology on Information Systems Engineering Laboratory, Nanjing 210007, China
  • Received:2017-09-30 Revised:2018-09-13 Online:2018-11-15 Published:2018-11-15
  • Contact: Wei Hu,E-mail:whu@nju.edu.cn E-mail:whu@nju.edu.cn
  • About author:Sai-Sai Gong is currently a Ph.D. student in State Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing. He received his B.S. degree in computer science and technology in 2009, and his M.S. degree in computer software and theory in 2012, both from Southeast University, Nanjing. His research interests include Semantic Web, linked data browsing and data integration.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China under Grant Nos. 61872172 and 61772264.

Entity resolution (ER) aims to identify whether two entities in an ER task refer to the same real-world thing.Crowd ER uses humans, in addition to machine algorithms, to obtain the truths of ER tasks. However, inaccurate or erroneous results are likely to be generated when humans give unreliable judgments. Previous studies have found that correctly estimating human accuracy or expertise in crowd ER is crucial to truth inference. However, a large number of them assume that humans have consistent expertise over all the tasks, and ignore the fact that humans may have varied expertise on different topics (e.g., music versus sport). In this paper, we deal with crowd ER in the Semantic Web area. We identify multiple topics of ER tasks and model human expertise on different topics. Furthermore, we leverage similar task clustering to enhance the topic modeling and expertise estimation. We propose a probabilistic graphical model that computes ER task similarity, estimates human expertise, and infers the task truths in a unified framework. Our evaluation results on real-world and synthetic datasets show that, compared with several state-of-the-art approaches, our proposed model achieves higher accuracy on the task truth inference and is more consistent with the human real expertise.

Key words: entity resolution; crowdsourcing; human expertise; topic modeling; task similarity;

[1] Heflin J, Song D. Ontology instance linking:Towards interlinked knowledge graphs. In Proc. the 30th AAAI Conf. Artificial Intelligence, February 2016, pp.4163-4169.
[2] Hu W, Jia C. A bootstrapping approach to entity linkage on the Semantic Web. Journal of Web Semantics, 2015, 34:1-12.
[3] Wang J, Kraska T, Franklin M J, Feng J. CrowdER:Crowdsourcing entity resolution. Proceedings of the VLDB Endowment, 2012, 5(11):1483-1494.
[4] Yalavarthi V K, Ke X, Khan A. Select your questions wisely:For entity resolution with crowd errors. In Proc. the 26th Int. Conf. Information and Knowledge Management, November 2017, pp.317-326.
[5] Ma F, Li Y, Li Q, Qiu M, Gao J, Zhi S, Su L, Zhao B, Ji H, Han J. FaitCrowd:Fine grained truth discovery for crowdsourced data aggregation. In Proc. the 21st ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, August 2015, pp.745-754.
[6] Yan Y, Rosales R, Fung G, Dy J G. Active learning from crowds. In Proc. the 28th Int. Conf. Machine Learning, June 2011, pp.1161-1168.
[7] Raykar V C, Yu S, Zhao L H, Valadez G H, Florin C, Bogoni L, Moy L. Learning from crowds. Journal of Machine Learning Research, 2010, 11:1297-1322.
[8] Fang M, Yin J, Tao D. Active learning for crowdsourcing using knowledge transfer. In Proc. the 28th AAAI Conf. Artificial Intelligence, July 2014, pp.1809-1815.
[9] Kuncheva L I, Whitaker C J, Shipp C A, Duin R P. Limits on the majority vote accuracy in classifier fusion. Pattern Analysis and Applications, 2003, 6(1):22-31.
[10] Whitehill J, Ruvolo P, Wu T, Bergsma J, Movellan J R. Whose vote should count more:Optimal integration of labels from labelers of unknown expertise. In Proc. the 23rd Annual Conf. Neural Information Processing Systems, December 2009, pp.2035-2043.
[11] Snow R, O'Connor B, Jurafsky D, Ng A Y. Cheap and fast-But is it good? Evaluating non-expert annotations for natural language tasks. In Proc. the 2008 Conf. Empirical Methods in Natural Language Processing, October 2008, pp.254-263.
[12] Fan J, Li G, Ooi B C, Tan K L, Feng J. iCrowd:An adaptive crowdsourcing framework. In Proc. the 2015 ACM SIGMOD Int. Conf. Management of Data, May 2015, pp.1015-1030.
[13] Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3:993-1022.
[14] Bhattacharya I, Getoor L. A latent Dirichlet model for unsupervised entity resolution. In Proc. the 6th SIAM Int. Conf. Data Mining, April 2006, pp.47-58.
[15] Li G, Wang J, Zheng Y, Franklin M J. Crowdsourced data management:A survey. IEEE Trans. Knowledge and Data Engineering, 2016, 28(9):2296-2319.
[16] Li G, Zheng Y, Fan J, Wang J, Cheng R. Crowdsourced data management:Overview and challenges. In Proc. the 2017 ACM SIGMOD Int. Conf. Management of Data, May 2017, pp.1711-1716.
[17] Acosta M, Zaveri A, Simperl E, Kontokostas D, Auer S, Lehmann J. Crowdsourcing linked data quality assessment. In Proc. the 12th Int. Semantic Web Conf., October 2013, pp.260-276.
[18] Demartini G, Difallah D E, Cudré-Mauroux P. ZenCrowd:Leveraging probabilistic reasoning and crowdsourcing techniques for large-scale entity linking. In Proc. the 21st Int. Conf. World Wide Web, April 2012, pp.469-478.
[19] Chai C, Li G, Li J, Deng D, Feng J. Cost effective crowdsourced entity resolution:A partial-order approach. In Proc. the 2016 ACM SIGMOD Int. Conf. Management of Data, June 2016, pp.969-984.
[20] Vesdapunt N, Bellare K, Dalvi N. Crowdsourcing algorithms for entity resolution. Proceedings of the VLDB Endowment, 2014, 7(12):1071-1082.
[21] Hassan U, Zaveri A, Marx E, Curry E, Lehmann J. ACRyLIQ:Leveraging DBpedia for adaptive crowdsourcing in linked data quality assessment. In Proc. the 20th Int. Conf. Knowledge Engineering and Knowledge Management, November 2016, pp.681-696.
[22] Kontokostas D, Zaveri A, Auer S, Lehmann J. TripleCheckMate:A tool for crowdsourcing the quality assessment of linked data. In Proc. the 4th Int. Conf. Knowledge Engineering and the Semantic Web, October 2013, pp.265-272.
[23] Fang Y L, Sun H L, Chen P P, Deng T. Improving the quality of crowdsourced image labeling via label similarity. Journal of Computer Science and Technology, 2017, 32(5):877-889.
[24] Zhuang Y, Li G, Zhong Z, Feng J. Hike:A hybrid humanmachine method for entity alignment in large-scale knowledge bases. In Proc. the 2017 Int. Conf. Information and Knowledge Management, November 2017, pp.1917-1926.
[25] Li G, Chai C, Fan J, Weng X, Li J, Zheng Y, Li Y, Yu X, Zhang X, Yuan H. CDB:Optimizing queries with crowdbased selections and joins. In Proc. the 2017 ACM SIGMOD Int. Conf. Management of Data, May 2017, pp.1463-1478.
[26] Zheng Y, Cheng R, Maniu S, Mo L. On optimality of jury selection in crowdsourcing. In Proc. the 18th Int. Conf. Extending Database Technology, March 2015, pp.193-204.
[27] Li Q, Ma F, Gao J, Su L, Quinn C J. Crowdsourcing high quality labels with a tight budget. In Proc. the 9th ACM Int. Conf. Web Search and Data Mining, February 2016, pp.237-246.
[28] Yuan D, Li G, Li Q, Zheng Y. Sybil defense in crowdsourcing platforms. In Proc. the 2017 Int. Conf. Information and Knowledge Management, November 2017, pp.1529-1538.
[29] Li Q, Li Y, Gao J, Zhao B, Fan W, Han J. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In Proc. the 2014 ACM SIGMOD Int. Conf. Management of Data, June 2014, pp.1187-1198.
[30] Xiao H, Gao J, Li Q, Ma F, Su L, Feng Y, Zhang A. Towards confidence in the truth:A bootstrapping based truth discovery approach. In Proc. the 22nd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, August 2016, pp.1935-1944.
[31] Ma F, Meng C, Xiao H, Li Q, Gao J, Su L, Zhang A. Unsupervised discovery of drug side-effects from heterogeneous data sources. In Proc. the 23rd ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, August 2017, pp.967-976.
[32] Wang Y, Ma F, Su L, Gao J. Discovering truths from distributed data. In Proc. the 2017 IEEE Int. Conf. Data Mining, November 2017, pp.505-515.
[33] Meng C, Jiang W, Li Y, Gao J, Su L, Ding H, Cheng Y. Truth discovery on crowd sensing of correlated entities. In Proc. the 13th ACM Conf. Embedded Networked Sensor Systems, November 2015, pp.169-182.
[34] Zhang H, Li Q, Ma F, Xiao H, Li Y, Gao J, Su L. Influenceaware truth discovery. In Proc. the 25th ACM Int. Conf. Information and Knowledge Management, October 2016, pp.851-860.
[35] Hu H, Zheng Y, Bao Z, Li G, Feng J, Cheng R. Crowdsourced POI labelling:Location-aware result inference and task assignment. In Proc. the 32nd IEEE Int. Conf. Data Engineering, May 2016, pp.61-72.
[36] Zheng Y, Wang J, Li G, Cheng R, Feng J. QASCA:A quality-aware task assignment system for crowdsourcing applications. In Proc. the 2015 ACM SIGMOD Int. Conf. Management of Data, May 2015, pp.1031-1046.
[37] Fang M, Zhu X, Li B, Ding W, Wu X. Self-taught active learning from crowds. In Proc. the 12th IEEE Int. Conf. Data Mining, December 2012, pp.858-863.
[38] Zheng Y, Li G, Cheng R. DOCS:Domain-aware crowdsourcing system. Proceedings of the VLDB Endowment, 2016, 10(4):361-372.
[39] Zheng Y, Li G, Li Y, Shan C, Cheng R. Truth inference in crowdsourcing:Is the problem solved? Proceedings of the VLDB Endowment, 2017, 10(5):541-552.
[40] Li Y, Gao J, Meng C, Li Q, Su L, Zhao B, Fan W, Han J. A survey on truth discovery. ACM SIGKDD Explorations Newsletter, 2016, 17(2):1-16.
[41] Wainwright M J, Jordan M I. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 2008, 1(1/2):1-305.
[42] Qu Y, Gong S, Cheng G, Xu J, Li X, Zheng L, Jiang J. SView:Smart views for browsing linked entities. In Proc. ISWC Semantic Web Challenge 2014, October 2014.
[43] Köpcke H, Thor A, Rahm E. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010, 3(1):484-493.
[44] Kejriwal M, Miranker D P. An unsupervised instance matcher for schema-free RDF data. Journal of Web Semantics, 2015, 35:102-123.
[45] Abdullah M B. On a robust correlation coefficient. The Statistician, 1990, 39(4):455-460.
[1] Jung-Been Lee, Taek Lee, Hoh Peter In. Topic Modeling Based Warning Prioritization from Change Sets of Software Repository [J]. Journal of Computer Science and Technology, 2020, 35(6): 1461-1479.
[2] Bo-Han Li, Yi Liu, An-Man Zhang, Wen-Huan Wang, Shuo Wan. A Survey on Blocking Technology of Entity Resolution [J]. Journal of Computer Science and Technology, 2020, 35(4): 769-793.
[3] Yang Li, Wen-Zhuo Song, Bo Yang. Stochastic Variational Inference-Based Parallel and Online Supervised Topic Model for Large-Scale Text Processing [J]. Journal of Computer Science and Technology, 2018, 33(5): 1007-1022.
[4] Peng-Peng Chen, Hai-Long Sun, Yi-Li Fang, Jin-Peng Huai. Collusion-Proof Result Inference in Crowdsourcing [J]. , 2018, 33(2): 351-365.
[5] An-Zhen Zhang, Jian-Zhong Li, Hong Gao, Yu-Biao Chen, Heng-Zhao Ma, Mohamed Jaward Bah. CrowdOLA: Online Aggregation on Duplicate Data Powered by Crowdsourcing [J]. , 2018, 33(2): 366-379.
[6] Yi-Li Fang, Hai-Long Sun, Peng-Peng Chen, Ting Deng. Improving the Quality of Crowdsourced Image Labeling via Label Similarity [J]. , 2017, 32(5): 877-889.
[7] Hong-Zhi Wang, Zhi-Xin Qi, Ruo-Xi Shi, Jian-Zhong Li, Hong Gao. COSSET+:Crowdsourced Missing Value Imputation Optimized by Knowledge Base [J]. , 2017, 32(5): 845-857.
[8] Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Min Zhang. Crowd-Guided Entity Matching with Consolidated Textual Data [J]. , 2017, 32(5): 858-876.
[9] An Liu, Zhi-Xu Li, Guan-Feng Liu, Kai Zheng, Min Zhang, Qing Li, Xiangliang Zhang. Privacy-preserving Task Assignment in Spatial Crowdsourcing [J]. , 2017, 32(5): 905-918.
[10] Jia-Xu Liu, Yu-Dian Ji, Wei-Feng Lv, Ke Xu. Budget-aware Dynamic Incentive Mechanism in Spatial Crowdsourcing [J]. , 2017, 32(5): 890-904.
[11] Xue-Li Liu, Hong-Zhi Wang, Jian-Zhong Li, Hong Gao. EntityManager: Managing Dirty Data Based on Entity Resolution [J]. , 2017, 32(3): 644-661.
[12] Hai-Bo Ye, Tao Gu, Xian-Ping Tao, Jian Lv. Infrastructure-Free Floor Localization Through Crowdsourcing [J]. , 2015, 30(6): 1249-1273.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Feng Yulin;. Hierarchical Protocol Analysis by Temporal Logic[J]. , 1988, 3(1): 56 -69 .
[2] Xu Zhiming;. Discrete Interpolation Surface[J]. , 1990, 5(4): 329 -332 .
[3] Cai Shijie; Zhang Fuyan;. A Fast Algorithm for Polygon Operations[J]. , 1991, 6(1): 91 -96 .
[4] I.V.Vel bitsky; A.L.Kovalev; I.V.Kasatkina; Wang Lei;. R-Technology of Programming: Basic Notions and Implementation[J]. , 1992, 7(4): 345 -355 .
[5] Wang Hui; Liu Dayou; Wang Yafei;. Sequential Back-Propagation[J]. , 1994, 9(3): 252 -260 .
[6] Hock C. Chan;. Translational Semantics for a Conceptual Level Query Language[J]. , 1995, 10(2): 175 -187 .
[7] Ying Mingsheng;. Institutions of Variable Truth Values:An Approach in the Ordered Style[J]. , 1995, 10(3): 267 -273 .
[8] Hao Ruibing; Wu Jianping;. A Formal Approach to Protocol Interoperability Testing[J]. , 1998, 13(1): 79 -90 .
[9] Chen Gang;. Dependent Type System with Subtyping (I)Type Level Transitivity Elimination[J]. , 1998, 13(6): 564 -578 .
[10] SUN Ninghui;. Reference Implementation of Scalable I/O Low-Level API on Intel Paragon[J]. , 1999, 14(3): 206 -223 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved