Journal of Computer Science and Technology ›› 2020, Vol. 35 ›› Issue (4): 739-750.doi: 10.1007/s11390-020-0139-5

Special Issue: Data Management and Data Mining

• Special Section on Entity Resolution • Previous Articles     Next Articles

DEM: Deep Entity Matching Across Heterogeneous Information Networks

Chao Kong*, Member, CCF, Bao-Xiang Chen*, Li-Ping Zhang        

  1. School of Computer and Information, Anhui Polytechnic University, Wuhu 241000, China
  • Received:2020-01-20 Revised:2020-06-03 Online:2020-07-20 Published:2020-07-20
  • Contact: Chao Kong, Bao-Xiang Chen E-mail:kongchao@ahpu.edu.cn;3140205325@stu.ahpu.edu.cn
  • About author:Chao Kong received his Ph.D. degree in software engineering from the Institute for Data Science and Engineering, East China Normal University, Shanghai, in 2017. He is a lecture of School of Computer and Information with Anhui Polytechnic University (AHPU), Wuhu. His research interests include web data management, streaming data processing, social network analysis and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China Youth Fund under Grant No. 61902001.

Heterogeneous information networks, which consist of multi-typed vertices representing objects and multi-typed edges representing relations between objects, are ubiquitous in the real world. In this paper, we study the problem of entity matching for heterogeneous information networks based on distributed network embedding and multi-layer perceptron with a highway network, and we propose a new method named DEM short for Deep Entity Matching. In contrast to the traditional entity matching methods, DEM utilizes the multi-layer perceptron with a highway network to explore the hidden relations to improve the performance of matching. Importantly, we incorporate DEM with the network embedding methodology, enabling highly efficient computing in a vectorized manner. DEM's generic modeling of both the network structure and the entity attributes enables it to model various heterogeneous information networks flexibly. To illustrate its functionality, we apply the DEM algorithm to two real-world entity matching applications:user linkage under the social network analysis scenario that predicts the same or matched users in different social platforms and record linkage that predicts the same or matched records in different citation networks. Extensive experiments on real-world datasets demonstrate DEM's effectiveness and rationality.

Key words: heterogeneous information network; entity matching; network embedding; multi-layer perceptron;

[1] Hu G, Zhang Y, Yang Q. CoNet:Collaborative cross networks for cross-domain recommendation. In Proc. the 27th Int. Conference on Information and Knowledge Management, October 2018, pp.667-676.
[2] Wang X, Peng Z, Wang S, Yu P S, Fu W, Hong X. Crossdomain recommendation for cold-start users via neighborhood based feature mapping. In Proc. the 23rd Int. Conference on Information Database Systems for Advanced Applications, May 2018, pp.158-165.
[3] Benson A R, Kleinberg J M. Link prediction in networks with core-fringe data. In Proc. the 27th World Wide Web Conference, May 2019, pp.94-104.
[4] Huo Z, Huang X, Hu X. Link prediction with personalized social influence. In Proc. the 32nd AAAI Conference on Artificial Intelligence, August 2019, pp.2289-2296.
[5] Wang Y, Feng C, Chen L, Yin H, Guo C, Chu Y. User identity linkage across social networks via linked heterogeneous network embedding. World Wide Web, 2019, 22(6):2611-2632.
[6] Li C, Wang S, Wang H, Liang Y, Yu P S, Li Z, Wang W. Partially shared adversarial learning for semi-supervised multi-platform user identity linkage. In Proc. the 28th Int. Conference on Information and Knowledge Management, November 2019, pp.249-258.
[7] Chen J, Wang C, Ester M, Shi Q, Feng Y, Chen C. Social recommendation with missing not at random data. In Proc. the 18th Int. Conference on Data Mining, November 2018, pp.29-38.
[8] Kong C, Gao M, Xu C, Fu Y, Qian W, Zhou A. EnAli:Entity alignment across multiple heterogeneous data sources. Frontiers Comput. Sci., 2019, 13(1):157-169.
[9] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks. In Proc. the 2015 Annual Conference on Neural Information Processing Systems, December 2015, pp.2377-2385.
[10] Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In Proc. the 2007 ACM SIGMOD Int. Conference on Management of Data, June 2007, pp.653-664.
[11] Barbosa L. Learning representations of Web entities for entity resolution. International Journal of Web Information Systems, 2019, 15(3):346-358.
[12] Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In Proc. the 15th Int. Conference on Knowledge Discovery and Data Mining, June 2009, pp.827-836.
[13] Cheng A, Zhou C, Yang H, Wu J, Li L, Tan J, Guo L. Deep active learning for anchor user prediction. In Proc. the 28th Int. Joint Conference on Artificial Intelligence, August 2019, pp.2151-2157.
[14] Armandpour M, Ding P, Huang J, Hu X. Robust negative sampling for network embedding. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 2019, pp.3191-3198.
[15] Bandyopadhyay S, Lokesh N, Murty M N. Outlier aware network embedding for attributed networks. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 2019, pp.12-19.
[16] Gao M, Chen L, He X, Zhou A. BiNE:Bipartite network embedding. In Proc. the 41st Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 2018, pp.715-724.
[17] Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381):954-959.
[18] Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 2012, 24(9):1537-1555.
[19] Mohtasseb H, Ahmed A. Two-layered Blogger identification model integrating profile and instance-based methods. Knowl. Inf. Syst., 2012, 31(1):1-21.
[20] Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. the 1995 ACM SIGMOD Int. Conference on Management of Data, May 1995, pp.127-138.
[21] Vidanage A, Ranbaduge T, Christen P, Schnell R. Efficient pattern mining based cryptanalysis for privacy-preserving record linkage. In Proc. the 35th Int. Conference on Data Engineering, April 2019, pp.1698-1701.
[22] Barbosa L. Learning representations of Web entities for entity resolution. Int. J. Web Inf. Syst., 2019, 15(3):346-358.
[23] Verroios V, Garcia-Molina H. Top-K entity resolution with adaptive locality-sensitive hashing. In Proc. the 35th Int. Conference on Data Engineering, April 2019, pp.1718-1721.
[24] Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Inf. Syst., 2001, 26(8):607-633.
[25] Liang D, Zhang F, Zhang W et al. Adaptive multi-attention network incorporating answer information for duplicate question detection. In Proc. the 42nd Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 2019, pp.95-104.
[26] McCarthy J F, Lehnert W G. Using decision trees for coreference resolution. In Proc. the 14th International Joint Conference on Artificial Intelligence, August 1995, pp.1050-1055.
[27] Gorla S, Velivelli S, Murthy N L B, Malapati A. Named Entity Recognition for Telugu news articles using naïve Bayes classifier. In Proc. the 2nd International Workshop on Recent Trends in News Information Retrieval Co-Located with 40th European Conference on Information Retrieval, March 2018, pp.33-38.
[28] Ponzetto S P, Strube M. Exploiting semantic role labeling, wordNet and Wikipedia for coreference resolution. In Proc. the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, June 2006, pp.192-199.
[29] Rahman M A, Ng V. Supervised models for coreference resolution. In Proc. the 2009 Conference on Empirical Methods in Natural Language Processing, August 2009, pp.968-977.
[30] Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.783-794.
[31] Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.39-48.
[32] Konda P, Das S, G.C. Suganthan P et al. Magellan:Toward building entity matching management systems. Proceedings of the VLDB Endowment, 2016, 9(12):1197-1208.
[33] Mudgal S, Li H, Rekatsinas T et al. Deep learning for entity matching:A design space exploration. In Proc. the 2018 Int. Conference on Management of Data, June 2018, pp.19-34.
[34] Ebraheem M, Thirumuruganathan S, Joty S R, Ouzzani M, Tang N. DeepER-Deep entity resolution. arXiv:1710.00597, 2017. http://arxiv.org/abs/1710.00597, August 2018.
[35] LeCun Y, Bengio Y, Hinton G E. Deep learning. Nature, 2015, 521(7553):436-444.
[36] Hoffer E, Ailon N. Deep metric learning using triplet network. In Proc. the 3rd International Workshop on Similarity-Based Pattern Recognition, October 2015, pp.84-92.
[37] Neculoiu P, Versteegh M, Rotaru M. Learning text similarity with Siamese recurrent networks. In Proc. the 1st Workshop on Representation Learning for NLP, August 2016, pp.148-157.
[38] Trouillon T, Welbl J, Riedel S, Gaussier É, Bouchard G. Complex embeddings for simple link prediction. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2071-2080.
[39] Lerer A, Wu L, Shen J et al. PyTorch-BigGraph:A largescale graph embedding system. arXiv:1903.12287, 2019. http://arxiv.org/abs/1903.12287, April 2019.
[40] Kasai J, Qian K, Gurajada S, Li Y, Popa L. Low-resource deep entity resolution with transfer and active learning. In Proc. the 57th Int. Conference of the Association for Computational Linguistics, July 2019, pp.5851-5861.
[1] Dan-Hao Zhu, Xin-Yu Dai, Jia-Jun Chen. Pre-Train and Learn: Preserving Global Information for Graph Neural Networks [J]. Journal of Computer Science and Technology, 2021, 36(6): 1420-1430.
[2] Chen-Chen Sun, De-Rong Shen. Mixed Hierarchical Networks for Deep Entity Matching [J]. Journal of Computer Science and Technology, 2021, 36(4): 822-838.
[3] Yue Kou, De-Rong Shen, Dong Li, Tie-Zheng Nie, Ge Yu. Finding Communities by Decomposing and Embedding Heterogeneous Information Network [J]. Journal of Computer Science and Technology, 2020, 35(2): 320-337.
[4] Chun-Yang Ruan, Ye Wang, Jiangang Ma, Yanchun Zhang, Xin-Tian Chen. Adversarial Heterogeneous Network Embedding with Metapath Attention Mechanism [J]. Journal of Computer Science and Technology, 2019, 34(6): 1217-1229.
[5] Da-Wei Cheng, Yi Tu, Zhen-Wei Ma, Zhi-Bin Niu, Li-Qing Zhang. BHONEM: Binary High-Order Network Embedding Methods for Networked-Guarantee Loans [J]. Journal of Computer Science and Technology, 2019, 34(3): 657-669.
[6] Lei Guo, Yu-Fei Wen, Xin-Hua Wang. Exploiting Pre-Trained Network Embeddings for Recommendations in Social Networks [J]. , 2018, 33(4): 682-696.
[7] Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Min Zhang. Crowd-Guided Entity Matching with Consolidated Textual Data [J]. , 2017, 32(5): 858-876.
[8] Sheng Zhang, Zhu-Zhong Qian, Jie Wu, Sang-Lu Lu. Service-Oriented Resource Allocation in Clouds: Pursuing Flexibility and Efficiency [J]. , 2015, 30(2): 421-436.
[9] ZHONG Lin(钟林),LIU Jia(刘加)and LIU Runsheng(刘润生). A Rejection Model Based on Multi-Layer Perceptrons for Mandarin Digit Recognition [J]. , 2002, 17(2): 0-0.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Li Wanxue;. Almost Optimal Dynamic 2-3 Trees[J]. , 1986, 1(2): 60 -71 .
[2] Wang Xuan; Lü Zhimin; Tang Yuhai; Xiang Yang;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[6] Shi Zhongzhi;. Knowledge-Based Decision Support System[J]. , 1987, 2(1): 22 -29 .
[7] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[8] Sun Yongqiang; Lu Ruzhan; Huang Xiaorong;. Termination Preserving Problem in the Transformation of Applicative Programs[J]. , 1987, 2(3): 191 -201 .
[9] Duan Ping; Cai Xiyao;. A Real-Time Interprocessor Synchronization Algorithm for Communication in Distributed Computer Systems[J]. , 1987, 2(4): 292 -302 .
[10] Shi Weigeng; StephenY.H.Su;. An Online Diagnosable Fault-Tolerant Redundancy System[J]. , 1987, 2(4): 310 -321 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved