Journal of Computer Science and Technology ›› 2021, Vol. 36 ›› Issue (4): 822-838.doi: 10.1007/s11390-021-1321-0

Special Issue: Data Management and Data Mining

• Special Section on AI4DB and DB4AI • Previous Articles     Next Articles

Mixed Hierarchical Networks for Deep Entity Matching

Chen-Chen Sun1,2, Member, CCF, and De-Rong Shen3, Senior Member, CCF        

  1. 1 Engineering Research Center of Learning-Based Intelligent System(Ministry of Education) Tianjin University of Technology, Tianjin 300384, China;
    2 School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China;
    3 School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2021-02-01 Revised:2021-07-12 Online:2021-07-05 Published:2021-07-30
  • About author:Chen-Chen Sun is a lecturer in the Engineering Research Center of Learning-Based Intelligent System (Ministry of Education) and School of Computer Science and Engineering, Tianjin University of Technology, Tianjin. He got his Ph.D. degree in computer science from Northeastern University, Shenyang, in 2017. He is a member of CCF. His research interests include entity resolution and anomaly detection.
  • Supported by:
    The work was supported by the National Natural Science Foundation of China under Grant Nos. 62002262, 61672142, 61602103, 62072086 and 62072084, and the National Key Research and Development Project of China under Grant No. 2018YFB1003404.

Entity matching is a fundamental problem of data integration. It groups records according to underlying real-world entities. There is a growing trend of entity matching via deep learning techniques. We design mixed hierarchical deep neural networks (MHN) for entity matching, exploiting semantics from different abstract levels in the record internal hierarchy. A family of attention mechanisms is utilized in different periods of entity matching. Self-attention focuses on internal dependency, inter-attention targets at alignments, and multi-perspective weight attention is devoted to importance discrimination. Especially, hybrid soft token alignment is proposed to address corrupted data. Attribute order is for the first time considered in deep entity matching. Then, to reduce utilization of labeled training data, we propose an adversarial domain adaption approach (DA-MHN) to transfer matching knowledge between different entity matching tasks by maximizing classifier discrepancy. Finally, we conduct comprehensive experimental evaluations on 10 datasets (seven for MHN and three for DA-MHN), which illustrate our two proposed approaches’ superiorities. MHN apparently outperforms previous studies in accuracy, and also each component of MHN is tested. DA-MHN greatly surpasses existing studies in transferability.

Key words: entity matching; attention mechanism; mixed hierarchical neural network (MHN); domain adaption; data integration;

[1] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection:A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1):1-16. DOI:10.1109/TKDE.2007.250581.
[2] Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K. An overview of end-to-end entity resolution for big data. ACM Computing Surveys, 2021, 53(6):Article No. 127. DOI:10.1145/3418896.
[3] Papadakis G, Ioannou E, Palpanas T. Entity resolution:Past, present and yet-to-come. In Proc. the 23rd International Conference on Extending Database Technology, March 30-April 2, 2020, pp.647-650. DOI:10.5441/002/edbt.2020.85.
[4] Hernández M A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2):127-138. DOI:10.1145/568271.223807.
[5] Singh R, Meduri V, Elmagarmid A, Madden S, Papotti P, Quiané-Ruiz J A, Solar-Lezama A, Tang N. Generating concise entity matching rules. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1635-1638. DOI:10.1145/3035918.3058739.
[6] Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328):1183-210. DOI:10.1080/01621459.1969.10501049.
[7] Konda P, Das S, Suganthan G P et al. Magellan:Toward building entity matching management systems. Proceedings of the VLDB Endowment, 2016, 9(12):1197-208. DOI:10.14778/2994509.2994535.
[8] Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 2018, 11(11):1454-1467. DOI:10.14778/3236187.3236198.
[9] Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. Deep learning for entity matching:A design space exploration. In Proc. the 2018 International Conference on Management of Data, May 2018, pp.19-34. DOI:10.1145/3183713.3196926.
[10] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444. DOI:10.1038/nature14539.
[11] Fu C, Han X, Sun L, Chen B, Zhang W, Wu S, Kong H. End-to-end multi-perspective matching for entity resolution. In Proc. the 28th International Joint Conference on Artificial Intelligence, August 2019, pp.4961-4967. DOI:10.24963/ijcai.2019/689.
[12] Zhang D, Nie Y, Wu S, Shen Y, Tan K L. Multi-context attention for entity matching. In Proc. the Web Conference 2020, April 2020, pp.2634-2640. DOI:10.1145/3366423.3380017.
[13] Nie H, Han X, He B, Sun L, Chen B, Zhang W, Wu S, Kong H. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Proc. the 28th ACM International Conference on Information and Knowledge Management, November 2019, pp.629-638. DOI:10.1145/3357384.3358018.
[14] Fu C, Han X, He J, Sun L. Hierarchical matching network for heterogeneous entity resolution. In Proc. the 29th International Joint Conference on Artificial Intelligence, July 2020, pp.3665-3671. DOI:10.24963/ijcai.2020/507.
[15] Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems, 2017, 65:137-57. DOI:10.1016/j.is.2016.12.001.
[16] Aráujo T B, Pires C E, Mestre D G, Nóbrega T P, Nascimento D C, Stefanidis K. A noise tolerant and schemaagnostic blocking technique for entity resolution. In Proc. the 34th ACM/SIGAPP Symposium on Applied Computing, April 2019, pp.422-430. DOI:10.1145/3297280.3299730.
[17] Li Y, Li J, Suhara Y, Doan A, Tan W C. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 2020, 14(1):50-60. DOI:10.14778/3421424.3421431.
[18] Brunner U, Stockinger K. Entity matching with transformer architectures-A step forward in data integration. In Proc. the 23rd International Conference on Extending Database Technology, March 30-April 2, 2020, pp.463-473. DOI:10.5441/002/edbt.2020.58.
[19] Thirumuruganathan S, Parambath S P, Ouzzani M, Tang N, Joty S R. Reuse and adaptation for entity resolution through transfer learning. arXiv:1809.11084, 2018. http://arxiv.org/abs/1809.11084, April 2021.
[20] Kasai J, Qian K, Gurajada S, Li Y, Popa L. Low-resource deep entity resolution with transfer and active learning. In Proc. the 57th Conference of the Association for Computational Linguistics, July 2019, pp.5851-5861. DOI:10.18653/v1/P19-1586.
[21] Zhao C, He Y. Auto-EM:End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In Proc. the 2019 World Wide Web Conference, May 2019, pp.2413-2424. DOI:10.1145/3308558.3313578.
[22] Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.1180-1189.
[23] Sun C, Shen D. Entity resolution with hybrid attentionbased networks. In Proc. the 26th International Conference on Database Systems for Advanced Applications, April 2021, pp.558-565. DOI:10.1007/978-3-030-73197-73.
[24] Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, June 2016, pp.1480-1489. DOI:10.18653/v1/N16-1174.
[25] Jiang J Y, Zhang M, Li C, Bendersky M, Golbandi N, Najork M. Semantic text matching for long-form documents. In Proc. the 2019 World Wide Web Conference, May 2019, pp.795-806. DOI:10.1145/3308558.3313707.
[26] Hu D. An introductory survey on attention mechanisms in NLP problems. In Proc. the 2019 Intelligent Systems Conference, September 2019, pp.432-448. DOI:10.1007/978-3-030-29513-431.
[27] Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the 26th International Conference on Neural Information Processing Systems, December 2013, pp.3111-3119.
[28] Pennington J, Socher R, Manning C D. Glove:Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.1532-1543. DOI:10.3115/v1/D14-1162.
[29] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5:135-146. DOI:10.1162/tacl a 00051.
[30] Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.1724-1734. DOI:10.3115/v1/D14-1179.
[31] Lin Z, Feng M, Santos C N, Yu M, Xiang B, Zhou B, Bengio Y. A structured self-attentive sentence embedding. In Proc. the 2017 International Conference on Learning Representations, April 2017.
[32] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. the 2015 International Conference on Learning Representations, May 2015.
[33] Tang M, Cai J, Zhuo H. Multi-matching network for multiple choice reading comprehension. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 27-February 1, 2019, pp.7088-7095. DOI:10.1609/aaai.v33i01.33017088.
[34] Saito K, Watanabe K, Ushiku Y, Harada T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.3723-3732. DOI:10.1109/CVPR.2018.00392.
[35] Wang J, Li G, Yu J X, Feng J. Entity matching:How similar is similar. Proceedings of the VLDB Endowment, 2010, 4(10):622-633. DOI:10.14778/2021017.2021020.
[1] Jia-Ke Ge, Yan-Feng Chai, Yun-Peng Chai. WATuning: A Workload-Aware Tuning System with Attention-Based Deep Reinforcement Learning [J]. Journal of Computer Science and Technology, 2021, 36(4): 741-761.
[2] Yang Liu, Ruili He, Xiaoqian Lv, Wei Wang, Xin Sun, Shengping Zhang. Is It Easy to Recognize Baby's Age and Gender? [J]. Journal of Computer Science and Technology, 2021, 36(3): 508-519.
[3] Yi-Ting Wang, Jie Shen, Zhi-Xu Li, Qiang Yang, An Liu, Peng-Peng Zhao, Jia-Jie Xu, Lei Zhao, Xun-Jie Yang. Enriching Context Information for Entity Linking with Web Data [J]. Journal of Computer Science and Technology, 2020, 35(4): 724-738.
[4] Chao Kong, Bao-Xiang Chen, Li-Ping Zhang. DEM: Deep Entity Matching Across Heterogeneous Information Networks [J]. Journal of Computer Science and Technology, 2020, 35(4): 739-750.
[5] Ying Li, Jia-Jie Xu, Peng-Peng Zhao, Jun-Hua Fang, Wei Chen, Lei Zhao. ATLRec: An Attentional Adversarial Transfer Learning Network for Cross-Domain Recommendation [J]. Journal of Computer Science and Technology, 2020, 35(4): 794-808.
[6] Chun-Yang Ruan, Ye Wang, Jiangang Ma, Yanchun Zhang, Xin-Tian Chen. Adversarial Heterogeneous Network Embedding with Metapath Attention Mechanism [J]. Journal of Computer Science and Technology, 2019, 34(6): 1217-1229.
[7] Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Min Zhang. Crowd-Guided Entity Matching with Consolidated Textual Data [J]. , 2017, 32(5): 858-876.
[8] Xin Wang, Student Member, CCF, Lin-Peng Huang, Senior Member, CCF, Yi Zhang, Xiao-Hui Xu, Student Member, CCF, and Jun-Qing Chen, Student Member, CCF. A Solution of Data Inconsistencies in Data Integration --- Designed for Pervasive Computing Environment [J]. , 2010, 25(3): 499-508.
[9] Yong-Quan Dong, Member, CCF, Qing-Zhong Li, Senior Member, CCF, Yan-Hui Ding, Member, CCF, and Zhao-Hui Peng, Member, CCF. A Query Interface Matching Approach Based on Extended Evidence Theory for Deep Web [J]. , 2010, 25(3): 537-547.
[10] Tao-Yuan Cheng and Shan Wang. A Novel Approach to Clustering Merchandise Records [J]. , 2007, 22(2): 228-231 .
[11] Chang-Jun Jiang, Zhao-Hui Zhang, Guo-Sun Zeng et al.. Urban Traffic Information Service Application Grid [J]. , 2005, 20(1): 0-0.
[12] MENG Xiaofeng (孟小峰), LU Hongjun (陆宏钧), WANG Haiyan (王海燕) and GU Mingzhe (谷明哲). Data Extraction from the Web Based on Pre-Defined Schema [J]. , 2002, 17(4): 0-0.
[13] Shen Yidong;. Extracting Schema from an OEM Database [J]. , 1998, 13(4): 289-299.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Zhou Di;. A Recovery Technique for Distributed Communicating Process Systems[J]. , 1986, 1(2): 34 -43 .
[2] Chen Shihua;. On the Structure of Finite Automata of Which M Is an(Weak)Inverse with Delay τ[J]. , 1986, 1(2): 54 -59 .
[3] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[4] Wang Jianchao; Wei Daozheng;. An Effective Test Generation Algorithm for Combinational Circuits[J]. , 1986, 1(4): 1 -16 .
[5] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[6] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[7] Zheng Guoliang; Li Hui;. The Design and Implementation of the Syntax-Directed Editor Generator(SEG)[J]. , 1986, 1(4): 39 -48 .
[8] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[9] Huang Xuedong; Cai Lianhong; Fang Ditang; Chi Bianjin; Zhou Li; Jiang Li;. A Computer System for Chinese Character Speech Input[J]. , 1986, 1(4): 75 -83 .
[10] Xu Xiaoshu;. Simplification of Multivalued Sequential SULM Network by Using Cascade Decomposition[J]. , 1986, 1(4): 84 -95 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved