|
计算机科学技术学报 ›› 2021,Vol. 36 ›› Issue (4): 822-838.doi: 10.1007/s11390-021-1321-0
所属专题: Data Management and Data Mining
Chen-Chen Sun1,2, Member, CCF, and De-Rong Shen3, Senior Member, CCF
Chen-Chen Sun1,2, Member, CCF, and De-Rong Shen3, Senior Member, CCF
1、研究背景(context)
实体匹配是数据预处理的重要步骤,也是数据集成的关键环节。实体匹配将数据源中描述相同实体的记录归入同一分组,它广泛应用于医疗健康、电子商务、金融和犯罪侦查等。传统的实体匹配方法可分为两类:基于规则的和基于机器学习的。近年来,深度学习技术的崛起为实体匹配带来了新机遇,相比于传统方法,深度实体匹配显示出卓著的优势。深度学习可以为实体匹配提供端到端解决方案,极大地降低人工参与;深度学习能够充分地捕捉文本数据的语义相似性。尽管深度学习已经加速了实体匹配研究,当前深度实体匹配仍有很大的提升空间。
2、目的(Objective)
本文希望解决已有深度实体匹配中语义相似性比较不充分的问题,已有方法通常只关注单一语义(词元级别或记录级别)。此外,在低资源条件下,深度实体匹配方法面临可用性困境,本文将通过迁移学习来解决此问题。
3、方法(Method)
本文提出面向深度实体匹配的混合层次神经网络框架MHN,它根据记录层次结构中不同抽象层次中的语义信息来分别计算语义相似性,然后将这些相似性聚集后用于实体匹配任务。MHN提出一组注意力机制来构建层次的实体匹配表示学习模型:自注意力用于捕捉内部依赖关系,互注意力用于对齐,多维的权重注意力用于区分重要性。针对低资源条件下深度实体匹配问题,提出对抗式域迁移方法DA-MHN,通过最大化分类器差异性来学习得到共享的辨别特征。
4、结果(Result&Findings)
本文在10个数据集进行了详尽的实验对比,证明了所提出方法的有效性。将MHN在3类7个数据集进行了实验评估,相对于已有的7个方法,MHN具有明显的优势。针对MHN的各个方面进行了测试,证明了MHN各个组件的作用。将DA-MHN在3个数据集上进行实验对比,实验结果说明DA-MHN是一个有竞争力的低资源下深度实体匹配的解决方案。
5、结论(Conclusions)
针对深度实体匹配问题,本文提出混合层次神经网络框架MHN和对抗式迁移学习方法DA-MHN。MHN通过挖掘记录的层次结构信息,捕捉来自不同抽象层次的语义相似性,包括详细的词元级和抽象的属性级,这可以弥补深度实体匹配中语义相似性计算不充分的不足。DA-MHN通过最大化分类器的差异性来迁移不同深度实体匹配任务之间的匹配知识,这可以解决低资源条件下深度实体匹配的可用性问题。
[1] Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate record detection:A survey. IEEE Trans. Knowledge and Data Engineering, 2007, 19(1):1-16. DOI:10.1109/TKDE.2007.250581. [2] Christophides V, Efthymiou V, Palpanas T, Papadakis G, Stefanidis K. An overview of end-to-end entity resolution for big data. ACM Computing Surveys, 2021, 53(6):Article No. 127. DOI:10.1145/3418896. [3] Papadakis G, Ioannou E, Palpanas T. Entity resolution:Past, present and yet-to-come. In Proc. the 23rd International Conference on Extending Database Technology, March 30-April 2, 2020, pp.647-650. DOI:10.5441/002/edbt.2020.85. [4] Hernández M A, Stolfo S J. The merge/purge problem for large databases. ACM SIGMOD Record, 1995, 24(2):127-138. DOI:10.1145/568271.223807. [5] Singh R, Meduri V, Elmagarmid A, Madden S, Papotti P, Quiané-Ruiz J A, Solar-Lezama A, Tang N. Generating concise entity matching rules. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1635-1638. DOI:10.1145/3035918.3058739. [6] Fellegi I P, Sunter A B. A theory for record linkage. Journal of the American Statistical Association, 1969, 64(328):1183-210. DOI:10.1080/01621459.1969.10501049. [7] Konda P, Das S, Suganthan G P et al. Magellan:Toward building entity matching management systems. Proceedings of the VLDB Endowment, 2016, 9(12):1197-208. DOI:10.14778/2994509.2994535. [8] Ebraheem M, Thirumuruganathan S, Joty S, Ouzzani M, Tang N. Distributed representations of tuples for entity resolution. Proceedings of the VLDB Endowment, 2018, 11(11):1454-1467. DOI:10.14778/3236187.3236198. [9] Mudgal S, Li H, Rekatsinas T, Doan A, Park Y, Krishnan G, Deep R, Arcaute E, Raghavendra V. Deep learning for entity matching:A design space exploration. In Proc. the 2018 International Conference on Management of Data, May 2018, pp.19-34. DOI:10.1145/3183713.3196926. [10] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521(7553):436-444. DOI:10.1038/nature14539. [11] Fu C, Han X, Sun L, Chen B, Zhang W, Wu S, Kong H. End-to-end multi-perspective matching for entity resolution. In Proc. the 28th International Joint Conference on Artificial Intelligence, August 2019, pp.4961-4967. DOI:10.24963/ijcai.2019/689. [12] Zhang D, Nie Y, Wu S, Shen Y, Tan K L. Multi-context attention for entity matching. In Proc. the Web Conference 2020, April 2020, pp.2634-2640. DOI:10.1145/3366423.3380017. [13] Nie H, Han X, He B, Sun L, Chen B, Zhang W, Wu S, Kong H. Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In Proc. the 28th ACM International Conference on Information and Knowledge Management, November 2019, pp.629-638. DOI:10.1145/3357384.3358018. [14] Fu C, Han X, He J, Sun L. Hierarchical matching network for heterogeneous entity resolution. In Proc. the 29th International Joint Conference on Artificial Intelligence, July 2020, pp.3665-3671. DOI:10.24963/ijcai.2020/507. [15] Efthymiou V, Papadakis G, Papastefanatos G, Stefanidis K, Palpanas T. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Information Systems, 2017, 65:137-57. DOI:10.1016/j.is.2016.12.001. [16] Aráujo T B, Pires C E, Mestre D G, Nóbrega T P, Nascimento D C, Stefanidis K. A noise tolerant and schemaagnostic blocking technique for entity resolution. In Proc. the 34th ACM/SIGAPP Symposium on Applied Computing, April 2019, pp.422-430. DOI:10.1145/3297280.3299730. [17] Li Y, Li J, Suhara Y, Doan A, Tan W C. Deep entity matching with pre-trained language models. Proceedings of the VLDB Endowment, 2020, 14(1):50-60. DOI:10.14778/3421424.3421431. [18] Brunner U, Stockinger K. Entity matching with transformer architectures-A step forward in data integration. In Proc. the 23rd International Conference on Extending Database Technology, March 30-April 2, 2020, pp.463-473. DOI:10.5441/002/edbt.2020.58. [19] Thirumuruganathan S, Parambath S P, Ouzzani M, Tang N, Joty S R. Reuse and adaptation for entity resolution through transfer learning. arXiv:1809.11084, 2018. http://arxiv.org/abs/1809.11084, April 2021. [20] Kasai J, Qian K, Gurajada S, Li Y, Popa L. Low-resource deep entity resolution with transfer and active learning. In Proc. the 57th Conference of the Association for Computational Linguistics, July 2019, pp.5851-5861. DOI:10.18653/v1/P19-1586. [21] Zhao C, He Y. Auto-EM:End-to-end fuzzy entity-matching using pre-trained deep models and transfer learning. In Proc. the 2019 World Wide Web Conference, May 2019, pp.2413-2424. DOI:10.1145/3308558.3313578. [22] Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In Proc. the 32nd International Conference on Machine Learning, July 2015, pp.1180-1189. [23] Sun C, Shen D. Entity resolution with hybrid attentionbased networks. In Proc. the 26th International Conference on Database Systems for Advanced Applications, April 2021, pp.558-565. DOI:10.1007/978-3-030-73197-73. [24] Yang Z, Yang D, Dyer C, He X, Smola A, Hovy E. Hierarchical attention networks for document classification. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, June 2016, pp.1480-1489. DOI:10.18653/v1/N16-1174. [25] Jiang J Y, Zhang M, Li C, Bendersky M, Golbandi N, Najork M. Semantic text matching for long-form documents. In Proc. the 2019 World Wide Web Conference, May 2019, pp.795-806. DOI:10.1145/3308558.3313707. [26] Hu D. An introductory survey on attention mechanisms in NLP problems. In Proc. the 2019 Intelligent Systems Conference, September 2019, pp.432-448. DOI:10.1007/978-3-030-29513-431. [27] Mikolov T, Sutskever I, Chen K, Corrado G S, Dean J. Distributed representations of words and phrases and their compositionality. In Proc. the 26th International Conference on Neural Information Processing Systems, December 2013, pp.3111-3119. [28] Pennington J, Socher R, Manning C D. Glove:Global vectors for word representation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.1532-1543. DOI:10.3115/v1/D14-1162. [29] Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, 5:135-146. DOI:10.1162/tacl a 00051. [30] Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proc. the 2014 Conference on Empirical Methods in Natural Language Processing, October 2014, pp.1724-1734. DOI:10.3115/v1/D14-1179. [31] Lin Z, Feng M, Santos C N, Yu M, Xiang B, Zhou B, Bengio Y. A structured self-attentive sentence embedding. In Proc. the 2017 International Conference on Learning Representations, April 2017. [32] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. the 2015 International Conference on Learning Representations, May 2015. [33] Tang M, Cai J, Zhuo H. Multi-matching network for multiple choice reading comprehension. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 27-February 1, 2019, pp.7088-7095. DOI:10.1609/aaai.v33i01.33017088. [34] Saito K, Watanabe K, Ushiku Y, Harada T. Maximum classifier discrepancy for unsupervised domain adaptation. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp.3723-3732. DOI:10.1109/CVPR.2018.00392. [35] Wang J, Li G, Yu J X, Feng J. Entity matching:How similar is similar. Proceedings of the VLDB Endowment, 2010, 4(10):622-633. DOI:10.14778/2021017.2021020. |
[1] | Jia-Ke Ge, Yan-Feng Chai, Yun-Peng Chai. WATuning:一种基于注意力机制的深度强化学习的工作负载感知调优系统[J]. 计算机科学技术学报, 2021, 36(4): 741-761. |
[2] | Sheng-Luan Hou, Xi-Kun Huang, Chao-Qun Fei, Shu-Han Zhang, Yang-Yang Li, Qi-Lin Sun, Chuan-Qing Wang. 基于深度学习的文本摘要研究综述[J]. 计算机科学技术学报, 2021, 36(3): 633-663. |
[3] | Yang Liu, Ruili He, Xiaoqian Lv, Wei Wang, Xin Sun, Shengping Zhang. 婴儿的年龄和性别容易被识别吗?[J]. 计算机科学技术学报, 2021, 36(3): 508-519. |
[4] | Yi-Ting Wang, Jie Shen, Zhi-Xu Li, Qiang Yang, An Liu, Peng-Peng Zhao, Jia-Jie Xu, Lei Zhao, Xun-Jie Yang. 基于搜索引擎丰富上下文信息的实体链接方法[J]. 计算机科学技术学报, 2020, 35(4): 724-738. |
[5] | Chao Kong, Bao-Xiang Chen, Li-Ping Zhang. 跨异构信息网络的深度实体匹配[J]. 计算机科学技术学报, 2020, 35(4): 739-750. |
[6] | Ying Li, Jia-Jie Xu, Peng-Peng Zhao, Jun-Hua Fang, Wei Chen, Lei Zhao. ATLRec:用于跨领域推荐的注意力对抗迁移学习网络[J]. 计算机科学技术学报, 2020, 35(4): 794-808. |
[7] | Huan-Jing Yue, Sheng Shen, Jing-Yu Yang, Hao-Feng Hu, Yan-Fang Chen. 基于渐进式通道注意力网络的参考图引导超分辨率研究[J]. 计算机科学技术学报, 2020, 35(3): 551-563. |
[8] | Chun-Yang Ruan, Ye Wang, Jiangang Ma, Yanchun Zhang, Xin-Tian Chen. 基于元路径注意力机制的异构网络对抗式嵌入[J]. 计算机科学技术学报, 2019, 34(6): 1217-1229. |
[9] | Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Mi. 众包指导下基于纯文本属性数据的实体匹配[J]. , 2017, 32(5): 858-876. |
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |