|
计算机科学技术学报 ›› 2020,Vol. 35 ›› Issue (4): 739-750.doi: 10.1007/s11390-020-0139-5
所属专题: Data Management and Data Mining
Chao Kong*, Member, CCF, Bao-Xiang Chen*, Li-Ping Zhang
Chao Kong*, Member, CCF, Bao-Xiang Chen*, Li-Ping Zhang
1.研究背景
信息网络关注对象与对象之间的交互,是现实世界的抽象。这一层次的抽象既具有表达和存储现实世界本质信息的强大能力,同时通过运用链接的信息,为现实世界挖掘知识提供了一个有用的工具。当今互联网平台上的海量数据多呈现碎片化特征,这些数据包含着不同的属性,且相互关联,构成了一张张由不同类型节点以及表达不同关系的边所组成的异构信息网络。因此,在数据碎片化的大趋势下,只有匹配、关联和拼接碎片化的数据才能真正体现互联网平台作为“社会传感器”的作用。
当前,设计一种有效地拼接这些碎片化数据的方法即实体匹配方法已然成为学术界和工业界共同关注的问题。实体匹配技术旨在从不同数据源中发现相同的实体,这正是数据清洗、数据挖掘等领域的关键问题。关于这一问题的研究甚至可以追溯到上世纪40年代,经过漫长的发展,实体匹配技术已广泛应用于数据集成、知识获取以及用户画像等领域。异构信息网络中的节点和边呈现出种类多、关联强、语义缺等特点。因此,如何准确、高效地匹配这些碎片化数据,进一步实现这些碎片化数据的价值是亟待解决的问题,本文正是为了应对这一挑战而提出的。传统的实体匹配方法仅致力于从用户产生的文本信息中抽取特征,而忽视了这些属性信息之间的网络结构关联,难以达到更好的匹配效果。此外,在面对大规模网络时也存在计算瓶颈。因此,本文基于深度学习的方法,提出一种新型的跨异构信息网络的实体匹配算法:结合高速公路网络和多层感知器挖掘更多的异构信息网络中的隐含关系,从而提升匹配性能;此外,通过网络嵌入方法将对象表示成稠密、实值、低维的向量,以向量化的方式高效地运算,从而提升算法执行效率。
2.研究目的
本文致力于设计一种基于深度学习方法的跨异构信息网络的实体匹配算法,利用网络结构信息捕捉不同实体间的丰富的隐含关系,并结合已有的语义信息发现不同数据源中相同的实体或对象。
3.研究方法
本文提出了一种基于深度学习方法的跨异构信息网络的实体匹配算法:Deep Entity Matching(DEM)。该方法分为三步:(1)构建异构信息网络,将每一条记录中的属性作为一个节点,建立不同属性间的关系,构建(起点,关系,终点)的三元组;(2)利用网络嵌入方法得到不同网络中每个节点的嵌入向量;(3)对不同属性的节点进行分块处理,计算同一属性下节点的相似向量,随后作为多层感知器的输入,得到最终的“匹配”或“不匹配”的实体集。
4.实验结果
本文利用了四个真实的数据集,进行了用户链接以及实体链接两组实验,并对比了四种已有的常规机器学习方法和两个先进的深度学习方法,利用F1指标证明了DEM方法的优越性。此外,通过削减数据集规模以及模型简化测试,证明了实验的可扩展性以及多层感知器模型在DEM方法中的重要性。
5.研究结论
实验结果表明了通过网络嵌入获取结构信息并利用多层感知器作为分类器的深度学习方法对实体匹配的性能有着明显的提升。证明了利用记录中的属性构建异构信息网络的方法是行之有效的,利用网络结构信息为实体匹配任务提供了帮助。此外,我们还设想是否可以通过迁移学习的方法减少所需要的标注数据,提高DEM的效率。
[1] Hu G, Zhang Y, Yang Q. CoNet:Collaborative cross networks for cross-domain recommendation. In Proc. the 27th Int. Conference on Information and Knowledge Management, October 2018, pp.667-676. [2] Wang X, Peng Z, Wang S, Yu P S, Fu W, Hong X. Crossdomain recommendation for cold-start users via neighborhood based feature mapping. In Proc. the 23rd Int. Conference on Information Database Systems for Advanced Applications, May 2018, pp.158-165. [3] Benson A R, Kleinberg J M. Link prediction in networks with core-fringe data. In Proc. the 27th World Wide Web Conference, May 2019, pp.94-104. [4] Huo Z, Huang X, Hu X. Link prediction with personalized social influence. In Proc. the 32nd AAAI Conference on Artificial Intelligence, August 2019, pp.2289-2296. [5] Wang Y, Feng C, Chen L, Yin H, Guo C, Chu Y. User identity linkage across social networks via linked heterogeneous network embedding. World Wide Web, 2019, 22(6):2611-2632. [6] Li C, Wang S, Wang H, Liang Y, Yu P S, Li Z, Wang W. Partially shared adversarial learning for semi-supervised multi-platform user identity linkage. In Proc. the 28th Int. Conference on Information and Knowledge Management, November 2019, pp.249-258. [7] Chen J, Wang C, Ester M, Shi Q, Feng Y, Chen C. Social recommendation with missing not at random data. In Proc. the 18th Int. Conference on Data Mining, November 2018, pp.29-38. [8] Kong C, Gao M, Xu C, Fu Y, Qian W, Zhou A. EnAli:Entity alignment across multiple heterogeneous data sources. Frontiers Comput. Sci., 2019, 13(1):157-169. [9] Srivastava R K, Greff K, Schmidhuber J. Training very deep networks. In Proc. the 2015 Annual Conference on Neural Information Processing Systems, December 2015, pp.2377-2385. [10] Scannapieco M, Figotin I, Bertino E, Elmagarmid A K. Privacy preserving schema and data matching. In Proc. the 2007 ACM SIGMOD Int. Conference on Management of Data, June 2007, pp.653-664. [11] Barbosa L. Learning representations of Web entities for entity resolution. International Journal of Web Information Systems, 2019, 15(3):346-358. [12] Tantipathananandh C, Berger-Wolf T Y. Constant-factor approximation algorithms for identifying dynamic communities. In Proc. the 15th Int. Conference on Knowledge Discovery and Data Mining, June 2009, pp.827-836. [13] Cheng A, Zhou C, Yang H, Wu J, Li L, Tan J, Guo L. Deep active learning for anchor user prediction. In Proc. the 28th Int. Joint Conference on Artificial Intelligence, August 2019, pp.2151-2157. [14] Armandpour M, Ding P, Huang J, Hu X. Robust negative sampling for network embedding. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 2019, pp.3191-3198. [15] Bandyopadhyay S, Lokesh N, Murty M N. Outlier aware network embedding for attributed networks. In Proc. the 33rd AAAI Conference on Artificial Intelligence, January 2019, pp.12-19. [16] Gao M, Chen L, He X, Zhou A. BiNE:Bipartite network embedding. In Proc. the 41st Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 2018, pp.715-724. [17] Newcombe H B, Kennedy J M, Axford S J, James A P. Automatic linkage of vital records. Science, 1959, 130(3381):954-959. [18] Christen P. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng., 2012, 24(9):1537-1555. [19] Mohtasseb H, Ahmed A. Two-layered Blogger identification model integrating profile and instance-based methods. Knowl. Inf. Syst., 2012, 31(1):1-21. [20] Hernández M A, Stolfo S J. The merge/purge problem for large databases. In Proc. the 1995 ACM SIGMOD Int. Conference on Management of Data, May 1995, pp.127-138. [21] Vidanage A, Ranbaduge T, Christen P, Schnell R. Efficient pattern mining based cryptanalysis for privacy-preserving record linkage. In Proc. the 35th Int. Conference on Data Engineering, April 2019, pp.1698-1701. [22] Barbosa L. Learning representations of Web entities for entity resolution. Int. J. Web Inf. Syst., 2019, 15(3):346-358. [23] Verroios V, Garcia-Molina H. Top-K entity resolution with adaptive locality-sensitive hashing. In Proc. the 35th Int. Conference on Data Engineering, April 2019, pp.1718-1721. [24] Tejada S, Knoblock C A, Minton S. Learning object identification rules for information integration. Inf. Syst., 2001, 26(8):607-633. [25] Liang D, Zhang F, Zhang W et al. Adaptive multi-attention network incorporating answer information for duplicate question detection. In Proc. the 42nd Int. ACM SIGIR Conference on Research and Development in Information Retrieval, July 2019, pp.95-104. [26] McCarthy J F, Lehnert W G. Using decision trees for coreference resolution. In Proc. the 14th International Joint Conference on Artificial Intelligence, August 1995, pp.1050-1055. [27] Gorla S, Velivelli S, Murthy N L B, Malapati A. Named Entity Recognition for Telugu news articles using naïve Bayes classifier. In Proc. the 2nd International Workshop on Recent Trends in News Information Retrieval Co-Located with 40th European Conference on Information Retrieval, March 2018, pp.33-38. [28] Ponzetto S P, Strube M. Exploiting semantic role labeling, wordNet and Wikipedia for coreference resolution. In Proc. the Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, June 2006, pp.192-199. [29] Rahman M A, Ng V. Supervised models for coreference resolution. In Proc. the 2009 Conference on Empirical Methods in Natural Language Processing, August 2009, pp.968-977. [30] Arasu A, Götz M, Kaushik R. On active learning of record matching packages. In Proc. the 2010 ACM SIGMOD International Conference on Management of Data, June 2010, pp.783-794. [31] Bilenko M, Mooney R J. Adaptive duplicate detection using learnable string similarity measures. In Proc. the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2003, pp.39-48. [32] Konda P, Das S, G.C. Suganthan P et al. Magellan:Toward building entity matching management systems. Proceedings of the VLDB Endowment, 2016, 9(12):1197-1208. [33] Mudgal S, Li H, Rekatsinas T et al. Deep learning for entity matching:A design space exploration. In Proc. the 2018 Int. Conference on Management of Data, June 2018, pp.19-34. [34] Ebraheem M, Thirumuruganathan S, Joty S R, Ouzzani M, Tang N. DeepER-Deep entity resolution. arXiv:1710.00597, 2017. http://arxiv.org/abs/1710.00597, August 2018. [35] LeCun Y, Bengio Y, Hinton G E. Deep learning. Nature, 2015, 521(7553):436-444. [36] Hoffer E, Ailon N. Deep metric learning using triplet network. In Proc. the 3rd International Workshop on Similarity-Based Pattern Recognition, October 2015, pp.84-92. [37] Neculoiu P, Versteegh M, Rotaru M. Learning text similarity with Siamese recurrent networks. In Proc. the 1st Workshop on Representation Learning for NLP, August 2016, pp.148-157. [38] Trouillon T, Welbl J, Riedel S, Gaussier É, Bouchard G. Complex embeddings for simple link prediction. In Proc. the 33rd International Conference on Machine Learning, June 2016, pp.2071-2080. [39] Lerer A, Wu L, Shen J et al. PyTorch-BigGraph:A largescale graph embedding system. arXiv:1903.12287, 2019. http://arxiv.org/abs/1903.12287, April 2019. [40] Kasai J, Qian K, Gurajada S, Li Y, Popa L. Low-resource deep entity resolution with transfer and active learning. In Proc. the 57th Int. Conference of the Association for Computational Linguistics, July 2019, pp.5851-5861. |
[1] | Chen-Chen Sun, De-Rong Shen. 面向深度实体匹配的混合层次网络[J]. 计算机科学技术学报, 2021, 36(4): 822-838. |
[2] | Yue Kou, De-Rong Shen, Dong Li, Tie-Zheng Nie, Ge Yu. 基于异构信息网络分解与嵌入的社区发现方法[J]. 计算机科学技术学报, 2020, 35(2): 320-337. |
[3] | Chun-Yang Ruan, Ye Wang, Jiangang Ma, Yanchun Zhang, Xin-Tian Chen. 基于元路径注意力机制的异构网络对抗式嵌入[J]. 计算机科学技术学报, 2019, 34(6): 1217-1229. |
[4] | Da-Wei Cheng, Yi Tu, Zhen-Wei Ma, Zhi-Bin Niu, Li-Qing Zhang. 二元高阶担保网络表示学习方法[J]. 计算机科学技术学报, 2019, 34(3): 657-669. |
[5] | Lei Guo, Yu-Fei Wen, Xin-Hua Wang. 社会网络中基于预训练网络嵌入式表示模型的推荐算法研究[J]. , 2018, 33(4): 682-696. |
[6] | Zhi-Xu Li, Qiang Yang, An Liu, Guan-Feng Liu, Jia Zhu, Jia-Jie Xu, Kai Zheng, Mi. 众包指导下基于纯文本属性数据的实体匹配[J]. , 2017, 32(5): 858-876. |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |