上下文感知的关系属性语义识别

丁玥; 郭雨荷; 卢卫; 李海翔; 张美慧; 李晖; 潘安群; 杜小勇

doi:10.1007/s11390-021-1048-y

摘要:

研究背景 关系数据库仍然是信息基础设施的核心技术和重要基础。然而，在当前许多重要的应用领域，例如政务信息系统，各部门、各行业之间，“信息孤岛”现象仍然广泛存在。在“信息孤岛”中，由于各个信息系统在建设时缺少统一标准和数据质量监控手段，造成构成数据库模式的主要对象-关系模式中属性命名语义缺失的问题，为后续数据治理中的数据汇聚与融合技术带来了挑战。为此，本项工作探索通过有效识别关系模式中的属性语义，为数据汇聚与融合中要求的关系模式属性间语义对齐提供基础。

目的本篇论文针对孤岛信息系统中关系模式的属性语义标注不一或缺失问题，提出一种自动化的关系属性语义识别方法，实现关系数据中指定属性到已知属性类的语义映射，为多源关系模式间属性语义对齐等数据治理问题提供技术支撑。

方法本文从现有方法中存在的主要问题出发，将属性语义识别问题转化为多分类问题，基于关系属性的语义编码和知识编码，提出了一种上下文感知的关系属性语义识别方法，实现了未知类别属性到给定语义类别的映射，同时对于具有未定义类别的属性集合，通过外部知识库的引入提供了候选语义类别的预测。具体地，我们引入关系数据中目标属性的上下文信息，来增强目标属性的语义表示，以区分属性值特征相似、语义不同属性组；引入已有知识库中的实体数据，提取目标属性的知识编码，一方面可以增强目标属性的特征表示，另一方面，对于语义识别分类器的未知类别，可以借助于知识编码生成候选类别。

结果本文在3个真实数据集上进行了实验验证，并且从多个角度将所提出的上下文感知的关系属性语义识别方法与已有方法进行对比。实验结果表明，本文的方法可以有效提升关系属性语义识别效果，具体地，对于高质BMdata数据集来说，宏平均（macro average）F1-score提升了6.14%，加权平均（weighted average）F1-score提升了0.28%；对于低质WebTable数据集来说，宏平均F1-score提升了25.17%，加权平均F1-score提升了9.56%。

结论本文对目标属性进行语义编码以及知识编码，提出了一种上下文感知的关系属性语义识别方法，实验证实了该方法的有效性。未来，我们将进一步从层次化类别定义、模型压缩等角度对方法进行改进，并探讨目标属性语义识别在模式匹配、异质数据检测等方面的应用。

Abstract: Identifying semantic types for attributes in relations, known as attribute semantic type (AST) identification, plays an important role in many data analysis tasks, such as data cleaning, schema matching, and keyword search in databases. However, due to a lack of unified naming standards across prevalent information systems (a.k.a. information islands), AST identification still remains as an open problem. To tackle this problem, we propose a context-aware method to figure out the ASTs for relations in this paper. We transform the AST identification into a multi-class classification problem and propose a schema context aware (SCA) model to learn the representation from a collection of relations associated with attribute values and schema context. Based on the learned representation, we predict the AST for a given attribute from an underlying relation, wherein the predicted AST is mapped to one of the labeled ASTs. To improve the performance for AST identification, especially for the case that the predicted semantic types of attributes are not included in the labeled ASTs, we then introduce knowledge base embeddings (a.k.a. KBVec) to enhance the above representation and construct a schema context aware model with knowledge base enhanced (SCA-KB) to get a stable and robust model. Extensive experiments based on real datasets demonstrate that our context-aware method outperforms the state-of-the-art approaches by a large margin, up to 6.14% and 25.17% in terms of macro average F_1 score, and up to 0.28% and 9.56% in terms of weighted F_1 score over high-quality and low-quality datasets respectively.

上下文感知的关系属性语义识别

Context-Aware Semantic Type Identification for Relational Attributes