基于图聚类的语义嵌入增强中文领域的实体链接

张照博; 钟芷漫; 袁平鹏; 金海

doi:10.1007/s11390-023-2835-4

基于图聚类的语义嵌入增强中文领域的实体链接

Improving Entity Linking in Chinese Domain by Sense Embedding Based on Graph Clustering

摘要

摘要:
研究背景 实体链接是指通过候选实体的生成和候选实体排名将文本中的字符串与知识库中的相应实体联系起来。它在很多的NLP任务，比如知识图谱补全等方面具有重要作用。而中文实体链接相比于英文环境下的实体链接，需要考虑更多的因素，因为中文文本没有单词间的空格以及可以直接表明实体的大写字母。而且在比如工业制造领域中，实体通常由长字符串组成，由此引发的实体内单词嵌套等问题，导致无法直接使用常用的单词嵌入等来处理实体的语义，需要通过分割其内部的单词和字符来实现向量化操作。而且构成实体的词的语义有时是模糊的，如“Dfvf3000发电机抗压实验仪表”这种无法直接表征语义的长实体名，内部还包含了一些可能存在模糊语义的“抗压”，“实验”，“仪表”等单词。因此，领域内的单词的语义空间是一般领域的词嵌入空间的一个子空间，如果直接采用一般领域内的词嵌入的话，可能会引入许多的噪声从而降低准确度。
目的为了提高工业制造领域的实体链接的性能，我们从实体链接的两个步骤：候选实体生成和候选实体排名进行分别设计方案来改进。主要目的是在提高召回率的同时，保证排序过程中的向量化计算的语义特征的精度。
方法首先，我们实现了一种基于n-gram的候选实体生成方法，以提高召回率并减少由于单词嵌套带来的噪声。这一方法将文本视为字符序列，并且通过滑动窗口来获取所有的可能得实体的集合。然后，我们通过引入语义嵌入来增强相应的候选实体排名机制。考虑到词向量的多义性和工业制造领域的单一语义的需求之间的矛盾，我们设计了一个基于图聚类的语义嵌入模型SECEL，包含两种聚类算法（SCPM和WWF-MCL）。它采用无监督的聚类方法进行单词语义的归纳，并结合上下文学习语义嵌入。排序时，针对所有候选实体的排序使用三个特征：单词重叠度，字符重叠度，以及由前述语义嵌入计算得来的语义相似度。整体的排序过程为：通过组合三个特征并将其输入XGBoost分类器进行训练，输出每一个候选实体的预测标签，即是否为目标实体。
结果为了测试我们的基于图聚类的语义嵌入的性能，在经典的语义相似度任务（SCWS）上测试了生成的语义嵌入的质量，并在一个语义消歧任务（TWSI）上证明了其在一般领域内的消歧能力。相比于基线模型（GloVe，GenSense，ELMo），SCPM和WWF-MCL算法都实现了在语义准确上的超越，其中WWF-MCL算法对比表现最好的ELMo提高了1.8%的准确率。而在语义消歧任务中，两种算法生成的语义嵌入都有较好的表现，而WWF-MCL相比表现最佳的基线模型（Dependencies）再次实现了最好的精度（+1.2%），召回率（+1.9%）和F1（+1.5%）。工业制造领域实体链接任务基于半手动构建的两个数据集IEL-1和IEL-2。在这两个数据集上，基于语义嵌入增强的SECEL模型相比基于单词嵌入的基线实现了普遍的超越。表现最好的参数配置下，SECEL在两个数据集上，F1分别取得了2.68%和4.2%的改善。上述结果表明，我们设计的图聚类算法具有较好的捕获和区分多义单词的不同语义的能力；而在这种语义嵌入的基础上实现的实体链接模型普遍在召回率和精度上比单词嵌入要好。
结论语义嵌入对比单词嵌入实现的改进，说明针对实体链接过程中的语义计算，尤其是在语义单一的一些领域内的实体链接，考虑内部单词的多义性并且进行有效的区分是很有意义的。此类问题不仅出现在实体链接领域，其他的需要进行向量化语义计算操作，且需要排除多义性的任务上也可以考虑引入一个语义嵌入来进行语义消歧从而提高语义计算的精确度。也就是说，通过一定的调整，SECEL也可以用于其他领域的实体链接，例如医学领域等。未来，我们计划进一步开发IEL数据集，使其更加完善且具有代表性，同时改进SECEL模型，推动其在工业制造领域内的实体链接的发展和应用。

Abstract: Entity linking refers to linking a string in a text to corresponding entities in a knowledge base through candidate entity generation and candidate entity ranking. It is of great significance to some NLP (natural language processing) tasks, such as question answering. Unlike English entity linking, Chinese entity linking requires more consideration due to the lack of spacing and capitalization in text sequences and the ambiguity of characters and words, which is more evident in certain scenarios. In Chinese domains, such as industry, the generated candidate entities are usually composed of long strings and are heavily nested. In addition, the meanings of the words that make up industrial entities are sometimes ambiguous. Their semantic space is a subspace of the general word embedding space, and thus each entity word needs to get its exact meanings. Therefore, we propose two schemes to achieve better Chinese entity linking. First, we implement an n-gram based candidate entity generation method to increase the recall rate and reduce the nesting noise. Then, we enhance the corresponding candidate entity ranking mechanism by introducing sense embedding. Considering the contradiction between the ambiguity of word vectors and the single sense of the industrial domain, we design a sense embedding model based on graph clustering, which adopts an unsupervised approach for word sense induction and learns sense representation in conjunction with context. We test the embedding quality of our approach on classical datasets and demonstrate its disambiguation ability in general scenarios. We confirm that our method can better learn candidate entities’ fundamental laws in the industrial domain and achieve better performance on entity linking through experiments.

HTML全文

参考文献()

施引文献

资源附件()