利用无标注数据的神经网络语法错误检测

doi:10.1007/s11390-017-1757-4

利用无标注数据的神经网络语法错误检测

Exploiting Unlabeled Data for Neural Grammatical Error Detection

摘要

摘要: 近年来，检测和改正非母语作者撰写的文本中的语法错误受到越来越多的关注。虽然一些标注语料库已经被建立来辅助数据驱动的语法错误检测和改正方法，但是由于人工标注耗时耗力，非常昂贵，标注语料库在数量和领域覆盖方面仍然非常有限。在这篇论文中，我们提出利用无标注的数据来训练基于神经网络的语法错误检测模型。基本思想是将错误检测转化为二元分类问题，并从无标注的数据中产生正例和负例。我们引入了基于注意力的神经网络来捕获影响被检测词的长距离依赖关系。实验表明，我们所提出的方法效果超过了固定窗口上下文支持向量机模型和卷积网络模型。

Abstract: Identifying and correcting grammatical errors in the text written by non-native writers have received increasing attention in recent years. Although a number of annotated corpora have been established to facilitate data-driven grammatical error detection and correction approaches, they are still limited in terms of quantity and coverage because human annotation is labor-intensive, time-consuming, and expensive. In this work, we propose to utilize unlabeled data to train neural network based grammatical error detection models. The basic idea is to cast error detection as a binary classification problem and derive positive and negative training examples from unlabeled data. We introduce an attention-based neural network to capture long-distance dependencies that influence the word being detected. Experiments show that the proposed approach significantly outperforms SVM and convolutional networks with fixed-size context window.

HTML全文

参考文献()

施引文献

资源附件()