We use cookies to improve your experience with our site.
孙晓, 黄德根, 宋海玉, 任福继. 基于一种隐藏变量区别模型与全局特征的中文新词识别[J]. 计算机科学技术学报, 2011, 26(1): 14-24. DOI: 10.1007/s11390-011-1107-x
引用本文: 孙晓, 黄德根, 宋海玉, 任福继. 基于一种隐藏变量区别模型与全局特征的中文新词识别[J]. 计算机科学技术学报, 2011, 26(1): 14-24. DOI: 10.1007/s11390-011-1107-x
Xiao Sun, De-Gen Huang, Hai-Yu Song, Fu-Ji Ren. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1): 14-24. DOI: 10.1007/s11390-011-1107-x
Citation: Xiao Sun, De-Gen Huang, Hai-Yu Song, Fu-Ji Ren. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1): 14-24. DOI: 10.1007/s11390-011-1107-x

基于一种隐藏变量区别模型与全局特征的中文新词识别

Chinese New Word Identification: A Latent Discriminative Model with Global Features

  • 摘要: 为了进一步提高未登录词的识别及其词性标注的精度,提出了隐藏变量半马尔可夫条件随机域模型(Hidden semi-CRF)。该模型综合了隐藏变量动态条件随机域模型(LDCRF)和半马尔可夫条件随机域模型(semi-CRF)的优势,可以同步识别未登录词及其词性,同时,Hidden semi-CRF模型的运算复杂度要低于semi-CRF模型,利用Hidden semi-CRF模型进行序列标记时,与semi-CRF模型类似,是对序列中的一个子序列进行标记,因此可以应用于semi-CRF的序列标记任务中,例如未登录词的识别及其词性标注。在未登录词的处理过程中,为了提高未登录词识别的精确率,引入针对未登录词的“全局碎片特征”概念,将未登录词被错误切分生成的碎片作为“全局碎片特征”来引入到CRF模型和Hidden semi-CRF模型的训练和测试中,提高了未登录词识别的精确率,进而提高了中文词法分析系统的整体精度。
    结合隐藏变量动态条件随机域模型(Latent-Dynamic Conditional Random Fields或LDCRF)和半马尔可夫条件随机域模型(Semi-Markov Conditional Random Fields或Semi-CRF),提出了一种新的隐藏变量半马尔可夫条件随机域模型(Hidden semi-CRF)。在模型底层,利用LDCRF模型来产生未登录词边界的Nbest输出,之后将这些候选未登录词结合所有可能的词性标记来构造Hidden semi-CRF模型的候选实体(Candidate entities)。可以通过调整LDCRF模型的Nbest输出的数量来调整对Hidden semi-CRF模型的候选实体的剪枝程度。在选取特征方面,隐藏变量半马尔可夫条件随机域模型保留了半马尔可夫条件随机域模型的优势。
    Hidden semi-CRF模型比semi-CRF有更低的复杂性和运算代价,并且由于同时具有LDCRF模型的特性,因此比semi-CRF模型有更高的精度。因为LDCF模型的Nbest输出的引入,用于训练和测试Hidden semi-CRF模型的候选实体的数量显著的降低,相应Hidden semi-CRF模型的性能得到了明显提高。Hidden semi-CRF模型可以在识别未登录词的同时对其进行词性标注,即同步完成未登录词的识别和词性标注的过程。同时针对未登录词的识别,引入了“全局碎片特征(Global Fragment Information)”来训练和测试Hidden semi-CRF,进一步提高了未登录词识别和词性标注的精确率。因为引入了LDCRF模型的Nbest输出来构造并对Hidden semi-CRF的候选实体进行剪枝,并且可以通过调整Nbest的值来调整剪枝的程度,因此Hidden semi-CRF模型的运算复杂度大大的降低,同时避免了对于不正确的未登录词的词性标注。在进一步的研究中,将利用Hidden semi-CRF模型来解决中文词法分析,组块识别,句法分析等深层分析。
    未登录词是未出现于系统词典中的词,在对真实文本进行分词和词性标注时,未登录词的存在影响了中文分词和词性标注的结果,因为未登录词的词边界和词性都是未知的。通过Hidden semi-CRF模型,对文本中的未登录词进行识别,可以提高中文词法分析的精确率,进而提高中文句法分析,语法分析,机器翻译等深层自然语言处理应用的精确率与效率。

     

    Abstract: Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field (LDCRF) and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied to generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called "Global Fragment Features" for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.

     

/

返回文章
返回