We use cookies to improve your experience with our site.

一种面向统计机器翻译中未登录词处理的“替换-翻译-恢复”框架

A Substitution-Translation-Restoration Framework for Handling Unknown Words in Statistical Machine Translation

  • 摘要: 未登录词是影响翻译质量的关键因素之一。传统地,几乎所有的相关工作着眼于采用不同的方法获得未登录词的译文。然而,这些方法存在两点缺陷:一方面,这些方法通常依赖于许多外部资源(譬如双语网络数据);另一方面,这些方法不能保证未登录词周围的词语获得较好的译文选择和词序。本文提出一种在统计机器翻译中处理未登录词的新视角。相比于费尽心思地寻找未登录词的翻译,本文着眼于确定待翻译语句中未登录词所扮演的语义功能,并在翻译过程中保持该未登录词的语义功能,从而帮助未登录词周围的上下文获得更好的译文词汇选择以及更好的短语调序。为了确定每个未登录词的语义功能,本文提出两种模型:分布语义模型和双向语言模型。汉英方向的充分实验证明本文所提方法在基于短语的翻译模型和基于语言学句法的翻译模型中都能够统计显著地提高译文质量。

     

    Abstract: Unknown words are one of the key factors that greatly affect the translation quality. Traditionally, nearly all the related researches focus on obtaining the translation of the unknown words. However, these approaches have two disadvantages. On the one hand, they usually rely on many additional resources such as bilingual web data; on the other hand, they cannot guarantee good reordering and lexical selection of surrounding words. This paper gives a new perspective on handling unknown words in statistical machine translation (SMT). Instead of making great efforts to find the translation of unknown words, we focus on determining the semantic function of the unknown word in the test sentence and keeping the semantic function unchanged in the translation process. In this way, unknown words can help the phrase reordering and lexical selection of their surrounding words even though they still remain untranslated. In order to determine the semantic function of an unknown word, we employ the distributional semantic model and the bidirectional language model. Extensive experiments on both phrase-based and linguistically syntax-based SMT models in Chinese-to-English translation show that our method can substantially improve the translation quality.

     

/

返回文章
返回