›› 2017, Vol. 32 ›› Issue (4): 796-804.doi: 10.1007/s11390-017-1760-9

Special Issue: Artificial Intelligence and Pattern Recognition

• Special Issue on Deep Learning • Previous Articles     Next Articles

Optimizing Non-Decomposable Evaluation Metrics for Neural Machine Translation

Shi-Qi Shen1,2,3, Student Member, CCF, Yang Liu1,2,3,4,*, Senior Member, CCF, Mao-Song Sun1,2,3,4, Senior Member, CCF   

  1. 1 Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
    2 State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Beijing 100084, China;
    3 Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China;
    4 Jiangsu Collaborative Innovation Center for Language Ability, Jiangsu Normal University, Xuzhou 221009, China
  • Received:2016-12-20 Revised:2017-05-20 Online:2017-07-05 Published:2017-07-05
  • Contact: Yang Liu E-mail:liuyang2011@tsinghua.edu.cn
  • Supported by:

    This work is supported by the National Natural Science Foundation of China under Grant Nos. 61522204, 61432013, and the National High Technology Research and Development 863 Program of China under Grant No. 2015AA015407, also supported by the Singapore National Research Foundation under Its International Research Centre@Singapore Funding Initiative, and administered by the IDM (Interactive Digital Media) Programme.

While optimizing model parameters with respect to evaluation metrics has recently proven to benefit endto-end neural machine translation (NMT), the evaluation metrics used in the training are restricted to be defined at the sentence level to facilitate online learning algorithms. This is undesirable because the final evaluation metrics used in the testing phase are usually non-decomposable (i.e., they are defined at the corpus level and cannot be expressed as the sum of sentence-level metrics). To minimize the discrepancy between the training and the testing, we propose to extend the minimum risk training (MRT) algorithm to take non-decomposable corpus-level evaluation metrics into consideration while still keeping the advantages of online training. This can be done by calculating corpus-level evaluation metrics on a subset of training data at each step in online training. Experiments on Chinese-English and English-French translation show that our approach improves the correlation between training and testing and significantly outperforms the MRT algorithm using decomposable evaluation metrics.

[1] Kalchbrenner N, Blunsom P. Recurrent continuous translation models. In Proc. the Conference on Empirical Methods in Natural Language Processing, Oct. 2013, pp.1700-1709.

[2] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In Proc. Advances in Neural Information Processing Systems, Dec. 2014, pp.3104-3112.

[3] Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In Proc. ICLR, May 2015.

[4] Och F J. Minimum error rate training in statistical machine translation. In Proc. the 41st Annual Meeting of the Association for Computational Linguistics, July 2003, pp.160-167.

[5] Chiang D. A hierarchical phrase-based model for statistical machine translation. In Proc. the 43rd Annual Meeting of the Association for Computational Linguistics, June 2005, pp.263-270.

[6] Hochreiter S, Schmidhuber J. Long short-term memory. Neural Computation, 1997, 9(8):1735-1780.

[7] Chung J, Gulcehre C, Cho K, Yoshua B. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555, 2014. https://arxiv. org/abs/1412.3555, May 2017.

[8] Ranzato M, Chopra S, Auli M, Zaremba W. Sequence level training with recurrent neural networks. In Proc. ICLR, May 2016.

[9] Shen S, Cheng Y, He Z, He W, Wu H, Sun M, Liu Y. Minimum risk training for neural machine translation. In Proc. the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), Aug. 2016, pp.1683-1692.

[10] Willams R J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 1992, 8(3/4):229-256.

[11] Smith D A, Eisner J. Minimum risk annealing for training log-linear models. In Proc. the COLING/ACL on Main Conference Poster Sessions, July 2006, pp.787-794.

[12] He X, Deng L. Maximum expected BLEU training of phrase and lexicon translation models. In Proc. the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), July 2012, pp.292-301.

[13] Gao J, He X, Yih W, Deng L. Learning continuous phrase representations for translation modeling. In Proc. the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1:Long Papers), June 2014, pp.699-709.

[14] Papineni K, Roukos S, Ward T, Zhu W J. BLEU:A method for automatic evaluation of machine translation. In Proc. the 40th Annual Meeting of the Association for Computational Linguistics, July 2002, pp.311-318.

[15] Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J. A study of translation edit rate with targeted human annotation. In Proc. the 7th Association for Machine Translation in the Americas, Aug. 2006, pp.223-231.

[16] Watanabe T, Suzuki J, Tsukada H, Isozaki H. Online largemargin training for statistical machine translation. In Proc. the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), June 2007, pp.764-773.

[17] Chiang D, Marton Y, Resnik P. Online large-margin training of syntactic and structural translation features. In Proc. the Conference on Empirical Methods in Natural Language Processing, Oct. 2008, pp.224-233.

[18] Chiang D. Hope and fear for discriminative training of statistical translation models. The Journal of Machine Learning Research, 2012, 13(1):1159-1187.

[19] Neubig G, Watanabe T. Optimization for statistical machine translation:A survey. Computational Linguistics, 2016, 42(2):1-54.

[20] Kar P, Narasimhan H, Jain P. Online and stochastic gradient methods for non-decomposable loss functions. In Proc. the 27th Advances in Neural Information Processing Systems, Dec. 2014, pp.694-702.

[21] Narasimhan H, Vaish R, Agarwal S. On the statistical consistency of plug-in classifiers for non-decomposable performance measures. In Proc. the 27th Advances in Neural Information Processing Systems, Dec. 2014, pp.1493-1501.

[22] Jean S, Cho K, Memisevic R, Bengio Y. On using very large target vocabulary for neural machine translation. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1:Long Papers), July 2015, pp.1-10.

[23] Luong M T, Sutskever I, Le Q V, Vinyals O, Zaremba W. Addressing the rare word problem in neural machine translation. In Proc. the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1:Long Papers), July 2015, pp.11-19.

[24] He D, Xia Y, Qin T, Wang L, Yu N, Liu T, Ma W Y. Dual learning for machine translation. In Proc. the 30th Advances in Neural Information Processing Systems, Dec. 2016, pp.820-828.

[25] Koehn P. Statistical significance tests for machine translation evaluation. In Proc. the Conference on Empirical Methods in Natural Language Processing, July 2004, pp.388-395.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved