Special Issue: Artificial Intelligence and Pattern Recognition

• Machine Learning and Data Mining • Previous Articles     Next Articles

Predicting Chinese Abbreviations from Definitions: An Empirical Learning Approach Using Support Vector Regression

Xu Sun1, 2, Hou-Feng Wang1, and Bo Wang1   

  1. 1Institute of Computational Linguistics, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China 2Department of Computer Science, Graduate School of Information Science and Technology, The University of Tokyo, Tokyo 113-0033, Japan
  • Received:2007-05-08 Revised:2008-04-02 Online:2008-07-10 Published:2008-07-10

In Chinese, phrases and named entities play a central role in information retrieval. Abbreviations, however, make keyword-based approaches less effective. This paper presents an empirical learning approach to Chinese abbreviation prediction. In this study, each abbreviation is taken as a reduced form of the corresponding definition (expanded form), and the abbreviation prediction is formalized as a scoring and ranking problem among abbreviation candidates, which are automatically generated from the corresponding definition. By employing Support Vector Regression (SVR) for scoring, we can obtain multiple abbreviation candidates together with their SVR values, which are used for candidate ranking. Experimental results show that the SVR method performs better than the popular heuristic rule of abbreviation prediction. In addition, in abbreviation prediction, the SVR method outperforms the hidden Markov model (HMM).

Key words: WFMS; Task Agent; TaskActivator; multi-TaskDomain architecture;

[1] Wren J D, Chang J T, Pustejovsky J, Adar E, Garner H R, Altman R B. Biomedical term mapping databases. \it Nucleic Acid Research, \rm 2005, 33: 289--293.
[2]} Yoshida M, Fukuda K, Takagi T. Pnad-css: A workbench for constructing a protein name abbreviation dictionary. \it Bioinformatics, \rm 2000, 16(2): 169--175.
[3]} Nenadic G, Spasic I, Ananiadou S. Automatic acronym acquisition and term variation management within domain-specific texts. In \it Proc. the LREC-3, \rm Las Palmas, Spain, 2002, pp.2155--2162.
[4]} Schwartz A, Hearst M. A simple algorithm for identifying abbreviation definitions in biomedical texts. In \it Proc. the Pacific Symposium on Biocomputing $($PSB 2003$)$, \rm pp.451--462.
[5]} Manuel Zahariev. An efficient methodology for acronym-expansion matching. In \it Proc. the International Conference on Information and Knowledge Engineering $($IKE$)$, \rm Las Vegas, USA, 2003, pp.32--37.
[6]} Adar E. Sarad: A simple and robust abbreviation dictionary. \it Bioinformatics, \rm 2004, 20(4): 527--533.
[7]} Tsuruoka Y, Ananiadou S, Tsujii J. A machine learning approach to abbreviation generation. In \it Proc. the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics, \rm Michigan, USA, 2005, pp.25--31.
[8]} Fu G, Luke K, Zhang M, Zhou G. A hybrid approach to Chinese abbreviation expansion. In \it Proc ICCPOL'06: 21st International Conference on Computer Processing of Oriental Languages, \rm Singapore, 2006, pp.277--287.
[9]} Huang C R, Ahrens K, Chen K J. A data-driven approach to psychological reality of the mental lexicon: Two studies on Chinese corpus linguistics. In \it Proc. Language and Its Psychobiological Bases, \rm Taipei, 1994a.
[10]} Huang C R, Hong W M, Chen K J. Suoxie: An information based lexical rule of abbreviation. In \it Proc. the Second Pacific Asia Conference on Formal and Computational Linguistics II, \rm Japan, 1994b, pp.49--52.
[11]} Chang J, Lai L. A preliminary study on probabilistic models for Chinese abbreviations. In \it Proc. the Third SIGHAN Workshop on Chinese Language Learning, \rm ACL, Barcelona, Spain, 2004, pp.9--16.
[12]} Chang J, Teng T. Mining atomic Chinese abbreviation pairs: A probabilistic model for single character word recovery. \it Language Resources and Evaluation, \rm 2007, 40(3/4): 367--374.
[13]} Christianini N, Shawe-Taylor J. An Introduction to Support Vector Machines and Other Kernel-Based Methods. Cambridge University Press, 2000.
[14]} Eubank R L. Spline Smoothing and Nonparametric Regression. New York: Marcel Dekker, 1988.
[15]} Smola A, Sch\"olkopf B. A tutorial on support vector regression. \it Statistics and Computing, \rm 2003, 14(3): 199--222.
[16]} Chang C C, Lin C J. LIBSVM: A library for support vector machines. Software available at http://www.csie. ntu.edu.tw/$^\sim$cjin/libsvm.
[17]} Hsu C W, Chang C C, Lin C J. A Practical Guide to Support Vector Classification, 2003, Working Paper, http://www.csie.ntu.edu.tw/$^\sim$cjlin/talks/freiburg.pdf.
[18]} Och F J. An efficient method for determining bilingual word classes. In \it Proc. Ninth Conference of the European Chapter of the Association for Computational Linguistics, \rm EACL'99, 1999, pp.71--76.
[19]} Martin S, Liermann J, Ney H. Algorithms for bigram and trigram word clustering. \it Speech Communication, \rm 1998, 24(1): 19--37.
[20]} Katz S M. Estimation of probabilities from sparse data for the language model component of a speech recogniser. \it IEEE Trans. Acoustics, Speech, and Signal Processing, \rm 1987, 35(3): 400--401.
[21]} Yan H, Wan X. Modern Chinese Abbreviation Dictionary. China: Yuwen Publisher, 2002. (In Chinese)
[22]} Sun X, Wang H F. Chinese abbreviation identification using abbreviation-template features and context information. In \it Proc. 21st International Conference on Computer Processing of Oriental Languages $($ICCPOL-06$)$, \rm Singapore, 2006, pp.245--255.
[23]} Sun X, Wang H F, Zhang Y. Chinese abbreviation-definition identification: A SVM approach using context information. In \it Proc. PRICAI-06: the 9th Pacific Rim International Conference on Artificial Intelligence, \rm 2006, pp.495--504.
[1] WANG wen jun; ZHONG Cuihao;. The Distributed Workflow Management System - Flow Agent [J]. , 2000, 15(4): 376-382.
[2] WANG Wenjun(王文军)and ZHONG Cuihao(仲萃豪). The Distributed Workflow Management System — FlowAgent [J]. , 2000, 15(4): 0-0.
Full text



[1] Liu Mingye; Hong Enyu;. Some Covering Problems and Their Solutions in Automatic Logic Synthesis Systems[J]. , 1986, 1(2): 83 -92 .
[2] Chen Shihua;. On the Structure of (Weak) Inverses of an (Weakly) Invertible Finite Automaton[J]. , 1986, 1(3): 92 -100 .
[3] Gao Qingshi; Zhang Xiang; Yang Shufan; Chen Shuqing;. Vector Computer 757[J]. , 1986, 1(3): 1 -14 .
[4] Chen Zhaoxiong; Gao Qingshi;. A Substitution Based Model for the Implementation of PROLOG——The Design and Implementation of LPROLOG[J]. , 1986, 1(4): 17 -26 .
[5] Huang Heyan;. A Parallel Implementation Model of HPARLOG[J]. , 1986, 1(4): 27 -38 .
[6] Min Yinghua; Han Zhide;. A Built-in Test Pattern Generator[J]. , 1986, 1(4): 62 -74 .
[7] Tang Tonggao; Zhao Zhaokeng;. Stack Method in Program Semantics[J]. , 1987, 2(1): 51 -63 .
[8] Min Yinghua;. Easy Test Generation PLAs[J]. , 1987, 2(1): 72 -80 .
[9] Zhu Hong;. Some Mathematical Properties of the Functional Programming Language FP[J]. , 1987, 2(3): 202 -216 .
[10] Li Minghui;. CAD System of Microprogrammed Digital Systems[J]. , 1987, 2(3): 226 -235 .

ISSN 1000-9000(Print)

CN 11-2296/TP

Editorial Board
Author Guidelines
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
E-mail: jcst@ict.ac.cn
  Copyright ©2015 JCST, All Rights Reserved