We use cookies to improve your experience with our site.
Xiao Sun, De-Gen Huang, Hai-Yu Song, Fu-Ji Ren. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1): 14-24. DOI: 10.1007/s11390-011-9411-z
Citation: Xiao Sun, De-Gen Huang, Hai-Yu Song, Fu-Ji Ren. Chinese New Word Identification: A Latent Discriminative Model with Global Features[J]. Journal of Computer Science and Technology, 2011, 26(1): 14-24. DOI: 10.1007/s11390-011-9411-z

Chinese New Word Identification: A Latent Discriminative Model with Global Features

Funds: This work is partially supported by the Doctor Startup Fund of Liaoning Province under Grant No.20101021.
More Information
  • Author Bio:

    Xiao Sun received the M.E. degree in 2004 from the Department of Computer Sciences and Engineering, Dalian University of Technology, Dalian, China. He is now working in School of Computer Science and Engineering, Dalian Nationalities University. He received his doublePh.D. degree from Dalian University of Technology, China, and University of Tokushima in Japan. His research interests include natural language processing, machine translation, Chinese lexical analysis, and machine learning.

    De-Gen Huang was born in 1965. He is a professor in the Dalian University of Technology. His main research interests include natural language processing, machine learning and machine translation. He is now working at the Department of Computer Science and Engineering, Dalian University of Technology. He is now a senior member of CCF, and an associate editor of Int. J. Advanced Intelligence.

    Hai-Yu Song received the B.E. degree in computer and application in 1996, the M.E. degree in computer software and theory in 2003, both from Jilin University, China. Now he is a Ph.D. candidate in computer software and theory at Jilin University, and working in Dalian Nationalities University. His research interests include image analysis and understanding, image retrieval, data mining, and computer graphics.

    Fu-Ji Ren received the B.E. degree in 1982 and M.E. degree in 1985 from the Department of Computer Sciences, Beijing University of Posts and Telecommunications, Beijing, China. He received the Ph.D. degree in 1991 from Faculty of Engineering, Hokkaido University, Japan. He worked at CSK, Japan, where he was a chief researcher of NLP. From 1994 to 2000, he was an associate professor. His research interests include natural language processing, machine translation, artificial intelligence, language understanding and communication.

  • Received Date: June 18, 2009
  • Revised Date: December 13, 2010
  • Published Date: December 31, 2010
  • Chinese new words are particularly problematic in Chinese natural language processing. With the fast development of Internet and information explosion, it is impossible to get a complete system lexicon for applications in Chinese natural language processing, as new words out of dictionaries are always being created. The procedure of new words identification and POS tagging are usually separated and the features of lexical information cannot be fully used. A latent discriminative model, which combines the strengths of Latent Dynamic Conditional Random Field (LDCRF) and semi-CRF, is proposed to detect new words together with their POS synchronously regardless of the types of new words from Chinese text without being pre-segmented. Unlike semi-CRF, in proposed latent discriminative model, LDCRF is applied to generate candidate entities, which accelerates the training speed and decreases the computational cost. The complexity of proposed hidden semi-CRF could be further adjusted by tuning the number of hidden variables and the number of candidate entities from the Nbest outputs of LDCRF model. A new-word-generating framework is proposed for model training and testing, under which the definitions and distributions of new words conform to the ones in real text. The global feature called "Global Fragment Features" for new word identification is adopted. We tested our model on the corpus from SIGHAN-6. Experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags with satisfactory results. The proposed model performs competitively with the state-of-the-art models.
  • [1]
    Goh C, Asahara M, Matsumoto Y. Chinese unknown word identification using character-based tagging and chunking. In Proc. the 41st Annual Meeting on Association for Computational Linguistics, Sapporo, Japan, Jul. 7-12, 2003, pp.197-200.
    [2]
    Nie J, Hannan M, Jin W. Unknown word detection and segmentation of Chinese using statistical and heuristic knowledge. Communications of COLIPS, 1995, 5(1): 47-57.
    [3]
    Chen C, Bai M, Chen K. Category guessing for Chinese unknown words. In Proc. the Natural Language Processing Pacific Rim Symposium, Phuket, Thailand, Dec. 2-4, 1997,pp.35-40.
    [4]
    Sproat R, Shih C, Gale W, Chang N. A stochastic finite state word-segmentation algorithm for Chinese. Computational Linguistics, 1996, 22(2): 377-404.
    [5]
    Zheng J H, Li W H. A study on automatic identifcation for Internet new words according to word-building rule. Journal of Shanxi University (Natural Science Edition), 2002, 25(2):115-119. (In Chinese)
    [6]
    Yan W. New words mining from the dynamic current corpus based on VSM. In Proc. Dictionaries and Digital Symposium,Yantai, China, Aug. 16-20, 2004. (In Chinese)
    [7]
    Chen A. Chinese word segmentation using minimal linguistic knowledge. In Proc. the Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, Jul. 11-12, 2003,pp.148-151.
    [8]
    Wu A D, Jiang Z X. Statistically-enhanced new word identification in a rule-based Chinese system. In Proc. the Second Chinese Language Processing Workshop, Hong Kong, China,Oct. 1-8, 2000, pp.46-51.
    [9]
    Zou G., Liu Y., Liu Q. Internet-oriented Chinese New Words Detection (in Chinese). Journal of Chinese Information Processing, 2004, 18: 1-9.
    [10]
    Peng F, Feng F, McCallum A. Chinese segmentation and new word detection using conditional random fields. In Proc. the 20th International Conference on Computational Linguistics,Geneva, Switzerland, Aug. 23-27, 2004, pp.562-569.
    [11]
    La?erty J, McCallum A, Pereira F. Conditional random fields:Probabilistic models for segmenting and labeling sequence data. In Proc. the 18th Int. Conf. Machine Learning,Williamstown, USA, Jun. 28-Jul. 1, 2001, pp.282-289.
    [12]
    Zhao H, Kit C. Scaling conditional random fields by one against-the-other decomposition. Journal of Computer Science and Technology, July, 2008, 23(4): 612-619.
    [13]
    Li H Q, Huang C N, Gao J F, Fan X Z. The use of SVM for Chinese new word identification. In Proc. IJCNLP 2004,Sanya, China, Mar. 22-24, 2004, pp.723-732.
    [14]
    Asahara M, Matsumoto Y. Japanese unknown word identification by character-based chunking. In Proc. the 20th International Conference on Computational Linguistics, Geneva,Switzerland, Aug. 23-27, 2004, pp.459-465.
    [15]
    Goh C L, Asahara M, Matsumoto Y. Training multi-classifiers for Chinese unknown word detection. Journal of Chinese Language and Computing, 2005, 15(1): 1-12.
    [16]
    Goh G, Asahara M, Matsumoto Y. Machine learning-based methods to Chinese unknown word detection and POS tag guessing. Journal of Chinese Language and Computing, 2006,16: 185-206.
    [17]
    Morency L, Quattoni A, Darrell T. Latent-dynamic discriminative models for continuous gesture recognition. In Proc.IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, USA, Jun. 17-22, 2007, pp.1-8.
    [18]
    Sun X, Wang H, Wang B. Predicting Chinese abbreviations from definitions: An empirical learning approach using support vector regression. Journal of Computer Science and Technology, 2008, 23(4): 602-611.
    [19]
    Sun X, Huang D, Ren F. Detecting new words from Chinese text using latent semi-CRF models. IEICE Transactions on Information and Systems, 2010, E93-D(6): 1386-1393.
    [20]
    Sarawagi S, Cohen W. Semi-Markov conditional random fields for information extraction. In Proc. NIPS 2004, Vancouver,Canada, Dec. 13-18, 2004, pp.1185-1192.
    [21]
    Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J. Improving the scalability of semi-Markov conditional random fields for named entity recognition. In Proc. the 21st Int. Conf. Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia,Jul. 17-21, 2006, pp.465-472.
    [22]
    Liu D, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989,45(3): 503-528.
    [23]
    Yu S, Duan H, Zhu X, Swen B, Chang B. Specification for corpus processing at Peking University: Word segmentation,POS tagging and phonetic notation. Journal of Chinese Language and Computing, 2003, 13: 121-158.
    [24]
    Zhou G. A chunking strategy towards unknown word detection in Chinese word segmentation. In Proc. IJCNLP 2005, Jeju Island, Korea, Oct. 11-13, 2005, pp.530-541.
    [25]
    Sproat R, Emerson T. The first international Chinese word segmentation bakeoff. In Proc. the 2nd SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, Jul. 11-12,2003, pp.133-143.
    [26]
    Emerson T. The second international Chinese word segmentation bakeoff. In Proc. the 4th SIGHAN Workshop on Chinese Language Processing, Jeju Island, Korea, Oct. 14-15, 2005,pp.123-133.
    [27]
    Levow G A. The third international Chinese language processing bakeoff: Word segmentation and named entity recognition. In Proc. the 5th SIGHAN Workshop on Chinese Language Processing, Sydney, Australia, Jul. 22-23, 2006, pp.108-117.
    [28]
    Jin G, Chen X. The fourth international Chinese language processing bakeoff: Chinese word segmentation, named entity recognition and Chinese POS tagging. In Proc. Sixth SIGHAN Workshop on Chinese Language Processing, Hyderabad, India, Jan. 11-12, 2008, pp.69-81.
  • Related Articles

    [1]Nuo Qun, Hang Yan, Xi-Peng Qiu, Xuan-Jing Huang. Chinese Word Segmentation via BiLSTM+Semi-CRF with Relay Node[J]. Journal of Computer Science and Technology, 2020, 35(5): 1115-1126. DOI: 10.1007/s11390-020-9576-4
    [2]Xu-Ran Zhao, Xun Wang, Qi-Chao Chen. Temporally Consistent Depth Map Prediction Using Deep CNN and Spatial-temporal Conditional Random Field[J]. Journal of Computer Science and Technology, 2017, 32(3): 443-456. DOI: 10.1007/s11390-017-1735-x
    [3]Yan-Hui Ding, Qing-Zhong Li, Yong-Quan Dong, Zhao-Hui Peng. 2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects[J]. Journal of Computer Science and Technology, 2010, 25(4): 761-770. DOI: 10.1007/s11390-010-1059-6
    [4]Hai Zhao, Chunyu Kit. Scaling Conditional Random Fields by One-Against-the-Other Decomposition[J]. Journal of Computer Science and Technology, 2008, 23(4): 612-619.
    [5]ZHANG Yiying, ZHU Xiaoyan, ZHANG Bo. A New Speaker Verification Method with Global Speaker Model and Likelihood Score Normalization[J]. Journal of Computer Science and Technology, 2000, 15(2): 184-193.
    [6]Sun Jigui, Cheng Xiaochun, Liu Xuhua. The Global Properties of Valid Formulas in Modal Logic K[J]. Journal of Computer Science and Technology, 1996, 11(6): 615-621.
    [7]Benjamin HAO, David PEARSON, Richard ZIPPEL. Global Register Allocation for SIMD Multiprocessors[J]. Journal of Computer Science and Technology, 1996, 11(3): 222-236.
    [8]Xiang Dong, Wei Daozheng. GLOBAL: A Design for Random Testability Algorithm[J]. Journal of Computer Science and Technology, 1994, 9(2): 182-192.
    [9]Tang Zhimin. Pipelined Global Data Communication on Hypertoruses[J]. Journal of Computer Science and Technology, 1992, 7(3): 247-256.
    [10]Su Bogong, Wang Jian, Xia Jinshi. TST——An Algorithm for Global Microcode Compaction with Timing Constraints[J]. Journal of Computer Science and Technology, 1991, 6(1): 97-107.
  • Cited by

    Periodical cited type(5)

    1. Jiayin Sun, Mengyu Gao, Hong Wang, et al. Recursive Counterfactual Deconfounding for image recognition. Knowledge-Based Systems, 2025. DOI:10.1016/j.knosys.2025.113245
    2. Jikai Wang, Wanglong Lu, Yu Wang, et al. TEG: image theme recognition using text-embedding-guided few-shot adaptation. Journal of Electronic Imaging, 2024, 33(01) DOI:10.1117/1.JEI.33.1.013028
    3. Min Wang, Wanglong Lu, Jiankai Lyu, et al. Generative image inpainting with enhanced gated convolution and Transformers. Displays, 2022, 75: 102321. DOI:10.1016/j.displa.2022.102321
    4. Xuetao Zhang, Kuangang Fan, Haonan Hou, et al. Real-Time Detection of Drones Using Channel and Layer Pruning, Based on the YOLOv3-SPP3 Deep Learning Algorithm. Micromachines, 2022, 13(12): 2199. DOI:10.3390/mi13122199
    5. Yu Wang, Wanglong Lu, Hanli Zhao, et al. Detecting Blinks from Wearable Cameras using Spatial-Temporal-Aware Deep Network Learning. 2023 Symposium on Eye Tracking Research and Applications, DOI:10.1145/3588015.3589668

    Other cited types(0)

Catalog

    Article views (61) PDF downloads (2354) Cited by(5)
    Related

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return