›› 2010, Vol. 25 ›› Issue (4): 761-770.doi: 10.1007/s11390-010-1059-6

• Special Section on Advances in Machine Learning and Applications • Previous Articles     Next Articles

2D Correlative-Chain Conditional Random Fields for Semantic Annotation of Web Objects

Yan-Hui Ding(丁艳辉), Member, CCF, Qing-Zhong Li*(李庆忠), Senior Member, CCF Yong-Quan Dong(董永权), Member, CCF, and Zhao-Hui Peng(彭朝晖), Member, CCF   

  1. School of Computer Science and Technology, Shandong University, Jinan 250014, China
  • Received:2009-05-11 Revised:2010-01-25 Online:2010-07-09 Published:2010-07-09
  • About author:
    Yan-Hui Ding is a Ph.D. candidate in computer science, Shandong University. He is a member of CCF. His research interests include Web information integration and Web information extraction.
    Qing-Zhong Li is a professor at School of Computer Science and Technology, Shandong University. He is a senior member of CCF. His research interests include Web information integration and enterprise information integration.
    Yong-Quan Dong is a Ph.D. candidate in computer science, Shandong University. He is a member of CCF. His research interests include Web information integration and Web data management.
    Zhao-Hui Peng is a lecturer at School of Computer Science and Technology, Shandong University. He received his Ph.D. degree from School of Information, Renmin University. He is a member of CCF. His research interest include information retrieval.
  • Supported by:

    Supported by the National Natural Science Foundation of China under Grant No. 90818001 and the Natural Science Foundation of Shandong Province of China under Grant No. Y2007G24.

Semantic annotation of Web objects is a key problem for Web information extraction. The Web contains an abundance of useful semi-structured information about real world objects, and the empirical study shows that strong two-dimensional sequence characteristics and correlative characteristics exist for Web information about objects of the same type across different Web sites. Conditional Random Fields (CRFs) are the state-of-the-art approaches taking the sequence characteristics to do better labeling. However, as the appearance of correlative characteristics between Web object elements, previous CRFs have their limitations for semantic annotation of Web objects and cannot deal with the long distance dependencies between Web object elements efficiently. To better incorporate the long distance dependencies, on one hand, this paper describes long distance dependencies by correlative edges, which are built by making good use of structured information and the characteristics of records from external databases; and on the other hand, this paper presents a two-dimensional Correlative-Chain Conditional Random Fields (2DCC-CRFs) to do semantic annotation of Web objects. This approach extends a classic model, two-dimensional Conditional Random Fields (2DCRFs), by adding correlative edges. Experimental results using a large number of real-world data collected from diverse domains show that the proposed approach can significantly improve the semantic annotation accuracy of Web objects.


[1] Zhu J, Nie Z Q, Wen J R, Zhang B, Ma W Y. 2D conditional random fields for Web information extraction. In Proc. the International Conference on Machine Learning, Bonn, Germany, Aug. 7-11, 2005, pp.1044-1051.

[2] Haas L. Beauty and the beast: The theory and practice of information integration. In Proc. the 11th International Conference on Database Theory, Barcelona, Spain, Jan. 10-12, 2007, pp.28-43.

[3] Zhu J, Nie Z Q, Wen J R, Zhang B, Ma W Y. Simultaneous record detection and attribute labeling in Web data extraction. In Proc. the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, USA, Aug. 20-23, 2006, pp.494-503.

[4] Lafferty J, McCallum A, Pereira F. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. the International Conference on Machine Learning, Williamstown, USA, Jun. 28-Jul. 1, 2001, pp.282-289.

[5] Zhai Y H, Liu B. Web data extraction based on partial tree alignment. In Proc. the 14th International World Wide Web Conference, Chiba, Japan, May 10-14, 2005, pp.76-85.

[6] Embley D W, Campbell D M, Jiang Y S et al. Conceptual-model-based data extraction from multiple-record Web pages. Data and Knowledge Engineering, 1999, 31(3): 227-251.

[7] Ramakrishnan S M, Ramakrishnan I V, Singh A. Bootstrapping semantic annotation for content-rich HTML documents. In Proc. the 21st International Conference on Data Engineering, Tokyo, Japan, Apr. 5-8, 2005, pp.583-593.

[8] Arlotta L, Crescenzi V, Mecca G, Merialdo P. Automatic annotation of data extracted from large Web sites. In Proc. the 6th International Workshop on Web and Databases, California, USA, Jun. 12-13, 2003, pp.7-12.

[9] Zhao H, Kit C Y. Scaling conditional random fields by one-against-the-other decomposition. Journal of Computer Science and Technology, 2008, 23(4): 612-619.

[10] Sutton C, McCallum A. Collective segmentation and labeling of distant entities in information extraction. England: University of Massachusetts, Technical Report: 04-49, July 2004.

[11] Huang J B, Ji H B, Sun H L. Integration of heterogeneous of Web records using mixed skip-chain conditional fields. Journal of Software, 2008, 19(8): 2149-2158. (in Chinese)

[12] Zhu J, Nie Z Q, Zhang B, Wen J R. Dynamic hierarchical Markov random fields for integrate Web data extraction. Journal of Machine Learning Research, 2008, 9(6): 1583-1614.

[13] Cohen W, Sarawagi S. Exploiting dictionaries in named entity extraction: Combining semi-Markov extraction processes and data integration methods. In \emphProc. the International Conference on Knowledge Discovery and Data Mining, Seattle, USA, Aug. 22-25, 2004, pp.89-98.

[14] Nie Z Q, Wu F, Wen J R, Ma W Y. Extracting objects from the Web. In Proc. the 22nd International Conference on Data Engineering, Atlanta, USA, Apr. 3-7, 2006, p.123.

[15] Hammersley J, Clifford P. Markov fields on finite graphs and lattices. Unpublished manuscript, Oxford University, 1971.

[16] Mansuri I R, Sarawagi S. Integrating unstructured data into relational databases. In Proc. the 22nd International Conference on Data Engineering, Atlanta, USA, Apr. 3-7,2006, p.29.

[17] Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization. Mathematical Programming, 1989, 45(3): 503-528.

[18] Kevin P M, Yair W, Michael I J. Loopy belief propagation for approximate inference: An empirical study. In Proc. the 15th Conference on Uncertainty in Artificial Intelligence, Stockholm, Sweden, Jul. 30-Aug. 1, 1999, pp.467-475.

[19] Weiss Y. Correctness of local probability propagation in graphical models with loops. Neural Computation, 2000, 12(1): 1-41.

[20] Weiss Y, Freeman W. On the optimality of solutions of the max-product belief propagation algorithm in arbitrary graphs. IEEE Transaction on Information Theory, 2001, 47(2): 736-744.

[21] Wang X L, Computer Processing of Natural Language, Beijing: Tsinghua University Press, 2005, pp.58-62. (in Chinese)

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] Cai Shijie; Zhang Fuyan;. A Fast Algorithm for Polygon Operations[J]. , 1991, 6(1): 91 -96 .
[2] Shen Yidong;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[3] Yu Shengke;. Reasoning in H-Net: A Unified Approach to Intelligent Hypermedia Systems[J]. , 1996, 11(1): 83 -89 .
[4] Tian Zengping; Wang Yujun; Qu Yunyao; Shi Baile;. On the Expressive Power of F-Logic Language[J]. , 1997, 12(6): 510 -519 .
[5] Chen Yangjun;. Graph Traversal and Top-Down Evaluation of Logic Queries[J]. , 1998, 13(4): 300 -316 .
[6] WU Jinzhao; LIU Zhuojun;. Linear Strategy for Boolean Ring Based Theorem Proving[J]. , 2000, 15(3): 271 -279 .
[7] Sheng-En Li and Shan Wang. Semi-Closed Cube: An Effective Approach to Trading Off Data Cube Size and Query Response Time[J]. , 2005, 20(3): 367 -372 .
[8] Jun-Hao Zheng, Lei Deng, Peng Zhang, and Don Xie. An Efficient VLSI Architecture for Motion Compensation of AVS HDTV Decoder[J]. , 2006, 21(3): 370 -377 .
[9] Xin-Fu Wang and De-Bin Zhao. Performance Comparison of AVS and H.264/AVC Video Coding Standards[J]. , 2006, 21(3): 310 -314 .
[10] Chang-Xuan Wan and Xi-Ping Liu. Structural Join and Staircase Join Algorithms of Sibling Relationship[J]. , 2007, 22(2): 171 -181 .

ISSN 1000-9000(Print)

         1860-4749(Online)
CN 11-2296/TP

Home
Editorial Board
Author Guidelines
Subscription
Journal of Computer Science and Technology
Institute of Computing Technology, Chinese Academy of Sciences
P.O. Box 2704, Beijing 100190 P.R. China
Tel.:86-10-62610746
E-mail: jcst@ict.ac.cn
 
  Copyright ©2015 JCST, All Rights Reserved