›› 2016,Vol. 31 ›› Issue (4): 661-672.doi: 10.1007/s11390-016-1655-1

所属专题: Artificial Intelligence and Pattern Recognition Data Management and Data Mining

• Special Section on Selected Paper from NPC 2011 • 上一篇    下一篇

基于DS理论融合标签路径特征的Web新闻抽取

Gong-Qing Wu1(吴共庆), Member, CCF, Lei Li1(李磊), Member, IEEE, Li Li2(李莉), and Xindong Wu3(吴信东), Fellow, IEEE   

  1. 1 School of Computer and Information, Hefei University of Technology, Hefei 230009, China;
    2 IFLYTEK CO., LTD., Hefei 230088, China;
    3 Department of Computer Science, University of Vermont, Burlington, VT 05405, U.S.A
  • 收稿日期:2016-02-29 修回日期:2016-04-25 出版日期:2016-07-05 发布日期:2016-07-05
  • 作者简介:Gong-Qing Wu received his Bachelor's degree in computer science and technology from Anhui Normal University, Wuhu, in 1995, his Master's degree in computer science and technology from University of Science and Technology of China, Hefei, in 2003, and his Ph.D. degree in computer science and technology from Hefei University of Technology, Hefei, in 2013. He is an associate professor of computer science and technology at the Hefei University of Technology, Hefei. His research interests include data mining and web intelligence. He is the recipient of a Best Paper Award of ICTAI 2011, and a Best Paper Award ofWI 2012.
  • 基金资助:

    It was supported by the National Basic Research 973 Program of China under Grant No. 2013CB329604, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of Ministry of Education of China under Grant No. IRT13059, and the National Natural Science Foundation of China under Grant Nos. 61273297, 61229301 and 61503114.

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Gong-Qing Wu1(吴共庆), Member, CCF, Lei Li1(李磊), Member, IEEE, Li Li2(李莉), and Xindong Wu3(吴信东), Fellow, IEEE   

  1. 1 School of Computer and Information, Hefei University of Technology, Hefei 230009, China;
    2 IFLYTEK CO., LTD., Hefei 230088, China;
    3 Department of Computer Science, University of Vermont, Burlington, VT 05405, U.S.A
  • Received:2016-02-29 Revised:2016-04-25 Online:2016-07-05 Published:2016-07-05
  • About author:Gong-Qing Wu received his Bachelor's degree in computer science and technology from Anhui Normal University, Wuhu, in 1995, his Master's degree in computer science and technology from University of Science and Technology of China, Hefei, in 2003, and his Ph.D. degree in computer science and technology from Hefei University of Technology, Hefei, in 2013. He is an associate professor of computer science and technology at the Hefei University of Technology, Hefei. His research interests include data mining and web intelligence. He is the recipient of a Best Paper Award of ICTAI 2011, and a Best Paper Award ofWI 2012.
  • Supported by:

    It was supported by the National Basic Research 973 Program of China under Grant No. 2013CB329604, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of Ministry of Education of China under Grant No. IRT13059, and the National Natural Science Foundation of China under Grant Nos. 61273297, 61229301 and 61503114.

不同网页在内容、布局、解析树结构上有较大差异。另外,新闻网页的页面布局和解析树结构可能会随着时间的推移而发生变化。鉴于此,如何设计性能优异的抽取特征以抽取海量异构新闻网页内容是个挑战性问题。大量实例研究表明,网页内容布局与其标签路径特征之间存在潜在的关联。受此启发,本文设计了系列标签路径特征以抽取web新闻内容。这些特征各有所长,本文采用DS证据理论融合这些特征以期得到一个具有较好性能的综合特征,并基于该特征设计了一个网页内容抽取算法CEDS。在CleanEval数据集和知名网站上随机选取的web新闻网页数据集上的实验结果表明,CEDS在抽取性能综合指标F值上优于CETR算法8.08%,优于CEPR-TPR算法3.08%。

Abstract: Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.

[1] Wu X, Wu G Q, Xie F, Zhu Z, Hu X G. News filtering and summarization on the web. IEEE Intell. Syst., 2010, 25(5):68-76.

[2] Xu G, Wu Z, Li G, Chen E. Improving contextual advertising matching by using Wikipedia thesaurus knowledge. Knowl. Inf. Syst., 2015, 43(3):599-631.

[3] Zhou T C, Lyu M R T, King I, Lou J. Learning to suggest questions in social media. Knowl. Inf. Syst., 2015, 43(2):389-416.

[4] Ferraraa E, De Meob P, Fiumarac G, Baumgartnerd R. Web data extraction, application and techniques:A survey. Knowledge Based Syst., 2014, 70:301-323.

[5] Adelberg B. NoDoSE-A tool for semi-automatically extracting semistructured data from text documents. In Proc. SIGMOD, June 1998, pp.283-294.

[6] Liu L, Pu C, Han W. XWRAP:An XML-enabled wrapper construction system for web information sources. In Proc. ICDE, Feb. 29-March 3, 2000, pp.611-621.

[7] Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its applications. In Proc. the 11th WWW, May 2002, pp.580-591.

[8] Lin S H, Ho J M. Discovering informative content blocks from web documents. In Proc. the 8th KDD, July 2002, pp.588-593.

[9] Reis D C, Golgher P B, Silva A S, Laender A F. Automatic web news extraction using tree edit distance. In Proc. the 13th WWW, May 2004, pp.502-511.

[10] Finn A, Kushmerick N, Smyth B. Fact or fiction:Content classification for digital libraries. In Proc. DELOS Workshop:Personalization and Recommender Systems in Digital Libraries, June 2001.

[11] Gottron T. Content code blurring:A new approach to content extraction. In Proc. the 19th DEXA, Sept. 2008, pp.29-33.

[12] Weninger T, Hsu W H, Han J. CETR:Content extraction via tag ratios. In Proc. WWW, Apr. 2010, pp.971-980.

[13] Mantratzis C, Orgun M, Cassidy S. Separating XHTML content from navigation clutter using DOM-structure block analysis. In Proc. the 16th HYPEATEXT, Sept. 2005, pp.145-147.

[14] Prasad J, Paepcke A. CoreEx:Content extraction from online news articles. In Proc. the 17th ACM CIKM, Oct. 2008, pp.1391-1392.

[15] Debnath S, Mitra P, Giles C L. Automatic extraction of informative blocks from webpages. In Proc. SAC, Mar. 2005, pp.1722-1726.

[16] Debnath S, Mitra P, Giles C L. Identifying content blocks from web documents. In Proc. the 15th ISMIS, May 2005, pp.285-293.

[17] Cai D, Yu S, Wen J R, Ma W Y. Extracting content structure for web pages based on visual representation. In Proc. the 5th APWeb, Apr. 2003, pp.406-417.

[18] Song D, Sun F, Liao L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Inf. Syst., 2015, 42(1):75-96.

[19] Beigbeder M, Géry M, Largeron C. Using proximity and tag weights for focused retrieval in structured documents. Knowl. Inf. Syst., 2015, 44(1):51-76.

[20] Wu G, Wu X. Extracting web news using tag path patterns. In Proc. IEEE/WIC/ACM WI-IAT, Dec. 2012, pp.588-595.

[21] Wu G, Li L, Hu X, Wu X. Web news extraction via path ratios. In Proc. the 22nd CIKM, Aug. 2013, pp.2059-2068.

[22] Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A. OXPath:A language for scalable data extraction, automation, and crawling on the deep web. VLDB J., 2013, 22(1):47-72.

[23] Hong L, Lynch A. Recursive temporal-spatial information fusion with application to target identification. IEEE Trans. Aero. Elec. Syst., 1993, 29(2):435-445.

[24] Peters M E, Lecocq D. Content extraction using diverse feature sets. In Proc. the 22nd WWW, May 2013, pp.89-90.

[25] Gibson D, Punera K, Tomkins A. The volume and evolution of web page templates. In Proc. WWW, May 2005, pp.830-839.

[26] Shafer G. A Mathematical Theory of Evidence. Princeton University Press, 1976.
No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 王选; 吕之敏; 汤玉海; 向阳;. A High Resolution Chinese Character Generator[J]. , 1986, 1(2): 1 -14 .
[2] 金志权; 柳诚飞; 孙钟秀; 周晓方; 陈佩佩; 顾建明;. Design and Implementation of a Heterogeneous Distributed Database System[J]. , 1990, 5(4): 363 -373 .
[3] 韩建超; 史忠植;. Formalizing Default Reasoning[J]. , 1990, 5(4): 374 -378 .
[4] Klaus Buchenrieder;. Standard-Cell Placement from Functional Descriptions[J]. , 1991, 6(1): 37 -46 .
[5] 史维更;. Reconnectable Network with Limited Resources[J]. , 1991, 6(3): 243 -249 .
[6] 付斌; 李琼章;. The Expressibility of First Order Dynamic Logic[J]. , 1992, 7(3): 268 -273 .
[7] 沈一栋;. Form alizing Incomplete Knowledge in Incomplete Databases[J]. , 1992, 7(4): 295 -304 .
[8] 吴信东;. A Frame Based Architecture for Information Integration in CIMS[J]. , 1992, 7(4): 328 -332 .
[9] 徐美瑞; 刘小林;. A VLSI Algorithm for Calculating the Tree to Tree Distance[J]. , 1993, 8(1): 68 -76 .
[10] 高虹;. Transformation List for SGML Application[J]. , 1995, 10(5): 455 -462 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: