We use cookies to improve your experience with our site.

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Gong-Qing Wu, Lei Li, Li Li, Xindong Wu

downloadPDF
吴共庆, 李磊, 李莉, 吴信东. 基于DS理论融合标签路径特征的Web新闻抽取[J]. 计算机科学技术学报, 2016, 31(4): 661-672. DOI: 10.1007/s11390-016-1655-1
引用本文: 吴共庆, 李磊, 李莉, 吴信东. 基于DS理论融合标签路径特征的Web新闻抽取[J]. 计算机科学技术学报, 2016, 31(4): 661-672. DOI: 10.1007/s11390-016-1655-1
Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J]. Journal of Computer Science and Technology, 2016, 31(4): 661-672. DOI: 10.1007/s11390-016-1655-1
Citation: Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J]. Journal of Computer Science and Technology, 2016, 31(4): 661-672. DOI: 10.1007/s11390-016-1655-1
吴共庆, 李磊, 李莉, 吴信东. 基于DS理论融合标签路径特征的Web新闻抽取[J]. 计算机科学技术学报, 2016, 31(4): 661-672. CSTR: 32374.14.s11390-016-1655-1
引用本文: 吴共庆, 李磊, 李莉, 吴信东. 基于DS理论融合标签路径特征的Web新闻抽取[J]. 计算机科学技术学报, 2016, 31(4): 661-672. CSTR: 32374.14.s11390-016-1655-1
Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J]. Journal of Computer Science and Technology, 2016, 31(4): 661-672. CSTR: 32374.14.s11390-016-1655-1
Citation: Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J]. Journal of Computer Science and Technology, 2016, 31(4): 661-672. CSTR: 32374.14.s11390-016-1655-1

基于DS理论融合标签路径特征的Web新闻抽取

基金项目: It was supported by the National Basic Research 973 Program of China under Grant No. 2013CB329604, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of Ministry of Education of China under Grant No. IRT13059, and the National Natural Science Foundation of China under Grant Nos. 61273297, 61229301 and 61503114.
详细信息
    作者简介:

    吴共庆: Gong-Qing Wu received his Bachelor's degree in computer science and technology from Anhui Normal University, Wuhu, in 1995, his Master's degree in computer science and technology from University of Science and Technology of China, Hefei, in 2003, and his Ph.D. degree in computer science and technology from Hefei University of Technology, Hefei, in 2013. He is an associate professor of computer science and technology at the Hefei University of Technology, Hefei. His research interests include data mining and web intelligence. He is the recipient of a Best Paper Award of ICTAI 2011, and a Best Paper Award ofWI 2012.

Web News Extraction via Tag Path Feature Fusion Using DS Theory

Funds: It was supported by the National Basic Research 973 Program of China under Grant No. 2013CB329604, the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of Ministry of Education of China under Grant No. IRT13059, and the National Natural Science Foundation of China under Grant Nos. 61273297, 61229301 and 61503114.
More Information
    Author Bio:

    Gong-Qing Wu received his Bachelor's degree in computer science and technology from Anhui Normal University, Wuhu, in 1995, his Master's degree in computer science and technology from University of Science and Technology of China, Hefei, in 2003, and his Ph.D. degree in computer science and technology from Hefei University of Technology, Hefei, in 2013. He is an associate professor of computer science and technology at the Hefei University of Technology, Hefei. His research interests include data mining and web intelligence. He is the recipient of a Best Paper Award of ICTAI 2011, and a Best Paper Award ofWI 2012.

  • 摘要: 不同网页在内容、布局、解析树结构上有较大差异。另外,新闻网页的页面布局和解析树结构可能会随着时间的推移而发生变化。鉴于此,如何设计性能优异的抽取特征以抽取海量异构新闻网页内容是个挑战性问题。大量实例研究表明,网页内容布局与其标签路径特征之间存在潜在的关联。受此启发,本文设计了系列标签路径特征以抽取web新闻内容。这些特征各有所长,本文采用DS证据理论融合这些特征以期得到一个具有较好性能的综合特征,并基于该特征设计了一个网页内容抽取算法CEDS。在CleanEval数据集和知名网站上随机选取的web新闻网页数据集上的实验结果表明,CEDS在抽取性能综合指标F值上优于CETR算法8.08%,优于CEPR-TPR算法3.08%。
    Abstract: Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.
  • [1]

    Wu X, Wu G Q, Xie F, Zhu Z, Hu X G. News filtering and summarization on the web. IEEE Intell. Syst., 2010, 25(5):68-76.

    [2]

    Xu G, Wu Z, Li G, Chen E. Improving contextual advertising matching by using Wikipedia thesaurus knowledge. Knowl. Inf. Syst., 2015, 43(3):599-631.

    [3]

    Zhou T C, Lyu M R T, King I, Lou J. Learning to suggest questions in social media. Knowl. Inf. Syst., 2015, 43(2):389-416.

    [4]

    Ferraraa E, De Meob P, Fiumarac G, Baumgartnerd R. Web data extraction, application and techniques:A survey. Knowledge Based Syst., 2014, 70:301-323.

    [5]

    Adelberg B. NoDoSE-A tool for semi-automatically extracting semistructured data from text documents. In Proc. SIGMOD, June 1998, pp.283-294.

    [6]

    Liu L, Pu C, Han W. XWRAP:An XML-enabled wrapper construction system for web information sources. In Proc. ICDE, Feb. 29-March 3, 2000, pp.611-621.

    [7]

    Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its applications. In Proc. the 11th WWW, May 2002, pp.580-591.

    [8]

    Lin S H, Ho J M. Discovering informative content blocks from web documents. In Proc. the 8th KDD, July 2002, pp.588-593.

    [9]

    Reis D C, Golgher P B, Silva A S, Laender A F. Automatic web news extraction using tree edit distance. In Proc. the 13th WWW, May 2004, pp.502-511.

    [10]

    Finn A, Kushmerick N, Smyth B. Fact or fiction:Content classification for digital libraries. In Proc. DELOS Workshop:Personalization and Recommender Systems in Digital Libraries, June 2001.

    [11]

    Gottron T. Content code blurring:A new approach to content extraction. In Proc. the 19th DEXA, Sept. 2008, pp.29-33.

    [12]

    Weninger T, Hsu W H, Han J. CETR:Content extraction via tag ratios. In Proc. WWW, Apr. 2010, pp.971-980.

    [13]

    Mantratzis C, Orgun M, Cassidy S. Separating XHTML content from navigation clutter using DOM-structure block analysis. In Proc. the 16th HYPEATEXT, Sept. 2005, pp.145-147.

    [14]

    Prasad J, Paepcke A. CoreEx:Content extraction from online news articles. In Proc. the 17th ACM CIKM, Oct. 2008, pp.1391-1392.

    [15]

    Debnath S, Mitra P, Giles C L. Automatic extraction of informative blocks from webpages. In Proc. SAC, Mar. 2005, pp.1722-1726.

    [16]

    Debnath S, Mitra P, Giles C L. Identifying content blocks from web documents. In Proc. the 15th ISMIS, May 2005, pp.285-293.

    [17]

    Cai D, Yu S, Wen J R, Ma W Y. Extracting content structure for web pages based on visual representation. In Proc. the 5th APWeb, Apr. 2003, pp.406-417.

    [18]

    Song D, Sun F, Liao L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Inf. Syst., 2015, 42(1):75-96.

    [19]

    Beigbeder M, Géry M, Largeron C. Using proximity and tag weights for focused retrieval in structured documents. Knowl. Inf. Syst., 2015, 44(1):51-76.

    [20]

    Wu G, Wu X. Extracting web news using tag path patterns. In Proc. IEEE/WIC/ACM WI-IAT, Dec. 2012, pp.588-595.

    [21]

    Wu G, Li L, Hu X, Wu X. Web news extraction via path ratios. In Proc. the 22nd CIKM, Aug. 2013, pp.2059-2068.

    [22]

    Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A. OXPath:A language for scalable data extraction, automation, and crawling on the deep web. VLDB J., 2013, 22(1):47-72.

    [23]

    Hong L, Lynch A. Recursive temporal-spatial information fusion with application to target identification. IEEE Trans. Aero. Elec. Syst., 1993, 29(2):435-445.

    [24]

    Peters M E, Lecocq D. Content extraction using diverse feature sets. In Proc. the 22nd WWW, May 2013, pp.89-90.

    [25]

    Gibson D, Punera K, Tomkins A. The volume and evolution of web page templates. In Proc. WWW, May 2005, pp.830-839.

    [26]

    Shafer G. A Mathematical Theory of Evidence. Princeton University Press, 1976.

计量
  • 文章访问数:  38
  • HTML全文浏览量:  0
  • PDF下载量:  1349
  • 被引次数: 0
出版历程
  • 收稿日期:  2016-02-28
  • 修回日期:  2016-04-24
  • 发布日期:  2016-07-04

目录

    /

    返回文章
    返回