-
摘要: 不同网页在内容、布局、解析树结构上有较大差异。另外,新闻网页的页面布局和解析树结构可能会随着时间的推移而发生变化。鉴于此,如何设计性能优异的抽取特征以抽取海量异构新闻网页内容是个挑战性问题。大量实例研究表明,网页内容布局与其标签路径特征之间存在潜在的关联。受此启发,本文设计了系列标签路径特征以抽取web新闻内容。这些特征各有所长,本文采用DS证据理论融合这些特征以期得到一个具有较好性能的综合特征,并基于该特征设计了一个网页内容抽取算法CEDS。在CleanEval数据集和知名网站上随机选取的web新闻网页数据集上的实验结果表明,CEDS在抽取性能综合指标F值上优于CETR算法8.08%,优于CEPR-TPR算法3.08%。Abstract: Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.
-
-
[1] Wu X, Wu G Q, Xie F, Zhu Z, Hu X G. News filtering and summarization on the web. IEEE Intell. Syst., 2010, 25(5):68-76.
[2] Xu G, Wu Z, Li G, Chen E. Improving contextual advertising matching by using Wikipedia thesaurus knowledge. Knowl. Inf. Syst., 2015, 43(3):599-631.
[3] Zhou T C, Lyu M R T, King I, Lou J. Learning to suggest questions in social media. Knowl. Inf. Syst., 2015, 43(2):389-416.
[4] Ferraraa E, De Meob P, Fiumarac G, Baumgartnerd R. Web data extraction, application and techniques:A survey. Knowledge Based Syst., 2014, 70:301-323.
[5] Adelberg B. NoDoSE-A tool for semi-automatically extracting semistructured data from text documents. In Proc. SIGMOD, June 1998, pp.283-294.
[6] Liu L, Pu C, Han W. XWRAP:An XML-enabled wrapper construction system for web information sources. In Proc. ICDE, Feb. 29-March 3, 2000, pp.611-621.
[7] Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its applications. In Proc. the 11th WWW, May 2002, pp.580-591.
[8] Lin S H, Ho J M. Discovering informative content blocks from web documents. In Proc. the 8th KDD, July 2002, pp.588-593.
[9] Reis D C, Golgher P B, Silva A S, Laender A F. Automatic web news extraction using tree edit distance. In Proc. the 13th WWW, May 2004, pp.502-511.
[10] Finn A, Kushmerick N, Smyth B. Fact or fiction:Content classification for digital libraries. In Proc. DELOS Workshop:Personalization and Recommender Systems in Digital Libraries, June 2001.
[11] Gottron T. Content code blurring:A new approach to content extraction. In Proc. the 19th DEXA, Sept. 2008, pp.29-33.
[12] Weninger T, Hsu W H, Han J. CETR:Content extraction via tag ratios. In Proc. WWW, Apr. 2010, pp.971-980.
[13] Mantratzis C, Orgun M, Cassidy S. Separating XHTML content from navigation clutter using DOM-structure block analysis. In Proc. the 16th HYPEATEXT, Sept. 2005, pp.145-147.
[14] Prasad J, Paepcke A. CoreEx:Content extraction from online news articles. In Proc. the 17th ACM CIKM, Oct. 2008, pp.1391-1392.
[15] Debnath S, Mitra P, Giles C L. Automatic extraction of informative blocks from webpages. In Proc. SAC, Mar. 2005, pp.1722-1726.
[16] Debnath S, Mitra P, Giles C L. Identifying content blocks from web documents. In Proc. the 15th ISMIS, May 2005, pp.285-293.
[17] Cai D, Yu S, Wen J R, Ma W Y. Extracting content structure for web pages based on visual representation. In Proc. the 5th APWeb, Apr. 2003, pp.406-417.
[18] Song D, Sun F, Liao L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Inf. Syst., 2015, 42(1):75-96.
[19] Beigbeder M, Géry M, Largeron C. Using proximity and tag weights for focused retrieval in structured documents. Knowl. Inf. Syst., 2015, 44(1):51-76.
[20] Wu G, Wu X. Extracting web news using tag path patterns. In Proc. IEEE/WIC/ACM WI-IAT, Dec. 2012, pp.588-595.
[21] Wu G, Li L, Hu X, Wu X. Web news extraction via path ratios. In Proc. the 22nd CIKM, Aug. 2013, pp.2059-2068.
[22] Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A. OXPath:A language for scalable data extraction, automation, and crawling on the deep web. VLDB J., 2013, 22(1):47-72.
[23] Hong L, Lynch A. Recursive temporal-spatial information fusion with application to target identification. IEEE Trans. Aero. Elec. Syst., 1993, 29(2):435-445.
[24] Peters M E, Lecocq D. Content extraction using diverse feature sets. In Proc. the 22nd WWW, May 2013, pp.89-90.
[25] Gibson D, Punera K, Tomkins A. The volume and evolution of web page templates. In Proc. WWW, May 2005, pp.830-839.
[26] Shafer G. A Mathematical Theory of Evidence. Princeton University Press, 1976.
计量
- 文章访问数: 38
- HTML全文浏览量: 0
- PDF下载量: 1349