|
›› 2016,Vol. 31 ›› Issue (4): 661-672.doi: 10.1007/s11390-016-1655-1
所属专题: Artificial Intelligence and Pattern Recognition; Data Management and Data Mining
• Special Section on Selected Paper from NPC 2011 • 上一篇 下一篇
Gong-Qing Wu1(吴共庆), Member, CCF, Lei Li1(李磊), Member, IEEE, Li Li2(李莉), and Xindong Wu3(吴信东), Fellow, IEEE
Gong-Qing Wu1(吴共庆), Member, CCF, Lei Li1(李磊), Member, IEEE, Li Li2(李莉), and Xindong Wu3(吴信东), Fellow, IEEE
不同网页在内容、布局、解析树结构上有较大差异。另外,新闻网页的页面布局和解析树结构可能会随着时间的推移而发生变化。鉴于此,如何设计性能优异的抽取特征以抽取海量异构新闻网页内容是个挑战性问题。大量实例研究表明,网页内容布局与其标签路径特征之间存在潜在的关联。受此启发,本文设计了系列标签路径特征以抽取web新闻内容。这些特征各有所长,本文采用DS证据理论融合这些特征以期得到一个具有较好性能的综合特征,并基于该特征设计了一个网页内容抽取算法CEDS。在CleanEval数据集和知名网站上随机选取的web新闻网页数据集上的实验结果表明,CEDS在抽取性能综合指标F值上优于CETR算法8.08%,优于CEPR-TPR算法3.08%。
[1] Wu X, Wu G Q, Xie F, Zhu Z, Hu X G. News filtering and summarization on the web. IEEE Intell. Syst., 2010, 25(5):68-76.[2] Xu G, Wu Z, Li G, Chen E. Improving contextual advertising matching by using Wikipedia thesaurus knowledge. Knowl. Inf. Syst., 2015, 43(3):599-631.[3] Zhou T C, Lyu M R T, King I, Lou J. Learning to suggest questions in social media. Knowl. Inf. Syst., 2015, 43(2):389-416.[4] Ferraraa E, De Meob P, Fiumarac G, Baumgartnerd R. Web data extraction, application and techniques:A survey. Knowledge Based Syst., 2014, 70:301-323.[5] Adelberg B. NoDoSE-A tool for semi-automatically extracting semistructured data from text documents. In Proc. SIGMOD, June 1998, pp.283-294.[6] Liu L, Pu C, Han W. XWRAP:An XML-enabled wrapper construction system for web information sources. In Proc. ICDE, Feb. 29-March 3, 2000, pp.611-621.[7] Bar-Yossef Z, Rajagopalan S. Template detection via data mining and its applications. In Proc. the 11th WWW, May 2002, pp.580-591.[8] Lin S H, Ho J M. Discovering informative content blocks from web documents. In Proc. the 8th KDD, July 2002, pp.588-593.[9] Reis D C, Golgher P B, Silva A S, Laender A F. Automatic web news extraction using tree edit distance. In Proc. the 13th WWW, May 2004, pp.502-511.[10] Finn A, Kushmerick N, Smyth B. Fact or fiction:Content classification for digital libraries. In Proc. DELOS Workshop:Personalization and Recommender Systems in Digital Libraries, June 2001.[11] Gottron T. Content code blurring:A new approach to content extraction. In Proc. the 19th DEXA, Sept. 2008, pp.29-33.[12] Weninger T, Hsu W H, Han J. CETR:Content extraction via tag ratios. In Proc. WWW, Apr. 2010, pp.971-980.[13] Mantratzis C, Orgun M, Cassidy S. Separating XHTML content from navigation clutter using DOM-structure block analysis. In Proc. the 16th HYPEATEXT, Sept. 2005, pp.145-147.[14] Prasad J, Paepcke A. CoreEx:Content extraction from online news articles. In Proc. the 17th ACM CIKM, Oct. 2008, pp.1391-1392.[15] Debnath S, Mitra P, Giles C L. Automatic extraction of informative blocks from webpages. In Proc. SAC, Mar. 2005, pp.1722-1726.[16] Debnath S, Mitra P, Giles C L. Identifying content blocks from web documents. In Proc. the 15th ISMIS, May 2005, pp.285-293.[17] Cai D, Yu S, Wen J R, Ma W Y. Extracting content structure for web pages based on visual representation. In Proc. the 5th APWeb, Apr. 2003, pp.406-417.[18] Song D, Sun F, Liao L. A hybrid approach for content extraction with text density and visual importance of DOM nodes. Knowl. Inf. Syst., 2015, 42(1):75-96.[19] Beigbeder M, Géry M, Largeron C. Using proximity and tag weights for focused retrieval in structured documents. Knowl. Inf. Syst., 2015, 44(1):51-76.[20] Wu G, Wu X. Extracting web news using tag path patterns. In Proc. IEEE/WIC/ACM WI-IAT, Dec. 2012, pp.588-595.[21] Wu G, Li L, Hu X, Wu X. Web news extraction via path ratios. In Proc. the 22nd CIKM, Aug. 2013, pp.2059-2068.[22] Furche T, Gottlob G, Grasso G, Schallhart C, Sellers A. OXPath:A language for scalable data extraction, automation, and crawling on the deep web. VLDB J., 2013, 22(1):47-72.[23] Hong L, Lynch A. Recursive temporal-spatial information fusion with application to target identification. IEEE Trans. Aero. Elec. Syst., 1993, 29(2):435-445.[24] Peters M E, Lecocq D. Content extraction using diverse feature sets. In Proc. the 22nd WWW, May 2013, pp.89-90.[25] Gibson D, Punera K, Tomkins A. The volume and evolution of web page templates. In Proc. WWW, May 2005, pp.830-839.[26] Shafer G. A Mathematical Theory of Evidence. Princeton University Press, 1976. |
No related articles found! |
|
版权所有 © 《计算机科学技术学报》编辑部 本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn 总访问量: |