基于DS理论融合标签路径特征的Web新闻抽取

吴共庆; 李磊; 李莉; 吴信东

doi:10.1007/s11390-016-1655-1

基于DS理论融合标签路径特征的Web新闻抽取

Web News Extraction via Tag Path Feature Fusion Using DS Theory

摘要

摘要: 不同网页在内容、布局、解析树结构上有较大差异。另外,新闻网页的页面布局和解析树结构可能会随着时间的推移而发生变化。鉴于此,如何设计性能优异的抽取特征以抽取海量异构新闻网页内容是个挑战性问题。大量实例研究表明,网页内容布局与其标签路径特征之间存在潜在的关联。受此启发,本文设计了系列标签路径特征以抽取web新闻内容。这些特征各有所长,本文采用DS证据理论融合这些特征以期得到一个具有较好性能的综合特征,并基于该特征设计了一个网页内容抽取算法CEDS。在CleanEval数据集和知名网站上随机选取的web新闻网页数据集上的实验结果表明,CEDS在抽取性能综合指标F值上优于CETR算法8.08%,优于CEPR-TPR算法3.08%。

Abstract: Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F₁-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.

HTML全文

参考文献()

施引文献

资源附件()