We use cookies to improve your experience with our site.
Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J]. Journal of Computer Science and Technology, 2016, 31(4): 661-672. DOI: 10.1007/s11390-016-1655-1
Citation: Gong-Qing Wu, Lei Li, Li Li, Xindong Wu. Web News Extraction via Tag Path Feature Fusion Using DS Theory[J]. Journal of Computer Science and Technology, 2016, 31(4): 661-672. DOI: 10.1007/s11390-016-1655-1

Web News Extraction via Tag Path Feature Fusion Using DS Theory

  • Contents, layout styles, and parse structures of web news pages differ greatly from one page to another. In addition, the layout style and the parse structure of a web news page may change from time to time. For these reasons, how to design features with excellent extraction performances for massive and heterogeneous web news pages is a challenging issue. Our extensive case studies indicate that there is potential relevancy between web content layouts and their tag paths. Inspired by the observation, we design a series of tag path extraction features to extract web news. Because each feature has its own strength, we fuse all those features with the DS (Dempster-Shafer) evidence theory, and then design a content extraction method CEDS. Experimental results on both CleanEval datasets and web news pages selected randomly from well-known websites show that the F1-score with CEDS is 8.08% and 3.08% higher than existing popular content extraction methods CETR and CEPR-TPR respectively.
  • loading

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return