Online Web News Extraction via Tag Path Feature Fusio
Author:
Affiliation:

Clc Number:

Fund Project:

National Natural Science Foundation of China (61273297, 61229301, 61273292); The Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education (IRT13059); National Program on Key Basic ResearchProject of China (973 Program) (2013CB329604); National High-Tech R&D Program of China (863 Program) (2012AA 011005)

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Accurately extracting content from Web news is a key technology for quality improvement in Web news analysis and applications.Due to the lack of publication standards, differences in publishing formats, and a highly heterogeneous big data carrier of the Web itself, Web news extraction has become an open research problem.Extensive case studies by this research indicate that there is potential relevance between Web content layouts and their tag paths.Inspired by this observation, this paper designs a series of tag path extraction features to distinguish the Web content and noise from different perspectives.Based on the similarity analysis of these features, the paper proposes a features fusion strategy with group feature selection, and provides a Web news extraction method via feature fusion, CEPF.CEPF is a fast, universal, no-training and online Web news extraction algorithm.It can extract Web news pages across multi-resources, multi-styles, and multi-languages.Experimental results with public data sets such as CleanEval show that the CEPF method achieves better performance than the state-of-the-art CETR method.

    Reference
    Related
    Cited by
Get Citation

吴共庆,胡骏,李莉,徐喆昊,刘鹏程,胡学钢,吴信东.基于标签路径特征融合的在线Web新闻内容抽取.软件学报,2016,27(3):714-735

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 31,2015
  • Revised:May 08,2015
  • Adopted:
  • Online: March 07,2016
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063