Web Evolution and Incremental Crawling
DOI:
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    With the massive and ever increasing pages in the Web, incremental crawling has become a promising method to achieve on-line information. Its main advantage is the resource economization, which comes from the avoidance of downloading unchanged pages. For the precision of change prediction, the evolution of Web is generally studied to find out how pages change. In sum, incremental crawlers often integrate change frequency, change extent, and document quality for each page to determine its relative order as well as its download frequency. In this paper, the researches on Web evolution and incremental crawling in recent years are summarized: First, the change of page is modeled as a Poisson process, and the solutions are given to estimate its parameters, especially the change frequency, and then experimental results are shown. Second, based on the change of pages, three public large-scale incremental crawling systems are introduced, with emphasis on their scheduling policies and strategies to enhance page qualities. Third, theoretical analysis and exploration are performed to find the optimal scheduling policy, three approaches from different points of views are utilized to achieve this object, and a heuristic approximate solution is supplied for the feasibility in practice. Finally, research trends in this area are predicted, and three main issues are listed.

    Reference
    Related
    Cited by
Get Citation

孟涛,王继民,闫宏飞.网页变化与增量搜集技术.软件学报,2006,17(5):1051-1067

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:October 11,2005
  • Revised:January 12,2006
  • Adopted:
  • Online:
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063