许 笑,张伟哲,张宏莉,方滨兴.广域网分布式Web 爬虫.软件学报,2010,21(5):1067-1082 |
广域网分布式Web 爬虫 |
WAN-Based Distributed Web Crawling |
投稿时间:2008-09-27 修订日期:2009-09-03 |
DOI: |
中文关键词: 搜索引擎 广域网分布式爬虫 Web 划分 Agent 协同 Agent 部署 |
英文关键词:search engine WAN-based distributed crawling Web partition agent collaboration agentdeployment |
基金项目:Supported by the National Natural Science Foundation of China under Grant No.60703014 (国家自然科学基金); the National BasicResearch Program of China under Grant No.G2005CB321806 (国家重点基础研究发展计划(973)); the National High-Tech Research andDevelopment Plan of China under Grant No.2009AA01Z437 (国家高技术研究发展计划(863)); the Specialized Research Fund for theDoctoral Program of Higher Education of China under Grant No.20070213044 (高等学校博士学科点专项科研基金); the ChinaPostdoctoral Science Foundation under Grant No.20070410263 (中国博士后科学基金); the Heilongjiang Postdoctoral Foundation ofChina under Grant No.LBH-Z07108 (黑龙江省博士后资助); the Development Program for Outstanding Young Teachers in HarbinInstitute of Technology of China under Grant No.HITQNJS.2007.034 (哈尔滨工业大学优秀青年教师培养计划) |
|
摘要点击次数: 7316 |
全文下载次数: 8566 |
中文摘要: |
分析了广域网分布式Web 爬虫相对于局域网爬虫的诸多优势,提出了广域网分布式Web 爬虫的3 个核心
问题:Web 划分、Agent 协同和Agent 部署.围绕这3 个问题,对目前学术界和商业界出现的多种实现方案和策略进
行了全面的综述,深入讨论了研究中遇到的问题与挑战,并论述了广域网分布式Web 爬虫的评价模型.最后,对未来
的研究方向进行了总结. |
英文摘要: |
There are three core issues recognized for WAN-based distributed Web crawling systems: Web Partition,
Agent collaboration and Agent deployment. Centering around these issues, this paper presents a comprehensive
overview of the current strategies adopted by academic and business communities. The experiences, problems and
challenges encountered by the WAN-based distributed Web crawlers are classified and discussed in depth. A
summary of the current evaluation indicators is also given. Finally, conclusion and some suggestions for future
research are put forward. |
HTML 下载PDF全文 查看/发表评论 下载PDF阅读器 |