关联性驱动的大数据处理任务调度方案
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

基金项目:

国家高技术研究发展计划(863)(2013AA01A209);国家自然科学基金(61172048,61303250)


Dependency-Driven Task Scheduling Scheme of Big Data Processing
Author:
Affiliation:

Fund Project:

National High Technology Research and Development Program of China(863) (2013AA01A209); National Natural Science Foundation of China (61172048, 61303250)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目前大数据处理过程较少关注任务所处理数据间的依赖关系,在任务执行过程中可能产生大量数据迁移,影响数据处理效率.为减少数据迁移,提升任务执行性能,从数据关联性及数据本地性两个角度出发,提出了一种数据关联性驱动的大数据处理任务优化调度方案:D3S2(data-dependency-driven scheduling scheme).D3S2由两部分组成:(1)数据关联性感知的数据优化放置机制(dependency-aware placement mechanism,简称DAPM),根据日志信息挖掘数据关联性,进而将强关联的数据聚合并放置于相同机架上,减少了跨机架的数据迁移;(2)数据迁移代价感知的任务优化调度机制(transfer-aware scheduling mechanism,简称TASM),完成数据放置后,以数据本地性为约束,对任务进行统一调度,最小化任务执行过程中的数据迁移代价.DAPM和TASM互相提供决策依据,以任务执行代价最小化为目标不断迭代调整调度方案,直至最优任务调度方案.在Hadoop平台上进行的实验结果表明:较之原生Hadoop,在不增加作业完成时间的基础上,D3S2减少了作业执行过程中的数据迁移量.

    Abstract:

    Currently, there is lack of consideration of dependencies between data in big data processing, resulting in low data processing efficiency with large amounts of data transfer during task execution. In order to reduce data transfer and improve processing performance, this paper proposes a data-dependency driven task scheduling scheme, named D3S2, for big data processing. D3S2 is mainly composed of two parts:dependency-aware placement mechanism(DAPM), and transfer-aware task scheduling mechanism(TASM). DAPM discovers dependency between data so that strongly related data will be clustered and assigned to nodes in the same rack, thereby reducing the cross-rack data migration. TASM schedules tasks simultaneously after data placement according to the data locality constraint, so as to minimize the data transfer cost during the task execution. DAPM and TASM provide basis for decision making to each other, iterating constantly to adjust the scheduling scheme with the goal of minimizing the execution cost until an optimal solution is reached. The proposed scheme is verified in Hadoop environment. Experiments show that compared to native Hadoop, D3S2 reduces the data transfer during job execution, and shortens job running time.

    参考文献
    相似文献
    引证文献
引用本文

王玢,吴雅婧,阳小龙,孙奇福.关联性驱动的大数据处理任务调度方案.软件学报,2017,28(12):3385-3398

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2016-07-04
  • 最后修改日期:2016-12-07
  • 录用日期:
  • 在线发布日期: 2017-03-27
  • 出版日期:
您是第位访问者
版权所有:中国科学院软件研究所 京ICP备05046678号-3
地址:北京市海淀区中关村南四街4号,邮政编码:100190
电话:010-62562563 传真:010-62562533 Email:jos@iscas.ac.cn
技术支持:北京勤云科技发展有限公司

京公网安备 11040202500063号