###
Journal of Software:2016.27(7):1789-1804

一种支持容错的任务并行程序设计模型
王一拙,陈旭,计卫星,苏岩,王小军,石峰
(北京理工大学 计算机学院, 北京 100081)
Task-Based Parallel Programming Model Supporting Fault Tolerance
WANG Yi-Zhuo,CHEN Xu,JI Wei-Xing,SU Yan,WANG Xiao-Jun,SHI Feng
(School of Computer Science, Beijing Institute of Technology, Beijing 100081, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 1959   Download 2442
Received:December 31, 2014    Revised:March 02, 2015
> 中文摘要: 任务并行程序设计模型已成为并行程序设计的主流,其通过发掘任务并行性来提高并行计算机的系统性能.提出一种支持容错的任务并行程序设计模型,将容错技术融入到任务并行程序设计模型中,在保证性能的同时提高系统可靠性.该模型以任务为调度、执行、错误检测与恢复的基本单位,在应用级实现容错支持.采用一种Buffer-Commit计算模型支持瞬时错误的检测与恢复;采用应用级无盘检查点实现节点故障类型永久错误的恢复;采用一种支持容错的工作窃取任务调度策略获得动态负载均衡.实验结果表明,该模型以较低的性能开销提供了对硬件错误的容错支持.
Abstract:Task-Based parallel programming model has become the mainstream parallel programming model to improve the performance of parallel computer systems by exploiting task parallelism. This paper presents a novel task-based parallel programming model which supports hardware fault tolerance. This model incorporates fault tolerance mechanisms into the task-based parallel programming model and aim to improve system performance and reliability. It uses task as the basic unit of scheduling, execution, fault detection and recovery, and supports fault tolerance in the application level. A buffer-commit computation model is used for transient fault tolerance and application-level diskless checkpointing technique is employed for permanent fault tolerance. A work-stealing scheduling scheme supporting fault tolerance is adopted to achieve dynamic load balancing. Experimental results show that the proposed model provides hardware fault tolerance with low performance overhead.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(61300011) 国家自然科学基金(61300011)
Foundation items:National Natural Science Foundation of China (61300011)
Reference text:

王一拙,陈旭,计卫星,苏岩,王小军,石峰.一种支持容错的任务并行程序设计模型.软件学报,2016,27(7):1789-1804

WANG Yi-Zhuo,CHEN Xu,JI Wei-Xing,SU Yan,WANG Xiao-Jun,SHI Feng.Task-Based Parallel Programming Model Supporting Fault Tolerance.Journal of Software,2016,27(7):1789-1804