###
Journal of Software:2012.23(2):411-427

一种利用并行复算实现的OpenMP 容错机制
富弘毅,丁滟,宋伟,杨学军
(国防科学技术大学 并行与分布处理国防科技重点实验室,湖南 长沙 410073;国防科学技术大学 计算机学院 软件研究所,湖南 长沙 410073)
Fault Tolerance Scheme Using Parallel Recomputing for OpenMP Programs
FU Hong-Yi,DING Yan,SONG Wei,YANG Xue-Jun
(Key Laboratory of Science and Technology for National Defense of Parallel and Distributed Processing, National University of Defense Technology, Changsha 410073, China;Institution of Software, College of Computer, National University of Defense Technology, Changsha 410073, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 3053   Download 3721
Received:January 05, 2010    Revised:March 30, 2010
> 中文摘要: 基于并行复算的故障恢复技术,将故障恢复的计算任务分配至未发生故障的结点上并行执行,从而显著缩短复算时间,有效降低故障恢复开销,提高并行程序容错性能.基于该故障恢复技术,提出了一种针对OpenMP 并行程序的容错机制PR-OMP,有效解决了分段复算、复算负载重分布等问题;此外,还扩展了传统编译数据流分析技术,提出了针对OpenMP 并行程序的数据流分析技术,并基于该技术计算状态保存开销进行优化.设计实现了用于支持PR-OMP 的编译工具GiFT-OMP,并通过实验证明了PR-OMP 机制及其支持工具的有效性,评估并分析了其性能和可扩展性.
中文关键词: 容错  OpenMP  并行复算  数据流分析
Abstract:This paper proposes a fault tolerance approach for OpenMP programs, named PR-OMP, which makes use of a novel fault recovery scheme, parallel recomputing. By redistributing the workload of the failed thread to all the surviving threads, PR-OMP remarkably reduces the overhead for fault recovery. The paper discusses the key issues including program division, computational state saving, workload redistribution, and fault detection of PR-OMP and details concerning implementation. Furthermore, the paper also presents an extended data flow analysis for OpenMP, which is used to decrease the data amount of computational state saving. Through the experimental evaluation, it has been proven that this approach achieves a minor overhead in fault recovery.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(60921062, 61003087); 国家高技术研究发展计划(863)(2009AA01Z102) 国家自然科学基金(60921062, 61003087); 国家高技术研究发展计划(863)(2009AA01Z102)
Foundation items:
Reference text:

富弘毅,丁滟,宋伟,杨学军.一种利用并行复算实现的OpenMP 容错机制.软件学报,2012,23(2):411-427

FU Hong-Yi,DING Yan,SONG Wei,YANG Xue-Jun.Fault Tolerance Scheme Using Parallel Recomputing for OpenMP Programs.Journal of Software,2012,23(2):411-427