Fault Tolerance Scheme Using Parallel Recomputing for OpenMP Programs
Author:
Affiliation:

Clc Number:

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    This paper proposes a fault tolerance approach for OpenMP programs, named PR-OMP, which makes use of a novel fault recovery scheme, parallel recomputing. By redistributing the workload of the failed thread to all the surviving threads, PR-OMP remarkably reduces the overhead for fault recovery. The paper discusses the key issues including program division, computational state saving, workload redistribution, and fault detection of PR-OMP and details concerning implementation. Furthermore, the paper also presents an extended data flow analysis for OpenMP, which is used to decrease the data amount of computational state saving. Through the experimental evaluation, it has been proven that this approach achieves a minor overhead in fault recovery.

    Reference
    Related
    Cited by
Get Citation

富弘毅,丁滟,宋伟,杨学军.一种利用并行复算实现的OpenMP 容错机制.软件学报,2012,23(2):411-427

Copy
Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 05,2010
  • Revised:March 30,2010
  • Adopted:
  • Online: February 07,2012
  • Published:
You are the firstVisitors
Copyright: Institute of Software, Chinese Academy of Sciences Beijing ICP No. 05046678-4
Address:4# South Fourth Street, Zhong Guan Cun, Beijing 100190,Postal Code:100190
Phone:010-62562563 Fax:010-62562533 Email:jos@iscas.ac.cn
Technical Support:Beijing Qinyun Technology Development Co., Ltd.

Beijing Public Network Security No. 11040202500063