###
Journal of Software:2012.23(4):1022-1035

大规模MPI 并行计算的可扩展三模冗余容错机制
王之元,杨学军,周云
(国防科学技术大学 计算机学院, 湖南 长沙 410073)
Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing
WANG Zhi-Yuan,YANG Xue-Jun,ZHOU Yun
(College of Computer, National University of Defense Technology, Changsha 410073, China)
Abstract
Chart / table
Reference
Similar Articles
Article :Browse 3463   Download 3498
Received:October 08, 2010    Revised:January 20, 2011
> 中文摘要: 随着系统规模的扩大,并行计算的性能不断提高,但可靠性却也在不断下降,因此需要采用某种容错机制来容忍或恢复硬件故障和数据错误.目前常用的容错机制Checkpoint/Restart 和多模冗余均引入了额外的开销,这些开销均在某种程度上制约了并行计算的可扩展性.因此,在高性能计算需求不断增长的今天,可扩展容错机制的设计显得尤为迫切和重要.以三模冗余(triple modular redundancy,简称TMR)为典型案例,描述了传统TMR 在大规模MPI并行计算上的实现方法,分析了该机制所面临的实际问题,进而指出传统TMR 制约了并行计算的扩展.根据该技术所面临的问题,设计了可扩展三模冗余(scalable triple modular redundancy,简称STMR),并进一步验证了其有效性和可扩展性.该机制不仅能够处理Checkpoint/Restart 针对的fail-stop 故障,还能够解决绝大部分硬件不能直接感知的数据错误.最后,借用BlueGene/L 的系统参数进行模拟,预测当系统规模增大时,在分别采用TMR和STMR的情况下并行计算可扩展性的变化,结果进一步验证了STMR 是可扩展的容错机制.
Abstract:The scale-up of system brings improvement in performance as well as reliability degradation, so there is a need to apply some fault tolerance mechanism to tolerate hardware failure or recover data. Currently, the popular fault tolerance mechanisms, such as Checkpoint/Restart and N-modular redundancy, all need additional overhead, which limits the scalability of parallel computing to some extent. Therefore, it is very important to develop scalable fault tolerance mechanisms for increasingly high performance supercomputing. This paper takes triple modular redundancy (TMR) as an example, describes the implementation of TMR on large-scale MPI parallel computing, and argues that traditional TMR fault-tolerant mechanism limits the scalability of parallel computing. To solve these practical problems, the paper proposes the scalable triple modular redundancy (STMR), and verifies the validity and scalability of it. STMR can not only handle the fail-stop failures that are traditionally handled by Checkpoint/Restart, but can also deal with most of data errors not perceived directly by the hardware. Finally, the study conducts the simulation using the system parameters of BlueGene/L, which shows the scalability change of parallel computing with the TMR and the STMR respectively when the system size increases. The results further validate STMR position as scalable fault-tolerant mechanism.
文章编号:     中图分类号:    文献标志码:
基金项目:国家自然科学基金(61003081, 61003087, 60921062) 国家自然科学基金(61003081, 61003087, 60921062)
Foundation items:
Reference text:

王之元,杨学军,周云.大规模MPI 并行计算的可扩展三模冗余容错机制.软件学报,2012,23(4):1022-1035

WANG Zhi-Yuan,YANG Xue-Jun,ZHOU Yun.Scalable Triple Modular Redundancy Fault Tolerance Mechanism for MPI-Oriented Large Scale Parallel Computing.Journal of Software,2012,23(4):1022-1035