High-performance computing (HPC) applications are continuously used to support Lawrence Livermore National Laboratory's programmatic and scientific objectives. Fail-stop failures may be more prevalent in future HPC systems as the number of components continues to increase and the overall reliability of such components decreases. Most applications use the message-passing interface (MPI) for data communication; however, the current MPI standard does not support fault-tolerance. Large-scale HPC programs traditionally use checkpoint/restart to survive failures, yet more efficient methods to tolerate failures could be used.
We studied the advantages and disadvantages of different failure-recovery programming models that MPI applications could use in the future to tolerate frequent failures, thus increasing overall reliability. The models that we considered were (a) error code checking, whereby if an error is returned in an MPI operation, recovery takes place; (b) try blocks, where groups of MPI operations are protected; and (c) global restart, where the state of the MPI is globally cleaned up and reinitialized upon a failure. When we weighted the programming complexity of using these models in large codes, their feasibility of implementation, and their performance, we found that global-restart is the best option for large-scale bulk synchronous codes because it resembles the fault-tolerance semantics of checkpoint/restart, the most common fault-tolerance approach in HPC.
We developed an efficient global-restart approach, called REINIT, in which the MPI is automatically reinitialized upon failure. In this approach, failed processes were replaced by new processes, which could be either spawned immediately after a failure or pre-allocated by the resource manager. Although this approach eliminated job startup time, it still had to perform its initialization phase. We subsequently developed a new method, Checkpointable MPI, which removes the requirement of processes re-executing the initialization phase and provides a much more efficient method for failure recovery in the MPI.
HPC applications support the mission needs of the US Departments of Energy, Defense, and Homeland Security, as well as other federal and state agencies. Deploying and operating exascale HPC systems requires addressing several challenges, key among them being the need for resilience to frequent faults. Because of the number of expected components (e.g., processors), amount of memory, scaling trends in device sizes and voltage, and a complex software stack, the overall fault rate of exascale systems is expected to be higher than those of current petascale systems (Dongarra et al. 2011). While a portion of faults and errors originate from hardware components in the system, others arise from software defects. However, regardless of their origin, many faults and errors become visible to applications and runtime systems as process or node failures. Therefore, applications and runtime libraries must employ fault-tolerance mechanisms to deal with process and node failures effectively.
A natural place in which process and node failures can be detected and mitigated is the communication layer; when processes or nodes fail, surviving processes can detect the failure when they attempt to communicate with the failed ones. The MPI is the most popular programming interface (MPI Forum 2015) to provide process communication in large-scale HPC systems, yet despite its popularity, the MPI does not provide fault-tolerance mechanisms that can be leveraged by programmers to cope with process and node failures. Even worse, the MPI standard specifies that, if a failure occurs, the state of the MPI is undefined, and thus most MPI implementations simply have no other choice than to abort the application when a failure is encountered. The predominant approach to tolerating failures in MPI applications has been checkpoint/restart (C/R), where checkpoint files containing application state are periodically saved by the application for use in the event of a failure to restart the application. This approach assumes that the time to save a checkpoint, which comprises the state of all executing processes, is less than the mean time between failures (MTBF) of the system. Since the MTBF is expected to shrink in exascale systems, it is unclear whether this approach will be sufficient to cope with failures at large-scale.
The first goal of this project was to explore new fault-tolerant (FT) programming models or abstractions, which can complement or improve the C/R approach. The purpose of MPI FT programming models is to provide a user-level mechanism to mitigate process and node failures, in particular to repair the state of the MPI after a failure. After identifying the best FT programing models in the context of the Laboratory's applications, the second goal of the project was to implement those models in open-source MPI libraries and to evaluate their performance and scalability in representative Laboratory HPC workloads. A third goal of the project was to work with the MPI Forum to standardize the FT models that resulted from this project
This research has broad applicability across NNSA and Laboratory missions. The FT methods that the project developed will help the Laboratory's HPC applications to tolerate node and process failures more efficiently, making complex simulations, for instance, more reliable. Our FT models are also currently discussed in the MPI Forum (the MPI standardization body).
In summary, we found that to select the right programming model for an MPI application, programmers must consider several aspects, including the programming complexity of the model, the class of application and algorithm (what works for master-worker applications may not work for bulk synchronous applications), and the availability and practicality of current implementations of the models. We found that global-restart is the best option for large-scale bulk synchronous codes as it resembles the fault-tolerance semantics of checkpoint/restart, the most common fault-tolerance approach in HPC. We developed REINIT, an efficient approach for MPI fault tolerance in which the MPI is automatically reinitialized when a failure is detected. Later, we designed a new method, Checkpointable MPI, which removes the requirement of REINIT that live processes re-execute their initialization phase, which provides a transparent and much more efficient method to recover from failures in MPI.
The REINIT programming model that this project developed is currently discussed in the MPI Forum as a solution to provide FT. We will work with the MPI Forum to standardize this approach. The REINIT model became part of the DOE Exascale Computing Project (ECP), the DOE project that accelerates the delivery of a capable exascale computing ecosystem. REINIT is a deliverable of the Open MPI (OMPI-X) project within ECP.
Dongarra, J., et al. 2011. The International Exascale Software Project Roadmap." The International Journal of High Performance Computing Applications 25(1): 3–60. doi: 10.1177/1094342010391989.
MPI Forum, 2015. "MPI: A Message-Passing Interface Standard. Version 3.1." Knoxville, TN, June 2015.
Chakraborty, S., et al. 2017. EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications." Workshop on Exascale MPI 2017 at Supercomputing 2017, Denver, CO, November 2017. LLNL-CONF-706037.
Emani, M., et al. 2017. "Checkpointable MPI: A Transparent, Fault-Tolerance Approach for MPI." Short Paper at Workshop on Exascale MPI 2017 at Supercomputing 2017, Denver, CO, November 2017. LLNL-CONF-739586.
Fang, A., et al. 2015. "Fault Tolerance Assistant (FTA): An Exception Handling Approach for MPI Programs." Extended Abstract at 3rd Workshop on Exascale MPI at Supercomputing 2015, Austin, Texas, November 2015. LLNL-ABS-676900.
——— . 2016. "Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications." Technical Report. Lawrence Livermore National Laboratory, Livermore, CA, May 2016. doi: 10.2172/1258538. LLNL-TR-692704.
Laguna, I. 2016. "REINIT: A Simple and Scalable Fault-Tolerance Model for MPI Applications." SIAM Conference on Parallel Processing for Scientific Computing (PP14), Paris, France, April 2016. LLNL-PRES-688119.
Laguna, I., et al. 2016. "Evaluating and Extending User-Level Fault Tolerance in MPI Applications." The International Journal of High Performance Computing Applications 30(3): 305–319. LLNL-JRNL-663434.
Lawrence Livermore National Laboratory • 7000 East Avenue • Livermore, CA 94550
Operated by Lawrence Livermore National Security, LLC, for the Department of Energy's National Nuclear Security Administration.