Failure Recovery Abstractions for Large-Scale Parallel Applications

Ignacio Laguna Peralta (15-ERD-039)

Project Description

The DOE has identified resilience as one of the major challenges to achieve exascale computing—specifically, the ability of the system and applications to work through frequent faults and failures. An exascale machine (a quintillion floating point operations per second) will be comprised of many more hardware and software components than today's petascale machines, which will increase the overall probability of failures. Most NNSA defense-program simulations use the message-passing interface programing model. However, the message-passing interface standard does not provide resilience mechanisms. It specifies that if a failure occurs, the state is undefined and applications must abort. To address the exascale resilience problem, we are developing multiple resilient programming abstractions for large-scale high-performance computing applications, with an emphasis on compute node and process failures (one of the most notable failures) and in the message-passing interface programing model. We will investigate the performance of the abstractions in several applications and study their costs in terms of programming capability.

We expect to deliver resilience programing abstractions that will enable efficient fault tolerance in stockpile stewardship simulations at exascale. The abstractions will comprise a set of programming interfaces for the message-passing interface programming model and node and process failures. The abstractions will encapsulate several failure recovery models to reduce the amount of code and reasoning behind implementing these models at large scale. As a result of this research, Laboratory code teams will be more productive in their large-scale simulations by concentrating more on the science behind the simulation rather than on the coding aspects to deal with frequent failures, especially in exascale simulations.

We have designed and implemented a novel approach to handle process failures, called Reinit, which allows high-performance computing applications to recover faster at scale. The figure shows recovery time measurements for process failures in the Sierra supercomputer at Lawrence Livermore National Laboratory. Reinit takes less than 4 seconds to recover with 1,000 nodes and 12,000 processes. Recovery with Reinit is 4 times faster than traditional job restarts.

Mission Relevance

This project directly supports one of the core competencies of Lawrence Livermore, specifically high-performance computing, simulation, and data science. Simulations (to extend the lifetime of nuclear weapons in the stockpile, for example) will require the resilience abstractions that we will provide in this project to make effective use of exascale computing in support of the strategic focus area in stockpile stewardship science and a central Laboratory mission in national security.

FY16 Accomplishments and Results

In FY16 we (1) studied the complexity of a large number of applications and fault-tolerance programming models and found that high complexity is a limitation to integrate fault-tolerance models in applications; (2) designed, as a result, a global-restart fault-tolerance model called Reinit, which has low programming complexity and is suitable for many classes of message-passing interface applications; and (3) developed a prototype model in MVAPICH (a message-passing interface library) and in the SLURM (Simple Linux Utility for Resource Management) workload manager.

Publications and Presentations

  • Fang, A., et al., Fault Tolerance Assistant (FTA): An exception handling programming model for MPI applications. (2016). LLNL-TR-692704.
  • Laguna, I., Reinit: A Simple and scalable fault-tolerance model for MPI applications. SIAM Conf. Parallel Processing, Paris, France, Apr. 12–15, 2016. LLNL-PRES-688119.
  • Laguna, I., et al., "Evaluating and extending user-level fault tolerance in MPI applications.” Int. J. High Perform. Comput. Appl. 30(3), 305 (2016). LLNL-JRNL-663434.