Reliable Linear Solvers in Unreliable Computing Environments

Christopher Vogl | 21-FS-007

Project Overview

The past few decades have seen incredible advances in scientific computing algorithms and hardware that have enabled a wide array of technologies, from high-fidelity physics simulations to facial recognition with machine learning. A core assumption in almost all those algorithms is that the underlying hardware is reliable, namely that the data is not corrupted or lost as it is transferred between memory and devices. The last few years have seen strong interest in harnessing the potential of moving scientific computation beyond the walls of high-performance computing (HPC) facilities to the "edge" devices collecting the data to be processed. Consider both the operational and cybersecurity benefits of, for example, computing on smart power grid sensors in real-time instead of first aggregating the data in a central location. The downside is that the devices in such edge computing environments are expected to experience more frequent hardware failures than their HPC counterparts (e.g., a 5G connected device temporarily losing cellular signal). Modern scientific computing algorithms are not likely to succeed in unreliable computing environments.

To address the need for scientific computing algorithms that can succeed in unreliable computing environments, this project sought to establish whether a set of recent modifications to specialized algorithms, which made those algorithms resilient to data corruption and delay, can be applied more generally to modern algorithms. The ability to solve a linear system is a critical function of many modern scientific computing algorithms. As such, a variant of a particular linear solver was developed with a novel criterion for rejecting data that is determined to be corrupt. This variant produced an accurate solution despite periodic data corruption at a frequency that prevented the original algorithm from succeeding. A second variant of the linear solver replaces the original approach of using only the most recent solution approximation with an approach that implicitly includes previous solution approximations. This variant produced an accurate solution faster than the original method when periodic data delays, which resulted in data arriving out of order, was introduced. The results of this work provide a proof-of-concept that the same modifications can introduce the resilience needed by modern scientific computing algorithms to succeed in unreliable computing environments. The results have also secured funding to generalize this work to bring resilience to some of the most powerful algorithms in use today.

Mission Impact

This research serves to address the lack of a key enabling capability identified by E and Z programs at Lawrence Livermore National Laboratory: the ability to apply modern scientific computing algorithms in unreliable computing environments. The ability to, for example, operate on data collected by edge devices without having to first transmit the data to a central location will improve the resilience of the nation's critical infrastructure to both cyber and physical attacks, as well as natural disasters. The framework developed in this project enables the development of an entire class of resilient algorithms that capitalizes on decades of HPC research by enhancing existing algorithms with proven resiliency enhancing modifications. This new class of algorithms will enable new tools and capabilities to meet national security challenges.