Floating Point Reproducibility for Accelerator-Based Systems

Ignacio Laguna Peralta | 20-FS-005

Project Overview

Due to major trends and limitations in computer architecture, future supercomputers are expected to use accelerators, leading toward a period of highly heterogeneous systems. However, due to the high degree of parallelism that accelerators expose and the use of different compilers and optimizations in these systems, numerical reproducibility can be challenging. In scientific codes, including Lawrence Livermore National Laboratory's mission codes, numerical reproducibility builds upon IEEE arithmetic for floating-point instructions, which presents severe challenges as the round-off errors make these instructions non-associative.

In this project, we used compiler-assisted numerical sampling and random code generation methods to study two important aspects of numerical reproducibility in heterogeneous systems: (1) numerical reproducibility in a device considering different order of operations in the device threads; and (2) numerical reproducibility between host and device executions and how it is affected by compiler optimizations. We found that, depending on the metric used to measure round-off error, there could be numerical differences of an order of magnitude between different reduction operations in accelerators for small data sets. We also found that reproducibility between host and device architectures can be quite challenging.

Mission Impact

The Laboratory conducts large-scale scientific simulations using its high-performance computing systems, which are based on accelerators. The results of this investigation can be used by Laboratory simulations to understand code regions (kernels) in a scientific application that are highly sensitive to numerical non-determinism. The project outcomes are expected to make Livermore's scientific simulations more reproducible by letting programmers reduce numerical non-determinism algorithmically, making the results of simulations more reliable and trustable. This work supports Livermore's core competency in high-performance computing, simulation, and data science.

Publications, Presentations, and Patents

Laguna Peralta, I. 2019a. "Varity" (computer software). October 29, 2019. doi:10.11578/dc.20200109.1. LLNL-CODE-798680

——— 2019b. "Tools and Techniques for Floating-Point Analysis." IDEAS Best Practices for HPC Software Developers Webinar Series, October 2019. LLNL-PRES-788144

——— 2020. "Varity: Quantifying Floating-Point Variations in HPC Systems Through Randomized Testing." 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, May 2020. LLNL-CONF-793958