XPlacer: Extensible and Portable Optimizations of Data Placement in Memory

Chunhua Liao | 18-ERD-006

Project Overview

Modern supercomputers with heterogeneous components (e.g., Graphic Processing Units, or GPUs) use various types of memory and increasingly complex designs to meet the growing demands for data by processors. Putting data into the proper part of a memory system, called the data placement, is essential for optimal program performance. However, the decision about data placement is difficult to make because of architectural complexity and the sensitivity of suitable data placements to changes in inputs, program phases, and architectures.

The goal of this project was to explore combined compiler and runtime techniques to automatically optimize memory usage of CUDA and RAJA applications running on latest generations of NVIDIA GPUs. The technical approach included two complementary methods: (1) a white-box method using instrumentation-based memory access pattern analysis and heuristics to guide CUDA unified memory usage; and (2) a black-box machine learning-based method to find best choices using hybrid discrete and unified memory APIs. The resulting framework, known as XPlacer, has led to 3.7-times speedup for LULESH (a proxy application) and 33.8-times speedup for the Smith-Waterman algorithm. Our machine-learning method achieved 94 percent prediction accuracy in correctly identifying the optimal memory advice choice for selected kernels.

Although this work was done in the context of CUDA and RAJA applications, the developed program analysis, code transformation, and adaptive runtime can be easily reused by or adapted to other programming environments (e.g. OpenMP, OpenACC). Thus, the machine learning research under this project laid the foundation for future work of incorporating machine learning techniques into modern programming models such as OpenMP.

Mission Impact

The outcome of this research provides essential compiler and runtime techniques for enhancing the computing efficiency of data-intensive dynamic applications in current and future computing systems with complex memory hierarchies. The project has a broad impact on hardware description, programming models, compiler analysis, machine learning, code generation, and runtime adaption. As such, it directly supports Lawrence Livermore National Laboratory's core competency in high-performance computing, simulation, and data science.

Publications, Presentations, and Patents

Bari, A. S, et al. 2018. "Is Data Placement Optimization Still Relevant on Newer GPUs?" 9th International Workshop on Performance Modeling, Benchmarking, and Simulation of High-Performance Computer Systems (PMBS18), Dallas, TX, November 2018. LLNL-CONF-757796

Mendonca, G., et al. 2020. "AutoParBench: A Unified Test Framework for OpenMP-based Parallelizers." International Conference on Supercomputing, Barcelona, Spain, June/July 2020. LLNL-CONF-795158

Mishra, A., et al. 2019. "Data Reuse Analysis for GPU Offloading with OpenMP." Supercomputing 2019, Denver, CO, November 2019. LLNL-POST-784875

Pirkelbauer, P., et al. 2020. "XPlacer: Automatic Analysis of CPU/GPU Access Patterns." 34th International Parallel & Distributed Processing Symposium, New Orleans, LA, May 2020. LLNL-CONF-795057

Ren, J., et al. 2019. "Opera: Data Access Pattern Similarity Analysis To Optimize OpenMP Task Affinity." 24th International Workshop On High-level Parallel Programming Models and Supportive Environments (HIPS), held in conjunction with the 33rd IPDPS International Parallel & Distributed Processing Symposium, Rio De Janeiro, Brazil, May 2019. LLNL-CONF-769889

Stoltzfus, L., et al. 2018. "Data Placement Optimization in GPU Memory Hierarchy Using Predictive Modeling." Proceedings of the Workshop on Memory Centric High Performance Computing (MCHPC 18), 45–49. LLNL-CONF-758021

Wang, A., et al. 2019. "Leveraging Smart Data Transfer and Extending Metadirective to Improve Adaptive Computing." Supercomputing 2019, Denver, CO, November 2020. LLNL-POST-782626

Xu, H., et al. 2019. "Machine Learning Guided Optimal Use of GPU Unified Memory." MCHPC 19: Workshop on Memory Centric High Performance Computing, Denver, CO, November 2019. LLNL-CONF-793704

Yan, Y., et al. 2019. "Extending OpenMP Metadirective Semantics for Runtime Adaptation." Fifteenth International Workshop on OpenMP(IWOMP 2019), Auckland, New Zealand, September 2019. LLNL-CONF-774899