Toward the Intelligent Center for High-Performance Computing

Jae-Seung Yeom | 23-ERD-045

Project Overview

Modern supercomputers are comprised of diverse architectures with an increasing number of systems possessing heterogeneous architectures. Understanding the performance of scientific applications across these diverse architectures is crucial for efficiently utilizing these systems. Furthermore, manually mapping different applications in a workflow to heterogeneous resources at the HPC center results in a high burden on users as well as wasted center resources. To enable intelligent scheduling on multi-resource systems that removes this burden from users, we have developed performance models based on machine learning (ML) techniques to predict the performance of applications across computing architectures. Combined with the information on resource availability, the model will help identify the effective mapping of workflow tasks to compute resources.

Mission Impact

Without fundamentally changing on how complex workflows are programmed and mapped to LLNL HPC resources, the laboratory's mission-critical workflows require heroic efforts to be successful. This project provides the techniques and foundations for the next-generation HPC centers to implement many game-changing system software capabilities to support workflow applications from the laboratory's mission areas. This effort also supports the Core Competency area of High-Performance Computing, Simulation and Data Science. This work advances the discipline in this area through the development of novel performance modeling approaches for workflow applications. Those will serve as the basis of optimization techniques that reconcile increasingly complex workflow and heterogeneous resources at the laboratory and software solutions that bridge existing gaps between traditional HPC software and cloud computing resource mannagement.

Publications, Presentations, and Patents

J. Yeom, et al., "Ubique: A New Model for Untangling Inter-task Data Dependence in Complex HPC Workflows" (Presentation, IEEE International Conference on e-Science, Denver, Co, Nov. 2023).

D. Nichols, et al., "Predicting Cross-Platform Relative Performance with Deep Generative Models" (Presentation, ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, Nov. 2022).

I. Lumsden, et al,"Enabling Transparent, High-Throughput Data Movement for Scientific Workflows on HPC Systems" (Presentation, ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis, Denver, CO, Nov. 2023).

Simulation of resource availability in HPC systems: https://github.com/LLNL/dr_evt.git

A tool to make workflows more adaptable and portable over complex and reconfigurable storage hierarchy: https://github.com/flux-framework/dyad.git