Improving Simulation Workflows: A Data Analytics Approach

Ming Jiang (16-ERD-036)

Executive Summary

We are developing intelligent analytics to predict failures and dynamically adjust the highly complex workflows commonly found in hydrodynamics codes used to simulate fluid flows in shocked substances. Semi-automating a currently disruptive, time-consuming tuning process improves the accuracy and efficiency of codes used across several energy- and nuclear weapons-related missions.

Project Description

Simulation workflows are highly complex and often require a manual-tuning process that is cumbersome for users. Developing a simulation workflow is as much an art as a science, requiring finding and adjusting the right combination of parameters to complete the simulation. Such workflows come into play in many codes, including arbitrary Lagrangian–Eulerian (ALE) hydrodynamics codes, which are often used to simulate fluid flows in shocked substances. These codes include (1) Lawrence Livermore National Laboratory’s ALE3D, a three-dimensional ALE multiple-physics numerical simulation code; (2) KULL, used to model high-energy-density physics; (3) BLAST, a finite-elements code for shock hydrodynamics; and (4) HYDRA, used to model radiation hydrodynamics. In general, ALE codes require tuning a number of parameters to control how a computational mesh moves during the simulation. The process can be disruptive and time-consuming; a few hours of simulation can require many days of manual tuning. There is an urgent need to semi-automate this process to save time for the user and improve the efficiency of the codes. To address this need, we are developing novel predictive analytics for simulations and an in situ infrastructure. The infrastructure will run the predictive analytics simultaneously with the simulation to predict failures and dynamically adjust the workflow accordingly. Our goal is to predict simulation failures ahead of time and proactively avoid them to the greatest extent possible. We are investigating supervised-learning algorithms to develop classifiers that can predict simulation failures by using the simulation state as a set of learning features. We are also investigating supervised-learning algorithms to find the correlation between adjustments in parameters and changes in the simulation state. Finally, we are investigating the use of flow-analysis techniques to extract high-level flow features for our predictive analytics. Together with the knowledge of which simulation state can lead to failures, we can generate workflows that can avoid those failures in a systematic way.

This research has the potential to significantly reduce the development time for simulation workflows and minimize the user’s effort. It will address the urgent need to move towards semi-automating the pipeline for ALE simulations, thus enabling large-scale uncertainty quantification studies. Such studies seek to reduce the uncertainties in computational and real-world applications. Our approach is based on analyzing where and when simulations fail, as well as developing solutions to help existing workflows avoid those failures. We will leverage the expertise of simulation users by codifying their knowledge and experience into this framework. We expect the outcome for predictive analytics to include novel machine-learning algorithms designed specifically for predicting and avoiding simulation failures. We expect our in situ infrastructure to be a novel approach for integration that combines traditional, high-performance computing simulations with cutting-edge data analytics.

Mission Relevance

Our effort in predictive analytics and in situ infrastructure supports the Laboratory’s core competency in high-performance computing, simulation, and data science. Our work will directly benefit Livermore's ALE codes and thus support the Laboratory’s stockpile stewardship science strategic focus area, as well as the NNSA goal of managing the nation’s nuclear stockpile.

FY17 Accomplishments and Results

In FY17 we (1) developed a supervised learning framework for ALE simulations; (2) conducted experiments to evaluate the framework design choices and the learning generalization performance; (3) performed a study on how an ALE mesh relaxation strategy can influence the simulation output by using a set of general integrated metrics, as well as problem-specific ones developed for the Hohlraum test problem; and (4) developed a characterization of scientific workflow management systems for extreme-scale applications, such as the one we developed for integrating machine learning with ALE simulations, then published our results in Elsevier Future Generation of Computer Systems journal.

Figure 1. High-level overview of our first-of-a-kind infrastructure for integrating machine learning into HPC simulations. At the top level is the user interface (user), where workflow management interacts with the simulation run, and the visual analytics interacts with the machine learning process. At the middle level is the data collection, where the key component is the feature aggregator, which aggregates massive simulation data into learning features for training data. At the bottom level is the predictive analytics, where the machine learning generates statistical models (classifiers) that are then used for the inference algorithm.

Publications and Presentations

Silva, R. et al. 2017. "A Characterization of Workflow Management Systems for Extreme-Scale Applications." Elsevier Future Generations Computer Systems 75. 228―238. doi: 10.1016/j.future.2017.02.026. LLNL-JRNL-706700.

&nbsp &nbsp