With the growing complexity of high-performance computing (HPC) systems, application performance variation has increased enough to disrupt the overall throughput of the systems. Such performance variation is expected to worsen in the future, when job schedulers will have to manage flow resources such as power, network, and input/output, in addition to traditional resources such as nodes and memory. We studied the simultaneous impact of inter-job interference, the service levels of InfiniBand (a computer-networking communications standard), and power capping on different applications in a controlled experimental setup, with the goal of understanding the range of performance variation, as well as potential mitigation strategies.
Performance variation and unpredictability in modern supercomputers are growing concerns. On current HPC systems, user applications already experience about 20 percent run-to-run variation with the exact same input configuration (Bhatele et al. 2013). Such variation is typically attributed to interference from neighboring jobs or to manufacturing differences on power-limited systems (Inadomi et al. 2015). Limited understanding of the causes of run-to-run variation and the range of variation can reduce the overall efficiency of the system. As we venture toward exascale, scientific reproducibility will worsen if system software does not adapt to the changing landscape of managing flow resources, such as network bandwidth and power, simultaneously. For instance, most modern schedulers do not consider flow resources when making allocation decisions, even though it has been shown that there is tremendous scope for improving data throughput and utilization (Savoie et al. 2016, Subramoni et al. 2010). One reason for this is the lack of information about application performance variation and its sources. To be able to boost throughput, system schedulers need to be able to predict, to some degree, both application performance and the associated range of run-to-run variations.
Our goal was to construct an application-performance dataset subject to changing parameters such as power, network bandwidth, and rank-to-node mapping and placement. Such a dataset can be used to create effective performance-prediction models, which can then be used for future job-scheduling research. Creating this dataset was extremely challenging because job-scheduling environments are dynamic in nature. Therefore, in this work, we ran simpler control jobs that could be well understood when subject to different parameters instead of complex applications. For such control jobs, resource dependencies can be well understood when various network and power parameters and external interference are varied.
Collecting observations and information for performance models from training data led to a first-of-its-kind dataset. Initial analysis with statistical and machine-learning models indicated a high level of accuracy. Using decision trees and random forests, we were able to predict the performance of a particular application with only five percent of training data. Our next step is to explore more detailed prediction models and algorithms for advanced research of HPC system software with such data. Feeding such information into online system software, such as job schedulers and runtime systems, will be helpful in significantly improving utilization and data throughput.
This research has broad application across the DOE and NNSA mission space as we prepare for future HPC systems by redesigning the system software stack to support multiple objectives. The dataset from this research is the first step in developing practical job-placement and resource-management policies for achieving this. This helps us understand current and future computational systems better and enables us to manage resources on large-scale systems more efficiently, leading to greater data throughput and more science accomplished per dollar. This project supports the Lawrence Livermore National Laboratory's core competencies in high-performance computing, simulation, and data science.
We studied the simultaneous impact of network quality-of-service, power capping, and placement on application performance. We delivered a statistically significant dataset that can be leveraged as part of future HPC system software design and showed that we can improve performance significantly if we know which parameters to tune for. Our future work involves building efficient models to incorporate such data into job schedulers and computational runtimes.
Bhatele, A., et al. 2013. "There Goes the Neighborhood: Performance Degradation Due to Nearby Jobs." Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13), Denver, CO, November 2013. LLNL-CONF-635776.
Inadomi, Y., et al. 2015. "Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’15), Article No. 78, Austin, Texas, November 2015. doi: 10.1145/2807591.2807638.
Savoie, L., et al. 2016. "I/O Aware Power Shifting." Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS 2016), 740–749, Chicago, IL, May 2016. doi: 10.1109/IPDPS.2016.15.
Subramoni, H., et al. 2010. "Improving Application Performance and Predictability Using Multiple Virtual Lanes in Modern Multi-Core Infiniband Clusters." 39th International Conference on Parallel Processing (ICPP ’10), San Diego, California, September 2010. doi: 10.1109/ICPP.2010.54.
Patki, T., et al. 2018. "Understanding Simultaneous Impact of Network QoS and Power on HPC Application Performance." Supercomputing 2018 (SC '18), Dallas, TX, November 2018. LLNL-PRES-761308.
Lawrence Livermore National Laboratory • 7000 East Avenue • Livermore, CA 94550
Operated by Lawrence Livermore National Security, LLC, for the Department of Energy's National Nuclear Security Administration.