Computation Power at Scale

Barry Rountree (14-ERD-065)

Abstract

The field of high-performance computing faces a profound change as we move towards exascale computation with one quintillion floating point operations per second. For the first time, users will have to optimize codes in the presence of limited and variable electrical power. While it may be initially possible to hide these limitations by over-provisioning and under-utilizing scarce power resources, we will demonstrate how addressing these limitations can lead to full utilization of power resources and an order-of-magnitude improvement in throughput. We intend to optimize the three difficult conditions of model creation for performance prediction, job scheduling, and run-time optimization, all under specific power bounds. We will investigate a hardware over-provisioning strategy, in which more compute nodes are resident in the supercomputing center than can be powered fully. Intuitively, hardware over-provisioning allows the cluster to use maximum machine-room power, and therefore achieve maximum performance, by judicious scheduling of per-node power. To do so, we will leverage LLNL's hardware resources, our expertise in job scheduling, and continuing relationships with our academic partner, the University of Arizona.

Exascale computing presents a new performance problem of how to best get the most science out of each watt, rather than out of each node. Any code run at scale will have to address this issue, and if we do not have solutions ready for code teams when we take delivery of our first power-limited systems, the result will be unnecessarily poor performance. Our ultimate goal is to influence the design of the first several generations of exascale systems and their software ecosystems to maximize performance per watt of power. We will produce a configuration-based model, power-aware job scheduler, and run-time system. We expect to influence the design of exascale systems, having demonstrated that power-aware approaches will reliably result in significant performance improvements.

Mission Relevance

Many core missions in national and energy security at the Laboratory are dependent on the predictive simulation capability of large-scale computers, which are moving into the exascale realm. Exascale systems will be intrinsically power limited. Our research will enable optimized software in a new era of power-constrained supercomputing, in support of LLNL's core competency in high-performance computing, simulation, and data science.

FY15 Accomplishments and Results

In FY15 we (1) developed a complete multiple-node model for determining the limits of power-constrained application performance; (2) developed new power-aware scheduling algorithms for practical resource management in power-constrained, high-performance computing; (3) enhanced run-time system functionality and performed inter-node power balancing to enable a run-time system for power-constrained applications; (4) began work relating computational power, performance, and temperature to characterize performance over power and thermal bounds using the power laboratory we created in FY14, which allows power and thermal measurement and control on advanced resources from computer manufacturers; and (5) completed initial work on assessing InfiniBand performance counters.

Manufacturing variation in processors leads to a wide distribution in power to achieve a set level of performance. Under a uniform power bound, the variation in efficiency leads to nonlinear, application-specific variation in performance. Power-limited supercomputers must account for, and exploit, this variation to optimize performance.

Publications and Presentations

Bailey, P., et al., Adaptive configuration selection for power-constrained heterogeneous systems. The 43rd Intl. Conf. Parallel Processing (ICPP-2015), Beijing, China, Sept. 1–4, 2015. LLNL-CONF-662222.
Marathe, A., et al., A run-time system for power-constrained HPC applications. 14th INFORMS Computing Soc. Conf., Richmond, VA, Jan. 11–13, 2015. LLNL-CONF-667408.