Lawrence Livermore National Laboratory

Maya Gokhale (16-ERD-005)

Executive Summary

In this project, researchers are developing hardware modules that address a long-standing, memory-bottleneck problem in high-performance computing. This novel approach to memory creates efficiencies in performance and energy use, supporting the large computational tasks that underpin many NNSA missions and the design of future exascale architectures.

Project Description

The memory wall—the growing disparity of speed between a central processing unit and the memory outside the unit's chip—is perhaps the most prominent obstacle to processing expanding data volumes. While improvements continue to increase the number of processing cores on a chip, improvements to memory bandwidth and latency (amount of time for the memory to respond to a command request) at the equivalent scale are not forthcoming, resulting in teraflop compute cores that must sit idle waiting on data. New three-dimensional memories, such as the hybrid memory cube with its very-high-speed serial interface, offer greatly improved bandwidth over double-data-rate memory at the expense of latency. However, it is a continuing challenge to keep the central processing unit cores doing useful work because memory bandwidth improvements mainly benefit regular, streaming access patterns. We are addressing the issues of memory latency, bandwidth, and power by incorporating associative indexing functionality into the memory itself. Commonly employed in data science and simulations, particularly in scientific computing that incorporates sparse data structures, associative indexing encompasses search methods in which data is accessed by content, pattern, or key rather than directly by the memory address at which the data is stored. We will design, develop a prototype, and evaluate a collection of associative-indexing hardware modules optimized for in-memory search. Incorporating logic with memory has become feasible with the emergence of three-dimensional memory stacks that put a memory controller in a base logic layer. In contrast to other methods that place central processing units in the memory, our novel approach offloads data access tasks to the memory, allowing the high-performance central processing unit to compute on the actual data stream. Achieving the ultimate goal of this work—accelerating data lookup and pattern search while reducing energy use—will result in very significant improvements in performance and energy for a wide range of data-centric workloads.

Data-intensive workloads often analyze large, complex, inter-related data structures and exhibit irregular, data-dependent memory-access patterns that can only use a fraction of available bandwidth. Such access patterns result in wasted memory bandwidth and power and limited performance. In our approach (see figure), an in-memory indexing accelerator creates, updates, and searches sparse data structures using index arrays, linked lists, trees, and hash tables (tables used to map keys to values). Our experimental evaluation uses a field-programmable gate-array emulator to quantitatively assess the performance and power usage of alternative hardware implementations. By exploring specialized acceleration functions for scalable compute-node performance improvement, we are seeking solutions to the memory-access bottleneck that is a major challenge in the design of future architectures for exascale computing.

Mission Relevance

This project supports the NNSA goal of strengthening the science, technology, and engineering base and advances the Laboratory's core competency of high-performance computing, simulation, and data science through its efforts to accelerate data lookup and pattern search while reducing energy use. The architecture foundations established by this work will have an impact on future exascale architectures, smaller-scale clusters, and single-node configurations.

FY17 Accomplishments and Results

In FY17 we (1) completed the design and implementation of an efficient hardware key/value lookup unit; (2) incorporated the hardware design into our emulator; (3) characterized performance and energy for a range of memory latencies representative of volatile and nonvolatile memory systems; (4) obtained access to an IBM ConTutto board, which is designed for experimenting with novel memory technologies within a working system; and (5) initiated experiments on the board based on the nonvolatile, magnetoresistive random-access memory.

Figure 1.
Accelerating hash table lookup with a near-memory hardware pipeline. A hash table is a data structure that implements a structure that can map keys to values to compute a data index into an array of buckets or slots, from which a desired value can be found. The lookup accelerator retrieves table entries by streaming hash table indices through a sequence of hardware building blocks including load store units (LSU), a hash unit, and a compare unit that selects the requested keys. Near-memory hardware acceleration delivers up to 12.8 times the performance of equivalent operations in the main processor.

Publications and Presentations

Gokhale, M. B. 2017. "FPGAs in Extreme Scale Computing: Niche or Mainstream?" Argonne Workshop on Reconfigurable Computing, Urbana, IL, 12–13 October 2016. LLNL-PRES-704337.

———. 2017. "Parallel Programming Models for Custom Computing Arrays: Lessons from FCCM Roots." FCCM 2016 24th IEEE International Symposium on Field-Programmable Custom Computing Machines, Washington, DC, 1–3 May 2016. LLNL-PRES-729894.

Lloyd, G. S., and M. Gokhale. 2016. "Evaluating the Feasibility of Storage Class Memory as Main Memory." MEMSYS '16 Proceedings of the Second International Symposium on Memory Systems, Association for Computing Machinery, New York, pp. 437–441. doi:10.1145/2989081.2989118. LLNL-CONF-687015.

———. 2017. "Near Memory Key/Value Lookup Acceleration." MEMSYS 2017, The International Symposium on Memory Systems, Alexandria, VA, 2–5 October 2017. LLNL-CONF-731026.