In-Memory Associative Indexing: An Approach to Efficient High-Performance Computing

Maya Gokhale | 16-ERD-005

Overview

The memory wall —the growing disparity of speed between a central processing unit (CPU) and the memory capacity outside the unit's chip—is perhaps the most prominent obstacle to processing expanding volumes of data. While technological improvements continue to increase the number of processing cores on a chip, improvements in memory bandwidth (i.e., the rate at which data can be read from or stored in a semiconductor's memory by a processor) and memory latency (i.e., the time between initiating a request for a byte in memory until it is retrieved by a processor) at the equivalent scale are not forthcoming, resulting in teraflop-sized supercomputer cores that must sit idle while waiting on data input. If the data are not in the processor's cache, it takes longer to obtain them because the processor will have to communicate with the external memory cells. It is a continuing challenge to keep the CPU cores doing useful work because memory-bandwidth improvements such as high-bandwidth memory mainly benefit regular, streaming access patterns.

We addressed the issues of memory latency, bandwidth, and computational power for associative-indexing applications by incorporating key-value store search capabilities into the memory itself. Commonly employed in data science and simulations, particularly in scientific computing that incorporates sparse data structures, this associative-indexing technique is used in search methods wherein data is accessed by content rather than directly accessed by the memory address where the data is stored. We designed, prototyped, and evaluated a key-value store lookup accelerator optimized for in-memory search. Incorporating logic with memory has become feasible with the emergence of three-dimensional memory stacks that put a memory controller in a base logic layer. In contrast to conventional key-value search accelerators that embed CPUs in the memory, our novel approach created a customized hardware pipeline in the memory itself to search the key-value store. Our memory-based accelerator outperformed conventional key-value store search by as much as twelve times. We evaluated the use of the accelerator in future memory systems using hardware-based emulation.

Background and Research Objectives

The term memory wall (Wulf and McKee 1995) was coined to describe the growing disparity between CPU speed, memory speed outside the unit, and disk drive input/output rates. The memory wall is one of the most significant obstacles to processing expanding volumes of data. From 1986 to 2000, CPU speed improved at an annual rate of 55 percent while memory speed only improved at 10 percent. Given these trends, computer scientists expected that memory latency would become a serious impediment to computer performance. While technological improvements continue to increase the number of processing cores on a chip, computer scientists have not made the necessary improvements in memory bandwidth (i.e., the rate at which data can be read from or stored in a semiconductor memory by a processor) and memory latency (i.e., the time between initiating a request for a byte in memory until it is retrieved by a processor) at the equivalent scale, which means teraflop-sized computing cores of supercomputers must sit idle while waiting for data input. If the data are not in the processor's cache, it takes longer to obtain the data because the processor will have to communicate with the external memory cells. It is a continuing challenge to keep the CPU cores doing useful work because memory-bandwidth improvements such as high-bandwidth memory mainly benefit regular, streaming access patterns. Additionally, data intensive workloads often analyze large, complex, inter-related data structures and exhibit irregular, data-dependent memory access patterns that can only use a fraction of available bandwidth. With these access patterns, only a small part of a fetched cache line is used (e.g., to traverse a linked index structure searching for the relevant data item). In such use cases, it is common for only eight bytes of a 64-byte cache line to be processed before a new memory request for a different cache line is issued, wasting memory bandwidth and limiting computational performance.

We addressed the issues of memory latency, memory bandwidth, and computing power for associative-indexing applications by incorporating key-value store search capability into the memory itself. A key-value store is a data-storage paradigm designed for storing, retrieving, and managing associative arrays (i.e., a data structure also known as a dictionary or hash table ). It contains a collection of objects or records that have many different fields within them, each containing data. These records are stored and retrieved using a key that uniquely identifies the record and is used to find the data quickly within the database. Commonly employed in data science and simulations, particularly in scientific computing that incorporates sparse data structures, this associative-indexing technique is used in search methods wherein data is accessed by content rather than directly accessed by the memory address at which the data is stored. Key-value stores are used in many applications, ranging from web search to high-performance computing (HPC) task graphs to lightweight databases. We focused on a specific search algorithm for open-address hashing (Corman et al. 2001) and designed, prototyped, and evaluated a customized key-value store lookup accelerator to search key-value stores organized in an open address-hashing index structure. Incorporating logic with memory has become feasible with the emergence of three-dimensional memory stacks that put a memory controller in a base logic layer. In contrast to conventional key-value search accelerators that embed CPUs in the memory, our novel approach uses a customized hardware pipeline in the memory itself to search the key-value store.

Our approach to integrating heterogeneous hardware modules into purpose-built accelerators is well aligned with trends emerging in computer architecture research to exploit specialized accelerators that offload general purpose processors. Our innovation is to demonstrate the benefit of co-locating data-intensive accelerators with the memory where the data resides, thereby reducing the CPU’s memory bandwidth requirements. Our near-memory computing approach brings only relevant data across the memory bus to the CPU. Our memory-based accelerator outperformed conventional key-value store search by as much as twelve times. We evaluated the use of the accelerator in future memory systems using field-programmable gate array hardware-based emulation.

Impact on Mission

This project supported Lawrence Livermore National Laboratory's core competencies in HPC, simulation, and data science. Our research is highly relevant to DOE post-exascale architectures, addressing the memory access bottleneck through research into specialized acceleration functions for scalable compute-node performance improvement. This research helped establish the Laboratory's leadership in extreme heterogeneity in computer architectures for the post-exascale computing era. We expect that future research will include cooperative design and software for using heterogeneous hardware blocks. We are also participating in a joint DOE/NSA architecture project of which heterogeneous hardware blocks technology is an integral component. Our research also resulted in the development and patent of a near-memory data reorganization engine (Gokhale 2018). Several well-known computer and memory system vendors have asked us for advice about developing future HPC products that incorporate near-memory computing.

Conclusion

Our approach to integrating heterogeneous hardware modules into purpose-built accelerators is aligned with emerging trends in computer architecture research into specialized accelerators that offload general-purpose processors. We demonstrated the benefit of co-locating data-intensive accelerators with the memory where the data resides, thereby reducing the CPU memory bandwidth requirements. Our near-memory computing approach brings only relevant data across the memory bus to the CPU. In contrast to conventional key-value search accelerators that embed CPUs in the memory, our novel approach created a customized hardware pipeline in the memory itself to search the key-value store. Our memory-based accelerator outperformed conventional key-value store search by as much as twelve times. We evaluated the use of the accelerator in future memory systems using hardware emulation.

References

Cormen, T. H., et al. 2001. "Introduction to Algorithms, 2nd Edition." MIT Press, Cambridge, MA.

Gokhale, M. B. 2018. Near-Memory Data Reorganization Engine. US Patent 9965187, filed February 18, 2016, issued May 8, 2018.

Wulf, W. A. and S. A. McKee. 1995. "Hitting the Memory Wall: Implications of the Obvious." ACM SIGARCH Computer Architecture News 23(1): 20–24. doi: 10.1145/216585.216588.

Publications and Presentations

Gokhale, M. B. 2016. "FPGAs in Extreme Scale Computing: Niche or Mainstream?" FPGA Workshop, Urbana, IL, October 2016. LLNL-PRES-704337.

——— . 2017. "Parallel Programming Models for Custom Computing Arrays: Lessons from FCCM Roots." IEEE International Symposium on Field Programmable Custom Computing Machines (FCCM 2017), Napa, CA, April 2017. LLNL-PRES-729894.

——— . 2018. Near-Memory Data Reorganization Engine. US Patent 9965187, filed February 18, 2016, issued May 8, 2018.

Jain, A. K., et al. 2017. "LiME: An Open Source Memory System Emulation Platform." Open Source Supercomputing Workshop at SC17, Denver, CO, November 2017. LLNL-PRES-741468.

——— . 2018. "Microscope on Memory: MPSoC-Enabled Computer Memory System Assessments." IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), Boulder, CO, April 2018. LLNL-CONF-748088.

Lloyd, G. S. and M. B. Gokhale. 2016. "Evaluating the Feasibility of Storage Class Memory as Main Memory." Proceedings of the Second International Symposium on Memory Systems (MEMSYS 16), Alexandria, VA, October 2016. doi: 10.1145/2989081.2989118. LLNL-CONF-687015.

——— . 2017. "Near Memory Key/Value Lookup Acceleration." International Symposium on Memory Systems (MEMSYS 17), Alexandria, VA. October 2017, LLNL-CONF-731026.