Interactive Exploratory Graph-Enabled Data Analytics at High-Performance Computing Scales

Roger Pearce | 21-ERD-020

Project Overview

Exploratory Data Analytics (EDA) is often the first step used by data scientists to explore a new dataset or develop a new analytic task. The de facto standard among a significant percentage of data scientists is the Jupyter Notebook environment (i.e., interactive Python) in which relatively small datasets can be interactively manipulated using tools such as NumPy, SciPy, Pandas, and NetworkX. Within Lawrence Livermore National Laboratory (LLNL) and external agency partners’ mission spaces, the volume of data generated by many mission-critical analytics requires High Performance Computing (HPC) scale resources, but the batch processing modality of working with HPC systems is orthogonal to the data science EDA workflow. This project aimed to close the gap between interactive exploratory data science workflows and batch-style HPC computing by investigating the fundamental computer science shifts required to realize interactive exploratory HPC-scale data science.

Such fundamental shifts in computing modality require co-design research efforts between algorithms and systems. At the systems level: new approaches to persist, snapshot, and version dynamic data structures were investigated; new asynchronous communication programming models for HPC were developed; and new approaches for driving HPC calculations from interactive Python frontends were investigated. At the algorithms level: new algorithms to support interactive queries on property graphs (e.g., knowledge or provenance graphs) where investigated; new parallel distributed data structures were developed for an important class of graph centrality analytics; and new analytic techniques for detecting temporal and topological coordination behaviors were investigated.

Mission Impact

This project advances the Livermore Core Competency in High Performance Computing. The LDRD research team investigated fundamental computer science algorithms, data structures, and system software to meet current and future national security challenges, particularly challenges requiring large scale data analytics. The research products from this project, including publications in multiple top-tier high performance computing conferences and journals, have resulted in new externally sponsored projects within Global Security that continue to increase the technology readiness level of the fundamental research started under this LDRD project.

Publications, Presentations, and Patents

Steil, Trevor, Tahsin Reza, Benjamin Priest, and Roger Pearce. "Embracing Irregular Parallelism in HPC with YGM." In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-13. 2023.

Fletcher, Lance, Trevor Steil, and Roger Pearce. "Optimizing a Distributed Graph Data Structure for K-Path Centrality Estimation on HPC." In 2023 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-7. IEEE, 2023.

Piercey, Preston, Roger Pearce, and Nate Veldt. "Coordinated Botnet Detection in Social Networks via Clustering Analysis." In Proceedings of the 52nd International Conference on Parallel Processing Workshops, pp. 192-196. 2023.

Reza, Tahsin, Trevor Steil, Geoffrey Sanders, and Roger Pearce. "Distributed Approximate Minimal Steiner Trees with Millions of Seed Vertices on Billion-Edge Graphs." Journal of Parallel and Distributed Computing (2023): 104717. https://doi.org/10.1016/j.jpdc.2023.104717

Youssef, Karim, Abdullah Al Raqibul Islam, Keita Iwabuchi, Wu-chun Feng, and Roger Pearce. "Optimizing Performance and Storage of Memory-Mapped Persistent Data Structures." In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-7. IEEE, 2022.

Reza, Tahsin, Geoffrey Sanders, and Roger Pearce. "Towards Distributed 2-Approximation Steiner Minimal Trees in Billion-edge Graphs." In 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 549-559. IEEE, 2022.

Pearce, Roger, Geoffrey Sanders. "Persistent memory as the substrate for HPC-scale graph analytics." The Next Wave. 2022;23(2):33-39. ISSN 2640-1789 [online], 2640-1797 [print]. Available at: www.nsa.gov/thenextwave

Pirkelbauer, Peter, Seth Bromberger, Keita Iwabuchi, and Roger Pearce. "Towards scalable data processing in python with CLIPPy." In 2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3), pp. 43-52. IEEE, 2021.

Steil, Trevor, Tahsin Reza, Keita Iwabuchi, Benjamin W. Priest, Geoffrey Sanders, and Roger Pearce. "Tripoll: computing surveys of triangles in massive-scale temporal graphs with metadata." In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-12. 2021.

Steil, Trevor, Geoffrey Sanders, and Roger Pearce. "Towards distributed square counting in large graphs." In 2021 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-7. IEEE, 2021.

Youssef, Karim, Keita Iwabuchi, Wu-Chun Feng, and Roger Pearce. "Privateer: Multi-versioned Memory-mapped Data Stores for High-Performance Data Science." In 2021 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-7. IEEE, 2021.

Reza, Tahsin, Hassan Halawa, Matei Ripeanu, Geoffrey Sanders, and Roger A. Pearce. "Scalable pattern matching in metadata graphs via constraint checking." ACM Transactions on Parallel Computing (TOPC) 8, no. 1 (2021): 1-45. https://doi.org/10.1145/3434391