Geoffrey Sanders (17-ERD-024)
Executive Summary
We are developing state-of-the-art graph search technologies that meet diverse and rapidly changing customer data-analysis needs. These technologies have applications in cyber security and intelligence, such as efficiently searching through a large collection of enterprise network transactions for the malicious communication behavior of a botnet.
Project Description
Consider looking through a large network for victims of a complicated distributed cyber attack. A signature of the attack manifests itself in the topology as communications between machines and in the metadata (data about data) as time-stamps, communication type, message sizes, and classifications from machine learning algorithms. Viewing the data as a large labeled graph, the signature has a particular community structure. Graph analytics are powerful tools for extracting meaningful information from the relationships (edges) between the entities (vertices) that are represented in large relational data sets. An ideal graph mining toolkit must meet several criteria: scalability (efficient in speed and memory), versatility (efficient implementation for a wide range of analytics), flexibility (analytics are customizable to the presence of attributes and temporal metadata), and robustness (analytics are somewhat insensitive to noise). These criteria are not all adequately addressed by current methodology. Our project aims to demonstrate that simulating biased random walks in an asynchronous parallel-computing environment is a powerful framework for scalable and customizable graph analytics. This effort expands current capabilities’ robustness to noise, factors rich metadata (including dynamic information) into analytics, and establishes a novel high-performance computing paradigm for massive scale-free graphs.
We expect to design biased random-walk policies targeted towards specific graph substructures (topology and metadata), and provide analysts with a new set of tools beyond simple uniform random walks. (A uniform random-walk is comparable to a car driver randomly navigating its way through a road network connecting the country--interstates, highways, surface streets, and backroads. It will rank highly portions of the graph that are most central, typically the interstates. However, if an analyst wishes to rank roads for drivers that avoid interstates and only drive on interstates less than 10% of the time, then a biased random-walk model is needed.) The simulations will be implemented in HavoqGT, a Laboratory-developed asynchronous high-performance computing graph framework. This vertex-centric paradigm allows us to simulate many thousands of highly sophisticated random walkers in parallel and obtain accurate results quickly, thus enabling graph analytic capabilities that are not currently available. We expect to develop novel, efficient, and robust graph mining tasks, including fuzzy pattern matching with hundred-edge query patterns and analytics involving dynamic metadata. We will develop the ability to perform such analytics on graphs with several billions of edges on high-performance computing resources.
Mission Relevance
This project supports NNSA's goal of expanding and applying science and technology capabilities for dealing with broad national security challenges. Our demonstration of efficient and sophisticated biased random-walk calculations also supports the Laboratory's core competency in high-performance computing, simulation, and data science.
FY17 Accomplishments and Results
In FY17, we (1) developed novel asynchronous graph analytics, including developing and evaluating a high-performance-computing exact pattern matching capability, designing text-based biased random-walk analytics, using exact matching for fuzzy pattern matching, and developing random-walk techniques driven by centrality approximations; (2) extended HavoqGT for more efficient searching and indexing, including expanding metadata support for vertices and edges and devising a message batching strategy; (3) developed new mathematical analysis for quality guarantees and complexity reduction; and (4) acquired a metadata-rich dataset (Reddit).
Publications and Presentations
Pearce, R. 2017. "Triangle Counting for Scale-Free Graphs at Scale in Distributed Memory." 2017 IEEE High Performance Extreme Computing Conference (HPEC). September 12-14, 2017, Waltham, MA. LLNL-CONF-737552.
Reza, T., et al., 2017. "Towards Practical and Robust Labeled Pattern Matching in Trillion-Edge Graphs." IEEE Cluster. September 5-8, 2017, Hawaii. LLNL-CONF-735657.