Vectorization of Dynamic Subgraphs via Generative Models
Geoffrey Sanders | 23-FS-039
Project Overview
An important class of data analysis tasks stem from comparing subsets of connected records within massive sets of complex relational data. A common approach is to represent each set of connected records with a small graph, or set of data entities (graph vertices) and their relationships (graph edges), and efficient methods to gauge similarity for pairs of graphs are of high interest. This project concentrated on dynamic graphs, where each edge record has an associated timestamp denoting the time of observation. Pre-existing techniques for comparing dynamic graphs concentrate on either computing graph edit distance (number of vertex and edge deletion, addition, and timestamp modifications) or vectorizing the graph with counts of a limited set of dynamic graph motifs (tiny fundamental subgraphs) and computing distances between the vectors. These approaches are less able to see similarities in graphs that are fairly different in size but come from identical graph generation processes. The motif counting approach can be improved for graphs from the same process, but suffers from requiring many types of motifs meaning it is expensive. Moreover, many motifs are not present for small graphs, meaning realizing a a much larger graph came from the same process is difficult.
This project utilized a collection of K trainable generative models that, for each dynamic graph G and model M, cheaply compute a likelihood that graph G came from model M. This yields a K-length vector of model likelihoods representing each graph. Comparing the relative scores in this vector facilitates computing dynamic graph similarities based on the underlying generative processes involved within a massive set of graphs. The contributions of the project were: (i) assessing existing trainable graph generators' utility for likelihood-based vectorization of dynamic graphs, (ii) showing dynamic information greatly simplifies likelihood computations for many generative models (due to limiting the number of graph permutations required in the analysis), (iii) designing a prototype for a custom, trainable dynamic graph generator, and (iv) some preliminary evaluation of the custom generator on challenging synthetic problems designed to demonstrate the efficacy of the proposed approach.
Mission Impact
Mission relevant data analysis tasks from cyber- and cyber-physical security, or biology, involve growing sets of relational (graph) data. Foundational methodology for representing such data on computers in ways that facilitate efficient indexing, querying, clustering, anomaly detection, and other such analyses greatly aides in the ability to use these datasets to various ends. This project developed an initial approach that will allow dynamic graphs to be indexed by what generative processes are likely in producing the graph, able to better detect completely anomalous dynamic behavior or similarity to previous behavior of interest. Additionally, this work could be leveraged to further improve generative modelling, explanability of graph neural networks, and other challenging machine learning tasks associated with real-world, mission relevant, relational datasets involving temporal information. This work contributes to LLNL's Core Competency High Performance Computing, Simulation and Data Science and can impact mission work in cybersecurity and biology."