All-Source Data Fusion for Detecting and Monitoring Threats on a Global Scale

David Widemann (15-ERD-050)

Project Description

The vast amount of digital information from social networks, e-mail, real-time chat, and blogs offer enormous opportunities for analysts to gain insight into patterns of activity and detection of national security threats. However, systems must be developed to analyze not only the enormous quantity of data, but also the diverse forms that they come in. We intend to improve analysts' abilities to discover illicit production of weapons of mass destruction by automatically matching intelligence to process templates created by Lawrence Livermore subject matter experts. A large, robust, and unified semantic vector space (semantic space denotes a mathematical vector structure where concepts or words that are related are proximate) will be created from large collections of documents. Additionally, novel algorithms will be developed to map non-text data such as imagery, graphs, process templates, and video into the semantic space. We will integrate all technologies developed in this project into existing Livermore computing assets. If successful, this will give analysts a powerful automated tool for analysis across enormous data sets. We intend to assemble a diverse team of in-house machine-learning and high-performance computing experts, and build upon existing algorithms to develop the initial version of our system.

We expect our system will be able to provide analysts with a powerful, highly automated, and unified tool for quickly identifying, comparing, and even anticipating threats of illicit production of weapons of mass destruction in an ever-shifting global intelligence landscape. The system will be scalable to various data sets and incorporate many different data modalities including text, entity graphs, process templates, images, and video. All data modalities will share a common, robust semantic space trained on vast real-world data collections, enabling cross-modal context-aware searching, analysis, and prediction of fast-emerging threats. We will also employ a variety of metrics to measure the efficacy of our system, both at an individual-component level as well as at a threat-detection level. Individual pieces will be measured by community-accepted benchmarks such as the probability distribution of text; the search precision recall for documents, imagery, and video; as well as new metrics for cross-modal transfer. System-level performance metrics will be developed to show how to measure improved analyst productivity when using our tools as compared to baseline. The system will be designed to integrate into the existing Laboratory computing infrastructure.

Mission Relevance

By moving the burden of integrating the pieces together from the analyst to the machine, a successful project will free analysts to perform high-level decision-making and analysis tasks rather than spend significant time "in the weeds." We will create an improved computational capability to provide decision makers with early warning to threats in a fast-moving ever-changing intelligence and threat landscape. Developing this capability is well aligned with the strategic focus area of cyber security, space, and intelligence. Our approach will result in better awareness of threats facing the nation and support the Laboratory's core competency in high-performance computing, simulation, and data science.

FY16 Accomplishments and Results

In FY16 we (1) created an algorithm, ROPE, that gives robust order-preserving embeddings using semantic vector spaces (word embedding and compressive sensing to allow a user to search and retrieve with semantic concepts); (2) determined that the ROPE algorithm represents a major advance over standard keyword searches; and (3) developed an algorithm that dynamically creates spatial and temporal heat maps to monitor nuclear activity reported in the U.S. Director of National Intelligence Open Source Center database, as shown in the figure.

Spatiotemporal heat map of nuclear related activity using the u.s. director of national intelligence open source center's web-scraping database. a large response indicates that the country in country in question has been referenced in many nuclear-related documents. — Spatiotemporal heat map of nuclear related activity using the U.S. Director of National Intelligence Open Source Center's Web-scraping database. A large response indicates that the country in country in question has been referenced in many nuclear-related documents.

Publications and Presentations

Sattigeri, P., and J. J. Thiagarajan, Sparsifying word representation for deep unordered sentence modeling. Association for Computational Linguistics 2016, Berlin, Germany, Aug. 7–12, 2016. LLNL-CONF-699499.
Widemann, D. P., and J. Ordonez, Semantic evaluation using robust order-preserving embeddings. (2016). LLNL-TR-699098.
Widemann, D., E. X. Wang, and J. Thiagarajan, ROPE: Recoverable order-preserving embedding of natural language. (2016). LLNL-TR-682663.