Predictive Models Based on Disjointed Feature Sets for Applications in Biomedicine and Cyber Security

Todd Wasson (15-ERD-053)

Project Description

From electronic medical records, to smart cities, to computer-network traffic monitoring, technological advances have enabled the collection of increasingly massive data sets. The ability to model the systems represented by these data and especially to make predictions about these systems, can provide immeasurable benefits. In healthcare, for example, the accurate and timely prediction of the future condition of hospitalized patients would allow better allocation of resources, potentially saving thousands of lives and substantially reducing hospital costs. Within cyber security, effective modeling of network traffic can help identify suspicious behavior and predict and prevent cyber attacks. While the field of big-data analytics has seen substantial advances, it has not kept pace with data-collection advances. Making sense of these data sets and extracting useful information from them remains a challenge. We are creating a framework and a tool set for learning from increasingly prevalent "messy" real-world data, which is often incomplete, heterogeneous, and high dimensional. Messy data are presently modeled in part via existing techniques capable of handling a subset of the data, or worse, are discarded outright. We are developing models, tools, and applications for critical data domains. We are creating three independent but complementary tools en route to a unified approach, using cutting-edge cluster computing. The methodological development is motivated by, and applied to, data sets within healthcare via collaboration with the Kaiser Permanente Medical Groups, as well as to cyber security, with an eye on generalization to other domains.

We are developing novel statistical modeling capabilities and tools and will ultimately merge these disparate tools and capabilities together to create a cohesive unit. The resulting tools will enable quantifiable prediction based on arbitrary subsets of data, enabling both learning and predictive tasks in order to fully utilize entire data sets instead of requiring complete observations. Specifically, we are developing the first software tools for (1) modeling time-series data points (a set of observations collected sequentially in time) and nontemporal data together, (2) selecting features on time-series and heterogeneous data together, (3) learning from heterogeneous time-series data with nonrandom missing data, and (4) performing all of these tasks jointly across problem domains. We expect to show significant impact with application to biomedical clinical data, yielding a deployable tool to improve hospitalized patient care. In addition, our software will be relevant to cyber security data, yielding hitherto impossible inference on heterogeneous, potentially unreliably gathered network-sensor data.

Mission Relevance

Harvesting the full potential of important data sets, including cyber and electronic medical record data, and developing a generalized analytic capability that can be broadly applied to other areas of interest to the Laboratory is aligned with the Laboratory's core competency in high-performance computing, simulation, and data science. The challenges of big data sweep across all mission areas, and immense needs exist to improve capabilities to extract knowledge and insight from large, complex collections of data. Our research also supports the strategic focus area in cyber security, space, and intelligence through predictive analysis of the behavior of complex systems.

FY16 Accomplishments and Results

In FY16 we had four achievements. (1) We implemented time-series modeling via Gaussian processes for irregularly sampled temporal features, including vital signs data. (2) We produced four alternative distributed Markov-chain Monte Carlo implementations for learning complex Bayesian parameters in parallel, yielding a fully Bayesian SparkPlug package. This package is a substantial extension of the pre-existing SparkPlug distributed learning framework developed at Lawrence Livermore to infer a variety of statistical models at scale. Bayesian SparkPlug allows these models to be learned with distributions on parameters, which enables direct measures of confidence or certainty, incorporating prior information or beliefs about data. (3) We presented the results of our work with Kaiser Permanente Medical Groups to the community at the 2016 American Thoracic Society Conference in an invited presentation. (4) We engaged new potential collaborators at the University of California, Davis, among others, and procured new intensive-care unit data sets from our University of Virginia collaborators.

Publications and Presentations

Mayhew, A. P., et al., Modeling patient subpopulations improves sepsis mortality prediction. American Thoracic Society Intl. Conf., San Francisco, CA, May 13–18, 2016. LLNL-PRES-692041.