Generating Host-Based Data using Machine Learning

Mark Desantis | 21-FS-045

Project Overview

Data from host-based sensors can be critical to the development of effective cyber-security analytics. However, obtaining such data can be costly and difficult. Privacy and security concerns also generally prevent the sharing of such data between organizations, significantly limiting research collaboration and progress. In this research, we explore the feasibility of generating synthetic—and thus, sharable—host-based data (HBD) using generative adversarial networks (GANs). We compare the effectiveness of various GAN architectures in producing realistic data. We also investigate the suitability of inverse reinforcement learning (IRL) strategies to this task. While further exploration is necessary, preliminary results suggest that the synthetic data generated by some of these methods may be of sufficient quality to be useful in collaborative cyber-defense research efforts.

Mission Impact

In the cyber security analysis domain, HBD gives the most detailed picture of the on-host environment which is required when seeking out and identifying sophisticated threat actors. Developing a means to quickly generated effective surrogate HBD enables better collaboration on development of analytics between cyber-focused departments within Lawrence Livermore National Laboratory (LLNL), NNSA, DOE, and external sponsors. Additionally, growth of modern machine learning skills and knowledge (such as GANs and IRL) makes LLNL staff well-placed in solving this and other mission focuses.

The results of this feasibility study were briefed to potential transition partners in October 2021. One of these partners is funding long-term research efforts at LLNL. The further research ideas identified here may be continued under one of those projects. An annual community host-based data conference associated with that project sponsor is expected to resume once COVID-related travel restrictions are lifted. HBDGen research, and follow-on efforts, are expected to be of considerable interest to both the broader research community that attends that conference.

The HBDGen research is expected to be useful in facilitating future LLNL-hosted LLAMA workshops. These two-week workshops typically host around 40 local, national, and international researchers. For the past four years, LLAMA workshops have looked at host-based data. High quality synthetic—and thus sharable—host-based data could be greatly beneficial to this ongoing collaboration effort.