Wesam Sakla | 20-FS-020
This research involved development of a novel, supervised, machine-learning (ML) method for detecting adversarial inputs to deep neural network (DNN) models trained for the task of image classification. We built on recent research regarding supervised adversarial detection that hypothesizes that the sequential forward propagation of activation trajectories in the layers of a trained DNN differ between genuine and adversary-modified inputs. We constructed class-specific, feature-embedding spaces for the activation of layers in a trained DNN, and used these features as input to a transformer-based architecture, typically used in natural language processing (NLP) applications. This is done to learn attention-based features in the trajectory of class-embedding scores formed by using the activations generated by the forward pass of inputs to a trained DNN. By learning these features using self-attention in the transformer, the detection model is capable of learning global patterns and correlations between activations that can facilitate the detection of adversarial inputs.
Experimental results show that the attention-based approach for detecting adversarial inputs outperforms a baseline approach that uses traditional sequence modeling architectures, such as gated recurrent units. Even for subtle white-box attacks, our approach detects adversarial inputs with greater than 94 percent accuracy. These results can pave the way for training more robust machine-learning classification models that incorporate attention-based developments from the NLP community.
This research resulted in a novel forensic method for detecting subtle adversarial attacks on ML models trained for the task of image classification. These techniques may provide insight for training more reliable ML models for mission-critical applications at Lawrence Livermore National Laboratory that demand high safety and security. These insights are equally valuable to programs internal and external to the Laboratory and the Department of Energy.