Adversarial attacks on neural network policies

OpenAI Blog Papers

Summary

OpenAI researchers demonstrate that adversarial attacks, previously studied in computer vision, are also effective against neural network policies in reinforcement learning, showing significant performance degradation even with small imperceptible perturbations in white-box and black-box settings.

No content available
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/20/26, 02:46 PM

# Adversarial attacks on neural network policies Source: [https://openai.com/index/adversarial-attacks-on-neural-network-policies/](https://openai.com/index/adversarial-attacks-on-neural-network-policies/) OpenAI## Abstract Machine learning classifiers are known to be vulnerable to inputs maliciously constructed by adversaries to force misclassification\. Such adversarial examples have been extensively studied in the context of computer vision applications\. In this work, we show adversarial attacks are also effective when targeting neural network policies in reinforcement learning\. Specifically, we show existing adversarial example crafting techniques can be used to significantly degrade test\-time performance of trained policies\. Our threat model considers adversaries capable of introducing small perturbations to the raw input of the policy\. We characterize the degree of vulnerability across tasks and training algorithms, for a subclass of adversarial\-example attacks in white\-box and black\-box settings\. Regardless of the learned task or training algorithm, we observe a significant drop in performance, even with small adversarial perturbations that do not interfere with human perception\. Videos are available at[this http URL⁠\(opens in a new window\)](http://rll.berkeley.edu/adversarial)\.

Similar Articles

Testing robustness against unforeseen adversaries

OpenAI Blog

OpenAI researchers developed a method to evaluate neural network robustness against unforeseen adversarial attacks, introducing a new metric called UAR (Unforeseen Attack Robustness) that assesses model performance against unanticipated distortion types beyond the commonly studied Lp norms.

Robust adversarial inputs

OpenAI Blog

Researchers demonstrated adversarial images that reliably fool neural network classifiers across multiple scales and perspectives, challenging assumptions about the robustness of multi-scale image capture systems used in autonomous vehicles.

Attacking machine learning with adversarial examples

OpenAI Blog

This article examines adversarial attacks on machine learning models and demonstrates why gradient masking—a defensive technique that attempts to deny attackers access to useful gradients—is fundamentally ineffective. The paper shows that attackers can circumvent gradient masking by training substitute models that mimic the defended model's behavior, making the defense strategy ultimately futile.

OpenAI Red Teaming Network

OpenAI Blog

OpenAI launches a Red Teaming Network to crowdsource adversarial testing of AI models from diverse experts and perspectives. The program accepts rolling applications, offers flexible time commitments (as little as 5 hours/year), compensation, and emphasizes safety expertise and underrepresented backgrounds.