Testing robustness against unforeseen adversaries

OpenAI Blog Papers

Summary

OpenAI researchers developed a method to evaluate neural network robustness against unforeseen adversarial attacks, introducing a new metric called UAR (Unforeseen Attack Robustness) that assesses model performance against unanticipated distortion types beyond the commonly studied Lp norms.

We’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training. Our method yields a new metric, UAR (Unforeseen Attack Robustness), which evaluates the robustness of a single model against an unanticipated attack, and highlights the need to measure performance across a more diverse range of unforeseen attacks.
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:56 PM

# Testing robustness against unforeseen adversaries Source: [https://openai.com/index/testing-robustness/](https://openai.com/index/testing-robustness/) OpenAIWe’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training\. Our method yields a new metric, UAR \(Unforeseen Attack Robustness\), which evaluates the robustness of a single model against an unanticipated attack, and highlights the need to measure performance across a more diverse range of unforeseen attacks\. Modern neural networks have achieved high accuracies on a wide range of benchmark tasks\. However, they remain susceptible to[*adversarial examples*⁠](https://openai.com/index/adversarial-example-research/), small but carefully crafted distortions of inputs created by adversaries to fool the networks\. For example, the adversarial example withL∞L\_\\inftydistortion below differs from the original image by at most 32 in each RGB pixel value; a human can still classify the changed image, but it is confidently misclassified by a standard neural network\. Sample images \(black swan\) generated by adversarial attacks with different distortion types\. Each distortion is optimized to fool the network\. ![A graph showing negative transfer between Distortion A and Distortian B](https://images.ctfassets.net/kftzwdyauwt9/49208ece-2b43-4ef1-282ea831991c/5a0f206397f212177910596832994127/negative-transfer.svg?w=3840&q=90) An example where adversarial robustness does not transfer well\. Hardening a model against Distortion A initially increases robustness against both Distortions A and B\. However, as we harden further, adversarial robustness is harmed for Distortion B but remains about the same for Distortion A; \(A = \_L\_∞​, B = \_L\_1​\)\. The accuracy of the model against Distortion A peaks at a hardening level of 8 because that is sufficient to defend against the attack and further hardening hurts clean accuracy; see full paper for details\. We’ve created a three\-step method to assess how well a model performs against a new held\-out type of distortion\. Our method evaluates against diverse unforeseen attacks at a wide range of distortion sizes and compares the results to a strong defense which has knowledge of the distortion type\. It also yields a new metric, UAR, which assesses the adversarial robustness of models against unforeseen distortion types\. Typical papers on adversarial defense evaluate only against the widely studiedL∞L\_\\inftyorL2L\_2distortion types\. However, we[show⁠\(opens in a new window\)](http://arxiv.org/abs/1908.08016)that evaluating against the distortions gives very similar information about adversarial robustness\.[A](https://openai.com/index/testing-robustness/#citation-bottom-A)We conclude that evaluating againstLpL\_p​ distortions is insufficient to predict adversarial robustness against other distortion types\. Instead, we suggest that researchers evaluate models against adversarial distortions that are not similar to those used in training\. We offer theL1L\_1,L2L\_2\-JPEG, Elastic, and Fog attacks as a starting point\. We provide implementations, pre\-trained models, and calibrations for a variety of attacks in our[code package⁠\(opens in a new window\)](https://github.com/ddkang/advex-uar)\. We found that considering too narrow a range of distortion sizes can reverse qualitative conclusions about adversarial robustness\. To pick a range, we examine images produced by an attack at different distortion sizes and choose the largest range for which the images are still human\-recognizable\. However, as shown below, an attack with a large distortion budget only uses it against strong defenses\. We recommend choosing a calibrated range of distortion sizes by evaluating against adversarially trained models \(we also provide calibrated sizes for a wide variety of attacks in our[code package⁠\(opens in a new window\)](https://github.com/ddkang/advex-uar)\)\. Sample images \(espresso maker\) of the same strong attack applied to different defense models\. Attacking stronger defenses causes greater visual distortion\. We developed a new metric, UAR, which compares the robustness of a model against an attack to adversarial training against that attack\. Adversarial training is a strong defense that uses knowledge of an adversary by training on adversarially attacked images\.[B](https://openai.com/index/testing-robustness/#citation-bottom-B)A UAR score near 100 against an unforeseen adversarial attack implies performance comparable to a defense with prior knowledge of the attack, making this a challenging objective\. We computed the UAR scores of adversarially trained models for several different distortion types\. As shown below, the robustness conferred by adversarial training does not transfer broadly to unforeseen distortions\. In fact, robustness against a known distortion can reduce robustness against unforeseen distortions\. These results underscore the need for evaluation against significantly more diverse attacks like Elastic, Fog, Gabor, and Snow\. ![A table of UAR scores for adversarially trained models](https://images.ctfassets.net/kftzwdyauwt9/b42fb0c8-0a83-4d47-4565c0d97fa3/8bf2790929ad4c45b7e3f8c3517e7a79/uar-scores.svg?w=3840&q=90) We hope that researchers developing adversarially robust models will use our methodology to evaluate against a more diverse set of unforeseen attacks\. Our[code⁠\(opens in a new window\)](https://github.com/ddkang/advex-uar)includes a suite of attacks, adversarially trained models, and calibrations which allow UAR to be easily computed\. *If you’re interested in topics in AI Safety, consider*[*applying*⁠](https://openai.com/careers/)*to work at OpenAI\.*

Similar Articles

Adversarial attacks on neural network policies

OpenAI Blog

OpenAI researchers demonstrate that adversarial attacks, previously studied in computer vision, are also effective against neural network policies in reinforcement learning, showing significant performance degradation even with small imperceptible perturbations in white-box and black-box settings.

Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics

arXiv cs.LG

This paper investigates adversarial robustness in Fuzzy ARTMAP, a streaming neural architecture, by introducing WB-Softmax as a mechanism-aligned white-box attack surrogate. It evaluates progressive training and selective updating strategies to improve robustness without data replay, while also offering interpretable diagnostics for structural failures.

Transfer of adversarial robustness between perturbation types

OpenAI Blog

Researchers study how adversarial robustness transfers across different perturbation types in deep neural networks, evaluating 32 attacks of 5 types on ImageNet models. Results show that robustness to one perturbation type doesn't always transfer to others and may sometimes hurt robustness elsewhere.

Neural Variability Enhances Artificial Network Robustness

arXiv cs.LG

This paper investigates how correlated noise, inspired by neural variability in the brain, can enhance the robustness of artificial neural networks against adversarial attacks and naturalistic image modifications.