Introducing Activation Atlases

OpenAI Blog 03/06/19, 08:00 AM Papers

Summary

OpenAI introduces Activation Atlases, a technique for visualizing and understanding the internal representations of neural networks, enabling humans to discover spurious correlations and unexpected behaviors such as fooling image classifiers by adding noodles to images.

We’ve created activation atlases (in collaboration with Google researchers), a new technique for visualizing what interactions between neurons can represent. As AI systems are deployed in increasingly sensitive contexts, having a better understanding of their internal decision-making processes will let us identify weaknesses and investigate failures.

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:57 PM

# Introducing Activation Atlases Source: [https://openai.com/index/introducing-activation-atlases/](https://openai.com/index/introducing-activation-atlases/) Understanding what’s going on inside neural nets isn’t solely a question of scientific curiosity—our lack of understanding handicaps our ability to audit neural networks and, in high stakes contexts, ensure they are safe\. Normally, if one was going to deploy a critical piece of software one could review all the paths through the code, or even do formal verification, but with neural networks, our ability to do this kind of review has presently been much more limited\. With activation atlases humans can discover unanticipated issues in neural networks—for example, places where the network is relying on spurious correlations to classify images, or where re\-using a feature between two classes leads to strange bugs\. Humans can even use this understanding to “[attack⁠\(opens in a new window\)](https://arxiv.org/pdf/1312.6199.pdf)” the model, modifying images to fool it\. For example, a special kind of activation atlas can be created to show how a network tells apart frying pans and woks\. Many of the things we see are what one expects\. Frying pans are more squarish, while woks are rounder and deeper\. But it also seems like the model has learned that frying pans and woks can also be distinguished by food around them—in particular, wok is supported by the presence of noodles\. Adding noodles to the corner of the image will fool the model 45% of the time\! This is similar to work like[adversarial patches⁠\(opens in a new window\)](https://arxiv.org/pdf/1712.09665.pdf), but based on human understanding\.

Introducing Activation Atlases

Similar Articles

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

AI Engram: In Search of Memory Traces in Artificial Intelligence

I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

Building Better Activation Oracles

Submit Feedback

Similar Articles

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

From Activation to Causality: Discovery of Causal Visual Representations in the Human Brain

AI Engram: In Search of Memory Traces in Artificial Intelligence

I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

Building Better Activation Oracles