Extracting Concepts from GPT-4

OpenAI Blog 06/06/24, 12:00 AM Papers

Summary

OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.

Using new techniques for scaling sparse autoencoders, we automatically identified 16 million patterns in GPT-4's computations.

Original Article

View Cached Full Text

Cached at: 04/20/26, 02:47 PM

# Extracting Concepts from GPT-4 Source: [https://openai.com/index/extracting-concepts-from-gpt-4/](https://openai.com/index/extracting-concepts-from-gpt-4/) Unlike with most human creations, we don’t really understand the inner workings of neural networks\. For example, engineers can directly design, assess, and fix cars based on the specifications of their components, ensuring safety and performance\. However, neural networks are not designed directly; we instead design the algorithms that train them\. The resulting networks are not well understood and cannot be easily decomposed into identifiable parts\. This means we cannot reason about AI safety the same way we reason about something like car safety\. In order to understand and interpret neural networks, we first need to find useful building blocks for neural computations\. Unfortunately, the neural activations inside a language model activate with unpredictable patterns, seemingly representing many concepts simultaneously\. They also activate densely, meaning each activation is always firing on each input\. But real world concepts are very sparse—in any given context, only a small fraction of all concepts are relevant\. This motivates the use of sparse autoencoders, a method for identifying a handful of "features" in the neural network that are important to producing any given output, akin to the small set of concepts a person might have in mind when reasoning about a situation\. Their features display sparse activation patterns that naturally align with concepts easy for humans to understand, even without direct incentives for interpretability\. While sparse autoencoder research is exciting, there is a long road ahead with many unresolved challenges\. In the short term, we hope the features we've found can be practically useful for monitoring and steering language model behaviors and plan to test this in our frontier models\. Ultimately, we hope that one day, interpretability can provide us with new ways to reason about model safety and robustness, and significantly increase our trust in powerful AI models by giving strong assurances about their behavior\. Today, we are sharing a [paper⁠\(opens in a new window\)](https://arxiv.org/abs/2406.04093)detailing our experiments and methods, which we hope will make it easier for researchers to train autoencoders at scale\. We are releasing a full suite of autoencoders for GPT‑2 small, along with[code⁠\(opens in a new window\)](https://github.com/openai/sparse_autoencoder)for using them, and[the feature visualizer⁠\(opens in a new window\)](https://openaipublic.blob.core.windows.net/sparse-autoencoder/sae-viewer/index.html)to get a sense of what the GPT‑2 and GPT‑4 features may correspond to\.

Extracting Concepts from GPT-4

Similar Articles

OpenAI’s technology explained

I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

Submit Feedback

Similar Articles

I built a tool that shows you what GPT-2 is "thinking" in real-time as it generates 3D graph of concept activations per token [R]

At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization

A Geometric View for Understanding Concept Learning and Neuron Interpretation in Sparse Autoencoders

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders