Language models can explain neurons in language models

OpenAI Blog Papers

Summary

OpenAI proposes using language models (GPT-4) to automatically generate and score explanations for neurons in language models, open-sourcing datasets and tools covering all 307,200 neurons in GPT-2. The work demonstrates iterative and scalable approaches to mechanistic interpretability, though explanation quality still lags behind humans.

We use GPT-4 to automatically write explanations for the behavior of neurons in large language models and to score those explanations. We release a dataset of these (imperfect) explanations and scores for every neuron in GPT-2.
Original Article
View Cached Full Text

Cached at: 04/20/26, 02:57 PM

# Language models can explain neurons in language models Source: [https://openai.com/index/language-models-can-explain-neurons-in-language-models/](https://openai.com/index/language-models-can-explain-neurons-in-language-models/) Although the vast majority of our explanations score poorly, we believe we can now use ML techniques to further improve our ability to produce explanations\. For example, we found we were able to improve scores by: - *Iterating on explanations\.*We can increase scores by asking GPT‑4 to come up with possible counterexamples, then revising explanations in light of their activations\. - *Using larger models to give explanations\.*The average score goes up as the explainer model’s capabilities increase\. However, even GPT‑4 gives worse explanations than humans, suggesting room for improvement\. - *Changing the architecture of the explained model\.*Training models with different activation functions improved explanation scores\. We are open\-sourcing our datasets and visualization tools for GPT‑4‑written explanations of all 307,200 neurons in GPT‑2, as well as code for explanation and scoring[using publicly available models⁠\(opens in a new window\)](https://github.com/openai/automated-interpretability)on the OpenAI API\. We hope the research community will develop new techniques for generating higher\-scoring explanations and better tools for exploring GPT‑2 using explanations\. We found over 1,000 neurons with explanations that scored at least 0\.8, meaning that according to GPT‑4 they account for most of the neuron’s top\-activating behavior\. Most of these well\-explained neurons are not very interesting\. However, we also found many interesting neurons that GPT‑4 didn't understand\. We hope as explanations improve we may be able to rapidly uncover interesting qualitative understanding of model computations\.

Similar Articles

OpenAI’s technology explained

OpenAI Blog

OpenAI publishes an explainer on its core technology, detailing how language models like GPT-4 are developed through pre-training (learning from vast text data) and post-training (alignment with human values and safety practices). The article emphasizes OpenAI's nonprofit mission structure and explains the distinction between raw base models and refined, usable versions.

Better language models and their implications

OpenAI Blog

OpenAI introduces GPT-2, a 1.5 billion parameter transformer-based language model trained on 40GB of internet text that achieves state-of-the-art performance on language modeling benchmarks and demonstrates zero-shot capabilities in reading comprehension, translation, question answering, and summarization. Due to safety concerns, only a smaller model and technical paper are released publicly rather than the full trained model.

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

arXiv cs.AI

This paper investigates whether language model agents can automate the explanation phase of mechanistic interpretability by introducing AgenticInterpBench, a benchmark with 84 semi-synthetic circuits, and HyVE, an agentic explainer that iteratively hypothesizes, validates, and explains circuit components. Experiments show promise but identify reliable validation as a key obstacle.

Language models are few-shot learners

OpenAI Blog

OpenAI introduces GPT-3, a 175-billion parameter autoregressive language model that demonstrates strong few-shot learning capabilities across diverse NLP tasks without gradient updates or fine-tuning, representing a paradigm shift in how language models can be applied to new tasks through text interactions alone.

Extracting Concepts from GPT-4

OpenAI Blog

OpenAI introduces sparse autoencoders as a method to extract and interpret concepts from large language models like GPT-4, addressing the fundamental challenge of understanding neural network behavior. They release a research paper, code, and feature visualization tools to help researchers train autoencoders at scale and improve AI safety through better interpretability.