@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…
Summary
Anthropic and Neuronpedia have partnered to release Natural Language Autoencoders (NLAs) on open models, allowing researchers to gain hands-on experience with this interpretability tool.
View Cached Full Text
Cached at: 05/08/26, 09:59 AM
To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on open models.
Try them out here: https://t.co/8duHfPR1Jy
Natural Language Autoencoders
Source: https://www.neuronpedia.org/nla © Neuronpedia 2026
Similar Articles
Natural Language Autoencoders: Turning Claude's Thoughts into Text
Anthropic introduces Natural Language Autoencoders (NLAs), a method to translate internal AI activations into human-readable text, enabling better understanding of model thoughts and improving safety by revealing hidden reasoning processes.
You can now read Gemma 3's mind
Anthropic and Neuronpedia released research and tools on Natural Language Autoencoders (NLA), enabling users to view the internal 'thoughts' of Gemma 3 during token generation. The release includes model weights for the Auto Verbalizer and Activation Reconstructor, hosted on Hugging Face and Neuronpedia.
@NousResearch: Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating s…
NousResearch releases Contrastive Neuron Attribution (CNA), a method to steer LLM behavior by ablating sparse MLP circuits without training autoencoders or degrading benchmarks, validated on refusal circuits across models up to 70B parameters.
Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.
Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.
@NousResearch: To check that CNA isolates only the intended behavior, we evaluate steered models on MMLU across a range of steering st…
Nous Research released Contrastive Neuron Attribution (CNA), a method to steer LLM behavior by identifying and ablating sparse circuits in MLP neurons without training sparse autoencoders or degrading general benchmarks, validated on multiple large language models.