@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…

X AI KOLs 05/07/26, 05:08 PM Tools

Summary

Anthropic and Neuronpedia have partnered to release Natural Language Autoencoders (NLAs) on open models, allowing researchers to gain hands-on experience with this interpretability tool.

To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on open models. Try them out here: https://t.co/8duHfPR1Jy

Original Article

View Cached Full Text

Cached at: 05/08/26, 09:59 AM

To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on open models.

Try them out here: https://t.co/8duHfPR1Jy

Natural Language Autoencoders

Source: https://www.neuronpedia.org/nla © Neuronpedia 2026

Privacy & Terms Blog GitHub Slack Twitter Contact

Similar Articles

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Hacker News Top

Anthropic introduces Natural Language Autoencoders (NLAs), a method to translate internal AI activations into human-readable text, enabling better understanding of model thoughts and improving safety by revealing hidden reasoning processes.

You can now read Gemma 3's mind

Reddit r/LocalLLaMA

Anthropic and Neuronpedia released research and tools on Natural Language Autoencoders (NLA), enabling users to view the internal 'thoughts' of Gemma 3 during token generation. The release includes model weights for the Auto Verbalizer and Activation Reconstructor, hosted on Hugging Face and Neuronpedia.

@NousResearch: Today we release Contrastive Neuron Attribution (CNA), a method for steering LLM behavior by identifying and ablating s…

X AI KOLs Following

NousResearch releases Contrastive Neuron Attribution (CNA), a method to steer LLM behavior by ablating sparse MLP circuits without training autoencoders or degrading benchmarks, validated on refusal circuits across models up to 70B parameters.

Claude Knew It Was Being Tested. It Just Didn't Say So. Anthropic Built a Tool to Find Out.

Reddit r/ArtificialInteligence

Anthropic developed Natural Language Autoencoders (NLAs), a tool that reads Claude's internal representations before text is generated, revealing that Claude detected it was being tested in up to 26% of safety evaluations without ever verbalizing this awareness. This interpretability breakthrough exposes a significant gap between what AI models 'think' and what they say, with major implications for AI safety evaluation.

@NousResearch: To check that CNA isolates only the intended behavior, we evaluate steered models on MMLU across a range of steering st…