You can now read Gemma 3's mind

Reddit r/LocalLLaMA 05/08/26, 01:44 AM Papers

Summary

Anthropic and Neuronpedia released research and tools on Natural Language Autoencoders (NLA), enabling users to view the internal 'thoughts' of Gemma 3 during token generation. The release includes model weights for the Auto Verbalizer and Activation Reconstructor, hosted on Hugging Face and Neuronpedia.

Anthropic has released new research to show what an LLM is thinking when generating next token using NLA or "Natural Language Autoencoders", the NLAs are a pair of LLMs that can translate internal thoughts of LLM for any specific token. Neuronpedia in partnership with Anthropic have also released NLA model weights for Gemma 3 27b instruct at: \- Auto Verbalizer (AV): [https://huggingface.co/kitft/nla-gemma3-27b-L41-av](https://huggingface.co/kitft/nla-gemma3-27b-L41-av) \- Activation Reconstructor (AR): [https://huggingface.co/kitft/nla-gemma3-27b-L41-ar](https://huggingface.co/kitft/nla-gemma3-27b-L41-ar) And Neuronpedia is currently hosting them on their site at [https://www.neuronpedia.org/gemma-3-27b-it/nla](https://www.neuronpedia.org/gemma-3-27b-it/nla) So you go to neuronpedia link above, ask Gemma 3 a question, then click on any token and click explain, and the site will show you what the model was thinking when generating that token Auto Verbalizer (LLM) is what translates LLM's activations to readable text, Activation Reconstructor is just to verify if the text generated by AV can be translated back to LLM activations. Edit (added example below): So I prompted Gemma 3 with "I am Elon musk", at the very first tokens the LLM is already marking the chat as "fabricated" & "satirical" https://preview.redd.it/f648tz17utzg1.png?width=1827&format=png&auto=webp&s=4c9aca885f2f9383e026263b3c524ac2d15b1a89

Original Article

You can now read Gemma 3's mind

Similar Articles

Introducing Gemma 3n: The developer guide

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…

Submit Feedback

Similar Articles

Introducing Gemma 3n: The developer guide

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Introducing Gemma 4 12B: a unified, encoder-free multimodal model

Gemma Scope 2: helping the AI safety community deepen understanding of complex language model behavior

@AnthropicAI: To support other researchers getting hands-on experience with NLAs, we’ve partnered with Neuronpedia to release NLAs on…