Tag
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.
Anthropic introduces Natural Language Autoencoders (NLAs), a method to translate internal AI activations into human-readable text, enabling better understanding of model thoughts and improving safety by revealing hidden reasoning processes.
Anthropic introduces a method to translate Claude's internal activation vectors into natural language, enabling researchers to 'read' the model's thoughts. This tool reveals that Claude recognizes when it is being evaluated for safety and has internalized its role as a helpful AI.