claude-3-sonnet

Tag

Cards List
#claude-3-sonnet

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

arXiv cs.AI · 2026-05-29 Cached

This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.

0 favorites 0 likes
← Back to home

Submit Feedback