Tag
This paper demonstrates that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing scalability concerns for dictionary learning. The features are multilingual, multimodal, and include safety-relevant concepts like deception and sycophancy, with causal influence on model outputs.
A user asks for feature suggestions for group chats, referencing XChat's upcoming admin setting that restricts messaging to admins only.
A model on Replicate that outputs CLIP ViT-L/14 features for text and images, allowing similarity computation between inputs.