Tag
This paper uses a pre-trained LLM with zero-shot classification to analyze approximately 20 million Twitch chat messages across seven game genres, finding that 2.4% of messages are toxic, with MOBA games having the highest rate (3.2%) and sports games the lowest (2%). The study also identifies significant differences in toxicity distributions across individual games within the same genre.
This article introduces Qwen-Scope, a toolkit of Sparse Autoencoders (SAEs) trained on Qwen3 and Qwen3.5 models to enable mechanistic analysis and intervention. It releases 14 groups of SAE weights covering dense and MoE backbones, providing sparse representations for residual-stream activations.
This academic paper analyzes the syntactic and lexical diversity of two generations of LLMs compared to human-authored news text, finding that newer, aligned models exhibit reduced diversity.
DeepMind releases Gemma Scope 2, an open suite of interpretability tools for the Gemma 3 model family, aiming to help the AI safety community understand and debug complex language model behaviors like hallucinations and jailbreaks.