Tag
This paper proposes a trait-space monitoring method to detect emergent misalignment in LLMs during supervised finetuning by tracking representational drift in activation space, achieving a 0.990 AUROC with low false positive and false negative rates, outperforming unsupervised baselines.
STRIDE is a new framework for training data attribution in LLMs that models functional effects in activation space using sparse recovery and steering operators, achieving state-of-the-art accuracy with 13x speedup over previous methods.
UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.
This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.