Tag
This paper analyzes linear activation steering in language models by decomposing interventions into angular and radial components. It finds that concepts are primarily encoded in angular structure, but norm adjustments are crucial for stability, supporting spherical steering methods while showing that additive coefficients conflate geometry.
This paper systematically tests linear probes for deception detection in large language models, finding they fail under distributional shifts but style-augmented probes recover performance, and revealing that deception is encoded through distributed sub-threshold features.
Discusses that the mathematics used by AI is mainly linear algebra, calculus, etc., from before the 19th century, but emerging phenomena such as Scaling Law, emergent abilities, double descent, in-context learning, and representation geometry lack mathematical explanation. Analogizes to the clouds in physics in 1900, suggesting it may drive the development of 21st-century mathematics.
This paper introduces the concept of 'minimal cores' in overcomplete reasoning traces, showing that on average 46% of steps can be removed while preserving the final answer, and that minimal cores improve trace separation and reduce intrinsic dimensionality.
This paper introduces Minor Component Unlearning (MCU), a novel approach to LLM unlearning that targets minor components in representations to resist relearning attacks. It addresses the vulnerability of existing methods by focusing on robust directions within the model's spectral structure.