activation-space

#activation-space

Do Active SAE Feature Planes Carry More Holonomy? A Preregistered Reversal in Gemma

arXiv cs.LG ↗ · 2026-07-24 Cached

This preregistered study tests whether holonomy (a geometric measure) concentrates on active SAE feature planes in the Gemma 2 2B language model. Contrary to the semantic-concentration prediction, active-feature planes carried less holonomy than matched mixed-feature controls, resulting in a narrow operational reversal with the underlying cause remaining open.

0 favorites 0 likes

#activation-space

Estimating Rare Events in Language Models with Proper Evaluation

arXiv cs.LG ↗ · 2026-07-22 Cached

This paper introduces GA-AMLS, a rare-event Monte Carlo method adapted to language model activation spaces, and SPB Loss, a proper scoring rule for asymmetric penalties, demonstrating improved estimation of rare harmful outputs.

0 favorites 0 likes

#activation-space

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper proposes a trait-space monitoring method to detect emergent misalignment in LLMs during supervised finetuning by tracking representational drift in activation space, achieving a 0.990 AUROC with low false positive and false negative rates, outperforming unsupervised baselines.

0 favorites 0 likes

#activation-space

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

Hugging Face Daily Papers ↗ · 2026-06-03 Cached

STRIDE is a new framework for training data attribution in LLMs that models functional effects in activation space using sparse recovery and steering operators, achieving state-of-the-art accuracy with 13x speedup over previous methods.

0 favorites 0 likes

#activation-space

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.

0 favorites 0 likes

#activation-space

Can SAEs Capture Neural Geometry? (6 minute read)

TLDR AI ↗ · 2026-05-22 Cached

This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.

0 favorites 0 likes

activation-space

Do Active SAE Feature Planes Carry More Holonomy? A Preregistered Reversal in Gemma

Estimating Rare Events in Language Models with Proper Evaluation

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Can SAEs Capture Neural Geometry? (6 minute read)

Submit Feedback