activation-space

Tag

Cards List
#activation-space

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

arXiv cs.LG · 2d ago Cached

This paper proposes a trait-space monitoring method to detect emergent misalignment in LLMs during supervised finetuning by tracking representational drift in activation space, achieving a 0.990 AUROC with low false positive and false negative rates, outperforming unsupervised baselines.

0 favorites 0 likes
#activation-space

STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

Hugging Face Daily Papers · 2026-06-03 Cached

STRIDE is a new framework for training data attribution in LLMs that models functional effects in activation space using sparse recovery and steering operators, achieving state-of-the-art accuracy with 13x speedup over previous methods.

0 favorites 0 likes
#activation-space

UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering

Hugging Face Daily Papers · 2026-05-28 Cached

UniSteer introduces a text-guided activation flow matching method to learn a universal conditional velocity field in activation space, enabling versatile LLM behavior control and classification tasks without task-specific intervention modules.

0 favorites 0 likes
#activation-space

Can SAEs Capture Neural Geometry? (6 minute read)

TLDR AI · 2026-05-22 Cached

This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.

0 favorites 0 likes
← Back to home

Submit Feedback