representation-control

Tag

Cards List
#representation-control

SAE Interventions are Unreliable: Post-Intervention Recovery of Suppressed Behavior

arXiv cs.LG · 2026-06-18 Cached

This paper demonstrates that interventions on Sparse Autoencoder (SAE) features can be unreliable because suppressed behavior can recover through residual-space optimization, even while the intervention remains active. It reveals a critical gap between feature-level control and actual behavioral completeness in language models.

0 favorites 0 likes
#representation-control

FineSteer: A Unified Framework for Fine-Grained Inference-Time Steering in Large Language Models

arXiv cs.CL · 2026-04-20 Cached

FineSteer is a novel inference-time steering framework that decomposes steering into conditional steering and fine-grained vector synthesis stages, using Subspace-guided Conditional Steering (SCS) and Mixture-of-Steering-Experts (MoSE) mechanisms to improve safety and truthfulness while preserving model utility. Experiments show 7.6% improvement over state-of-the-art methods on TruthfulQA with minimal utility loss.

0 favorites 0 likes
← Back to home

Submit Feedback