activation-steering

Tag

Cards List
#activation-steering

Modeling identity formation in LLMs as hypergraph evolution through multi-instance relational interaction and measuring structural divergence in activation space.

Reddit r/ArtificialInteligence · yesterday

The author proposes a novel experimental framework to study identity formation in LLMs as hypergraph evolution through multi-instance interaction, distinguishing it from standard multi-agent debate by focusing on structural divergence in activation space rather than task performance.

0 favorites 0 likes
#activation-steering

Decomposing and Steering Functional Metacognition in Large Language Models

arXiv cs.CL · yesterday Cached

This research paper investigates functional metacognition in Large Language Models, demonstrating that internal states like evaluation awareness and self-assessed capability are linearly decodable from residual stream activations. The authors propose a mechanistic framework to steer these states, showing causal control over reasoning behaviors, verbosity, and safety responses.

0 favorites 0 likes
#activation-steering

@antirez: I just pushed a big refactoring of DS4 backends with CUDA support and single direction activation steering. The Metal p…

X AI KOLs Timeline · 2d ago

antirez pushed a major refactoring of DS4 backends, adding CUDA support and single direction activation steering while preserving the Metal path. Only M3 and DGX Spark hardware are supported for now.

0 favorites 0 likes
#activation-steering

Prompt-Activation Duality: Improving Activation Steering via Attention-Level Interventions

Hugging Face Daily Papers · 2d ago Cached

This paper identifies KV-cache contamination as a failure mode for activation steering in dialogue and proposes GCAD, a method that extracts steering signals from prompt contributions and applies token-level gating to improve long-horizon coherence, achieving substantial gains on multi-turn benchmarks.

0 favorites 0 likes
#activation-steering

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

arXiv cs.AI · 5d ago Cached

This paper investigates whether linearly decodable failure signals in LLM hidden states can be corrected via residual-stream steering. It finds that while 'overthinking' failures are decodable, fixed linear steering fails to correct them due to representational entanglement with task-critical computations, though the probes effectively support selective abstention.

0 favorites 0 likes
#activation-steering

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

arXiv cs.AI · 5d ago Cached

This paper proposes a causal framework for probing internal visual representations in Multimodal Large Language Models, revealing differences in how entities and abstract concepts are encoded. The study highlights that increasing model depth is crucial for encoding abstract concepts and uncovers a disconnect between perception and reasoning in current MLLMs.

0 favorites 0 likes
#activation-steering

Don't Lose Focus: Activation Steering via Key-Orthogonal Projections

arXiv cs.CL · 5d ago Cached

This paper introduces Steering via Key-Orthogonal Projections (SKOP), a method to control LLM behavior by preventing attention rerouting, thereby reducing utility degradation while maintaining steering efficacy.

0 favorites 0 likes
#activation-steering

Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention

arXiv cs.CL · 5d ago Cached

This paper introduces FLAS, a flow-based activation steering method that learns a concept-conditioned velocity field to steer language model activations at inference time. On the AxBench benchmark, FLAS is the first learned method to consistently outperform in-context prompting on held-out concepts without per-concept tuning.

0 favorites 0 likes
#activation-steering

Structural Instability of Feature Composition

arXiv cs.LG · 5d ago Cached

This paper presents a geometric framework to analyze the instability of feature composition in Sparse Autoencoders, revealing that non-linearities cause a ratchet effect leading to compositional collapse beyond a critical density.

0 favorites 0 likes
#activation-steering

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

Hugging Face Daily Papers · 6d ago Cached

This research paper investigates how Large Language Models encode social role granularity as a structured latent dimension. It demonstrates that this 'Granularity Axis' is consistent across architectures like Qwen3 and Llama-3, and can be causally manipulated via activation steering.

0 favorites 0 likes
#activation-steering

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

arXiv cs.CL · 2026-04-22 Cached

Study reveals that answer tokens in thinking LLMs follow a structured self-reading pattern—forward drift plus focus on key anchors—during quantitative reasoning, and proposes a training-free SRQ steering method to exploit this for accuracy gains.

0 favorites 0 likes
← Back to home

Submit Feedback