attention-heads

Tag

Cards List
#attention-heads

MechRL: Reinforcement Learning Agents Perform Circuit Discovery for Mechanistic Interpretability

arXiv cs.LG · 2026-05-27 Cached

Proposes MechRL, a reinforcement learning approach to automate circuit discovery in transformer language models. A PPO agent trained on multiple tasks discovers attention head circuits that match known canonical circuits and generalizes to a held-out task.

0 favorites 0 likes
#attention-heads

Spectral Probe-Circuits: A Three-Step Recipe for Identifying Attention-Head Circuits in Pretrained Transformers

arXiv cs.LG · 2026-05-26 Cached

Introduces a three-step recipe for identifying attention-head circuits in pretrained transformers using a spectral signal and task-pattern screen without requiring labels, validated across 51M to 1B parameter models and multiple architectures.

0 favorites 0 likes
#attention-heads

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

arXiv cs.LG · 2026-05-22 Cached

P2D is a unified framework that leverages task-sensitive attention heads for both data selection and structural pruning, achieving an 8.3 pp performance gain and 7.0× speedup by updating only 10% of heads on 10% of data.

0 favorites 0 likes
#attention-heads

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

arXiv cs.LG · 2026-05-21 Cached

This paper investigates how weight decay acts as a control parameter for transitioning between memorization and generalization in transformers trained on modular arithmetic, and introduces two cheap online diagnostic metrics from attention activations that track these dynamics.

0 favorites 0 likes
#attention-heads

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

arXiv cs.CL · 2026-05-19 Cached

Introduces counterfactual localization to identify when language models become committed to deception during reasoning, using five environments and a corpus of 1.46M sentences across four reasoning models. Shows that attention-based transition features generalize across environments for detecting deceptive commitment.

0 favorites 0 likes
#attention-heads

Language-Switching Triggers Take a Latent Detour Through Language Models

Hugging Face Daily Papers · 2026-05-18 Cached

This paper identifies a circuit underlying a language-switching backdoor in an 8B-parameter language model, where a three-word Latin trigger redirects English output to French via attention heads and orthogonal latent subspaces, with the final layer MLP converting the latent signal to French logits.

0 favorites 0 likes
#attention-heads

Mechanisms of Prompt-Induced Hallucination in Vision-Language Models

arXiv cs.CL · 2026-04-20 Cached

This paper investigates prompt-induced hallucinations in vision-language models through mechanistic analysis, identifying specific attention heads responsible for the models' tendency to favor textual prompts over visual evidence. The authors demonstrate that ablating these PIH-heads reduces hallucinations by at least 40% without additional training, revealing model-specific mechanisms underlying this failure mode.

0 favorites 0 likes
← Back to home

Submit Feedback