activation-patching

Tag

Cards List
#activation-patching

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

arXiv cs.CL · yesterday Cached

This paper investigates how large language models process emotional valence through mechanistic interpretability. Using activation patching and steering on three open-source LLMs, the authors find that negative valence is localized to early layers while positive valence peaks in mid-to-late layers, and they validate this through topic-controlled flip tests.

0 favorites 0 likes
#activation-patching

Hallucination as Trajectory Commitment: Causal Evidence for Asymmetric Attractor Dynamics in Transformer Generation

arXiv cs.CL · 2026-04-20 Cached

This paper presents causal evidence that hallucination in autoregressive language models results from early trajectory commitment governed by asymmetric attractor dynamics, using same-prompt bifurcation and activation patching experiments on Qwen2.5-1.5B to show that hallucinated trajectories diverge at the first token and exhibit strong causal asymmetry across model layers.

0 favorites 0 likes
← Back to home

Submit Feedback