steering

#steering

Non-linear Interventions on Large Language Models

arXiv cs.CL ↗ · 15h ago Cached

This paper introduces a general formulation of non-linear intervention for large language models, extending beyond the Linear Representation Hypothesis to manipulate features encoded along non-linear manifolds, and validates the approach on refusal bypass steering.

0 favorites 0 likes

#steering

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper investigates how large language models process emotional valence through mechanistic interpretability. Using activation patching and steering on three open-source LLMs, the authors find that negative valence is localized to early layers while positive valence peaks in mid-to-late layers, and they validate this through topic-controlled flip tests.

0 favorites 0 likes

steering

Non-linear Interventions on Large Language Models

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

Submit Feedback