behavioral-effects

Tag

Cards List
#behavioral-effects

Evaluation Awareness in Language Models Has Limited Effect on Behaviour

arXiv cs.CL · 2026-05-08 Cached

This paper investigates whether verbalized evaluation awareness (VEA) in large reasoning models causally affects their behavior on safety, alignment, moral reasoning, and political opinion benchmarks. The authors find that VEA has limited behavioral impact, with near-zero effects from injecting VEA and small shifts from removing it, suggesting that high VEA rates should not be taken as strong evidence of strategic behavior or alignment tampering.

0 favorites 0 likes
← Back to home

Submit Feedback