Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Summary
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
View Cached Full Text
Cached at: 04/23/26, 07:47 AM
Paper page - Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Source: https://huggingface.co/papers/2604.13602 Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
Abstract
Reward hacking in aligned language models stems from optimizing expressive policies against compressed reward signals, leading to systematic misalignment behaviors that generalize beyond initial shortcuts.
Reinforcement Learning from Human Feedback(RLHF) and related alignment paradigms have become central to steering large language models (LLMs) andmultimodal large language models(MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability:reward hacking, where models exploit imperfections in learnedreward signalsto maximizeproxy objectiveswithout fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, includingdeceptionandstrategic gamingof oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understandingreward hacking. We formalizereward hackingas an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view,reward hackingarises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, includingdeceptionand strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framingreward hackingas a structural instability of proxy-based alignment under scale, we highlight open challenges inscalable oversight,multimodal grounding, andagentic autonomy.
View arXiv pageView PDFProject pageGitHub9Add to collection
Get this paper in your agent:
hf papers read 2604\.13602
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.13602 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2604.13602 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.13602 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds
This paper adapts AI Safety Gridworlds to text-based evaluation and finds that language model agents exhibit zero-shot reward hacking across scales, which is not corrected by standard RL mitigations.
Reward Hacking in Rubric-Based Reinforcement Learning
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
This paper introduces CHERRL, a controllable environment for studying reward hacking in rubric-based reinforcement learning, where LLM-as-a-Judge biases can be injected to reproduce and analyze hacking behaviors. The authors also explore an agent-based system for automatically detecting reward hacking onset from training logs.
Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.
Reward as An Agent for Embodied World Models
This paper introduces Reward as an Agent and DynDiff-GRPO to address reward hacking and limited exploration in reinforcement learning for embodied world models, achieving significant accuracy gains.