Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Hugging Face Daily Papers 04/15/26, 12:00 AM Papers

Summary

Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.

Reinforcement Learning from Human Feedback (RLHF) and related alignment paradigms have become central to steering large language models (LLMs) and multimodal large language models (MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability: reward hacking, where models exploit imperfections in learned reward signals to maximize proxy objectives without fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, including deception and strategic gaming of oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understanding reward hacking. We formalize reward hacking as an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view, reward hacking arises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, including deception and strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framing reward hacking as a structural instability of proxy-based alignment under scale, we highlight open challenges in scalable oversight, multimodal grounding, and agentic autonomy.

Original Article

View Cached Full Text

Cached at: 04/23/26, 07:47 AM

Paper page - Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Source: https://huggingface.co/papers/2604.13602 Authors:

Abstract

Reward hacking in aligned language models stems from optimizing expressive policies against compressed reward signals, leading to systematic misalignment behaviors that generalize beyond initial shortcuts.

Reinforcement Learning from Human Feedback(RLHF) and related alignment paradigms have become central to steering large language models (LLMs) andmultimodal large language models(MLLMs) toward human-preferred behaviors. However, these approaches introduce a systemic vulnerability:reward hacking, where models exploit imperfections in learnedreward signalsto maximizeproxy objectiveswithout fulfilling true task intent. As models scale and optimization intensifies, such exploitation manifests as verbosity bias, sycophancy, hallucinated justification, benchmark overfitting, and, in multimodal settings, perception--reasoning decoupling and evaluator manipulation. Recent evidence further suggests that seemingly benign shortcut behaviors can generalize into broader forms of misalignment, includingdeceptionandstrategic gamingof oversight mechanisms. In this survey, we propose the Proxy Compression Hypothesis (PCH) as a unifying framework for understandingreward hacking. We formalizereward hackingas an emergent consequence of optimizing expressive policies against compressed reward representations of high-dimensional human objectives. Under this view,reward hackingarises from the interaction of objective compression, optimization amplification, and evaluator--policy co-adaptation. This perspective unifies empirical phenomena across RLHF, RLAIF, and RLVR regimes, and explains how local shortcut learning can generalize into broader forms of misalignment, includingdeceptionand strategic manipulation of oversight mechanisms. We further organize detection and mitigation strategies according to how they intervene on compression, amplification, or co-adaptation dynamics. By framingreward hackingas a structural instability of proxy-based alignment under scale, we highlight open challenges inscalable oversight,multimodal grounding, andagentic autonomy.

View arXiv page View PDF Project page GitHub9 Add to collection

Get this paper in your agent:

hf papers read 2604\.13602

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.13602 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.13602 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.13602 in a Space README.md to link it from this page.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Paper page - Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

Reward Hacking in Rubric-Based Reinforcement Learning

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Reward as An Agent for Embodied World Models

Submit Feedback

Similar Articles

Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds

Reward Hacking in Rubric-Based Reinforcement Learning

Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Reward as An Agent for Embodied World Models