Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Hugging Face Daily Papers 05/24/26, 12:00 AM Papers

reward-hacking reinforcement-learning language-models alignment optimization gradients shortcuts

Summary

This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.

Original Article

View Cached Full Text

Cached at: 05/26/26, 06:45 PM

Paper page - Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Source: https://huggingface.co/papers/2605.25189

Abstract

Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.

Reward hackingarises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry ofreinforcement learning updatesinlanguage modelsand argue that hacking emerges whenoptimization drifts away from astable low-dimensional learning trajectory. We analyze this drift through dominantsingular directionsofparameter updatesand show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introducetrusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delaysshortcut exploitationand better preserves task performance.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.25189

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25189 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.25189 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25189 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Paper page - Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Our approach to alignment research

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models

Submit Feedback

Similar Articles

Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Our approach to alignment research

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models