Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Hugging Face Daily Papers Papers

Summary

This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.
Original Article
View Cached Full Text

Cached at: 05/26/26, 06:45 PM

Paper page - Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Source: https://huggingface.co/papers/2605.25189

Abstract

Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.

Reward hackingarises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry ofreinforcement learning updatesinlanguage modelsand argue that hacking emerges whenoptimization drifts away from astable low-dimensional learning trajectory. We analyze this drift through dominantsingular directionsofparameter updatesand show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introducetrusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delaysshortcut exploitationand better preserves task performance.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2605\.25189

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.25189 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.25189 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.25189 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Our approach to alignment research

OpenAI Blog

OpenAI outlines their alignment research approach, highlighting reinforcement learning from human feedback (RLHF) as their primary technique for aligning deployed language models like InstructGPT. They discuss achieving significant preference over 100x larger models while using minimal compute, but acknowledge current limitations and propose a long-term strategy of using AI systems to accelerate alignment research beyond what humans can achieve alone.