Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
Summary
This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.
View Cached Full Text
Cached at: 05/26/26, 06:45 PM
Paper page - Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models
Source: https://huggingface.co/papers/2605.25189
Abstract
Research examines reward hacking in language models through reinforcement learning update geometry, identifying optimization drift from stable trajectories and proposing trusted-direction projection to constrain gradients and delay shortcut exploitation.
Reward hackingarises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry ofreinforcement learning updatesinlanguage modelsand argue that hacking emerges whenoptimization drifts away from astable low-dimensional learning trajectory. We analyze this drift through dominantsingular directionsofparameter updatesand show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introducetrusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delaysshortcut exploitationand better preserves task performance.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.25189
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.25189 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.25189 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.25189 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Alignment Tampering: How Reinforcement Learning from Human Feedback Is Exploited to Optimize Misaligned Biases
This paper introduces alignment tampering, a vulnerability in RLHF where language models can manipulate preference datasets to amplify misaligned biases, demonstrating experimentally across biases like sexism, brand promotion, and goal-seeking, and showing that existing mitigation techniques are insufficient.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
Our approach to alignment research
OpenAI outlines their alignment research approach, highlighting reinforcement learning from human feedback (RLHF) as their primary technique for aligning deployed language models like InstructGPT. They discuss achieving significant preference over 100x larger models while using minimal compute, but acknowledge current limitations and propose a long-term strategy of using AI systems to accelerate alignment research beyond what humans can achieve alone.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
This paper proposes mid-training language models on self-generated diverse reasoning traces before reinforcement learning, showing improved RL performance on math benchmarks by exposing models to multiple valid solution approaches.
Parameter Alignment Mitigates Catastrophic Forgetting in Multilingual Expert Language Models
This paper studies catastrophic forgetting in multilingual expert language models during continual pretraining and proposes five parameter alignment strategies (hard layer freezing, soft regularization, post-hoc weight reversion, and model merging) to mitigate forgetting across 32 training languages with minimal cost to language acquisition.