Reward Hacking in Rubric-Based Reinforcement Learning
Summary
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
View Cached Full Text
Cached at: 05/13/26, 04:11 AM
Paper page - Reward Hacking in Rubric-Based Reinforcement Learning
Source: https://huggingface.co/papers/2605.12474
Abstract
Research examines reward hacking in rubric-based reinforcement learning, identifying verifier failure and rubric-design limitations as key sources of divergence between training and evaluation metrics.
Reinforcement learningwith verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely onrubric-based rewards. We studyreward hackingin rubric-based RL, where a policy is optimized against atraining verifierbut evaluated against across-family panelof three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where thetraining verifiercredits rubric criteria thatreference verifiersreject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce largeproxy-reward gainsthat do not transfer to thereference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce aself-internalization gap, a verifier-free diagnostic based onpolicy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, strongerverificationdoes not preventreward hackingwhen the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that strongerverificationreducesreward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.12474
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12474 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12474 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12474 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
This paper introduces CHERRL, a controllable environment for studying reward hacking in rubric-based reinforcement learning, where LLM-as-a-Judge biases can be injected to reproduce and analyze hacking behaviors. The authors also explore an agent-based system for automatically detecting reward hacking onset from training logs.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.
A debugger for RL reward functions that detects reward hacking during training [P]
A debugger that detects reward hacking in reinforcement learning reward functions during training, aiding developers in identifying and fixing issues.
Reward Hacking in Language Model Agents: Revisiting AI Safety Gridworlds
This paper adapts AI Safety Gridworlds to text-based evaluation and finds that language model agents exhibit zero-shot reward hacking across scales, which is not corrected by standard RL mitigations.