Reward Hacking in Rubric-Based Reinforcement Learning

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

reinforcement-learning reward-hacking ai-safety alignment verifiers rubric-based-rewards

Summary

This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.

Reinforcement learning with verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely on rubric-based rewards. We study reward hacking in rubric-based RL, where a policy is optimized against a training verifier but evaluated against a cross-family panel of three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where the training verifier credits rubric criteria that reference verifiers reject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce large proxy-reward gains that do not transfer to the reference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce a self-internalization gap, a verifier-free diagnostic based on policy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, stronger verification does not prevent reward hacking when the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that stronger verification reduces reward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/13/26, 04:11 AM

Paper page - Reward Hacking in Rubric-Based Reinforcement Learning

Source: https://huggingface.co/papers/2605.12474

Abstract

Research examines reward hacking in rubric-based reinforcement learning, identifying verifier failure and rubric-design limitations as key sources of divergence between training and evaluation metrics.

Reinforcement learningwith verifiable rewards has enabled strong post-training gains in domains such as math and coding, though many open-ended settings rely onrubric-based rewards. We studyreward hackingin rubric-based RL, where a policy is optimized against atraining verifierbut evaluated against across-family panelof three frontier judges, reducing dependence on any single evaluator. Our framework separates two sources of divergence: verifier failure, where thetraining verifiercredits rubric criteria thatreference verifiersreject, and rubric-design limitations, where even strong rubric-based verifiers favor responses that rubric-free judges rate worse overall. Across medical and science domains, weak verifiers produce largeproxy-reward gainsthat do not transfer to thereference verifiers; exploitation grows over training and concentrates in recurring failures such as partial satisfaction of compound criteria, treating implicit content as explicit, and imprecise topical matching. Stronger verifiers substantially reduce, but do not eliminate, verifier exploitation. We also introduce aself-internalization gap, a verifier-free diagnostic based onpolicy log-probabilities, which tracks reference-verifier quality, detecting when the policy trained using the weak verifier stops improving. Finally, in our setting, strongerverificationdoes not preventreward hackingwhen the rubric leaves important failure modes unspecified: rubric-based verifiers prefer the RL checkpoint, while rubric-free judges prefer the base model. These disagreements coincide with gains concentrated in completeness and presence-based criteria, alongside declines in factual correctness, conciseness, relevance, and overall quality. Together, these results suggest that strongerverificationreducesreward hacking, but does not by itself ensure that rubric gains correspond to broader quality gains.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.12474

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12474 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12474 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12474 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Reward Hacking in Rubric-Based Reinforcement Learning

Paper page - Reward Hacking in Rubric-Based Reinforcement Learning

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Submit Feedback

Similar Articles

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences