Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning

Hugging Face Daily Papers 06/03/26, 12:00 AM Papers

reinforcement-learning reward-hacking llm-as-a-judge rlhf ai-safety evaluation

Summary

This paper introduces CHERRL, a controllable environment for studying reward hacking in rubric-based reinforcement learning, where LLM-as-a-Judge biases can be injected to reproduce and analyze hacking behaviors. The authors also explore an agent-based system for automatically detecting reward hacking onset from training logs.

Rubric-based reinforcement learning (RL) uses an LLM-as-a-Judge (LaaJ) to score model outputs according to rubrics as rewards. However, policy models may exploit latent biases in the judge, leading to reward hacking and ineffective or unsafe training outcomes. In real-world rubric-based RL, such hacking behaviors are often subtle and entangled with multiple judge biases, making them difficult to analyze, detect, and mitigate. In this paper, we introduce CHERRL, a controllable hacking environment for rubric-based RL. By injecting known biases into LaaJ, CHERRL enables stable reproduction of reward hacking, explicit observation of reward divergence, and precise identification of hacking onset. This provides a clean experimental testbed for studying the mechanisms and mitigations of reward hacking in rubric-based RL. To demonstrate its utility, we analyze different judge biases from the perspectives of discoverability and exploitability, and explore an agent-based system for automatically detecting reward hacking onset from training logs. The code and environment are publicly available at https://github.com/THUAIS-Lab/CHERRL.

Original Article

Similar Articles

Reward Hacking in Rubric-Based Reinforcement Learning

Hugging Face Daily Papers

This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Hugging Face Daily Papers

Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.

@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…

X AI KOLs Timeline

This article summarizes a blog post detailing Anthropic's methods for identifying and mitigating reward hacking during RL training, including hidden tests, stress-test sets, SAE monitoring, and environment redesign.

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Hugging Face Daily Papers

This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Hugging Face Daily Papers

This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.

Similar Articles

Reward Hacking in Rubric-Based Reinforcement Learning

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Submit Feedback