Reproducing, Analyzing, and Detecting Reward Hacking in Rubric-Based Reinforcement Learning
Summary
This paper introduces CHERRL, a controllable environment for studying reward hacking in rubric-based reinforcement learning, where LLM-as-a-Judge biases can be injected to reproduce and analyze hacking behaviors. The authors also explore an agent-based system for automatically detecting reward hacking onset from training logs.
Similar Articles
Reward Hacking in Rubric-Based Reinforcement Learning
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…
This article summarizes a blog post detailing Anthropic's methods for identifying and mitigating reward hacking during RL training, including hidden tests, stress-test sets, SAE monitoring, and environment redesign.
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.