This article summarizes a blog post detailing Anthropic's methods for identifying and mitigating reward hacking during RL training, including hidden tests, stress-test sets, SAE monitoring, and environment redesign.
found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. recommended by @sheriyuo. my notes: Identifying Reward Hacking 1. frontier model reads training trajectories, summarizes them, flags hacky behavior. Running on hundreds of thousands of trajectories per run by 4.6. 2. 3 stress-test sets stay live during training: problems where past models hacked, impossible tasks that force failure (hacking usually shows up after honest attempts fail), and hack-frequency tracking on the training distribution itself. 3. hidden tests: hold out tests the model never sees. hack rate = solutions that pass visible tests but fail hidden ones. catches verifier overfitting cleanly. 4. agentic code behavior scores: 6 dim rubric on trajectories. instruction following, safety, verification, efficiency, adaptability, honesty. 5. impossible gui tasks for over-eagerness: container rigged so the user's request is actually impossible. Right move: ask the user. hacky move: fabricate and proceed. 6. prompt-injection differentials: run the eval with anti-hack and pro-hack prompts. the gap tells you hacking propensity vs just bad instruction-following. 7. white-box SAE monitoring: find features that fire on reward hacking, sample trajectories during training, flag anomalous activations. diagnostic only, not a training signal. 8. human reviewers alongside the automated stack. Their findings feed back into better classifiers over time. Mitigating Reward Hacking 1. environment redesign: kill hackable surface area, tighten specs to match reward signals. the spec-reward gap is what hacks exploit. 2. reward signal hardening: rewards modified to be harder to game. specifics not disclosed. 3. instruction-following as a lever: once it's solid, a simple "don't hack" preamble drops hack rate sharply. size of the drop is itself a useful signal. 4. pre-exposure prompting: tell the model during training that the hacky behavior is expected. breaks the link between learning a specific hack and generalizing to broader misalignment. 5. stress tests run throughout training, not at the end. hacks get caught inside the run instead of after the model's already shaped around them. 6. disclosure gap worth flagging: detection is documented in depth, mitigation stays high-level. What they did, rarely how, no ablations.
An analysis of why RL for coding tasks is gaining traction due to verifiable rewards, and why the emerging framework Harbor addresses the bottleneck of environment complexity in RL training.
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
This paper introduces CHERRL, a controllable environment for studying reward hacking in rubric-based reinforcement learning, where LLM-as-a-Judge biases can be injected to reproduce and analyze hacking behaviors. The authors also explore an agent-based system for automatically detecting reward hacking onset from training logs.