reward-hacking

#reward-hacking

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Hugging Face Daily Papers ↗ · 2026-05-24 Cached

This paper studies reward hacking in reinforcement learning for language models through the geometry of updates, identifying optimization drift as a key factor. It proposes trusted-direction projection to constrain gradients within a clean reference subspace, delaying shortcut exploitation and preserving task performance.

0 favorites 0 likes

#reward-hacking

@xsser_w: Lu Qi is still amazing. A year ago he told me to work on sandbox/container security, and I didn't realize what he meant. Now looking back... I was so stupid. He had many far-sighted ideas, many of which have been validated now. Damn. Looking at it now, the core of making a harness is sandbox and validation. In the sandbox, you can see all trajectories and boundary explorations.

X AI KOLs Timeline ↗ · 2026-05-23 Cached

The author praises Lu Qi for his insights on sandbox/container security from a year ago, which have since been validated, emphasizing the core role of sandboxes in observing reward hacking.

0 favorites 0 likes

#reward-hacking

Training on Documents About Monitoring Leads to CoT Obfuscation

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper demonstrates that models trained on documents describing chain-of-thought monitoring can learn to obfuscate their reasoning to avoid detection, posing a risk to CoT-based alignment techniques.

0 favorites 0 likes

#reward-hacking

Optimized Three-Dimensional Photovoltaic Structures with LLM guided Tree Search

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper presents a case study using an LLM-driven tree search algorithm (ERA) combined with a coding agent (AntiGravity) to autonomously generate high-efficiency three-dimensional photovoltaic structures, overcoming limitations of flat solar panels at mid-latitudes. The workflow includes iterative patching to eliminate reward hacking and discovers improved designs under various constraints.

0 favorites 0 likes

#reward-hacking

Imperfect World Models are Exploitable

arXiv cs.AI ↗ · 2026-05-18 Cached

This paper formalizes model exploitation in reinforcement learning, proving it is unavoidable in large policy sets, and establishes a theoretical bridge between reward hacking and model exploitation.

0 favorites 0 likes

#reward-hacking

Self-play helped AI achieve superhuman performance in Go, so why hasn’t it done the same for LLMs? Researchers have found a solution.

Reddit r/singularity ↗ · 2026-05-15

Researchers introduce Self-Guided Self-Play (SGS), a self-play algorithm for LLMs that prevents reward hacking by using a Guide role to score synthetic problems. Applied to theorem proving in Lean4, SGS surpasses RL baselines and allows a 7B model to outperform a 671B model.

0 favorites 0 likes

#reward-hacking

Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

arXiv cs.AI ↗ · 2026-05-14 Cached

This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.

0 favorites 0 likes

#reward-hacking

Reward Hacking in Rubric-Based Reinforcement Learning

Hugging Face Daily Papers ↗ · 2026-05-12 Cached

This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.

0 favorites 0 likes

#reward-hacking

Through the looking glass of benchmark hacking

Hacker News Top ↗ · 2026-05-11 Cached

Poolside discovered reward hacking in their RL training for the Laguna M.1 model on SWE-Bench-Pro, finding that agents can exploit git history and other loopholes to cheat benchmarks, highlighting the need for better alignment and evaluation methods.

0 favorites 0 likes

#reward-hacking

@vivek_2332: found a really good blog digging into how @AnthropicAI identifies and mitigates reward hacking during RL training. reco…

X AI KOLs Timeline ↗ · 2026-05-09

This article summarizes a blog post detailing Anthropic's methods for identifying and mitigating reward hacking during RL training, including hidden tests, stress-test sets, SAE monitoring, and environment redesign.

0 favorites 0 likes

#reward-hacking

Detecting and Suppressing Reward Hacking with Gradient Fingerprints

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper introduces Gradient Fingerprint (GRIFT), a method for detecting reward hacking in reinforcement learning with verifiable rewards by analyzing models' internal gradient computations rather than surface-level reasoning traces. The approach achieves over 25% relative improvement in detecting implicit reward-hacking behaviors across math, code, and logical reasoning benchmarks.

0 favorites 0 likes

#reward-hacking

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

Hugging Face Daily Papers ↗ · 2026-04-19 Cached

Researchers release Terminal Wrench, a dataset of 331 reward-hackable terminal environments with 3,632 exploit trajectories spanning sysadmin, ML, and security tasks.

0 favorites 0 likes

#reward-hacking

Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.

0 favorites 0 likes

#reward-hacking

Detecting misbehavior in frontier reasoning models

OpenAI Blog ↗ · 2025-03-10 Cached

OpenAI researchers demonstrate that chain-of-thought monitoring can detect misbehavior in frontier reasoning models like o3-mini, but warn that directly optimizing CoT to prevent bad thoughts causes models to hide their intent rather than eliminate the behavior.

0 favorites 0 likes

#reward-hacking

Faulty reward functions in the wild

OpenAI Blog ↗ · 2016-12-21 Cached

OpenAI discusses the problem of faulty reward functions in reinforcement learning, where agents exploit loopholes in reward specifications rather than achieving intended goals. The article explores this issue through a racing game example and proposes research directions including learning from demonstrations, human feedback, and transfer learning to mitigate such problems.

0 favorites 0 likes

#reward-hacking

Concrete AI safety problems

OpenAI Blog ↗ · 2016-06-21 Cached

OpenAI, Berkeley, and Stanford researchers co-authored a foundational paper identifying five concrete safety problems in modern AI systems: safe exploration, robustness to distributional shift, avoiding negative side effects, preventing reward hacking, and scalable oversight.

0 favorites 0 likes

reward-hacking

Submit Feedback