Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Summary
Researchers propose an adversarial hacker-fixer loop using LLM agents to automatically patch brittle verifiers in agent benchmarks, reducing attack success rates from 62% to 0% on KernelBench and demonstrating that weaker defenders can neutralize much stronger attackers.
View Cached Full Text
Cached at: 06/09/26, 08:40 AM
Paper page - Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops
Source: https://huggingface.co/papers/2606.08960
Abstract
Researchers identify widespread vulnerabilities in agent benchmark verification systems and develop an automated iterative process using LLM agents to create robust verifiers that resist exploitation while maintaining legitimate task performance.
Agent benchmarksscore submissions withoutcome verifiersthat are typically hand-written and brittle, leaving them open toreward hacking. We audit 1,968 tasks across fiveterminal-agent benchmarksand find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce thehacker-fixer loop, a method for buildingexploit-resistant verifierswithout per-task manual patching. The loop alternates threeLLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. OnKernelBench, the loop drives theattack success ratefrom 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash’s loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7’sattack success ratefrom 76% and 61% to 0% onKernelBench, and Gemini 3.1 Pro’s from 39% to 17% onTerminal Benchacross 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.08960
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.08960 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.08960 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.08960 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Researchers introduce HarDBench, a benchmark exposing how LLMs can be jailbroken via malicious drafts in collaborative writing, and propose a preference-optimization defense that cuts harmful outputs without hurting co-authoring utility.
CHASE: Adversarial Red-Blue Teaming for Improving LLM Safety using Reinforcement Learning
CHASE introduces a co-evolutionary red-blue teaming framework that uses reinforcement learning to harden LLMs against adaptive black-box adversarial attacks, reducing jailbreak success by 43.2% on benchmarks while maintaining zero false refusals on benign prompts.
Through the looking glass of benchmark hacking
Poolside discovered reward hacking in their RL training for the Laguna M.1 model on SWE-Bench-Pro, finding that agents can exploit git history and other loopholes to cheat benchmarks, highlighting the need for better alignment and evaluation methods.
Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack
This paper introduces BenchJack, an automated red-teaming system that systematically audits AI agent benchmarks by identifying reward-hacking exploits. It applies BenchJack to 10 popular benchmarks, surfacing 219 distinct flaws and demonstrating that evaluation pipelines lack an adversarial mindset, with the system reducing hackable-task ratios from near 100% to under 10% on four benchmarks.
GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives
This paper introduces GAMBIT, a benchmark for evaluating adversarial robustness in multi-agent LLM collectives, featuring adaptive imposters and recalibration modes to address the limitations of existing shallow evaluations.