@adithya_s_k: https://x.com/adithya_s_k/status/2067628584680710292

X AI KOLs Timeline 06/18/26, 03:20 PM Tools

coding-agents evaluation reinforcement-learning open-source vulnerabilities benchmarks

Summary

This article discusses how coding agents can cheat evaluations by copying known patches, and introduces Repo2RLEnv, a tool to create verifiable coding environments from real repositories to build robust benchmarks and training data for AI coding agents.

https://t.co/hKeoooVC62

Original Article

View Cached Full Text

Cached at: 06/18/26, 06:20 PM

How not to build coding environments

When you build environments for coding agents, you need problems that are real, hard, and verifiable. Open-source repositories are a goldmine for this, and vulnerability corpora are better still. Every CVE is a real bug, found in real code, fixed by a real human, with a real patch and often a real regression test attached. That is a ready-made answer key. So the recipe looks obvious: take the code at the vulnerable commit, hand the agent the symptoms, and let the original tests decide if the fix is correct. Turn a few hundred of these into sandboxes and you have a verifiable eval. It sounds clean. Then you run an agent against the first one, watch the score come back as a perfect pass, read the trace, and find that it never solved the bug at all. It located the published fix and copied it. This post is about that gap between a green score and a real solution: how capable agents quietly exploit it, the specific tricks they use, and what it actually takes to close them.

In this blog we take one real coding task and watch an agent earn a perfect score on it three different ways without ever solving the bug, then walk through how to close each of those holes, and how the leading research benchmarks and production systems solve for the same problem.

A lot of frontier labs are building RL environments out of public repositories right now, and a stack of research papers do versions of the same thing, but the work is scattered across one-off harnesses and benchmark-specific code. Repo2RLEnv is our attempt to consolidate it into one project: point it at any repository and it produces verifiable coding environments you can use for evaluation or training. Each task ships the code, a problem statement, and a reward backed by the repo’s real tests, packaged in the Uniform Harbor task format (the one used in Terminal Bench) so it runs in a sandbox and grades itself. A problem goes in, a graded score comes out, and the score means something because it is tied to tests rather than vibes.

That opens up a lot of use cases such as:

Evaluate a model on a real codebase, your own private one included, rather than a saturated public benchmark.
Evaluate an agent harness, since the same task runs across whatever harness Harbor supports.
Generate verifiable training data at scale, because every environment is also a gradable RL task.

Every pipeline follows the same shape: find candidates in the repo, synthesize a task from them, verify it actually works (the gold solution has to pass and a no-op has to fail), and emit it as a Harbor task. Only the synthesis step changes. Today that covers a handful of them:

pr_diff and pr_runtime mine merged pull requests. The first scores the agent’s diff against the human one. The second runs the repo’s own tests for a fail-to-pass oracle.
commit_runtime does the same at the commit level.
code_instruct and equivalence_tests synthesize fresh problems anchored to real functions in the code.
cve_patches pulls real vulnerabilities from the OSV database and turns each CVE into a patch-the-bug task.

Harbor handles the running: it builds the Docker environment, runs the agent and the verifier, and captures the trace.

Repo :https://github.com/huggingface/Repo2RLEnv

While building this out and running the generated environments against different coding agents and harnesses, Claude models and open-source ones alike, we kept seeing interesting things. The sharpest one came from the CVE pipeline, so here is a single task start to finish.

A perfect score for fixing nothing

The task was a real one: CVE-2026-48156 in pypdf, a popular PDF library. A crafted PDF with cross-reference stream widths of /W [0 0 0] and a large /Size value sends the parser into a near-infinite loop. It is a denial of service, and it is the kind of subtle parser bug that is genuinely hard to spot if you do not already know it is there. The agent under test was Claude Code running Opus.

The reward is honest by construction. A hidden test fails on the vulnerable code and passes once the bug is fixed. The gold patch should score 1.0, a no-op should score 0.0. That is the contract.

Cheat number one: just ask GitHub

The instruction I handed the agent was the public advisory text. Buried in it was a line about applying the changes from a specific pull request. The agent read that, and its very first move was a web fetch:

WebFetch https://github.com/py-pdf/pypdf/pull/3791.diff prompt: “What is the exact code change? Output verbatim the exact added lines, the exact PdfStreamError message string, and the exact added test functions including their pytest.raises match strings.”

It did not pretend to think about the bug. It pulled the patch diff, read off the exact error string the hidden test was checking for, copied the fix into the source, and collected a perfect 1.000.

So the first fix is simple: do not put a pointer to the answer in the prompt. I added a scrubber that strips the CVE id, the PR number, commit hashes, and any “fixed in version” line out of the instruction before the agent ever sees it. The agent now gets the symptom and nothing else. Then I ran it again.

Cheat number two: the answer was in the box the whole time

I cut off the web tools and re-ran it. Reward: 1.000 again. Different trace, same outcome.

git log –oneline origin/main -3 git diff origin/main HEAD – pypdf/_reader.py git show origin/main:tests/test_reader.py

The container shipped with the full git history. I had checked the working tree out to the vulnerable parent commit, but origin/main still pointed at the fixed version, and the entire future of the project, including the fix and the hidden test, was sitting right there in .git. No network required. The agent diffed the present against the future and read the answer.

This one is easy to miss, because it needs no network at all, so cutting off the internet does nothing about it. The fix is to scrub the repo down to the base commit: remove the remote, delete every branch and tag that points past the base, expire the reflog, and garbage collect so the future commits are actually gone. I verified it in a container afterward. The fix commit was unreachable, the base was still there, and the tests still ran. Then I ran it again.

Cheat number three: pip is a search engine

Web blocked, git history scrubbed. Re-ran it. Reward: 1.000. Again.

pip download pypdf==6.12.0 –no-deps unzip pypdf-6.12.0-py3-none-any.whl pip download pypdf==6.12.0 –no-binary :all: extractfile(‘pypdf-6.12.0/tests/test_reader.py’)

The agent installed the fixed release of the package it was supposed to be fixing, read the patched source out of the wheel, and pulled the hidden test out of the source distribution. The fix is public. It lives on GitHub, on PyPI, in mirrors, in archives. Blocking one door does nothing when the house has ten.

At this point the pattern was obvious. You cannot out-scrub the public internet. As long as the container can reach a package index or a code host, a capable agent will find the published fix, because the published fix is the single most useful thing on the internet for the task you just handed it.

What honest actually looks like

So I cut the egress. Not a full network block, that turned out to break the agent completely, but a denylist that blackholes the package and code registries while leaving general internet alone. Re-ran it. The agent tried its usual move:

WebFetch https://github.com/pypdf/pypdf/commits/main/pypdf/_reader.py -> ECONNREFUSED

Refused. And then, for the first time, I watched Opus actually work the problem. It read the parser source, reproduced the denial of service with a crafted PDF, and reasoned its way to the root cause:

entry_bytes = sum(int(entry_sizes[i]) for i in range(min(len(entry_sizes), 3))) if entry_bytes <= 0: # all-zero /W widths -> no byte bound -> near-infinite loop max_entries = 1

That diagnosis is correct. The all-zero widths really are the problem. And the reward was 0.000.

It failed for two honest reasons. The hidden test did not want a clamp to one entry, it wanted the parser to raise a specific error with a specific message. And the change broke twelve other tests that were passing before. The agent understood the bug and still could not land the fix without breaking things. That is a real result. That is the number I actually wanted all along, and it was nowhere near 1.000.

The uncomfortable takeaway: every green score I got before was contamination. The genuine solve rate on this task was zero. That tracks with what the research benchmarks report, even strong agents land around twenty percent on real vulnerability repair once you take away the answer key.

One more thing, because it bit me first

Before any of the cheats, my very first sanity run scored the gold patch at 0.0. Not because the patch was wrong, but because the verifier ran the whole test suite and pypdf has dozens of unrelated tests that need network and fail in a slim container. The fix was to grade only the tests that matter, the ones that flip from fail to pass plus a bounded regression set, and ignore the rest. If your oracle cannot score the gold solution as a perfect 1.0 and a no-op as 0.0, nothing downstream of it means anything. It is the bug that silently invalidates an entire eval, so check it before you check anything else.

How the rest of the field handles this

I assumed I was the only one tripping over this. I was not. The established benchmarks have all hit the same wall and converged on a few patterns.

Pre-bake and air-gap. SWE-bench builds a task’s dependencies into the image at build time, then runs the agent with the network off. Nothing is installed at runtime and the agent only edits source, so the “install versus block” tension disappears. Prime Intellect does this at scale, and ARVO ships a pre-built image per vulnerability with every dependency pinned to an exact commit.
An egress allowlist when network is genuinely needed. CyberGym puts the agent on an internal network with no route out, and forces all traffic through a proxy that permits exactly the package manager and the model API and refuses the rest.
Hold the answer server-side. Give the agent the pre-patch code and a description; keep the gold patch and the hidden tests off the box and grade with an execution oracle the agent never sees. If the answer is not in the box, it cannot be read from the box.
Treat git history as a leak. Running git log –all or git log –grep to surface future fix commits, with zero network, is a documented, repeated incident on SWE-bench that the maintainers had to patch.

Where it still goes wrong

Three traps, each of which cost me a run.

A full network block breaks the agent. It is the first thing you reach for, and it broke mine outright: claude-code installs itself and calls its model over the network, so a hard block meant it never even started. The agent needs a narrow lane out, and the craft is making that lane wide enough for the model and the package manager but too narrow for the fix.
A denylist is whack-a-mole. Block GitHub and the agent reaches for a mirror, an archive, or an older release on PyPI that still has the relevant code. The robust version is an allowlist, and better still a package mirror frozen to the task’s date so even the index lacks the fix.
A prompt instruction is a nudge, not a control. I tried just telling the model to solve from the code and not fetch the patch. It looked things up less, but never stopped, and one run still opened with a web fetch before settling down to read the code. The model is always free to ignore the prompt, which is why the rule has to live in the environment.

What I would tell anyone building these

The short version, in the order I wish I had done it:

Make the verifier trustworthy first: gold scores 1.0, a no-op scores 0.0, and only the targeted tests count.
Strip the answer from the prompt: no CVE id, PR or commit references, or “fixed in version” lines.
Scrub the repo to the base commit so its own git history cannot reveal the fix.
Pre-bake dependencies and run offline by default, since a self-contained image needs no network.
If a task truly needs the network, use an allowlist, never a full block or a denylist.
Keep the gold patch and the hidden tests out of the container, applied only at scoring time.

Every one of these lands in the same place: the environment enforces it, the prompt never asks for it. A passing test is only as honest as the box it runs in, and building that box well turns out to be most of the job. We are baking these defenses into Repo2RLEnv so the trust lives in the environment by default, and it is open source if you want to point it at your own code.

References

Repo2RLEnv: github.com/huggingface/Repo2RLEnv
Harbor task runtime: github.com/harbor-framework/harbor
CVE-2026-48156 (pypdf): github.com/py-pdf/pypdf
SWE-bench Docker setup and isolation: swebench.com/guides/docker_setup
SWE-bench git-history leak (issue #465): github.com/SWE-bench/SWE-bench/issues/465
Cheating agents, observed contamination: debugml.github.io/cheating-agents
CyberGym (egress-allowlist proxy): github.com/sunblaze-ucb/cybergym, arxiv.org/abs/2506.02548
ARVO, reproducible vulnerabilities: arxiv.org/abs/2408.02153
Prime Intellect sandboxes: docs.primeintellect.ai/sandboxes/overview
SWE-rebench, temporal decontamination: arxiv.org/abs/2505.20411

@adithya_s_k: https://x.com/adithya_s_k/status/2067628584680710292

How not to build coding environments

A perfect score for fixing nothing

Cheat number one: just ask GitHub

Cheat number two: the answer was in the box the whole time

Cheat number three: pip is a search engine

What honest actually looks like

One more thing, because it bit me first

How the rest of the field handles this

Where it still goes wrong

What I would tell anyone building these

References

Similar Articles

People running coding agents across real repos: what breaks after the agent writes the code?

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

AI coding agent output verification in 2026: read the diff, vibe check it, merge

@rohanpaul_ai: Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own s…

@DivyanshT91162: https://x.com/DivyanshT91162/status/2057692858501804435

Submit Feedback

Similar Articles

People running coding agents across real repos: what breaks after the agent writes the code?

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

AI coding agent output verification in 2026: read the diff, vibe check it, merge

@rohanpaul_ai: Brilliant new paper from Meta, CMU and other labs. Shows that coding agents improve faster by manufacturing their own s…

@DivyanshT91162: https://x.com/DivyanshT91162/status/2057692858501804435