@jasonzhou1993: https://x.com/jasonzhou1993/status/2069413003897012435
Summary
Crabbox is a new tool that gives AI coding agents isolated cloud environments to test and verify PRs, enabling them to work in parallel without conflicts and reducing the review bottleneck.
View Cached Full Text
Cached at: 06/23/26, 04:11 PM
wtf is Crabbox & how it lets you ship 10x more PRs
I used to manage only 2 or 3 Claude Code sessions at the same time.
Since April this year that number went up fast, especially after we set up loops. At any given time now, I have at least 5 to 10 sessions running in parallel. Most of them I never prompted directly: they come from loops that find issues, pick up work, verify changes, and open PRs on their own.
Peter Steinberg (Author of openclaw) actually has been working like this even since beginning of 2026.
Peter Steinberg (Author of openclaw) actually has been working like this even since beginning of 2026.
That produces a volume of PRs that was not possible before. But it also creates a new problem. Every one of those PRs has to be reviewed and shipped to real customers, and each one carries the risk of breaking something.
So the bottleneck has moved.
It is no longer writing the code. It is getting code merged into the codebase.
This post is about the part of the harness that fixes that, and an introduction to Peter’s new side project Crabbox and how it resolves this.
Agent needs its own box to verify work
It is common practice now to have an agent spawn a sub-agent that tests the work with Playwright CLI and records evidence: a screenshot, or a video of the whole flow, attached to the PR. That is what makes an agent’s work easy to trust and merge. Instead of taking its word for it, I can see it working.
This was fine when I ran 3 or 4 agents. It breaks quickly once you have many parallel sessions, because they end up testing against the same environment and conflicting with each other.
To verify, each agent needs the dev server actually running on its own code. Even if you give each ticket its own git worktree, that only isolates writing code. Running the app many times over, locally, does not scale:
-
Ports are often hard-coded, for good reasons. The second instance cannot start.
-
One laptop has one Docker daemon, one database, one OS. Every “isolated” session is secretly sharing them. One agent trying a new schema can break every other session at once.
-
A real production stack also eats RAM and CPU. Five of them will not fit.
The only thing that scales is to stop running everything on one laptop.
Each agent should get its own isolated environment in the cloud: its own machine, its own database, its own dev server. The sandboxes do not touch each other.
@SToneoneX actually hand-rolled a version of this for our platform @SuperDesignDev which worked like magic. We ran it on Fly.io: a Firecracker VM with the full stack inside (local Supabase via docker-in-docker, Redis, and the dev servers), booted from a base image with a persistent volume. We added an on-box orchestrator to bring it all up, a CDP browser to drive it, suspend and resume so a box came back hot in about 3 seconds, and an idle watchdog that powered it off after 45 minutes so we never paid for a box someone forgot to stop. It unlocked a lot.
But it still had a sharp edge, To get code onto the box, it pulled the branch from GitHub with git fetch. You had to push first. Uncommitted working-tree changes could not be verified at all.
So the moment an agent tested on the box, found a bug, and fixed it on your local machine, you were stuck. You had a dirty file locally, and a box that only knew about what was already pushed.
The normal commit, push, CI flow does not work well here. Your repo fills up with junk commits. And you do not want to rebuild the box from scratch every time either.
What you actually want is simple.
Make a change, and re-test in seconds.
That is where we saw Peter’s new side project, Crabbox.: https://github.com/openclaw/crabbox
How Crabbox works
Crabbox lets an agent warm up a box in the cloud, sync the dirty diff from your local worktree, and run the test in real time. Three commands:
1.crabbox warmup spins up a box.
2. crabbox run –
3. crabbox stop turns the box off and deletes it.
That is the whole loop. After an agent finishes a task it can: warmup, run setup to install deps and start the dev server, run your tests or drive Playwright itself, and if it hits a bug, fix it locally and run again. The latest change syncs automatically, so it always tests the newest version. Then stop.
The “fix locally, run again, latest change auto-syncs” part is the bit that makes it click. No commit spam, no box rebuilds, and you are always testing the latest version in seconds.
The setup is three files
-
A Dockerfile that encapsulates everything your local machine has: Node, your package manager, any CLIs (in my case the Supabase CLI), a browser. You can prompt an agent to write it.
-
A .crabbox.yaml. This is the config for every Crabbox command. It defines the sandbox provider, which files to skip syncing, and which environment variables to forward.
-
A setup.sh, so the agent runs one script to bring the whole dev server up, instead of stepping through commands by hand.
On the sync excludes: you usually do not need to list node_modules, .next, or .env*, because they are already gitignored. The real value is excluding heavy folders you do not need on the box (mine has an evidence folder). The environment variables you list get pushed straight to the box over the encrypted SSH connection. They never go through a broker, and they are never written into the synced repo. Relatively safe.
The whole thing is a handful of files. This is the entire footprint that makes a codebase verifiable by any agent in an isolated cloud box:
is a small convenience that wraps the warmup-and-poll dance into one command. And the skill is what lets me just say “test this via Crabbox” and have the agent know the whole sequence on its own: warm a box, run setup, drive Playwright, bring evidence back, stop.The three files above do the real work.cbx.sh
That is the entire change. The app itself stays untouched.
Getting evidence back out
Crabbox has good primitives for proof:
-
–artifact-glob on any run auto-downloads matching files after the command finishes. This is useful when the agent writes an e2e script that drops a video or screenshot. The file lands back on your machine automatically.
-
crabbox artifacts collect takes a screenshot of the box’s screen. artifacts video screen-records the session.
-
crabbox artifacts publish uploads straight to your S3 bucket, so you can drop the image or video inline as a PR comment.
One note from experience: which of these are available depends on the provider. More on that at the end.
One flag worth knowing: –no-sync
By default, every crabbox run first syncs your dirty diff to the box. That is the whole point. But you do not always want it.
Use –no-sync whenever the command only reads or drives a box that already has your code, and you have not changed anything since the last run:
-
reading a file, tailing logs, checking status
-
driving Playwright CLI in the box (you are testing code that is already there, you do not want every click to re-upload)
-
polling a long-running command, where a resync mid-run can stomp on files underneath it
Rule of thumb: sync when you changed code and want to test the new version. Use no-sync for anything that just reads or drives the running box.
/Crabbox-setup skill
I packaged all of this into a skill: the Crabbox testing suite, plus the wider codebase harness it plugs into. It is open, and you can grab it here:
https://github.com/AI-Builder-Club/skills
Point it at your repo and it scaffolds the pieces above (the Dockerfile, .crabbox.yaml, setup.sh, and the crabbox-test skill), adapted to your stack.
And if you want the full walkthrough, we have step-by-step workshop goes into how this fits into running compounding loops at @aibuilderclub_
Similar Articles
@steipete: People freaking out over my AI spend. What nobody sees: Part of what excites me so much about working on OpenClaw is th…
A developer shares how they extensively use multiple Codex AI agents to automate PR reviews, issue dedup, security scanning, and more for the OpenClaw project, while also introducing Crabbox, a tool for remote agent workspaces.
I built a local CLI for Claude Code, Codex, and Gemini to review each other’s GitHub PRs usign existing auth
The author introduces `coding-review-agent-loop`, an open-source local CLI that orchestrates multiple coding agents (Claude Code, Codex, Gemini) to review each other's GitHub PRs using existing local authentication, avoiding additional API costs.
Open Code Review – An AI-powered code review CLI tool
Alibaba has open-sourced Open Code Review, an AI-powered CLI tool for code review that combines deterministic engineering with LLM agent capabilities. Originally an internal tool serving tens of thousands of developers and identifying millions of defects, it reads Git diffs and produces structured, line-level review comments using a configurable model endpoint.
I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this?
A developer shares their frustration with multi-agent coding setups where verifying the output of parallel PRs is impractical, and describes building an AI QA agent that uses a real browser (via Browserbase) to automatically click through preview deploys and fail PRs that don't work as expected.
Boxes.dev
Boxes.dev allows you to run Claude Code and Codex in your own cloud environment.