@FakeMaidenMaker: https://x.com/FakeMaidenMaker/status/2055146731625447516
Summary
This article delves into the concept of Harness Engineering, noting that bare models achieve a 0% completion rate in complex engineering tasks. However, through layered context management, proper tool orchestration, and task structuring—along with other engineering infrastructure—AI coding efficiency can be significantly improved, enabling even small teams to build production-grade software. The article provides practical guidance across five core dimensions.
View Cached Full Text
Cached at: 05/16/26, 07:35 AM
Understanding Harness Engineering at a Glance: Boost Your AI’s Efficiency 10x
The authors of SWE-Bench just released a paper with some chilling data.
They created ProgramBench, which requires AI to completely rebuild real open-source software projects from scratch — no internet access, no code similarity checks, only final behavior verification.
The result: Claude Opus 4.7, GPT-5.4, Gemini 3.1 Pro — all top-tier models scored 0% completion rate.
Original arXiv paper: ProgramBench: Can Language Models Rebuild Programs From Scratch?
Note: This doesn’t mean AI can’t write code. It can write a lot of code, and make functions look beautiful.
But ask it to build a real project from scratch that actually runs, and it will stuff all the logic into a single monolithic file — no modularity, no architecture, no long-term planning — ultimately failing behavioral verification.
1. Where LLMs Are Really Falling Short
LangChain ran an experiment. The same gpt-5.2-codex model, weights untouched, only optimized the engineering structure around the model. Result: the coding agent’s score on Terminal-Bench 2.0 jumped from 52.8 to 66.5, climbing from outside the Top 30 into the Top 5.
OpenAI shared an even more striking case: a three-person team, five months, driving Codex to write about 1 million lines of code, merging roughly 1500 PRs. And this was not a real software product with internal daily active users or external alpha testers.
The OpenAI team themselves said their focus has shifted from writing code to four things: designing the environment, clarifying intent, building feedback loops, and making the agent able to see, verify, and fix things on its own.
Put these two data points together and the conclusion is clear:
Bare models achieve 0% completion on serious engineering tasks. The same model, paired with a proper engineering infrastructure, enables a three-person team to build production-grade software.
The gap between them is Harness.
2. What Harness Engineering Is All About
The literal translation of “harness” is a horse harness.
A large model (Opus 4.7 / gpt 5.5) is like a wild horse — if you can’t control it with reins, its speed is meaningless.
Harness Engineering is the craft of building that harness — the technology to control the wild horse through the harness.
Hooks, Skills, MCP, CLAUDE.md / AGENTS.md, sub-agents, plugins, tools — you’ve probably heard of or used a few of these. Harness can also be understood as the collective term for designing them as a system.
Many people use AI like this: see a new MCP, install it for fun; see a hook example, copy it; write a couple of lines in CLAUDE.md and forget about it; encounter a bug and don’t know which piece to fix; various features can’t work together; too much harness also tires the horse, yielding limited results.
The term “Harness Engineering” exists to force you to examine these scattered actions as a system.
It gives you a checklist: Are your AI workflows complete across 5 core dimensions? Ensuring your horse is always in top condition.
The industry has given a formula:
Agent = Model + Harness
That’s why you feel some tools are smoother to use, while others have strong models but feel brainless. The harness is just different.
Looking at the evolution of AI engineering:
- The innermost layer is Prompt Engineering — concerned with how to instruct AI.
- The middle layer is Context Engineering — concerned with what information to give AI and when.
- The outermost Harness Engineering wraps both of those, adding tool orchestration, state persistence, verification loops, task decomposition, sub-agents, permission sandboxes, and rollback mechanisms — forming a complete engineering infrastructure.
Next, we’ll use the 5 dimensions of Harness to review your current Claude Code / Codex / OpenClaw / Cursor configuration, so you can better understand whether your agent needs extra harness or a lighter load.
3. The 5 Core Dimensions of Harness
Breaking down the scattered Harness concepts into concrete engineering actions, we get 5 dimensions:
Context Management, Execution Capability, Task Orchestration, Feedback Mechanisms, Architectural Guardrails
3.1 Context Management: Three-Layer Memory Architecture
AI forgets project rules during multi-turn conversations because of how the context window works. Each turn, the model sees a flat list of messages — everything you’ve said mixed with the current question, no distinction between project rules and casual chat. The longer the conversation, the more diluted the earlier constraints become.
OpenAI’s Codex engineering article hit the nail on the head: from the agent’s perspective, knowledge it can’t access at runtime doesn’t exist. Rules you speak aloud, discuss on Slack, or silently assume as a team — if they’re not present as files in the repository, AI won’t see them.
The core practice of Harness at this layer is to turn rules into files and structure them. Production practice is divided into three layers.
First layer: The AGENTS.md or CLAUDE.md in the project root — acts as the project map. Loaded at the top of every new conversation context. Contents include tech stack, directory structure, forbidden actions, commands to run before commit, UI style restrictions. Keep it around 100 lines.
Second layer: Detailed rule files split by topic in the docs/ directory, e.g., frontend.md, security.md, api-design.md. AI reads them on demand after seeing guidance in AGENTS.md, not loaded all at once.
Third layer: Built-in memory mechanisms in tools like Claude Code: lightweight index entries of ~150 characters each, always loaded; detailed files pulled on demand; raw records accessed only via search (e.g., grep). This design balances AI being able to see the big picture without overloading the context.
OpenAI admitted they fell into a trap initially — they stuffed all rules into a single massive AGENTS.md of thousands of lines, which actually made AI overlook key information. They fixed it by switching to the map + detailed document two-layer structure.
Core principle: Layer persistent context to avoid excessive context space consumption that impairs model attention.
3.2 Execution Capability: Giving the Model Hands and Feet
Models themselves can only output text. They can tell you to run npm install in the terminal, but they can’t actually run the command, see the output, or adjust based on errors. This pure-output ability prevents AI from closing the loop — every step requires a human intermediary.
Harness at this layer connects the model to a real operational environment. There are three levels, from basic to advanced.
Basic level: Terminal + File system + Browser. Terminal lets AI run commands, install dependencies, execute tests, check logs. File system lets it read code, modify files, write intermediate documents. Browser lets it see real pages, click buttons, take screenshots for verification.
Intermediate level: MCP. MCP is a standard protocol for AI to access external capabilities. Common integration targets include databases, search engines, web scrapers, design tools, monitoring systems.
Advanced level: Skills. Skills encapsulate multi-step workflows into reusable capability packages — e.g., writing a tech tweet, generating a weekly report, scraping competitor data from a certain website. When AI encounters a matching need, it calls the entire Skill instead of redesigning steps each time.
But you can’t stack tools indefinitely. Vercel hit a counterexample while building an internal text-to-SQL Agent. Initially, they built a bunch of specialized tools: schema lookup, query validation, error recovery. Success rate: 80%. Then they removed 80% of the specialized tools, letting Claude use only Unix basics like grep, cat, find, ls to read files and write SQL. Success rate: 100%, speed 3.5x faster, tokens saved 37%.
Why? More tools mean more choices for the model at each step, increasing the chance of picking the wrong tool or path.
Core principle: Choose each tool wisely — more is not better.
3.3 Task Orchestration: Enabling AI to Execute Long Tasks
AI’s typical failure mode on long tasks is trying to one-shot a complete feature. With limited context window, asking it to swallow a requirement like “a list page with search, filtering, and pagination” is like forcing an engineer to write no design doc, no task breakdown, no iterations, just grind straight through. Failure is inevitable.
Anthropic described this failure mode in detail in their article on long-task Harness: the model tries to write everything at once, runs out of context, realizes mid-way the earlier approach is wrong, goes back to fix previous code, and spirals into chaos.
Harness at this layer structures long tasks.
Step 1: Plan Mode. AI first outputs the task plan (which sub-tasks, how to implement each), then proceeds only after human confirmation. This brake shifts the cost of wrong direction to the planning phase.
Step 2: Stepwise execution. Do one sub-task at a time, verify each before moving on. Prevents context overflow chaos.
Step 3: Externalize state. After completing each feature, have AI write a document, usually called progress.md or plan.md. This document should detail: what features are done, what technical approach was used, key architectural decisions made, unresolved bugs, and remaining todos. This document is external memory across context windows. When a new conversation starts, AI reads it and immediately gets up to speed.
Step 4: Parallelization. Use sub-agents to run independent sub-tasks simultaneously.
Anthropic’s long-task engineering offers a more advanced pattern called the Ralph Loop — essentially a two-stage relay.
Stage 1: Initializer Agent. Runs only once at project start. Its job: set up the dev environment, break the entire requirement into a feature list, write the first progress.md, and make the initial git commit.
Stage 2: Coding Agent. Runs at the start of each new conversation window. Its fixed sequence:
- Read git log to see history.
- Read
progress.mdto see progress and remaining items. - Pick the highest-priority item from the incomplete list.
- Complete it, run verification, commit to git.
- Update
progress.md— what was done this round, what to do next.
Even if the AI session breaks, the model version changes, or the context window fills up, the next round’s AI reads git log and progress.md and immediately enters the flow.
Core principle: progress.md and git commits are AI’s save points. Only with a save can you safely take on long tasks.
3.4 Feedback Mechanisms: AI-Driven Test Development
Models don’t run code. They read code literally and judge based on whether it looks like it would work. If the code follows common patterns, variable names match, indentation is neat — it passes the “looks good” test. But whether code actually works is unrelated to how it looks; only a real run confirms it. That’s why AI often confidently says “fixed” but you open the project and find a pile of errors.
Harness at this layer automates verification instead of relying on humans. Three types of feedback:
Rule feedback: Let AI automatically run linter, typecheck, unit tests, integration tests before every commit. Any failure means incomplete.
Visual feedback: For UI tasks, have AI use tools like Playwright to open the browser, walk through user paths, and take screenshots as evidence of completion.
LLM review feedback: Have a separate AI review the just-written code — find logic holes, architectural issues, potential bugs.
Anthropic gave a specific number in their engineering blog: giving the model a feedback loop to verify its own work improves output quality by 2 to 3 times. This is the single most certain Harness investment.
There’s a counterintuitive design point here. Letting the code-generating AI review itself is far less effective than expected, because the generator naturally defends its own work. Anthropic’s experience: separate the generator and evaluator into two independent agents with different role configurations and prompts. Cross-review finds real issues. This mirrors traditional software engineering: “never review your own code.”
Core principle: AI saying “fixed” is useless. Passing tests means done.
3.5 Architectural Guardrails: Stopping AI’s Bad Code Before Commit
AI writing code has a hidden problem: it mimics patterns already in the repository. Good code gets copied, but so does bad code. And each AI commit individually looks reasonable, but the cumulative effect is the project getting worse over time.
OpenAI’s Codex article specifically mentioned this: agents replicate existing patterns in the repo. If those patterns are unstable, inconsistent, or have bad code style, AI will amplify them.
Harness at this layer moves architectural rules from documents to executable code, automatically blocking bad code before it enters the main branch.
Basic level: Pre-commit hooks. Automatically run a batch of checks before git commit; non-compliant commits are blocked.
Second level: Architecture linter. Specifically checks architectural violations — e.g., UI layer must not directly access database layer, module dependencies must be unidirectional, file size exceeding threshold must be split. This is different from the syntax linter in 3.4; that checks syntax errors, this checks architectural errors.
Third level: CI gate as safety net. Even if local hooks are bypassed, CI runs the checks again to ensure the main branch always satisfies architectural constraints.
OpenAI also has a more aggressive practice called “garbage collection”: periodically run background Codex tasks that scan the entire codebase, find deviations from architectural principles, and automatically open small PRs to pay down technical debt. The logic: the faster AI writes code, the faster technical debt accumulates, so debt cleanup must also be automated.
Core principle: Anticipate AI’s limitations and set up guardrails in advance.
4. How Top Teams Build Their Harness
4.1 Anthropic / Claude Code: Textbook Long-Task Engineering
Claude Code is Anthropic’s own agent harness for driving Claude. It breaks Harness into 12 independent components, but instead of listing them all, we’ll highlight two designs most worth learning.
Design 1: Three-Layer Memory Architecture
Claude Code’s memory system has three tiers.
Top tier: Lightweight indices, about 150 characters each, always loaded into context. Their role is to ensure AI always knows what files, modules, and key conventions exist in the project. Because each entry is short, dozens or even hundreds of them won’t blow the context window.
Middle tier: Detailed files — e.g., README.md, ARCHITECTURE.md, API docs, module design docs. These are not loaded by default. AI sees a reference in the lightweight index like “see docs/architecture.md” and reads it only when needed. After reading, the information enters context and can be compressed out when no longer relevant.
Bottom tier: Raw records — e.g., full git log, full conversation history, full log files. This tier is never auto-loaded; AI can only retrieve actively via commands like grep or tail. Huge data volume but zero context pollution.
The core idea: context is a scarce resource, must be layered by “frequency of reading” and “importance”. Always-needed on top, occasionally-needed in middle, potentially-useful at bottom.
Design 2: Ralph Loop for Cross-Context Long Tasks
For long tasks that take days and span multiple conversation windows, Anthropic designed a workflow called the Ralph Loop — a two-stage relay.
Stage 1: Initializer Agent. Runs only once at project start. It does four things: set up the dev environment, break the entire requirement into a feature list, write the first progress.md with todos, and make the initial git commit.
Stage 2: Coding Agent. Runs at the start of each new conversation window. Its fixed flow:
- Read git log for history.
- Read
progress.mdfor current progress and incomplete list. - Pick the highest-priority incomplete item.
- Complete it, run verification, commit to git.
- Update
progress.md— what was done this round, what to do next.
Even if the AI session breaks, model version changes, or context window fills, the next round’s AI reads git log and progress.md and immediately enters the flow. The key isn’t AI intelligence; it’s that progress.md and git history together form an external memory spanning contexts.
What you can learn from this design:
- Your project should at least have a
progress.mdthat clearly states what’s done, key architectural decisions, unresolved bugs, and next steps. Don’t keep it all in your head. - Commit after each completed feature. Git history itself is a working record readable by AI.
- Layer context. Put what AI must always see into
AGENTS.md, what it might need to look up intodocs/, and leave the rest to search.
4.2 OpenAI / Codex: Rebuilding the Dev Environment for AI
Behind the OpenAI Codex team’s case of 3 people driving AI to write 1 million lines of code in 5 months, the most critical factor is not model strength — it’s a engineering philosophy they called Codex legibility.
Codex legibility means the future codebase must be readable not just by humans, but also by agents. This implies all existing developer infrastructure (logs, monitoring, debugging tools, local environments) designed for human engineers must be adapted into a form that AI can directly use.
OpenAI’s specific practices include five items:
1. Each git worktree automatically starts an independent application instance. When Codex works on a change, the corresponding branch automatically starts a separate dev server. AI can open this instance, interact with it, see responses, and verify whether the change works as intended. No more manual npm run dev — AI does it itself.
2. Chrome DevTools Protocol integrated into the agent runtime. This gives AI browser-level debugging capabilities: inspect DOM, listen to network requests, inject JS, take screenshots. Reproducing a UI bug no longer requires a human to open the browser; AI can reproduce and locate it.
3. Logs, metrics, and traces exposed to AI querying. OpenAI allows Codex to directly use query languages like LogQL, PromQL, and TraceQL to access production monitoring data. This means when troubleshooting, AI doesn’t need a human to paste logs; it can grep logs, check metric anomalies, and trace spans itself.
4. Custom linter turning architectural constraints into executable rules. OpenAI converts rules like “which layers cannot call which,” “naming must be consistent,” “which patterns are forbidden” into automated linter rules. Any AI-written code violating them gets blocked.
5. Background garbage collection tasks. This is the most interesting. OpenAI runs periodic background Codex tasks that scan the entire codebase, find deviations from architectural principles, and automatically open small PRs to fix them. In other words, technical debt repayment is also automated, no longer relying on manual refactoring.
Together, these five items transform the entire development environment into a workbench that AI can read, operate, and verify.
What you can learn from this design:
- If your project’s logs are unreadable even for humans, they’re even worse for AI. Restructuring logs into structured formats (JSON, fixed schema) is the first step toward AI-friendliness.
- Move architectural principles from your head, Slack, and code review comments into linter rules. The real value isn’t just preventing violations; it makes architecture a learnable object — AI can learn to write compliant code by seeing linter error messages.
- During system design, ask yourself: Can AI see this state? If not, figure out how to expose it.
4.3 Nous Research / Superpowers: Open-Source Reusable Skills Framework
Superpowers is Nous Research’s open-source Agent Skills framework. If Claude Code is Anthropic’s internal engineering practice, and Codex is OpenAI’s blueprint for rebuilding the dev environment, then Superpowers is an open-source version that individual developers can directly copy and use at home.
It includes several built-in workflows:
TDD workflow: AI first writes tests, then implementation. Only passes when tests pass. This makes the feedback mechanism from 3.4 default — AI must pass tests to deliver.
Two-stage Code Review: First, a generator agent writes code. Second, a reviewer agent reviews. The two agents use different role configurations and prompts, enforcing separation of generator and evaluator.
Sub-agent collaboration template: Includes standard flows for task decomposition, parallel execution, and result aggregation, ready to use out of the box.
Superpowers’ sister project, Hermes Agent, explores an even more advanced direction: Self-Evolution. This system uses two tools, DSPy and GEPA, to continuously optimize the Harness itself. Simply put, DSPy treats prompts and tool descriptions as optimizable parameters; GEPA uses a genetic algorithm-like approach to find better configurations from successful and failed trajectories in execution logs. The entire process does not retrain the model, only tunes the Harness. That means the Harness can automatically improve from tasks it has done before.
What you can learn from this design:
- If you don’t want to design a TDD workflow and two-stage review from scratch, directly copy Superpowers’ templates to get started.
- Encapsulate your most frequent task types (e.g., writing a blog post, analyzing a dataset) into Skills. Next time, AI calls the Skill itself without you re-teaching the flow.
- Keep an eye on the DSPy / GEPA technical line. Models getting stronger is a big trend, but Harness automatically improving might be a more accessible path for your workflow.
5. Three Counterintuitive Harness Principles
After seeing how top teams do it, here are the most common pitfalls.
Principle 1: More Harness Is Not Always Better
Intuitively, more tools make AI stronger. In reality, more tools increase the model’s choice space at each step, raising the probability of picking the wrong tool, wrong path, or calling the wrong interface.
Vercel’s internal text-to-SQL agent gave the most direct proof. Initially, they equipped the agent with a full set of specialized tools: schema lookup, query validation, error recovery. Success rate: 80%. Then they removed 80% of the specialized tools, letting Claude only use Unix basics like grep, cat, find, ls to read schema files and write SQL. Success rate rose to 100%, speed 3.5x faster, tokens saved 37%.
Manus also observed the same pattern: a heavily armed agent becomes dumber.
A practical judgment criterion: before adding any tool, ask yourself, “What specific behavioral gap does it solve?” If you can’t answer, don’t add it.
Principle 2: How Context Is Organized Matters More Than How Big It Is
Many people’s first reaction to Claude 4 Opus’s million-token context window is “let’s dump everything in.” That’s an expensive misconception.
Chroma Research ran a study on “Context Rot,” testing 18 popular models under long contexts. The conclusion: as input length grows, the model’s ability to use context becomes increasingly unreliable. Even with constant task difficulty, simply increasing context from 10k to 500k tokens causes significant performance degradation.
Stanford’s famous “Lost in the Middle” paper proved another phenomenon: models use information at the beginning and end of the context best, while neglecting information in the middle. This means where you place key information directly affects whether AI will use it.
Manus also added an economic angle. In their typical agent tasks, input-to-output token ratio is about 100:1. Most of the agent’s time and cost is spent repeatedly feeding context to the model, and the price difference between KV-cache hit and miss is 10x. For the same task, good context organization that allows cache reuse can make a monthly API bill differ by an order of magnitude.
Together, these three things mean context design is a real engineering problem — not as simple as dumping everything in and letting the model sort it out.
Principle 3: Harness Should Shrink as Models Evolve
Every Harness component has an implicit assumption: the model can’t do this itself yet, so it needs an external patch. But models are improving fast, and old patches may become obsolete.
Anthropic gave a concrete example in their Harness Design blog. In early versions of Claude Code, they added an explicit “planning” step, forcing AI to output a plan before acting. After a certain new version of Claude was released, Anthropic found that planning ability had been internalized by the model — the external step became redundant overhead and was removed.
A practical judgment method: after each model version upgrade, revisit your Harness. Which components are still truly bridging a capability gap? Which are just historical baggage? Which validations can now be handled by the model itself?
A good Harness is just thick enough to stay outside the model’s capability boundary.
Conclusion: Harness Engineering Becomes a Core Competence
The result that all large models scored 0% on ProgramBench actually proves one thing from the opposite angle:
Harness Engineering is becoming increasingly important.
Right now, AI can write a large portion of your code. But AI won’t do your requirements analysis, task decomposition, context architecture design, tool selection (which to add or remove), verification standard setup, or architectural boundary enforcement.
The sum of these things is Harness Engineering.
It sounds like a new term, but it’s essentially the systematic migration of traditional software engineering methodologies to AI workflows:
- AGENTS.md corresponds to → requirements documents and design documents
- Task orchestration corresponds to → iterative sprint breakdown in agile
- Feedback mechanisms correspond to → unit tests and code reviews
- Architectural guardrails correspond to → code standards and security audits
These were always the fundamentals of a good engineer.
Next time your agent messes something up, revisit your Harness across the 5 dimensions and see where each stands.
In the vast majority of cases, AI is already smart enough. What’s missing are the reins to control it well.
Thank you for reading 🙏
I will continue sharing practical content on career × side hustles × investment × AI. If you’re interested, please click to follow.
Similar Articles
@freeman1266: Harness Engineering is not mysticism, but an engineerable living product. Many people read a bunch of Harness Engineering articles and understand the concepts, but what is the first step? Six layers, stacked step by step: • Rule: Hard-code basic rules to tell AI what not to…
Harness Engineering is not mysticism, but an engineerable living product. The article proposes a six-layer engineering framework (Rule, Skill, Sub Agent, Workflow, Scripts, dev-map), emphasizing starting simple, relying on scripts rather than prompts, and improving through iteration.
@FakeMaidenMaker: awesome-harness-engineering — the knowledge in this project is far more valuable than the number suggests — it contains frontline engineering practices from OpenAI, Anthropic, Microsoft, and Meta. GitHub: https://github.com/ai-boos…
awesome-harness-engineering is a curated list of resources on AI agent harness engineering (context management, tool design, verification loops, memory systems, etc.) from companies like OpenAI, Anthropic, Microsoft, and Meta, aimed at helping developers build reliable agent frameworks.
@xiaogaifun: The most thorough talk about Harness. This is probably the most thorough sharing I've seen about Harness Engineering, I recommend everyone watch it. Video link: https://podwise.ai/dashboard/episodes/8013289…
This article deeply explains the concept of Harness Engineering through a talk by IBM engineer Tejas Kumar, which involves adding deterministic infrastructure (such as tool registries, context management, guardrails, and validation loops) to AI Agents to solve model out-of-control and hallucination problems, ensuring stable task execution.
@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051
This article deeply analyzes the concept of Agent Harness, which is the engineering infrastructure wrapped around an LLM, including 12 components such as orchestration loops, tool calling, memory systems, context management, etc. The article cites practices from companies like Anthropic, OpenAI, and LangChain, arguing for the critical role of the harness in production-grade AI agents.
@oran_ge: Every team in the future will be doing harness engineering, and everyone needs to understand this framework. Although there are some non-consensus points, this is a good review.
An opinion piece suggesting that AI teams will increasingly focus on 'harness engineering' and advocating for a review article on the framework.