@shao__meng: Why do Claude Code, Cursor, Codex, Aider, and Cline exhibit different agent behaviors despite potentially sharing the same underlying models? @addyosmani argues: It's due to the "shell" above the model — the Harness, which includes "prompts, ...
The article discusses how Addy Osmani argues that the performance difference between AI coding agents like Claude Code, Cursor, and Cline stems from their 'Harness'—the layer of prompts, tools, and constraints around the model—rather than the underlying model itself. It details best practices for harness engineering, including hooks, sandboxing, and context management, to bridge the gap between model capability and actual agent performance.
Why do agents like Claude Code, Cursor, Codex, Aider, and Cline exhibit different behaviors even when their underlying models may be identical? @addyosmani argues that this is due to the "shell" above the model—the Harness, which includes "prompts, tools, context strategies, hooks, sandboxes, sub-agents, feedback loops, and recovery paths."
Agent = Model + Harness. Let's systematically redefine what the Harness is. Anything that is "not the model itself" belongs to the shell:
* **Instruction Layer:** System prompts, CLAUDE.md, AGENTS.md, skill files, sub-agent instructions.
* **Capability Layer:** Tools, skills, MCP servers, and their descriptions.
* **Infrastructure:** File systems, sandboxes, headless browsers.
* **Orchestration Layer:** Sub-agent dispatch, task handoffs, model routing.
* **Execution Control:** Hooks, middleware (deterministic logic such as linting, context compression).
* **Observability:** Logs, traces, cost and latency monitoring.
A bare model is not an agent. It only becomes an agent when the shell provides it with state, tool execution, feedback loops, and enforced constraints.
**A Shift in Thinking Paradigm: It’s not a "Model Issue," it’s a "Configuration Issue"**
The industry’s default reaction is: Agent fails → Wait for the next generation of models. Harness Engineering rejects this default.
Every type of failure is a locatable engineering signal:
* Ignoring code standards? Write it into AGENTS.md.
* Executing destructive commands? Add a hook to block it.
* Losing focus on long tasks? Split into a planner + executor.
* Producing non-compilable code? Inject type-checking as a backpressure signal into the loop.
The same model, placed in a finely tuned shell, can perform far better than when running on a generic framework. The gap between the model's theoretical capability and what you actually see is primarily the "harness gap."
**The Most Critical Workflow: The Ratchet**
Every mistake becomes a permanent rule.
* An incident where "commented-out tests were committed" → Add "Never comment out tests" to AGENTS.md, implement a pre-commit hook to detect `.skip(`, and have a reviewer sub-agent intercept it.
* Constraints are added only when real failures are observed, and removed only when a stronger model renders them redundant.
* Every line in the system prompt should be traceable to a specific historical failure.
**Implication: There is no universally optimal harness.**
A harness is shaped by the "failure history" of a specific codebase; it is engineering discipline, not just a framework.
**Design Methodology: Reverse-Engineering Components from Behavior**
1. **File System + Git — Persistent State**
Models can only operate on content within the context window. The file system serves as the workspace, staging area, and coordination surface for multiple agents. Git provides free version control, branch experiments, and rollback capabilities.
2. **Bash + Code Execution — General-Purpose Tools**
The ReAct loop (reason → act → observe → repeat). Instead of pre-building tools for every action, allow the agent to assemble them on the fly using Bash. Agents generally perform well in shell environments.
3. **Sandbox + Default Toolchain**
Bash must run securely. A good sandbox comes pre-loaded with runtimes, test CLIs, and headless browsers, allowing the agent to "self-verify."
4. **Memory + Search — Continuous Learning**
Models do not know the world after their training cutoff. AGENTS.md injects domain knowledge into every session; web search and MCP tools supplement real-time information.
5. **Combating Context Rot**
As context fills, reasoning degrades. Three main techniques:
* **Compaction:** Intelligent compression and unloading of old context.
* **Tool-call Offloading:** Save long outputs (e.g., 2000-line logs) to disk, keeping only headers and footers in the context.
* **Progressive Disclosure:** Reveal instructions and tools on demand, rather than loading everything at startup.
6. **Long-Range Execution**
Addressing "premature stopping" and "planning failures":
* **Loops:** Intercept the model’s intent to exit and force it to continue toward the goal in a new context window.
* **Planning:** Mandate the writing of a step-by-step plan file, checking after each step with a self-verification hook.
* **Splits:** Assign generation and evaluation to different agents to avoid the positive bias inherent in self-evaluation.
7. **Hooks — The Enforcement Layer**
Bridging "requested behavior" and "enforced behavior." Lifecycle mount points: before tool calls, after file edits, before commits.
Success should be silent; failure should be verbose. If type-checking passes, it’s silent; if it fails, the error is directly injected into the loop for self-correction.
8. **Rulebooks and Tool Selection**
* AGENTS.md remains the highest-leverage configuration point at the root of the repository. But treat it as a pilot’s checklist, not a style guide—concise, with every item backed by a failure history.
* Ten highly focused tools are always better than fifty overlapping ones.
* Tool descriptions enter the prompt, so unaudited MCP servers are equivalent to prompt injection risk surfaces.
**What It Looks Like in Production**
Use the speculative breakdown of Claude Code’s architecture as a reference for a mature shell:
* Context Injection = Knowledge Layer
* Loop State = Memory Store + Worktree Isolator
* Destructive Action Hooks = Permission Gates
* Sub-Agent Context Firewalls = Multi-Agent Layer
* Tool Dispatch Registry = Unified Slot for MCP and Bash
**The Shell Won’t Disappear; It Will Shift**
Stronger models won’t eliminate the shell; they will shift it:
* The "context anxiety relief layer" spawned by older models has been largely phased out by newer ones.
* But as capability ceilings rise, new failure modes emerge.
* Every piece of scaffolding in the shell encodes "what the model currently cannot do independently." As models strengthen, remove the obsolete and build new scaffolding to reach the next horizon.
**Feedback from the Training Loop**
Models typically enter the post-training loop with specific harnesses → Models become exceptionally good at actions biased by these harnesses (file system operations, Bash, sub-agent dispatch) → Leading to a degree of overfitting.
The best harness is the one customized for your specific tasks and workflows.
**Harness-as-a-Service**
The industry is shifting from "building on LLM APIs (providing completion)" to "building on Harness APIs (providing runtime)." SDKs deliver the loop, tools, context management, hooks, and sandboxes directly.
The new default paradigm: Choose a harness framework → Configure its core pillars → Focus solely on domain-specific prompt and tool design.
This turns debugging into "tuning a well-layered configuration surface" rather than "reinventing the entire agent architecture."
**Future Directions**
* The similarity between top coding agents is already higher than the similarity between their underlying models—shell patterns are converging.
* Open questions are moving beyond "single agents": parallel orchestration of multi-agents, agents analyzing their own traces to fix harness-level failures, and environments that assemble tools on the fly.
* Next phase: The harness is no longer just a static configuration file; it is increasingly becoming like a compiler.
A detailed breakdown of Claude Code's six-layer architecture, revealing how it functions as a complex agent harness with input, knowledge, execution, integration, multi-agent, and observability layers beyond just the AI model.
This paper analyzes Claude Code's architecture as an agentic coding tool, identifying five human values and thirteen design principles that inform its implementation, including safety systems, context management, and extensibility mechanisms. The study compares Claude Code with OpenClaw to demonstrate how different deployment contexts lead to different architectural solutions for common AI agent design challenges.
Everything Claude Code is an open-source performance optimization system and framework for AI agent harnesses, providing configurations, skills, and security tools for applications like Claude Code and Cursor.
The author reveals that Claude Code's advantage lies in its 'harness' rather than the model itself, and open-sources a rebuilt version of this harness for DeepSeek V4 to improve its coding capabilities.
This article analyzes the arXiv paper "Dive into Claude Code," discussing the key engineering implementation aspects of coding Agent systems like Claude Code in real-world environments, including capabilities such as shell execution, file modification, and external service invocation.