@dongxi_nlp: https://x.com/dongxi_nlp/status/2065200644802101633
Summary
The article proposes that in a Coding Agent, tool invocations should be treated as contracts rather than simple functions, emphasizing the Harness's adjudication role in verification, permissions, lifecycle management, and others, and discusses in detail the composition and lifecycle of tool contracts.
View Cached Full Text
Cached at: 06/12/26, 10:56 AM
Tools Are Contracts, Not Functions
Harness series, part 3: Tools Are Contracts.
If a Coding Agent had a world, then every tool call is a request—via generated text—to change the state of that world.
Tool calls can easily appear too simple: the model outputs JSON, the Harness parses it, some local function gets executed, and the result is placed back into the next prompt.
That’s the tiny agent version.
But a coding agent’s tool call is not a simple function.
A tool call comes from generated text; it is requesting access to the workspace, shell, network, transcript, or another Agent.
This difference changes the entire state and environment of the Coding Agent.
The model can ask. The harness decides.
The Proposal and Contract
Imagine the model outputs:
It looks structured. It is still just a proposal.
The model producing seemingly valid JSON does not automatically grant write access.
The Harness still must answer:
- Does this tool exist?
- Do the args match the schema?
- Is the path within the workspace?
- Is this a create, overwrite, or patch?
- Has the model recently read the existing file?
- Is the file-state baseline fresh?
- Does this action require approval?
- What should be shown to the human before approval?
- How much of the result output can safely enter the context?
- Which transcript and state need updating after execution?
This string of questions is the tool contract.
The Naive Design
A naive design treats tools as a function map:
tools = {
"read_file": read_file,
"write_file": write_file,
"run_shell": run_shell,
"search_code": search_code,
}
This code is short, so it looks clean, but it packs many responsibilities into one line.
Parsing, schema validation, path safety, permission policy, sandbox choice, execution, output clipping, transcript recording, state updates—all crammed into tool(...).
A demo can be written this way. But a coding agent that modifies real repositories needs stronger boundaries. The problem isn’t the function itself; the problem is that a tool boundary has different trust rules.
Function call vs Tool call
In a normal program, a function call is an implementation detail.
In a coding agent, a tool call is generated text requesting a change to a real-world location.
What A Tool Contract Contains
A useful tool contract cannot consist of just a name and a handler. It should describe:
tool:
name: patch_file
description: Apply a surgical edit to an existing file.
args:
path: string # absolute or workspace-relative path
old_text: string # exact substring to replace
new_text: string # replacement content
returns:
success: boolean
message: string
policies:
path: must be inside workspace
state: file must have a fresh baseline
size: patch must be smaller than N tokens
lifecycle:
- parse
- schema
- path
- policy
- exec
- bound
- record
Some fields are for the model. Some belong only to the Harness.
The model needs enough information to make a correct request. The Harness needs enough metadata to make a correct ruling.
The Lifecycle
A minimal viable lifecycle:
parse → schema → path → policy → exec → bound → record
Each step intercepts a different type of problem.
- parse handles malformed output
- schema validation handles missing fields and wrong types
- path validation handles workspace escape
- policy handles risky actions, approval, sandboxing, denial
- execution runs the real handler
- bounding prevents a single command or file read from flooding the next prompt
- recording enables recovery and auditing in the next round
So the mental model of “just call the function” is insufficient; the function is only a tiny piece of the lifecycle.
Validate Before Approval
Validation should happen before approval.
- If
patch_filepoints outside the workspace, reject first. - If
old_textis missing or ambiguous, reject first. - If the tool call immediately repeats the previous failed request, reject or retry first.
- If the write target is an existing file but lacks a fresh baseline, reject first.
- Approval is a product surface.
Users should not be asked to judge an already invalid request. When approval is needed, the summary should remain bounded.
The approval prompt for a file edit should show the affected path and the change shape, avoiding dumping unbounded raw content.
The Harness can later show a diff, count, or preview. The approval boundary should be small and clear.
Bounded Results Are Part Of Safety
Tool output becomes context. Context influences Agent behavior.
- If a search returns 10,000 results, the model is not clearer.
- If
run_shellemits a huge log, the next round may lose the real user request. - If
read_filerepeats the same unchanged file every round, useful context is displaced by duplicate text.
So the tool contract needs result limits:
max_result_tokens: 2000
This isn’t just about token cost; it concerns the accuracy of the model’s working set.
The Harness should decide which result evidence goes into the transcript, which becomes durable state, and which remains as an external artifact reference.
When evaluating a coding-agent tool, don’t start by asking:
Can the model call it?
Instead, ask:
- Is the argument schema precise?
- What can this tool read or mutate?
- Which paths, commands, and resources are allowed?
- What kind of fresh state is needed before running?
- What policy determines allow, ask, sandbox, deny?
- What result evidence will come back to the model?
- What durable state changes after success or failure?
- How will this call and result be paired later?
The model can request. The Harness is responsible for ruling.
A Tool Call is only executed after the Harness has approved it.
This process is the contract.
In short: tools are contracts.
Similar Articles
@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051
This article deeply analyzes the concept of Agent Harness, which is the engineering infrastructure wrapped around an LLM, including 12 components such as orchestration loops, tool calling, memory systems, context management, etc. The article cites practices from companies like Anthropic, OpenAI, and LangChain, arguing for the critical role of the harness in production-grade AI agents.
@dongxi_nlp: https://x.com/dongxi_nlp/status/2066290950352081336
This article discusses the design concept of how Markdown files (such as AGENTS.md and SKILL.md) in Coding Agents effectively influence agent behavior through the Harness mechanism, emphasizing the importance of loading different contexts at the right time.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.
@xiaogaifun: The most thorough talk about Harness. This is probably the most thorough sharing I've seen about Harness Engineering, I recommend everyone watch it. Video link: https://podwise.ai/dashboard/episodes/8013289…
This article deeply explains the concept of Harness Engineering through a talk by IBM engineer Tejas Kumar, which involves adding deterministic infrastructure (such as tool registries, context management, guardrails, and validation loops) to AI Agents to solve model out-of-control and hallucination problems, ensuring stable task execution.
Code as Agent Harness
This survey paper presents a unified view of code as the operational substrate for agent reasoning and execution in agentic systems, organizing the discussion around three layers: harness interface, mechanisms, and scaling.