@dongxi_nlp: https://x.com/dongxi_nlp/status/2071729771126346093

X AI KOLs Timeline News

Summary

This article explains the core importance of the harness (runtime framework) as a product in a coding agent, and analyzes in detail the six key components and boundary control mechanisms it must undertake.

https://t.co/HdGsN4cU5N
Original Article
View Cached Full Text

Cached at: 06/30/26, 07:36 AM

Coding Agent Harness: A Collection of Seven Articles

01. The Harness Is The Product

Why the most important part of a coding agent is often the least visible.

Every time a coding agent is discussed, people inevitably start with the model.

“Which model? How big is the context? How good is it at writing code?”

These questions matter, but they’re no longer the first ones we should ask. The first question should be:

What exactly does the harness do?

When we consider the basic task of an LLM predicting text, a reasoning model can follow structure, output tool calls, and work within a protocol.

A coding agent goes further: it places the model inside a runtime. This runtime can inspect a real repo, request tools, edit files, run checks, remember what happened, and continue over multiple turns.

This runtime layer is the harness.

For a coding agent, the harness itself is the product.

The Naive Version

A mini version of a coding agent often looks like this:

Naive agent loop:

user request -> big prompt -> model response -> run whatever tool it asked for -> paste result back into prompt -> repeat

This loop looks simple because validation, permissions, result limits, and state updates are hidden.

It’s a useful sketch, but such a mini agent quickly runs into problems. For example:

  • What happens when the model wants to edit a file it never read?
  • What happens when a shell command touches files outside the workspace?
  • What happens when a tool returns 50,000 lines of output?
  • What happens when the file on disk has changed, but the transcript still holds the old file read?
  • What happens when the tool result doesn’t match the tool call that produced it?

This system hasn’t become a coding agent yet.

What The Harness Owns

The model can propose. The harness decides.

From an architecture perspective, the critical surface goes beyond this line:

Interface is not the agent:

model(prompt) -> answer

That’s just one interface. The real agent also needs tool lifecycle, permission decisions, transcripts, file truth, and next-turn state.

It’s closer to:

What the harness actually owns:

input routing -> message design -> prompt assembly -> model output -> parser -> tool validation -> permission policy -> execution -> bounded result -> transcript + state update -> next turn

Every arrow is a boundary. The model proposes, the harness controls the path to real-world effect.

Every arrow is a place where the harness can protect the user; it’s also a place where it can silently go wrong.

Six Things A Coding-Agent Harness Must Do

Here, we compress a Coding Agent into six core components:

  • Live repo context
    The agent shouldn’t start from an empty prompt. It needs to know the workspace, current files, relevant project docs, and which repo state is safe to expose.

  • Prompt shape
    Context quality often looks like model quality. Stable prefixes, clear tool contracts, the current request, and controlled history can cause behavior differences larger than switching models.

  • Structured tools
    Tools shouldn’t be treated as helper functions. They are contracts between model proposals and real side effects.
    The Harness parses arguments, validates paths, checks policies, executes actions, trims output, and records what happened.

  • Context reduction
    If the harness blindly appends everything, the model will eventually see too much, too little, or the wrong things.
    Good context is a projection, not an ever-growing blob.

  • Transcripts and memory
    Transcript answers: “What happened?”
    Working state answers: “What’s important now?”
    These are two different tasks and should be treated differently.

  • Delegation
    Subagents aren’t magic parallelism. They should be bounded workers: scoped tools, isolated state, distilled results.
    Delegation becomes truly useful when it acts as a context firewall.

The Main Loop

A coding agent is an observe-act loop, but the truly valuable part happens between observe and act.

The model is inside the loop:

while tool_steps < max_steps: prompt = build_prompt( workspace=workspace_state, tools=tool_schemas, memory=session_memory, file_state=file_state, history=projected_history, ) output = model.complete(prompt) action = parse_model_output(output) if action.kind == “final”: return action.text result = validate_authorize_execute_record(action) projected_history = update_context(result)

The model is inside the loop, but the harness owns the loop.

This distinction is critical.

A Concrete Example

Suppose the model emits a tool call:

A tool call is a proposal:

write_file( path=“src/config.py”, content=“…” )

Before the write becomes a real side effect, the harness must check path safety, baseline freshness, approval, and state update.

The model made a proposal. The harness must now answer a series of questions:

  • Is src/config.py inside the workspace?
  • Could this path escape via symlink?
  • Is this a new file or an overwrite?
  • Has the model recently read the existing file?
  • Is the known file baseline stale?
  • Does this write require human approval?
  • Should the edit result include a diff summary?
  • Should a validation command be run afterward?
  • How much output should go into the next turn’s prompt?
  • What state needs to be updated to avoid staleness in the next turn?

These judgments should not be left to the model’s random processing. The prompt can describe desired behavior. The harness is responsible for enforcing boundaries.

The Product Lesson

Once you start seeing the harness, many coding-agent behaviors become easier to diagnose.

  • If the agent keeps repeating, check the loop and retry policy.
  • If it edits stale code, check file-state baselines.
  • If it gets progressively worse, check context projection.
  • If it runs something unexpected, check the permission policy.
  • If it can’t recover from tool errors, check tool result objects.
  • If it can’t explain what happened, check traces, audits, and doctor surfaces.

The model is important, but model quality is just one layer.

The real agent experience comes from the entire harness around the model:

The agent experience is a stack:

  • model
  • repo context
  • prompt structure
  • tool contracts
  • validation
  • permissions
  • transcript
  • file state
  • diagnostics

Anything that must be reliable belongs in the harness.

That’s why “the harness is the product” isn’t just a slogan. It’s an engineering rule:

Anything that must be reliable belongs in the harness.

Alright, hopefully this gives you a better understanding of the harness.

Next up: The Stale Read Trap.


02. The Stale Read Trap

The model remembers the transcript, but the harness must remember the truth.

What is the Stale Read Trap?

A coding agent’s most dangerous bug often looks ordinary.

The agent reads a file. The user changes that file. The model remembers the file content from the transcript, but the file on disk has changed.

The agent continues editing the old version.

No dramatic error, no obvious hallucination—just the model confidently working with outdated evidence, falling into the “stale read trap” (a term I made up ;)).

This is the stale read trap.

Remember:

Transcript text is not file truth.

If a coding agent can’t tell whether the file content the model remembers still matches the disk, it will eventually patch yesterday’s code.

The Failure

Imagine this sequence:

Failure sequence:

turn 1: read_file(“src/config.py”) -> transcript now contains the file content outside the agent: formatter / user / git checkout changes src/config.py turn 2: model uses the old transcript text patch_file( path=“src/config.py”, old=“TIMEOUT = 30”, new=“TIMEOUT = 60” )

The problem isn’t read_file. The problem is that after the disk changes, the agent still believes the old read.

The model is acting in good faith; from its perspective, the file content is right there in the conversation.

The issue is that the conversation is merely history. The workspace is the current truth.

The Naive Design

A small agent often treats tool output as ordinary transcript text:

Naive transcript:

User: change the timeout Tool: here is src/config.py Assistant: I will patch line 12 Tool: patch succeeded

Plain transcript text cannot detect external file changes.

This works as long as nothing changes outside the transcript. But the transcript and disk are not in sync.

Transcript versus disk:

transcript says: 12: TIMEOUT = 30 disk now says: 12: REQUEST_TIMEOUT = 45 13: RETRY_TIMEOUT = 30

If transcript and disk disagree, the harness should believe the disk.

If the harness only appends text, the model may act in a world that no longer exists.

What The Harness Needs To Know

The harness needs to separately record file truth, not just “what the model has read.”

It needs File State Records. In short, track file state at all times:

What file state records:

path: src/config.py read range: whole file or lines 1-80 baseline hash: abc123… mtime / size: last known disk fingerprint content baseline: bounded text the model saw source: read_file, agent_change, external_change status: fresh, changed, stale, partial, truncated

Transcript is chronological. File content is stateful.

Transcript answers: What happened in the conversation? File state answers: What file content can the model trust now?

The Harness Contract

A mature harness treats read_file as a contract.

The harness contract:

read_file(path) -> record baseline -> attach bounded content to transcript -> remember what the model saw before edit(path) -> compare baseline with disk -> reject stale / partial / missing / changed -> require fresh read or inject current changed lines after successful edit(path) -> record post-edit baseline -> mark the agent’s own write as fresh

This transforms file read from a convenient tool into a safety mechanism.

The model can still propose edits. The harness is responsible for judging whether the proposal is based on the latest file truth.

Partial Reads Are Partial

Watch out for partial reads in the stale read trap.

The model reads only lines 100-140, then tries to rewrite the entire file. The harness should immediately be extra cautious about such partial reads. Range reads can help answer questions, but they don’t automatically authorize high-risk edits.

Partial reads are partial:

read_file(“server.py”, start=100, end=140) -> useful context -> partial baseline -> not enough for whole-file overwrite

A local slice can answer a question without authorizing a risky edit.

A simple rule is useful:

Existing-file edits should require a fresh full-file baseline, unless the edit tool has a stricter exact-context contract.

This prevents the model from understanding only a fragment and then guessing about the entire file.

What Should Harness Track

When read_file runs, record:

  • workspace-relative path
  • line range and total line count
  • whether this read covered the entire file
  • modification time, file size, SHA-256 fingerprint
  • bounded baseline text
  • the source of the state

When write_file or patch_file succeeds, record the post-edit content as a new known baseline. When constructing the next turn’s prompt, have the agent lazily refresh tracked files:

  • unchanged content stays fresh
  • timestamp-only change with same content: quietly refresh
  • external disk edit: produce a current changed-line snippet
  • deleted or oversized changed file: emit a stale warning
  • fresh read_file clears stale or external-change state

Why This Changes Agent Behavior

Once file state exists, agent behavior changes noticeably.

  • When a file hasn’t changed, repeated reads can be compressed into a summary, avoiding context flooding.
  • When a file changes outside the agent, the next turn’s prompt can show the current changed lines instead of the old transcript content.
  • When a file is only partially read, edits can be rejected before causing damage.
  • When the agent writes a file itself, the post-edit content becomes a fresh baseline.
  • After a session resume, the harness can restore not just text but also what the agent can trust.

This changes the real product experience.

Conclude with this sentence:

The model can remember text. The harness must remember truth.


03. Tools Are Contracts, Not Functions

Each time a coding agent issues a tool call, it uses generated text to request a change in its world state.

Tool calls can easily seem too simple: the model outputs JSON, the harness parses it, some local function executes, and the result is placed back into the next turn’s prompt.

That’s the tiny agent version.

But a coding agent’s tool call is not a simple function. The tool call comes from generated text, and it is requesting access to the workspace, shell, network, transcript, or another agent.

This difference changes the entire state and environment of the coding agent.

The model can ask. The harness decides.

The Proposal and Contract

Imagine the model outputs:

The model proposal:

{ “name”: “write_file”, “args”: { “path”: “src/config.py”, “content”: “…” } }

Valid-looking JSON is still only a request for power.

It looks structured, but it’s still just a proposal. The model producing seemingly valid JSON does not automatically grant write access.

The harness still needs to answer:

  • Does this tool exist?
  • Do the args match the schema?
  • Is the path inside the workspace?
  • Is this a create, overwrite, or patch?
  • Has the model recently read the existing file?
  • Is the file-state baseline fresh?
  • Does this action require approval?
  • What should be shown to the human before approval?
  • How much result output can safely enter context?
  • What transcript and state need updating after execution?

This series of questions is the tool contract.

The Naive Design

The naive design treats tools as a function map:

The naive function map:

tool = tools[model_json[“name”]] result = tool(**model_json[“args”]) history.append(result)

One short line can hide parsing, policy, limits, and state updates.

This code is short, so it looks clean, but it also crams a lot of responsibility into one line. Parsing, schema validation, path safety, permission policy, sandbox choice, execution, output clipping, transcript recording, state updates—all shoved into tool(...).

This works for a demo. But a coding agent that modifies a real repo needs stronger boundaries. The problem isn’t the function itself; the problem is that the tool boundary has different trust rules.

Function call vs Tool call:

In a normal program, a function call is an implementation detail. In a coding agent, a tool call is where generated text requests a change in the real world.

What A Tool Contract Contains

A useful tool contract shouldn’t just have a name and handler. It should describe:

What the contract contains:

name: write_file args schema: path, content risk: workspace_write path policy: must stay inside workspace state requirement: existing file needs fresh baseline approval: required for writes approval summary: path + operation + size execution: atomic write or exact patch result budget: bounded diff / chars / lines state update: record post-edit baseline transcript record: paired call and result

Some fields are for the model. Some fields belong exclusively to the harness.

The model needs enough information to make a correct request. The harness needs enough metadata to make the correct decision.

The Lifecycle

A minimal viable lifecycle:

Every step intercepts a different type of problem.

  • parse handles malformed output
  • schema validation handles missing fields and wrong types
  • path validation handles workspace escape
  • policy handles risky actions, approval, sandbox, denial
  • execution runs the real handler
  • bounding prevents a single command or file read from overwhelming the next turn’s prompt
  • recording allows recovery and auditing in the next turn

So the mental model of “just call the function” is insufficient. The function is only a small part of the lifecycle.

Validate Before Approval

Validation should happen before approval.

  • If patch_file points outside the workspace, reject first.
  • If old_text is missing or ambiguous, reject first.
  • If the tool call immediately repeats a previous failed request, reject or retry first.
  • If a write targets an existing file but lacks a fresh baseline, reject first.
  • Approval is a product surface; users should not be asked to judge an already invalid request.
  • When approval is needed, the summary should remain bounded.

Validate before approval:

tool: write_file path: src/config.py operation: update existing file content: 84 lines, 2410 chars risk: workspace_write requires: human approval

Invalid requests should fail before a human is asked to decide.

The approval prompt for a file edit should show the affected path and the shape of the change, avoiding dumping unbounded raw content. The harness can later show diff, count, preview. The approval boundary should be small and clear.

Bounded Results Are Part Of Safety

Tool output becomes context, and context affects agent behavior.

  • If search returns 10,000 results, the model is not clearer.
  • If run_shell spews a huge log, the next turn may lose the actual user request.
  • If read_file repeats the same unchanged file every turn, useful context gets crowded out by duplicate text.

Therefore, tool contracts need result limits:

Bounded result:

stdout: clipped stderr: clipped large files: offset + limit search: max matches diff: compact preview binary data: metadata, not raw bytes

Tool output becomes the model’s next working set, so result limits protect both token budget and reasoning quality.

This isn’t just about token cost; it’s about the accuracy of the model’s working set. The harness should decide which result evidence enters the transcript, which becomes durable state, and which remains as an external artifact reference.

When evaluating a coding-agent tool, don’t just ask:

Can the model call it?

Ask instead:

  • Is the argument schema precise?
  • What can this tool read or mutate?
  • Which paths, commands, and resources are allowed?
  • What kind of fresh state is needed before execution?
  • What policy decides allow, ask, sandbox, deny?
  • What result evidence will go back to the model?
  • After success or failure, what durable state changes?
  • How can this call and result be paired later?

The model can request. The harness decides. A tool call only executes after the harness approves it.

This process is the contract.

In short: Tools are contracts.


04. The Agent’s Toolkit: “/”

Slash commands let the coding agent go beyond a simple prompt box.

Not all input should become a prompt.

Every coding-agent UI starts with a simple idea: type in a request, send it to the model, get back a result. This is the prompt-box view of an agent.

Ordinary work requests are fine:

Ordinary work requests:

fix the failing tests explain this module add a config option

These are work requests. They can become a prompt.

But product-level coding agents have another class of input:

/status, /tools, /reset, /goal pause, /audit

These inputs shouldn’t be sent to the model as ordinary requests to understand. They are tools the user uses to control the agent runtime.

Do not send control-plane intent to the model as ordinary chat.

The Naive Prompt Box

Tiny agents often process all text through the same path:

The naive prompt box:

raw user input -> append to transcript -> build prompt -> call model -> parse model output

All text becomes model input, including commands that should control the runtime.

This is simple, but it also traps the product.

  • User types /status, model might fabricate a status.
  • User types /reset, model might discuss reset without cleaning session state.
  • User types /goal pause, model might treat it as an instruction in the current task.
  • User types /statsu (misspelled), model might try to guess instead of returning command help.

The problem occurs before model quality. The harness didn’t route the input correctly.

Slash Commands Are User Tools

Article 03 said that model tools are contracts. Slash commands belong to another category of tools—they belong to the user and the harness.

They can inspect, steer, reset, pause, diagnose, and configure the runtime before any model call occurs.

A simple command families:

Slash command families:

session: /status /memory /session /reset /help tools: /tools goal: /goal /goal pause /goal resume /goal clear capability: /skills /mcp /hooks policy + diagnostics: /permissions /audit /doctor future context: /compact

  • Some commands produce local output.
  • Some commands modify session state.
  • Some commands display harness metadata.
  • Some commands prepare context for the next model turn.

All of them need to be routed before prompt construction.

The keyword here already comes out: Input Router.

The Input Router

A useful coding-agent harness has two parser boundaries. The first handles user input. The second handles model output.

The input router should decide:

The input router:

raw user input -> empty? ignore -> known slash command? handle locally -> unknown slash command? return command help -> local query? answer locally -> normal request? prepare message and call model

Only the last path should enter agent.ask().

The prompt box is no longer the only entry point. The harness has its own control surface.

Not Every Input Becomes A Prompt

The Special Case: /goal

/goal is the most interesting slash command because it sits between user control and model work.

goal is explicit session state

Goal control state:

session[“goal”] = { “objective”: “…”, “status”: “active”, “created_at”: “…”, “updated_at”: “…”, “events”: [] }

user controls: /goal pause /goal resume /goal clear model tools: get_goal() update_goal(status=“complete”)

Goal state is control-plane state, not fuzzy chat memory.

User-facing commands control lifecycle: /goal, /goal pause, /goal resume, /goal complete, /goal clear.

/goal /goal pause /goal resume /goal complete /goal clear

The model can see the goal context in the prompt. But the model cannot arbitrarily create, pause, resume, or clear goals.

get_goal() / update_goal(status=“complete”)

The goal can guide the model, but the user still holds the control state. Completion should depend on evidence, not just the model thinking it’s done.

Diagnostics Should Stay Local

Some commands should directly inspect the harness. /audit and /doctor are good examples.

They should not ask the model:

does my session look healthy?

Similar Articles

@dongxi_nlp: https://x.com/dongxi_nlp/status/2065200644802101633

X AI KOLs Timeline

The article proposes that in a Coding Agent, tool invocations should be treated as contracts rather than simple functions, emphasizing the Harness's adjudication role in verification, permissions, lifecycle management, and others, and discusses in detail the composition and lifecycle of tool contracts.

@dongxi_nlp: https://x.com/dongxi_nlp/status/2066991890348572950

X AI KOLs Following

This is the 6th article in the "Context Is A Projection Harness" series. It delves into the core issues of context management in coding agents, proposing a Harness method that projects the full history into the narrow window needed by the model. Key techniques include Large-Result Preview, Idle-Gap Microcompact, Old-Span Collapse, and Auto-Compact Near The Limit.

This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.

X AI KOLs

This article systematically reviews AI Agent architecture and engineering practices, covering control flow, context engineering, tool design, memory, multi-agent organization, evaluation, tracing, and security. It is based on the OpenClaw implementation and emphasizes the critical role of Harness (testing and validation infrastructure) for system stability.

@dongxi_nlp: https://x.com/dongxi_nlp/status/2066290950352081336

X AI KOLs Timeline

This article discusses the design concept of how Markdown files (such as AGENTS.md and SKILL.md) in Coding Agents effectively influence agent behavior through the Harness mechanism, emphasizing the importance of loading different contexts at the right time.