@XAMTO_AI: 想自己从零搭一个生产级 Agent Harness?别做梦了,以为随便挑个框架就能收工的,基本全翻车了。 真相是这玩意儿压根不是"选框架"能摆平的事,它背后藏着15项你绕不开的硬核职责: 每一项都得做成能安装、能版本化、还能换语言跑的 w…

X AI KOLs Timeline 工具

摘要

The article argues that production agent harnesses should not be monolithic frameworks but rather a stack of independent, replaceable workers connected by a shared trigger primitive, outlining 15 core responsibilities and how the iii engine implements this approach.

想自己从零搭一个生产级 Agent Harness?别做梦了,以为随便挑个框架就能收工的,基本全翻车了。 真相是这玩意儿压根不是"选框架"能摆平的事,它背后藏着15项你绕不开的硬核职责: 每一项都得做成能安装、能版本化、还能换语言跑的 worker 单次 turn 要真正跑通,不是demo糊弄过去就完事 策略、审批、预算、trace 缺一不可,这才是生产级和玩具之间的分水岭 @mfpiccolo 把这套打法全给你写透了,想认真搞 Agent 的,别划走,去啃原文: https://iii.dev/blog/how-to-build-your-own-agent-harness/…
查看原文
查看缓存全文

缓存时间: 2026/06/18 16:18

想自己从零搭一个生产级 Agent Harness?别做梦了,以为随便挑个框架就能收工的,基本全翻车了。

真相是这玩意儿压根不是“选框架“能摆平的事,它背后藏着15项你绕不开的硬核职责:

每一项都得做成能安装、能版本化、还能换语言跑的 worker 单次 turn 要真正跑通,不是demo糊弄过去就完事 策略、审批、预算、trace 缺一不可,这才是生产级和玩具之间的分水岭 @mfpiccolo 把这套打法全给你写透了,想认真搞 Agent 的,别划走,去啃原文: https://iii.dev/blog/how-to-build-your-own-agent-harness/…


How to Build Your Own Agent Harness

Source: https://iii.dev/blog/how-to-build-your-own-agent-harness/ May 28, 2026· Mike Piccolo, Founder & CEO of iii


How to Build Your Own Agent Harness — loops, tools, memory, sandbox, and observability

Most agent teams don’t build a harness. They adopt one. LangChain, LangGraph, OpenAI Agents SDK, Anthropic SDK, CrewAI, AutoGen, the loop, the tools, the memory, the orchestration are picked off the shelf as a single decision. The harness is a framework you import. If something inside it doesn’t fit,you fork it, fight it, or work around it.

I think that shape iswrong, and it’s the reason every long-running agent team eventually ends up rewriting its harness from scratch. The harness isn’t one thing. It’s ten or twelve different things bundled together because the surrounding ecosystem doesn’t give you a way to compose them.Pi agentpackagesare on the right track, but they are still in the paradigm of “Add another service and integrate it with all others” The iii engine treats all workers the same and removes the integration logic completely. Theprovider router, the credential vault, the policy engine, the approval gate, the model catalog, the session storage, the budget tracker, the after-call hook fanout, the durable turn loopare independent concerns. These are all interoperable with your queue, http/api server, streaming, even browser workers. A framework that ships them as one block is selling you a tradeoff you didn’t have to make.

The bet underneathiiiis that they shouldn’t be one block. There should be a set of workers on a shared engine, each replaceable, each versioned independently, each connected by a single primitive:**a trigger (iii\.trigger\(\))**that every other worker also uses. The harness becomes a stack of installable workers, and “build your own” stops meaning “fork a framework.” It means “swap a few workers.

This post walks through what that actually looks like. The complete stack that drives an iii agent turn today, why each layer is its own worker, and how you replace any of them.

The 15 jobs an agent harness has to do

If you strip a production agent harness back to its responsibilities, you get a list that looks roughly like this:

  1. Accept a turn request from a client and persist it
  2. Resolve credentials for whichever model provider gets called
  3. Look up what the chosen model can actually do (vision, tools, streaming, context window)
  4. Drive the per-turn state machine, provision, stream assistant, run tools, steer, tear down
  5. Load and serve skill bodies that describe each function’s request shape, error codes, and usage notes
  6. Assemble the system prompt, mode paragraph, identity preamble, working directory, and default skills appendix
  7. Stream tokens back to the client as the model produces them
  8. Check every tool call (that’s just a function) against a policy before it runs
  9. Pause tool calls that need a human decision and route the answer back to the right turn
  10. Track LLM spend against per-workspace or per-agent budgets
  11. Run hooks before and after tool calls (logging, redaction, custom side effects)
  12. Persist the session as a branching tree so forks and resumes work
  13. Compact session history when the context window fills up
  14. Emit an event stream that the UI subscribes to
  15. Missing piece from every agent’s company building, I see. Carry one OpenTelemetry trace across every step so you can debug it

Every serious agent harnesses most of these. The expensive ones do all of them. The cheap ones cut corners and rebuild the corners later when they hit production. The frameworks bundle them into a monolith and ship one version of each. That last part is the part that costs you, because a year in, you find out that the policy engine you want is not the policy engine the framework ships, and replacing it means replacing the harness.

The iii harness ships every one of those thirteen jobs as a separate worker on theworkers.iii.devregistry. Each speaks the same WebSocket protocol. Each registers functions and triggers on the same engine bus. Each isiii worker add-able, swappable, and writable in any language with an SDK.

The stack, by worker

Here is the actual production stack from theiii-hq/workersmonorepo, with each worker’s job in one line. The whole bundle ships atgithub.com/iii-hq/workers/harness:

WorkerJobiii\-directorySkill and prompt registry. Workers publish skills atiii://<worker\>/<function\>; the agent fetches them on demand viadirectory::skills::get. Ships with the iii engine (Rust).harnessMeta-worker. Loadsiii\-permissions\.yaml. Exposespolicy::check\_permissionsand theui::\*plane. Pumpsagent::eventsto subscribed browsers.turn\-orchestratorThe durable 11-state FSM driving each agent turn. Ownsrun::start,turn::step,turn::get\_state. Also assembles the system prompt at theprovisioningstep.approval\-gateBus entry point for operator decisions. Routesapproval::resolveto per-call resume functions registered by the orchestrator.sessionBranching session storage.session\-tree::\*for the parent-linked entry tree;session\-inbox::\*for per-session queues.llm\-budgetWorkspace + agent spend caps. 14budget::\*functions including check, record, alerts, forecast, period rollover.hook\-fanoutGeneric publish-and-collect over a stream topic. The pattern every iii hook is built from.auth\-credentialsFile-backed provider credential vault underauth::\*.models\-catalogStatic model capability catalogue.models::list,models::get,models::supports.provider\-anthropicAnthropic Messages API SSE streamed into an iii channel.provider\-openaiOpenAI Chat Completions SSE streamed into an iii channel.provider\-kimiKimi (Moonshot) Chat Completions SSE.provider\-lmstudioLocal LM Studio SSE for desktop development.context\-compactionOptionalagent::eventsthat compacts session history when token count crosses a threshold.Eleven workers. One engine. Each is on a published version. Each is independently runnable as a standalone process (pnpm dev:<worker\>in dev,iii worker add <specific\-worker\>as a release binary) or as part of the composite entry point that spins them up together.

The reason this matters: every box in that table is a place where someone can hand you a different worker, and you keep the rest. Don’t like the static model catalogue? Plug in a worker that registersmodels::listand reads from a live API. Don’t like file-backed credentials? Plug in a worker that registersauth::get\_tokenand reads from a secrets manager. Want a different turn FSM for a workflow that branches differently? Replaceturn\-orchestrator, every dependent callsrun::startand readsturn\_statethrough the same bus, so the rest of the stack doesn’t change.

How the loop actually runs

The shape of one turn looks like this, walking through the workers in the order they fire.

A browser/cli/chat POSTs a turn throughharness::triggerwith\{session\_id, message\_id, payload\}. The harness meta-worker forwardspayloadtorun::start. That hop exists so the OpenTelemetry span wrapper can seed the session and message IDs as baggage, which propagates to every nestediii\.triggercall across every worker in the stack. The trace tree on the other side is one connected graph.

run::startlands on the turn-orchestrator. It persists the run request, seeds the initialTurnStateRecordin iii state atsession/<sid\>/turn\_state, and returns immediately. The actual work happens inside the durable per-state machine, woken by publishes to theturn\-stepFIFO.

The two terminal states arestopped(clean exit viafinishSession\(\)) andfailed(an unexpected handler throw routes here, acks the queue so it stops retrying, and surfacesmessage\_complete\{stop\_reason:'error'\}plusagent\_endso the UI shows the reason). Teardown is an inlinefinishSession\(\)port called from any turn-end path, not a separate enqueued step.

provisioningdoes three things. It boots aiii-sandboxmicroVM if the run needs isolated execution. It callsdirectory::skills::downloadfor every namespace insystem\_default\_skills(default\["iii://iii\-directory/index"\]) so iii-directory pre-caches the skill bodies the run starts with. And it assembles the system prompt in three layers: a mode paragraph picked fromrun\_request\.mode(plan,ask, oragent), the iii identity preamble that teaches the model theagent\_triggerconvention and thedirectory::skills::geton-demand discovery pattern, and an appended index of the default skills the agent boots with. The caller can override the whole prompt by passingsystem\_promptonrun::start; otherwise the orchestrator builds it. Function schemas come from the live engine catalog.

assistant\_streamingcallsprovider::<name\>::streamon whichever provider worker matches the run’sproviderfield. The provider worker pulls credentials viaauth::get\_token(auth-credentials), streams the model’s SSE response into an iii channel, and the orchestrator drains that channel emittingmessage\_updateevents onagent::eventsfor the UI fanout. Channel creation and the read loop live behind a pull-basedMessagePumpinprovider\-stream\.ts, so the streaming state stays focused on transitions.

When the assistant returns tool calls, the FSM entersfunction\_execute. Every tool call passes throughdispatchWithHook, the single chokepoint in the orchestrator.consultBeforecallspolicy::check\_permissionsdirectly with a 5-second timeout. The policy worker (the harness meta-worker, in the default stack) readsiii\-permissions\.yaml, matches the call’sfunction\_idagainst the rule set, and returns one of three outcomes:

  • allow:dispatch proceeds; the orchestrator triggers the target function and writes the result
  • deny:dispatch short-circuits with a DenialEnvelope, the result becomes a denial record
  • needs\_approval:the individual call parks into the turn’s awaiting_approval list. The rest of the batch keeps dispatching. The turn transitions to function_awaiting_approval only when one or more entries are pending.

The approval wake is reactive and shared. The orchestrator registers exactly oneturn::on\_approvalstate trigger on scopeapprovals. When the console callsapproval::resolve, the approval-gate worker writesapprovals/<sid\>/<cid\> = \{decision, reason\}to iii state. That write firesturn::on\_approval, which advances the affected session.function\_awaiting\_approvalreads only the decisions that just landed, dispatches each one as it arrives (allowbecomes a pre-approved dispatch,denyorabortedbecomes a synthetic denial), and advances whenawaiting\_approval\[\]is empty. No per-call resume functions to register. No startup re-scan to recover pending approvals. One trigger covers every session.

Fail-closed by construction: if the policy worker is unreachable or the 5-second timeout fires,consultBeforedenies the call with agate\_unavailableenvelope. Ifiii::durable::publishitself errored, the hook fanout returnspublish\_failed: trueand the orchestrator treats it as a deny.

A few latency wins fall out of this shape. The after-function-call hook short-circuitspublish\_collectvia a subscriber-presence cache when no durable subscriber is registered for the topic, removing roughly 500ms per executed function call.tearing\_downis inlined intofinishSession\(\), removing one durable queue hop per turn.context\-compactionsubscribes to a dedicatedagent::turn\_endstream the orchestrator emits at turn boundaries, so compactor wakeups are per-turn instead of per-event. The session-create fanout state trigger gates by scope alone and matches in-process, so the previous per-writeharness::session::is\_create\_eventRPC is gone.

After the batch completes,steering\_checkdecides whether to continue, stop, or hitmax\_turns. If continue, loop back toassistant\_streaming. If stop or max,finishSession\(\)runs inline: emitagent\_end, free the sandbox, transition tostopped.

Throughout the whole run, every worker that participates emits OTel spans tagged withiii\.session\.id,iii\.message\.id, andiii\.function\.id. Those tags are what the engine’sengine::traces::group\_byreads to populate “Group by Session” / “Group by Message” / “Group by Function” in the traces UI. The instrumentation is automatic:src/runtime/worker\.tswraps everyregisterFunctionin aProxyso no per-worker code has to remember to add spans.

Build your own

The interesting part is that none of the workers above are special. Each one is a process that opens a WebSocket to the engine, registers some functions and triggers, and runs. The contract is the same as the contract every application worker uses. The harness is built on the same primitive your business logic is built on.

Which means “build your own harness” decomposes into the same operation as “write any worker.” You pick the layer you want to replace, you write a worker that registers the same functions on the bus, youiii worker addit, and the rest of the stack starts using your worker.

Two layers don’t show up in the worker table above but matter for how the harness behaves.Skillsare how each worker advertises what its functions do. Every worker can publish a skill atiii://<worker\>/<function\>that the agent fetches viadirectory::skills::getbefore calling that function for the first time.The system promptis assembled per turn from a mode paragraph, the iii identity preamble, and the default skill bodies the run was configured with. Both are bus-driven: skills are served by the iii-directory worker, the system prompt is assembled by the turn-orchestrator. Both are replaceable.

Five concrete examples.

**Replace the model catalogue with a live API.**Write a worker that registersmodels::list,models::get,models::supports. Have it fetch from your provider’s catalog endpoint every N minutes and cache. Publish it.iii worker add your\-org/dynamic\-models\-catalog. Stop the static models-catalog worker. The turn-orchestrator never knows the difference. It callsiii\.trigger\('models::list'\)and the engine routes to whichever worker registered that function id most recently.

**Add a new provider.**The shape isprovider\-kimiandprovider\-lmstudioalready prove out. Each is one worker that registersprovider::<name\>::streamandprovider::<name\>::complete, drains an SSE stream from the upstream API into an iii channel, and writes its model usage to llm-budget viabudget::record. Adding a fifth provider is writing one folder with oneiii\.worker\.yamland oneregister\.ts. Publish to the registry, or keep it local. The turn-orchestrator picks the provider by the run’sproviderfield; new providers become available the instant the worker connects.

**Serve skills from a private artifact store.**Write a worker that registersdirectory::skills::getanddirectory::skills::list, backed by your internal docs system or a private S3 bucket. Disconnect or rename the default iii-directory worker. The orchestrator’s bootstrap callsdirectory::skills::downloadper namespace; your worker answers. The agent’s “fetch the per-function skill before calling a new function” pattern keeps working unchanged because the wire shape is the same.

Override the system prompt entirely.run::startaccepts an optionalsystem\_promptfield. Pass it and the orchestrator uses your string verbatim, skipping the mode paragraph + identity preamble + skills appendix assembly. Useful when you have an existing prompt asset you want the harness to honour without modification. Skill download still runs in bootstrap, so the agent keepsdirectory::skills::geton-demand discovery even with a custom prompt.

**Replace the approval gate UI surface.**The defaultapproval\-gateworker registersapproval::resolve. The wire schema is one function call:

iii.trigger('approval::resolve', {
  session_id: '...',
  function_call_id: '...',
  decision: 'allow' | 'deny' | 'aborted',
  reason: 'optional human text',
})

The handler persistsapprovals/<sid\>/<cid\> = \{decision, reason\}to iii state. The orchestrator’s singleturn::on\_approvalstate trigger picks that write up and wakes the right session. If you want to drive approvals from Slack instead of the console, write a Slack worker that listens for/approve <id\>and/deny <id\>slash commands, then callsapproval::resolvewith the right payload. The orchestrator never knows the difference. The whole approval-gate worker stays untouched. You added a new worker; you didn’t replace the existing one.

If you want a different policy engine (OPA, Cedar, your own DSL), write a worker that registerspolicy::check\_permissionsand returns\{ decision, rule\_id?, matched\_constraint? \}. Disconnect the default policy worker (which is wrapped inside the harness meta-worker, so you’d disable that handler or run a stripped-down meta-worker). The turn-orchestrator’sconsultBeforedoesn’t know the difference. Same 5-second timeout, same fail-closed semantics, same wire shape.

The point of these examples isn’t the specific replacements. It’s the shape of the operation. Every harness layer in the iii stack is reachable through one or two function ids on the bus. Replacing a layer is writing a worker that registers those ids. The rest of the system stays.

The harness is a slider, not a fork in the road

The classic harness debate frames itself as thin vs thick. Anthropic’s thin loop versus LangGraph’s explicit DAG. The framing assumes you pick one side and live with it.

When the harness is composed of workers on the same bus, thin vs thick is just a count of how many workers you install. A thin harness isturn\-orchestratorplusprovider\-anthropicplusauth\-credentialsplus a minimalharnessmeta-worker. That’s it. No approvals, no budgets, no policy engine, no hook fanout. Run anything. Trust the model. Useful for autonomous research agents, experimental loops, anything internal.

A thick harness is all thirteen workers pluscontext\-compactionplus a custom policy worker plus a custom approval-gate plus a Slack-integrated approval surface plus the budget worker enforcing per-workspace caps. Useful for an agent running customer workflows where every tool call needs to be auditable and every model spend has to roll up to a finance dashboard.

The architectural distance between thin and thick isn’t a rewrite. It’s a config change. Same primitives, same wire protocol, same trace shape, same observability story. The slider moves by adding and removing workers from yourconfig\.yaml. Everything else holds.

It applies inside a single worker too. The turn-orchestrator just shipped a refactor that collapsed its FSM from eleven states to seven, deleted the per-callturn::approval\_resume::<sid\>/<cid\>mechanism in favour of one reactiveturn::on\_approvalstate trigger on scopeapprovals, and inlinedtearing\_downinto afinishSession\(\)port. Every other worker in the stack (approval-gate, session, llm-budget, providers, models-catalog, auth-credentials, hook-fanout, context-compaction) stayed unchanged. Theapproval::resolvewire shape didn’t move. The contracts held. That’s the property the composition gives you: a major internal rewrite of one worker is a self-contained change because every neighbour talks to it through bus-level function ids.

This is the part the framework model can’t give you. A framework picks a position on the slider for you and locks you in. The worker model leaves the slider in your hand.

What this means in practice

If you’ve been running an agent on top of a framework and feeling the same boundary problems most teams hit at scale, the answer is probably not “rewrite the harness in our own framework.” The policy engine doesn’t extend the way you need. The approval UI is wired into the framework’s chat surface. The credential store can’t talk to your secrets manager. The budget tracker is in a sidecar database the trace can’t see. The answer is to switch to a substrate where the harness is decomposed in the first place.

The fastest way to feel the argument is to clonegithub.com/iii-hq/workers,pnpm install,pnpm build, and run the composite entry point. You’ll get the full fourteen-worker harness pointed at an iii engine. You can disable any worker by removing its entry from the boot list. You can swap any worker by writing a replacement that registers the same function ids. You can extend any worker by adding a subscriber to its hook topics.hook\-fanout::publish\_collectis the generic every iii hook builds on.

The docs live atiii.dev/docs. The engine is atgithub.com/iii-hq/iii. The worker registry is atworkers.iii.dev. The harness bundle is atgithub.com/iii-hq/workers/harness.

The bet

A harness is not a thing you install. A harness is a set of jobs your system has to do for an agent to run durably, safely and observably. The framework era bundled those jobs together because nothing underneath gave you a way to compose them.

iii’s bet is that one primitive:a workerthat connects to the engine over WebSocket and registers functions and triggers is small enough to absorb every one of those jobs separately, and that the resulting stack is more useful than any framework because every layer is independently replaceable.

You don’t adopt the iii harness. You install the workers you want, write the ones you need, and end up with a harness shaped exactly like your system. Same protocol on every layer. Same trace across every call. Sameiii worker addfor the parts you take from the registry as for the parts you publish yourself.

That’s what “build your own agent harness” looks like when the substrate is the right shape. Pick the workers. Write the missing ones. Compose. The harness is the composition.

Join us in building the perfect agent harness that the modern world needs:discord.gg/iiidev

iii is open source. Get started atiii.dev/docs. The harness workers are atgithub.com/iii-hq/workersand the engine is atgithub.com/iii-hq/iii.

— Mike Piccolo, Founder & CEO

相似文章

@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051

X AI KOLs Timeline

本文深度拆解了Agent Harness的概念,即包裹在LLM外部的工程基础设施,包括编排循环、工具调用、记忆系统、上下文管理等12个组件。文章引用Anthropic、OpenAI、LangChain等公司的实践,论证了harness对生产级AI Agent的关键作用。

@dotey: 去做一个 Agent Harness 这种事情价值不大了,怎么做也做不过模型公司,模型一升级好多活都白干了。 但是基于成熟的 Agent Harness 去做方案,大有可为。 MCP 只是解决了连接的问题,Skills 只是解决了领域知识…

X AI KOLs Timeline

作者认为直接开发Agent Harness价值不大,因为模型公司会主导,但基于成熟框架在垂直领域构建应用仍有很大机会,需要重新设计AI Native工作流、UI/UX和数据整理。

@mfpiccolo: https://x.com/mfpiccolo/status/2060069083878408689

X AI KOLs Timeline

文章认为,当前像 LangChain 和 CrewAI 这样的智能体编排框架将独立关注点捆绑成一个整体模块,导致缺乏灵活性。文章介绍了 iii 引擎,其中每个职责都是一个独立的、可替换的工作单元,通过共享总线和单一触发原语连接,使开发者能够通过替换工作单元而非分叉框架来组合自己的编排方案。

@freeman1266: Harness Engineering 不是玄学,是可工程化的活产物 很多人看了一圈 Harness Engineering 的文章,理念都懂了,但第一步到底该做什么? 六层零件,逐层叠加: • Rule:写死基础规矩,告诉 AI 什么不…

X AI KOLs Timeline

Harness Engineering 不是玄学,而是可工程化的活产物。文章提出六层逐步叠加的工程框架(Rule、Skill、Sub Agent、Workflow、Scripts、dev-map),强调从简单开始、依赖脚本而非提示词,并通过迭代改进。

@NFTCPS: HarnessX这玩意儿挺有意思:一个能自己改自己的智能体架构。 以前架构怎么变,全靠人手调。新模型一出,Anthropic就把Claude Code里的规划步骤砍了,Manus半年重构了五次智能体,每次都在做减法。改什么、什么时候改,一…

X AI KOLs Timeline

HarnessX introduces a framework for self-evolving AI agent harnesses that treats the runtime harness as a first-class object, enabling automatic adaptation via trace-driven reinforcement learning. It achieves average gains of +14.5% across five benchmarks, with larger improvements for weaker models.