@kmeanskaran: https://x.com/kmeanskaran/status/2071160257943052683
Summary
A detailed guide on building a production-grade agent harness for multi-agent LLM systems, covering components like orchestrator, subagents, skills, backend state management, and context engineering.
View Cached Full Text
Cached at: 06/28/26, 04:11 PM
A Guide to Shipping Your Agent Harness into Production
Getting one LLM call to work is easy. Getting five agents to work together reliably in production is a different problem entirely. Real users break things. Real costs add up fast. Real failures cascade in ways your local environment never shows you.
The difference is the harness.
This guide covers the complete engineering picture: what an Agent Harness is, the components that make it up, how to design the backend that supports it, and the optimizations that turn a working system into a production-grade one.
Part 1: The Agent Harness
What it is
An Agent Harness is the infrastructure layer that wraps a language model into a useful system. It is not the model. It is not the prompt. It is the scaffolding that answers every question the model itself cannot:
-
Where does work product live between agent steps?
-
What context does each agent see, and what is it explicitly excluded from seeing?
-
How do agents coordinate without stepping on each other’s context windows?
-
What happens when an agent produces bad output or a call fails?
-
How is cost tracked, bounded, and controlled?
The LLM is a reasoning engine. The harness is the system that makes it useful. Every serious agent deployment is built on one explicitly designed or accidentally accumulated. The explicit version is faster, cheaper, and easier to debug.
The five components
A well-designed harness composes five concepts. Each one is a solution to a specific failure mode that emerges when you try to run agents in production.
Agent Harness: The Five Components
Agent Harness: The Five Components
The five components are:
-
**Orchestrator: ** reads the brief, delegates in sequence, verifies completion, reports done
-
**Subagents: **isolated execution contexts, each with one job and one output artifact
-
**Skills: ** per-agent knowledge documents that define role, output format, and rules
-
**Backend: **the shared virtual filesystem where agents pass work product between steps
-
**Context Engineering: ** the discipline of controlling what each agent sees, when, and in what order
Understanding why each component exists is more useful than memorizing the API. The design decisions that follow from understanding the why are the ones that scale.
Part 2: Backends
The backend is the state management layer of the harness. It answers the question every multi-agent system must answer: where does work product live?
The problem with message-passing state
The naive approach is to pass agent outputs through conversation messages. Agent A produces structured insights and returns them. The orchestrator stores the response and passes it to Agent B as message content.
This creates three compounding problems.
-
Context bloat. Every message in the orchestrator’s thread costs tokens on every subsequent call. After four agents have run and echoed their outputs back, the orchestrator is carrying thousands of tokens of content that only the next agent will need. You pay for it on every call until the job ends.
-
**No inspection surface. **There is no way to read what Agent A produced without parsing the orchestrator’s conversation history. Debugging means wading through reasoning traces.
-
Coupling. Agent B’s behavior depends on the exact format of Agent A’s response appearing in the conversation. A format change in A’s output schema breaks B. The coupling is invisible until it fails.
The virtual filesystem
The StateBackend solves all three problems by giving every job a virtual filesystem is an in-memory workspace scoped to the duration of the run.
Agents don’t pass content to each other through messages. They write files to the workspace and read from it.
Agents orchestration with skills.md
Agents orchestration with skills.md
A typical workspace for a five-agent pipeline looks like this: a brief.md seeded at job start (the only input), then one file written per agent that extracted insights, drafts for each platform, and a final review notes file. When the extractor finishes, it writes its output file and confirms in one sentence. The orchestrator stores one line in its thread and not the content. The next agent reads directly from the file when it runs.
This single change reduces the orchestrator’s accumulated context by roughly 85% on a five-agent pipeline. It makes every intermediate output inspectable. And it decouples agents from each other where each agent reads a file by path, not a message by position.
Seeding and result assembly
The workspace starts empty. Before the orchestrator runs, seed it with everything the pipeline needs: skill files, shared context documents, and the per-job brief. Seeding separates the file system setup from the agent’s runtime where the workspace is fully defined before the first LLM call is made.
When the pipeline finishes, read the workspace to assemble the structured result. This is the only place where file content is read by the calling code.
Best practice: never read workspace content during the pipeline run from the orchestrator’s code. The orchestrator infers progress from file existence which files have which appeared not file content. Content is read exactly once, at the end.
Progress inference from the workspace
One under appreciated property of the file-based workspace: you can infer pipeline progress from which files exist. No explicit progress callbacks needed in the agent code. The workspace state is the progress state.
Part 3: Skills
What skills are
Every agent has a specific job. The job description is what to produce, what format it should take, what rules to follow which lives in a Skill file. Skills are markdown documents loaded into the agent’s context alongside its system prompt.
The key design decision is progressive disclosure: each agent loads only the skill it needs.
The naive approach is to include all skills for all agents. This is wrong for two reasons.
Token cost. Skills are static context loaded on every call. An X writer loading LinkedIn formatting instructions pays for those tokens every time it runs and contributing nothing to the output.
Focus degradation. Models occasionally apply instructions from the wrong context. An agent carrying instructions it doesn’t need will, with some non-zero probability, produce output influenced by those instructions. The more irrelevant context an agent carries, the worse this gets.
Each agent declares exactly which skill it loads. The skill resolver loads only the matching file. Five agents. Five skills. Each agent sees only its own.
Tool scoping follows the same principle
Tools are an extension of the skills concept. Like skills, they add tokens (tool definitions count toward input) and add behavioral surface area.
Give agents only the tools they can actually use. The reviewer agent gets a fact-checking tool and it’s the only one that verifies external claims. No other agent gets it. A writer that can’t call external APIs gets no tools. A tool it can’t use is overhead: tokens paid, behavioral risk incurred, no upside.
Best practice: the skill and tool set of an agent should be the minimum required to do its specific job. Expand scope only when a specific task demands it.
Part 4: Subagents and Isolated Contexts
Why shared context windows fail
When you run a multi-agent pipeline in a single conversation thread, every agent accumulates the full history. By the time the reviewer runs, it’s carrying the orchestrator’s planning reasoning, every writer’s output confirmation, and any back-and-forth from corrections. You pay for every one of those tokens. Quality degrades because the reviewer is reasoning in a context full of content that has nothing to do with its job.
Subagents run in isolation
Each subagent gets a fresh conversation: its own system prompt, its own skill, and only what it explicitly reads from the workspace.
Shared Thread of agents vs Isolated Contexts (Subagents)
Shared Thread of agents vs Isolated Contexts (Subagents)
The orchestrator maintains a flat list of agent descriptions. When it decides to delegate, it picks by description. The subagent runs in its own context, does its work, writes to the workspace, and its context is garbage-collected when done.
The cost difference is significant. A five-agent pipeline run entirely in one shared thread accumulates 15,000–30,000 tokens of history by the final step. The same pipeline with isolated subagents keeps each agent’s context at 2,000–5,000 tokens. The orchestrator’s thread stays lean because it accumulates file paths and one-line confirmations, not content.
**Best practice: **Design subagents to have exactly one output artifact in the workspace. One job, one file. This makes progress tracking trivial and outputs inspectable.
Part 5: Context Engineering
Context engineering is the discipline of controlling what enters an agent’s context window, when, and in what order. It has more impact on cost and quality than any other engineering decision.
The static-before-dynamic rule
This is the foundational rule. It must be followed without exception.
Static content is anything identical across many requests: system prompts, skill files, tool definitions, shared context documents. Dynamic content is anything that changes per request: job IDs, user input, timestamps, per-job parameters.
The correct ordering is: tool definitions first (most stable), then system prompt, then skill files, then conversation history, then the current user message (fully dynamic, never cached).
Provider-level prompt caching stores computed tensor representations of the prefix up to the first dynamic content. A cache read costs 10% of normal input price. A cache miss costs 100%.
Every violation of static-before-dynamic breaks the cache prefix. Common violations: putting a job ID or timestamp in the system prompt, embedding user-specific data in a skill file path, including environment-specific flags in system messages, or changing tool definitions between requests. The fix is always the same that move the dynamic value to the user turn message.
Durable memory vs. per-job context
Not all context ages the same way.
Durable memory is stable across every job: project conventions, behavioral guidelines, how agents should handle edge cases. This lives in a shared document loaded as memory at graph construction time. It gets computed and cached once per worker process.
Per-job context is task-specific: the user’s README, the requested platforms, the tone. This belongs in a brief document seeded into the workspace at job start where the only dynamic input to the pipeline.
The test for which category a piece of context belongs in: would it be identical across 1,000 different jobs? If yes, it’s durable memory. If it changes per job, it belongs in the brief and should never appear in the system prompt.
Thread compaction
For long-running pipelines, conversation threads grow. Older turns are less relevant than recent turns but still cost tokens on every subsequent call.
Thread summarization middleware handles this automatically. When the thread exceeds a configurable token threshold, older turns are compacted into a summary paragraph. The last N messages stay verbatim and recent context is the most relevant.
Best practice: Don’t set the summarization threshold too low. Summarizing at 80% of context capacity gives the agent room to work without constantly compacting. Summarizing at 40% wastes tokens on summary overhead.
Part 6: The Orchestrator
Coordination, not execution
The orchestrator’s role is exactly one thing: read the brief, delegate in sequence, verify completion, report done.
It explicitly does not produce content. This is a hard architectural constraint, not a guideline.
The orchestrator has the broadest context in the system. If it starts reasoning about domain-specific tasks like writing content, making factual judgments, formatting outputs and it will produce adequate results at the cost of long reasoning traces filling its context window, confusion between its coordination role and the domain role it just assumed, and bypassing the quality controls that subagents implement. Any time the orchestrator is tempted to produce domain-specific output, a subagent is missing from the design.
Build once, reuse always
The orchestrator’s construction is expensive: loading skill files from disk, initializing the model client, compiling the graph. Cache it at the process level. Build once per worker, reuse across every job.
A worker processing 40 jobs per hour builds the graph once and reuses it 40 times. A module-level singleton is Python’s simplest and most reliable pattern here.
Enforcing pipeline invariants in the prompt
The orchestrator’s system prompt is where pipeline rules are codified: only generate platforms listed in the brief, always run verification last, pass the job ID explicitly to every subagent, never write draft files directly, confirm file existence before reporting done.
These are coordination rules, not hints. Make them explicit and direct. The orchestrator’s prompt should read like a technical runbook, not a creative brief.
Part 7: The Caching Stack
Caching in agent systems has more leverage than in traditional applications because you can avoid work at multiple levels. Each level has different cost savings, different hit rates, and different implementation complexity.
3-layer caching
3-layer caching
Layer 1: Provider prompt caching
Anthropic’s prompt cache stores computed KV tensor representations of the stable prompt prefix. Subsequent requests with identical prefixes read from cache at 10% of normal input token price.
The prerequisite is strict adherence to the static-before-dynamic ordering. Cache control is applied transparently on every system message — the call site doesn’t change.
Critical constraints engineers miss: the minimum token threshold is 1,024 tokens. Content below this fails to cache silently; no error, no warning, just a cache miss and full-price billing. Tool definition changes invalidate the entire cache hierarchy. You get at most four cache breakpoints per request.
Layer 2: Redis LLM response cache
Prompt caching reduces token cost but doesn’t eliminate API latency where you still make an HTTP call and wait for a response. A Redis response cache operates upstream of the API entirely: a cache hit means no HTTP call, no latency, zero cost.
Every LLM call in the system, orchestrator and all subagents automatically checks Redis before making any API call. The cache key is a hash of the full serialized message list combined with the model configuration. Including the model configuration in the key means a model upgrade creates new keys automatically.
Version your keys on prompt changes. Without versioning, a broken prompt gets cached and served for hours. A version bump at deploy time flushes the entire cache without touching Redis directly.
TTL by environment: Development at 5 minutes (prompt edits visible immediately), Staging at 1 hour (stable enough to catch regressions), Production at 24 hours (maximize cost savings).
Layer 3: Content identity cache
For systems where the same source material recurs popular open-source repos submitted by different users, documents processed multiple times and a content-identity cache can eliminate the most expensive pipeline step entirely.
Hash the raw source content. The same document with different user-specified parameters hashes to the same key because the source content is identical. This cache operates on content identity, not prompt identity. It bypasses the LLM entirely on a hit: no API call, no tokens, no latency. TTL can be much longer even seven days is reasonable for most content.
Part 8: Token Optimization
Tokens are the cost unit of LLM systems. Every inefficiency compounds across every user, every request, every retry.
Estimate before you execute
Never run a job without estimating its token cost first. A rough estimator uses character count divided by four (a reliable approximation for English text) plus fixed overhead for skills and context files. Log this for every job. After a week of production traffic you have real P50/P95 data. The numbers that let you set alert thresholds with confidence instead of guesswork.
Validate and truncate at the boundary
Validate input size before the job enters the queue. When input is oversized, truncate rather than reject some users with large inputs should still get results, just from the most information-dense portion of their content.
Snap truncation to a structural boundary (paragraph break, section heading) so the model doesn’t receive mid-sentence content. Append a truncation marker so the model knows the document is incomplete.
Model routing by task
Not every agent in your pipeline needs the most capable (and expensive) model. Structured extraction from a markdown document is something a smaller model handles well. Multi-document cross-referencing with judgment calls benefits from stronger reasoning.
Route cheaper models to extraction, classification, and format validation. Reserve capable models for final review and complex multi-step reasoning. Done correctly, this reduces total pipeline cost by 40–60% with no quality reduction on the overall output.
Part 9: The Async Job Architecture
Why you need a job queue
Agent pipelines take 45–120 seconds for non-trivial work. HTTP connections timeout in 30 seconds by default. Even if they don’t, holding a connection open per active job is a poor use of resources.
The correct architecture: accept the request immediately, return a job ID, run the pipeline asynchronously. The client polls for status. The result is written to a fast store the moment the worker completes. Total overhead for the poll loop: milliseconds.
Agent Orchestrator
Agent Orchestrator
Celery for LLM workloads
LLM tasks are I/O-bound, not CPU-bound. A worker thread spends most of its time waiting for API responses. This means you can run far more concurrency than CPU cores. The gevent pool uses cooperative multitasking where threads yield during I/O waits, allowing other tasks to run. A 4-core machine running 32 concurrent LLM jobs is reasonable when each job spends 70–80% of its time waiting for API responses.
The dual-store pattern
Redis is fast but ephemeral. Postgres is durable but slower. Use both for different purposes.
Redis handles the real-time polling use case like clients check status every few seconds and need sub-millisecond responses. Postgres handles the historical use case: user job history, billing, debugging jobs from yesterday.
The write pattern: write to both on every state change. The read pattern: check Redis first, fall back to Postgres if the key has expired. Never make Redis your source of truth. TTL expiration is silent.
Part 10: Development Workflow
Local model first, cloud model for validation
Development to Production
Development to Production
Separate development iteration from cost by building your harness to accept any LangChain-compatible model. Use a local Ollama model during development is free, fast, no API key needed.
Local output is lower quality than frontier models. It is good enough to verify that files are written to the correct workspace paths, the orchestrator delegates in the correct sequence, result assembly parses workspace files into the expected structure, and error handling works as designed.
When a skill file change needs quality validation, flip to the cloud provider for a single test run. Switch back immediately. This separation makes the iteration cycle for prompt and skill development essentially free.
Testing the harness, not the model
Unit tests for an agent harness should test harness behavior, not model output. The model is non-deterministic; the harness is not.
What to test: workspace seeding produces the expected file structure, result assembly correctly reads each file type, token estimation returns correct totals for known inputs, truncation snaps to paragraph boundaries correctly, cache key generation is deterministic, and state transitions follow the defined valid transition map.
What not to test at the unit level: whether the model produces good content. That’s integration testing with a real model, run periodically, not on every commit.
Part 11: Observability
Observability in LLM systems is harder than in traditional systems because the most important failure mode is qualitative degradation (invisible without the right tooling). A slow API call shows up in latency metrics. An agent that produces subtly wrong content doesn’t.
Structured logs
Every log line should be machine-parseable. Consistent field names across all log lines like job ID, status, step, elapsed time, whether the response was cached means you can grep, filter, and aggregate across your entire log history without a structured logging system. With this format, extracting P95 latency is a shell one-liner against your log file.
LLM call tracing
Structured logs give you job-level visibility. A tracing layer gives you call-level visibility: the full prompt, the response, actual token counts, per-call latency broken down by time-to-first-token and generation time, and the complete call tree across orchestrator and subagents.
When a user reports wrong output, you open the trace for their job ID and see exactly what prompt produced the problem. Without this, debugging hallucinations or incorrect agent behavior is guesswork.
Cache hit rate as a cost signal
Track cache performance explicitly. A sudden drop in LLM response cache hit rate is a signal that a prompt changed which usually means a dynamic value leaked into a previously static section, a model was upgraded, or a skill file was accidentally modified. Alert on this. It’s a cost event that’s easy to miss until the billing cycle closes.
Summary: Design Principles
Every decision in this guide comes from one of five principles.
Minimize accumulated context. Agents that carry less context are cheaper, faster, and more focused. Every component like the workspace backend, subagent isolation, thread compaction serves this principle.
Static before dynamic, always. Prompt caching is the highest-ROI optimization in a deployed agent system. It requires static content to come before dynamic content in every prompt, without exception.
Scope knowledge to role. Agents should know exactly what they need to do their job. Skills, tools, and context files should be narrowed to the minimum. Unnecessary context costs tokens and degrades focus.
Separate concerns cleanly. Orchestrators coordinate. Subagents execute. Backends hold state. These roles should not overlap. When they do, debugging becomes significantly harder.
Estimate, validate, and bound before spending. Token costs compound. Input validation, pre-execution estimation, and hard ceilings prevent runaway spend from becoming a production incident.
The infrastructure described here is not glamorous. None of it shows up in a demo. All of it is the difference between an agent that works on your laptop and a system that serves real users reliably.
Closing Thoughts
This guide is written from the other perspective: the one where you have real users, real costs, and a system that has to work at 3am when you’re not watching it.
The five components don’t solve interesting AI problems. They solve boring infrastructure problems. The boring problems are the ones that kill production systems.
I will be publishing one project on Agent Harness and Ops with video demo. Agent-Harness-Ops project deployment will be on @Railway to understand CI/CD lifecycle.
The model is the easy part. It was always the easy part.
Follow @kmeanskaran for more deployment and Ops article on AI/ML.
Further reading:
-
Externalization in LLM Agents: Memory, Skills, Protocols and Harness Engineering (arXiv 2604.08224)
-
Prompt Caching — Anthropic API Docs
-
Don’t Break the Cache: Prompt Caching for Long-Horizon Agentic Tasks (arXiv 2601.06007)
-
Optimizing Sequential Multi-Step Tasks with Parallel LLM Agents (arXiv 2507.08944)
-
Context Window Overflow — Redis Blog
-
Taming the AI Inference Queue: Redis, Celery & RabbitMQ at Scale
Similar Articles
@eyad_khrais: https://x.com/eyad_khrais/status/2069552027382980882
A comprehensive guide to building AI agent harnesses, covering tool execution, context management, state/memory, and guardrails, based on lessons from building Claude Code and other harnesses for enterprise.
@sydneyrunkle: https://x.com/sydneyrunkle/status/2062217190724579673
A guide on building custom agent harnesses using LangChain's create_agent, focusing on middleware for customization.
@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051
This article deeply analyzes the concept of Agent Harness, which is the engineering infrastructure wrapped around an LLM, including 12 components such as orchestration loops, tool calling, memory systems, context management, etc. The article cites practices from companies like Anthropic, OpenAI, and LangChain, arguing for the critical role of the harness in production-grade AI agents.
@janehu07: https://x.com/janehu07/status/2058359677843599494
This learning note introduces the concept of an agent harness as the infrastructure layer around an LLM, proposing the ETCLOVG taxonomy (Execution, Tooling, Context, Lifecycle, Observability, Verification, Governance) and demonstrating its application through a coding agent case study.
best of the best agentic harnesses do this…
The author shares insights on building effective agent harnesses: the best ones minimize LLM reliance for trivial tasks and reserve LLMs for complex reasoning, distinguishing genuine harnesses from simple wrappers.