@kmeanskaran: https://x.com/kmeanskaran/status/2071160257943052683

X AI KOLs Timeline 06/28/26, 09:14 AM News

production agent-harness multi-agent llm backend orchestration context-engineering

Summary

A detailed guide on building a production-grade agent harness for multi-agent LLM systems, covering components like orchestrator, subagents, skills, backend state management, and context engineering.

https://t.co/8bdau2a2as

Original Article

View Cached Full Text

Cached at: 06/28/26, 04:11 PM

A Guide to Shipping Your Agent Harness into Production

Getting one LLM call to work is easy. Getting five agents to work together reliably in production is a different problem entirely. Real users break things. Real costs add up fast. Real failures cascade in ways your local environment never shows you.

The difference is the harness.

This guide covers the complete engineering picture: what an Agent Harness is, the components that make it up, how to design the backend that supports it, and the optimizations that turn a working system into a production-grade one.

Part 1: The Agent Harness

What it is

An Agent Harness is the infrastructure layer that wraps a language model into a useful system. It is not the model. It is not the prompt. It is the scaffolding that answers every question the model itself cannot:

Where does work product live between agent steps?
What context does each agent see, and what is it explicitly excluded from seeing?
How do agents coordinate without stepping on each other’s context windows?
What happens when an agent produces bad output or a call fails?
How is cost tracked, bounded, and controlled?

The LLM is a reasoning engine. The harness is the system that makes it useful. Every serious agent deployment is built on one explicitly designed or accidentally accumulated. The explicit version is faster, cheaper, and easier to debug.

The five components

A well-designed harness composes five concepts. Each one is a solution to a specific failure mode that emerges when you try to run agents in production.

Agent Harness: The Five Components

The five components are:

**Orchestrator: ** reads the brief, delegates in sequence, verifies completion, reports done
**Subagents: **isolated execution contexts, each with one job and one output artifact
**Skills: ** per-agent knowledge documents that define role, output format, and rules
**Backend: **the shared virtual filesystem where agents pass work product between steps
**Context Engineering: ** the discipline of controlling what each agent sees, when, and in what order

Understanding why each component exists is more useful than memorizing the API. The design decisions that follow from understanding the why are the ones that scale.

Part 2: Backends

The backend is the state management layer of the harness. It answers the question every multi-agent system must answer: where does work product live?

The problem with message-passing state

The naive approach is to pass agent outputs through conversation messages. Agent A produces structured insights and returns them. The orchestrator stores the response and passes it to Agent B as message content.

This creates three compounding problems.

Context bloat. Every message in the orchestrator’s thread costs tokens on every subsequent call. After four agents have run and echoed their outputs back, the orchestrator is carrying thousands of tokens of content that only the next agent will need. You pay for it on every call until the job ends.
**No inspection surface. **There is no way to read what Agent A produced without parsing the orchestrator’s conversation history. Debugging means wading through reasoning traces.
Coupling. Agent B’s behavior depends on the exact format of Agent A’s response appearing in the conversation. A format change in A’s output schema breaks B. The coupling is invisible until it fails.

The virtual filesystem

The StateBackend solves all three problems by giving every job a virtual filesystem is an in-memory workspace scoped to the duration of the run.

Agents don’t pass content to each other through messages. They write files to the workspace and read from it.

Agents orchestration with skills.md

A typical workspace for a five-agent pipeline looks like this: a brief.md seeded at job start (the only input), then one file written per agent that extracted insights, drafts for each platform, and a final review notes file. When the extractor finishes, it writes its output file and confirms in one sentence. The orchestrator stores one line in its thread and not the content. The next agent reads directly from the file when it runs.

This single change reduces the orchestrator’s accumulated context by roughly 85% on a five-agent pipeline. It makes every intermediate output inspectable. And it decouples agents from each other where each agent reads a file by path, not a message by position.

Seeding and result assembly

The workspace starts empty. Before the orchestrator runs, seed it with everything the pipeline needs: skill files, shared context documents, and the per-job brief. Seeding separates the file system setup from the agent’s runtime where the workspace is fully defined before the first LLM call is made.

When the pipeline finishes, read the workspace to assemble the structured result. This is the only place where file content is read by the calling code.

Best practice: never read workspace content during the pipeline run from the orchestrator’s code. The orchestrator infers progress from file existence which files have which appeared not file content. Content is read exactly once, at the end.

Progress inference from the workspace

One under appreciated property of the file-based workspace: you can infer pipeline progress from which files exist. No explicit progress callbacks needed in the agent code. The workspace state is the progress state.

Part 3: Skills

What skills are

Every agent has a specific job. The job description is what to produce, what format it should take, what rules to follow which lives in a Skill file. Skills are markdown documents loaded into the agent’s context alongside its system prompt.

The key design decision is progressive disclosure: each agent loads only the skill it needs.

The naive approach is to include all skills for all agents. This is wrong for two reasons.

Token cost. Skills are static context loaded on every call. An X writer loading LinkedIn formatting instructions pays for those tokens every time it runs and contributing nothing to the output.

Focus degradation. Models occasionally apply instructions from the wrong context. An agent carrying instructions it doesn’t need will, with some non-zero probability, produce output influenced by those instructions. The more irrelevant context an agent carries, the worse this gets.

Each agent declares exactly which skill it loads. The skill resolver loads only the matching file. Five agents. Five skills. Each agent sees only its own.

Tool scoping follows the same principle

Tools are an extension of the skills concept. Like skills, they add tokens (tool definitions count toward input) and add behavioral surface area.

Give agents only the tools they can actually use. The reviewer agent gets a fact-checking tool and it’s the only one that verifies external claims. No other agent gets it. A writer that can’t call external APIs gets no tools. A tool it can’t use is overhead: tokens paid, behavioral risk incurred, no upside.

Best practice: the skill and tool set of an agent should be the minimum required to do its specific job. Expand scope only when a specific task demands it.

Part 4: Subagents and Isolated Contexts

Why shared context windows fail

When you run a multi-agent pipeline in a single conversation thread, every agent accumulates the full history. By the time the reviewer runs, it’s carrying the orchestrator’s planning reasoning, every writer’s output confirmation, and any back-and-forth from corrections. You pay for every one of those tokens. Quality degrades because the reviewer is reasoning in a context full of content that has nothing to do with its job.

Subagents run in isolation

Each subagent gets a fresh conversation: its own system prompt, its own skill, and only what it explicitly reads from the workspace.

Shared Thread of agents vs Isolated Contexts (Subagents)

The orchestrator maintains a flat list of agent descriptions. When it decides to delegate, it picks by description. The subagent runs in its own context, does its work, writes to the workspace, and its context is garbage-collected when done.

The cost difference is significant. A five-agent pipeline run entirely in one shared thread accumulates 15,000–30,000 tokens of history by the final step. The same pipeline with isolated subagents keeps each agent’s context at 2,000–5,000 tokens. The orchestrator’s thread stays lean because it accumulates file paths and one-line confirmations, not content.

**Best practice: **Design subagents to have exactly one output artifact in the workspace. One job, one file. This makes progress tracking trivial and outputs inspectable.

Part 5: Context Engineering

Context engineering is the discipline of controlling what enters an agent’s context window, when, and in what order. It has more impact on cost and quality than any other engineering decision.

The static-before-dynamic rule

This is the foundational rule. It must be followed without exception.

Static content is anything identical across many requests: system prompts, skill files, tool definitions, shared context documents. Dynamic content is anything that changes per request: job IDs, user input, timestamps, per-job parameters.

The correct ordering is: tool definitions first (most stable), then system prompt, then skill files, then conversation history, then the current user message (fully dynamic, never cached).

Provider-level prompt caching stores computed tensor representations of the prefix up to the first dynamic content. A cache read costs 10% of normal input price. A cache miss costs 100%.

Every violation of static-before-dynamic breaks the cache prefix. Common violations: putting a job ID or timestamp in the system prompt, embedding user-specific data in a skill file path, including environment-specific flags in system messages, or changing tool definitions between requests. The fix is always the same that move the dynamic value to the user turn message.

Durable memory vs. per-job context

Not all context ages the same way.

Durable memory is stable across every job: project conventions, behavioral guidelines, how agents should handle edge cases. This lives in a shared document loaded as memory at graph construction time. It gets computed and cached once per worker process.

Per-job context is task-specific: the user’s README, the requested platforms, the tone. This belongs in a brief document seeded into the workspace at job start where the only dynamic input to the pipeline.

The test for which category a piece of context belongs in: would it be identical across 1,000 different jobs? If yes, it’s durable memory. If it changes per job, it belongs in the brief and should never appear in the system prompt.

Thread compaction

For long-running pipelines, conversation threads grow. Older turns are less relevant than recent turns but still cost tokens on every subsequent call.

Thread summarization middleware handles this automatically. When the thread exceeds a configurable token threshold, older turns are compacted into a summary paragraph. The last N messages stay verbatim and recent context is the most relevant.

Best practice: Don’t set the summarization threshold too low. Summarizing at 80% of context capacity gives the agent room to work without constantly compacting. Summarizing at 40% wastes tokens on summary overhead.

Part 6: The Orchestrator

Coordination, not execution

The orchestrator’s role is exactly one thing: read the brief, delegate in sequence, verify completion, report done.

It explicitly does not produce content. This is a hard architectural constraint, not a guideline.

The orchestrator has the broadest context in the system. If it starts reasoning about domain-specific tasks like writing content, making factual judgments, formatting outputs and it will produce adequate results at the cost of long reasoning traces filling its context window, confusion between its coordination role and the domain role it just assumed, and bypassing the quality controls that subagents implement. Any time the orchestrator is tempted to produce domain-specific output, a subagent is missing from the design.

Build once, reuse always

The orchestrator’s construction is expensive: loading skill files from disk, initializing the model client, compiling the graph. Cache it at the process level. Build once per worker, reuse across every job.

A worker processing 40 jobs per hour builds the graph once and reuses it 40 times. A module-level singleton is Python’s simplest and most reliable pattern here.

Enforcing pipeline invariants in the prompt

The orchestrator’s system prompt is where pipeline rules are codified: only generate platforms listed in the brief, always run verification last, pass the job ID explicitly to every subagent, never write draft files directly, confirm file existence before reporting done.

These are coordination rules, not hints. Make them explicit and direct. The orchestrator’s prompt should read like a technical runbook, not a creative brief.

Part 7: The Caching Stack

Caching in agent systems has more leverage than in traditional applications because you can avoid work at multiple levels. Each level has different cost savings, different hit rates, and different implementation complexity.

3-layer caching

Layer 1: Provider prompt caching

Anthropic’s prompt cache stores computed KV tensor representations of the stable prompt prefix. Subsequent requests with identical prefixes read from cache at 10% of normal input token price.

The prerequisite is strict adherence to the static-before-dynamic ordering. Cache control is applied transparently on every system message — the call site doesn’t change.

Critical constraints engineers miss: the minimum token threshold is 1,024 tokens. Content below this fails to cache silently; no error, no warning, just a cache miss and full-price billing. Tool definition changes invalidate the entire cache hierarchy. You get at most four cache breakpoints per request.

Layer 2: Redis LLM response cache

Prompt caching reduces token cost but doesn’t eliminate API latency where you still make an HTTP call and wait for a response. A Redis response cache operates upstream of the API entirely: a cache hit means no HTTP call, no latency, zero cost.

Every LLM call in the system, orchestrator and all subagents automatically checks Redis before making any API call. The cache key is a hash of the full serialized message list combined with the model configuration. Including the model configuration in the key means a model upgrade creates new keys automatically.

Version your keys on prompt changes. Without versioning, a broken prompt gets cached and served for hours. A version bump at deploy time flushes the entire cache without touching Redis directly.

TTL by environment: Development at 5 minutes (prompt edits visible immediately), Staging at 1 hour (stable enough to catch regressions), Production at 24 hours (maximize cost savings).

Layer 3: Content identity cache

For systems where the same source material recurs popular open-source repos submitted by different users, documents processed multiple times and a content-identity cache can eliminate the most expensive pipeline step entirely.

Hash the raw source content. The same document with different user-specified parameters hashes to the same key because the source content is identical. This cache operates on content identity, not prompt identity. It bypasses the LLM entirely on a hit: no API call, no tokens, no latency. TTL can be much longer even seven days is reasonable for most content.

Part 8: Token Optimization

Tokens are the cost unit of LLM systems. Every inefficiency compounds across every user, every request, every retry.

Estimate before you execute

Never run a job without estimating its token cost first. A rough estimator uses character count divided by four (a reliable approximation for English text) plus fixed overhead for skills and context files. Log this for every job. After a week of production traffic you have real P50/P95 data. The numbers that let you set alert thresholds with confidence instead of guesswork.

Validate and truncate at the boundary

Validate input size before the job enters the queue. When input is oversized, truncate rather than reject some users with large inputs should still get results, just from the most information-dense portion of their content.

Snap truncation to a structural boundary (paragraph break, section heading) so the model doesn’t receive mid-sentence content. Append a truncation marker so the model knows the document is incomplete.

Model routing by task

Not every agent in your pipeline needs the most capable (and expensive) model. Structured extraction from a markdown document is something a smaller model handles well. Multi-document cross-referencing with judgment calls benefits from stronger reasoning.

Route cheaper models to extraction, classification, and format validation. Reserve capable models for final review and complex multi-step reasoning. Done correctly, this reduces total pipeline cost by 40–60% with no quality reduction on the overall output.

Part 9: The Async Job Architecture

Why you need a job queue

Agent pipelines take 45–120 seconds for non-trivial work. HTTP connections timeout in 30 seconds by default. Even if they don’t, holding a connection open per active job is a poor use of resources.

The correct architecture: accept the request immediately, return a job ID, run the pipeline asynchronously. The client polls for status. The result is written to a fast store the moment the worker completes. Total overhead for the poll loop: milliseconds.

Agent Orchestrator

Celery for LLM workloads

LLM tasks are I/O-bound, not CPU-bound. A worker thread spends most of its time waiting for API responses. This means you can run far more concurrency than CPU cores. The gevent pool uses cooperative multitasking where threads yield during I/O waits, allowing other tasks to run. A 4-core machine running 32 concurrent LLM jobs is reasonable when each job spends 70–80% of its time waiting for API responses.

The dual-store pattern

Redis is fast but ephemeral. Postgres is durable but slower. Use both for different purposes.

Redis handles the real-time polling use case like clients check status every few seconds and need sub-millisecond responses. Postgres handles the historical use case: user job history, billing, debugging jobs from yesterday.

The write pattern: write to both on every state change. The read pattern: check Redis first, fall back to Postgres if the key has expired. Never make Redis your source of truth. TTL expiration is silent.

Part 10: Development Workflow

Local model first, cloud model for validation

Development to Production

Separate development iteration from cost by building your harness to accept any LangChain-compatible model. Use a local Ollama model during development is free, fast, no API key needed.

Local output is lower quality than frontier models. It is good enough to verify that files are written to the correct workspace paths, the orchestrator delegates in the correct sequence, result assembly parses workspace files into the expected structure, and error handling works as designed.

When a skill file change needs quality validation, flip to the cloud provider for a single test run. Switch back immediately. This separation makes the iteration cycle for prompt and skill development essentially free.

Testing the harness, not the model

Unit tests for an agent harness should test harness behavior, not model output. The model is non-deterministic; the harness is not.

What to test: workspace seeding produces the expected file structure, result assembly correctly reads each file type, token estimation returns correct totals for known inputs, truncation snaps to paragraph boundaries correctly, cache key generation is deterministic, and state transitions follow the defined valid transition map.

What not to test at the unit level: whether the model produces good content. That’s integration testing with a real model, run periodically, not on every commit.

Part 11: Observability

Observability in LLM systems is harder than in traditional systems because the most important failure mode is qualitative degradation (invisible without the right tooling). A slow API call shows up in latency metrics. An agent that produces subtly wrong content doesn’t.

Structured logs

Every log line should be machine-parseable. Consistent field names across all log lines like job ID, status, step, elapsed time, whether the response was cached means you can grep, filter, and aggregate across your entire log history without a structured logging system. With this format, extracting P95 latency is a shell one-liner against your log file.

LLM call tracing

Structured logs give you job-level visibility. A tracing layer gives you call-level visibility: the full prompt, the response, actual token counts, per-call latency broken down by time-to-first-token and generation time, and the complete call tree across orchestrator and subagents.

When a user reports wrong output, you open the trace for their job ID and see exactly what prompt produced the problem. Without this, debugging hallucinations or incorrect agent behavior is guesswork.

Cache hit rate as a cost signal

Track cache performance explicitly. A sudden drop in LLM response cache hit rate is a signal that a prompt changed which usually means a dynamic value leaked into a previously static section, a model was upgraded, or a skill file was accidentally modified. Alert on this. It’s a cost event that’s easy to miss until the billing cycle closes.

Summary: Design Principles

Every decision in this guide comes from one of five principles.

Minimize accumulated context. Agents that carry less context are cheaper, faster, and more focused. Every component like the workspace backend, subagent isolation, thread compaction serves this principle.

Static before dynamic, always. Prompt caching is the highest-ROI optimization in a deployed agent system. It requires static content to come before dynamic content in every prompt, without exception.

Scope knowledge to role. Agents should know exactly what they need to do their job. Skills, tools, and context files should be narrowed to the minimum. Unnecessary context costs tokens and degrades focus.

Separate concerns cleanly. Orchestrators coordinate. Subagents execute. Backends hold state. These roles should not overlap. When they do, debugging becomes significantly harder.

Estimate, validate, and bound before spending. Token costs compound. Input validation, pre-execution estimation, and hard ceilings prevent runaway spend from becoming a production incident.

The infrastructure described here is not glamorous. None of it shows up in a demo. All of it is the difference between an agent that works on your laptop and a system that serves real users reliably.

Closing Thoughts

This guide is written from the other perspective: the one where you have real users, real costs, and a system that has to work at 3am when you’re not watching it.

The five components don’t solve interesting AI problems. They solve boring infrastructure problems. The boring problems are the ones that kill production systems.

I will be publishing one project on Agent Harness and Ops with video demo. Agent-Harness-Ops project deployment will be on @Railway to understand CI/CD lifecycle.

The model is the easy part. It was always the easy part.

Follow @kmeanskaran for more deployment and Ops article on AI/ML.

@kmeanskaran: https://x.com/kmeanskaran/status/2071160257943052683

A Guide to Shipping Your Agent Harness into Production

Part 1: The Agent Harness

Part 2: Backends

Part 3: Skills

Part 4: Subagents and Isolated Contexts

Part 5: Context Engineering

Part 6: The Orchestrator

Part 8: Token Optimization

Part 9: The Async Job Architecture

Part 10: Development Workflow

Part 11: Observability

Summary: Design Principles

Closing Thoughts

Further reading:

Similar Articles

@eyad_khrais: https://x.com/eyad_khrais/status/2069552027382980882

@sydneyrunkle: https://x.com/sydneyrunkle/status/2062217190724579673

@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051

@janehu07: https://x.com/janehu07/status/2058359677843599494

best of the best agentic harnesses do this…

Submit Feedback

Similar Articles

@eyad_khrais: https://x.com/eyad_khrais/status/2069552027382980882

@sydneyrunkle: https://x.com/sydneyrunkle/status/2062217190724579673

@Potatoloogs: https://x.com/Potatoloogs/status/2057391224592667051

@janehu07: https://x.com/janehu07/status/2058359677843599494

best of the best agentic harnesses do this…