@dair_ai: NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower infe…

X AI KOLs Following 05/22/26, 03:30 PM Papers

agentic-workflows distillation inference-cost fine-tuning model-compilation ai-agents

Summary

This paper demonstrates that agentic workflows can be distilled into small fine-tuned models, achieving near-frontier quality while reducing inference cost by two orders of magnitude compared to orchestration approaches.

NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality. The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure. Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation. This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction. Paper: https://arxiv.org/abs/2605.22502 Learn to build effective AI agents in our academy: https://academy.dair.ai

Original Article

View Cached Full Text

Cached at: 05/22/26, 03:51 PM

NEW paper worth reading.

A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality.

The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure.

Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation.

This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction.

Paper: https://arxiv.org/abs/2605.22502

Learn to build effective AI agents in our academy: https://academy.dair.ai

Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost

Source: https://arxiv.org/html/2605.22502 Simon Dennis i14, University of Melbourne &Rivaan Patil i14 &Kevin Shabahang i14 &Hao Guo i14

Abstract

Agent orchestration frameworks have proliferated, collectively exceeding 290,000 GitHub stars across LangGraph, CrewAI, Google ADK, OpenAI Agents SDK, Semantic Kernel, Strands, and LlamaIndex. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn. Recent work has shown this architecture is dominated for procedural tasks by simply providing the procedure in a frontier model’s system prompt(Dennis et al.,2026a), at the cost of consuming the context window, requiring a frontier model for every conversation, and exposing proprietary procedures to third-party providers.Compilingthe procedure into the weights of a small fine-tuned model—creating asubterranean agent—should resolve all of these concerns, and prior work (SimpleTOD, FireAct, SynTOD, WorkflowLLM, Agent Lumos) has shown the technique works. Yet developer adoption has overwhelmingly favored orchestration. We identify three perceived barriers and address each empirically across travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims (55 nodes, 6 decision hubs).

Quality.On a travel booking task, a controlled same-model comparison isolates the effect of compilation: a 3B compiled model beats the same 3B base model with explicit orchestration on four of five quality metrics (p<0.001p<0.001). The 3B model reaches only∼\sim82% of the frontier in-context baseline on graceful handling and naturalness; scaling to 8B on Zoom support and insurance claims closes that gap, achieving 87–98% of in-context quality and matching a LangGraph orchestrator built on a∼\sim70×\timeslarger frontier model. Compiled models also have lower failure rates than the orchestrator in travel (5.5% vs. 24%) and insurance (9% vs. 17%); Zoom failure rates are comparable.

Cost.Compiled models are 128–462×\timescheaper per conversation than the in-context baseline, combining a∼\sim65×\timesper-token reduction from self-hosting with a roughly22–7×7\timestoken volume reduction. The advantage grows with procedure complexity because the compiled model’s prompt is constant-size. Local inference also reduces latency (2.8×\timesfaster in insurance).

Flexibility.The recompile cycle takes 30–50 minutes on production hardware—a CI/CD cycle, not the prohibitively long retraining often assumed.

Each barrier turns out to be smaller than commonly assumed: the quality gap is small (87–98% of frontier), the cost gap favors compilation by two orders of magnitude (and grows with procedure complexity), and the flexibility gap is a deployment cycle rather than a paradigm shift. Compilation is the natural solution to procedural knowledge that needs to persist beyond a single conversation’s context: persistent structure belongs in the weights, transient state belongs in the prompt.

1Introduction

Agent orchestration frameworks—LangGraph(LangChain, Inc.,2024), CrewAI(Moura,2024), Google ADK(Google,2026), OpenAI’s Agents SDK(OpenAI,2026), Semantic Kernel(Microsoft,2026), Strands(Amazon Web Services,2026), LlamaIndex(LlamaIndex,2026)—have proliferated, collectively exceeding 290,000 GitHub stars. All follow the same pattern: an external orchestrator above the LLM, injecting instructions and routing decisions every turn.Dennis et al. (2026a)showed that for procedural tasks this architecture is dominated by a simpler alternative—giving the model the entire procedure in its system prompt and letting it self-orchestrate, achieving near-perfect quality (4.53–5.00 on a 5-point scale). However, the in-context approach requires a frontier model for every conversation, inflates token consumption with the procedure embedded in every API call, consumes context window capacity, and exposes proprietary procedures to third-party providers.

Compilingthe procedure into the weights of a small fine-tuned model—creating what we call asubterranean agent—should resolve all of these concerns. The technique is established: SimpleTOD(Hosseini-Asl et al.,2020), FireAct(Chen et al.,2023), SynTOD(Samarinas et al.,2024), WorkflowLLM(Fan et al.,2024), and Agent Lumos(Yin et al.,2024)all compile agent capabilities into model weights, and several reach quality competitive with frontier models. Yet developer adoption has overwhelmingly favored orchestration; the compilation papers above remain academic prototypes with∼\sim3,000 stars combined—roughly 100×\timesless community engagement than the orchestration frameworks.Why aren’t people compiling procedures into weights?

We identify three perceived barriers and address each empirically across three domains—travel booking (14 nodes), Zoom support (14 nodes, product-specific knowledge), and insurance claims processing (55 nodes, 6 decision hubs)—usingn=200n=200scenarios per condition per domain with a LangGraph orchestrator and in-context baseline as controls:

•Quality(§4). Will a fine-tuned small model match a frontier model on procedural tasks? An 8B compiled model achieves 87–98% of in-context frontier quality, competitive with a LangGraph orchestrator using a∼\sim70×\timeslarger frontier model.
•Cost(§5). Is the per-conversation inference cost actually lower once self-hosting is accounted for? Compiled models are 128–462×\timescheaper per conversation than the in-context baseline; the advantage grows with procedure complexity because the compiled model’s prompt is constant-size.
•Flexibility(§6). When the procedure changes, can a compiled model adapt without a prohibitively long retraining effort? The recompile cycle takes 30–50 minutes on production hardware—a CI/CD cycle, not a major retraining job.

2Compiling Agentic Workflows into Weights

UserOrchestratorLLMinject promptsparse outputSurface OrchestrationRuntime: User + Orchestrator + LLMUserLLMOrchestratortrain dataSubterranean Agenttraining onlyRuntime: User + LLM onlyFigure 1:Architectural comparison.Left: Surface orchestration interposes an orchestrator between user and LLM, injecting instructions and parsing outputs every turn.Right: The subterranean approach uses the orchestrator only during training data generation; at runtime, the procedure is compiled into the LLM’s weights and the user talks directly to the LLM.Figure1contrasts the two architectures. In surface orchestration (left), an orchestrator sits between the user and the LLM, injecting prompts and parsing outputs every turn. In the subterranean approach (right), the orchestrator is used only during training data generation; at runtime, the user talks directly to the LLM, which has the procedure compiled into its weights. The subterranean pipeline works as follows:

1.Define the procedureas a flowchart with nodes (turns) and edges (transitions)
2.Generate synthetic conversationsby traversing all valid paths through the flowchart
3.Fine-tune the LLMon these conversations using full parameter updates
4.Deploy without orchestration—the LLM has learned to self-orchestrate

The orchestrator never touches the LLM at runtime. The procedure shapes the training data, and the model implicitly follows it through learned statistical regularities rather than explicit instruction. At runtime, the compiled model runs the procedure directly—no interpreter needed.

We represent procedures as directed graphsF=(N,E,n0,T)F=(N,E,n_{0},T):

•NN: Nodes with role (agent/user) and prompt template
•E⊆N×N×CE\subseteq N\times N\times C: Edges with optional conditions
•n0∈Nn_{0}\in N: Start node
•T⊆NT\subseteq N: Terminal nodes (success, abandonment, escalation)

We evaluate on three domains (flowcharts in AppendixD).

Travel booking(14 nodes, 3 decision hubs; Figure2). A customer contacts an agent to book a trip. The agent greets the customer, gathers travel preferences (destinations, dates, budget), then assesses whether enough information has been collected. If not, the conversation loops back for more details. Once ready, the agent presents options; the customer may accept, reject, or ask for alternatives—each leading down a different branch. If the customer accepts, the agent confirms the booking; otherwise the conversation may loop through new searches or end in abandonment. The procedure yields 86 unique acyclic paths of 4–17 turns.

Zoom support(14 nodes, parallel structure). A user reports a Zoom problem—audio issues, video freezing, connection drops, or screen-sharing failures. The agent triages the issue into the appropriate troubleshooting track, walks the user through diagnostic steps, and checks whether each step resolves the problem. If not, the agent can retry with alternative fixes or escalate. This is a product-specific domain: the agent must know Zoom’s UI, settings menus, and common error codes, so the training data encodes this domain knowledge directly into the model’s weights. The procedure yields 60 unique acyclic paths of 4–17 turns.

Insurance claims(55 nodes, 6 decision hubs). A policyholder files a claim—auto, property, health, or liability. The agent walks through intake, determines the claim type, gathers supporting documents (looping back when documents are incomplete), assesses coverage and exclusions, and negotiates a settlement with offer/counter-offer exchanges. The procedure is nearly 4×\timeslarger than the other domains, with nested loops (e.g., document request→\toreview→\tore-request) and cross-phase dependencies (coverage decisions constrain settlement options). The procedure yields 2,381 unique acyclic paths of 9–39 turns, testing whether compilation scales to substantially more complex workflows.

For each conversation, we sample a path through the flowchart and a set of scenario variables (destinations, budgets, user personalities, claim types, etc.), then generate the conversation turn by turn along the path using Claude Sonnet 4.5. At each node, the generator receives the node’s prompt template and the full conversation history, producing a contextually appropriate response. The model at inference sees only natural dialogue with no procedural annotations—the structure is implicit in how conversations flow. Dataset sizes are reported with each experiment below.

We fine-tune usingfull parameter updates. Procedural internalization requires modifying the model’s implicit state-tracking behavior—a deeper change than stylistic alignment. A systematic study of parameter-efficient (LoRA) fine-tuning across ranks 16–128 found that low-rank methods fail to approach full fine-tuning on procedural tasks(Dennis et al.,2026b). At inference, the subterranean agent uses only a minimal system prompt (e.g., “You are a helpful travel booking assistant”)—no procedural instructions, no flowchart state, no routing logic. Training configurations are reported with each experiment below.

3Evaluation Methodology

All three experiments share two baselines. TheLangGraph orchestratoruses Claude Sonnet 4.5 orchestrated via LangGraph(LangChain, Inc.,2024), the most widely adopted agent framework (∼\sim30K GitHub stars as of March 2026). Each flowchart node maps to a LangGraph graph node; at decision hubs, an LLM classifier selects the next edge. This is the same orchestrator evaluated inDennis et al. (2026a)and serves as the frontier-model baseline (∼\sim70×\timesmore parameters than the 3B compiled model). Thein-context baselinegives Claude Sonnet 4.5 theentireserialized flowchart in its system prompt, allowing it to self-orchestrate with one API call per turn. It establishes an upper bound for what a frontier model can achieve with full procedural context, at the cost of consuming context window capacity. The comparison between these two baselines—orchestration vs. in-context prompting—is reported separately(Dennis et al.,2026a).

We usedynamic user simulationvia Claude Sonnet 4.5, which generates contextually appropriate user responses based on the full conversation history and scenario variables. The user simulator has no knowledge of the underlying flowchart—it role-plays a customer with given preferences. All conditions receive the same scenario specifications and user simulator. We evaluaten=200n=200scenarios per condition per domain, with scenarios designed to cover the full range of flowchart paths, user styles (specific to vague), satisfaction levels (enthusiastic to skeptical), budget realism, and group complexity.

We use an approach-agnostic LLM-as-judge methodology(Zheng et al.,2023)where Claude Sonnet 4.5 evaluates conversation quality without knowing which system produced it. To address potential judge self-preference bias (Claude as both data generator and judge)(Panickssery et al.,2024), we re-score all conversations with an independent GPT-4.1 judge using the identical rubric; the GPT-4.1 judge yields a comparable 83–99% of in-context quality across all three domains, and the qualitative findings (compiled model dominates the same-model orchestrator, in-context baseline leads) are robust to judge choice (AppendixC). Each conversation is scored on five criteria (1–5 scale) with specific behavioral anchors at every level:

Task Success: Did the agent execute the procedure correctly through to an appropriate terminal state, with consistent and accurate handling at each decision point? (5 = complete procedure with clear terminal state; 3 = middle stages completed but conversation fizzled; 1 = no meaningful progress)

Information Accuracy: Did the agent correctly use and retain all user-provided information? (5 = every detail correctly reflected; 1 = fabricated details or ignored input)

Consistency: Did the agent maintain coherent state across the conversation? (5 = no contradictions or repeated questions; 1 = contradicts itself repeatedly)

Graceful Handling: How well did the agent handle changes, ambiguity, and edge cases? (5 = smoothly adapts; 1 = any deviation breaks the flow. Capped at 3 if the user posed no challenges.)

Naturalness: Does the conversation read like talking to a skilled human agent? (5 = indistinguishable from human; 1 = mechanical, scripted)

For statistical analysis, conditions evaluated within the same run are paired by scenario index; conditions evaluated separately use unpaired tests. Because the user simulator responds dynamically, conversations diverge after the first turns even for paired conditions—the pairing is on scenariointent, not identical content. We report Wilcoxon signed-rank tests (paired) or Mann–WhitneyUU(unpaired), Cohen’sddeffect sizes, and bootstrap 95% CIs (10,000 resamples, percentile method), with Holm–Bonferroni correction across the five quality criteria (α=0.05\alpha=0.05) within each pairwise comparison.

4Barrier 1: Quality

Will a fine-tuned small model match a frontier model on procedural tasks? We compare three compiled models against the LangGraph orchestrator and in-context baseline.

4.1Experiment 1: Travel Booking (3B)

The travel booking experiment provides a controlled same-model comparison. We evaluate four conditions (n=200n=200each): (1) a3B subterranean agent(Qwen 2.5 3B Instruct, fine-tuned), (2) a3B surface orchestratorusing the same base model with explicit flowchart-based state tracking (injecting node prompts and routing at every turn), (3) theLangGraph orchestrator, and (4) thein-context baseline. Conditions 1 vs. 2 isolate the effect of compilation (same model, same procedure, different architecture); conditions 1 vs. 3 test whether a compiled 3B model is competitive with a frontier orchestrator.

Training.We generated 2,125 synthetic conversations by sampling paths through the travel flowchart and using Claude Sonnet 4.5 to produce turn-by-turn dialogue (1,912 for training, 213 held out for evaluation). The base model is Qwen 2.5 3B Instruct in bf16 precision; we fine-tune all parameters on a single RTX 5090. Optimization uses AdamW 8-bit, learning rate2×10−52\times 10^{-5}with cosine decay, and an effective batch size of 16 via gradient accumulation. We train for 20 epochs (∼\sim3.5 hours wall-clock) and select the best checkpoint by held-out eval loss, which converges at epoch∼\sim4 and plateaus thereafter. At inference the fine-tuned model receives only a minimal system prompt; no procedural instructions, flowchart state, or routing logic are injected at runtime.

Table 1:Quality scores—Travel Booking (n=200n=200, Claude Sonnet 4.5 judge, 1–5 scale). Bold indicates best per criterion. 95% bootstrap CIs and pairwise significance tests in AppendixB.Compilation helps.The 3B subterranean agent leads the same-model orchestrator on all five criteria (Table1). The advantage reaches significance on task success (Δ=+0.18\Delta=+0.18,p<0.001p<0.001), consistency (Δ=+0.22\Delta=+0.22,p<0.001p<0.001), graceful handling (Δ=+0.20\Delta=+0.20,p<0.001p<0.001), and naturalness (Δ=+0.17\Delta=+0.17,p<0.001p<0.001); information accuracy trends positive but does not reach significance (Δ=+0.05\Delta=+0.05,p=.29p=.29).

Competitive with frontier orchestrator.Against the LangGraph orchestrator (∼\sim70×\timesmore parameters), the 3B model leads on information accuracy (4.75 vs. 4.21,p<0.001p<0.001) but the orchestrator leads on graceful handling (4.62 vs. 4.07,p<0.001p<0.001) and naturalness (4.84 vs. 4.12,p<0.001p<0.001). Task success and consistency are comparable.

The gap.The quality gap with the in-context baseline is uneven: information accuracy is strong (102%) while graceful handling and naturalness lag (∼\sim82%), suggesting the 3B model learns the procedure but lacks capacity for natural edge-case handling. This motivates scaling to a larger model.

4.2Experiment 2: Zoom Support (8B)

To close the graceful-handling and naturalness gap observed in travel, we scale to Qwen3-8B with substantially more training data. Zoom support also tests a product-specific domain where the model must internalize domain knowledge (Zoom’s UI, settings, error codes), not just procedure structure. We evaluate three conditions (n=200n=200each): (1) an8B subterranean agent, (2) theLangGraph orchestrator, and (3) thein-context baseline.

Training.The base pipeline produces 870 conversations per run (90/10 train/eval split). To increase volume, we ran the pipeline eight times with different random seeds (seeds 42–49) and concatenated the training splits (8×783=6,2648\times 783=6{,}264training conversations), with no deduplication needed as identical paths produce different conversations due to seed-dependent scenario sampling. Qwen3-8B (bf16), full fine-tuning with DeepSpeed ZeRO-3 on 8×\timesA100. Learning rate2×10−52\times 10^{-5}, effective batch size 32. 10 epochs, best checkpoint at epoch 2.

Table 2:Quality scores—Zoom Support (n=200n=200, Claude Sonnet 4.5 judge, 1–5 scale). Bold indicates best per criterion. 95% bootstrap CIs and pairwise significance tests in AppendixB.Gap closed on graceful handling and naturalness.The 8B model achieves 92% of the in-context baseline on graceful handling (vs. 82% for the 3B) and 97% on naturalness (vs. 82%). The remaining gap is concentrated in information accuracy (87%), where broad world knowledge—not procedure following—is the bottleneck.

Competitive with frontier orchestrator.The 8B subterranean agent leads the LangGraph orchestrator on naturalness (4.87 vs. 4.64,p<0.001p<0.001); the orchestrator leads on information accuracy (4.75 vs. 4.26,p<0.001p<0.001). Task success, consistency, and graceful handling are comparable. Full statistics in AppendixB.

4.3Experiment 3: Insurance Claims (8B)

The 14-node travel and Zoom procedures are moderately complex. Insurance claims (55 nodes, 6 decision hubs, 2,381 paths) tests whether compilation scales to a substantially more complex workflow—and whether the cost advantage grows accordingly. We evaluate three conditions (n=200n=200each): (1) an8B subterranean agent, (2) theLangGraph orchestrator, and (3) thein-context baseline.

Training.We generated 3,000 synthetic conversations from the 55-node insurance flowchart using Claude Sonnet 4.5 (2,700 for training, 300 held out for evaluation). The base model is Qwen3-8B in bf16 precision, fine-tuned with full parameter updates using DeepSpeed ZeRO-3 across 8×\timesA100 GPUs (the same configuration as Zoom). Optimization uses AdamW with learning rate2×10−52\times 10^{-5}and effective batch size 32. The longer 20-epoch budget (vs. 10 for Zoom) reflects the larger procedure—more node-specific behavior and longer trajectories require more passes through the data; the best checkpoint by held-out eval loss is reached at epoch 3.

Table 3:Insurance claims processing (55 nodes, 6 decision hubs,n=200n=200, Claude Sonnet 4.5 judge). The 8B compiled model achieves 92–98% of in-context quality at 462×\timesless cost. 95% bootstrap CIs and pairwise significance tests in AppendixB.The 8B compiled model achieves 92–98% of in-context quality on insurance claims (4.40–4.92 vs. 4.78–5.00; Table3), demonstrating that compilation scales to a substantially larger procedure (55 nodes vs. 14 for travel and Zoom).

The 8B compiled model leads the LangGraph orchestrator on graceful handling (4.81 vs. 4.38,p<0.001p<0.001), naturalness (4.92 vs. 4.58,p<0.001p<0.001), and consistency (4.51 vs. 4.39); task success and information accuracy are comparable (neither difference reaches significance).

4.4Efficiency and Failure Modes

Table 4:Efficiency comparison across domains (n=200n=200per condition). Wall-clock time includes all LLM calls (generation + routing for LG Orch). The subterranean agent’s self-hosted inference eliminates the network round-trips of an external API; the LangGraph orchestrator’s time grows with procedure complexity due to API calls for routing.Table 5:Failure rates across domains (conversations with task success≤3\leq 3,n=200n=200per condition). The compiled model achieves lower failure rates than the LangGraph orchestrator in travel and insurance; Zoom rates are comparable.TravelZoomInsurance3B Sub.3B Orch.LG8B Sub.LG8B Sub.LGTotal failures1194822181834Failure rate5.5%4.5%24.0%11.0%9.0%9.0%17.0%Turn counts and wall-clock times vary by domain (Table4). On travel, the 3B subterranean agent produces longer conversations (22.6 turns) than the baselines (∼\sim16), reflecting a one-question-per-turn style learned from training data. Despite the extra turns, its wall-clock time is comparable to the LangGraph orchestrator’s because the subterranean agent is self-hosted while the orchestrator incurs Claude API latency on every turn. In Zoom and insurance, the compiled model is faster: 29.5s vs. 52.1s in Zoom and 43.2s vs. 120.8s in insurance. The insurance gap is largest because the LangGraph orchestrator makes additional API calls at each of the 6 decision hubs, and the 55-node procedure inflates every prompt. The compiled model achieves substantially lower failure rates (conversations the judge scored task success≤3\leq 3) than the LangGraph orchestrator in travel (5.5% vs. 24.0%) and insurance (9.0% vs. 17.0%; Table5); Zoom rates are comparable (11.0% vs. 9.0%). The orchestrator’s high failure rate in travel (24%) reflects routing errors at decision hubs, a failure mode the compiled model eliminates by construction.

The subterranean agent and the LangGraph orchestrator produce conversations of comparable length, but with differentgranularity. The subterranean agent learned an “interview style” from its training data: each agent turn asks a single focused question, waits for the user’s response, then proceeds (64% of turns contain exactly one question). The LangGraph orchestrator, constrained to its current node’s template, sometimes packs multiple questions into a single turn. The total word count per conversation is comparable across conditions (∼\sim1,200–1,400 words), confirming that the same information is exchanged—just sliced into different-sized turns. The one-question-per-turn rhythm produces clearer audit trails and may reduce user cognitive load.

That a 3B–8B model with∼\sim70×\timesfewer parameters than the LangGraph orchestrator’s Claude Sonnet 4.5 achieves competitive quality is at first glance surprising.Dennis et al. (2026a)identify three structural costs of orchestration: it fragments reasoning by generating from local node context only, introduces routing failure modes absent in non-orchestrated architectures, and constrains the model’s natural conversational style through template injection. The subterranean agent avoids all three: it reasons over the full procedure holistically through internalized weights, has zero routing failures by construction, and produces unconstrained responses shaped by natural training data. These structural advantages compensate for the large capacity gap.

5Barrier 2: Cost

Is the per-conversation inference cost actually lower once self-hosting is accounted for? The cost reduction has two independent components (Table6).

Per-token cost.The compiled subterranean agent is self-hosted on commodity GPU hardware rather than served through a frontier API. We deploy the 8B model on a reserved cloud A100 80GB at $2\.50/hr using vLLM$Kwon et al\.,[2023](https://arxiv.org/html/2605.22502#bib.bib11)$for batched inference\. Published benchmarks for 8B models on A100 hardware report 4–5K total tokens/s at batch size 64$Patel et al\.,[2024](https://arxiv.org/html/2605.22502#bib.bib18)$, decomposable into∼\\sim15K tokens/s for prefill and∼\\sim3K tokens/s for autoregressive decode\. Dividing the hourly GPU cost by these throughputs yields effective rates of∼\\sim$ 0.05/M input tokens and∼\sim$0.23/M output tokens—against Claude Sonnet 4.5’s published rates of $3/M input and $15/M output, that works out toroughly 65×\timescheaperper token.

Token volume.The in-context baseline must include the serialized procedure in the system prompt on every turn, an overhead that grows with procedure complexity:∼\sim2×\timesfor travel (14 nodes) up to∼\sim7×\timesfor insurance (55 nodes). The compiled model’s prompt is constant-size regardless of procedure complexity, so this overhead disappears entirely. The two factors compound:128×\timescheaper for travel, 462×\timesfor insurance, with the largest advantage on the most complex procedure. The compiled model is also 77–249×\timescheaper than the LangGraph orchestrator, which uses fewer tokens than the in-context baseline but still pays Claude API rates on every generation and routing call.

Table 6:Inference cost per conversation across domains. API costs assume Claude Sonnet 4.5 pricing ($3/M input, $15/M output). Self-hosted cost assumes Qwen3-8B on a reserved A100 80GB at $2\.50/hr via vLLM, yielding a∼\\sim65×\\timesper\-token cost reduction\. The IC/Sub ratio compounds this with a token volume reduction that grows with procedure complexity\.One\-time compilation cost\.Compilation incurs a one\-time setup cost separate from running cost:∼\\sim$ 40 for data generation plus∼\sim $10–40 for fine\-tuning compute \($ 50–80 total). Translating this into a per-conversation figure requires knowing the deployed agent’s lifetime conversation volume, which is application-specific; break-even against the in-context baseline arrives within 500 conversations on every domain, and for 10,000+ conversations compilation adds less than $0.01 per conversation.

6Barrier 3: Flexibility

A compiled model must be retrained from scratch when its procedure changes, but the recompile pipeline parallelizes cleanly across all three stages and completes in well under an hour on a production GPU cluster.

Data generation.Claude Sonnet 4.5 traverses the new flowchart to produce∼\sim1,600 synthetic conversations. The conversations are independent, so the API calls parallelize trivially; with reasonable concurrency the step takes 15–30 minutes, limited by per-account API rate limits rather than local compute. (Sequentially it would take∼\sim60 minutes.)

Fine-tuning.The 8B model fits comfortably in a single H200 for full-precision AdamW (parameters, gradients, and optimizer states together∼\sim96 GB), so no sharding is needed and 8×\timesH200 simply runs data parallelism with the H200’s∼\sim3×\timeshigher BF16 throughput, compressing 12 epochs to 10–15 minutes. On a single A100 80GB (the same hardware used for inference) the optimizer state has to be 8-bit AdamW to fit, and training takes∼\sim3 hours.

Evaluation.A 50-scenario vLLM-batched spot-check takes∼\sim5 minutes on a warm inference server (the model is already loaded), or∼\sim10–15 minutes including server start-up.

The fully optimized cycle (8×\timesH200 plus parallelized API requests) completes in 30–50 minutes—comparable to a CI/CD build for a large application. A practitioner without access to an 8-GPU cluster can recompile on a single A100 80GB in roughly 3–4 hours, dominated by training rather than data generation.

7Related Work

Agent frameworks and their failure modes.The 2024–2025 period saw rapid proliferation of LLM agent frameworks(LangChain, Inc.,2024; Moura,2024; Google,2026; OpenAI,2026; Microsoft,2026; Amazon Web Services,2026; LlamaIndex,2026). The reliability costs are well documented:Cemri et al. (2026)identified 14 failure modes,Zhu et al. (2026)showed cascading failures are the primary bottleneck, andGupta (2026)found 60% pass@1 agents show only 25% consistency across trials.Dennis et al. (2026a)showed that for procedural tasks, in-context prompting dominates orchestrated approaches, establishing the quality ceiling that compilation targets.

Compiling agent capabilities into weights.The idea of replacing modular agent pipelines with a single trained model has been explored along three lines.

Collapsing dialogue pipelines.SimpleTOD(Hosseini-Asl et al.,2020)recast all sub-tasks of task-oriented dialogue (understanding, action decision, response generation) as a single sequence prediction problem, achieving state-of-the-art on MultiWOZ. AutoTOD(Xu et al.,2024)extended this to autonomous action sequencing, explicitly identifying failure modes of modular systems—error accumulation and poor generalization—that a unified model avoids.

Distilling agent reasoning from frontier models.FireAct(Chen et al.,2023)fine-tuned Llama2-7B on GPT-4 ReAct trajectories, achieving a 77% performance increase on HotpotQA. AgentTuning(Zeng et al.,2024)instruction-tuned Llama 2 on diverse agent interaction trajectories, producing a 70B model that matched GPT-3.5-turbo on unseen tasks while preserving general capabilities. Agent Lumos(Yin et al.,2024)decomposed agent behavior into planning and grounding modules, training open-source models that surpassed GPT agents on several benchmarks.

Scaling to complex workflows.WorkflowLLM(Fan et al.,2024)compiled API orchestration knowledge from 106K workflow samples spanning 1,503 APIs into an 8B model. SynTOD(Samarinas et al.,2024)generated synthetic training data from state transition graphs—closely related to our flowchart-guided approach—showing that procedural dialogue data can be synthesized without crowdsourcing.Hsiao (2026)formalized procedural knowledge as hierarchical task networks, showing that structured decomposition improves agent performance.

These papers prove the technique works—and several achieve quality competitive with frontier models. However, none quantifies the inference cost advantage of compilation over orchestrated or in-context alternatives, none measures the recompile cycle, and none compares against both a same-model orchestrated baseline and a frontier-model baseline to isolate the effect of compilation from model capacity.

8Conclusion

We posed three barriers to adoption and addressed each empirically.Quality:the 8B compiled model achieves 87–98% of in-context frontier quality, competitive with a LangGraph orchestrator using a∼\sim70×\timeslarger frontier model; a controlled same-model comparison confirms compilation itself helps, beating the same base model with orchestration on four of five metrics (p<0.001p<0.001).Cost:compiled models are 128–462×\timescheaper per conversation than the in-context baseline, with the advantage growing on more complex procedures because the compiled model’s prompt is constant-size.Flexibility:the recompile cycle takes 30–50 minutes on production hardware—a CI/CD cycle, not a prohibitively long retraining job. Each of the three barriers turns out to be smaller than commonly assumed. The quality gap to a frontier model is small (2–13%); the cost gap is two orders of magnitude in the practitioner’s favor and grows with procedure complexity; and the flexibility gap is a deployment cycle, not a paradigm shift. Compilation is the natural solution to procedural knowledge that needs to persist beyond a single conversation’s context: persistent structure belongs in the weights, transient state belongs in the prompt.

References

Amazon Web Services [2026]Amazon Web Services.Strands agents sdk.https://strandsagents.com/, 2026.Open-source agent SDK with model-driven orchestration loop.
Cemri et al. [2026]Muhammed Cemri, Yue Shi, Jayaganesh Jeyakumar, Oznur Kislal, George Karypis, and Akash Srivastava.Why do multi-agent LLM systems fail?arXiv preprint arXiv:2503.13657, 2026.
Chen et al. [2023]Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao.Fireact: Toward language agent fine-tuning.arXiv preprint arXiv:2310.05915, 2023.
Dennis et al. [2026a]Simon Dennis, Michael Diamond, Rivaan Patil, Kevin Shabahang, and Hao Guo.In-context prompting obsoletes agent orchestration for procedural tasks.arXiv preprint, 2026a.
Dennis et al. [2026b]Simon Dennis, Kevin Shabahang, Hao Guo, and Rivaan Patil.Procedural knowledge is not low-rank: Why LoRA fails to internalize multi-step procedures.arXiv preprint, 2026b.
Fan et al. [2024]Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun.Workflowllm: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024.
Google [2026]Google.Agent development kit.https://google.github.io/adk-docs/, 2026.Event-driven agent framework with workflow and LLM agent types.
Gupta [2026]Aayush Gupta.ReliabilityBench: Evaluating LLM agent reliability under production-like stress conditions.arXiv preprint arXiv:2601.06112, 2026.
Hosseini-Asl et al. [2020]Ehsan Hosseini-Asl, Bryan McCann, Chien-Sheng Wu, Semih Yavuz, and Richard Socher.A simple language model for task-oriented dialogue.Advances in Neural Information Processing Systems, 33, 2020.
Hsiao [2026]Victor Hsiao.Procedural knowledge improves agentic llm workflows.arXiv preprint arXiv:2511.07568, 2026.
Kwon et al. [2023]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica.Efficient memory management for large language model serving with PagedAttention.InProceedings of the 29th Symposium on Operating Systems Principles, 2023.
LangChain, Inc. [2024]LangChain, Inc.Langgraph: Build resilient language agents as graphs.https://github.com/langchain-ai/langgraph, 2024.Directed state graph framework for LLM agent orchestration.
LlamaIndex [2026]LlamaIndex.Llamaindex workflows.https://www.llamaindex.ai/workflows, 2026.Event-driven agent workflow framework.
Microsoft [2026]Microsoft.Semantic kernel: Multi-agent orchestration.https://learn.microsoft.com/en-us/semantic-kernel/, 2026.Enterprise agent framework with sequential, concurrent, and handoff patterns.
Moura [2024]João Moura.Crewai: Framework for orchestrating role-playing ai agents.https://github.com/crewAIInc/crewAI, 2024.Role-based multi-agent orchestration with Flows and Crews.
OpenAI [2026]OpenAI.Openai agents sdk.https://openai.github.io/openai-agents-python/, 2026.Agent framework with handoff-based orchestration.
Panickssery et al. [2024]Arjun Panickssery, Samuel R. Bowman, and Shi Feng.Llm evaluators recognize and favor their own generations.arXiv preprint arXiv:2404.13076, 2024.
Patel et al. [2024]Krishna Patel, Tirth Patel, Mihir Vij, Yueqing Zhu, Siddharth Jain, Matthew Franusich, Yin Liang, Xiao Liu, Zhengyu Liu, Ben Athiwaratkun, Yanqi Zou, Shreyas Vishwanath, Arindam Basu, and Hui Guan.LLM-Inference-Bench: Inference benchmarking of large language models on AI accelerators.arXiv preprint arXiv:2411.00136, 2024.
Samarinas et al. [2024]Chris Samarinas, Pattaramanee Promthaw, Atharva Nijasure, Hansi Zeng, Julian Killingback, and Hamed Zamani.Simulating task-oriented dialogues with state transition graphs and large language models.arXiv preprint arXiv:2404.14772, 2024.
Xu et al. [2024]Hong-Da Xu, Xin-Lan Mao, Pei Yang, Fei Sun, and He Huang.Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent.ACL, 2024.
Yin et al. [2024]Da Yin, Faeze Brahman, Abhilasha Ravichander, Khyathi Chandu, Kai-Wei Chang, Yejin Choi, and Bill Yuchen Lin.Agent lumos: Unified and modular training for open-source language agents.ACL, 2024.
Zeng et al. [2024]Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang.Agenttuning: Enabling generalized agent abilities for llms.Findings of ACL, 2024.
Zheng et al. [2023]Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in Neural Information Processing Systems, 36, 2023.
Zhu et al. [2026]Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, Xiaoteng Ma, Xiaodong Yu, Gowtham Ramesh, Jialian Wu, Zicheng Liu, Pan Lu, James Zou, and Jiaxuan You.Where LLM agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2026.

Appendix AConversation Examples

We present excerpts from actual evaluation conversations, selected to illustrate typical behavioral differences between conditions rather than best-case performance. Conversations are lightly edited for length; full unedited transcripts are available in the repository. See AppendixA.2for Zoom support.

A.1Travel Booking

Scenario:Two friends planning a 6-day trip to Japan, budget $1,100/person, interested in water sports and festivals. User style: uncertain.

Subterranean agent(task success: 5, naturalness: 5, 14 turns):

Agent:Hello! I’d be happy to help you plan your next trip. What kind of travel are you looking to book today? User:Hey! My friend and I are thinking about going to Japan for about 6 days. We’re really into water sports and checking out beautiful scenery, and we heard there’s some cool festival happening that we’d love to catch. Agent:That sounds like an amazing trip! Japan has incredible water sports opportunities and some fantastic festivals. When are you planning to go, and what’s your approximate budget per person for this 6-day adventure? User:We’re thinking around $1100 total per person\. As for when, we’re pretty flexible—whenever that festival is happening would work for us\. Agent:Perfect\! For water sports and festivals, I’d recommend timing it around the Gion Festival in late May/early June \[…\] I can put together some options that include accommodation near the coast for water sports and easy access to festival events\. User:That sounds perfect\! Yeah, go ahead and put something together—we’re pretty easy to please \[…\] Agent:Excellent\! I’ve crafted three itinerary options for your Japan adventure:Option 1: Kyoto & Coast Explorer\($ 550/person)—Split between Kyoto’s temples (Gion Festival backdrop) and beach time in Wakayama. […]Option 2: […]Option 3: […] User:Option 1 sounds great! […] Let’s go with that one! Agent:Fantastic choice! […] Are you ready for me to finalize all the details and create your complete booking? User:Yes, let’s do it! […] Agent:Perfect! Here’s your finalized itinerary summary: [destination, dates, budget, highlights]. […] Shall I confirm this booking for you? User:Everything looks perfect! […] Thanks so much for putting this together! Agent:Wonderful! Your Kyoto & Wakayama adventure is confirmed! [closing with travel tips]

3B Orchestrated agent(task success: 4, naturalness: 4, 18 turns):

Agent:Hi there! Welcome to your travel planning journey. How can I assist you today? User:Hey! Yeah, I’m looking to plan a trip actually. My friend and I are thinking about going to Japan for about 6 days. […] Agent:Great to hear about your trip to Japan! […] how much are you thinking in terms of budget? And do you have a preferred start date for your trip? User:We’re thinking around $1100 total, and we’re pretty flexible on dates. […] Agent:Sounds exciting! The festival you heard about could be a great highlight. For the dates, we can definitely look into a few options around then. And sticking with the budget of around $1100,does that sound right to you? User:Yeah, $1100 sounds about right […] We’re also really into water sports and checking out scenic spots […] Agent:Got it! Water sports and scenic spots are right up your alley. Let’s find a festival that matches those interests.Do you have a preferred start date for your trip, or are you still flexible? User:We’re pretty flexible! […] Agent:Perfect! Flexibility is always good when it comes to festivals. […]Just let me know if you have a preferred month or whether we should focus on specific regions of Japan. [Agent asks the same question about dates/preferences three times before presenting options at turn 14.]

The subterranean agent reaches option presentation by turn 5 and booking confirmation by turn 7. The orchestrated agent repeats the date/preference question three times (turns 4, 6, 8) before presenting options at turn 14—a symptom of the orchestrator’s node-level generation lacking global conversation awareness.

A.2Zoom Support

Scenario:Audio drops intermittently during Zoom meetings. Connection indicator sometimes shows yellow/red bar. User on Windows 11 desktop app with “Your network bandwidth is low” error.

8B Subterranean agent(task success: 5, naturalness: 4, 12 turns):

Agent:Hello! I’m here to help you with any Zoom issues you’re experiencing. What seems to be the problem? User:Hi, my audio keeps cutting out during Zoom meetings. It drops for a few seconds then comes back. I also see the connection bar going yellow or red sometimes. Agent:That sounds like a bandwidth-related issue. Are you on WiFi or Ethernet? And have you noticed if this happens in all meetings or just specific ones? User:I’m on WiFi. It happens in most meetings, but it’s worse when there are a lot of participants. Agent:That’s a good clue—more participants means more video streams competing for bandwidth. Let’s try a few things:Step 1:In your Zoom settings, go to Video and uncheck “HD Video.” This significantly reduces bandwidth usage. […] User:OK, I turned off HD Video. What else can I try? Agent:Step 2:During your next meeting, you can check your actual connection quality: click the signal strength icon in the top-left corner of the Zoom window to see your Statistics panel. […]Step 3:Under Settings>>Audio, check “Suppress background noise” is set to Auto rather than High. User:That all makes sense. I’ll try these in my next meeting. Is there anything else I should check on my end? Agent:One more thing: if possible, try connecting via Ethernet cable instead of WiFi for your important meetings. That alone often resolves intermittent audio issues. […] User:Great, thanks! I’ll try the Ethernet cable and those settings changes. Appreciate the help! Agent:You’re welcome! Those changes should make a noticeable difference. Feel free to reach back out if the issue persists after trying these steps.

Claude orchestrated agent(task success: 4, naturalness: 3, 9 turns):

Agent:Hi there! I’m here to help with your Zoom issue. Could you tell me what’s going on? User:My audio keeps dropping during Zoom calls. The connection indicator shows yellow/red bars and I get a “bandwidth is low” message. Agent:I understand you’re having audio issues with bandwidth warnings. What operating system and Zoom client are you using? And what’s your internet connection type? User:Windows 11, desktop app, WiFi connection. Agent:Thanks. Here are several things to try: (1) Disable HD video in Settings>>Video, (2) Close other bandwidth-heavy applications, (3) Check your WiFi signal strength and consider moving closer to your router, (4) Under Audio settings, try enabling “Use original sound for musicians” which can reduce processing overhead. User:Wait, the “original sound” thing—wouldn’t that actually make quality worse for regular calls? Agent:You’re right, that suggestion was for a specific use case. For general meetings, keep that disabled. The HD video and WiFi optimization steps should be your primary focus. […]

The subterranean agent provides Zoom-specific guidance in a natural diagnostic flow—disabling HD video, checking Statistics panel, adjusting noise suppression—reaching resolution in 12 turns with one focused suggestion per turn. The LangGraph orchestrator packs multiple recommendations into a single turn (4 suggestions at once), including the incorrect “original sound for musicians” advice which it must then retract. The compiled model’s domain-specific training enables accurate product knowledge, while its learned conversational rhythm produces clearer, more actionable troubleshooting steps.

A.3Insurance Claims

Scenario:Vehicle stolen overnight from outside apartment. Police report filed. Claimant has Premium tier comprehensive coverage, $500 deductible. Frustrated, high urgency—needs a car for work.

8B Subterranean agent(task success: 5, naturalness: 5, 22 turns):

Agent:Hello, I’m Sarah from the claims department. I’m here to help you file your claim today. What type of incident would you like to report? User:Hi Sarah. My car was stolen last night from right outside my apartment. I woke up this morning and it was just gone. I already filed a police report, but I need to get this claim started as soon as possible. Agent:I’m so sorry to hear about your stolen vehicle—that must be incredibly stressful. To get your claim started right away, could you please provide your policy number and confirm your full name for verification? User:It’s Linda Robinson, and my policy number is 643-6323376. How long is this whole process going to take? I need a car to get to work. Agent:Thank you, Linda. I’ve confirmed your Premium tier policy with comprehensive coverage, which does cover vehicle theft. The claim filing process typically takes 5–7 business days for initial review, and we’ll work with you to expedite this given your transportation needs. Can you tell me more details about when and where you last saw your vehicle? User:It was parked on the street right in front of my building around 10 PM last night. When I came out at 7 AM to drive to work, it was completely gone—no broken glass or anything. The police said there’s been a string of thefts in my neighborhood lately. What happens next? Agent:Based on your comprehensive coverage and the police report, we’ll process this as a theft claim and arrange for a claims adjuster to contact you within 24–48 hours, including rental car coverage under your policy. Do you have the vehicle’s VIN number available? User:I have the VIN somewhere in my paperwork—can I email it to you? It’s a 2019 Honda Accord, silver. And wait, you said rental car coverage—does my policy actually cover that? I really need to figure out how I’m getting to work tomorrow. Agent:Yes, your Premium tier policy includes rental car coverage up to $40 per day for up to 30 days while we process your claim. You can absolutely email the VIN—in the meantime, I can authorize a rental vehicle for you today so you can get to work tomorrow. User:Oh thank goodness! So can I just go pick up a rental car today, or do I need authorization paperwork first? Agent:I’ll email you an authorization letter within the next hour that you’ll need to present at any of our partner rental locations—Enterprise, Hertz, or Budget. Just bring that letter, your driver’s license, and a credit card. User:What are we looking at in terms of getting compensated for my car? I still owe money on it, and the car’s worth about $24,500. Will insurance cover the full value? Agent:With a theft claim, we’ll pay the actual cash value based on condition and market value, minus your $500 comprehensive deductible. If you still owe money, we’ll pay your lender directly up to the loan balance, and any remaining value goes to you. Our adjuster will provide a detailed valuation within 3–5 business days. User:So $24,500 minus $500 is $24,000—that should cover what I owe with maybe a little left over. When will I actually see that money? Agent:Typically 7–10 business days from when our adjuster completes the valuation. Once approved, we’ll issue payment directly to your lender for the loan balance and any remainder to you. [Conversation continues with agent confirming next steps and closing naturally.]

Claude orchestrated agent(task success: 3, naturalness: 4, 16 turns):

Agent:Hello! I’m Sarah, a claims specialist here to help you today. What type of claim do you need to file—auto, home, health, or another type? User:Hi Sarah. My car was stolen from right outside my apartment last night. I already filed a police report, but I need to get this claim started because I need a car for work. Agent:I’m so sorry to hear that. Could you please provide me with your auto policy number and your full name as it appears on the policy? User:It’s Mark Perez, policy number 385-6891254. How long is this whole process going to take? Agent:Thank you, Mark.*reviewing account*You have ourPremium tier auto coverage, which includes comprehensive coverage for theft. Many theft claims are resolved within 7–10 days. Now, could you walk me through what happened in more detail? [Agent gathers vehicle details and confirms rental coverage at $50/day.] User:So when you say “actual cash value,” what does that mean exactly? I still owe money on this car, and I need to make sure the settlement covers my loan. Agent:Actual Cash Value (ACV) means the current market value of your vehicleat the time of the loss. If the settlement amount is less than your loan balance, you would be responsible for the difference.However, I see you have Premium coverage—let me check if you have gap coverage.*checking policy details*Do you recall if you added gap insurance when you purchased your policy? User:Wait, you didn’t answer my question about the gap coverage—do I have it or not? I owe about $23,500 to Capital One. Can you checkright now? Agent:Mark, I want to be completely honest:I’m not seeing gap coverage listed as an active endorsement on your policy at this moment.However, Premium tier policies sometimes include it, and I want to be absolutely certain. Let me escalate this to verify with our underwriting team, or I can have your adjuster confirm this within 24–48 hours. User:I need to know thisnow, not in 24–48 hours. Can you transfer me to someone who can pull up my full policy? Agent:You’re absolutely right. Let me connect you with a senior claims specialist right now who has direct access to your complete policy documentation.*Transferring call* [Agent produces an extensive internal handoff summary with bullet-pointed claim details, visible to the user.]

The subterranean agent handles the 55-node insurance procedure naturally: it verifies coverage, gathers incident details, authorizes a rental, explains the valuation process, and closes the claim—all in a smooth conversational flow. The orchestrated agent, despite using Claude Sonnet 4.5, produces responses with markdown formatting (bold headers, bullet lists) inappropriate for a phone conversation, fails to definitively answer the gap coverage question, and ultimately escalates to a “senior specialist”—a handoff that no actual system backs. The node-level generation produces locally reasonable but globally disconnected responses: the agent checks policy details multiple times without resolving the question, a symptom of the orchestrator lacking cross-turn memory of what has already been attempted.

Appendix BStatistical Details

Tables7–9report pairwise comparisons for all domains using the Claude Sonnet 4.5 judge. Within-run comparisons (Travel: Sub/Orch) use Wilcoxon signed-rank tests (paired); cross-run comparisons (all others) use Mann-WhitneyUUtests (unpaired).pp-values are Holm–Bonferroni corrected within each comparison (5 criteria). Effect size is Cohen’sdd(pooled SD). 95% CIs are bootstrap (10,000 resamples).

Table 7:Pairwise comparisons—Travel Booking (n=200n=200).WW: Wilcoxon signed-rank (paired). Significance after Holm–Bonferroni:p∗<.05{}^{*}p<.05,p∗∗<.01{}^{**}p<.01,p∗⁣∗∗<.001{}^{***}p<.001.Table 8:Pairwise comparisons—Zoom Support (n=200n=200).UU: Mann-Whitney (unpaired, cross-run). Significance after Holm–Bonferroni:p∗<.05{}^{*}p<.05,p∗∗<.01{}^{**}p<.01,p∗⁣∗∗<.001{}^{***}p<.001.Table 9:Pairwise comparisons—Insurance Claims (55 nodes,n=200n=200).UU: Mann-Whitney (unpaired, cross-run). Significance after Holm–Bonferroni:p∗<.05{}^{*}p<.05,p∗∗<.01{}^{**}p<.01,p∗⁣∗∗<.001{}^{***}p<.001.

Appendix CJudge Robustness: GPT-4.1 Replication

To test whether our findings depend on the choice of judge model, we re-scored all conversations using OpenAI GPT-4.1 as an independent judge with the identical rubric. Table10presents the means.

Table 10:GPT-4.1 judge replication (n=200n=200per condition per domain). Cf. Tables1–3(Claude Sonnet 4.5 judge).Three findings are robust across both judges: (1) the compiled model achieves 83–99% of in-context quality (GPT-4.1) vs. 82–102% (Claude)—a comparable range; (2) the 3B compiled model strongly outperforms the 3B orchestrator on all metrics under both judges; (3) the in-context baseline dominates all other conditions.

Two patterns differ between judges. Under GPT-4.1, the LangGraph orchestrator leads the compiled model on more metrics than under the Claude judge, particularly in travel and insurance. The compiled model retains advantages on consistency (Zoom, Insurance) and information accuracy (Insurance) but loses on task success, graceful handling, and naturalness in most domains. Naturalness scores compress to near-parity across all conditions under GPT-4.1 (3.92–4.04), in contrast to the wider spread under Claude (4.12–5.00).

The core value proposition—competitive quality at two orders of magnitude less cost—holds under both judges. The compiled model achieves 85–113% (Claude) and 87–108% (GPT-4.1) of the LangGraph orchestrator’s quality, while costing 128–462×\timesless than the in-context baseline. The quality ranking (In-Context>>LangGraph≈\approxCompiled>>3B Orch) is consistent across judges.

Appendix DProcedure Flowcharts

Figures2–4present the three procedure flowcharts. Blue nodes are agent turns, orange nodes are user turns, green nodes indicate successful completion, and red nodes are exit states.

Refer to caption Figure 2:Travel booking flowchart (14 nodes, 3 decision hubs, 3 terminal states). Multi-way decision hubs route among 4–6 alternatives based on conversation state.Figure 3:Zoom technical support flowchart (14 nodes, 3 decision hubs, 3 terminal states).Figure 4:Insurance claims processing flowchart (55 nodes, 6 decision hubs, 5 terminal states).