@hooeem: https://x.com/hooeem/status/2062266452921491934
Summary
A guide explaining how to make agentic workflows up to 462x cheaper by compiling fixed procedures into smaller fine-tuned models instead of repeatedly prompting frontier models.
View Cached Full Text
Cached at: 06/03/26, 09:55 PM
How to make agentic workflows 100x cheaper (full guide):
Your agentic workflows are wasting your tokens. They’re wasting your money repeatedly for an orchestration loop. Here’s how you fix it and make your workflows 100x cheaper (yes, 100x cheaper).
In fact in testing it was found to be 128x, 296x, and 462x cheaper in the three tested domains, so 100x is an understatement.
The paper this research has come from has been written by Simon Dennis, Riviaan Patil, Kevin Shabahang, & Hao Guo from the University of Melbourne (I’ll link the paper in full at the end).
This article is going to tell you how to utilise their research so that you can make your agentic workflows 100x cheaper and the contents of this article is the following:
-
Why your agentic workflow costs so much
-
The one idea to take away
-
How it’s actually done (theory)
-
Does it hold up?
-
Run your own numbers (find how much it cost)
-
Is this for you?
-
Full build guide
This guide can make your conversations up to 462x cheaper whilst keeping 87-98% of the frontier quality kept, so let’s get started!
1: Why your agentic workflow costs so much
Okay, so you have a fixed procedure - an agentic workflow.
Then you have where this agentic workflow lives and depending on that single choice is exactly what drives the cost of your agentic workflow.
A: Orchestration
This is the most common setup you see today, this is where software sits on top of the model and, every single turn, injects instructions and decides where the conversation goes next.
The cost? $0.05-0.17 per conversation.
B: In-context
This is the “just prompt it” route, this is where you paste the whole workflow into the model’s system prompt and you let it run yourself.
This is the most expensive approach, the cost? $0.10-0.33 per conversation.
C: Compiled
This is the method we’re going to learn from the research paper - we teach a smaller model the procedure once and then we host it ourself.
The procedure is buried inside the model and the cost? $0.0003-0.001 per conversation.
2: The one idea to take away
If the shape of your workflow, it’s steps, it’s branches, it’s order, doesn’t change from one conversation to the next then why pay to describe it every time? FFFFFFFF*****CCCCCKKKKK THAT.
Go put the unchanging shape into the model itself and keep the prompt for the only thing that actually varies.
Look:
3: How?
Let’s dive in…
Step 1: Draw the workflow as a flowchart
Map your procedure as boxes and arrows, yes, a frickin flow chart, each box is a turn each arrow is a possible next step. Mark the start and the ways a conversation can end.
Why? It’s a precise way of writing down the procedure a computer can walk through automatically. If you’re able to write down your agentic workflow then you’re ready for step 2.
Step 2: Generate practice conversations from it
A frontier model such as Claude Sonnet will be able to walk through every sensible path that can be made from your flowchart and will write out realistic example conversations with varying details each time. This is how you teach it and to do this you need to get it to run 2000-6000 conversations which will cost around $40 in API calls. (you don’t need to chat to your workflow 2000 times yourself lol)
Step 3: Fine tune a small model on those conversations
Take a small open model: the study used Qwen 2.5 (3 billion parameters) and Qwen3 (8 billion), and train it on the examples until it absorbs the procedure. It needs to learn the workflow in it’s entirety not as a set of instructions it reads each time.
One important caveat from the study: this has to be a full retrain, not the cheap shortcut (LoRA), the shortcut was shown to fail at learning multi-step procedures.
-
Hardware: 1 high-end GPU (e.g. an A100 or H200)
-
Cost: ~$10–40
-
“3B” = 3 billion parameters
Fine-tuning: taking an existing model and continuing its training on your own examples so it specialises. Parameters are the model’s adjustable internals; 3–8 billion is small enough to run on a single rented GPU, versus the ~70× larger frontier models behind the paid APIs.
Step 4: Deploy it, with no orchestrator at all
Host the trained model yourself and let conversations fly. The model self-orchestrates from what it learned. The expensive parts of the old setup are simply gone.
-
Serving: self-hosted on your own GPU
-
Prompt size: constant
-
Routing errors: eliminated by design
Self-hosting: running the model on a machine you rent or own (the study used a cloud A100 at ~$2.50/hour) instead of paying a provider per token. This is where the ~65× per-token saving comes from.
I will be diving deeper into the build at the end of this article.
4: Does it hold up?
Cheap is easy if you don’t care about quality. The study measured both, across three workflows, judged blind by an AI grader (and re-checked by a second, different grader).
They found that despite it being ridiculously cheaper, it’s quality kept vs. the gold standard.
-
For naturalness its quality kept 97%.
-
For graceful handling its quality kept 92%.
-
For task success its quality kept 91%.
-
For information accuracy its quality kept 87%.
Overall the 8B compiled model scored 87–98% of the frontier in-context baseline. Not only that but in two of three workflows tested in the article the compiled model failed far less often than the orchestrator version.
5: How much does it cost?
-
At low volume the one-time setup dominates, so the advantage looks modest. As volume grows, the compiled cost approaches the raw 128–462× advantage from the cost ladder.
-
The study notes break-even against the in-context approach arrives within 500 conversations visible in the first row above.
-
For 10,000+ conversations, the study reports compilation adds less than $0.01 per conversation once the setup is spread out.
6: Is this for you?
A strong fit when…
-
Your workflow is procedural - you can draw it as a flowchart with clear steps and branches.
-
The procedure is stable - it doesn’t change shape from one conversation to the next.
-
You run enough volume for the one-time setup to pay off (break-even under 500 conversations).
-
You want to keep the procedure private rather than expose it to a third-party API.
-
Latency and per-conversation cost matter to you at scale.
A poor fit when…
-
The task is open-ended - not a defined procedure you can chart. The study only tested procedural workflows.
-
Success depends on broad world knowledge - that was the compiled model’s weakest area (87%).
-
You need the absolute highest quality and cost is no object - the in-context frontier baseline still scored highest.
-
Your procedure changes constantly - though a refresh is only 30–50 minutes, not a full rebuild.
-
You have no access to a suitable GPU or the skills to run a fine-tune (or to delegate it).
7: Building it
You do not need to understand every single nerdy thing that’s going on, and you shouldn’t expect to either. Think of it like building a house, you draw the plans, a builder does the construction.
Here we are simply deciding the workflow and we’ll judge whether the result is good or not (nice).
Here’s what we’re about to do (we mentioned it briefly earlier but now we’re going to give those who are actually going to do this the tools):
-
WRITE WORKFLOW DOWN AS FLOWCHART
-
GET AI TO PRACTICE CONVERSATIONS WITH IT
-
GET SMALL MODEL TO LEARN IT
-
IT THEN RUNS THAT
It should cost around $50-80 bucks to set this fucker up assuming you have a GPU to use and after 500 conversations with your agentic workflow it has already paid for itself!
Stage 1: Draw your workflow as a flowchart
Write down your procedure as a simple map: boxes for each thing that gets said, arrows for what can happen next, and a few clearly-marked endings (the customer is happy, the customer gives up, or it’s handed to a human). That’s it. If you can sketch your process on a whiteboard, you’ve done the hard part of this stage.
This is the one stage that is genuinely yours. You know your workflow better than anyone; nobody else can draw it for you.
A few tips from the study:
-
Keep agent and customer turns alternating, the agent says something, the customer replies, and so on.
-
Every ending should be one of three kinds: success, gave up, or handed to a human.
-
Write each agent step to ask one thing at a time. The study found this single-question rhythm is what makes the trained model feel natural and easy to follow.
-
For a sense of scale: the paper’s simple workflows had 14 boxes; its most complex (insurance claims) had 55 boxeswith 6 branching points. Both worked.
The flowchart as a file (procedure.json) The map is saved as a plain text file.
Here is a tiny travel-booking example; replace it with your own or get Claude to help you with it by explaining your workflow.
json{ “system_prompt”: “You are a helpful travel booking assistant.”, “start”: “greet”, “terminals”: { “booked”: “success”, “abandoned”: “abandonment”, “escalated”: “escalation” }, “scenario_variables”: { “destination”: [“Japan”, “Portugal”, “Peru”], “budget_per_person”: [“£650”, “£900”, “£2,000”], “trip_length”: [“a weekend”, “6 days”, “two weeks”], “user_style”: [“uncertain”, “specific”, “price-conscious”] }, “nodes”: { “greet”: { “role”: “agent”, “prompt”: “Warmly greet the customer and ask what trip they’d like to book.” }, “user_request”: { “role”: “user”, “prompt”: “You want {destination} for {trip_length}, budget about {budget_per_person} each. Be {user_style}.” }, “gather”: { “role”: “agent”, “prompt”: “Ask ONE focused follow-up question about dates or interests.” }, “user_detail”: { “role”: “user”, “prompt”: “Answer, staying consistent with your budget and style.” }, “present”: { “role”: “agent”, “prompt”: “Present 2–3 concrete options that fit the budget.” }, “user_choose”: { “role”: “user”, “prompt”: “React; pick one or ask for alternatives.” }, “confirm”: { “role”: “agent”, “prompt”: “Summarise the choice and ask the customer to confirm.” }, “user_confirm”: { “role”: “user”, “prompt”: “Confirm you’re happy to book.” }, “booked”: { “role”: “agent”, “prompt”: “Confirm the booking and close with one travel tip.” }, “abandoned”: { “role”: “agent”, “prompt”: “Politely acknowledge they’re not ready.” }, “escalated”: { “role”: “agent”, “prompt”: “Explain you’re handing off to a human specialist.” } }, “edges”: [ { “from”: “greet”, “to”: “user_request” }, { “from”: “user_request”, “to”: “gather” }, { “from”: “gather”, “to”: “user_detail” }, { “from”: “user_detail”, “to”: “present” }, { “from”: “user_detail”, “to”: “gather”, “condition”: “needs more info” }, { “from”: “present”, “to”: “user_choose” }, { “from”: “user_choose”, “to”: “confirm” }, { “from”: “user_choose”, “to”: “present”, “condition”: “wants alternatives” }, { “from”: “user_choose”, “to”: “abandoned”, “condition”: “not interested” }, { “from”: “confirm”, “to”: “user_confirm” }, { “from”: “user_confirm”, “to”: “booked” }, { “from”: “present”, “to”: “escalated”, “condition”: “too complex” } ] }
Here is a live artefact to help you with this stage:
https://claude.ai/public/artifacts/3f0bd0cf-980b-407c-a515-9880f66103e7
Stage 2: Let a clever AI have thousands of conversations with it
You don’t need a pile of real customer transcripts to start. Instead, a top-tier AI (the study used Claude) walks every sensible route through your flowchart and writes out realistic example conversations, thousands of them, changing the details each time, like a different destination or a more sceptical customer. These examples are the “textbook” your small model will learn from.
The clever part: the finished examples read as completely natural dialogue. None of the flowchart labels show up in them. The procedure is hidden inside how the conversations flow which is exactly how the small model will end up learning it.
What it costs? $40 in usage.
For whoever runs it generate.py (the tested data generator). This walks the flowchart, writes each turn with a frontier model, and saves the conversations.
pythonimport json, random, os from anthropic import Anthropic client = Anthropic() # reads your ANTHROPIC_API_KEY GENERATOR_MODEL = “claude-sonnet-4-5” # the study used Claude Sonnet 4.5
F = json.load(open(“procedure.json”)) NODES, EDGES = F[“nodes”], F[“edges”]; TERMINALS = set(F[“terminals”].keys())
def enumerate_acyclic_paths(max_paths=10000): # List every distinct route through the flowchart (no box visited twice). # This gives even coverage of all endings — a simple random walk lopsidedly # over-samples short “gave up / escalated” routes (a bug found in testing). paths = [] def dfs(node, seen, acc): acc = acc + [node] if node in TERMINALS: paths.append(acc); return for e in EDGES: if e[“from”] == node and e[“to”] not in seen and len(paths) < max_paths: dfs(e[“to”], seen | {e[“to”]}, acc) dfs(F[“start”], {F[“start”]}, []); return paths ALL_PATHS = enumerate_acyclic_paths()
def fill(t, s): for k, v in s.items(): t = t.replace(“{”+k+“}”, str(v)) return t
def generate_turn(node, scenario, history): who = “the booking AGENT” if node[“role”] == “agent” else “the CUSTOMER” transcript = “\n”.join(f’{m[“role”].upper()}: {m[“content”]}’ for m in history) or “(start)” r = client.messages.create(model=GENERATOR_MODEL, max_tokens=400, messages=[{“role”:“user”,“content”: f“Write one turn as {who}.\nYOUR INSTRUCTION: {fill(node[‘prompt’], scenario)}\n“ f“CONVERSATION SO FAR:\n{transcript}\n\nWrite only {who}’s next message, naturally. “ f“No labels, no mention of any procedure.“}]) return r.content[0].text.strip()
def generate_conversation(): path = random.choice(ALL_PATHS) scenario = {k: random.choice(v) for k, v in F[“scenario_variables”].items()} turns = [] for nid in path: node = NODES[nid]; role = “assistant” if node[“role”] == “agent” else “user” text = generate_turn(node, scenario, turns) # Merge two same-role turns in a row into one (keeps a valid chat format — # another bug found and fixed in testing). if turns and turns[-1][“role”] == role: turns[-1][“content”] += “ “ + text else: turns.append({“role”: role, “content”: text}) return [{“role”: “system”, “content”: F[“system_prompt”]}] + turns
def build(n=2125, eval_frac=0.10): # the study’s volume + 90/10 split convs = [{“messages”: generate_conversation()} for _ in range(n)] random.shuffle(convs); cut = int(len(convs)*(1-eval_frac)) os.makedirs(“data”, exist_ok=True) for name, part in [(“train”, convs[:cut]), (“eval”, convs[cut:])]: with open(f“data/{name}.jsonl“, “w”) as f: for c in part: f.write(json.dumps(c)+“\n”) if name == “main”: build()
Stage 3: Train the small model on those conversations
Take a small, free, open model and let it study the practice conversations until it has absorbed the procedure. It doesn’t memorise a rulebook to re-read each time, it picks up the workflow as a habit, the way a new employee eventually stops checking the manual.
It then needs a powerful graphics computer (a GPU).
The one rule you must insist on: it has to be a full training, not the popular cheap shortcut (called “LoRA”). The study tested the shortcut and it failed to learn multi-step procedures properly. If someone offers to do it the quick way, the answer is no.
For whoever runs it: the settings that matter, and the tested script. Base model: Qwen 2.5 (3B) for simple workflows, Qwen3-8B for complex ones. Full fine-tune, never LoRA. Learning rate 2×10⁻⁵, 10–20 passes over the data, keep the best version by held-out score.
pythonimport torch from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer from trl import SFTConfig, SFTTrainer
BASE_MODEL = “Qwen/Qwen2.5-3B-Instruct” # use “Qwen/Qwen3-8B” for complex workflows tok = AutoTokenizer.from_pretrained(BASE_MODEL) model = AutoModelForCausalLM.from_pretrained(BASE_MODEL, torch_dtype=torch.bfloat16) ds = load_dataset(“json”, data_files={“train”: “data/train.jsonl”, “eval”: “data/eval.jsonl”})
cfg = SFTConfig( output_dir=“compiled-model”, num_train_epochs=20, # 10–20 passes per_device_train_batch_size=2, gradient_accumulation_steps=8, # effective batch 16 (use 32 for the 8B model) learning_rate=2e-5, lr_scheduler_type=“cosine”, bf16=True, optim=“adamw_8bit”, # lets it fit one GPU (needs bitsandbytes); use “adamw_torch” if you have plenty of memory eval_strategy=“epoch”, save_strategy=“epoch”, load_best_model_at_end=True, metric_for_best_model=“eval_loss”, greater_is_better=False, assistant_only_loss=True, # learn the AGENT’s turns only max_length=4096, # NOTE: this was called max_seq_length in older versions — the old name now errors report_to=[], ) SFTTrainer(model=model, processing_class=tok, train_dataset=ds[“train”], eval_dataset=ds[“eval”], args=cfg).train()
Stage 4: Switch it on
Put the trained model on a computer you control. Its only instruction is a single line. The model runs the whole workflow from what it learned.
This is where the saving comes from: you’re no longer renting a giant model by the word, and you’re no longer re-sending the procedure on every reply. A whole conversation now costs a tiny fraction of a penny.
For whoever runs it: serve and query. The study used vLLM on a rented GPU (about $2.50/hour).
bashvllm serve ./compiled-model –max-model-len 4096 –port 8000
The bottom line
Stop over paying for your agentic workflows on repeat. If your workflow is procedural, stable and high-volume, compiling it into a small self-hosted model is the natural move with near-frontier quality, fewer failures, and a cost that drops by two orders of magnitude with the advantage growing the more complex your workflow gets.
The article: https://arxiv.org/pdf/2605.22502
Similar Articles
@dair_ai: NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower infe…
This paper demonstrates that agentic workflows can be distilled into small fine-tuned models, achieving near-frontier quality while reducing inference cost by two orders of magnitude compared to orchestration approaches.
The power of structured workflows and small local models
The author details their experience building a custom agent loop using a small local model (Qwen3.5 9B) with structured workflows and a map-reduce pattern to manage context limits, replacing Claude Code for most tasks.
@Zephyr_hg: https://x.com/Zephyr_hg/status/2062176187384807488
A practical guide arguing that mastering sub-agents requires building four specific workflows in a weekend, covering decomposition, context packaging, verification, and cost control, rather than spending 200 hours on tutorials.
@DeRonin_: https://x.com/DeRonin_/status/2054235707791778034
A practical guide on reducing AI coding expenses by 80% through smarter token management, including multi-model routing, prompt caching, and context discipline, rather than simply switching to cheaper models.
@Aurimas_Gr: You must know these 𝗔𝗴𝗲𝗻𝘁𝗶𝗰 𝗦𝘆𝘀𝘁𝗲𝗺 𝗪𝗼𝗿𝗸𝗳𝗹𝗼𝘄 𝗣𝗮𝘁𝘁𝗲𝗿𝗻𝘀 as an 𝗔𝗜 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿. If you a…
The article describes five key workflow patterns for building agentic AI systems in enterprise settings, as summarized by Anthropic: prompt chaining, routing, parallelization, orchestrator, and evaluator-optimizer, with tips to prefer simpler workflows before using full agents.