Measured token consumption across 4 agent runtimes doing the same tasks. Costs ranged from 1x to 4x depending on cache architecture

Reddit r/AI_Agents News

Summary

A comparison of token consumption across four agent runtimes (Claude Code, OpenClaw, Hermes, and OpenClacky) on the same tasks reveals costs ranging from 0.8x to 4x relative to Claude Code, driven by differences in cache architecture and tool schema design.

I've been digging into why some agent runtimes burn through tokens so much faster than others, even when using the same model. Ran a controlled comparison on three real tasks and the gap was bigger than I expected. Setup: same model (Claude Sonnet), same tasks, measuring total input + output tokens. The agents tested were Claude Code, OpenClaw, Hermes, and ours (OpenClacky, open source). Rough results, normalized to Claude Code as 1.0x: - Hermes: ~3-4x. It ships 52 built-in tools. Every API call sends the full schema. That's 10-25k tokens of tool definitions per turn. If the schema shifts (dynamic tools), the whole thing is a cache miss. - OpenClaw: ~1.5x. Solid runtime, but skill loading touches the system prompt, which breaks prefix matching on every skill invocation. - Claude Code: 1.0x baseline. Good cache engineering, closed-source. - OpenClacky: ~0.8x. 16 tools, frozen system prompt, double cache markers. Cache hit rate stays above 90%. The underlying issue is pretty simple. On every turn, the API receives: system prompt + tool definitions + full conversation history. If prompt caching hits, you pay 1/10th price (Anthropic) or half price (OpenAI) for everything the model has already seen. If it misses, full price for all of it again. Most runtimes break their own cache without realizing it. The common ways: - Adding or removing tools mid-session changes the system prompt bytes - Loading new context into the system prompt (skills, memory, rules) - Compressing history at the wrong time rewrites what was already cached - Model switches split the cache namespace The fix isn't complicated in concept: freeze the prefix, put dynamic state elsewhere, use rolling cache markers so history growth doesn't invalidate prior turns. Took us two failed architectures and eight months to get the ordering right though. If you're running local models through something like LiteLLM or a local OpenAI-compatible server, it works. Cache benefits depend on your provider though. Anthropic and OpenAI have the best caching infra right now. Local setups still benefit from the smaller prompts regardless. Happy to go deeper on methodology if anyone wants.
Original Article

Similar Articles

Claude Token Counter, now with model comparisons

Simon Willison's Blog

Simon Willison upgraded his Claude Token Counter tool to support comparing token counts across different Claude models, revealing that Claude Opus 4.7's new tokenizer uses 1.46x more tokens than Opus 4.6 for the same text, resulting in ~40% higher costs despite identical pricing.