Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.

X AI KOLs News

Summary

Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.

https://t.co/jrDKv6Z3JN
Original Article
View Cached Full Text

Cached at: 06/23/26, 02:35 PM

Prompt caching, clearly explained

A case study on how Claude achieves 92% cache hit-rate

Every time an AI agent takes a step, it pays a tax.

It re-reads everything from scratch.

The system instructions. The tool definitions. The project context it already loaded three turns ago. All of it. Every single turn.

That’s the context tax. And for long-running agentic workflows, it’s often the most expensive line item in your entire AI infrastructure.

Here’s the math: a system prompt with 20,000 tokens running over 50 turns means 1 million tokens of redundant computation billed at full price, producing zero new value.

The fix is prompt caching. But to use it well, you need to understand what’s actually happening under the hood.

Start With What Changes and What Doesn’t

Before you can optimize anything, you need to think clearly about the structure of an agent’s prompt (context).

Every request your agent sends has two fundamentally different parts:

The static prefix: it includes system instructions, tool definitions, project context, behavioral guidelines. This content is identical across every single turn of a session.

The dynamic tail: user messages, tool outputs, terminal observations. This is unique to every request and grows as the conversation progresses.

This distinction is everything. The static prefix is the expensive part you keep recomputing for no reason. The dynamic tail is the only part that actually needs fresh computation.

Prompt caching works by storing the mathematical state of the static prefix so that future requests can skip recomputing it entirely. You pay to process that prefix once. Every subsequent turn reads from memory instead.

Why This Works: What a Transformer Actually Does

To really understand why caching is so effective, you need to understand what happens inside the model when it reads your prompt.

Every LLM inference request has two phases:

Phase 1: Prefill

This is where the model processes your full input prompt. It’s compute-bound, meaning it runs dense matrix multiplications across every token in your context. The model reads everything and builds up a representation of it. This is the slow, expensive phase.

Phase 2: Decode

This is where the model generates output tokens, one at a time. It’s memory-bound rather than compute-bound because the model spends most of its time reading previously computed state rather than running heavy calculations.

During the prefill phase, the transformer builds three vectors for each token: a Query, a Key, and a Value. The attention mechanism uses these to figure out how each token relates to every other token in the sequence.

Here’s the critical insight: the Key and Value vectors depend only on the tokens that came before them. Once they’re calculated for a given prefix, they never need to change.

The illustration below visually explains what we just discussed:

Without caching, those Key-Value tensors get thrown away the moment a request completes. The next request starts from scratch and recalculates them for all 20,000 tokens again.

KV caching solves this by storing those tensors. The infrastructure keeps them on the inference servers, indexed by a cryptographic hash of the input text. When a new request comes in with the same prefix, the hash matches, the tensors are retrieved immediately, and the model skips all that computation.

This drops the computational complexity from O(n²) per generated token down to O(n). For a 20,000-token prefix repeated across 50 turns, that’s an enormous reduction.

The Economics

Understanding the pricing structure is what makes this architectural decision so consequential.

Here’s how Anthropic prices caching across their model families:

Three numbers to internalize:

  • Cache reads cost 10% of the base input price, a 90% discount on every token read from cache

  • Cache writes cost 25% more than the base input price, a small premium to store the KV tensors

  • Extended 1-hour caching costs 2x the base price

The math only works if your cache hit rate stays high. Which brings us to the best real-world example of what that looks like in practice.

Claude Code: A 30-Minute Session Walkthrough

Claude Code is built entirely around one objective: keep the cache hot.

To understand what that means concretely, lets walk through how a typical 30-minute coding session looks like and track exactly what gets billed and what doesn’t.

Minute 0: Session Start

Claude Code loads its system prompt and tool definitions. It also reads the CLAUDE.md file in your project root, which describes the codebase and conventions. This payload regularly exceeds 20,000 tokens.

This is the most expensive moment of the entire session. Every single token is new. But you only pay this cost once.

Minutes 1 to 5: First Commands

You type your first instruction, something like “look at the auth module and suggest improvements.”

Claude Code dispatches an Explore Subagent. It navigates through the codebase, opens files, runs grep commands, and builds a picture of the relevant code. All of this gets appended to the dynamic tail.

The 20,000-token static foundation? Already in cache. Being read back at $0.30/MTok instead of $3.00/MTok. You’re only paying for the new tool outputs and your message.

Minutes 6 to 15: Deep Work

The Plan Subagent receives the findings from the Explore Subagent. Rather than passing the raw results verbatim (which would blow up the dynamic tail unnecessarily), Claude Code passes a concise summary. This keeps the suffix manageable and the cache efficient.

The planner produces a structured implementation plan. You review it, approve it, and Claude Code starts making changes. Every turn in this loop reads the 20,000-token prefix from cache. Each cache hit resets the TTL, keeping the cache warm for future turns.

Minutes 16 to 25: Iteration

You ask for adjustments. Claude Code revises its approach. More tool calls, more terminal output. The dynamic tail is growing, but it represents only the new, unique content in this session.

At this point, the session has processed hundreds of thousands of tokens total. But the 20,000-token foundation has been read from cache every single turn.

Minute 28: Running /cost

Without caching, a session like this easily crosses 2 million tokens. At Sonnet 4.5 rates, that’s around $6.00.

With caching running at high efficiency:

  • The vast majority of tokens are read from cache at $0.30/MTok

  • Only the new dynamic tail tokens are computed fresh

In practice, you’d expect somewhere in the range of an 80%+ cost reduction on a single task. Now multiply that by every user, every day.

To summarise here’s how the system prompt layout looks are the session continues:

The Rule That Breaks Everything

Here is the most counterintuitive thing about prompt caching.

1 + 2 = 3. But 2 + 1 is a cache miss.

The infrastructure hashes the prompt. The hash is an identifier for cryptography. The hash changes if anything in that order changes, even if two elements are in a different order. The cache is empty. The whole prefix is recalculated at full price.

Three rules that follow from this:

  • Don’t add or take away tools during a session. The cached prefix includes tools. Changing the tools makes everything that comes after it useless.

  • Never switch models mid-session. Caches are model-specific. Switching to a cheaper model mid-conversation requires rebuilding the entire cache.

  • Never change the prefix to change the state. Instead, Claude Code adds a tag to the next user message that reminds the system. The prefix never changes.

What It Means for You

Everything above explains how Claude Code handles caching. The same rules apply if you are making your own agent.

This is how to structure your prompts:

  • At the top are system instructions and rules. Don’t change in the middle.

  • Load all the tools you’ll need ahead of time. Don’t add or take them away.

  • Retrieved context and documents after that. Static for the duration.

  • At the bottom, the history of the conversation and the tool’s outputs.

With auto-caching turned on, the breakpoint moves forward automatically as the conversation goes on.

Claude Code is in charge of its own cache. Anthropic just added auto-caching to its API, so you can do the same for your own agent.

Without auto-caching, you had to remember where the token boundaries were. A wrong boundary meant not getting to the cache.

Use cache-safe forking to compact for the context limit. Use the same system prompt, tools, and conversation, then add the compaction as a new message.

The compaction call looks almost exactly like the last one. The cached prefix is used again. The only thing that is billed as new is the compaction instruction.

To see if an API is working, keep an eye on these three fields in every response:

  • cache_creation_input_tokens⁣: tokens put into memory

  • cache_read_input_tokens⁣: tokens read from memory

  • input_tokens⁣: tokens worked as usual

Your cache efficiency score is the number of read tokens compared to the number of creation tokens. Keep an eye on it the same way you keep an eye on uptime.

Key Takeaways

Prompt caching is not a feature you turn on. It is an architectural discipline you build around.

Claude Code is the best example of what that field looks like when it’s done on a large scale.

A cache hit rate of 92%. A cut in costs of 81%.

This is the blueprint if you are making agents. You can’t ignore the tax; it exists. The only thing that matters is whether you are paying for it or getting rid of it.

References:

  • https://www.dailydoseofds.com/p/kv-caching-in-llms-explained-visually/

  • https://x.com/trq212/status/2024574133011673516?s=20

  • https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

Similar Articles

Probing the Prompt KV Cache: Where It Becomes Dispensable

arXiv cs.CL

This paper systematically investigates when and which parts of the prompt KV cache become dispensable during LLM decoding, showing that redundancy primarily involves chat template scaffolding rather than task content, and replacement with neutral filler preserves accuracy.