@nateherk: https://x.com/nateherk/status/2057450555212013627

X AI KOLs Timeline Tools

Summary

A practical guide explaining how prompt caching works in Claude Code, how it reduces token costs by 90%, and common habits that break the cache, helping developers extend session length and reduce costs.

https://t.co/tbQe5AAnAw
Original Article
View Cached Full Text

Cached at: 05/21/26, 09:40 PM

How Anthropic Engineers Actually Save Tokens

I Saved 300 Million Tokens This Week

91 million in a single day. 300+ million in a week.

I didn’t change a single setting. This was prompt caching doing its job in the background.

But once I actually understood what caching is and how to stop breaking it, my sessions started lasting way longer for the same usage. So here’s the 80/20 of prompt caching in Claude Code, no API deep dive required.

TL;DR

→ Cached tokens cost 10% of normal input. 91M cached billed like 9M.

→ Claude Code subscription TTL is 1 hour. API default is 5 minutes. Sub-agents are always 5 minutes.

→ Cache lives in three layers: system, project, conversation.

→ Switching models mid-session breaks the cache. Including “opus plan” mode.

What Caching Actually Costs

Every cached token costs 10% of a normal input token.

So when my dashboard showed 91 million cached on a single day, I got billed like I processed 9 million. That’s the whole reason long Claude Code sessions feel “free” compared to what they would cost without caching.

Two numbers worth knowing inside the dashboard:

Cache create. The one-time cost of writing something into the cache. Pays off the next turn.

→** Cache read.** Tokens Claude reused from a cache (your CLAUDE.md, tool definitions, prior messages). 10x cheaper than fresh input.

If your cache read number is high, you’re winning. If it’s low, you’re paying for the same context over and over.

Quick quote from Thariq at Anthropic that stuck with me:

“We actually run alerts on our prompt cache hit rate and declare SEVs if they’re too low.”

And a great X article by him as well: https://x.com/trq212/status/2024574133011673516?s=20

When the hit rate is high, four things happen: Claude Code feels faster, their serving cost goes down, your subscription limits feel more generous, and long coding sessions stay practical. When it’s low, everyone loses.

So the incentives line up. They want your hit rate high. You want it high. The only thing in the way is a handful of small habits that quietly reset everything.

How the Cache Grows Each Turn

Cache works on prefix matching. Without going down a rabbit hole, that just means Claude reuses cached tokens as long as everything before that point is identical to what was cached.

Here’s how a fresh session actually plays out:

From Claude Code docs

From Claude Code docs

1️⃣ **Turn 1. **No cache yet. The system prompt, your project context (CLAUDE.md, memory, rules), and your first message all get processed fresh and written to cache.

2️⃣ Turn 2. Everything from turn 1 is now cached. Claude only has to process your reply and the next message. Cheap turn.

3️⃣ Turn 3. Same deal. Old turns stay cached. Only the new exchange is fresh.

The cache is organized in three layers:

From Thariq’s X article

From Thariq’s X article

System layer. Base instructions, tool definitions (read, write, bash, grep, glob), output style. Globally cached.

→** Project layer. **CLAUDE.md, memory, project rules. Cached per project.

Conversation. Replies and messages. Grow each turn.

If anything in the system layer or project layer changes mid-session, everything has to be re-cached from scratch. That’s the expensive move. Imagine you’re on message 16 and you change the system prompt or wait an hour. Every token from message 1 onwards has to be reprocessed.

The 1-Hour vs 5-Minute Confusion

This is where most people get tripped up.

Claude Code subscription: 1 hour TTL by default.

Claude API: 5 minutes by default. You can bump it to an hour for more cost.

Sub-agents on any plan: 5 minutes. Always.

→** Claude.ai web chat:** not officially documented. Probably the same as the subscription, but I haven’t confirmed it.

A few months back when everyone was complaining about Claude subscriptions getting eaten alive, people thought Anthropic had quietly dropped the TTL to 5 minutes without telling anybody. Turns out they didn’t. It’s still 1 hour. But the documentation is split across Claude Code and API pages, which are two very different things, and that’s where a lot of the confusion came from.

The 5-minute number matters if you’re running heavy sub-agent workflows or using the API directly. For 95% of Claude Code users, the 1-hour window is the only one to care about.

Three Habits That Cover 95% of People

Here’s what stood out as actually useful day to day.

1️⃣** Don’t pause too long.**

If you’ve been idle for over an hour, everything has fallen out of cache. The next message rebuilds it from scratch. Cheaper to hand off to a fresh session than to resume a stale one.

2️⃣** Start fresh when you switch tasks.**

A /compact or /clear breaks the cache anyway, so use that moment to actually reset.

I built a session handoff skill that’s been my replacement for /compact. It summarizes what we built, the open decisions, the important files, and exactly where to pick back up. Then I /clear, paste the summary, and keep moving like nothing happened.

The compact command can also take a long time to run. The handoff skill typically finishes in under a minute.

3️⃣ In Claude chat, use Projects for big docs.

Caching on claude.ai isn’t documented in detail, but Projects are clearly optimized differently than a thread. If you’re going to paste big documents in, drop them into a Project instead of into the conversation.

What Quietly Breaks the Cache

A few things will reset everything without warning.

Switching the model. Because of prefix matching, each model has its own cache. The next request reads the entire history with zero cache hits.

→ **“Opusplan” mode. **This is the setting that uses Opus during plan mode and Sonnet for execution. I’ve recommended it before in token hacks videos for a reason. But it’s important to understand each plan toggle is a model switch, which means a fresh cache each time. Long-run it still helps your session limits. Just know what’s happening under the hood.

Editing CLAUDE.md mid-session is fine. The edit doesn’t apply until the next restart, so the live cache stays safe.

My Free Token Dashboard

The screenshots I’ve been pulling are from a token dashboard.

https://github.com/nateherkai/token-dashboard

It’s a simple GitHub repo. You give the link to Claude Code, tell it to set it up on localhost, and it pulls in all your past sessions. Not a blank slate. You see your input, output, cache create, and cache read numbers per day from the jump.

One caveat: the dashboard tracks tokens on a local device. If you switch from your desktop to a laptop, the numbers won’t match. Each machine has its own picture.

Wrap

Prompt caching is one of those things you can get extremely deep on. The Thoric article goes way further than what’s in here, and it’s worth reading if you want the full picture.

But you don’t need the full picture to get the benefit. You need the 80/20: cached tokens are 10x cheaper, the TTL is 1 hour on Claude Code, model switches break the cache, and a clean handoff between tasks beats letting a session rot.

Similar Articles

How I easily cut my input token burn ~90% on long agent runs

Reddit r/AI_Agents

The author shares a practical tip to reduce input token costs by ~90% on long agent runs using prompt caching: placing unchanged text (system prompt, tool definitions, context) at the start of every prompt to leverage cached prefixes from LLM providers.

@akshay_pachaar: https://x.com/akshay_pachaar/status/2045910818450182526

X AI KOLs Following

A practical guide explaining how Claude Opus 4.7 differs from 4.6, covering the new xhigh effort level, adaptive thinking replacing fixed token budgets, and a 1M context window, with recommendations on how to adjust prompting and delegation strategies to avoid inflated token costs.

Prompt Caching in the API

OpenAI Blog

OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.