Why does it feel like big LLM providers are literally hiding prompt caching?
Summary
An article discussing how prompt caching can significantly reduce LLM API costs, pointing out that providers under-explain it and offering a simple rule to structure prompts for maximum cache hits.
Similar Articles
@nateherk: https://x.com/nateherk/status/2057450555212013627
A practical guide explaining how prompt caching works in Claude Code, how it reduces token costs by 90%, and common habits that break the cache, helping developers extend session length and reduce costs.
Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.
Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.
Probing the Prompt KV Cache: Where It Becomes Dispensable
This paper systematically investigates when and which parts of the prompt KV cache become dispensable during LLM decoding, showing that redundancy primarily involves chat template scaffolding rather than task content, and replacement with neutral filler preserves accuracy.
How I easily cut my input token burn ~90% on long agent runs
The author shares a practical tip to reduce input token costs by ~90% on long agent runs using prompt caching: placing unchanged text (system prompt, tool definitions, context) at the start of every prompt to leverage cached prefixes from LLM providers.
Prompt Caching in the API
OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.