Why does it feel like big LLM providers are literally hiding prompt caching?

Reddit r/artificial 07/01/26, 03:52 PM News

Summary

An article discussing how prompt caching can significantly reduce LLM API costs, pointing out that providers under-explain it and offering a simple rule to structure prompts for maximum cache hits.

I know the info is there. Somewhere in the pricing pages, docs, or API notes. But for something that can seriously change what you pay in production, it is weirdly under-explained. expeciely for other providers than openai which they do have decent explainer here - https://developers.openai.com/api/docs/guides/prompt-caching So basicly: two prompts can look almost identical, but one can be much cheaper to run just because it is ordered better. Put the changing parts too early, like the user query, variables, timestamps, metadata, or anything request-specific, and you can break the stable prefix the cache depends on. The practical rule is simple: Keep the repeatable stuff first. Start with system instructions, fixed rules, examples, schemas, and formatting requirements. Then put the dynamic user input and request-specific data near the end. That is it. Just a good prompt structure... But if you run LLMs at scale, this tiny detail can be the difference between insanely expensive LLMs usage and acctually good ROI product. full blog post here

Original Article

Similar Articles

@nateherk: https://x.com/nateherk/status/2057450555212013627

X AI KOLs Timeline

A practical guide explaining how prompt caching works in Claude Code, how it reduces token costs by 90%, and common habits that break the cache, helping developers extend session length and reduce costs.

Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.

X AI KOLs

Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.

Probing the Prompt KV Cache: Where It Becomes Dispensable

arXiv cs.CL

This paper systematically investigates when and which parts of the prompt KV cache become dispensable during LLM decoding, showing that redundancy primarily involves chat template scaffolding rather than task content, and replacement with neutral filler preserves accuracy.

How I easily cut my input token burn ~90% on long agent runs

Reddit r/AI_Agents

The author shares a practical tip to reduce input token costs by ~90% on long agent runs using prompt caching: placing unchanged text (system prompt, tool definitions, context) at the start of every prompt to leverage cached prefixes from LLM providers.

Prompt Caching in the API

OpenAI Blog

OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.

Similar Articles

@nateherk: https://x.com/nateherk/status/2057450555212013627

Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.

Probing the Prompt KV Cache: Where It Becomes Dispensable

How I easily cut my input token burn ~90% on long agent runs

Prompt Caching in the API

Submit Feedback