Tag
An article discussing how prompt caching can significantly reduce LLM API costs, pointing out that providers under-explain it and offering a simple rule to structure prompts for maximum cache hits.
Claude is now officially available on Microsoft Foundry, allowing Azure accounts to use it directly with existing authentication, billing, and compliance. The initial rollout includes Claude Opus 4.8 and Haiku 4.5, supporting prompt caching and extended thinking.
Alex, a new LangChain team member, published an article explaining how Deep Agents uses prompt caching to reduce API costs.
Anthropic released Fable 5, a powerful new model with high pricing, making cost-aware routing essential for agent builders due to token fan-out and high output costs.
The author shares a practical tip to reduce input token costs by ~90% on long agent runs using prompt caching: placing unchanged text (system prompt, tool definitions, context) at the start of every prompt to leverage cached prefixes from LLM providers.
A comparison of token consumption across four agent runtimes (Claude Code, OpenClaw, Hermes, and OpenClacky) on the same tasks reveals costs ranging from 0.8x to 4x relative to Claude Code, driven by differences in cache architecture and tool schema design.
This article introduces practical techniques to cut AI coding costs by 80%, including prompt caching, context trimming, multi-model routing (using Kimi 2.6 for daily coding tasks and advanced models for core architecture), and more.
This thread shares strategies to reduce token usage in AI agents, including prompt caching, context summarization, using smaller models, trimming tool outputs, subagents, RAG, and tight system prompts.
A practical guide explaining how prompt caching works in Claude Code, how it reduces token costs by 90%, and common habits that break the cache, helping developers extend session length and reduce costs.
A practical guide listing 10 strategies to reduce costs when using LLM APIs, including model selection, prompt caching, batch processing, and monitoring expenses.
Explains Cache-Augmented Generation (CAG) as a method to cache static knowledge directly in the model's KV memory, reducing latency and cost compared to traditional RAG, and shows how to combine both for optimal performance.
The article argues that the real challenge in AI isn't just building smarter models but making them cost-efficient at scale, highlighting the importance of reducing token usage, improving speed, and optimizing infrastructure.
An analysis of Anthropic's prompt caching costs for Claude derives a 62.5-minute break-even rule: refresh the cache if you expect to need it again within that time, otherwise let it expire to save costs.
Anthropic's Head of Product released a free 28-minute masterclass on putting AI agents into production, covering prompt caching, tool search, programmatic tool calling, compaction, and advisor strategy.
A tutorial by Vasco Schiavo explaining the math behind the cost of AI agents, focusing on why agents can be expensive and the importance of prompt caching.
A new optimization technique for open-source RL training engines introduces prompt caching during training, achieving up to 7.5x speedup on long-prompt, short-response workloads by reducing redundant compute.
OpenClaw is a CLI tool that supports Anthropic Claude models via API key or Claude CLI reuse, with features including adaptive thinking defaults for Claude 4.6, fast mode service tier toggling, and configurable prompt caching. Anthropic has reportedly re-allowed OpenClaw-style Claude CLI usage.
OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.
Explains how prompt caching works in LLMs, using Claude as a case study, detailing the transformer's KV cache mechanism and the cost benefits of caching static prefixes in agentic workflows.