Tag
Fable 5 model only used 4 prompts and $173 worth of tokens to create a game called 'Super Smart Racing', demonstrating its extremely strong generative capabilities.
NVIDIA's full-stack inference software, codesigned with hardware, has reduced token costs by up to 5x on the Blackwell platform in just one month, enabling lower cost per token for AI factories. Companies like Baseten, Cognition, Deep Infra, and Together AI are using the stack to optimize inference performance.
A report indicates that operating an AI datacenter in orbit costs 8 to 12 times more per token than a terrestrial datacenter, highlighting significant cost barriers for space-based AI computation.
Bellwethr is developing an open methodology for tracking the real USD cost of a single inference token from capable models, with a draft benchmark suite and community contributions underway.
Announcing version 1 of mattpocock/skills, a collection of AI skill definitions that reduces token costs by 63% and introduces new skills for codebase design, domain modeling, and more.
A discussion on how harness designs can reduce token costs by structuring information instead of feeding everything into a language model's context, citing an example of an RLM agent processing many lines of logs with few active tokens.
A GitHub tool that reduces Claude API costs by dynamically adjusting effort/thinking parameters based on prompt complexity.
This paper proposes a unified framework called Efficiency Frontier, which treats large model context management as a deployment optimization problem, jointly modeling task performance, token overhead, and preprocessing reuse. On 5,000 HotpotQA instances, deployment optimization saves 25% of token usage, while memory compression is more than half the cost of full context in high-precision scenarios.
A practical guide explaining how prompt caching works in Claude Code, how it reduces token costs by 90%, and common habits that break the cache, helping developers extend session length and reduce costs.
OpenSquilla has launched an open-source AI agent runtime designed to reduce token costs through intelligent routing, caching, and a four-tier memory architecture, claiming 60-80% cost savings.