Tag
OpenAI's GPT-5.5 costs 49–92% more than GPT-5.4 in practice despite claimed token efficiency improvements, while Anthropic's Claude Opus 4.7 also raised effective costs by 12–27% for longer prompts, reflecting a broader trend of rising frontier model prices as both companies face massive projected losses.
This paper presents empirical measurements of information density in web pages from the perspective of LLM agents, using a curated benchmark of 100 URLs across five categories. It finds that structural extraction reduces token count by an average of 71.5% while preserving answer quality, and reveals an undocumented compression layer in Claude Code.
GitHub improved token efficiency in their agentic workflows by logging token usage via an API proxy and building daily optimization workflows, reducing overhead from unused MCP tool registrations.
BACR introduces adaptive token budgeting and curriculum-aware scheduling to prevent LLMs from overthinking easy problems and underthinking hard ones, cutting token use 34% while boosting accuracy up to 8.3%.
MiMo-V2.5 & Pro introduces frontier agent capabilities with improved token efficiency.
Ling-2.6-flash is a 104B-total/7.4B-active sparse instruct model optimized for token efficiency, aiming to cut costs and boost throughput on agent tasks.
TACO introduces a self-evolving compression framework that automatically learns to shrink redundant terminal interaction history, cutting token overhead ~10% while boosting accuracy 1-4% across TerminalBench and other code-agent benchmarks.
YourMemory is an MCP-based memory tool that uses self-pruning to reduce token waste by up to 84%, improving efficiency in AI context management.
This paper proposes a reinforcement learning framework that improves LLM reasoning efficiency by modeling token significance to selectively penalize unimportant tokens while preserving essential reasoning, using both significance-aware and dynamic length rewards to reduce verbosity without sacrificing accuracy.
A developer shares their experience with Recurrent Language Models (RLMs), claiming they effectively handle extremely long context windows with tens of millions of tokens, representing a significant advancement in context handling capabilities.
This paper introduces GenericAgent, a self-evolving LLM agent system designed to maximize context information density. It addresses long-horizon limitations through hierarchical memory, reusable SOPs, and efficient compression, achieving better performance with fewer tokens compared to leading agents.
OpenAI introduces Prompt Caching, an automatic feature that reduces API costs by 50% and improves latency by reusing recently cached input tokens on GPT-4o, GPT-4o mini, o1-preview, and o1-mini models. The feature automatically applies to prompts longer than 1,024 tokens without requiring developer integration changes.