Tag
Trellis introduces RadixAttention to optimize LLM inference prefill phase by caching prefix tokens using a radix tree, reducing redundant computation in chat and agentic sessions.
The article discusses a C++/WinRT pattern for caching the result of a Windows Runtime IAsyncOperation, including handling failures, so that multiple coroutines can share the cached result or exception.
This paper presents a stateful inference architecture for multi-agent tool calling that reuses KV cache across turns and employs speculative decoding, achieving 2.1x-4.2x speedup over vLLM and SGLang on agentic workflows.
DeepSeek releases a native coding agent called DeepSeek reasonix, featuring high caching and low cost.
A comparison between Redis and Memcached covering data structures, performance, scalability, and operational considerations to help choose the right caching solution.
New research by Joshua Gu shows that AI agents perform better when they manage a small buffer in their context window as a cache for external context, challenging the common practice of pushing context entirely out of the prompt.
A discussion on effective FinOps strategies for managing costs in large-scale AI agent operations, covering tactics like model routing, prompt trimming, caching, and the need to track cost by agent, workflow, and customer.
This paper introduces PEEK, a system that caches orientation knowledge about recurring external contexts as a context map, enabling LLM agents to reuse context knowledge across invocations and significantly improving efficiency and accuracy on long-context reasoning and information aggregation tasks.
The author describes using HAProxy caching to reduce unnecessary load on snac threads in the FediMeteo service, following previous similar optimizations with nginx. The approach aims to keep the lightweight ActivityPub server efficient by having the reverse proxy absorb repeated public requests.
A thread sharing a structured install order for agentic projects: using direnv with a secrets manager for credential safety, litellm or portkey as a model proxy for cost and fallback management, uv+git commits on passing evals for reproducibility, and mitmproxy for full observability of LLM calls. Highlights common failure modes and security gaps.
The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.
A practical blueprint for designing a backend system capable of handling 1 million concurrent users, covering architecture decisions like language selection, load balancing, database sharding, multi-layer caching, and resilience patterns.
LM Studio announces a beta update to its MLX engine, introducing batching for vision models and improved caching for faster inference.