caching

Tag

Cards List
#caching

Introducing RadixAttention to Trellis

Lobsters Hottest · yesterday Cached

Trellis introduces RadixAttention to optimize LLM inference prefill phase by caching prefix tokens using a radix tree, reducing redundant computation in chat and agentic sessions.

0 favorites 0 likes
#caching

Sharing the result of a single Windows Runtime IAsyncOperation among multiple coroutines, part 3

The Old New Thing (Raymond Chen) · 6d ago Cached

The article discusses a C++/WinRT pattern for caching the result of a Windows Runtime IAsyncOperation, including handling failures, so that multiple coroutines can share the cached result or exception.

0 favorites 0 likes
#caching

Stateful Inference for Low-Latency Multi-Agent Tool Calling

arXiv cs.LG · 2026-05-27 Cached

This paper presents a stateful inference architecture for multi-agent tool calling that reuses KV cache across turns and employs speculative decoding, achieving 2.1x-4.2x speedup over vLLM and SGLang on agentic workflows.

0 favorites 0 likes
#caching

DeepSeek reasonix, DeepSeek native coding agent with high caching and low cost

Hacker News Top · 2026-05-24

DeepSeek releases a native coding agent called DeepSeek reasonix, featuring high caching and low cost.

0 favorites 0 likes
#caching

No Slop Grenade

Hacker News Top · 2026-05-21 Cached

A comparison between Redis and Memcached covering data structures, performance, scalability, and operational considerations to help choose the right caching solution.

0 favorites 0 likes
#caching

@lateinteraction: Agents often externalize some context: a repository in coding agents, a corpus in RAG, and the user prompt in an RLM. N…

X AI KOLs Following · 2026-05-20 Cached

New research by Joshua Gu shows that AI agents perform better when they manage a small buffer in their context window as a cache for external context, challenging the common practice of pushing context entirely out of the prompt.

0 favorites 0 likes
#caching

What FinOps tools and tactics actually work for large AI agent operations?

Reddit r/AI_Agents · 2026-05-19

A discussion on effective FinOps strategies for managing costs in large-scale AI agent operations, covering tactics like model routing, prompt trimming, caching, and the need to track cost by agent, workflow, and customer.

0 favorites 0 likes
#caching

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

Hugging Face Daily Papers · 2026-05-19 Cached

This paper introduces PEEK, a system that caches orientation knowledge about recurring external contexts as a context map, enabling LLM agents to reuse context knowledge across invocations and significantly improving efficiency and accuracy on long-context reasoning and information aggregation tasks.

0 favorites 0 likes
#caching

FediMeteo, HAProxy, and the art of not wasting snac threads

Lobsters Hottest · 2026-05-18 Cached

The author describes using HAProxy caching to reduce unnecessary load on snac threads in the FediMeteo service, following previous similar optimizations with nginx. The approach aims to keep the lightweight ActivityPub server efficient by having the reverse proxy absorb repeated public requests.

0 favorites 0 likes
#caching

@DeRonin_: anybody who uses or learns agentic systems, SHOULD READ THIS the install order I run before any new agentic project: 1.…

X AI KOLs Following · 2026-05-17 Cached

A thread sharing a structured install order for agentic projects: using direnv with a secrets manager for credential safety, litellm or portkey as a model proxy for cost and fallback management, uv+git commits on passing evals for reproducibility, and mitmproxy for full observability of LLM calls. Highlights common failure modes and security gaps.

0 favorites 0 likes
#caching

KV Cache Is Becoming the Memory Hierarchy of Inference

Hacker News Top · 2026-05-17 Cached

The article discusses how the KV cache is evolving into a memory hierarchy for LLM inference, optimizing memory management during decoding.

0 favorites 0 likes
#caching

@Akintola_steve: https://x.com/Akintola_steve/status/2055620856802357587

X AI KOLs Timeline · 2026-05-16 Cached

A practical blueprint for designing a backend system capable of handling 1 million concurrent users, covering architecture decisions like language selection, load balancing, database sharding, multi-layer caching, and resilience patterns.

0 favorites 0 likes
#caching

@lmstudio: Batching for vision models is now available in Beta with our latest MLX engine update The updated engine also brings ma…

X AI KOLs Following · 2026-05-14 Cached

LM Studio announces a beta update to its MLX engine, introducing batching for vision models and improved caching for faster inference.

0 favorites 0 likes
← Back to home

Submit Feedback