prefix-caching

#prefix-caching

@jino_rohit: over the last 6-8 months, ive been trying to move towards the ml systems and ai infra space. these are some of my favor…

X AI KOLs Timeline ↗ · 2d ago Cached

The author shares their work over 6-8 months in ML systems and AI infrastructure, including a lightweight Python LLM inference engine (tachyon) that achieves 600+ tokens/s on consumer hardware with continuous batching and prefix caching, alongside blog posts on CUDA/CUTE DSL and collective communication, and contributions to SGLang and vLLM.

0 favorites 0 likes

#prefix-caching

Introducing RadixAttention to Trellis

Lobsters Hottest ↗ · 2026-06-03 Cached

Trellis introduces RadixAttention to optimize LLM inference prefill phase by caching prefix tokens using a radix tree, reducing redundant computation in chat and agentic sessions.

0 favorites 0 likes

#prefix-caching

@m_sirovatka: KV Cache re-use is the most important thing for agentic rollouts. We've integrated Mooncake Store into prime-rl with vL…

X AI KOLs Following ↗ · 2026-06-02 Cached

vLLM integrates Mooncake Store for distributed KV cache reuse, enabling cross-node prefix caching to efficiently serve agentic workloads with high token reuse.

0 favorites 0 likes

#prefix-caching

ObjectCache: Layerwise Object-Storage Retrieval for KV Cache Reuse

arXiv cs.AI ↗ · 2026-05-25 Cached

ObjectCache proposes using S3-compatible object storage for LLM KV cache reuse to reduce cost and increase capacity, with a co-designed storage protocol and transfer schedule that minimizes latency overhead. Experiments show it adds only 5.6% latency over local DRAM for 64K contexts.

0 favorites 0 likes

#prefix-caching

@GithubProjects: Reasonix is a terminal-based AI coding agent built specifically for DeepSeek, designed to keep token costs low through …

X AI KOLs Timeline ↗ · 2026-05-23 Cached

Reasonix is a terminal-based AI coding agent optimized for DeepSeek models, achieving 99.82% cache hit rate and reducing token costs from ~$61 to ~$12 per workload through stable prefix caching.

0 favorites 0 likes

#prefix-caching

Not All Tokens Are Worth Caching: Learning Semantic-Aware Eviction for LLM Prefix Caches

arXiv cs.LG ↗ · 2026-05-20

A new semantic-adaptive eviction policy for LLM prefix caches that learns token reuse patterns across different token types, achieving 1.4x-2.7x TTFT improvement over existing policies.

0 favorites 0 likes

#prefix-caching

@QingQ77: A terminal AI coding agent designed specifically for DeepSeek API prefix caching mechanism, maintaining ultra-low token costs in long sessions through a cache-first architecture. https://github.com/esengine/DeepSeek-Reasonix… Reaso…

X AI KOLs Timeline ↗ · 2026-05-09 Cached

Reasonix is a terminal AI coding agent designed specifically for DeepSeek API prefix caching mechanism, achieving ultra-low token costs in long sessions through a cache-first architecture. In testing, 435 million input tokens cost only about $12, with a cache hit rate of 99.82%.

0 favorites 0 likes

#prefix-caching

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

arXiv cs.LG ↗ · 2026-05-08 Cached

This paper introduces sparse prefix caching for hybrid and recurrent LLMs, which stores recurrent states at a limited set of checkpoint positions to avoid dense caching while minimizing recomputation. The method outperforms standard heuristics on real-world data, especially when requests share substantial but non-identical prefixes.

0 favorites 1 likes

prefix-caching

Submit Feedback