kv-cache

#kv-cache

@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…

X AI KOLs Timeline ↗ · 16h ago

A Stanford lecture on AI inference emphasizes practical bottlenecks like KV-cache and techniques like speculative decoding and continuous batching, offering more real-world insight than typical ML courses.

0 favorites 0 likes

#kv-cache

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

arXiv cs.LG ↗ · 2d ago Cached

This paper introduces a fixed-contract diagnostic tool to analyze why KV cache compression methods succeed or fail in long-context LLM inference. It identifies three failure modes—missing evidence, scoring irrelevant tokens, and breaking related evidence—and evaluates them on LongBench and NeedleBench.

0 favorites 0 likes

#kv-cache

ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

arXiv cs.CL ↗ · 2d ago Cached

This paper introduces ReST-KV, a novel method for robust KV cache eviction in large language models that uses layer-wise output reconstruction and spatial-temporal smoothing to improve efficiency. The method significantly reduces decoding latency and outperforms state-of-the-art baselines on long-context benchmarks like LongBench and RULER.

0 favorites 0 likes

#kv-cache

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Hugging Face Daily Papers ↗ · 2d ago Cached

Orthrus is a dual-architecture framework that combines autoregressive LLMs with diffusion models for fast parallel token generation while maintaining exact inference fidelity via shared KV caches and consensus mechanisms, achieving up to 7.8x speedup.

0 favorites 0 likes

#kv-cache

Why your current hardware will choke on 2026 Multi-Agent workflows (Mac Studio vs. RTX 5090)

Reddit r/ArtificialInteligence ↗ · 2d ago

Comparison of hardware requirements for running multi-agent AI workflows locally, highlighting VRAM and KV Cache constraints.

0 favorites 0 likes

#kv-cache

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces Louver, a novel index structure for KV cache retrieval that reformulates sparse attention as a range searching problem, guaranteeing zero false negatives and improving efficiency over existing methods.

0 favorites 0 likes

#kv-cache

WiCER: Wiki-memory Compile, Evaluate, Refine Iterative Knowledge Compilation for LLM Wiki Systems

arXiv cs.CL ↗ · 3d ago Cached

The paper introduces WiCER, an iterative algorithm for compiling domain knowledge into LLM Wiki systems to minimize information loss and catastrophic failure rates during knowledge distillation. It demonstrates that this approach improves upon full-context KV cache inference by preserving critical facts better than blind compilation methods.

0 favorites 0 likes

#kv-cache

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces LKV, a method for end-to-end learning of head-wise budgets and token selection to optimize KV cache eviction in large language models, achieving state-of-the-art performance with high compression rates.

0 favorites 0 likes

#kv-cache

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory

arXiv cs.LG ↗ · 3d ago Cached

This paper introduces RateQuant, a method for optimal mixed-precision KV cache quantization that uses rate-distortion theory to address distortion model mismatch. It significantly reduces perplexity compared to existing methods like KIVI and QuaRot with minimal calibration overhead.

0 favorites 0 likes

#kv-cache

I solved kv-cache

Reddit r/AI_Agents ↗ · 3d ago

The author has open-sourced a novel KV-cache solution called catalyst-brain, claiming to dramatically reduce RAM usage for local models and potentially enable infinite context windows.

0 favorites 0 likes

#kv-cache

@davideciffa: Huge thanks to @csujun, now Luce DFlash is 10-15% faster, by implementing per-layer K/V truncation in the draft graph f…

X AI KOLs Timeline ↗ · 3d ago Cached

Luce DFlash has achieved a 10-15% speedup by implementing per-layer K/V truncation in the draft graph for SWA layers.

0 favorites 0 likes

#kv-cache

Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

Hugging Face Daily Papers ↗ · 4d ago Cached

This paper introduces a learned global retention-based KV cache eviction method that improves long-context reasoning by selectively retaining useful tokens and reducing attention dilution, while significantly lowering memory usage.

0 favorites 0 likes

#kv-cache

@pallavishekhar_: KV Cache in LLMs Read here: https://outcomeschool.com/blog/kv-cache-in-llms…

X AI KOLs Timeline ↗ · 4d ago Cached

This article explains the concept of KV Cache in Large Language Models, detailing how it optimizes text generation by storing and reusing key-value pairs to avoid redundant computations during inference.

0 favorites 0 likes

#kv-cache

@ickma2311: Efficient AI Lecture 12: Transformer and LLM This lecture is not only about how LLMs work. It also explains the buildin…

X AI KOLs Timeline ↗ · 5d ago Cached

Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.

0 favorites 0 likes

#kv-cache

@Modular: HTTP routing has been a solved problem for many years. Then came Large Language Models. Their backends aren't interchan…

X AI KOLs Following ↗ · 5d ago Cached

Modular published a blog post explaining why traditional HTTP routing doesn't work for LLM inference workloads. The article describes how their distributed inference framework handles stateful, heterogeneous GPU pods with KV caches, specialized prefill/decode backends, and conversation-level routing that traditional stateless routing algorithms cannot address.

0 favorites 0 likes

#kv-cache

Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

Hugging Face Daily Papers ↗ · 6d ago Cached

Proposes Memory-Efficient Looped Transformer (MELT), a novel recurrent LLM architecture that decouples reasoning depth from memory consumption by sharing a single KV cache across loops and using chunk-wise training with interpolated transition and attention-aligned distillation.

0 favorites 0 likes

#kv-cache

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Hugging Face Daily Papers ↗ · 2026-05-07 Cached

The paper introduces SPEED, a layer-asymmetric KV visibility policy that reduces long-context inference costs by processing prompt tokens only in lower layers during prefill while maintaining full-depth attention during decoding.

0 favorites 0 likes

#kv-cache

@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…

X AI KOLs Timeline ↗ · 2026-04-23 Cached

IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.

0 favorites 0 likes

#kv-cache

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

arXiv cs.CL ↗ · 2026-04-23 Cached

TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.

0 favorites 0 likes

#kv-cache

Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA ↗ · 2026-04-22

Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.

0 favorites 0 likes

kv-cache

Submit Feedback