kv-cache

#kv-cache

@ickma2311: Efficient AI Lecture 12: Transformer and LLM This lecture is not only about how LLMs work. It also explains the buildin…

X AI KOLs Timeline ↗ · yesterday Cached

Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.

0 favorites 0 likes

#kv-cache

@Modular: HTTP routing has been a solved problem for many years. Then came Large Language Models. Their backends aren't interchan…

X AI KOLs Following ↗ · yesterday Cached

Modular published a blog post explaining why traditional HTTP routing doesn't work for LLM inference workloads. The article describes how their distributed inference framework handles stateful, heterogeneous GPU pods with KV caches, specialized prefill/decode backends, and conversation-level routing that traditional stateless routing algorithms cannot address.

0 favorites 0 likes

#kv-cache

@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…

X AI KOLs Timeline ↗ · 2026-04-23 Cached

IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.

0 favorites 0 likes

#kv-cache

TTKV: Temporal-Tiered KV Cache for Long-Context LLM Inference

arXiv cs.CL ↗ · 2026-04-23 Cached

TTKV introduces a temporal-tiered KV cache that mimics human memory to cut 128K-context LLM inference latency by 76% and double throughput while reducing cross-tier traffic 5.94×.

0 favorites 0 likes

#kv-cache

Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA ↗ · 2026-04-22

Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.

0 favorites 0 likes

#kv-cache

INT3 compression+fused metal kernels [R]

Reddit r/MachineLearning ↗ · 2026-04-22

Solo researcher released Spiral, a tool that compresses LLMs to INT3 and KV-cache to INT2 with custom fused Metal kernels for Apple Silicon, currently shipping Qwen-7B preview.

0 favorites 0 likes

#kv-cache

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top ↗ · 2026-04-21 Cached

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

0 favorites 0 likes

#kv-cache

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

STOP method prunes doomed reasoning trajectories early via KV-cache states, cutting token usage 70% and boosting AIME/GPQA accuracy across 1.5B–20B models.

0 favorites 0 likes

#kv-cache

River-LLM: Large Language Model Seamless Exit Based on KV Share

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.

0 favorites 0 likes

#kv-cache

Context Is Software, Weights Are Hardware

Hacker News Top ↗ · 2026-04-19 Cached

Aravind Jayendran argues that while longer context windows improve LLM performance, they cannot fully replace weight updates, framing context as transient software and weights as hardware that fundamentally alters model capabilities.

0 favorites 0 likes

#kv-cache

High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction

Hacker News Top ↗ · 2026-04-19 Cached

Proposes an SRC pipeline that uses entropy-based selection and low-rank reconstruction to summarize KV cache instead of pruning tokens, reducing VRAM for million-token LLM contexts while avoiding catastrophic attention errors.

0 favorites 0 likes

#kv-cache

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

Hacker News Top ↗ · 2026-04-19 Cached

Researchers propose Prefill-as-a-Service (PrfaaS), a system that offloads long-context prefill to remote compute-dense clusters and streams KVCache over commodity Ethernet, enabling independent scaling and 32-54% higher throughput for a 1T-parameter hybrid model.

0 favorites 0 likes

#kv-cache

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.

0 favorites 0 likes

#kv-cache

Efficient Memory Management for Large Language Model Serving with PagedAttention

Papers with Code Trending ↗ · 2023-09-12 Cached

This paper introduces PagedAttention, an algorithm inspired by virtual memory paging, and vLLM, a serving system that significantly improves LLM throughput by reducing memory fragmentation in key-value caches.

0 favorites 0 likes

kv-cache

Submit Feedback