kv-cache

#kv-cache

Can we already use Google's TurboQuant (TQ) for KV Cache in llama-server? Or are we waiting for a PR?

Reddit r/LocalLLaMA ↗ · 2026-04-22

Community discussion on whether Google's TurboQuant compression can already be applied to KV cache in llama-server or if implementation is still pending.

0 favorites 0 likes

#kv-cache

INT3 compression+fused metal kernels [R]

Reddit r/MachineLearning ↗ · 2026-04-22

Solo researcher released Spiral, a tool that compresses LLMs to INT3 and KV-cache to INT2 with custom fused Metal kernels for Apple Silicon, currently shipping Qwen-7B preview.

0 favorites 0 likes

#kv-cache

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top ↗ · 2026-04-21 Cached

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

0 favorites 0 likes

#kv-cache

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

X AI KOLs Timeline ↗ · 2026-04-21 Cached

STOP method prunes doomed reasoning trajectories early via KV-cache states, cutting token usage 70% and boosting AIME/GPQA accuracy across 1.5B–20B models.

0 favorites 0 likes

#kv-cache

River-LLM: Large Language Model Seamless Exit Based on KV Share

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

River-LLM proposes a training-free early-exit framework for decoder-only LLMs that uses KV-sharing to eliminate KV-cache gaps, achieving 1.71–2.16× speedup without quality loss.

0 favorites 0 likes

#kv-cache

Context Is Software, Weights Are Hardware

Hacker News Top ↗ · 2026-04-19 Cached

Aravind Jayendran argues that while longer context windows improve LLM performance, they cannot fully replace weight updates, framing context as transient software and weights as hardware that fundamentally alters model capabilities.

0 favorites 0 likes

#kv-cache

High-Fidelity KV Cache Summarization Using Entropy and Low-Rank Reconstruction

Hacker News Top ↗ · 2026-04-19 Cached

Proposes an SRC pipeline that uses entropy-based selection and low-rank reconstruction to summarize KV cache instead of pruning tokens, reducing VRAM for million-token LLM contexts while avoiding catastrophic attention errors.

0 favorites 0 likes

#kv-cache

Prefill-as-a-Service:KVCache of Next-Generation Models Could Go Cross-Datacenter

Hacker News Top ↗ · 2026-04-19 Cached

Researchers propose Prefill-as-a-Service (PrfaaS), a system that offloads long-context prefill to remote compute-dense clusters and streams KVCache over commodity Ethernet, enabling independent scaling and 32-54% higher throughput for a 1T-parameter hybrid model.

0 favorites 0 likes

#kv-cache

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

This paper introduces STOP (Super Token for Pruning), a lightweight method that learns to prune unpromising reasoning paths early during parallel decoding by appending learnable tokens and reading KV cache states, achieving 70% token reduction while improving performance on AIME and GPQA benchmarks.

0 favorites 0 likes

#kv-cache

Efficient Memory Management for Large Language Model Serving with PagedAttention

Papers with Code Trending ↗ · 2023-09-12 Cached

This paper introduces PagedAttention, an algorithm inspired by virtual memory paging, and vLLM, a serving system that significantly improves LLM throughput by reducing memory fragmentation in key-value caches.

0 favorites 0 likes

kv-cache

Submit Feedback