proveKV – Honest 36× lossless (vs f32, 18x vs fp16) KV‑cache compression for LLMs (zero PPL regression)
Summary
An open-source repo, proveKV, demonstrates a reproducible KV-cache compression technique achieving 36x lossless (vs f32) and 68x lossy memory reduction on SmolLM2-1.7B with zero PPL regression, including Rust examples and an audit pipeline.
Similar Articles
Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
CONF-KV is a KV-cache management system that uses model uncertainty to dynamically adjust cache retention, improving memory efficiency for long-context LLM inference while maintaining accuracy within 1.5-2.1 perplexity points.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.