proveKV – Honest 36× lossless (vs f32, 18x vs fp16) KV‑cache compression for LLMs (zero PPL regression)

Reddit r/LocalLLaMA 06/05/26, 02:38 AM Tools

kv-cache compression llm open-source memory-efficiency rust

Summary

An open-source repo, proveKV, demonstrates a reproducible KV-cache compression technique achieving 36x lossless (vs f32) and 68x lossy memory reduction on SmolLM2-1.7B with zero PPL regression, including Rust examples and an audit pipeline.

I’m sharing a new open‑source repo that demonstrates a reproducible KV‑cache compression technique. \- Result: 36× lossless / 68× lossy memory reduction vs. f32‑raw KV cache on SmolLM2‑1.7B + WikiText‑2 (0% ΔPPL). \- Transparency: The numbers flow directly from the source code → CLAIMS.json → validation receipts, verified by an automated audit script (prove\_audit.sh). \- What’s inside: Rust examples, a full audit pipeline, and a detailed README that walks through the three baseline calculations and why the “+1” offset was removed to get honest numbers. If you’re interested in KV‑cache efficiency, give it a look and let me know what you think: [https://github.com/RecursiveIntell/proveKV](https://github.com/RecursiveIntell/proveKV)

Original Article

Similar Articles

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

Hugging Face Daily Papers

CONF-KV is a KV-cache management system that uses model uncertainty to dynamically adjust cache retention, improving memory efficiency for long-context LLM inference while maintaining accuracy within 1.5-2.1 perplexity points.

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

arXiv cs.CL

This paper introduces LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem as an output-aware matrix multiplication approximation, achieving high performance with only 5% cache usage.