(Yet Another) KV cache calculator - kvanta.vcerny.cz
Summary
A new open-source KV cache calculator tool named KVANTA has been released, supporting any LLM/VLM from Hugging Face.
Similar Articles
KVarN: Native vLLM backend for KV-cache quantization by Huawei
Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.
KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs
KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.
proveKV – Honest 36× lossless (vs f32, 18x vs fp16) KV‑cache compression for LLMs (zero PPL regression)
An open-source repo, proveKV, demonstrates a reproducible KV-cache compression technique achieving 36x lossless (vs f32) and 68x lossy memory reduction on SmolLM2-1.7B with zero PPL regression, including Rust examples and an audit pipeline.
CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM
CONF-KV is a KV-cache management system that uses model uncertainty to dynamically adjust cache retention, improving memory efficiency for long-context LLM inference while maintaining accuracy within 1.5-2.1 perplexity points.
Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist
A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.