(Yet Another) KV cache calculator - kvanta.vcerny.cz

Reddit r/LocalLLaMA 05/25/26, 03:17 PM Tools

kv-cache llm vlm calculator open-source hugging-face

Summary

A new open-source KV cache calculator tool named KVANTA has been released, supporting any LLM/VLM from Hugging Face.

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA [https://kvanta.vcerny.cz](https://kvanta.vcerny.cz) It should support any LLM/VLM from Hugging Face, if not let me know! (also, it's Apache 2.0) https://preview.redd.it/rk8i48ftva3h1.png?width=1754&format=png&auto=webp&s=7a2e8908d7d0a6c2efd92be5fb7f0ec548e7aba9

Original Article

Similar Articles

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Hacker News Top

Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

proveKV – Honest 36× lossless (vs f32, 18x vs fp16) KV‑cache compression for LLMs (zero PPL regression)

Reddit r/LocalLLaMA

An open-source repo, proveKV, demonstrates a reproducible KV-cache compression technique achieving 36x lossless (vs f32) and 68x lossy memory reduction on SmolLM2-1.7B with zero PPL regression, including Rust examples and an audit pipeline.

CONF-KV: Confidence-Aware KV Cache Eviction with Mixed-Precision Storage for Long-Horizon LLM

Hugging Face Daily Papers

CONF-KV is a KV-cache management system that uses model uncertainty to dynamically adjust cache retention, improving memory efficiency for long-context LLM inference while maintaining accuracy within 1.5-2.1 perplexity points.

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist