(Yet Another) KV cache calculator - kvanta.vcerny.cz

Reddit r/LocalLLaMA Tools

Summary

A new open-source KV cache calculator tool named KVANTA has been released, supporting any LLM/VLM from Hugging Face.

Hello everyone, I thought all public web-based KV cache calculators kinda suck.. so I decided to create one I would like to use myself - KVANTA [https://kvanta.vcerny.cz](https://kvanta.vcerny.cz) It should support any LLM/VLM from Hugging Face, if not let me know! (also, it's Apache 2.0) https://preview.redd.it/rk8i48ftva3h1.png?width=1754&format=png&auto=webp&s=7a2e8908d7d0a6c2efd92be5fb7f0ec548e7aba9
Original Article

Similar Articles

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Hacker News Top

Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.

KV Packet: Recomputation-Free Context-Independent KV Caching for LLMs

Hugging Face Daily Papers

KV Packet proposes a recomputation-free cache reuse framework for LLMs that uses trainable soft-token adapters to bridge context discontinuities, eliminating overhead while maintaining performance comparable to full recomputation baselines on Llama-3.1 and Qwen2.5.

Dynamic KV Cache Quantization and Load-on-demand mmproj/MTP: my llama.cpp wishlist

Reddit r/LocalLLaMA

A developer has implemented a proof-of-concept PR for llama.cpp that adds dynamic KV cache quantization via an HTTP endpoint, allowing users to requantize their KV cache on-demand without fully reloading the model. The post also outlines a wishlist including load-on-demand mmproj/MTP swapping and an automatic --fit flag for context optimization.