@no_stp_on_snek: Always start with uncompressed k and compressed V and go more aggressively from there. Model families have different se…

X AI KOLs Following News

Summary

A tip on KV-cache compression for transformer models: start with uncompressed keys and compressed values, then adjust based on model family sensitivity; try asymmetric before symmetric compression.

Always start with uncompressed k and compressed V and go more aggressively from there. Model families have different sensitivities to K compression in particular. Asym first over sym.
Original Article
View Cached Full Text

Cached at: 05/23/26, 06:11 PM

Always start with uncompressed k and compressed V and go more aggressively from there.

Model families have different sensitivities to K compression in particular.

Asym first over sym.

Similar Articles

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

arXiv cs.LG

This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.