@no_stp_on_snek: Always start with uncompressed k and compressed V and go more aggressively from there. Model families have different se…
Summary
A tip on KV-cache compression for transformer models: start with uncompressed keys and compressed values, then adjust based on model family sensitivity; try asymmetric before symmetric compression.
View Cached Full Text
Cached at: 05/23/26, 06:11 PM
Always start with uncompressed k and compressed V and go more aggressively from there.
Model families have different sensitivities to K compression in particular.
Asym first over sym.
Similar Articles
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…
DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.
@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …
This thread challenges the fundamental assumption that Transformers require separate Q, K, and V projections, proposing that merging them can yield massive memory savings for KV cache.
Models Take Notes at Prefill: KV Cache Can Be Editable and Composable
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.
The KV-cache wall: why fixed-size memory sequence models keep coming back
Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.