@no_stp_on_snek: Always start with uncompressed k and compressed V and go more aggressively from there. Model families have different se…

X AI KOLs Following 05/23/26, 04:40 PM News

Summary

A tip on KV-cache compression for transformer models: start with uncompressed keys and compressed values, then adjust based on model family sensitivity; try asymmetric before symmetric compression.

Always start with uncompressed k and compressed V and go more aggressively from there. Model families have different sensitivities to K compression in particular. Asym first over sym.

Original Article

View Cached Full Text

Cached at: 05/23/26, 06:11 PM

Always start with uncompressed k and compressed V and go more aggressively from there.

Model families have different sensitivities to K compression in particular.

Asym first over sym.

Similar Articles

KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit

Hacker News Top

A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.

@Michaelzsguo: KV cache is the model’s working memory during generation. As the context window gets longer, the model has to keep more…

X AI KOLs Timeline

DeepSeek's KV cache compression innovations, including MLA and CSA/HCA, reduce KV cache size by 93%, enabling efficient long-context inference and SSD-based caching, as demonstrated by antirez's ds4.c project.

@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …

X AI KOLs Timeline

This thread challenges the fundamental assumption that Transformers require separate Q, K, and V projections, proposing that merging them can yield massive memory savings for KV cache.

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

arXiv cs.LG

This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.

The KV-cache wall: why fixed-size memory sequence models keep coming back

Reddit r/ArtificialInteligence

Explores the growing memory bottleneck of KV-cache in transformer inference, explaining why alternative architectures with fixed-size memory like Mamba and RWKV are gaining renewed attention.