@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…
Summary
This paper investigates whether Transformers need separate key and value projections, finding that sharing them can reduce KV cache by 50% with only 3.1% higher perplexity, and further cuts when combined with GQA and MQA.
View Cached Full Text
Cached at: 06/09/26, 02:51 PM
Interesting, this paper shows that Transformers may not need separate key and value projections to work well.
This paper’s design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.
A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.
Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.
The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.
When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.
The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.
Link – arxiv. org/abs/2606.04032v2
Title: “Do Transformers Need Three Projections? Systematic Study of QKV Variants”
Similar Articles
Do transformers need three projections? Systematic study of QKV variants
This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.
@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …
This thread challenges the fundamental assumption that Transformers require separate Q, K, and V projections, proposing that merging them can yield massive memory savings for KV cache.
@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…
Microsoft's NextLat paper proposes a self-supervised training method where transformers predict their next hidden state instead of just the next token, leading to more compact world models, better planning and reasoning, and up to 3.3x faster generation.
@jiqizhixin: What if your AI’s memory didn’t have to balloon with every extra sentence? University of Oxford, Technion, AITHYRA, and…
Introduces KV-Compression Aware Training (KV-CAT), a method that encourages transformers to learn compressible key-value caches during training, improving memory efficiency for long-context tasks without sacrificing performance.
Models Take Notes at Prefill: KV Cache Can Be Editable and Composable
This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.