@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…

X AI KOLs Timeline Papers

Summary

This paper investigates whether Transformers need separate key and value projections, finding that sharing them can reduce KV cache by 50% with only 3.1% higher perplexity, and further cuts when combined with GQA and MQA.

Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close. A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back. Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved. The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like. When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads. The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings. ---- Link – arxiv. org/abs/2606.04032v2 Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"
Original Article
View Cached Full Text

Cached at: 06/09/26, 02:51 PM

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper’s design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.


Link – arxiv. org/abs/2606.04032v2

Title: “Do Transformers Need Three Projections? Systematic Study of QKV Variants”

Similar Articles

Do transformers need three projections? Systematic study of QKV variants

Hacker News Top

This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

arXiv cs.LG

This paper proposes that the KV cache in transformers acts as a notebook of memoized conclusions, enabling surgical editing and composition without full recomputation. The method achieves significant latency reductions while preserving decision equivalence across model scales.