@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…

X AI KOLs Timeline 06/09/26, 11:39 AM Papers

transformer attention kv-cache language-modeling efficiency deep-learning

Summary

This paper investigates whether Transformers need separate key and value projections, finding that sharing them can reduce KV cache by 50% with only 3.1% higher perplexity, and further cuts when combined with GQA and MQA.

Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This paper's design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close. A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back. Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved. The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like. When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads. The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings. ---- Link – arxiv. org/abs/2606.04032v2 Title: "Do Transformers Need Three Projections? Systematic Study of QKV Variants"

Original Article

View Cached Full Text

Cached at: 06/09/26, 02:51 PM

Interesting, this paper shows that Transformers may not need separate key and value projections to work well.

This paper’s design cut the KV cache by 50% in language modeling with only 3.1% higher perplexity, meaning inference memory fell sharply while prediction quality stayed close.

A normal attention layer makes Query to ask what each token needs, Key to label what each token offers, and Value to carry the information sent back.

Here, the surprising result is that Key and Value can often share the same learned map, because the model can use one representation both as an address and as the content being retrieved.

The best variant, Q-K=V, kept Query separate, so attention still had direction: one token can ask a different token for information instead of every relation becoming mirror-like.

When stacked with GQA and MQA, the same idea reached 87.5% and 96.9% cache cuts, because it reduces projection storage while those methods reduce stored heads.

The weak variant is Q=K-V, because tying Query and Key makes attention too symmetric for causal language, and it gives no KV-cache savings.

Link – arxiv. org/abs/2606.04032v2

Title: “Do Transformers Need Three Projections? Systematic Study of QKV Variants”

@rohanpaul_ai: Interesting, this paper shows that Transformers may not need separate key and value projections to work well. This pape…

Similar Articles

Do transformers need three projections? Systematic study of QKV variants

@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …

@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…

@jiqizhixin: What if your AI’s memory didn’t have to balloon with every extra sentence? University of Oxford, Technion, AITHYRA, and…

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Submit Feedback

Similar Articles

Do transformers need three projections? Systematic study of QKV variants

@che_shr_cat: 1/ We have spent years optimizing KV cache via head-sharing (GQA/MQA), but we ignored a fundamental assumption: why do …

@rohanpaul_ai: New Microsoft paper argues that transformers generalize better when they learn compact internal states, not just next t…

@jiqizhixin: What if your AI’s memory didn’t have to balloon with every extra sentence? University of Oxford, Technion, AITHYRA, and…

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable