Do transformers need three projections? Systematic study of QKV variants
Summary
This paper systematically studies variants of QKV projection sharing in transformers, finding that sharing key and value projections (Q-K=V) achieves 50% KV cache reduction with only 3.1% perplexity degradation, and combining with GQA/MQA can reach up to 96.9% cache reduction—enabling practical on-device inference with minimal quality loss.
View Cached Full Text
Cached at: 06/05/26, 02:09 AM
# Do Transformers Need Three Projections? Systematic Study of QKV Variants
Source: [https://arxiv.org/html/2606.04032](https://arxiv.org/html/2606.04032)
###### Abstract
Transformers have become the standard solution for various AI tasks, with the query, key, and value \(QKV\) attention formulation playing a central role\. However, the individual contribution of these three projections and the impact of omitting some remain poorly understood\. We systematically evaluate three projection sharing constraints: a\) Q\-K=V \(shared key\-value\), b\) Q=K\-V \(shared query\-key\), and c\) Q=K=V \(single projection\)\. The last two variants produce symmetric attention maps; to address this, we also explore asymmetric attention via 2D positional encodings\. Through experiments spanning synthetic tasks, vision \(MNIST, CIFAR, TinyImageNet, anomaly\), and language modeling \(300M and 1\.2B parameter models on 10B tokens\), we discovered that our transformers perform on par or occasionally better than the QKV transformer\. In language modeling, Q\-K=V projection sharing achieves 50% KV cache reduction with only 3\.1% perplexity degradation\. Crucially, projection sharing is complementary to head sharing \(GQA/MQA\): combining Q\-K=V with GQA\-4 yields 87\.5% cache reduction, while Q\-K=V \+ MQA achieves 96\.9%—enabling practical on\-device inference\. We show that Q\-K=V preserves quality because keys and values can occupy similar representational spaces and attention operates in a low\-rank regime, whereas Q=K\-V breaks attention directionality\. Our results systematically characterize projection sharing as an underexplored instance of weight tying in attention, with direct, quantifiable inference memory benefits—particularly valuable for edge deployment\. The code is publicly available at[https://github\.com/Brainchip\-Inc/Do\-Transformers\-Need\-3\-Projections](https://github.com/Brainchip-Inc/Do-Transformers-Need-3-Projections)\.
Machine Learning, ICML
## 1Introduction
Since their inception, Transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib3)\)have evolved from language\-specific tools into the backbone of multimodal AI\(Yinet al\.,[2024](https://arxiv.org/html/2606.04032#bib.bib64); Hanet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib35)\)\. However, as context windows expand and the demand for real\-time inference grows, the research community has shifted focus toward architectural efficiency\. High\-efficiency variants—ranging from linear\-complexity models like the Performer and Linformer to modern implementations like Ring Attention and blockwise schemes—seek to alleviate the quadratic bottleneck of self\-attention\(Tayet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib4)\)\.
Despite these advances, a fundamental structural question remains: is the tripartite\(Query,Key,Value\)\(\\text\{Query\},\\text\{Key\},\\text\{Value\}\)projection truly necessary? While Convolutional Neural Networks \(CNNs\)\(LeCunet al\.,[1995](https://arxiv.org/html/2606.04032#bib.bib2)\)and contemporary State Space Models \(SSMs\)\(Gu and Dao,[2023](https://arxiv.org/html/2606.04032#bib.bib36)\)often utilize more unified internal representations, Transformers maintain a persistent redundancy across their projection matrices\. To investigate this, we propose and evaluate three*Projective Sharing*architectures:
- •Q=K\-V:UnifiedQQandKK; separateVV\.
- •Q\-K=V:SeparateQQ; unifiedKKandVV\.
- •Q=K=V:Single projection for all three\.
Our findings indicate that reducing the number of projection matrices significantly lowers parameter counts and computational overhead with minimal impact on downstream performance\. We observe that the efficacy of these reductions is task\-dependent; for example,symmetric attention\(whereQ=KQ=K\) is highly effective for non\-temporal tasks such as image classification, whereas sequential tasks benefit from maintaining some level of asymmetry\.
### 1\.1Projection Sharing vs\. Head Sharing
Our approach addresses a different dimension of efficiency than current industry standards such asGrouped Query Attention \(GQA\)byAinslieet al\.\([2023](https://arxiv.org/html/2606.04032#bib.bib65)\)andMulti\-Query Attention \(MQA\)byShazeer \([2019](https://arxiv.org/html/2606.04032#bib.bib88)\)\. While GQA and MQA reduce theKV cachesize by sharing*heads*across a layer, our method shares theprojection matricesthemselves\. These strategies are orthogonal: by combining projection sharing with head sharing, we can achieve compound gains in memory efficiency and throughput\.
### 1\.2Our Contributions
- •Systematic Evaluation:We benchmark projection\-sharing strategies across 12 diverse tasks, including synthetic reasoning, computer vision, and Large Language Model \(LLM\) pre\-training\.
- •Cache Optimization:We demonstrate that theQ\-K=Vconfiguration reduces the KV cache footprint by50%while incurring only a negligible3\.1%increase in perplexity for 300M\-parameter models\.
- •Scale validation:We validate our findings at 1\.2B parameter scale \(∼\\sim10B tokens\), confirming that relative quality rankings remain stable across model sizes\. MQA maintains near\-parity with QKV \(1\.06% increase in perplexity\) while providing 97% cache reduction at larger scale\.
- •Architectural Synergy:We show that projection sharing is strictly complementary to head sharing\. A combinedQ\-GQA\-4configuration achieves an87\.5%cache reduction, whileQ\-MQAreaches a96\.9%reduction\.
- •Insights:We provide architectural insights explaining why Q\-K=V works \(shared representational space\) while Q=K\-V fails \(breaks attention directionality\)\. Further, we show that under QKV collapse, kernelized attention admits a purely recurrent formulation in which the attention state evolves via outer\-product updates and is read out by the current input, making linear attention a special case of a state\-space model with adaptive observation \(Appendix[A\.1](https://arxiv.org/html/2606.04032#A1.SS1)\)\.
## 2Related Works
### 2\.1Background: The Standard Attention Mechanism
The Transformer architecture\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib3)\)has become the foundation for modern deep learning across multiple domains, from natural language processing\(Brownet al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib24)\)to computer vision\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib7)\)and beyond\. At its core, the Transformer block comprises several interconnected components: multi\-head self\-attention, position\-wise feed\-forward networks, layer normalization\(Baet al\.,[2016](https://arxiv.org/html/2606.04032#bib.bib26)\), residual connections\(Heet al\.,[2016](https://arxiv.org/html/2606.04032#bib.bib6)\), and positional encodings\.
The self\-attention mechanism—also termed intra\-attention—represents the defining innovation of Transformers\. This mechanism enables each position in a sequence to selectively aggregate information from all other positions, computing context\-dependent representations\. Self\-attention has demonstrated remarkable effectiveness across diverse tasks including machine translation, abstractive summarization\(Gupta and Gupta,[2019](https://arxiv.org/html/2606.04032#bib.bib67)\), visual question answering\(Wuet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib66)\), multimodal understanding\(Radfordet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib15)\), and object recognition\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib7)\)\.
Formally, for a single attention head operating on inputX∈ℝn×dX\\in\\mathbb\{R\}^\{n\\times d\}, the attention mechanism computes:
Ah=Softmax\(αQhKhT\)Vh,A\_\{h\}=\\text\{Softmax\}\(\\alpha Q\_\{h\}K\_\{h\}^\{T\}\)V\_\{h\},\(1\)whereQh=XWqQ\_\{h\}=XW\_\{q\},Kh=XWkK\_\{h\}=XW\_\{k\}, andVh=XWvV\_\{h\}=XW\_\{v\}represent learned linear projections with weight matricesWq,Wk,Wv∈ℝd×dkW\_\{q\},W\_\{k\},W\_\{v\}\\in\\mathbb\{R\}^\{d\\times d\_\{k\}\}\. The scaling factorα=1/dk\\alpha=1/\\sqrt\{d\_\{k\}\}stabilizes gradients during training, wheredk=d/Hd\_\{k\}=d/HandHHdenotes the number of attention heads\. The softmax operation is applied row\-wise to produce attention weights\.
In multi\-head attention,HHheads compute attention in parallel:A1,…,AHA\_\{1\},\\ldots,A\_\{H\}\. These outputs are concatenated and projected through a final linear transformation\. The attention scoresQKTQK^\{T\}encode pairwise token affinities, with the query\-key dot product determining which values are relevant for each position\.
### 2\.2The necessity of three separate projections\.
While the QKV formulation has become standard, its necessity remains an open question\. Unlike the more parsimonious representations in CNNs\(LeCunet al\.,[1998](https://arxiv.org/html/2606.04032#bib.bib28)\), RNNs, or state space models\(Gu and Dao,[2023](https://arxiv.org/html/2606.04032#bib.bib36)\), Transformers maintain three distinct representations per token\. Recent work has begun questioning this design: approaches like linear attention\(Katharopouloset al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib69)\), kernel\-based attention\(Choromanskiet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib82)\), and attention\-free models\(Zhaiet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib70)\)suggest that simpler mechanisms may suffice\. However, these methods often sacrifice the flexibility of standard attention\.
Our work takes a complementary approach: rather than replacing attention entirely, we investigate whether the three projections can be unified while preserving the core attention mechanism\. We first introduced this idea inBorji \([2023](https://arxiv.org/html/2606.04032#bib.bib94)\)111The first author previously published under the name Ali Borji\.\. Subsequently,Kowsheret al\.\([2025](https://arxiv.org/html/2606.04032#bib.bib95)\)proposed a similar approach\. Several other works are also tangentially related\(Fuscoet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib92); Maiet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib91)\)\.
DeepSeek\-V2’s Multi\-Head Latent Attention \(MLA\)\(Liuet al\.,[2024](https://arxiv.org/html/2606.04032#bib.bib98)\)reduces the KV cache by compressing K and V into a shared latent vector that is cached and expanded at inference\. Unlike Q\-K=V, K and V remain functionally independent after expansion — MLA trades added projection parameters for a richer compressed representation, whereas Q\-K=V achieves cache reduction through a simple hard equality constraint\.
## 3Our Approach
Figure 1:Our proposed Projection\-Shared Attention Variants\. Attention mechanism with 2D positional encoding is denoted as \(X\)\+\.
### 3\.1Proposed Projection\-Shared Attention Variants
We systematically examine three projection\-sharing constraints that progressively reduce the number of learned transformations \(Figure[1](https://arxiv.org/html/2606.04032#S3.F1.2)\)\.
Variant 1: Q=K\-V\.We eliminate the separate query projection, settingQ=KQ=K:
A=Softmax\(αKKT\)V\.A=\\text\{Softmax\}\(\\alpha KK^\{T\}\)V\.\\vskip\-5\.0pt\(2\)This formulation produces a symmetric attention matrixKKTKK^\{T\}\. Symmetric attention has been explored in prior work on graph neural nets\(Veličkovićet al\.,[2018](https://arxiv.org/html/2606.04032#bib.bib71)\)and relational reasoning\(Santoroet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib81)\), where the lack of directional bias can be beneficial\. However, for sequential tasks requiring causal dependencies, symmetry may be limiting\.
To address this, we introduce\(Q=K\-V\)\+, which injects asymmetry via 2D positional encodings\. We first construct a fixed 2D sinusoidal positional encodingP∈ℝn×n×mP\\in\\mathbb\{R\}^\{n\\times n\\times m\}\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib3)\)\. Then×nn\\times nattention map is then broadcast along the channel dimension and added toPP\. To map the resulting tensor back to a 2D attention matrix, we apply a1×11\\times 1convolution \(equivalently, a linear projection across channels\)\. This design is inspired by relative positional encodings\(Shawet al\.,[2018](https://arxiv.org/html/2606.04032#bib.bib83); Huanget al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib12)\)and 2D positional embeddings in vision Transformers\(Dosovitskiyet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib7)\)\. See Appendix[A\.2](https://arxiv.org/html/2606.04032#A1.SS2)for the full construction\.
Variant 2: Q\-K=V\.We unify the key and value projections, settingV=KV=K:
A=Softmax\(αQKT\)K\.A=\\text\{Softmax\}\(\\alpha QK^\{T\}\)K\.\\vskip\-5\.0pt\(3\)This formulation preserves asymmetric attention maps sinceQQandKKremain independent\. The constraint that keys and values share representations can be viewed as imposing a form of weight tying\(Press and Wolf,[2017](https://arxiv.org/html/2606.04032#bib.bib11)\), which has proven effective in language modeling\.
Variant 3: Q=K=V\.The most aggressive simplification uses a single projection for all three roles:
A=Softmax\(αKKT\)K\.A=\\text\{Softmax\}\(\\alpha KK^\{T\}\)K\.\\vskip\-5\.0pt\(4\)This combines the symmetric attention of variant one with the representational bottleneck of variant two\. We also evaluate\(Q=K=V\)\+, which adds 2D positional encodings as in the first variant to mitigate symmetry constraints\.
##### Scope of \(X\)\+variants\.
The 2D positional encoding in the \(X\)\+variants is targeted at non\-causal settings \(vision, synthetic tasks\) where symmetric attention fromQ=KQ=Kis the principal limitation\. Causal language modeling already enforces asymmetry via the causal mask, so \(X\)\+addresses a problem that does not meaningfully exist there; we therefore evaluate \(X\)\+only on non\-causal tasks \(Tables[2](https://arxiv.org/html/2606.04032#S4.T2)and[3](https://arxiv.org/html/2606.04032#S4.T3)\) and treat it as a task\-specific heuristic rather than a universal augmentation\.
### 3\.2Combining Projection Sharing with Head Sharing
Our projection\-sharing approach operates on a different axis than recent head\-sharing methods, enabling compound optimizations\.
Head sharing mechanisms\.Grouped Query Attention \(GQA\)\(Ainslieet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib65)\)and Multi\-Query Attention \(MQA\)\(Shazeer,[2019](https://arxiv.org/html/2606.04032#bib.bib88)\)reduce memory by sharing key\-value heads across multiple query heads\. In GQA\-gg,HHquery heads attend tog<Hg<Hshared KV heads\. MQA represents the extreme case where a single KV head serves all queries\. These methods have demonstrated strong empirical performance: MQA powers models like PaLM\(Chowdheryet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib72)\)and Falcon\(Almazroueiet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib79)\), while GQA is adopted in Llama 2\(Touvronet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib80)\)and Mistral\(Jianget al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib73)\)\.
Orthogonal combination\.Crucially, head sharing \(reducing the number of KV heads\) and projection sharing \(constrainingK=VK=V\) address different dimensions of the architecture\. They can be combined multiplicatively:
- •Q\-GQA\-gg: Apply K=V constraint within each ofggGQA groups, yielding cache reduction of1−g2H1\-\\frac\{g\}\{2H\}\.
- •Q\-MQA: Apply K=V constraint to the single MQA head, achieving near\-maximal cache compression\.
For example, GQA\-4 alone provides 75% cache reduction \(4 groups vs\. 16 heads\)\. Adding K=V \(Q\-GQA\-4\) halves each group’s cache, yielding87\.5% total reduction\. Q\-MQA achieves96\.9% reduction—approaching the theoretical limit for cache\-based Transformers while maintaining practical model quality, as we demonstrate in Section[4\.3](https://arxiv.org/html/2606.04032#S4.SS3)\. The efficiency\-quality Pareto frontier clearly demonstrates this complementarity \(see Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4), Figure[10](https://arxiv.org/html/2606.04032#A1.F10)\)\.
### 3\.3Computational and Memory Analysis
Table[1](https://arxiv.org/html/2606.04032#S3.T1)compares the computational complexity and parameter counts of our variants against standard QKV attention\. Complexity is reported for projection operations only, excluding theO\(n2d\)O\(n^\{2\}d\)cost of computing attention scores, which is shared across all variants\.
For Q=K\-V and Q\-K=V attention, projection complexity is2nd22nd^\{2\}versus3nd23nd^\{2\}for QKV—a 33% reduction\. Parameter counts decrease proportionally \(2d22d^\{2\}vs\.3d23d^\{2\}\)\. The \(X\)\+variant addsn2mn^\{2\}moperations andmmparameters for positional encoding, remaining efficient whennm<d2nm<d^\{2\}\. For instance, withm=100m=100andd=1000d=1000, \(Q=K\-V\)\+is more efficient than QKV for sequences below 10,000 tokens\. Q=K=V attention achieves the minimal configuration:nd2nd^\{2\}operations andd2d^\{2\}parameters—one\-third of QKV\.
Table 1:Comparison of proposed Transformers and QKV baseline in terms of computational complexity and parameter count\.ddis the embedding dimension,nnis sequence length, andmmis the positional encoding dimension\. Complexity excludes the sharedO\(n2d\)O\(n^\{2\}d\)attention score computation\. Positional embeddings use fixed sinusoidal features \(not learned\)\.Practical deployment benefits\.While parameter reductions are modest \(self\-attention projections constitute only∼\\sim30% of total Transformer parameters\), the inference memory benefits are substantial\. During autoregressive generation, Transformers cache past key\-value states to avoid redundant computation\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib3)\)\. Standard QKV and Q=K\-V attention must cache bothKKandVVseparately\. In contrast, Q\-K=V and Q=K=V cache only theKKtensor, sinceVVcan be reused fromKK\. This yields50% KV cache reduction, enabling:
- •2×\\timeslonger context window for the same memory budget
- •2×\\timeshigher throughput \(concurrent users per GPU\)
- •40–50% reduction in serving costs for memory\-bound deployments
Recent work highlights KV cache as the primary bottleneck for long\-context LLM serving\(Popeet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib19); Liuet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib74)\)\. Our approach complements cache optimization techniques including quantization\(Dettmers and others,[2023](https://arxiv.org/html/2606.04032#bib.bib75); Xiaoet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib54)\), offloading\(Sheng and others,[2023](https://arxiv.org/html/2606.04032#bib.bib76)\), and windowed attention\(Childet al\.,[2019](https://arxiv.org/html/2606.04032#bib.bib77); Beltagyet al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib78)\)\.
### 3\.4Design Considerations
Diagonal dominance in symmetric attention\.ComputingKKTKK^\{T\}produces symmetric attention matrices with large diagonal elements, as each token attends strongly to itself\. Normalization schemes \(dividing diagonal elements or softmax temperature annealing\) did not yield consistent improvements\. Q\-K=V naturally avoids this by computingQKTQK^\{T\}, preserving the off\-diagonal attention distribution of standard transformers\.
Extension to encoder\-decoder architectures\.While our primary focus is decoder\-only models \(prevalent in modern LLMs\(Brownet al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib24)\)\), the approach extends to encoder\-decoder settings\. Tasks requiring cross\-attention—such as machine translation\(Vaswaniet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib3)\)or vision\-language modeling\(Alayracet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib14)\)—can preserve standard QKV or Q\-K=V formulations for cross\-attention while applying projection sharing to self\-attention layers\. This is analogous to how MQA is applied selectively in T5\(Raffelet al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib84)\)and other encoder\-decoder models\.
Synergies with other efficiency techniques\.Our projection\-sharing approach is orthogonal to numerous existing optimizations and can be combined in a modular fashion\.Quantizationoffers immediate compounding benefits: KV cache can be quantized to INT8 or INT4\(Dettmers and others,[2023](https://arxiv.org/html/2606.04032#bib.bib75)\), yielding multiplicative memory savings \(e\.g\., 50% from projection sharing×\\times50% from INT8 = 75% total reduction\)\.Sparse attentionmechanisms with local or strided patterns\(Childet al\.,[2019](https://arxiv.org/html/2606.04032#bib.bib77); Zaheeret al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib18)\)reduce theO\(n2\)O\(n^\{2\}\)complexity of attention computation, while projection sharing orthogonally reduces the per\-token cache footprint\.Alternative activationspresent another avenue: recent work questions the necessity of softmax in attention\(Luet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib37); Koohpayegani and Pirsiavash,[2024](https://arxiv.org/html/2606.04032#bib.bib38)\), suggesting that softmax\-free variants combined with projection sharing could yield further simplifications\. Finally,Flash Attentionand other hardware\-efficient implementations\(Daoet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib20)\)can accelerate our variants, particularly Q=K=V attention, which exhibits the simplest memory access patterns\.
When to apply each variant\.The choice among attention variants depends on task characteristics:
- •Sequential/causal tasks\(language modeling\): Q\-K=V provides the best quality\-efficiency trade\-off, maintaining asymmetric attention while halving cache\.
- •Non\-causal tasks\(vision, set processing\): Q=K\-V or Q=K=V may suffice, optionally augmented with \(X\)\+to inject directional bias where symmetric attention limits performance\.
- •Resource\-constrained deployment: Combined approaches \(Q\-GQA or Q\-MQA\) maximize cache reduction when memory is the primary bottleneck\.
This task\-dependent behavior aligns with broader findings in efficient Transformers: no single architecture wins across all domains\(Tayet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib4)\)\. Our systematic evaluation in Section[4](https://arxiv.org/html/2606.04032#S4)characterizes when each variant is appropriate\.
This formulation establishes a principled framework for trading model complexity against performance—a trade\-off that becomes increasingly critical as language models scale to billions of parameters and serve millions of users\(Kaplanet al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib89); Hoffmannet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib90)\)\.
## 4Experiments and Results
We evaluate projection\-sharing variants across three domains:synthetic reasoning\(5 tasks\),computer vision\(6 tasks\), andlanguage modeling\(300M and 1\.2B parameters on 10B tokens\)\. All models are trained from scratch with matched hyperparameters to isolate architectural effects, except set anomaly detection which uses pre\-trained ResNet34 features\(Heet al\.,[2016](https://arxiv.org/html/2606.04032#bib.bib6)\)\. Our goal is controlled comparison of attention mechanisms rather than state\-of\-the\-art performance\(Dehghaniet al\.,[2023](https://arxiv.org/html/2606.04032#bib.bib23); Zhuet al\.,[2019](https://arxiv.org/html/2606.04032#bib.bib22); DeRoseet al\.,[2020](https://arxiv.org/html/2606.04032#bib.bib21)\)\. Synthetic and vision experiments used a single NVIDIA GTX 1080 Ti GPU\.
### 4\.1Synthetic tasks
Table 2:Performance on synthetic tasks\. Multiple runs, over different configurations \(such as number of attention heads, embedding dimension, learning rate, sequence length, etc\.\), are conducted, and the results are averaged\.We focus on five specific tasks outlined below\. The input list, which has a predetermined length, consists of numbers ranging from 0 to 9, inclusive of both 0 and 9\.
Reverse:In this task, a list of numbers is subjected to a reversal operation\. For instance, the input list \[4, 3, 9, 8, 1\] would be transformed into \[1, 8, 9, 3, 4\]\.Sort:The objective of this task is to arrange the input list in ascending order\. For example, \[4, 3, 9, 8, 1\] would be transformed into \[1, 3, 4, 8, 9\]\.Sub:In this case, each element of the list is subtracted from 9\. For example, the array \[4, 3, 9, 8, 1\] would be transformed into \[5, 6, 0, 1, 8\]\.Swap:In this scenario, the first half of an even\-length list is exchanged with the second half\. For instance, the list \[4, 3, 9, 8, 1, 7\] would be transformed into \[8, 1, 7, 4, 3, 9\]\.Copy:The objective here is to retain the input list as is\. For example, \[4, 3, 9, 8, 1\] remains unchanged as \[4, 3, 9, 8, 1\]\.
Here, only one transformer encoder is used\. In training, we feed the input sequence into the encoder to generate predictions for each token in the input\. We utilize the standard cross entropy loss for this purpose\. Each number is encoded as a one\-hot vector\. We apply a gradient clip value of 5 and set the 2D positional embedding dimension to 10 \(*i\.e\.**m*\)\. Additionally, we employ the Adam optimizer along with the CosineWarmupScheduler, using a warm\-up period of 5\.
We perform experiments with different configurations of transformer models by varying the embedding dimension \(32, 64, 256\), the number of layers \(2, 4\), the number of heads \(2, 4\), a learning rate of 1e\-3 and the input sequence length \(16, 64, 128\)\. Each configuration is run three times for two epochs, and the results are then averaged across the configurations\.
The QKV transformer exhibits faster convergence compared to the Q=K=V and Q=K\-V transformers \(see loss curves in Appendix[A\.3](https://arxiv.org/html/2606.04032#A1.SS3)\)\. However, all transformers demonstrate good performance on synthetic tasks, as indicated by the accuracies presented in Table[2](https://arxiv.org/html/2606.04032#S4.T2)\. The Q=K\-V transformer achieves performance comparable to that of the QKV transformer, whereas the Q=K=V transformer performs considerably worse\. Incorporating positional information, \(X\)\+, substantially boosts the performance\. Sample self\-attention maps over synthetic tasks are shown in Appendix[A\.3](https://arxiv.org/html/2606.04032#A1.SS3)\.
### 4\.2Vision tasks
Table 3:The performance of transformers on vision tasks\. The average column does not include the TinyImageNet performance\.We evaluated performance on various vision tasks, including image classification in MNIST\(LeCunet al\.,[1998](https://arxiv.org/html/2606.04032#bib.bib28)\), FashionMNIST\(Xiaoet al\.,[2017](https://arxiv.org/html/2606.04032#bib.bib27)\), CIFAR\-10\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2606.04032#bib.bib29)\), CIFAR\-100\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2606.04032#bib.bib29)\), and Tiny ImageNet \(200 classes222[https://paperswithcode\.com/dataset/tiny\-imagenet](https://paperswithcode.com/dataset/tiny-imagenet)\), as well as anomaly detection\.
Classification\. We explore various settings for patch size \(4, 7\), learning rate \(1e\-3, 1e\-4\), embedding dimension \(64, 256, 512\), number of layers \(2, 4\), and number of heads \(2, 4\)\. For each configuration, we performed two experiments, each experiment lastingkkepochs\. The value ofkkdiffers depending on the dataset: 20 epochs for MNIST and FashionMNIST, 40 epochs for CIFAR\-10, and 50 epochs for CIFAR\-100\. We employ the cross\-entropy loss function and utilize the Adam optimizer with the MultiStepLR scheduler for optimization\. In the case of 2D positional encoding, we set pos dim to 50\.
As indicated in Table[3](https://arxiv.org/html/2606.04032#S4.T3), the \(Q=K\-V\)\+transformer exhibits performance comparable to that of the QKV transformer in the MNIST, FashionMNIST and CIFAR datasets\. The Q=K=V transformer, while slightly behind these two variants on MNIST and FashionMNIST, still performs at a reasonably competitive level on CIFAR datasets\.
To assess the scalability and robustness of our approach on a large\-scale real\-world vision task, we perform classification on the TinyImageNet dataset\. This dataset contains 100K images of 200 classes \(500 per class\)\. Each class has 500 training images, 50 validation images, and 50 test images\. We use a Vision Transformer \(ViT\) model that is configured with the following parameters: image size of 224, patch size of 16, 200 classes, embedding dimension of 768, 12 layers, 12 attention heads, MLP dimension of 3072, and a dropout rate of 0\.1\. The optimization process and loss function are as above\. All models were trained from scratch \(*i\.e\.*no use of pretrained backbones\)\. We evaluate three self\-attention variants, each run twice\. Figure[2](https://arxiv.org/html/2606.04032#S4.F2.fig1)shows the training loss and validation accuracy over epochs\. Numerical results are provided in Table[3](https://arxiv.org/html/2606.04032#S4.T3)\. The corresponding training times per epoch are 40, 35, and 32 minutes on GPU, demonstrating improved efficiency with small impact on accuracy\. Notably, the Q=K=V Transformer, despite employing only one projection, achieves the best results in this instance\. Continued training over more epochs could potentially close the performance gap between the Transformer architectures\.
Figure 2:Training loss and validation accuracy of attention variants for image classification on the TinyImageNet dataset\.
Set Anomaly Detection\.We applied transformers to sets \(*i\.e\.*unordered inputs\)\. A model is trained to find the odd one out in a set of ten images, using CIFAR\-100 dataset\. Nine images are from one class, and one is different\. Two sample sets are shown in Figure[6](https://arxiv.org/html/2606.04032#A1.F6)\(Appendix[A\.3\.2](https://arxiv.org/html/2606.04032#A1.SS3.SSS2)\)\. CIFAR\-100 has 60K 32×\\times32 images over 100 classes \(600 per class\)\. Please, see Appendix[A\.3\.2](https://arxiv.org/html/2606.04032#A1.SS3.SSS2)for details on this task\.
The second\-to\-last column of Table[3](https://arxiv.org/html/2606.04032#S4.T3)presents the results of this experiment\. It shows comparable performance across models, with \(Q=K\-V\)\+exhibiting a slight advantage\.
Image Segmentation\.Hwaet al\.\([2025](https://arxiv.org/html/2606.04032#bib.bib93)\)extended our earlier work\(Borji,[2023](https://arxiv.org/html/2606.04032#bib.bib94)\)by applying QKV and Q=K\-V attention variants to semantic segmentation of abdominal MRI slices, labeling pixels across three categories \(large bowel, small bowel, and stomach\), finding that the Q=K\-V variant remained competitive with standard QKV attention even in this larger\-scale, more complex setting\. See Appx\.[A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3)\.
### 4\.3NLP tasks
Dataset and Scale\.We trained 300M and 1\.2B parameter GPT\-style language models on up to 10B tokens from the SlimPajama dataset\(Systems,[2023](https://arxiv.org/html/2606.04032#bib.bib13)\), a cleaned and deduplicated subset of RedPajama\. The 300M models were trained for 4,238 steps \(∼\\sim10B tokens\), while 1\.2B models were trained for 8,475 steps \(∼\\sim10B tokens\) to validate scaling behavior\.
Model Architecture\.The 300M models comprise 20 transformer layers, embedding dimensiond=1024d=1024, 16 attention heads, and MLP dimension of 4096\. The 1\.2B models use 22 layers,d=2048d=2048, 32 attention heads, and MLP dimension of 8192\. All models use vocabulary size of 50,304 tokens\. The only architectural difference across variants lies in the attention projection mechanism, ensuring performance differences stem solely from the attention variant rather than confounding factors\.
Training Infrastructure\.Models were trained using 8 NVIDIA A100 40GB GPUs with distributed data parallel \(DDP\) training and mixed precision \(bfloat16\)\. We used the AdamW optimizer withβ1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, weight decay of 0\.1, and a cosine learning rate schedule with linear warmup\. Gradient clipping was applied with a maximum norm of 1\.0\. Complete training and architectural details \(activation, normalization, tokenizer, dropout, warmup, gradient accumulation, evaluation cadence\) are provided in Appendix[A\.5](https://arxiv.org/html/2606.04032#A1.SS5)\.
#### 4\.3\.1Main Results: Language Model Quality
Table[4](https://arxiv.org/html/2606.04032#S4.T4)presents the primary results from training 300M parameter language models on SlimPajama\. These results reveal several surprising findings that challenge conventional assumptions about attention mechanisms\.
Table 4:Comparison of attention variants on 300M parameter language models trained on 10B tokens from SlimPajama\. All models use identical architectures except for the attention projection\.PPL Degradation vs\. QKV BaselineQ\-K=V\+3\.1%Best proj\. variant, 50% cache↓\\downarrowQ=K\-V\+4\.9%No cache benefitQ=K=V\+25\.4%Not recommendedGQA\-4\+0\.7%75% cache↓\\downarrowMQA\+1\.5%93\.8% cache↓\\downarrowQ\-GQA\-4\+3\.9%87\.5% cache↓\\downarrowQ\-MQA\+4\.8%96\.9% cache↓\\downarrow
Q\-K=V emerges as the clear winner among the proposed attention mechanisms\. Surprisingly, this variant achieves better quality than Q=K\-V attention despite having identical parameter counts and computational costs: validation perplexity of 5\.27 vs 5\.36, representing only 3\.1% degradation from the QKV baseline\. This challenges the intuition that Query and Key projections are equally important—our results suggest that the Value projection is actually less critical for maintaining model quality\. Validation curves show Q\-K=V tracks the baseline closely throughout training \(see Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4), Figure[11](https://arxiv.org/html/2606.04032#A1.F11)\)\. While Q=K\-V attention achieves competitive training performance \(4\.9% worse than baseline\), it offersno inference benefitsover standard QKV attention, as we detail in Section[4\.3\.3](https://arxiv.org/html/2606.04032#S4.SS3.SSS3)\. This makes Q=K\-V attention less suitable for practical deployment despite its good training quality\. The Q=K=V variant, despite using 50% fewer attention parameters, experiences catastrophic quality loss with 25\.4% worse perplexity\. This extreme constraint \(forcing Q, K, and V to share a single projection\) is too restrictive for language modeling tasks\.
Training efficiency\.All variants achieve similar training throughput \(423k\-460k tokens/second\), with the Q=K=V variant being slightly faster due to reduced projection overhead\. However, these speed differences are marginal \(8\.7% at most\) and do not compensate for quality losses\. Additional visualizations of projection sharing and head sharing results are provided in Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4)\(Figures[8](https://arxiv.org/html/2606.04032#A1.F8)and[9](https://arxiv.org/html/2606.04032#A1.F9)\)\.
#### 4\.3\.2Parameter Count and Compute
Table[5](https://arxiv.org/html/2606.04032#S4.T5)breaks down the parameter distribution across model components\. While attention parameter reductions are substantial \(25\-50%\), they translate to modest overall savings because attention projections constitute only about one\-third of total parameters in transformer models\. While parameter and computational improvements appear modest, the true benefit of Q\-K=V attention lies ininference memory efficiency, as we demonstrate next\.
Table 5:Parameter count analysis for 300M parameter models\. Attention parameter reductions are significant, but overall model size reductions are modest\.Table[6](https://arxiv.org/html/2606.04032#S4.T6)shows inference computational costs \(multiply\-accumulate operations\) at sequence length 2048\. The computational savings \(5\.4% for Q=K\-V and Q\-K=V, 10\.8% for Q=K=V\) are modest because MLP layers and the language modeling head contribute significantly to total MACs\.
Table 6:Inference computational cost \(MACs\) at sequence length 2048\. Attention savings are diluted by MLP and LM head costs\.
#### 4\.3\.3KV Cache Memory Analysis
This section reveals why Q\-K=V attention is transformative for practical deployment\. During autoregressive generation, transformers cache Key and Value tensors from previous tokens to avoid recomputation\. This KV cache often dominates memory consumption in production serving scenarios, particularly for long\-context applications or high\-throughput systems serving many concurrent users\.
Table 7:KV cache memory requirements\. Q\-K=V achieves 50% cache reduction—a benefit that Q=K\-V attention cannot provide despite competitive training quality\.Table[7](https://arxiv.org/html/2606.04032#S4.T7)reveals a critical distinction:Q=K\-V attention provides zero cache savingsbecause it still requires caching both K and V tensors separately\. In contrast,Q\-K=V attention \(K=V\) achieves 50% cache reductionby storing only K and reusing it as V during generation\. The K variant also achieves 50% savings but with a big quality loss\.
Practical impact at scale\.For longer contexts, the memory savings become dramatic\. At 32k tokens: QKV and Q=K\-V require 2\.62 GB, Q\-K=V requires 1\.31 GB \(50% savings\)\. At 128k tokens: QKV and Q=K\-V require 10\.49 GB, Q\-K=V requires 5\.24 GB \(50% savings\)\. For a batch size of 32 with 32k tokens, memory usage is reduced from 83\.9 GB to 41\.9 GB, yielding a VRAM savings of 42 GB\.
Real\-world deployment scenario\.Consider deploying a code completion model with 32k context serving 100 concurrent users on A100 40GB GPUs: 1\)QKV or Q=K\-V:KV cache of 2\.62 GB per user→\\rightarrow15 users per GPU→\\rightarrowrequires 7 GPUs \($14k/month\), 2\)Q\-K=V:KV cache of 1\.31 GB per user→\\rightarrow30 users per GPU→\\rightarrowrequires 4 GPUs \($8k/month\), and 3\)Cost savings:$6k/month = $72k/year \(43% reduction\)\. We confirm these projections with end\-to\-end inference benchmarks on a single A100 \(Tables[14](https://arxiv.org/html/2606.04032#A1.T14)and[15](https://arxiv.org/html/2606.04032#A1.T15)in Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4)\)\.
This analysis reveals thatQ\-K=V is the only 2\-projection variant with practical deployment advantages\. Q=K\-V attention, despite achieving slightly better training quality in some configurations, offers no cache benefits and should be avoided for production deployment\.
#### 4\.3\.4Scaling with Sequence Length
Table[8](https://arxiv.org/html/2606.04032#S4.T8)shows how computational costs scale with sequence length\. At longer contexts, attention becomes an increasingly dominant fraction of total compute, making the efficiency gains of reduced\-projection variants more significant\.
Table 8:Attention MACs \(% of total\) across sequence lengths; longer contexts amplify efficiency gains\.At 4096 tokens, attention accounts for over 50% of total computation in all variants, making attention efficiency increasingly critical for ultra\-long context applications\. This scaling behavior demonstrates that the benefits of reduced\-projection attention become more pronounced as context lengths increase—a crucial consideration for modern LLMs that increasingly target 32k, 128k, or even longer contexts and the relative rankings across all variants remain stable \(Table[16](https://arxiv.org/html/2606.04032#A1.T16)in Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4)\)\.
Table 9:1\.2B parameter models trained on 10B tokens\.
#### 4\.3\.5Scaling to 1\.2B parameters
To validate our findings at larger scale, we trained 1\.2B parameter models \(22 layers, 2048 embedding dimension, 32 attention heads\) on 10B tokens from SlimPajama\.
Architecture scaling\.The 1\.2B models maintain the same architectural patterns as our 300M experiments, with parameter counts of 1,215M \(QKV\), 1,123M \(Q\-K=V\), 1,077M \(GQA\-8\), 1,036M \(MQA\), 1,054M \(Q\-GQA\-8\), and 1,033M \(Q\-MQA\)\. See Table[9](https://arxiv.org/html/2606.04032#S4.T9)\.
Quality preservation at scale\.Our findings generalize effectively to larger models\. MQA achieves near\-parity with QKV \(5\.057 vs\. 5\.004 perplexity, \+1\.06% degradation\) with 97% cache reduction—a gap small enough to be practically negligible at this scale\. GQA\-8 provides the best quality\-efficiency balance with only \+0\.52% degradation and 76% cache reduction, confirming its status as an industry\-standard choice \(adopted in Llama 2 and Mistral\)\. Q\-K=V maintains reasonable quality \(\+2\.48% degradation\) with 50% cache savings\. At 1\.2B scale, the relative rankings remain consistent with our 300M experiments \(see Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4), Figure[12](https://arxiv.org/html/2606.04032#A1.F12)\)\.
Combined approaches scale effectively\.Q\-GQA\-8 achieves 88% cache reduction with 3\.08% degradation, while Q\-MQA reaches 98\.5% cache reduction with 4\.16% degradation\. Notably, these compound gains remain practical: even the most aggressive variant \(Q\-MQA\) incurs less than 5% quality loss while reducing the KV cache by67×67\\times\.
Comparison with 300M results\.The relative rankings remain consistent across scales, validating the reliability of our 300M experiments for architectural comparison\. However, the absolute degradation percentages differ slightly: Q\-K=V shows 2\.48% degradation at 1\.2B versus 3\.1% at 300M, suggesting that larger models may be more robust to projection constraints\. This trend, if it continues at 7B\+ scale, would make projection sharing even more attractive for large production models\.
Table 10:Deployment recommendations for different resource constraint scenarios based on 300M model results\.Implications for deployment\.At 1\.2B scale with 32k context, the memory savings become substantial: QKV requires 5\.9 GB per user, MQA requires 176 MB \(33×\\timesreduction\), and Q\-MQA requires only 88 MB \(67×\\timesreduction\)\. For a batch size of 32 concurrent users, this translates to 189 GB \(QKV\) vs 5\.6 GB \(MQA\) vs 2\.8 GB \(Q\-MQA\)—enabling dramatically higher throughput in production serving scenarios\. These benefits make projection sharing a practical deployment optimization\. Table[10](https://arxiv.org/html/2606.04032#S4.T10)summarizes deployment recommendations under different resource constraints\.
#### 4\.3\.6Downstream Task Evaluation
Table 11:5\-shot downstream accuracy \(%\) on standard benchmarks for 1\.2B models\. Q\-K=V loses only 0\.41% on average while halving KV cache; the perplexity gap to QKV does not translate to a comparable downstream gap \(HW=HellaSwag\)\.While perplexity is a useful pretraining metric, it does not always predict downstream task performance\. To validate that projection\-sharing variants remain practically usable, we evaluate all 1\.2B models on five standard zero\-/few\-shot benchmarks using the EleutherAIlm\-eval\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.04032#bib.bib96)\): HellaSwag, PIQA, ARC\-Easy, ARC\-Challenge, and WinoGrande, all in the 5\-shot setting\. Results are shown in Table[11](https://arxiv.org/html/2606.04032#S4.T11)\.
Q\-K=V remains competitive on downstream tasks\.Despite a 2\.48% perplexity gap to the QKV baseline, Q\-K=V loses only 0\.41% on average downstream accuracy \(35\.99% vs\. 36\.40%\)\. This decoupling between perplexity degradation and task accuracy strengthens the practical case for projection sharing: the inference memory savings come without a corresponding loss in capability on the kinds of tasks production systems actually serve\.
Perplexity is not a reliable predictor of downstream rank\.Although GQA\-8 attains better validation perplexity than Q\-K=V \(Table[9](https://arxiv.org/html/2606.04032#S4.T9)\), the two are statistically indistinguishable on downstream tasks \(35\.86% vs\. 35\.99%\)\. This is consistent with prior observations that small perplexity differences at this scale do not translate reliably to task\-level differences\.
Combined approaches preserve quality at aggressive cache reductions\.Q\-GQA\-8 slightly exceeds the QKV \(36\.72% vs\. 36\.40%\) while reducing cache by 87\.5%—supporting the view that projection sharing and head sharing operate on complementary axes\. Q\-MQA, the most aggressive variant \(96\.9% cache reduction\), shows the largest degradation \(34\.38%\), establishing a practical envelope: useful compression with bounded quality cost up to the Q\-GQA regime; beyond that, the trade\-off begins to bite\.
## 5Discussion and Conclusion
We evaluated self\-attention with reduced projections, with and without 2D positional encoding, against standard QKV attention across 12 tasks\. Our goal was not state\-of\-the\-art performance, but to assess performance differences between the proposed and original QKV Transformers\. A comprehensive summary of all variants is provided in Appendix[A\.4](https://arxiv.org/html/2606.04032#A1.SS4), Table[13](https://arxiv.org/html/2606.04032#A1.T13)\. Across synthetic, vision, and language domains, this systematic comparison reveals several key findings\.
K=V projection is effective and scalable\.Q\-K=V achieves 50% cache reduction with 2\.48% degradation at 1\.2B scale \(vs 3\.1% at 300M\), offering an efficiency\-quality trade\-off that is orthogonal to and stackable with head sharing\.
Why Q\-K=V works\.Two complementary readings explain the small quality cost of K=V\. The first is that V’s role is less essential than commonly assumed\(He and Hofmann,[2024](https://arxiv.org/html/2606.04032#bib.bib97)\); the second is that*K is rich enough to absorb V’s role*—when the K=V constraint is imposed during training, the shared projection successfully serves both addressing and content functions\. Both readings are consistent with the same operational claim: attention requires asymmetry between Q and the shared K\-V representation, not three fully independent projections\. Analysis of trained QKV models supports this: K and V projection matrices exhibit high cosine similarity \(0\.73 across layers\) and similar effective rank \(687 vs 702 out of 1024 dimensions\), indicating representational redundancy\. In contrast, Q maintains lower cosine similarity with both K \(0\.42\) and V \(0\.31\), preserving the asymmetry required for directional attention\. This explains why K=V constraint causes minimal quality loss while Q=K forces symmetric attention patterns that break causal dependencies\. Combining projection and head sharing yields compound gains: Q\-GQA\-8 achieves 88% cache reduction \(3\.08% degradation\), while Q\-MQA reaches 98\.5% reduction \(4\.16% degradation\), enabling edge deployment\.
Insight: Q\-K=V works, Q=K\-V fails\.K=V constraint preserves model quality because keys and values can share representational space while attention patterns \(QK⊤QK^\{\\top\}\) remain flexible\. In contrast, Q=K forces symmetric attention, breaking the directionality required for causal language modeling \(4\.9% drop with zero cache benefit\)\. Q=K=V combines both pathologies, causing catastrophic 25\.4% degradation\.
## 6Limitations
Several limitations apply\. Our largest validated scale is 1\.2B parameters; whether the Q\-K=V degradation trend continues to improve beyond 7B remains unconfirmed\. Our explanation for why Q\-K=V preserves quality is empirical rather than formal\. Evaluation is restricted to sequences up to 2048 tokens, and we do not characterize length extrapolation\. We omit a Q=V ablation, as Q is not cached during generation and its addressing role differs fundamentally from V’s payload role, making this the least natural constraint to study\.
## Acknowledgments
We thank the BrainChip research team for compute support and infrastructure that made the language modeling experiments possible\. We are grateful to the ICML 2026 reviewers and area chairs for their thoughtful feedback, which substantially improved the manuscript\. We also thank D\. Hwa, T\. Holmes, and K\. Drechsler for extending our work to medical image segmentation \(Appendix[A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3)\)\.
## Impact Statement
The development of more efficient Transformer models, as explored in this research, offers positive societal benefits like broadening AI accessibility by enabling use on less powerful hardware and potentially reducing the energy footprint of AI computations\. Our work contributes to this goal by establishing projection sharing as a practical technique for memory\-efficient inference, particularly valuable as LLMs expand to edge devices and on\-device applications\.
## References
- J\. Ainslie, J\. Lee\-Thorp, M\. de Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai \(2023\)GQA: training generalized multi\-query transformer models from multi\-head checkpoints\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 4895–4901\.Cited by:[§1\.1](https://arxiv.org/html/2606.04032#S1.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.04032#S3.SS2.p2.3)\.
- J\. Alayrac, J\. Donahue, P\. Luc, A\. Miech, I\. Barr, Y\. Hasson, K\. Lenc, A\. Mensch, K\. Millican, M\. Reynolds,et al\.\(2022\)Flamingo: a visual language model for few\-shot learning\.arXiv preprint arXiv:2204\.14198\.Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p2.1)\.
- E\. Almazrouei, H\. Alobeidli, A\. Alshamsi, A\. Cappelli, R\. Cojocaru, M\. Debbah, É\. Goffinet, D\. Hesslow, J\. Launay, Q\. Malartic,et al\.\(2023\)The falcon series of open language models\.arXiv preprint arXiv:2311\.16867\.Cited by:[§3\.2](https://arxiv.org/html/2606.04032#S3.SS2.p2.3)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.arXiv preprint arXiv:1607\.06450\.Cited by:[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p1.1)\.
- I\. Beltagy, M\. E\. Peters, and A\. Cohan \(2020\)Longformer: the long\-document transformer\.arXiv preprint arXiv:2004\.05150\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1)\.
- A\. Borji \(2023\)Key\-value transformer\.arXiv preprint arXiv:2305\.19129\.Cited by:[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p1.1),[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p2.1),[§4\.2](https://arxiv.org/html/2606.04032#S4.SS2.p7.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p1.1),[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p2.1)\.
- J\. Chen, Y\. Lu, Q\. Yu, X\. Luo, E\. Adeli, Y\. Wang, L\. Lu, A\. L\. Yuille, and Y\. Zhou \(2021\)TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2102.04306),2102\.04306Cited by:[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p4.1)\.
- R\. Child, S\. Gray, A\. Radford, and I\. Sutskever \(2019\)Generating long sequences with sparse transformers\.arXiv preprint arXiv:1904\.10509\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1),[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p3.2)\.
- K\. Choromanski, V\. Likhosherstov, D\. Dohan, X\. Song, A\. Gane, T\. Sarlos, P\. Hawkins, J\. Q\. Davis, A\. Mohiuddin, L\. Kaiser, D\. Belanger, L\. J\. Colwell, and A\. Weller \(2021\)Rethinking attention with performers\.International Conference on Learning Representations \(ICLR\)\.Note:arXiv:2009\.14794Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p1.1)\.
- A\. Chowdhery, S\. Narang, J\. Devlin, M\. Bosma, G\. Mishra, A\. Roberts, P\. Barham,et al\.\(2022\)PaLM: scaling language modeling with pathways\.arXiv preprint arXiv:2204\.02311\.Cited by:[§3\.2](https://arxiv.org/html/2606.04032#S3.SS2.p2.3)\.
- T\. Dao, D\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)Flashattention: fast and memory\-efficient exact attention with io\-awareness\.Advances in neural information processing systems35,pp\. 16344–16359\.Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p3.2)\.
- M\. Dehghani, J\. Djolonga, B\. Mustafa, P\. Padlewski, J\. Heek, J\. Gilmer, A\. P\. Steiner, M\. Caron, R\. Geirhos, I\. Alabdulmohsin,et al\.\(2023\)Scaling vision transformers to 22 billion parameters\.InInternational conference on machine learning,pp\. 7480–7512\.Cited by:[§4](https://arxiv.org/html/2606.04032#S4.p1.1)\.
- J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei \(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§A\.3\.2](https://arxiv.org/html/2606.04032#A1.SS3.SSS2.p2.1)\.
- J\. F\. DeRose, J\. Wang, and M\. Berger \(2020\)Attention flows: analyzing and comparing attention mechanisms in language models\.IEEE Transactions on Visualization and Computer Graphics27\(2\),pp\. 1160–1170\.Cited by:[§4](https://arxiv.org/html/2606.04032#S4.p1.1)\.
- T\. Dettmerset al\.\(2023\)SPQR: a sparse\-quantized representation for near\-lossless llm weight compression\.arXiv preprint arXiv:2306\.03078\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1),[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p3.2)\.
- A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby \(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.International Conference on Learning Representations\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2010.11929),2010\.11929Cited by:[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p1.1),[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p2.1),[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p3.5)\.
- F\. Fusco, D\. Pascual, and P\. Staar \(2022\)PNLP\-mixer: an efficient all\-mlp architecture for language\.\(2022\)\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p2.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§4\.3\.6](https://arxiv.org/html/2606.04032#S4.SS3.SSS6.p1.1)\.
- A\. Gu and T\. Dao \(2023\)Mamba: linear\-time sequence modeling with selective state spaces\.arXiv preprint arXiv:2312\.00752\.Cited by:[§1](https://arxiv.org/html/2606.04032#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p1.1)\.
- S\. Gupta and S\. K\. Gupta \(2019\)Abstractive summarization: an overview of the state of the art\.Expert Systems with Applications121,pp\. 49–65\.Cited by:[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p2.1)\.
- K\. Han, Y\. Wang, H\. Chen, X\. Chen, J\. Guo, Z\. Liu, Y\. Tang, A\. Xiao, C\. Xu, Y\. Xu,et al\.\(2022\)A survey on vision transformer\.IEEE transactions on pattern analysis and machine intelligence45\(1\),pp\. 87–110\.Cited by:[§1](https://arxiv.org/html/2606.04032#S1.p1.1)\.
- happyharrycn, Maggie, P\. Culliton, P\. Yadav, and S\. L\. Lee \(2022\)UW\-madison gi tract image segmentation\.Kaggle\.External Links:[Link](https://kaggle.com/competitions/uw-madison-gi-tract-image-segmentation)Cited by:[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p7.1)\.
- B\. He and T\. Hofmann \(2024\)Simplifying transformer blocks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§5](https://arxiv.org/html/2606.04032#S5.p3.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§A\.3\.2](https://arxiv.org/html/2606.04032#A1.SS3.SSS2.p2.1),[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p4.1),[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p1.1),[§4](https://arxiv.org/html/2606.04032#S4.p1.1)\.
- J\. Hoffmann, S\. Borgeaud, A\. Mensch, E\. Buchatskaya, T\. Cai, E\. Rutherford, D\. d\. L\. Casas, L\. A\. Hendricks, J\. Welbl, A\. Clark, T\. Hennigan, E\. Noland, K\. Millican, G\. van den Driessche, B\. Damoc, A\. Guy, S\. Osindero, K\. Simonyan, E\. Elsen, J\. W\. Rae, O\. Vinyals, and L\. Sifre \(2022\)Training compute‑optimal large language models\.arXiv preprint arXiv:2203\.15556\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2203.15556)Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p5.1)\.
- Z\. Huang, D\. Liang, P\. Xu, and B\. Xiang \(2020\)Improve transformer models with better relative position embeddings\.InFindings of the Association for Computational Linguistics: EMNLP 2020,pp\. 3327–3335\.Cited by:[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p3.5)\.
- D\. Hwa, T\. Holmes, and K\. Drechsler \(2025\)Integration of key\-value attention into pure and hybrid transformers for semantic segmentation\.InBVM Workshop,pp\. 305–310\.Cited by:[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p1.1),[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p9.1),[§4\.2](https://arxiv.org/html/2606.04032#S4.SS2.p7.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand,et al\.\(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§3\.2](https://arxiv.org/html/2606.04032#S3.SS2.p2.3)\.
- J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p5.1)\.
- A\. Katharopoulos, A\. Vyas, N\. Pappas, and F\. Fleuret \(2020\)Transformers are rnns: fast autoregressive transformers with linear attention\.InInternational conference on machine learning,pp\. 5156–5165\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p1.1)\.
- S\. A\. Koohpayegani and H\. Pirsiavash \(2024\)Sima: simple softmax\-free attention for vision transformers\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 2607–2617\.Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p3.2)\.
- M\. Kowsher, N\. J\. Prottasha, C\. Yu, O\. Garibay, and N\. Yousefi \(2025\)Does self\-attention need separate weights in transformers?\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 3: Industry Track\),pp\. 535–543\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p2.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§4\.2](https://arxiv.org/html/2606.04032#S4.SS2.p1.1)\.
- Y\. LeCun, Y\. Bengio,et al\.\(1995\)Convolutional networks for images, speech, and time series\.The handbook of brain theory and neural networks3361\(10\),pp\. 1995\.Cited by:[§1](https://arxiv.org/html/2606.04032#S1.p2.1)\.
- Y\. LeCun, L\. Bottou, Y\. Bengio, and P\. Haffner \(1998\)Gradient\-based learning applied to document recognition\.Proceedings of the IEEE86\(11\),pp\. 2278–2324\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p1.1),[§4\.2](https://arxiv.org/html/2606.04032#S4.SS2.p1.1)\.
- A\. Liu, B\. Feng, B\. Wang, B\. Wang, B\. Liu, C\. Zhao, C\. Dengr, C\. Ruan, D\. Dai, D\. Guo,et al\.\(2024\)Deepseek\-v2: a strong, economical, and efficient mixture\-of\-experts language model\.arXiv preprint arXiv:2405\.04434\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p3.1)\.
- Z\. Liu, A\. Desai, F\. Liao, W\. Wang, V\. Xie, Z\. Xu, A\. Kyrillidis, and A\. Shrivastava \(2023\)Scissorhands: exploiting the persistence of importance hypothesis for llm kv cache compression at test time\.Advances in Neural Information Processing Systems36,pp\. 52342–52364\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1)\.
- J\. Lu, J\. Yao, J\. Zhang, X\. Zhu, H\. Xu, W\. Gao, C\. Xu, T\. Xiang, and L\. Zhang \(2021\)Soft: softmax\-free transformer with linear complexity\.Advances in Neural Information Processing Systems34,pp\. 21297–21309\.Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p3.2)\.
- F\. Mai, A\. Pannatier, F\. Fehr, H\. Chen, F\. Marelli, F\. Fleuret, and J\. Henderson \(2023\)Hypermixer: an mlp\-based low cost alternative to transformers\.InProceedings of the 61st annual meeting of the Association for Computational Linguistics \(volume 1: long papers\),pp\. 15632–15654\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p2.1)\.
- R\. Pope, S\. Douglas, A\. Chowdhery, J\. Devlin, J\. Bradbury, J\. Heek, K\. Xiao, S\. Agrawal, and J\. Dean \(2023\)Efficiently scaling transformer inference\.Proceedings of machine learning and systems5,pp\. 606–624\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1)\.
- O\. Press and L\. Wolf \(2017\)Using the output embedding to improve language models\.InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers,Valencia, Spain,pp\. 157–163\.Cited by:[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p4.3)\.
- A\. Radford, J\.W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark, G\. Krueger, and I\. Sutskever \(2021\)Learning transferable visual models from natural language supervision\.InProceedings of the 38th International Conference on Machine Learning \(ICML\),Cited by:[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.External Links:[Link](http://jmlr.org/papers/v21/20-074.html)Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p2.1)\.
- A\. Santoro, D\. Raposo, D\. G\. T\. Barrett, M\. Malinowski, R\. Pascanu, P\. Battaglia, and T\. Lillicrap \(2017\)A simple neural network module for relational reasoning\.InAdvances in Neural Information Processing Systems 30 \(NeurIPS 2017\),pp\. 4974–4983\.Cited by:[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p2.2)\.
- P\. Shaw, J\. Uszkoreit, and A\. Vaswani \(2018\)Self\-attention with relative position representations\.arXiv preprint arXiv:1803\.02155\.External Links:[Link](https://arxiv.org/abs/1803.02155)Cited by:[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p3.5)\.
- N\. Shazeer \(2019\)Fast transformer decoding: one write\-head is all you need\.arXiv preprint arXiv:1911\.02150\.Cited by:[§1\.1](https://arxiv.org/html/2606.04032#S1.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.04032#S3.SS2.p2.3)\.
- Y\. Shenget al\.\(2023\)FlexGen: high\-throughput generative inference of large language models with a single gpu\.arXiv preprint arXiv:2303\.06865\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1)\.
- C\. Systems \(2023\)SlimPajama: a 627b token cleaned and deduplicated version of redpajama\.External Links:[Link](https://huggingface.co/datasets/cerebras/SlimPajama-627B)Cited by:[§4\.3](https://arxiv.org/html/2606.04032#S4.SS3.p1.2)\.
- Y\. Tay, M\. Dehghani, D\. Bahri, and D\. Metzler \(2022\)Efficient transformers: a survey\.ACM Computing Surveys55\(6\),pp\. 1–28\.Cited by:[§1](https://arxiv.org/html/2606.04032#S1.p1.1),[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p4.2)\.
- H\. Touvron, L\. Martin, G\. Stone, S\. Albert,et al\.\(2023\)LLaMA 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§3\.2](https://arxiv.org/html/2606.04032#S3.SS2.p2.3)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§1](https://arxiv.org/html/2606.04032#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p1.1),[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p3.5),[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p3.6),[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p2.1)\.
- P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio \(2018\)Graph attention networks\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=rJXMpikCZ)Cited by:[§3\.1](https://arxiv.org/html/2606.04032#S3.SS1.p2.2)\.
- H\. Wu, B\. Xiao, N\. C\. F\. Codella, M\. Liu, X\. Dai, L\. Yuan, and L\. Zhang \(2021\)CvT: introducing convolutions to vision transformers\.Proc IEEE Int Conf Comput Vis,pp\. 22–31\.External Links:[Link](https://api.semanticscholar.org/CorpusID:232417787),2103\.15808Cited by:[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p5.1)\.
- Q\. Wu, D\. Teney, P\. Wang, C\. Shen, A\. Dick, and A\. Van Den Hengel \(2017\)Visual question answering: a survey of methods and datasets\.Computer Vision and Image Understanding163,pp\. 21–40\.Cited by:[§2\.1](https://arxiv.org/html/2606.04032#S2.SS1.p2.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational conference on machine learning,pp\. 38087–38099\.Cited by:[§3\.3](https://arxiv.org/html/2606.04032#S3.SS3.p4.1)\.
- H\. Xiao, K\. Rasul, and R\. Vollgraf \(2017\)Fashion\-mnist: a novel image dataset for benchmarking machine learning algorithms\.arXiv preprint arXiv:1708\.07747\.Cited by:[§4\.2](https://arxiv.org/html/2606.04032#S4.SS2.p1.1)\.
- S\. Yin, C\. Fu, S\. Zhao, K\. Li, X\. Sun, T\. Xu, and E\. Chen \(2024\)A survey on multimodal large language models\.National Science Review11\(12\),pp\. nwae403\.Cited by:[§1](https://arxiv.org/html/2606.04032#S1.p1.1)\.
- M\. Zaheer, K\. Guruganesh, N\. Dubey, J\. Ainslie, C\. Alberti, S\. Ontanon, P\. Pham, A\. Ravula, Q\. Wang, L\. Yang,et al\.\(2020\)Big bird: transformers for longer sequences\.Advances in Neural Information Processing Systems \(NeurIPS\)33\.Cited by:[§3\.4](https://arxiv.org/html/2606.04032#S3.SS4.p3.2)\.
- S\. Zhai, W\. Talbott, N\. Srivastava, C\. Huang, H\. Goh, R\. Zhang, and J\. Susskind \(2021\)An attention free transformer\.arXiv preprint arXiv:2105\.14103\.Cited by:[§2\.2](https://arxiv.org/html/2606.04032#S2.SS2.p1.1)\.
- S\. Zheng, J\. Lu, H\. Zhao, X\. Zhu, Z\. Luo, Y\. Wang, Y\. Fu, J\. Feng, T\. Xiang, P\. H\.S\. Torr, and L\. Zhang \(2021\)Rethinking semantic segmentation from a sequence\-to\-sequence perspective with transformers\.InProc IEEE Comput Soc Conf Comput Vis Pattern Recognit,pp\. 6877–6886\.External Links:[Document](https://dx.doi.org/10.1109/CVPR46437.2021.00681),2012\.15840Cited by:[Figure 7](https://arxiv.org/html/2606.04032#A1.F7),[Figure 7](https://arxiv.org/html/2606.04032#A1.F7.3.2),[§A\.3\.3](https://arxiv.org/html/2606.04032#A1.SS3.SSS3.p3.3)\.
- X\. Zhu, D\. Cheng, Z\. Zhang, S\. Lin, and J\. Dai \(2019\)An empirical study of spatial attention mechanisms in deep networks\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 6688–6697\.Cited by:[§4](https://arxiv.org/html/2606.04032#S4.p1.1)\.
## Appendix AAppendix
### A\.1Unifying Linear Attention and State\-Space Models via QKV Collapse
Standard self\-attention employs three distinct learned projections of each token: queries, keys, and values, enabling content\-based addressing and selective information routing across tokens\. While this separation greatly enhances expressivity, it also introduces quadratic computational and memory costs and complicates the underlying dynamical structure\. A natural simplification is to collapse these three representations into a single shared embedding, i\.e\.,qt=kt=vt=ztq\_\{t\}=k\_\{t\}=v\_\{t\}=z\_\{t\}, wherezt=Wxtz\_\{t\}=Wx\_\{t\}\. This tying removes explicit addressing and enforces a single\-stream representation in which each token simultaneously defines what is stored, how it is matched, and what is retrieved\.
Under this constraint, kernelized \(linear\) attention admits a particularly simple form\. Recall that linear attention replaces the softmax kernel with a positive feature mapϕ\(⋅\)\\phi\(\\cdot\), allowing the attention computation to be reordered as
yt=ϕ\(qt\)⊤∑i≤tϕ\(ki\)vi⊤ϕ\(qt\)⊤∑i≤tϕ\(ki\)\.y\_\{t\}=\\frac\{\\phi\(q\_\{t\}\)^\{\\top\}\\sum\_\{i\\leq t\}\\phi\(k\_\{i\}\)v\_\{i\}^\{\\top\}\}\{\\phi\(q\_\{t\}\)^\{\\top\}\\sum\_\{i\\leq t\}\\phi\(k\_\{i\}\)\}\.\(5\)Substitutingqt=kt=vt=ztq\_\{t\}=k\_\{t\}=v\_\{t\}=z\_\{t\}yields the recurrence
St=∑i≤tϕ\(zi\)zi⊤,yt=ϕ\(zt\)⊤Stϕ\(zt\)⊤∑i≤tϕ\(zi\),S\_\{t\}=\\sum\_\{i\\leq t\}\\phi\(z\_\{i\}\)z\_\{i\}^\{\\top\},\\qquad y\_\{t\}=\\frac\{\\phi\(z\_\{t\}\)^\{\\top\}S\_\{t\}\}\{\\phi\(z\_\{t\}\)^\{\\top\}\\sum\_\{i\\leq t\}\\phi\(z\_\{i\}\)\},\(6\)whereStS\_\{t\}is a running state that aggregates outer products of the current representation with itself\. Importantly, the state update can be written incrementally as
St=St−1\+ϕ\(zt\)zt⊤,S\_\{t\}=S\_\{t\-1\}\+\\phi\(z\_\{t\}\)z\_\{t\}^\{\\top\},\(7\)optionally with a decay factorSt=λSt−1\+ϕ\(zt\)zt⊤S\_\{t\}=\\lambda S\_\{t\-1\}\+\\phi\(z\_\{t\}\)z\_\{t\}^\{\\top\}to ensure stability\. No token–token interaction matrix is ever formed; all computation proceeds through a streaming state update and a local readout\.
This formulation reveals a direct structural correspondence between linear attention with collapsed QKV and state\-space models \(SSMs\)\. Classical discrete\-time SSMs evolve a hidden state according to
ht=Aht−1\+Bxt,yt=Cht,h\_\{t\}=Ah\_\{t\-1\}\+Bx\_\{t\},\\qquad y\_\{t\}=Ch\_\{t\},\(8\)whereAAcontrols state dynamics andBBinjects input into the state\. In the linear\-attention recurrence above,StS\_\{t\}plays the role of the hidden state, the outer\-product termϕ\(zt\)zt⊤\\phi\(z\_\{t\}\)z\_\{t\}^\{\\top\}acts as an input\-dependent update, and the optional decay corresponds to a stable transition operator\. The key difference is that attention employs an input\-conditioned readout,yt=ϕ\(zt\)⊤Sty\_\{t\}=\\phi\(z\_\{t\}\)^\{\\top\}S\_\{t\}, rather than a fixed observation matrix\. Conceptually, linear attention therefore behaves as a state\-space model with adaptive, content\-dependent observation\.
Collapsing Q, K, and V removes explicit content\-based routing and converts attention into a dynamical memory system closely related to fast\-weight models and Hebbian associative updates\. The resulting model emphasizes continuous temporal integration and efficient long\-range aggregation rather than selective retrieval and symbolic addressing\. This unification clarifies why linear attention and modern SSMs share similar scaling properties, streaming behavior, and inductive biases, while also explaining their limitations in tasks requiring sharp, discrete information routing\. From an architectural perspective, the QKV collapse highlights a continuum between programmable memory \(attention\) and dynamical systems \(SSMs\), reinforcing the view that representational structure, not scale alone, determines the qualitative behavior of sequence models\.
### A\.22D Positional Encodings
We use 2D positional encodings in the “\+\+” variants to restore directional asymmetry in attention when projection sharing \(e\.g\.,Q=KQ=K\) produces symmetric attention maps \(QK⊤=KK⊤QK^\{\\top\}=KK^\{\\top\}\)\.
Construction:We define a fixed 2D sinusoidal positional encoding
P∈ℝn×n×m,P\\in\\mathbb\{R\}^\{n\\times n\\times m\},wherennis the sequence length andmmthe positional embedding dimension\. Each entryPi,jP\_\{i,j\}encodes the relative interaction between query positioniiand key positionjj, allowing the model to distinguish directional relationships \(i<ji<jvs\.i\>ji\>j\)\.
Integration into Attention:Given raw attention scores
A=QK⊤∈ℝn×n,A=QK^\{\\top\}\\in\\mathbb\{R\}^\{n\\times n\},we broadcastAAalong a channel dimension, add the positional encoding
and apply a1×11\\times 1convolution \(linear projection\) to mapA′∈ℝn×n×m→ℝn×nA^\{\\prime\}\\in\\mathbb\{R\}^\{n\\times n\\times m\}\\to\\mathbb\{R\}^\{n\\times n\}\.
Intuition:This modifies attention to combine content\-based similarity with positional/directional bias, breaking symmetry caused by projection sharing and enabling order\-sensitive behavior\.
### A\.3Additional Synthetic and Vision Results
#### A\.3\.1Synthetic results
Figure[3](https://arxiv.org/html/2606.04032#A1.F3)shows the loss over time for the synthetics tasks\. Figure[4](https://arxiv.org/html/2606.04032#A1.F4)displays sample attention maps\. It should be noted that the attention maps of the KV \(Q=K\-V\) transformer exhibit symmetry around the liney=xy=x\. Notable patterns can be observed within the attention maps\. For instance, in the reversing task, the QKV model has learned to take care of the token located at the flipped index of itself\. However, it also allocates some attention to values near the flipped index\. This behavior arises because the model does not require precise, strict attention to solve this problem, but rather benefits from an approximate, noisy attention map\. Figure[5](https://arxiv.org/html/2606.04032#A1.F5)shows the code to compute and normalize the self attention map, plus visualization of maps\.
Figure 3:Loss over time for the synthetics tasks for QKV, Q=K\-V and \(Q=K\-V\)\+\.














Figure 4:Attention maps over synthetic tasks\. Rows from top to bottom: Reverse, Sort, Swap, Sub, and Copy\. Columns from left to right: QKV, Q=K\-V, and \(Q=K\-V\)\+\.

Figure 5:Top\) Code to compute and normalize the self attention map\. Bottom\) un\-normalized and normalized \(right\) attention maps\.
#### A\.3\.2Set Anomaly Detection
We aim to apply transformers to sets \(*i\.e\.*unordered inputs\)\. A model is trained to find the odd one out in a set of ten images, using CIFAR\-100\. Nine images are from one class, and one is different\. Two sample sets are shown in Figure[6](https://arxiv.org/html/2606.04032#A1.F6)\. CIFAR\-100 has 60K 32×\\times32 images over 100 classes \(600 per class\)\.
To extract high\-level, low\-dimensional features from the images, we employ a pre\-trained ResNet34 model\(Heet al\.,[2016](https://arxiv.org/html/2606.04032#bib.bib6)\)pretrained on the ImageNet dataset\(Denget al\.,[2009](https://arxiv.org/html/2606.04032#bib.bib5)\)\. To monitor the training progress and determine when to stop, a validation set is created\. In this scenario, we divide the training set into 90% for training purposes and 10% for validation, ensuring a balanced distribution across classes\.
Figure 6:Two sets of samples from the anomaly detection dataset, with the first image in each set representing the anomaly\.We define an epoch as a sequence in which each image within the dataset is considered as an “anomaly” exactly once\. Therefore, the length of the dataset is determined by the total number of images it contains\. When constructing the training set, we follow a two\-step process\. First, we randomly sample a class that is different from the class of the image at the corresponding index \(*i\.e\.*\_\_getitem\_\_\(self, idx\)\)\. Then, in the second step, we sample 9 images from the newly selected class\.
We perform set\-level classification by assigning one logit per image and applying softmax across images, ensuring permutation\-equivariant predictions that identify the anomalous image regardless of input order\.
In our experiments, we vary the embedding dimension, selecting from the options of 256 and 512\. Additionally, we explore different depths and numbers of heads, choosing values of 2 and 4\. We set the learning rate to 5e\-4 for all configurations\. We incorporate a dropout rate of 0\.1 throughout the model to facilitate regularization\. To control the model’s learning rate, we utilize the CosineWarmupScheduler\. We configure the warm\-up parameter \(set to 100\) to gradually initiate the model training process\. Each setting is executed twice for a total of 20 epochs, and the results are subsequently averaged to obtain reliable performance measurements \(see Table[3](https://arxiv.org/html/2606.04032#S4.T3)\)\.
#### A\.3\.3Image Segmentation
Hwaet al\.\([2025](https://arxiv.org/html/2606.04032#bib.bib93)\)did some experiments based on an earlier version of our work\(Borji,[2023](https://arxiv.org/html/2606.04032#bib.bib94)\)\. They applied the proposed models to a more complex and larger\-scale scenario\. The task was semantic segmentation of abdominal MRI slices by labeling each pixel to belong to one of three categories: large bowel, small bowel, or stomach\.
They implemented several models with QKV \(default\) and KV \(corresponding to Q=K\-V here\) attention variants, as detailed below\. They skip the K variant \(corresponding to Q=K=V here\) to allocate computational resources on the more competitive variants\. All models share a convolutional decoder adapted from SETR \(Fig\.[7](https://arxiv.org/html/2606.04032#A1.F7), left side\)\. They modified the decoder to halve the feature dimensions during upsampling, reducing the overall parameter count \(Fig\.[7](https://arxiv.org/html/2606.04032#A1.F7), right side\)\.
They implemented the SETR encoder as outlined in\(Zhenget al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib9)\), using a ViT\-B/16 backbone with the feature dimensionD=768D=768, number of headsH=12H=12, and number of layersL=12L=12with both QKV and KV attention mechanisms\. They refer to these architectures as SETR\-QKV and SETR\-KV, respectively\.
Furthermore, they explored SETR\-KV\+Pos, where they introduced positional encoding within the KV attention block to create asymmetry\. The 2D positional encoding dimensionmmwas set to 50\. Additionally, they constructed two models with a hybrid encoder\. Drawing inspiration from TransUNet\(Chenet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib10)\), they integrated the first four convolutional layers of the ResNet\-50 architecture\(Heet al\.,[2016](https://arxiv.org/html/2606.04032#bib.bib6)\)into encoder to capture higher\-dimensional features before the patch embedding stage\. In the fourth layer, they increased the number of blocks from 6 to 9 to improve feature extraction while maintaining a feature dimension of 1024\. Unlike the approach in\(Chenet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib10)\), no skip connections were used\. They refer to these models as SETR\-QKV\-CE and SETR\-KV\-CE, respectively\.
Finally, they developed an additional hybrid model using a Convolutional Vision Transformer \(CvT\)\(Wuet al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib16)\)as the encoder\. The models SETR\-QKV\-CVT and SETR\-KV\-CVT utilize a CvT\-13 encoder, with the multi\-head attention \(MHA\) in the Convolutional Transformer Blocks implemented with QKV and KV attention, respectively\.


Figure 7:Left: The standard SETR architecture\(Zhenget al\.,[2021](https://arxiv.org/html/2606.04032#bib.bib9)\)\. Right: The SETR\-PUP decoder\. It is modified to also reduce feature dimensions during upsampling\.All models were trained for 100 epochs without early stopping to ensure comparable results\. The input resolution was set at224×224224\\times 224and a fixed patch size of16×1616\\times 16was chosen\. The AdamW optimizer with a learning rate of 1e\-4 and polynomial learning rate scheduling with factor0\.90\.9were used\. Furthermore, a batch size of 32 was chosen for training\. During training, on\-the\-fly data augmentation was applied, namely horizontal flipping, vertical flipping, shift scale rotate, coarse dropout, and random bright contrast, each having50%50\\%probability of being applied\. All models were trained from scratch \(*i\.e\.*no use of pretrained backbones\)\.
The medical image dataset used was UW\-Madison GI Tract Image Segmentation\(happyharrycnet al\.,[2022](https://arxiv.org/html/2606.04032#bib.bib17)\)which consists of abdominal MRI slices\. Annotations of the three classes were provided in the form of run\-length encoded organ segmentations\. During preprocessing, they transformed the RLE ground truth data into 2D grayscale multi\-class masks\. The dataset was split into training, validation, and test sets with a ratio of80:16:480:16:4\.
The performance metrics computed for the tested architectures include the Jaccard index and the weighted Jaccard index \(Table[12](https://arxiv.org/html/2606.04032#A1.T12)\)\. Model complexity is represented by the number of learnable parameters, while computational efficiency is assessed by the number of multiply\-accumulate operations \(MACs\) \(collected through the`torchinfo`and`ptflops`python modules\)\. Their results indicate that all tested attention variants perform comparably well or slightly better than their corresponding QKV implementations, while also demonstrating a reduction in both parameter count and MACs of approximately10%10\\%\.
Table 12:The results of semantic segmentation experiments\. No performance drop was observed among most of the KV variants, while simultaneously seeing a reduction in parameter count and MACs\. The asterisk \(\*\) indicates that the MACs calculation does not account for the calculations related to 2D positional encoding\.For model details and additional results please refer to\(Hwaet al\.,[2025](https://arxiv.org/html/2606.04032#bib.bib93)\)\.
### A\.4Additional LLM Results
This section provides additional visualizations and detailed results for the language modeling experiments described in Section[4\.3](https://arxiv.org/html/2606.04032#S4.SS3)\. We present comprehensive comparisons of projection sharing variants, head sharing mechanisms, and their combinations across both 300M and 1\.2B parameter scales\.
Figures[8](https://arxiv.org/html/2606.04032#A1.F8)and[9](https://arxiv.org/html/2606.04032#A1.F9)visualize the core trade\-offs between model quality \(perplexity\) and inference efficiency \(KV cache reduction\)\. Figure[10](https://arxiv.org/html/2606.04032#A1.F10)synthesizes these results into an efficiency\-quality Pareto frontier, demonstrating that projection sharing and head sharing operate on complementary optimization axes\. Figures[11](https://arxiv.org/html/2606.04032#A1.F11)and[12](https://arxiv.org/html/2606.04032#A1.F12)show complete training curves, confirming that quality rankings remain stable throughout training and across model scales\. Table[13](https://arxiv.org/html/2606.04032#A1.T13)provides a comprehensive reference for all evaluated variants\. These visualizations reveal that Q\-K=V achieves the best balance between cache reduction and model quality, while combined approaches like Q\-MQA push the efficiency frontier to near\-theoretical limits with 96\.9% cache reduction \(at 300M scale\)\. The consistency of results across scales validates the reliability of our architectural comparisons and provides confidence in the generalizability of these findings to larger production models\.
Figure 8:Projection sharing variants on 300M parameter LLMs trained on 10B tokens\.Left: Validation perplexity \(lower is better\)\. Right: KV cache reduction \(higher is better\)\. Q\-K=V achieves 50% cache reduction with only 3\.1% perplexity degradation\. KV \(Q=K\-V\) provides no cache benefit despite 4\.8% degradation due to still requiring separate K and V caches\. K \(Q=K=V\) causes catastrophic 25\.4% degradation, making it impractical\.Figure 9:Head sharing and combined approaches on 300M parameter LLMs\.Left: Validation perplexity\. Right: KV cache reduction\. Orange bars: head sharing only \(GQA\-4, MQA\)\. Green bars: combined projection \+ head sharing \(Q\-GQA\-4, Q\-MQA\)\. Combined approaches achieve up to 96\.9% cache reduction while maintaining less than 5% perplexity degradation, demonstrating that projection sharing and head sharing are complementary optimization axes\.Figure 10:Efficiency\-quality Pareto frontier for attention variants\.Projection sharing \(blue circles\) and head sharing \(orange triangles\) occupy complementary regions\. Combined approaches \(green diamonds\) achieve the highest cache reductions\. The shaded region indicates practical deployment zone \(<<5% perplexity degradation\)\. Q\-K=V fills the gap between QKV baseline and head\-sharing methods, providing 50% cache reduction with only 3\.1% degradation\.Figure 11:Validation curves for 300M parameter models\.Left: Validation loss\. Right: Validation perplexity over 10B training tokens\. Q\-K=V \(dark teal\) matches baseline QKV \(olive\) closely on held\-out data, achieving 50% cache reduction with only 3\.1% perplexity degradation\. Q=K\-V \(light pink\) shows higher validation loss, confirming suboptimal generalization\. All head\-sharing and combined variants converge to practical validation performance\.Figure 12:Validation curves for 1\.2B parameter models\.Left: Validation loss\. Right: Validation perplexity over 10B training tokens\. Rankings on held\-out data remain consistent with 300M scale\. Q\-K=V \(green\) and head\-sharing variants track baseline QKV \(gray/brown\) closely, while combined approaches \(Q\-GQA\-8, Q\-MQA\) maintain<5%<5\\%degradation with 88\-98\.5% cache reduction, confirming scalability of our findings\.Table 13:Comprehensive summary of all attention mechanism variants evaluated\. PE = Positional Encoding\. Cache column shows what must be stored during autoregressive generation\. Cache reduction and perplexity degradation were reported for 300M parameter models\. The “—” entries in the PPLΔ\\Deltacolumn correspond to \(X\)\+variants, which were evaluated only on non\-causal tasks \(vision and synthetic\); see Section[2](https://arxiv.org/html/2606.04032#S2), “Scope of \(X\)\+variants\.”\#NotationProjectionsCacheCache↓PPLΔ\\DeltaKey InsightBaseline1QKVQ, K, VK\+V0%0%Standard attentionProjection Sharing2Q=K\-VQ=K, VK\+V0%\+4\.9%Symmetric, no cache benefit3\(Q=K\-V\)\+Q=K, V, \+PEK\+V0%—Adds 2D PE for asymmetry4Q\-K=VQ, K=VK50%\+3\.1%50% Cache reduction \(Optimal\)5Q=K=VQ=K=VK50%\+25\.4%Too constrained6\(Q=K=V\)\+Q=K=V, \+PEK50%—PE partially recovers quality on syntheticHead Sharing \(Comparison Baselines\)7GQA\-4Q, K, V \(4 groups\)K\+V75%\+0\.7%4 groups, 16 heads total8MQAQ, K, V \(1 head\)K\+V93\.8%\+1\.5%Single KV head for all QCombined: Projection \+ Head Sharing9Q\-GQA\-4Q, K=V \(4 groups\)K87\.5%\+3\.9%K=V within each group10Q\-MQAQ, K=V \(1 head\)K96\.9%\+4\.8%K=V on single head#### Key Takeaways from Additional Results
The visualizations and comprehensive comparisons in this appendix support several important conclusions:
1. 1\.Q\-K=V is the clear winner for projection sharing\.It achieves 50% cache reduction with only 3\.1% perplexity degradation at 300M scale and 2\.48% at 1\.2B scale, representing a new point on the efficiency\-quality Pareto frontier\.
2. 2\.Cache reduction, not parameter reduction, drives practical benefits\.While all projection sharing variants reduce parameters, only K=V constraints reduce inference memory\. This explains why Q=K\-V fails to provide deployment advantages despite competitive training quality\.
3. 3\.Projection and head sharing are strictly complementary\.Combined approaches achieve 87\.5% \(Q\-GQA\-4\) to 96\.9% \(Q\-MQA\) cache reduction, enabling practical on\-device inference for billion\-parameter models\.
4. 4\.Quality rankings remain stable across scales\.The relative performance of all variants is consistent from 300M to 1\.2B parameters, with larger models showing slightly better robustness to projection constraints\.
5. 5\.No training instabilities observed\.All variants converge smoothly without requiring specialized initialization, learning rate schedules, or architectural modifications beyond the attention mechanism itself\.
These results establish projection sharing as a practical optimization for memory\-efficient transformer deployment, particularly for applications requiring long contexts or high throughput in resource\-constrained environments\.
##### Inference Wall\-Clock Benchmarks\.
To validate that the theoretical KV cache reductions translate to measurable deployment gains, we benchmarked all 1\.2B variants on a single NVIDIA A100 GPU using bfloat16 with standard causal attention\. We report both a forward\-pass benchmark across batch sizes\{1,4,16\}\\\{1,4,16\\\}and sequence lengths\{1024,2048\}\\\{1024,2048\\\}\(Table[14](https://arxiv.org/html/2606.04032#A1.T14)\), and an autoregressive generation benchmark with a 128\-token prompt generating 128 new tokens \(Tables[15\(a\)](https://arxiv.org/html/2606.04032#A1.T15.st1)and[15\(b\)](https://arxiv.org/html/2606.04032#A1.T15.st2)\)\. All variants share identical hardware, software, and runtime configuration\.
Table 14:Forward\-pass inference benchmark on a single A100 \(1\.2B models, bf16\)\. All variants reduce peak memory and improve throughput versus the QKV baseline at every batch size and sequence length tested\.Table 15:Autoregressive generation benchmark on a single A100 \(1\.2B models, bf16, 128\-token prompt, 128 tokens generated\)\. \(Left\) raw measurements\. \(Right\) savings versus QKV\. Q\-K=V consistently outperforms QKV across all configurations\.\(a\)Raw measurements\.
\(b\)Savings versus QKV\.
Across all configurations, Q\-K=V achieves 6\.5–6\.9% peak memory reduction, 4\.4–5\.3% higher decode throughput, and 4\.3–5\.0% lower per\-token latency relative to QKV\. The 6\.5–6\.9% total memory reduction reflects KV cache as one component of peak memory; activations, weights, and workspace dominate the remainder\. The structural 50% KV cache reduction is fully realized in production serving systems \(e\.g\., vLLM\) where K and V are allocated separately per decode step\. Combined approaches push further: Q\-MQA achieves 12\.8–13\.6% memory reduction and 11\.7–13\.2% throughput improvement, approaching the cache\-bound limit for transformer generation\.
##### Perplexity Across Context Lengths\.
To confirm that projection sharing’s quality cost does not compound with longer contexts, we evaluated all 1\.2B variants at three sequence lengths \(512, 1024, 2048\) on a held\-out SlimPajama validation subset\. Table[16](https://arxiv.org/html/2606.04032#A1.T16)reports relative perplexity degradation versus QKV at each length\. These results use fixed\-length truncation without document\-packed inputs; absolute perplexities are therefore not directly comparable to Table[9](https://arxiv.org/html/2606.04032#S4.T9), and short\-context values may be inflated by low\-context positions\. We include them to characterize relative rankings across lengths rather than as precise degradation estimates\.
Table 16:Relative perplexity degradation \(%\) versus QKV at varying sequence lengths for 1\.2B models\. Relative rankings are stable across context lengths\. Under this evaluation, degradation decreases with sequence length for all variants, suggesting the quality\-efficiency trade\-off does not worsen in the long\-context regime where cache savings matter most\. Results use fixed\-length truncation; see text for methodology caveats\.The relative rankings are stable across all sequence lengths, confirming that the efficiency hierarchy in Table[9](https://arxiv.org/html/2606.04032#S4.T9)generalizes across context lengths\. Q\-K=V’s degradation decreases from5\.4%5\.4\\%at 512 tokens to2\.2%2\.2\\%at 2048 tokens, aligning closely with its\+2\.48%\+2\.48\\%in Table[9](https://arxiv.org/html/2606.04032#S4.T9)\. MQA shows a slight apparent advantage over QKV under this evaluation; we note this does not fully align with Table[9](https://arxiv.org/html/2606.04032#S4.T9)\(\+1\.06%\+1\.06\\%degradation there\), and attribute the discrepancy to the truncation\-based evaluation methodology\. Q\-MQA results were unstable on this evaluation subset and are omitted\.
### A\.5Full Training Configuration
We provide complete training and architectural details for the language modeling experiments described in Section[4\.3](https://arxiv.org/html/2606.04032#S4.SS3), extending the summary in Section 3\.3\.
Architecture\.The 300M models use 20 transformer layers, embedding dimensiond=1024d=1024, 16 attention heads \(head dimension 64\), and feed\-forward dimension 4096\. The 1\.2B models use 22 layers,d=2048d=2048, 32 attention heads \(head dimension 64\), and feed\-forward dimension 8192\. Both configurations use GELU activation in the feed\-forward sublayers\. Pre\-Norm LayerNorm \(ϵ=10−5\\epsilon=10^\{\-5\}\) is applied before each attention and feed\-forward sublayer\. Input and output embeddings are tied, with vocabulary size 50,304 using the GPT\-2 tokenizer\. Positional information is encoded via learned absolute position embeddings with maximum sequence length 2048\. Residual dropout is set to 0\.1\.
Optimization\.All models are trained from scratch with AdamW \(β1=0\.9\\beta\_\{1\}=0\.9,β2=0\.95\\beta\_\{2\}=0\.95, weight decay 0\.1, gradient clipping at norm 1\.0\)\. The learning rate schedule is 1000\-step linear warmup to a peak of6×10−56\\times 10^\{\-5\}, followed by cosine decay to a minimum of6×10−66\\times 10^\{\-6\}\.
Infrastructure\.Training uses bfloat16 mixed precision on 8×\\timesNVIDIA A100 40GB GPUs with distributed data parallelism and gradient accumulation of 36 steps\. The 300M models are trained for 4,238 steps \(∼\\sim10B tokens\); the 1\.2B models for 8,475 steps \(∼\\sim10B tokens\)\. Validation perplexity is evaluated every 500 steps on a held\-out 10M\-token subset of SlimPajama\. The only architectural difference across variants is the attention projection mechanism; all other components are held identical to ensure a controlled comparison\.Similar Articles
Here are my KV cache quantization benchmarks: TurboQuant is overrated but saved by TCQ, q5 deserves more attention, and symmetric q8 might be a waste of VRAM
A detailed benchmark comparing KV cache quantization methods (TurboQuant, TCQ, q4, q5, q8) using PPL and KLD metrics on Qwen 3.6 27B, finding that TCQ improves low-bit quantization, asymmetric KV beats symmetric at same size, and q8 is often overkill. Includes analysis and data in linked article.
Statistical Inference and Quality Measures of KV Cache Quantisations Inspired by TurboQuant
This paper analyzes KV cache quantization schemes inspired by TurboQuant, using statistical inference and a new 6D error framework to evaluate quality measures like KL divergence and geometric error.
KV Cache Compression 900000x Beyond TurboQuant and Per-Vector Shannon Limit
A new paper proposes sequential KV cache compression using probabilistic language tries and predictive delta coding, achieving theoretical compression ratios of ~914,000× beyond TurboQuant by exploiting the sequential structure of language model tokens rather than treating vectors independently.
Quantizing MTP KV Cache = free lunch?
Quantizing the Multi-Token Prediction (MTP) KV cache to q8_0 in llama.cpp for Qwen models reduces VRAM usage without affecting inference speed or acceptance rate, effectively providing a 'free lunch' for memory-constrained setups.
ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference
ProxyKV is a cross-model proxy pruning framework that offloads importance scoring to a lightweight small model, achieving high precision KV cache pruning with much lower prefilling overhead, matching KVZip accuracy across Llama-3.1, Qwen-2.5, and Qwen-3 families.