GQLA: Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding

arXiv cs.LG Papers

Summary

GQLA proposes a minimal modification to Multi-head Latent Attention (MLA) that exposes both an MQA-absorb path and a GQA path over the same trained weights, enabling hardware-adaptive decoding without retraining. The method compresses KV cache and supports tensor parallelism, demonstrated by converting LLaMA-3-8B from GQA to GQLA.

arXiv:2605.15250v1 Announce Type: new Abstract: Multi-head Latent Attention (MLA), the attention used in DeepSeek-V2/V3, jointly compresses keys and values into a low-rank latent and matches the H100 roofline almost perfectly. Its trained weights, however, expose only one decoding path - an absorbed MQA form - which ties efficient inference to H100-class compute-bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi-Token Prediction (MTP) gain on commodity inference GPUs such as the export-restricted H20. We propose Group-Query Latent Attention (GQLA), a minimal modification of MLA whose trained weights expose two algebraically equivalent decoding paths over the same parameters: an MQA-absorb path identical to MLA's, and a GQA path with a per-group expanded cache. The runtime picks the path that matches the target hardware - no retraining, no custom kernels - so a single set of GQLA weights pins the rooflines of both H100 (MQA-absorb, s_q=1) and H20 (GQA + MTP, s_q=2), while supporting up to 8-way zero-redundancy tensor parallelism on the GQA path. To avoid pretraining from scratch we extend TransMLA into TransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA-3-8B it compresses the per-token KV cache to 28.125% of the GQA baseline on the MQA-absorb path while structurally preserving GQA-level traffic on the per-group path.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:38 AM

# Group-Query Latent Attention for Hardware-Adaptive Large Language Model Decoding
Source: [https://arxiv.org/html/2605.15250](https://arxiv.org/html/2605.15250)
###### Abstract

Multi\-head Latent Attention \(MLA\), the attention used in DeepSeek\-V2/V3, jointly compresses keys and values into a low\-rank latent and matches the H100 roofline almost perfectly\. Its trained weights, however, expose only one decoding path—an absorbed MQA form—which ties efficient inference to H100\-class compute–bandwidth ratios, forfeits tensor parallelism along the head axis, and yields no Multi\-Token Prediction \(MTP\) gain on commodity inference GPUs such as the export\-restricted H20\. We proposeGroup\-Query Latent Attention \(GQLA\), a minimal modification of MLA whose trained weights expose*two*algebraically equivalent decoding paths over the same parameters: an MQA\-absorb path identical to MLA’s, and a GQA path with a per\-group expanded cache\. The runtime picks the path that matches the target hardware—no retraining, no custom kernels—so a single set of GQLA weights pins the rooflines of both H100 \(MQA\-absorb,sq=1s\_\{q\}\\\!=\\\!1\) and H20 \(GQA \+ MTP,sq=2s\_\{q\}\\\!=\\\!2\), while supporting up to 8\-way zero\-redundancy tensor parallelism on the GQA path\. To avoid pretraining from scratch we extend TransMLA intoTransGQLA, which converts a pretrained GQA checkpoint into a GQLA model; on LLaMA\-3\-8B it compresses the per\-token KV cache to28\.125%28\.125\\%of the GQA baseline on the MQA\-absorb path while structurally preserving GQA\-level traffic on the per\-group path\.

GQLA: Group\-Query Latent Attention for Hardware\-Adaptive Large Language Model Decoding

Fanxu MengInstitute for Artificial Intelligence, Peking Universityfxmeng@stu\.pku\.edu\.cn

![Refer to caption](https://arxiv.org/html/2605.15250v1/x1.png)Figure 1:Multi\-Head Attention \(MHA\), Grouped\-Query Attention \(GQA\), Multi\-Query Attention \(MQA\), Multi\-head Latent Attention \(MLA\), and our proposed Group\-Query Latent Attention \(GQLA\)\. MLA’s joint low\-rank latent compression yields the smallest KV cache but locks decoding into a single MQA\-absorb path\. GQLA inherits the latent compression and additionally exposes a GQA decoding path over the same trained weights, so the runtime can pick the path best matched to the target hardware \(Section[3\.1](https://arxiv.org/html/2605.15250#S3.SS1)\)\.## 1Introduction

Autoregressive decoding in modern Large Language Models \(LLMs\) is fundamentally bottlenecked by Key–Value \(KV\) cache traffic: every generated token must read the entire history of cached keys and values from off\-chip memory\(Popeet al\.,[2023](https://arxiv.org/html/2605.15250#bib.bib8); Zadouriet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib7)\)\. A line of work has therefore focused on shrinking the KV cache: Multi\-Query Attention \(MQA;Shazeer,[2019](https://arxiv.org/html/2605.15250#bib.bib1)\) shares one KV head across all query heads, Grouped\-Query Attention \(GQA;Ainslieet al\.,[2023](https://arxiv.org/html/2605.15250#bib.bib2)\) shares one KV head per group, and most recently Multi\-head Latent Attention \(MLA;Liuet al\.,[2024a](https://arxiv.org/html/2605.15250#bib.bib3)\) jointly compresses keys and values into a low\-rank latent, reaching state\-of\-the\-art KV\-cache reduction in DeepSeek\-V2/V3\(Liuet al\.,[2024a](https://arxiv.org/html/2605.15250#bib.bib3),[b](https://arxiv.org/html/2605.15250#bib.bib13)\)\.

A central design feature of MLA is that its trained weights admit two algebraically equivalent execution paths: during training and prefill the latent is expanded back into per\-head keys and values and attention is computed in an MHA\-like form \(compute\-friendly\), while during decoding the up\-projections are absorbed into the query and output projections so that attention runs against the latent directly in an MQA\-like form \(memory\-friendly\)\. On the NVIDIA H100, whose BF16 roofline\(Williamset al\.,[2009](https://arxiv.org/html/2605.15250#bib.bib6)\)ridges around295295FLOPs/byte, the absorbed MQA path with the canonical configuration\(hq,dh,rk​v,dhR\)=\(128,128,512,64\)\(h\_\{q\},d\_\{h\},r\_\{kv\},d\_\{h\}^\{R\}\)\\\!=\\\!\(128,128,512,64\)and single\-token decoding lands its arithmetic intensity at≈242\\approx\\\!242FLOPs/byte, just below the ridge\. This perfect H100 fit, however, is the only operating point MLA exposes\.

##### Three coupled hardware drawbacks of MLA\.

Because MLA is structurally locked into the MQA\-absorb path:

- •*Hardware coupling\.*The operating point is anchored to H100’s compute–bandwidth ratio\. The export\-restricted H20 retains the bandwidth but cuts compute by∼7×\\sim\\\!7\\times, dropping its ridge to∼37\\sim\\\!37FLOPs/byte; MLA then sits far above the ridge and decoding becomes compute\-bound \(§[4\.2](https://arxiv.org/html/2605.15250#S4.SS2)\)\.
- •*TP\-unfriendly\.*The absorbed form funnels every query head through one shared latent KV, so tensor parallelism must replicate the latent on every device\.
- •*MTP\-unfriendly\.*Multi\-Token Prediction \(MTP;Gloeckleet al\.,[2024](https://arxiv.org/html/2605.15250#bib.bib14); Liuet al\.,[2024b](https://arxiv.org/html/2605.15250#bib.bib13)\) doubles the arithmetic intensity per extra query token, pushing MLA past the H100 ridge and leaving zero throughput gain on the already compute\-bound H20\.

##### Group\-Query Latent Attention \(GQLA\)\.

We propose a minimal variant of MLA \(Figure[1](https://arxiv.org/html/2605.15250#S0.F1)right; Figure[2](https://arxiv.org/html/2605.15250#S2.F2)\) that preserves the joint low\-rank latent compression but indexes the up\-projections bygggroups instead of replicating them across allhqh\_\{q\}query heads\. The trained weights then admit two algebraically equivalent decoding paths, each paired with a natural cache content:

- •*MQA\-absorb path*\(shared with MLA\): cache holds the latent𝐜K​V\\mathbf\{c\}^\{KV\}and shared RoPE key,rk​v\+dhRr\_\{kv\}\\\!\+\\\!d\_\{h\}^\{R\}elements/token; allhqh\_\{q\}heads attend directly to the latent\.
- •*GQA path*\(only available to GQLA\): cache holds the per\-group expandedKC,VK\_\{C\},Vplus the shared RoPE key,2​g​dh\+dhR2gd\_\{h\}\\\!\+\\\!d\_\{h\}^\{R\}elements/token; decoding runs vanilla GQA without per\-step latent expansion\.

With the recommended configurationhq=128,g=8h\_\{q\}\\\!=\\\!128,g\\\!=\\\!8plus one MTP head, the same trained weights pin both rooflines: H100 \+ MQA\-absorb atsq=1s\_\{q\}\\\!=\\\!1inherits MLA’s H100 sweet spot, while H20 \+ GQA atsq=2s\_\{q\}\\\!=\\\!2lands the H20 ridge and MTP recovers near\-ideal throughput gain\. The GQA path additionally supports up to88\-way zero\-redundancy tensor parallelism along the group axis\. The path switch requires no retraining and no custom kernels: MQA\-absorb reuses MLA’s absorb kernel, GQA reuses the standard GQA kernel\.

##### TransGQLA and Sparse GQLA\.

To avoid pretraining from scratch we extend TransMLA\(Menget al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib4)\)intoTransGQLA, which converts a pretrained GQA checkpoint into a GQLA model via a single targeted change to the head\-merging step that keeps the up\-projections indexed by group rather than by query head\. We also describe a sparse\-attention extension: because GQLA’s GQA\-path query\-per\-KV\-head ratiohq/g=16h\_\{q\}/g\\\!=\\\!16matches the Tensor\-Core MMA tile, sparse GQLA preserves the GQA path on H20\-class hardware, whereas sparse MLA\(Liuet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib5)\)is structurally locked to the sparse MQA\-absorb path on every device\.

##### Contributions\.

- •We identify three coupled hardware drawbacks of MLA’s MQA\-absorb\-only design: hardware coupling to H100, loss of head\-axis tensor parallelism, and zero MTP gain on commodity inference GPUs\.
- •We introduceGQLA\(Section[3\.1](https://arxiv.org/html/2605.15250#S3.SS1)\), whose trained weights expose two algebraically equivalent decoding paths over the same parameters; the recommended\(hq,g\)=\(128,8\)\(h\_\{q\},g\)\\\!=\\\!\(128,8\)\+ one MTP head simultaneously removes all three drawbacks at deployment time without retraining or custom kernels\.
- •We introduceTransGQLA\(Section[3\.2](https://arxiv.org/html/2605.15250#S3.SS2)\), a one\-line modification of the TransMLA pipeline that converts a pretrained GQA checkpoint into a GQLA model while retaining tensor parallelism, and extend the design to fine\-grained sparse attention \(Section[3\.3](https://arxiv.org/html/2605.15250#S3.SS3)\)\.
- •We give a Roofline analysis \(Section[4](https://arxiv.org/html/2605.15250#S4)\) verifying that the same GQLA weights pin the H100 and H20 rooflines, and empirically validate TransGQLA on LLaMA\-3\-8B \(Section[5](https://arxiv.org/html/2605.15250#S5)\)\.

## 2Related Work

##### KV\-cache reduction via attention design\.

The dominant family of architectural KV\-cache reductions trades query/KV head multiplicity: MQA\(Shazeer,[2019](https://arxiv.org/html/2605.15250#bib.bib1)\)collapses all query heads onto a single KV head, GQA\(Ainslieet al\.,[2023](https://arxiv.org/html/2605.15250#bib.bib2)\)interpolates by sharing one KV head per group, and MLA\(Liuet al\.,[2024a](https://arxiv.org/html/2605.15250#bib.bib3)\)pushes the idea further by jointly compressing keys and values into a low\-rank latent coupled with a decoupled\-RoPE pathway\. System\-level techniques such as FlashAttention\(Daoet al\.,[2022](https://arxiv.org/html/2605.15250#bib.bib9)\), paged KV caches, and quantised KV storage are complementary: they reduce per\-byte cost but do not change the asymptotic per\-token cache footprint\. GQLA stays in the architectural family, inheriting MLA’s latent compression while regaining the GQA execution path that MLA discards\.

##### Roofline\-driven attention design\.

Zadouriet al\.\([2025](https://arxiv.org/html/2605.15250#bib.bib7)\)present a hardware\-aware roofline study of latent attention on the H100 and characterise the design choices that govern arithmetic intensity\.Popeet al\.\([2023](https://arxiv.org/html/2605.15250#bib.bib8)\)andGholamiet al\.\([2024](https://arxiv.org/html/2605.15250#bib.bib10)\)argue more broadly that LLM inference is increasingly bandwidth\-limited as compute scales faster than HBM bandwidth\. Our analysis \(Section[4](https://arxiv.org/html/2605.15250#S4)\) follows the same methodology and extends it to the export\-restricted H20 to motivate hardware\-adaptive path selection\.

##### Converting pretrained MHA/GQA models\.

Training a new attention architecture from scratch is expensive, so several recent papers convert existing checkpoints\. TransMLA\(Menget al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib4)\)converts a GQA model into an MLA model in two steps: an exact head\-merging reformulation, followed by RoRoPE/FreqFold/balanced low\-rank compression of the latent\. MHA2MLA\(Jiet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib12)\)pursues a similar goal under a different parameterisation\. TransGQLA \(Section[3\.2](https://arxiv.org/html/2605.15250#S3.SS2)\) reuses the TransMLA pipeline almost verbatim, with a targeted change in the head\-merging step that preserves the GQA execution path and tensor parallelism\.

##### Sparse and long\-context attention\.

DeepSeek Sparse Attention \(DSA;Liuet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib5)\) extends MLA with token\-dependent top\-kkselection of past keys/values for long\-context inference\. As shown in Section[3\.3](https://arxiv.org/html/2605.15250#S3.SS3), sparse MLA is structurally locked to the absorbed MQA path by MMA tile constraints, whereas sparse GQLA naturally supports both paths\. HISA\(Xuet al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib15)\)is orthogonal: it replaces the DSA\-style indexer with hierarchical scoring to accelerate top\-kkselection itself, and composes with GQLA—HISA accelerates the “before top\-kk” indexer while GQLA accelerates the “after top\-kk” attention\.

![Refer to caption](https://arxiv.org/html/2605.15250v1/x2.png)\(a\)GQA path of GQLA\.
![Refer to caption](https://arxiv.org/html/2605.15250v1/x3.png)\(b\)MQA\-absorb path of GQLA\.

Figure 2:The two algebraically equivalent decoding paths of GQLA over a single set of trained weights\.Left:the GQA path materialisesggkey/value groups from the latent and runs standard GQA attention; paired with the per\-group expanded cache, it is the H20\-deployment working point\.Right:the MQA\-absorb path absorbsWU​K,WU​VW^\{UK\},W^\{UV\}into the query and output projections so that allhqh\_\{q\}query heads attend to the latent directly; paired with the compact latent cache, it is the H100\-deployment working point\. Both paths produce numerically identical outputs \(Section[4\.2](https://arxiv.org/html/2605.15250#S4.SS2)\); the deployment\-time choice is driven by the target hardware\.

## 3Methods

### 3\.1Group\-Query Latent Attention

##### Architecture\.

Let𝐱t∈ℝD\\mathbf\{x\}\_\{t\}\\in\\mathbb\{R\}^\{D\}denote thett\-th token embedding\. A down\-projectionWD​K​V∈ℝrk​v×DW^\{DKV\}\\in\\mathbb\{R\}^\{r\_\{kv\}\\times D\}compresses it into a low\-rank latent𝐜tK​V\\mathbf\{c\}\_\{t\}^\{KV\}; the up\-projectionsWU​K,WU​V∈ℝg​d×rk​vW^\{UK\},W^\{UV\}\\in\\mathbb\{R\}^\{gd\\times r\_\{kv\}\}expand the latent intoggkey/value groups of per\-head dimensiondd, matching the KV\-cache footprint of a GQA model withgggroups\. Queries are decomposed analogously byWD​Q∈ℝrq×DW^\{DQ\}\\in\\mathbb\{R\}^\{r\_\{q\}\\times D\}andWU​Q∈ℝh​d×rqW^\{UQ\}\\in\\mathbb\{R\}^\{hd\\times r\_\{q\}\}intohhheads\. Positional information follows MLA’s decoupled\-RoPE strategy: a per\-head query path𝐪t,iR∈ℝdR\\mathbf\{q\}\_\{t,i\}^\{R\}\\in\\mathbb\{R\}^\{d^\{R\}\}fromWQ​R∈ℝh​dR×rqW^\{QR\}\\in\\mathbb\{R\}^\{hd^\{R\}\\times r\_\{q\}\}and a single shared key path𝐤tR∈ℝdR\\mathbf\{k\}\_\{t\}^\{R\}\\in\\mathbb\{R\}^\{d^\{R\}\}fromWK​R∈ℝdR×DW^\{KR\}\\in\\mathbb\{R\}^\{d^\{R\}\\times D\}\. The query and key representations are

𝐜tQ\\displaystyle\\mathbf\{c\}\_\{t\}^\{Q\}=WD​Q​𝐱t,\\displaystyle=W^\{DQ\}\\mathbf\{x\}\_\{t\},𝐪tC\\displaystyle\\mathbf\{q\}\_\{t\}^\{C\}=\[𝐪t,1C;…;𝐪t,hC\]=WU​Q​𝐜tQ,\\displaystyle=\[\\mathbf\{q\}\_\{t,1\}^\{C\};\\dots;\\mathbf\{q\}\_\{t,h\}^\{C\}\]=W^\{UQ\}\\mathbf\{c\}\_\{t\}^\{Q\},𝐪tR\\displaystyle\\mathbf\{q\}\_\{t\}^\{R\}=\[𝐪t,1R;…;𝐪t,hR\]=RoPEt​\(WQ​R​𝐜tQ\),\\displaystyle=\[\\mathbf\{q\}\_\{t,1\}^\{R\};\\dots;\\mathbf\{q\}\_\{t,h\}^\{R\}\]=\\text\{RoPE\}\_\{t\}\(\{W^\{QR\}\}\\mathbf\{c\}\_\{t\}^\{Q\}\),𝐪t,i\\displaystyle\\mathbf\{q\}\_\{t,i\}=\[𝐪t,iC;𝐪t,iR\],\\displaystyle=\[\\mathbf\{q\}\_\{t,i\}^\{C\};\\mathbf\{q\}\_\{t,i\}^\{R\}\],𝐜tK​V\\displaystyle\\mathbf\{c\}\_\{t\}^\{KV\}=WD​K​V​𝐱t,\\displaystyle=W^\{DKV\}\\mathbf\{x\}\_\{t\},𝐤tC\\displaystyle\\mathbf\{k\}\_\{t\}^\{C\}=\[𝐤t,1C;…;𝐤t,gC\]=WU​K​𝐜tK​V,\\displaystyle=\[\\mathbf\{k\}\_\{t,1\}^\{C\};\\dots;\\mathbf\{k\}\_\{t,g\}^\{C\}\]=W^\{UK\}\\mathbf\{c\}\_\{t\}^\{KV\},𝐤tR\\displaystyle\\mathbf\{k\}\_\{t\}^\{R\}=RoPEt​\(WK​R​𝐱t\),\\displaystyle=\\text\{RoPE\}\_\{t\}\(\{W^\{KR\}\}\\mathbf\{x\}\_\{t\}\),𝐤t,i\\displaystyle\\mathbf\{k\}\_\{t,i\}=\[𝐤t,iC;𝐤tR\]\.\\displaystyle=\[\\mathbf\{k\}\_\{t,i\}^\{C\};\\mathbf\{k\}\_\{t\}^\{R\}\]\.\(1\)

##### Two equivalent decoding paths\.

GQLA exposes two algebraically equivalent decoding paths over the same trained weights, differing only in how the latent𝐜tK​V\\mathbf\{c\}\_\{t\}^\{KV\}is consumed\. The GQA path \(Eq\. \([2](https://arxiv.org/html/2605.15250#S3.E2)\)\) materialisesggkey/value groups from the latent and runs ordinary GQA attention against a per\-group expanded cache of2​g​dh\+dhR2gd\_\{h\}\+d\_\{h\}^\{R\}elements/token\. The MQA\-absorb path \(Eq\. \([3](https://arxiv.org/html/2605.15250#S3.E3)\)\) absorbsWU​K,WU​VW^\{UK\},W^\{UV\}into the query and output projections so that the latent itself plays the role of a single shared key and value, attending against a compact latent cache ofrk​v\+dhRr\_\{kv\}\+d\_\{h\}^\{R\}elements/token \(the shared RoPE key is stored once across groups\)\. Switching between paths requires only a one\-shot compress/expand of the KV cache at deployment time, never at runtime\.

##### GQA path

𝐯tC\\displaystyle\\mathbf\{v\}\_\{t\}^\{C\}=\[𝐯t,1C;𝐯t,2C;…;𝐯t,gC\]=WU​V​𝐜tK​V,\\displaystyle=\[\\mathbf\{v\}\_\{t,1\}^\{C\};\\mathbf\{v\}\_\{t,2\}^\{C\};\.\.\.;\\mathbf\{v\}\_\{t,g\}^\{C\}\]=W^\{UV\}\\mathbf\{c\}\_\{t\}^\{KV\},\[𝐤t,i;𝐯t,i\]\\displaystyle\[\\mathbf\{k\}\_\{t,i\};\\mathbf\{v\}\_\{t,i\}\]=repeat​\(\[𝐤t,i;𝐯t,i\],h/g\),\\displaystyle=\\text\{repeat\}\(\[\\mathbf\{k\}\_\{t,i\};\\mathbf\{v\}\_\{t,i\}\],h/g\),𝐨t,i\\displaystyle\\mathbf\{o\}\_\{t,i\}=∑s=1tsoftmaxs⁡\(𝐪t,i⊤​𝐤s,id\+dR\)​𝐯s,iC,\\displaystyle=\\sum\_\{s=1\}^\{t\}\\operatorname\{softmax\}\_\{s\}\\\!\\left\(\\tfrac\{\\mathbf\{q\}\_\{t,i\}^\{\\top\}\\mathbf\{k\}\_\{s,i\}\}\{\\sqrt\{d\+d^\{R\}\}\}\\right\)\\mathbf\{v\}\_\{s,i\}^\{C\},𝐲t\\displaystyle\\mathbf\{y\}\_\{t\}=WO​\[𝐨t,1;…;𝐨t,h\]\.\\displaystyle=W^\{O\}\[\\mathbf\{o\}\_\{t,1\};\\dots;\\mathbf\{o\}\_\{t,h\}\]\.\(2\)

##### MQA\-absorb path

\[W^U​K;W^U​V\]\\displaystyle\[\\hat\{W\}^\{UK\};\\hat\{W\}^\{UV\}\]=repeat​\(\[WU​K;WU​V\],h/g\),\\displaystyle=\\text\{repeat\}\(\[W^\{UK\};W^\{UV\}\],h/g\),𝐪t,iA\\displaystyle\\mathbf\{q\}\_\{t,i\}^\{A\}=\(W^iU​K\)⊤​𝐪t,iC,\\displaystyle=\(\\hat\{W\}^\{UK\}\_\{i\}\)^\{\\\!\\top\}\\mathbf\{q\}\_\{t,i\}^\{C\},𝐪^t,i\\displaystyle\\mathbf\{\\hat\{q\}\}\_\{t,i\}=\[𝐪t,iA;𝐪t,iR\],\\displaystyle=\[\\mathbf\{q\}\_\{t,i\}^\{A\};\\mathbf\{q\}\_\{t,i\}^\{R\}\],𝐤^t\\displaystyle\\mathbf\{\\hat\{k\}\}\_\{t\}=\[𝐜tK​V;𝐤tR\],𝐯^t=𝐜tK​V,\\displaystyle=\[\\mathbf\{c\}\_\{t\}^\{KV\};\\mathbf\{k\}\_\{t\}^\{R\}\],\\quad\\mathbf\{\\hat\{v\}\}\_\{t\}=\\mathbf\{c\}\_\{t\}^\{KV\},𝐨^t,i\\displaystyle\\mathbf\{\\hat\{o\}\}\_\{t,i\}=∑s=1tsoftmaxs⁡\(𝐪^t,i⊤​𝐤^sd\+dR\)​𝐯^s,\\displaystyle=\\sum\_\{s=1\}^\{t\}\\operatorname\{softmax\}\_\{s\}\\\!\\left\(\\tfrac\{\\mathbf\{\\hat\{q\}\}\_\{t,i\}^\{\\top\}\\mathbf\{\\hat\{k\}\}\_\{s\}\}\{\\sqrt\{d\+d^\{R\}\}\}\\right\)\\mathbf\{\\hat\{v\}\}\_\{s\},𝐨t,i\\displaystyle\\mathbf\{o\}\_\{t,i\}=W^iU​V​𝐨^t,i,\\displaystyle=\\hat\{W\}^\{UV\}\_\{i\}\\mathbf\{\\hat\{o\}\}\_\{t,i\},𝐲t\\displaystyle\\mathbf\{y\}\_\{t\}=WO​\[𝐨t,1;…;𝐨t,h\]\.\\displaystyle=W^\{O\}\[\\mathbf\{o\}\_\{t,1\};\\dots;\\mathbf\{o\}\_\{t,h\}\]\.\(3\)
whereW^iU​K,W^iU​V∈ℝd×rk​v\\hat\{W\}^\{UK\}\_\{i\},\\hat\{W\}^\{UV\}\_\{i\}\\in\\mathbb\{R\}^\{d\\times r\_\{kv\}\}are theii\-th query\-head slices of the up\-projection matrices after their group\-wise replication along the head axis\.

### 3\.2TransGQLA

Following TransMLA\(Menget al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib4)\), we convert a pretrained GQA checkpoint into a GQLA model and refer to the procedure asTransGQLA\. TransGQLA reuses the entire TransMLA pipeline—merging grouped heads, decoupling RoPE \(RoRoPE\), frequency folding \(FreqFold\), and key–value norm balancing—with a single targeted change in the head\-merging step\.

##### Merging grouped heads to a latent head\.

The first stage of TransMLA folds GQA’sggKV heads into a single latent and*replicates*the up\-projectionsWU​K,WU​VW^\{UK\},W^\{UV\}across allhhquery heads, so the non\-absorbed computation behaves as MHA\. TransGQLA omits the replication:WU​K,WU​VW^\{UK\},W^\{UV\}remain indexed by groupj∈\[1,g\]j\\in\[1,g\]rather than by query headi∈\[1,h\]i\\in\[1,h\]\. The merged module thus behaves as a standard GQA \(not MHA\) and is structurally identical to the GQA path of Section[3\.1](https://arxiv.org/html/2605.15250#S3.SS1); the MQA\-absorb path is reachable, exactly as in MLA, via the absorb operation\. The per\-group structure also preserves tensor parallelism along the group axis—a property MLA loses once absorbed\.

Concretely, the merged GQA attention is re\-expressed as

𝐪t\\displaystyle\\mathbf\{q\}\_\{t\}=\[𝐪t,1;…;𝐪t,h\]=WQ​𝐱t,\\displaystyle=\[\\mathbf\{q\}\_\{t,1\};\\dots;\\mathbf\{q\}\_\{t,h\}\]=W^\{Q\}\\mathbf\{x\}\_\{t\},𝐜tK​V\\displaystyle\\mathbf\{c\}\_\{t\}^\{KV\}=\[𝐜tK;𝐜tV\]=WD​K​V​𝐱t,\\displaystyle=\[\\mathbf\{c\}\_\{t\}^\{K\};\\mathbf\{c\}\_\{t\}^\{V\}\]=W^\{DKV\}\\mathbf\{x\}\_\{t\},𝐪^t,iR\\displaystyle\\mathbf\{\\hat\{q\}\}\_\{t,i\}^\{R\}=RoPE¯t​\(\(Wj​\(i\)U​K\)⊤​𝐪t,i\),\\displaystyle=\\overline\{\\text\{RoPE\}\}\_\{t\}\\\!\\left\(\(W^\{UK\}\_\{j\(i\)\}\)^\{\\\!\\top\}\\mathbf\{q\}\_\{t,i\}\\right\),𝐤^tR\\displaystyle\\mathbf\{\\hat\{k\}\}\_\{t\}^\{R\}=RoPE¯t​\(𝐜tK\),𝐯^t=𝐜tV,\\displaystyle=\\overline\{\\text\{RoPE\}\}\_\{t\}\\\!\\left\(\\mathbf\{c\}\_\{t\}^\{K\}\\right\),\\quad\\mathbf\{\\hat\{v\}\}\_\{t\}=\\mathbf\{c\}\_\{t\}^\{V\},𝐨^t,i\\displaystyle\\mathbf\{\\hat\{o\}\}\_\{t,i\}=∑s=1tsoftmaxs⁡\(𝐪^t,iR⊤​𝐤^sRd\)​𝐯^s,\\displaystyle=\\sum\_\{s=1\}^\{t\}\\operatorname\{softmax\}\_\{s\}\\\!\\left\(\\tfrac\{\\mathbf\{\\hat\{q\}\}^\{R^\{\\top\}\}\_\{t,i\}\\mathbf\{\\hat\{k\}\}\_\{s\}^\{R\}\}\{\\sqrt\{d\}\}\\right\)\\mathbf\{\\hat\{v\}\}\_\{s\},𝐲t\\displaystyle\\mathbf\{y\}\_\{t\}=WO​\[Wj​\(1\)U​V​𝐨^t,1;…;Wj​\(h\)U​V​𝐨^t,h\],\\displaystyle=W^\{O\}\[W^\{UV\}\_\{j\(1\)\}\\mathbf\{\\hat\{o\}\}\_\{t,1\};\\dots;W^\{UV\}\_\{j\(h\)\}\\mathbf\{\\hat\{o\}\}\_\{t,h\}\],\(4\)wherej​\(i\)=⌈i/\(h/g\)⌉j\(i\)=\\lceil i/\(h/g\)\\rceilroutes theii\-th query head to its group, and eachWjU​K=WjU​V∈ℝd×g​dW^\{UK\}\_\{j\}=W^\{UV\}\_\{j\}\\in\\mathbb\{R\}^\{d\\times gd\}is initialised as a sparse identity block selecting thejj\-th group out of theg​dgd\-dimensional latent \(mirroring GQA’srepeat\_kv\)\. The operatorRoPE¯\\overline\{\\text\{RoPE\}\}consolidates theggidentical per\-head RoPE rotations into a single one that applies the original pattern repeatedly everydddimensions across the unified key\. By itself this reformulation does not reduce the KV cache, which remains𝐜tK​V∈ℝ2​g​d\\mathbf\{c\}\_\{t\}^\{KV\}\\in\\mathbb\{R\}^\{2gd\}; compression is delivered by the subsequent pipeline stages\.

##### RoRoPE, FreqFold, and balanced KV compression\.

The remaining stages—decoupling positional information via head\-wise rotation \(RoRoPE\), grouping nearby rotational frequencies before PCA \(FreqFold\), and balancing the norms ofKnopeK\_\{\\text\{nope\}\}andVVprior to joint low\-rank compression—are inherited from TransMLA without modification\. They operate on the mergedg​dgd\-dimensional latent and are agnostic to whether the post\-merge model is interpreted as MHA \(TransMLA\) or GQA \(TransGQLA\); seeMenget al\.\([2026](https://arxiv.org/html/2605.15250#bib.bib4)\)for details\.

### 3\.3Sparse GQLA

Following DSA\(Liuet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib5)\), fine\-grained sparse attention computes attention only over a token\-dependent subset𝒮t=top​\-​k​\(It,:\)\\mathcal\{S\}\_\{t\}=\\mathrm\{top\}\\text\{\-\}k\(I\_\{t,:\}\)of past positions, with per\-head output

𝐮t,j=∑s∈𝒮tsoftmax⁡\(𝐪t,j⊤​𝐤s,g​\(j\)d\)​𝐯s,g​\(j\),\\mathbf\{u\}\_\{t,j\}=\\\!\\\!\\sum\_\{s\\in\\mathcal\{S\}\_\{t\}\}\\\!\\\!\\operatorname\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{q\}\_\{t,j\}^\{\\top\}\\mathbf\{k\}\_\{s,\\,g\(j\)\}\}\{\\sqrt\{d\}\}\\right\)\\mathbf\{v\}\_\{s,\\,g\(j\)\},\(5\)whereg​\(j\)=⌈j/\(h/g\)⌉g\(j\)=\\lceil j/\(h/g\)\\rceilroutes query headjjto its KV group\.

##### Sparse MLA is locked into the MQA\-absorb path\.

Because𝒮t\\mathcal\{S\}\_\{t\}varies across query tokens, the natural execution model issues one compute block per token, packing allhhheads of that token into a single GEMM against the retrieved keys\. Modern Tensor Cores execute this GEMM through fixed\-shape MMA tiles \(e\.g\.m16n16k16\) whosemmdimension must be at least1616, requiring that at least1616query heads share each KV head\. MLA in its MHA\-mode form hasm=1m=1and degenerates into low\-intensity GEMV; sparse MLA is therefore forced into the absorbed MQA form on every device, inheriting the same compute overhead and TP loss that hurt dense MLA on H20\-class hardware\.

##### Sparse GQLA preserves the GQA path\.

GQLA’s canonical configuration hashq/g=16h\_\{q\}/g\\\!=\\\!16query heads per KV group on the GQA path—exactly them=16m\\\!=\\\!16MMA tile—so Eq\. \([5](https://arxiv.org/html/2605.15250#S3.E5)\) maps onto Tensor Cores at full efficiency without leaving the GQA path\. The same hardware\-driven rule as the dense case applies: memory\-bound hardware switches to sparse MQA\-absorb to minimise KV traffic, while compute\-bound hardware stays in sparse GQA to keep FLOPs low and retain group\-axis tensor parallelism\. No custom kernels are required for either path\.

##### Composing with HISA\.

TheO​\(L\)O\(L\)indexer that produces\{It,s\}\\\{I\_\{t,s\}\\\}becomes the dominant cost atL≥64L\\\!\\geq\\\!64K\. HISA\(Xuet al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib15)\)is a training\-free hierarchical\-scoring replacement that accelerates the indexer kernel while preserving IoU\>99%\>99\\%with the original top\-kkset\. GQLA and HISA compose naturally—HISA accelerates the “before top\-kk” indexer while GQLA keeps the “after top\-kk” attention filling the MMA tile—pushing end\-to\-end sparse long\-context decoding to the hardware peak from both sides\.

## 4Roofline Analysis

### 4\.1The Roofline model and the H100/H20 ridges

The Roofline model\(Williamset al\.,[2009](https://arxiv.org/html/2605.15250#bib.bib6)\)characterises a kernel by its arithmetic intensityII\(FLOPs per byte of off\-chip traffic\) and bounds attainable throughput asmin⁡\(I⋅BW,FLOPsmax\)\\min\(I\\cdot\\mathrm\{BW\},\\,\\mathrm\{FLOPs\}\_\{\\max\}\)\. The boundary between the memory\- and compute\-bound regimes is the*ridge point*I⋆=FLOPsmax/BWI^\{\\star\}=\\mathrm\{FLOPs\}\_\{\\max\}/\\mathrm\{BW\}: efficient decoding designs an attention whose arithmetic intensity lands as close toI⋆I^\{\\star\}as possible on the target device\.

Standard MHA decoding hasI≈1I\\approx 1\(Zadouriet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib7)\): each cached BF16 element is consumed by exactly one query element of the new token\. Table[1](https://arxiv.org/html/2605.15250#S4.T1)contrasts the two GPUs we analyse\. The H100 ridge sits atI⋆≈295I^\{\\star\}\\\!\\approx\\\!295FLOPs/byte, leaving MHA decoding nearly three orders of magnitude inside the memory\-bound regime; closing this gap requires redesigning attention itself, not just kernel\-level optimisation\(Daoet al\.,[2022](https://arxiv.org/html/2605.15250#bib.bib9); Popeet al\.,[2023](https://arxiv.org/html/2605.15250#bib.bib8)\)\. The export\-restricted H20 retains almost all the HBM bandwidth but cuts compute by∼7×\\sim\\\!7\\times, dropping the ridge toI⋆≈37I^\{\\star\}\\\!\\approx\\\!37\. Although hardware FLOPs have historically outpaced bandwidth\(Gholamiet al\.,[2024](https://arxiv.org/html/2605.15250#bib.bib10)\), the H100→\\toH20 pair inverts that trend, and an arithmetic intensity well matched to H100 is far above the H20 ridge—wasted compute on the cheaper card\.

Table 1:BF16 Roofline parameters \(dense peak compute, no 2:4 sparsity\)\. H20 has∼1/7\\sim\\\!1/7the peak compute of H100 but slightly higher HBM bandwidth, so its ridge sits∼8×\\sim\\\!8\\timeslower\.![Refer to caption](https://arxiv.org/html/2605.15250v1/imgs/roofline_h100.png)
![Refer to caption](https://arxiv.org/html/2605.15250v1/imgs/roofline_h20.png)

Figure 3:Roofline analysis of BF16 decoding on H100 \(left\) and H20 \(right\)\. Black solid line:min⁡\(I⋅BW,peak\)\\min\(I\\\!\\cdot\\\!\\mathrm\{BW\},\\mathrm\{peak\}\); vertical dashed line: ridgeI⋆I^\{\\star\}\. On H100, MLA and GQLA share the MQA\-absorb path:sq=1s\_\{q\}\\\!=\\\!1lands just below the ridge, whilesq=2s\_\{q\}\\\!=\\\!2MTP overshoots it and becomes compute\-bound\. On H20, MLA\-MQA\-absorb is far above the ridge \(severely compute\-bound\), whereas GQLA’s GQA path at\(g,sq\)∈\{\(8,2\),\(4,1\)\}\(g,s\_\{q\}\)\\\!\\in\\\!\\\{\(8,2\),\(4,1\)\\\}pins the ridge and saturates both bandwidth and compute\.
### 4\.2GQLA on the Roofline

We now apply the Roofline analysis to GQLA’s two decoding paths and explain why it remains close to the achievable peak on both H100\-class \(compute\-rich\) and H20\-class \(compute\-poor\) GPUs, while MLA cannot\. The combined design space is two paths×\\timesone deployment knob \(the per\-step query\-token countsqs\_\{q\}; ordinary decoding hassq=1s\_\{q\}\\\!=\\\!1, MTP/speculative decoding givessq≥2s\_\{q\}\\\!\\geq\\\!2\)\. Notation is summarised in Appendix[B](https://arxiv.org/html/2605.15250#A2); we use the DeepSeek\-V2/V3 canonical configuration\(hq,g,dh,dhR,rk​v\)=\(128,8,128,64,512\)\(h\_\{q\},g,d\_\{h\},d\_\{h\}^\{R\},r\_\{kv\}\)=\(128,8,128,64,512\)unless otherwise stated\. Some recent open models\(Teamet al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib16); GLM Team, Zhipu AI,[2025](https://arxiv.org/html/2605.15250#bib.bib17)\)usehq=64h\_\{q\}\\\!=\\\!64, which halves allIIvalues but leaves the qualitative conclusions unchanged\.

#### 4\.2\.1MQA\-absorb path: compact latent cache

The MQA\-absorb path stores per token only the jointly compressed latent𝐜sK​V∈ℝrk​v\\mathbf\{c\}\_\{s\}^\{KV\}\\\!\\in\\\!\\mathbb\{R\}^\{r\_\{kv\}\}\(shared by all heads\) and the MLA\-style decoupled RoPE key𝐤sR∈ℝdhR\\mathbf\{k\}\_\{s\}^\{R\}\\\!\\in\\\!\\mathbb\{R\}^\{d\_\{h\}^\{R\}\}\(stored once, no replication\), giving

NMQAtok=rk​v\+dhR,BMQAtok=2​\(rk​v\+dhR\)\\displaystyle N\_\{\\mathrm\{MQA\}\}^\{\\mathrm\{tok\}\}=r\_\{kv\}\+d\_\{h\}^\{R\},\\quad B\_\{\\mathrm\{MQA\}\}^\{\\mathrm\{tok\}\}=2\(r\_\{kv\}\+d\_\{h\}^\{R\}\)\(6\)\(11521152bytes/token at the canonical configuration\)\. Decoding one step reads allLLcached tokens once and reuses them across thesqs\_\{q\}new query tokens \(FlashAttention\-style\), soBMQA=2​L​\(rk​v\+dhR\)B\_\{\\mathrm\{MQA\}\}=2L\(r\_\{kv\}\+d\_\{h\}^\{R\}\)is independent ofsqs\_\{q\}\. After absorption \(Eq\. \([3](https://arxiv.org/html/2605.15250#S3.E3)\)\), each \(head, query\-token, cache\-position\) triplet contributes2​\(2​rk​v\+dhR\)2\(2r\_\{kv\}\+d\_\{h\}^\{R\}\)FLOPs, hence

FMQA\\displaystyle F\_\{\\mathrm\{MQA\}\}=2​L​hq​sq​\(2​rk​v\+dhR\),\\displaystyle=2L\\,h\_\{q\}s\_\{q\}\\,\(2r\_\{kv\}\+d\_\{h\}^\{R\}\),\(7\)IMQA\\displaystyle I\_\{\\mathrm\{MQA\}\}=hq​sq​\(2​rk​v\+dhR\)rk​v\+dhR\.\\displaystyle=\\frac\{h\_\{q\}s\_\{q\}\(2r\_\{kv\}\+d\_\{h\}^\{R\}\)\}\{r\_\{kv\}\+d\_\{h\}^\{R\}\}\.\(8\)IMQAI\_\{\\mathrm\{MQA\}\}scales linearly withsqs\_\{q\}\(DeepSeek\-AI,[2025](https://arxiv.org/html/2605.15250#bib.bib18)\):IMQA​\(sq=1\)≈242I\_\{\\mathrm\{MQA\}\}\(s\_\{q\}\{=\}1\)\\\!\\approx\\\!242sits just below the H100 ridge \(memory\-bound\) andIMQA​\(sq=2\)≈484I\_\{\\mathrm\{MQA\}\}\(s\_\{q\}\{=\}2\)\\\!\\approx\\\!484overshoots it \(compute\-bound\)\. MLA enables MTP by default in DeepSeek\-V3 \(sq=2s\_\{q\}\\\!=\\\!2\), so its per\-step time grows from2\.822\.82to4\.61​μ​s4\.61\\,\\mu\\text\{s\}on H100 and the MTP throughput gain shrinks from the ideal2×2\\timesto∼1\.22×\\sim\\\!1\.22\\times\.

#### 4\.2\.2GQA path: per\-group expanded cache

The GQA path stores per\-group expandedKC,VK\_\{C\},V\(g​dhgd\_\{h\}elements each\) plus the MLA\-style shared RoPE key \(stored once across groups\), so

NGQAtok=2​g​dh\+dhR,BGQAtok=2​\(2​g​dh\+dhR\)\\displaystyle N\_\{\\mathrm\{GQA\}\}^\{\\mathrm\{tok\}\}=2gd\_\{h\}\+d\_\{h\}^\{R\},\\quad B\_\{\\mathrm\{GQA\}\}^\{\\mathrm\{tok\}\}=2\(2gd\_\{h\}\+d\_\{h\}^\{R\}\)\(9\)\(42244224bytes/token atg=8g\\\!=\\\!8\)\. The cache is structurally close to LLaMA\-3 GQA’s2​g​dh=20482gd\_\{h\}\\\!=\\\!2048, with onlydhRd\_\{h\}^\{R\}extra elements for the shared RoPE key, butK,VK,Vare constrained at training time into the rank\-rk​vr\_\{kv\}subspace spanned by GQLA’s up\-projections, so the GQA path differs in expressivity from a freely parameterised same\-dhd\_\{h\}standard GQA\. Per \(head, query\-token, cache\-position\) FLOPs are2​\(2​dh\+dhR\)2\(2d\_\{h\}\+d\_\{h\}^\{R\}\), giving

FGQA\\displaystyle F\_\{\\mathrm\{GQA\}\}=2​L​hq​sq​\(2​dh\+dhR\),\\displaystyle=2L\\,h\_\{q\}s\_\{q\}\\,\(2d\_\{h\}\+d\_\{h\}^\{R\}\),\(10\)IGQA\\displaystyle I\_\{\\mathrm\{GQA\}\}=hq​sq​\(2​dh\+dhR\)2​g​dh\+dhR\.\\displaystyle=\\frac\{h\_\{q\}s\_\{q\}\(2d\_\{h\}\+d\_\{h\}^\{R\}\)\}\{2gd\_\{h\}\+d\_\{h\}^\{R\}\}\.\(11\)IGQAI\_\{\\mathrm\{GQA\}\}scales linearly withsqs\_\{q\}and roughly inversely withgg\. Two configurations pin the H20 ridge:\(g,sq\)=\(8,2\)\(g,s\_\{q\}\)\\\!=\\\!\(8,2\)givesIGQA≈38\.8I\_\{\\mathrm\{GQA\}\}\\\!\\approx\\\!38\.8, while\(g,sq\)=\(4,1\)\(g,s\_\{q\}\)\\\!=\\\!\(4,1\)givesIGQA≈37\.6I\_\{\\mathrm\{GQA\}\}\\\!\\approx\\\!37\.6\.

#### 4\.2\.3Operating points across hardware

Table 2:Per\-step Roofline operating points \(L=8192L\\\!=\\\!8192, BF16, canonical config\)\. Per\-step time=max⁡\(F/FLOPsmax,B/BW\)=\\max\(F/\\mathrm\{FLOPs\}\_\{\\max\},B/\\mathrm\{BW\}\); throughput=sq/step=s\_\{q\}/\\text\{step\}\. The recommended\(hq,g,sq\)=\(128,8,2\)\(h\_\{q\},g,s\_\{q\}\)\\\!=\\\!\(128,8,2\)pairs H100 \+ MQA\-absorb at354354K tok/s with H20 \+ GQA at221221K tok/s on the same trained weights;\(g,sq\)=\(4,1\)\(g,s\_\{q\}\)\\\!=\\\!\(4,1\)is an equally ridge\-optimal H20 alternative\. MLA on H20 is always compute\-bound \(6565K tok/s, zero MTP gain\)\.Table[2](https://arxiv.org/html/2605.15250#S4.T2)tabulatesmax⁡\(F/FLOPsmax,B/BW\)\\max\(F/\\mathrm\{FLOPs\}\_\{\\max\},B/\\mathrm\{BW\}\)across hardware×\\timespath×sq\\times s\_\{q\}\. Three observations summarise the design space: \(1\) on H100 the MQA\-absorb path withsq=1s\_\{q\}\\\!=\\\!1is the fastest configuration \(2\.82μs/2\.82\\,\\mu\\text\{s\}/step\) and enabling MTP turns it compute\-bound, shrinking the gain to1\.22×1\.22\\times; \(2\) MLA on H20 is always compute\-bound, so MTP delivers zero throughput gain; \(3\) GQLA’s GQA path with\(g,sq\)∈\{\(8,2\),\(4,1\)\}\(g,s\_\{q\}\)\\\!\\in\\\!\\\{\(8,2\),\(4,1\)\\\}both pin the H20 ridge at221221K tok/s—a3\.4×3\.4\\timesimprovement over MLA on the same device\. The path switch requires no retraining and no custom kernels: MQA\-absorb reuses MLA’s absorb kernel, GQA reuses standard GQA kernels, and the MTP head is a standard DeepSeek\-V3 component\.

#### 4\.2\.4Choosing\(g,sq\)\(g,s\_\{q\}\)

The choice between the two H20 ridge\-optimal points trades cache size against expressivity, TP cap, and MTP training cost\. We recommend\(g,sq\)=\(8,2\)\(g,s\_\{q\}\)\\\!=\\\!\(8,2\)as the default: it gives the largest latent subspace \(g​dh=1024\>rk​v=512gd\_\{h\}\\\!=\\\!1024\>r\_\{kv\}\\\!=\\\!512, so the rank\-rk​vr\_\{kv\}PCA compression has2×2\\timesredundancy\), an88\-way zero\-redundancy TP cap, and the exacthq/g=16h\_\{q\}/g\\\!=\\\!16Tensor\-Core MMA tile required by sparse GQLA \(§[3\.3](https://arxiv.org/html/2605.15250#S3.SS3)\)\.\(g,sq\)=\(4,1\)\(g,s\_\{q\}\)\\\!=\\\!\(4,1\)is a lighter H20\-only alternative: the GQA\-path cache halves to21762176bytes/token and no MTP head is needed, at the cost of a squareWU​K∈ℝ512×512W^\{UK\}\\\!\\in\\\!\\mathbb\{R\}^\{512\\times 512\}\(PCA redundancy1×1\\times\) and a44\-way TP cap\. Crucially,IMQAI\_\{\\mathrm\{MQA\}\}does not containgg, so both configurations remain deployable on H100 at the same2\.82μs/2\.82\\,\\mu\\text\{s\}/step MQA\-absorb operating point\. A third option—combiningg=4g\\\!=\\\!4’s small cache withsq=2s\_\{q\}\\\!=\\\!2MTP on H20—would require pushingrk​vr\_\{kv\}down to≤256\\leq\\\!256and is left to future work\.

## 5Experiments

We evaluate TransGQLA on the open\-source GQA checkpoint LLaMA\-3\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.15250#bib.bib11)\), with two questions in mind: \(i\) how much capability is lost when GQA weights are reorganised into the GQLA latent form without any further training; and \(ii\) how rapidly that loss can plausibly be recovered through continued pretraining\.

##### Setup\.

LLaMA\-3\-8B hashq=32h\_\{q\}\\\!=\\\!32query heads andg=8g\\\!=\\\!8KV groups withdh=128d\_\{h\}\\\!=\\\!128, giving an original GQA cache of2​g​dh=20482gd\_\{h\}\\\!=\\\!2048BF16 elements per token per layer\. We apply the TransGQLA pipeline of Section[3\.2](https://arxiv.org/html/2605.15250#S3.SS2): GQA\-preserving head merging keepsg=8g\\\!=\\\!8KV groups and retains both decoding paths, followed by TransMLA\-style RoRoPE, FreqFold, and activation\-balanced low\-rank compression\(Menget al\.,[2026](https://arxiv.org/html/2605.15250#bib.bib4)\)into a shared latent of576576dimensions\. This compresses the per\-layer KV cache on the MQA\-absorb path to28\.125%28\.125\\%of the GQA baseline \(−71\.875%\-71\.875\\%\); the GQA\-path cache is2​g​dh\+dhR2gd\_\{h\}\+d\_\{h\}^\{R\}elements/token, comparable to the original\. Because the two paths are algebraically equivalent \(Section[3\.1](https://arxiv.org/html/2605.15250#S3.SS1)\) they produce numerically identical outputs, so we report a single accuracy figure per row\. Continued pretraining draws from a6060B\-token open\-domain corpus; hyperparameters are in Appendix[A](https://arxiv.org/html/2605.15250#A1)\.

##### Benchmarks\.

We report zero\-shot accuracy on six commonsense\-reasoning benchmarks—MMLU, ARC \(easy/challenge avg\.\), PIQA, HellaSwag, OpenBookQA, Winogrande—and their unweighted average\.

##### Conversion preserves nearly all capability\.

Table[3](https://arxiv.org/html/2605.15250#S5.T3)reports the results\. The0\-token row measures the pure architectural transformation: at the aggressive∼7×\\sim\\\!7\\timesKV\-cache compression, TransGQLA loses only∼9\.7\\sim\\\!9\.7Avg\. points relative to the1515T\-token pretrained LLaMA\-3\-8B and remains within a few points of the source on PIQA and HellaSwag, confirming that the GQA\-preserving merge of Section[3\.2](https://arxiv.org/html/2605.15250#S3.SS2)transforms the model into a GQLA backbone with very little information loss\.

##### Projected recovery from continued pretraining\.

Because the GQA\-preserving merge does not change the jointK,VK,Vsubspace that the latent\-compression stages then act on, the TransGQLA and TransMLA conversions coincide at0tokens \(this is reflected in the identical Avg\. scores in Table[3](https://arxiv.org/html/2605.15250#S5.T3)\)\. We therefore expect TransGQLA to follow TransMLA’s continued\-pretraining trajectory: TransMLA recovers to within0\.50\.5Avg\. points of the original LLaMA\-3\-8B after3030B tokens—a∼500×\\sim\\\!500\\timesreduction relative to the1515T\-token pretraining budget—while retaining the same−71\.875%\-71\.875\\%KV\-cache compression\. The corresponding TransGQLA continued\-pretraining run is in progress and will be added in the camera\-ready \(see Limitations\)\.

Table 3:Commonsense\-reasoning accuracy after the architectural conversion \(0\-token row\)\.KV Mem\.is the per\-token cache change relative to the GQA baseline on the MQA\-absorb path\. TransGQLA and TransMLA agree at0tokens because the GQA\-preserving merge does not change the jointK,VK,Vsubspace being compressed; we project that the corresponding TransGQLA continued\-pretraining run will follow TransMLA’s recovery trajectory \(TransMLA reaches63\.3963\.39Avg\. after3030B tokens, within0\.50\.5pts of the LLaMA\-3\-8B baseline\)\.

## 6Conclusion

We identified three coupled hardware drawbacks of MLA’s MQA\-absorb\-only design—hardware coupling to H100\-class ratios, loss of head\-axis tensor parallelism, and zero MTP gain on commodity inference GPUs—and proposed Group\-Query Latent Attention as a minimal architectural fix\. By indexing the up\-projections by group rather than by query head, GQLA’s trained weights admit two algebraically equivalent decoding paths over the same parameters: a compact\-latent MQA\-absorb path identical to MLA’s, and a per\-group expanded GQA path\. The deployment runtime picks the path that matches the target hardware—no retraining, no custom kernels—so a single recommended configuration\(hq,g\)=\(128,8\)\(h\_\{q\},g\)\\\!=\\\!\(128,8\)with one MTP head pins the rooflines of both H100 and H20 simultaneously, while supporting up to 8\-way zero\-redundancy tensor parallelism on the GQA path\.

To make the design accessible without pretraining from scratch, TransGQLA converts a pretrained GQA checkpoint into a GQLA model via a one\-line change to the TransMLA head\-merging step\. Because the GQA\-preserving merge does not change the jointK,VK,Vsubspace that the latent compression then acts on, we project the TransGQLA continued\-pretraining trajectory to follow that of TransMLA\. The same dual\-path design extends to fine\-grained sparse attention, where GQLA’shq/g=16h\_\{q\}/g\\\!=\\\!16ratio matches the Tensor\-Core MMA tile and sparse GQLA preserves the GQA path on H20\-class hardware—something sparse MLA cannot do\.

More broadly, GQLA suggests that exposing multiple algebraically equivalent decoding paths over a single set of trained weights is a practical design principle for hardware\-adaptive attention\. We hope it provides a useful, hardware\-agnostic alternative to MLA for groups that target both flagship and commodity inference accelerators\.

## Limitations

##### Hardware claims are based on roofline modelling\.

Our central claim that GQLA stays close to the achievable peak on both H100\- and H20\-class GPUs rests on the Roofline analysis of Section[4](https://arxiv.org/html/2605.15250#S4), using published peak BF16 FLOPs and HBM bandwidth figures \(dense, no 2:4 sparsity\)\. The Roofline model is a first\-order tool and does not capture cache hierarchy, kernel\-launch overheads, multi\-stream scheduling, or Hopper\-specific TMA/WGMMA effects\. A full kernel\-level benchmark of MQA\-absorb and GQA\-path decoding on actual H20 and H100 hardware is left for future work\.

##### TransGQLA continued\-pretraining results pending\.

The TransGQLA continued\-pretraining run is currently in progress; the camera\-ready will report the corresponding accuracy trajectory in place of the projection from TransMLA used in Table[3](https://arxiv.org/html/2605.15250#S5.T3)\. Because the GQA\-preserving merge does not change the jointK,VK,Vsubspace that the latent compression acts on, we do not expect qualitative changes, but the exact numbers may shift slightly\.

##### Single model, single domain\.

We validate TransGQLA on a single backbone \(LLaMA\-3\-8B\) and on commonsense\-reasoning benchmarks\. A broader study across model sizes \(11B–7070B\) and downstream task families \(long\-context retrieval, code, math, instruction following\) would strengthen the empirical case\. The conversion recipe itself is architecture\-agnostic and we expect it to transfer\.

##### Larger per\-head dimensions and aggressive latent compression\.

We adopt the DeepSeek\-V2/V3 head shapedh=dhV=128,dhR=64d\_\{h\}\\\!=\\\!d\_\{h\}^\{V\}\\\!=\\\!128,d\_\{h\}^\{R\}\\\!=\\\!64for a fair comparison with MLA\. Two extensions are orthogonal to the dual\-path design and left to future work: \(i\) wideningdhVd\_\{h\}^\{V\}to256256following GLM\-5\(GLM Team, Zhipu AI,[2025](https://arxiv.org/html/2605.15250#bib.bib17)\); and \(ii\) pushingrk​vr\_\{kv\}down to≤256\\leq\\\!256to combineg=4g\\\!=\\\!4’s small GQA\-path cache withsq=2s\_\{q\}\\\!=\\\!2MTP on H20\. Eq\. \([6](https://arxiv.org/html/2605.15250#S4.E6)\)–\([11](https://arxiv.org/html/2605.15250#S4.E11)\) substitute the new values directly\.

##### Sparse GQLA is described but not benchmarked\.

Section[3\.3](https://arxiv.org/html/2605.15250#S3.SS3)presents sparse GQLA as a structural argument:hq/g=16h\_\{q\}/g\\\!=\\\!16matches the Tensor\-Core MMA tile, so sparse GQLA preserves the GQA path on H20\-class hardware\. End\-to\-end sparse\-decoding benchmarks against DSA\(Liuet al\.,[2025](https://arxiv.org/html/2605.15250#bib.bib5)\)are not yet included and constitute an obvious next experiment\.

##### Path\-selection policy\.

We assume a static, hardware\-driven policy decided once at deployment: runtime path switching would require re\-expanding or re\-compressing the entire KV cache\. For mixed\-batch or online\-batching workloads where attention’s effective arithmetic intensity varies, more refined serving policies—e\.g\. deploying the same weights on different replicas under different paths and routing requests accordingly—may be needed\.

## Ethics Statement

This work focuses on the systems\-level efficiency of LLM inference and does not introduce any new training data, modelling capability, or downstream application beyond what is already provided by the underlying open\-source LLaMA\-3\-8B checkpoint\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.15250#bib.bib11)\)\. The proposed GQLA architecture and the TransGQLA conversion procedure are intended to make existing models cheaper and easier to deploy on commodity hardware; they neither change the model’s capabilities in a way that introduces new risks nor remove any existing safety mitigations\. All evaluation benchmarks \(MMLU, ARC, PIQA, HellaSwag, OpenBookQA, Winogrande\) are publicly released for academic use and are used here in their standard zero\-shot configuration\.

## References

- J\. Ainslie, J\. Lee\-Thorp, M\. de Jong, Y\. Zemlyanskiy, F\. Lebrón, and S\. Sanghai \(2023\)GQA: training generalized multi\-query transformer models from multi\-head checkpoints\.InProceedings of EMNLP,Cited by:[§1](https://arxiv.org/html/2605.15250#S1.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Dao, D\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)Flashattention: fast and memory\-efficient exact attention with io\-awareness\.Advances in neural information processing systems35,pp\. 16344–16359\.Cited by:[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.15250#S4.SS1.p2.5)\.
- DeepSeek\-AI \(2025\)DeepSeek\-V3 / R1 inference system overview\.Note:[https://github\.com/deepseek\-ai/open\-infra\-index/blob/main/202502OpenSourceWeek/day\_6\_one\_more\_thing\_deepseekV3R1\_inference\_system\_overview\.md](https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md)Cited by:[§4\.2\.1](https://arxiv.org/html/2605.15250#S4.SS2.SSS1.p1.17)\.
- A\. Gholami, Z\. Yao, S\. Kim, C\. Hooper, M\. W\. Mahoney, and K\. Keutzer \(2024\)Ai and memory wall\.IEEE Micro44\(3\),pp\. 33–39\.Cited by:[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.15250#S4.SS1.p2.5)\.
- GLM Team, Zhipu AI \(2025\)GLM\-5 model card\.Note:[https://github\.com/THUDM/GLM\-5](https://github.com/THUDM/GLM-5)Cited by:[§4\.2](https://arxiv.org/html/2605.15250#S4.SS2.p1.7),[Larger per\-head dimensions and aggressive latent compression\.](https://arxiv.org/html/2605.15250#Sx1.SS0.SSS0.Px4.p1.7)\.
- F\. Gloeckle, B\. Y\. Idrissi, B\. Roziere, D\. Lopez\-Paz, and G\. Synnaeve \(2024\)Better & faster large language models via multi\-token prediction\.InInternational Conference on Machine Learning,pp\. 15706–15734\.Cited by:[3rd item](https://arxiv.org/html/2605.15250#S1.I1.i3.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5](https://arxiv.org/html/2605.15250#S5.p1.1),[Ethics Statement](https://arxiv.org/html/2605.15250#Sx2.p1.1)\.
- T\. Ji, B\. Guo, Y\. Wu, Q\. Guo, L\. Shen, Z\. Chen, X\. Qiu, Q\. Zhang, and T\. Gui \(2025\)Towards economical inference: enabling deepseek’s multi\-head latent attention in any transformer\-based llms\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 33313–33328\.Cited by:[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px3.p1.1)\.
- A\. Liu, B\. Feng, B\. Wang, B\. Wang, B\. Liu, C\. Zhao, C\. Dengr, C\. Ruan, D\. Dai, D\. Guo,et al\.\(2024a\)Deepseek\-v2: a strong, economical, and efficient mixture\-of\-experts language model\.arXiv preprint arXiv:2405\.04434\.Cited by:[§1](https://arxiv.org/html/2605.15250#S1.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024b\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[3rd item](https://arxiv.org/html/2605.15250#S1.I1.i3.p1.1),[§1](https://arxiv.org/html/2605.15250#S1.p1.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§1](https://arxiv.org/html/2605.15250#S1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px4.p1.4),[§3\.3](https://arxiv.org/html/2605.15250#S3.SS3.p1.1),[Sparse GQLA is described but not benchmarked\.](https://arxiv.org/html/2605.15250#Sx1.SS0.SSS0.Px5.p1.1)\.
- F\. Meng, P\. Tang, Z\. Yao, X\. Sun, and M\. Zhang \(2026\)TransMLA: migrating gqa models to mla with full deepseek compatibility and speedup\.Advances in Neural Information Processing Systems38,pp\. 81977–82019\.Cited by:[§1](https://arxiv.org/html/2605.15250#S1.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.15250#S3.SS2.SSS0.Px2.p1.3),[§3\.2](https://arxiv.org/html/2605.15250#S3.SS2.p1.1),[§5](https://arxiv.org/html/2605.15250#S5.SS0.SSS0.Px1.p1.10)\.
- R\. Pope, S\. Douglas, A\. Chowdhery, J\. Devlin, J\. Bradbury, J\. Heek, K\. Xiao, S\. Agrawal, and J\. Dean \(2023\)Efficiently scaling transformer inference\.Proceedings of machine learning and systems5,pp\. 606–624\.Cited by:[§1](https://arxiv.org/html/2605.15250#S1.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.15250#S4.SS1.p2.5)\.
- N\. Shazeer \(2019\)Fast transformer decoding: one write\-head is all you need\.arXiv preprint arXiv:1911\.02150\.Cited by:[§1](https://arxiv.org/html/2605.15250#S1.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Team, T\. Bai, Y\. Bai, Y\. Bao, S\. Cai, Y\. Cao, Y\. Charles, H\. Che, C\. Chen, G\. Chen,et al\.\(2026\)Kimi k2\. 5: visual agentic intelligence\.arXiv preprint arXiv:2602\.02276\.Cited by:[§4\.2](https://arxiv.org/html/2605.15250#S4.SS2.p1.7)\.
- S\. Williams, A\. Waterman, and D\. Patterson \(2009\)Roofline: an insightful visual performance model for multicore architectures\.Communications of the ACM52\(4\),pp\. 65–76\.Cited by:[§1](https://arxiv.org/html/2605.15250#S1.p2.3),[§4\.1](https://arxiv.org/html/2605.15250#S4.SS1.p1.4)\.
- Y\. Xu, F\. Meng, F\. Jiang, Y\. Wang, R\. Zhou, J\. Wu, Z\. Pan, Z\. Wang, X\. Tang, W\. Pei,et al\.\(2026\)HISA: efficient hierarchical indexing for fine\-grained sparse attention\.arXiv preprint arXiv:2603\.28458\.Cited by:[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px4.p1.4),[§3\.3](https://arxiv.org/html/2605.15250#S3.SS3.SSS0.Px3.p1.7)\.
- T\. Zadouri, H\. Strauss, and T\. Dao \(2025\)Hardware\-efficient attention for fast decoding\.InSecond Conference on Language Modeling,Cited by:[§1](https://arxiv.org/html/2605.15250#S1.p1.1),[§2](https://arxiv.org/html/2605.15250#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.15250#S4.SS1.p2.5)\.

## Appendix AContinued\-Pretraining Hyperparameters

We list the optimiser and data settings used for the TransGQLA continued\-pretraining runs on LLaMA\-3\-8B\. These values are inherited from the corresponding TransMLA recipe\.

Table 4:Continued\-pretraining hyperparameters used in the experiments of Section[5](https://arxiv.org/html/2605.15250#S5)\.
## Appendix BNotation

The following table summarises the notation used throughout the paper; the “typical value” column adopts the canonical configuration\(hq,g,dh,dhR,rk​v\)=\(128,8,128,64,512\)\(h\_\{q\},g,d\_\{h\},d\_\{h\}^\{R\},r\_\{kv\}\)=\(128,8,128,64,512\)\.

Table 5:Notation summary\. Data are BF16 \(22bytes per element\); FLOPs use the dense “22FLOPs per multiply\-add” count\. Roofline “compute” uses the dense peak BF16 TFLOPs \(no 2:4 sparsity, see Section[4\.1](https://arxiv.org/html/2605.15250#S4.SS1)\)\.

Similar Articles

GQA-{\mu}P: The maximal parameterization update for grouped query attention

arXiv cs.LG

This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.

LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs?

Hugging Face Daily Papers

This paper introduces LLaVA-UHD v4, which improves visual encoding efficiency in multimodal large language models by using slice-based encoding and intra-ViT early compression. It reduces computational costs by over 55% while maintaining or improving performance on high-resolution image tasks.