Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Summary
This paper investigates whether deep layer value vectors in transformer attention need context from the residual stream. It proposes Bank of Values (BoV), which uses context-free token-specific value vectors in the last third of layers, improving validation loss and benchmark scores over standard attention.
View Cached Full Text
Cached at: 06/03/26, 09:35 AM
# Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Source: [https://arxiv.org/html/2606.02780](https://arxiv.org/html/2606.02780)
Muyu He Yuchen Liu11footnotemark:1Qingya Huang Li Zhang
###### Abstract
The success of the transformer architecture as the backbone of modern LLMs is in large part due to its use of attention layers\. An attention layer follows the standard neural network paradigm: it takes the residual stream as input and thereby produces context\-dependent query, key, and value vectors\. However, we find that model performance meaningfully improves when deeper layers learn only a context\-free value vector to preserve the original token information, without drawing on any context from the residual stream\. When the model has access to this context\-free value vector, adding back the context\-dependent component provides little additional benefit for aggregate benchmark performance\. Such context\-free value vectors can be stored as sparse model parameters, eliminating the need to recompute or persistently cache these values\. Through systematic ablations on the key design choices for such context\-free value vectors, we proposeBank of Values\(BoV\), a new way of computing value vectors in attention by learning a lookup table of token\-specific value vectors for each of the last third of layers\. Across 135M and 780M models, BoV improves validation loss over standard attention and, at 780M, the average score across 21 benchmarks, matching the previous best method that adds token information to the value vector with less compute and memory\.111Full architecture and training code is available at[https://github\.com/RiddleHe/nanochat](https://github.com/RiddleHe/nanochat)\.
Do Value Vectors in Deep Layers Need Context from the Residual Stream?
Muyu He††thanks:Equal contribution\.††thanks:Corresponding author:muyuhe0327@gmail\.com\.Yuchen Liu11footnotemark:1Qingya Huang Li Zhang
## 1Introduction
Figure 1:Overview of Bank of Values\. \(a\) Standard Attention: a context\-dependent value vector is computed directly from the residual stream\. \(b\) SVFormer: a context\-free value vector is copied from the first layer’s value vector into all subsequent layers\. \(c\) Bank of Values: in each of the last third of layers, a context\-free value vector is looked up by token id from that layer’s value vector table𝐄v\\mathbf\{E\}\_\{v\}and scaled by a learnable coefficient\.Modern LLMs invariably adopt the transformer architecture, whose basic formulation necessarily includes the attention layer\(Vaswani et al\.,[2017](https://arxiv.org/html/2606.02780#bib.bib45)\)\. Attention computes dot products between query and key vectors to obtain attention scores, then writes a weighted sum of value vectors back to the residual stream according to these scores\. Despite its sophisticated design, the transformer retains an old paradigm first proposed in ResNet\(He et al\.,[2016](https://arxiv.org/html/2606.02780#bib.bib15)\): the input to each layer of a deep network is the normalized sum of all previous layers’ outputs, which we refer to as the residual stream\. This design proves effective for computing queries and keys, as it allows deeper layers to accumulate context information propagated from other tokens through attention\(Ghandeharioun et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib14)\)\. However, whether context information is necessary for value vectors is left largely unexamined, even though value vectors do not themselves participate in cross\-token computation\. As a result, the residual stream remains the default input for computing value vectors, diluting the original context\-free token information available to them\.
Recent work has begun to explore whether the value vector should carry more of the original token information and less of the accumulated context\. Value Residual Learning\(Zhou et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib51)\)linearly interpolates the value vectors in deeper layers with the value vector from the first attention layer\. Since the value vector in the first layer is a linear transformation of the token embedding, it deposits original token information into the value vectors in deeper layers\. Similar work modulates value vectors in deeper layers with a token\-specific value vector that is either shared across layers\(Li,[2026](https://arxiv.org/html/2606.02780#bib.bib25)\)or learned separately per layer\(Karpathy,[2025](https://arxiv.org/html/2606.02780#bib.bib19)\)\. All of these methods claim improvements over the standard attention mechanism, implying that a larger share of original token information in the value vector benefits deeper layers\. At the same time, studies have explored using only a context\-free value vector for attention, directly reusing the first layer’s value vector \(SVFormer,Zhou et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib51)\), or components of it \(SkipV1Former,Wu et al\.,[2025](https://arxiv.org/html/2606.02780#bib.bib47)\), in all subsequent layers\. These methods report degraded performance relative to the standard attention baseline\. Together, the results suggest that context information from the residual stream is essential to value vectors in deeper layers, and that adding a context\-free component that preserves the original token information is beneficial but not sufficient\.
We find that all the aforementioned studies share the same oversight: when targeting value vectors in deeper layers, the effect of computing a single context\-free value vector is never studied in isolation\. In the rare cases where it is, the context\-free value vector is nevertheless applied to all layers, which fails to account for different layers’ varying dependence on context\. Systematically ablating how value vectors are computed along four axes \(Section[3](https://arxiv.org/html/2606.02780#S3)\), we find that almost all of the reported gains in previous work come from the context\-free component that represents the original token, not from the context\-dependent component\. In these deeper layers, the value can be computed without the residual stream at little to no cost to performance\.
The insight that deeper layers largely benefit from a context\-free value vector has an important consequence: since these value vectors do not require context information, they need not be computed from a particular input sequence, nor cached for reuse in subsequent decode steps\. Instead, context\-free value vectors can persist as FLOP\-free, sparse model parameters that are looked up inO\(1\)O\(1\)time during a forward pass\. We therefore proposeBank of Values\(BoV\), which replaces the standard value vector computation from the residual stream with a static lookup in a dedicated value vector table in each of the last third of layers\. As a result, these layers drop the value matrix from their parameters and no longer cache value vectors, storing only the token indices needed to look the values up\. Although value lookup tables have been explored in the works above, they are always added on top of the value computed from the residual stream\. Consequently, those models still compute and cache that value, achieving lower quality than BoV at higher memory and compute cost \(Section[5\.2](https://arxiv.org/html/2606.02780#S5.SS2)\)\.
Empirically, we train a transformer model with BoV at two model sizes, 135M and 780M, for1\.50×10181\.50\\times 10^\{18\}and3\.91×10193\.91\\times 10^\{19\}FLOPs, respectively\. In a FLOP\-controlled setting, BoV lowers validation loss at both scales on a held\-out split of roughly 41\.9M tokens from the ClimbMix dataset\(Diao et al\.,[2025](https://arxiv.org/html/2606.02780#bib.bib12)\)and, at 780M, raises the average score across the 21 DCLM CORE benchmarks\(Li et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib24)\), relative to an otherwise identical model with standard attention\. Moreover, compared with existing methods that add original token information to the value vector, BoV matches the top\-performing variant and surpasses the rest, all while using less memory and fewer FLOPs per token\.
## 2Preliminaries
We establish notation for a single forward pass of a transformer model, omitting the batch dimension for clarity\. A model embeds a length\-TTtoken sequence into𝐞∈ℝT×d\\mathbf\{e\}\\in\\mathbb\{R\}^\{T\\times d\}, whereddis the hidden dimension\. Its normalized form𝐱0=RMSNorm\(𝐞\)\\mathbf\{x\}\_\{0\}=\\mathrm\{RMSNorm\}\(\\mathbf\{e\}\), which we call the initial token embedding, is the input to the first transformer block and carries no context information on its own\.
Transformer blocks are joined by residual connections\(He et al\.,[2016](https://arxiv.org/html/2606.02780#bib.bib15)\)\. The input𝐱i\\mathbf\{x\}\_\{i\}to theii\-th block’s attention layer, which we call the residual stream, is the RMSNorm of the sum of the initial token embedding𝐱0\\mathbf\{x\}\_\{0\}and the outputs of all preceding layers\. Unlike the context\-free𝐱0\\mathbf\{x\}\_\{0\}, the residual stream𝐱i\\mathbf\{x\}\_\{i\}accumulates information from other tokens through the attention layers below it, and is therefore context\-dependent\.
The attention layer projects query, key, and value vectors from𝐱i\\mathbf\{x\}\_\{i\},
𝐐=𝐱i𝐖Q,𝐊=𝐱i𝐖K,𝐕=𝐱i𝐖V,\\mathbf\{Q\}=\\mathbf\{x\}\_\{i\}\\mathbf\{W\}\_\{Q\},\\quad\\mathbf\{K\}=\\mathbf\{x\}\_\{i\}\\mathbf\{W\}\_\{K\},\\quad\\mathbf\{V\}=\\mathbf\{x\}\_\{i\}\\mathbf\{W\}\_\{V\},\(1\)where𝐖Q,𝐖K,𝐖V∈ℝd×d\\mathbf\{W\}\_\{Q\},\\mathbf\{W\}\_\{K\},\\mathbf\{W\}\_\{V\}\\in\\mathbb\{R\}^\{d\\times d\}\.
Each of𝐐,𝐊,𝐕\\mathbf\{Q\},\\mathbf\{K\},\\mathbf\{V\}is split intoHHheads of dimensiondh=d/Hd\_\{h\}=d/H, and each headhhcomputes scaled dot\-product attention,
𝐎\(h\)=softmax\(𝐐\(h\)𝐊\(h\)⊤dh\)𝐕\(h\)\.\\mathbf\{O\}^\{\(h\)\}=\\mathrm\{softmax\}\\\!\\left\(\\frac\{\\mathbf\{Q\}^\{\(h\)\}\{\\mathbf\{K\}^\{\(h\)\}\}^\{\\\!\\top\}\}\{\\sqrt\{d\_\{h\}\}\}\\right\)\\mathbf\{V\}^\{\(h\)\}\.\(2\)The per\-head outputs are concatenated and projected by𝐖O∈ℝd×d\\mathbf\{W\}\_\{O\}\\in\\mathbb\{R\}^\{d\\times d\}into the attention output𝐚i=concath=1H\(𝐎\(h\)\)𝐖O\\mathbf\{a\}\_\{i\}=\\mathrm\{concat\}\_\{h=1\}^\{H\}\(\\mathbf\{O\}^\{\(h\)\}\)\\,\\mathbf\{W\}\_\{O\}, which is added back to the residual stream, updating each token’s representation with information from other tokens\.
Figure 2:Value of the unbounded, learnable coefficient for each component of the value vector at each target layer of the 12\-layer model, for the substitutive \(a\) and additive \(b, c\) variants\.
## 3Motivation
Table 1:Validation loss \(BPB\) on the 12\-layer, 135M model across variants\.γv\\gamma\_\{v\}indicates an unbounded, learnable coefficient\.In this work, we comprehensively study to what extent value vectors in attention layers need context information from the residual stream, and to what extent they instead benefit from the original token information\. We provide the model with two context\-free sources of original token information, which we collectively denote𝐕~\\tilde\{\\mathbf\{V\}\}: a value vector𝐕~=𝐱0𝐖V\\tilde\{\\mathbf\{V\}\}=\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}computed from the initial token embedding𝐱0\\mathbf\{x\}\_\{0\}with the target layer’s value matrix𝐖V\\mathbf\{W\}\_\{V\}, or the value vector𝐕~=𝐕1\\tilde\{\\mathbf\{V\}\}=\\mathbf\{V\}\_\{1\}from the first attention layer\. In each variant,𝐕~\\tilde\{\\mathbf\{V\}\}either replaces or is linearly combined with the standard value vector𝐕=𝐱i𝐖V\\mathbf\{V\}=\\mathbf\{x\}\_\{i\}\\mathbf\{W\}\_\{V\}computed from the residual stream\.
Our exploration spans four axes\. \(1\)Additive or substitutive\.We study whether the model benefits more from using𝐕~\\tilde\{\\mathbf\{V\}\}directly, which substitutes𝐕\\mathbf\{V\}entirely and supplies only context\-free, original token information, or from a weighted combination of𝐕~\\tilde\{\\mathbf\{V\}\}and𝐕\\mathbf\{V\}, which mixes context\-free and context\-dependent information\. Unless otherwise noted, all variants modify only the last third of attention layers\. \(2\)Learnable or fixed coefficient\.We study whether each component of the value vector is best scaled by an unbounded, learnable coefficient or by a fixed one\. In the substitutive variants, the fixed coefficient for𝐕~\\tilde\{\\mathbf\{V\}\}is11; in the additive variants, the fixed coefficient for both𝐕~\\tilde\{\\mathbf\{V\}\}and𝐕\\mathbf\{V\}is0\.50\.5\. The magnitudes of the learned coefficients reveal how strongly the model prefers𝐕~\\tilde\{\\mathbf\{V\}\}over𝐕\\mathbf\{V\}\. \(3\)Source of original token information\.We study using𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}, which is unique to each target layer, or𝐕1\\mathbf\{V\}\_\{1\}, which is shared across all target layers, as𝐕~\\tilde\{\\mathbf\{V\}\}\. \(4\)Target layers\.We study the effect of targeting the last third of layers, every other layer, or every layer\.
We conduct all experiments on a 12\-layer, 135M transformer model in the style of GPT\-2\(Radford et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib32)\), using a simplified implementation of nanochat\(Karpathy,[2025](https://arxiv.org/html/2606.02780#bib.bib19)\)\. We train every variant under a compute\-controlled budget of1\.50×10181\.50\\times 10^\{18\}FLOPs, which yields 1\.97B tokens for the standard attention baseline\.222This budget is cost\-effective: its validation loss, measured in bits per byte \(BPB\), is only 0\.06 BPB higher than that of training on 40B tokens\.All runs use a fixed global batch size of 524,288 tokens and a sequence length of 2048\. We use the Muon optimizer for all matrix parameters with a learning rate of0\.020\.02, and the AdamW optimizer for the remaining parameters, following nanochat’s standard practices\. A complete training configuration is given in Appendix[B](https://arxiv.org/html/2606.02780#A2)\.
From Table[1](https://arxiv.org/html/2606.02780#S3.T1)and Figure[2](https://arxiv.org/html/2606.02780#S2.F2), we make the following observations about the relative contributions of𝐕\\mathbf\{V\}and𝐕~\\tilde\{\\mathbf\{V\}\}to the value vector\.
\(i\)Deep layers prefer a larger share of𝐕~\\tilde\{\\mathbf\{V\}\}over𝐕\\mathbf\{V\}\.Figure[2](https://arxiv.org/html/2606.02780#S2.F2)\(b, c\) shows that, in the two additive variants, the learned coefficients for𝐕~\\tilde\{\\mathbf\{V\}\}are much larger in absolute value than those for𝐕\\mathbf\{V\}\. Several layers drive the coefficient of𝐕\\mathbf\{V\}toward0, giving it minimal influence, while the final layer pushes the magnitude of𝐕~\\tilde\{\\mathbf\{V\}\}’s coefficient past1010\. When the value vector is computed from a single source, Figure[2](https://arxiv.org/html/2606.02780#S2.F2)\(a\) shows that the coefficients for𝐕~\\tilde\{\\mathbf\{V\}\}stay consistently above22and rise past1010in the final layer, while the coefficient for𝐕\\mathbf\{V\}is driven toward0or below\. Fixing the coefficients for both components in the additive variants at0\.50\.5\(rows[1](https://arxiv.org/html/2606.02780#S3.T1)and[1](https://arxiv.org/html/2606.02780#S3.T1)\) increases the validation loss, suggesting that the model benefits from an uneven weighting in favor of𝐕~\\tilde\{\\mathbf\{V\}\}\.
\(ii\)Only deep layers, not shallow ones, benefit from substituting𝐕\\mathbf\{V\}with𝐕~\\tilde\{\\mathbf\{V\}\}\.With a fixed coefficient of11, substituting𝐕\\mathbf\{V\}with𝐕1\\mathbf\{V\}\_\{1\}in every layer \(row[1](https://arxiv.org/html/2606.02780#S3.T1)\) leads to a higher loss than the standard attention baseline \(row[1](https://arxiv.org/html/2606.02780#S3.T1)\), whereas restricting the substitution to the last third of layers achieves significantly lower loss \(row[1](https://arxiv.org/html/2606.02780#S3.T1)\)\. With a learnable coefficient, substituting𝐕\\mathbf\{V\}with𝐕1\\mathbf\{V\}\_\{1\}\(row[1](https://arxiv.org/html/2606.02780#S3.T1)\) or𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}\(row[1](https://arxiv.org/html/2606.02780#S3.T1)\) in every other layer still lags behind substituting in the last third of layers \(rows[1](https://arxiv.org/html/2606.02780#S3.T1)and[1](https://arxiv.org/html/2606.02780#S3.T1)\)\. Substituting𝐕\\mathbf\{V\}with𝐕~\\tilde\{\\mathbf\{V\}\}therefore provides little benefit in the shallow layers but a meaningful improvement in the deep ones\.
\(iii\)Once deep layers have access to𝐕~\\tilde\{\\mathbf\{V\}\}, adding𝐕\\mathbf\{V\}yields little improvement\.With learnable coefficients, computing the value vector directly as𝐕1\\mathbf\{V\}\_\{1\}\(row[1](https://arxiv.org/html/2606.02780#S3.T1)\) achieves a slightly lower loss than𝐕\+𝐕1\\mathbf\{V\}\+\\mathbf\{V\}\_\{1\}\(row[1](https://arxiv.org/html/2606.02780#S3.T1)\), while computing it directly as𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}\(row[1](https://arxiv.org/html/2606.02780#S3.T1)\) matches the loss of𝐕\+𝐱0𝐖V\\mathbf\{V\}\+\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}\(row[1](https://arxiv.org/html/2606.02780#S3.T1)\)\. This corroborates observation \(i\): the contribution of𝐕\\mathbf\{V\}is minimized at deep layers in the additive variants\.
\(iv\)Deep layers benefit from a layer\-specific value vector rather than a shared one\.With both a fixed and a learnable coefficient, using a layer\-specific𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}as𝐕~\\tilde\{\\mathbf\{V\}\}\(rows[1](https://arxiv.org/html/2606.02780#S3.T1)and[1](https://arxiv.org/html/2606.02780#S3.T1)\) consistently outperforms using a shared𝐕1\\mathbf\{V\}\_\{1\}\(rows[1](https://arxiv.org/html/2606.02780#S3.T1)and[1](https://arxiv.org/html/2606.02780#S3.T1)\)\. This is consistent with𝐕~=𝐱0𝐖V\\tilde\{\\mathbf\{V\}\}=\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}being strictly more expressive than𝐕~=𝐕1\\tilde\{\\mathbf\{V\}\}=\\mathbf\{V\}\_\{1\}: although both are computed from the same input𝐱0\\mathbf\{x\}\_\{0\}, multi\-head attention lets each layer’s value matrix𝐖V\\mathbf\{W\}\_\{V\}project𝐱0\\mathbf\{x\}\_\{0\}into a different low\-rank subspace, which a single shared projection cannot reproduce\.
Overall, when restricted to the last third of layers, computing the value vector directly from a layer\-specific𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}provides the largest gain, and the role of𝐕\\mathbf\{V\}is marginal\. This shows that value vectors in deep layers benefit primarily from context\-free, original token information, an insight we build on in our architecture design in Section[4](https://arxiv.org/html/2606.02780#S4)\.
## 4Bank of Values: Persisting Context\-Free Value Vectors as Model Parameters
Figure 3:PyTorch\-style pseudocode for Bank of Values\.bov\_value\(\)performs anO\(Td\)O\(Td\)lookup of all past value vectors, scaled by a coefficientgamma\_v\.forward\(\)stores past token indices inkv\_cache\.idxand uses them to copy value vectors intov\_histfor dot\-product attention\.From Section[3](https://arxiv.org/html/2606.02780#S3), value vectors in deep attention layers are most effective when computed directly from𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}\. Importantly, since both𝐱0\\mathbf\{x\}\_\{0\}and𝐖V\\mathbf\{W\}\_\{V\}depend only on model parameters, the value vector that a token with idiiproduces,
𝐯i=RMSNorm\(𝐞i\)𝐖V,\\mathbf\{v\}\_\{i\}=\\mathrm\{RMSNorm\}\(\\mathbf\{e\}\_\{i\}\)\\,\\mathbf\{W\}\_\{V\},\(3\)is deterministic and independent of the preceding sequence, where𝐞i\\mathbf\{e\}\_\{i\}is the embedding of the token with idii\. We can therefore learn each𝐯i\\mathbf\{v\}\_\{i\}directly as a model parameter, stored at each target layer in a value table
𝐄v∈ℝ\|𝒱\|×d,𝐄v\[i\]=𝐯i,\\mathbf\{E\}\_\{v\}\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\\times d\},\\qquad\\mathbf\{E\}\_\{v\}\[i\]=\\mathbf\{v\}\_\{i\},\(4\)where\|𝒱\|\|\\mathcal\{V\}\|is the vocabulary size andddthe hidden dimension\. During a forward pass, the value vector of each token in the last third of layers is looked up directly from𝐄v\\mathbf\{E\}\_\{v\}\. Scaling each looked\-up value by an unbounded, learnable per\-layer coefficientγv∈ℝ\\gamma\_\{v\}\\in\\mathbb\{R\}, we obtain an attention variant that replaces the dense value matrix𝐖V\\mathbf\{W\}\_\{V\}with a sparse lookup table𝐄v\\mathbf\{E\}\_\{v\}, computing the value vector at positionppas
𝐯p=γv𝐄v\[ip\],\\mathbf\{v\}\_\{p\}=\\gamma\_\{v\}\\,\\mathbf\{E\}\_\{v\}\[i\_\{p\}\],\(5\)whereipi\_\{p\}is the id of the token at positionpp\. We call this attention variantBank of Values\(BoV\)\. The architecture of BoV is shown in Figure[1](https://arxiv.org/html/2606.02780#S1.F1), alongside standard attention and SVFormer\(Zhou et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib51)\)\.
#### Inference\.
In standard attention, the model caches the value vectors of all previous tokens during decode to avoid recomputing them, reducing the per\-step cost of producing past values fromO\(Td2\)O\(Td^\{2\}\)toO\(d2\)O\(d^\{2\}\)\. In BoV, by contrast, the value vector of each token is looked up directly from𝐄v\\mathbf\{E\}\_\{v\}, so no projection or recomputation is ever required, and the value vectors of allTTpreceding tokens are retrieved by a singlegatherinO\(Td\)O\(Td\)time\. When a target layer computes attention, the model gathers the value vectors of all preceding tokens from𝐄v\\mathbf\{E\}\_\{v\}and discards them as soon as the layer finishes\. Across the target layers, therefore, at most one layer materializes the full set of value vectors at any moment, in contrast to standard attention, which must keep a value cache for every layer at all times\.
#### Memory overhead\.
Because BoV persists the value vectors of all vocabulary tokens as the parameter table𝐄v\\mathbf\{E\}\_\{v\}, the fixed memory cost of𝐄v\\mathbf\{E\}\_\{v\}trades off predictably against the value cache of standard attention across context lengths\. The net V\-cache memory saved across all layers is
ΔKV=L3d\(T−\|𝒱\|\),\\Delta\_\{\\mathrm\{KV\}\}=\\frac\{L\}\{3\}\\,d\\,\(T\-\|\\mathcal\{V\}\|\),\(6\)whereLLis the number of layers,TTthe context length,ddthe hidden dimension, and\|𝒱\|\|\\mathcal\{V\}\|the vocabulary size\. As Eq\.[6](https://arxiv.org/html/2606.02780#S4.E6)shows, standard attention is more memory\-efficient at shorter contexts, but beyond the breakeven lengthT=\|𝒱\|T=\|\\mathcal\{V\}\|, BoV saves memory, with the saved fraction growing and eventually saturating\. Figure[5](https://arxiv.org/html/2606.02780#S5.F5)\(a\) gives an overview of this tradeoff for the vocabulary sizes of two tokenizers \(nanochat and Qwen3\)\.
For the workloads of modern LLMs, we argue that BoV is preferable for two reasons\. First, long\-context inference is a common requirement across a wide range of LLM applications, precisely the regime in which persisting value vectors as parameters becomes advantageous\. Second, the lookup into𝐄v\\mathbf\{E\}\_\{v\}requires only the token indicesipi\_\{p\}, which are available well before the forward pass reaches the target layer\. Since only entries corresponding to those token indices need to be retrieved in a forward pass,𝐄v\\mathbf\{E\}\_\{v\}is a sparse parameter that does not participate in dense matrix multiplication\. Therefore,𝐄v\\mathbf\{E\}\_\{v\}can be offloaded to host memory by default, and only the needed entries are prefetched on demand, improving memory utilization\.
We provide PyTorch\-style pseudocode for BoV in Figure[3](https://arxiv.org/html/2606.02780#S4.F3)\.
Figure 4:Validation loss over training at the \(a\) 135M and \(b\) 780M scales, for the standard attention baseline, BoV, and other attention variants that at least partially compute value vectors from a context\-free component\.
## 5Experiments
Table 2:Validation loss, aggregate CORE score, and representative benchmark scores for standard attention, BoV, and other variants, trained under an identical FLOP\-controlled budget on the 780M model\.Table 3:Comparison on four representative benchmarks between BoV \(top\), which retrieves value vectors as𝐄v\[i\]\\mathbf\{E\}\_\{v\}\[i\], and an additive variant \(bottom\), which adds the retrieved values to the standard value vector as𝐕\+𝐄v\[i\]\\mathbf\{V\}\+\\mathbf\{E\}\_\{v\}\[i\], on the 780M model\. The better result in each column is inbold\.Figure 5:\(a\) Fraction of V cache memory saved by Bank of Values over standard attention as a function of context length, for the nanochat and Qwen3 tokenizers; and \(b\) ablation of BoV’s design components on the 135M model\.#### Model architecture and training setup\.
We train both BoV and the standard attention baseline at the 24\-layer, 780M scale in the style of nanochat\(Karpathy,[2025](https://arxiv.org/html/2606.02780#bib.bib19)\)\. We fix the compute budget for all experiments at3\.91×10193\.91\\times 10^\{19\}FLOPs, which corresponds to 8\.19B tokens for the baseline and 8\.39B for BoV, owing to its lower FLOPs per token\. As in Section[3](https://arxiv.org/html/2606.02780#S3), we use the Muon optimizer with learning rate0\.020\.02on all matrix parameters, and the AdamW optimizer with learning rate0\.150\.15on the value vector table𝐄v\\mathbf\{E\}\_\{v\}in BoV\. The global batch size is fixed at 1,048,576 tokens with a sequence length of 2048\. A complete training configuration can be found in Appendix[B](https://arxiv.org/html/2606.02780#A2)\. We additionally train both architectures at the 12\-layer, 135M scale, following the recipe in Section[3](https://arxiv.org/html/2606.02780#S3)\.
To avoid introducing any inductive bias through BoV’s initialization, we initialize each target layer’s table by copying𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\,\\mathbf\{W\}\_\{V\}, evaluated at each token, into the corresponding row𝐄v\[i\]\\mathbf\{E\}\_\{v\}\[i\], so that BoV’s first forward pass is numerically identical to an attention layer that computes its value vectors from the initial token embedding alone, as𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\,\\mathbf\{W\}\_\{V\}\.
#### Evaluation setup\.
For the 24\-layer variants, we evaluate on 21 benchmarks from the DCLM CORE suite\(Li et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib24)\): Jeopardy\(Li et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib24)\), ARC\-Easy and ARC\-Challenge\(Clark et al\.,[2018](https://arxiv.org/html/2606.02780#bib.bib7)\), COPA\(Roemmele et al\.,[2011](https://arxiv.org/html/2606.02780#bib.bib35)\), CommonsenseQA\(Talmor et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib44)\), PIQA\(Bisk et al\.,[2020](https://arxiv.org/html/2606.02780#bib.bib3)\), OpenBookQA\(Mihaylov et al\.,[2018](https://arxiv.org/html/2606.02780#bib.bib27)\), LAMBADA\(Paperno et al\.,[2016](https://arxiv.org/html/2606.02780#bib.bib30)\), HellaSwag\(Zellers et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib49)\), Winograd\(Levesque et al\.,[2012](https://arxiv.org/html/2606.02780#bib.bib23)\), Winogrande\(Sakaguchi et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib36)\), AGIEval LSAT\-AR\(Zhong et al\.,[2023](https://arxiv.org/html/2606.02780#bib.bib50)\), SQuAD\(Rajpurkar et al\.,[2016](https://arxiv.org/html/2606.02780#bib.bib33)\), CoQA\(Reddy et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib34)\), BoolQ\(Clark et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib6)\), and the BIG\-bench QA WikiData, Language Identification, Dyck Languages, CS Algorithms, Operators, and Repeat\-Copy\-Logic tasks\(Srivastava et al\.,[2022](https://arxiv.org/html/2606.02780#bib.bib41)\)\.
Following DCLM, we report a centered accuracy for each task,\(acc−r\)/\(1−r\)\(\\text\{acc\}\-r\)/\(1\-r\)whererris the task’s random\-guessing \(or majority\-class\) baseline, and define the aggregate CORE score as the unweighted mean of the 21 centered accuracies\. For both the 12\-layer and 24\-layer variants, we additionally report bits\-per\-byte validation loss on a held\-out split of the ClimbMix pretraining corpus, computed over approximately 41\.9M tokens\. The main results, including scores on four representative benchmarks, are reported in Table[2](https://arxiv.org/html/2606.02780#S5.T2), and the full benchmark results in Table[4](https://arxiv.org/html/2606.02780#A1.T4)\.
### 5\.1Main Results
As shown in Figure[4](https://arxiv.org/html/2606.02780#S4.F4)\(a, b\), BoV outperforms the standard attention baseline on validation loss at both the 135M and 780M scales\. It also performs favorably on most benchmarks \(Tables[2](https://arxiv.org/html/2606.02780#S5.T2)and[4](https://arxiv.org/html/2606.02780#A1.T4)\), with the largest gains on the reading\-comprehension \(SQuAD, CoQA\) and symbolic\-problem\-solving \(BIG\-bench\) tasks, while matching the baseline on most commonsense\-reasoning and language\-understanding tasks\. These results align with the intuition that computing attention with context\-free value vectors particularly benefits tasks that rely on fewer contextual cues, while remaining competitive on the rest\.
### 5\.2Comparison with Prior Methods
We compare BoV with the prior methods discussed in Section[3](https://arxiv.org/html/2606.02780#S3)that compute the value vector at least partially from a context\-free component𝐕~\\tilde\{\\mathbf\{V\}\}\(Table[2](https://arxiv.org/html/2606.02780#S5.T2)\)\. At the 780M scale, none of these methods outperforms the standard attention baseline on the aggregate CORE score, indicating limited scalability\. In particular, computing𝐱0𝐖V\\mathbf\{x\}\_\{0\}\\mathbf\{W\}\_\{V\}per layer \(row[2](https://arxiv.org/html/2606.02780#S5.T2)\), which has the same expressivity as BoV, lags behind on benchmark performance\. We surmise that one contributor to this gap is BoV’s lower FLOPs per token, which lets it see more tokens under a FLOP\-controlled setting\.
In Table[3](https://arxiv.org/html/2606.02780#S5.T3), we also compare BoV with a popular additive variant\(Jordan and contributors,[2024](https://arxiv.org/html/2606.02780#bib.bib17); snimu,[2025](https://arxiv.org/html/2606.02780#bib.bib40)\)that adds a value\-embedding lookup𝐄v\[i\]\\mathbf\{E\}\_\{v\}\[i\]to the standard value vector computed from𝐱i\\mathbf\{x\}\_\{i\}\. We follow their original implementation, using input\-dependent, bounded scaling coefficients and targeting every second layer\. BoV performs favorably on most benchmarks while attaining a similar validation loss\. The full comparison is reported in Table[5](https://arxiv.org/html/2606.02780#A1.T5)\.
### 5\.3Component Design
Finally, we ablate each key component of BoV on the 12\-layer model and report the impact on validation loss in Figure[5](https://arxiv.org/html/2606.02780#S5.F5)\(b\)\.
#### Sharing a single table across layers\.
Sharing a single𝐄v\\mathbf\{E\}\_\{v\}across all target layers increases validation loss, showing the importance of learning a layer\-specific table of value vectors\.
#### Fixing the coefficient at11\.
Removing the learnable coefficient and fixing it at11also increases validation loss, indicating the benefit of scaling the attention output of the target layers\.
#### Targeting every layer\.
Learning an𝐄v\\mathbf\{E\}\_\{v\}at every layer rather than only the last third increases validation loss significantly, corroborating the insight that only deep layers benefit substantially from context\-free value vectors\.
#### Retaining the standard value𝐕\\mathbf\{V\}\.
Preserving the standard value vector𝐕\\mathbf\{V\}computed from the residual stream yields only a marginal improvement over BoV, showing that a context\-invariant value vector is sufficient in deeper layers\.
## 6Related Work
#### Adding early\-layer information to value vectors\.
Several methods inject early\-layer information into the value vectors of deeper layers\. Value Residual Learning adds or shares the first layer’s values\(Zhou et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib51)\), SkipV1Former reuses its value heads\(Wu et al\.,[2025](https://arxiv.org/html/2606.02780#bib.bib47)\), and modded\-nanogpt\(Jordan and contributors,[2024](https://arxiv.org/html/2606.02780#bib.bib17)\)and MoVE\(Li,[2026](https://arxiv.org/html/2606.02780#bib.bib25)\)add a per\-layer token value\-embedding table to the value vector; earlier sequence models similarly let layers attend to combinations of previous representations\(Bapna et al\.,[2018](https://arxiv.org/html/2606.02780#bib.bib2)\)\. Preserving early\-layer information is also motivated by over\-smoothing in deep layers\(Nguyen et al\.,[2023](https://arxiv.org/html/2606.02780#bib.bib28)\)\. BoV instead*replaces*the value vector in deep layers with a layer\-specific lookup keyed by token identity\.
#### Depth\-aware transformer architectures\.
Another line changes how information flows across depth\. DenseFormer feeds each block a learned weighted average of all previous block outputs\(Pagliardini et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib29)\), Hyper\-Connections replace the fixed residual connection with learnable connection weights\(Zhu et al\.,[2025](https://arxiv.org/html/2606.02780#bib.bib52)\), and Attention Residuals replace the additive residual with a learned, input\-dependent weighted sum over previous layer outputs\(Kimi Team et al\.,[2026](https://arxiv.org/html/2606.02780#bib.bib20)\)\. Mixture\-of\-Depths Attention further lets each attention head attend to key/value pairs from preceding layers, not just the current one\(Zhu et al\.,[2026](https://arxiv.org/html/2606.02780#bib.bib53)\)\. BoV instead modifies only the value vectors, leaving the rest of attention unchanged\.
#### Parametric memory and lookup\-based computation\.
A related line replaces learned matrix computations with table lookups\. Persistent\-memory attention augments each layer with static, learnable key/value vectors\(Sukhbaatar et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib42)\); product\-key memory adds a large key–value lookup layer retrieved via factored keys\(Lample et al\.,[2019](https://arxiv.org/html/2606.02780#bib.bib22)\); MemoryFormer replaces a transformer’s linear projection layers with hashed embedding lookups\(Ding et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib13)\); and Engram offloads static information into a sparse, conditionalO\(1\)O\(1\)lookup\(Cheng et al\.,[2026](https://arxiv.org/html/2606.02780#bib.bib5)\)\. In theory, one of the query, key, or value projections can even be dropped without loss of expressivity\(Karbevski and Mijoski,[2025](https://arxiv.org/html/2606.02780#bib.bib18)\)\. BoV applies this idea to the value path, replacing the value projection in deep layers with a per\-token learned table\.
#### Efficient attention and KV\-cache reduction\.
Many methods improve efficiency by optimizing attention kernels\(Dao et al\.,[2022](https://arxiv.org/html/2606.02780#bib.bib9); Dao,[2023](https://arxiv.org/html/2606.02780#bib.bib8); Shah et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib37)\), cache layout and serving\(Kwon et al\.,[2023](https://arxiv.org/html/2606.02780#bib.bib21)\), sparsifying attended tokens\(Yuan et al\.,[2025](https://arxiv.org/html/2606.02780#bib.bib48); DeepSeek\-AI,[2025](https://arxiv.org/html/2606.02780#bib.bib11)\), sharing or compressing KV heads\(Shazeer,[2019](https://arxiv.org/html/2606.02780#bib.bib38); Ainslie et al\.,[2023](https://arxiv.org/html/2606.02780#bib.bib1); DeepSeek\-AI,[2024](https://arxiv.org/html/2606.02780#bib.bib10); Brandon et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib4)\), quantizing the cache\(Hooper et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib16); Liu et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib26)\), or pruning cached layers and states\(Sun et al\.,[2024](https://arxiv.org/html/2606.02780#bib.bib43); Wu and Tu,[2024](https://arxiv.org/html/2606.02780#bib.bib46); Shen et al\.,[2025](https://arxiv.org/html/2606.02780#bib.bib39)\); others eliminate the V cache by recomputing values from the residual stream on demand\(Qasim et al\.,[2026](https://arxiv.org/html/2606.02780#bib.bib31)\)\. These change how attention is computed or which states are kept; BoV instead removes the need to compute or store deep\-layer values at all, and is composable with them\.
## 7Conclusion
In this work, we systematically study the value of adding context\-free, original token information to value vectors\. We find that deeper layers benefit largely from a context\-free value vector that does not draw on the residual stream, which leads us to propose Bank of Values, an attention variant that persists value vectors directly as model parameters\. Across downstream benchmarks and validation loss, Bank of Values outperforms the standard attention baseline and scales better than competing variants, while using less memory and fewer FLOPs\.
## Limitations
Our study has several limitations\. First, owing to compute constraints, we validate Bank of Values only at the 135M and 780M scales under FLOP\-controlled budgets; confirming that the gains persist for substantially larger models trained on more tokens is left to future work\. Second, our findings are empirical: we observe that only deeper layers benefit from a context\-free value vector, but we do not fully characterize the mechanism behind this depth dependence or the optimization objective it serves, which we believe is an important direction for understanding the role of value vectors in attention\. Finally, the memory benefit of Bank of Values is context\-length dependent: it stores a fixed value table of sizeO\(\|𝒱\|d\)O\(\|\\mathcal\{V\}\|\\,d\)per target layer and only saves memory beyond a breakeven context length, so its advantage is realized primarily in long\-context settings\.
## References
- Ainslie et al\. \(2023\)Joshua Ainslie, James Lee\-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai\. 2023\.[GQA: Training generalized multi\-query transformer models from multi\-head checkpoints](https://arxiv.org/abs/2305.13245)\.In*Proceedings of EMNLP*\.
- Bapna et al\. \(2018\)Ankur Bapna, Mia Xu Chen, Orhan Firat, Yuan Cao, and Yonghui Wu\. 2018\.[Training deeper neural machine translation models with transparent attention](https://arxiv.org/abs/1808.07561)\.In*Proceedings of EMNLP*\.
- Bisk et al\. \(2020\)Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi\. 2020\.[PIQA: Reasoning about Physical Commonsense in Natural Language](https://arxiv.org/abs/1911.11641)\.In*Proceedings of the AAAI Conference on Artificial Intelligence*\.
- Brandon et al\. \(2024\)William Brandon, Mayank Mishra, Aniruddha Nrusimha, Rameswar Panda, and Jonathan Ragan\-Kelly\. 2024\.[Reducing transformer key\-value cache size with cross\-layer attention](https://arxiv.org/abs/2405.12981)\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.
- Cheng et al\. \(2026\)Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang\. 2026\.[Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models](https://arxiv.org/abs/2601.07372)\.*Preprint*, arXiv:2601\.07372\.
- Clark et al\. \(2019\)Christopher Clark, Kenton Lee, Ming\-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova\. 2019\.[BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions](https://arxiv.org/abs/1905.10044)\.In*Proceedings of NAACL\-HLT*\.
- Clark et al\. \(2018\)Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord\. 2018\.[Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge](https://arxiv.org/abs/1803.05457)\.*Preprint*, arXiv:1803\.05457\.
- Dao \(2023\)Tri Dao\. 2023\.[FlashAttention\-2: Faster Attention with Better Parallelism and Work Partitioning](https://arxiv.org/abs/2307.08691)\.*Preprint*, arXiv:2307\.08691\.
- Dao et al\. \(2022\)Tri Dao, Daniel Y\. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré\. 2022\.[FlashAttention: Fast and Memory\-Efficient Exact Attention with IO\-Awareness](https://arxiv.org/abs/2205.14135)\.In*Advances in Neural Information Processing Systems*\.
- DeepSeek\-AI \(2024\)DeepSeek\-AI\. 2024\.[DeepSeek\-V2: A Strong, Economical, and Efficient Mixture\-of\-Experts Language Model](https://arxiv.org/abs/2405.04434)\.*Preprint*, arXiv:2405\.04434\.
- DeepSeek\-AI \(2025\)DeepSeek\-AI\. 2025\.[DeepSeek\-V3\.2: Pushing the Frontier of Open Large Language Models](https://arxiv.org/abs/2512.02556)\.*Preprint*, arXiv:2512\.02556\.
- Diao et al\. \(2025\)Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan Lin, Jan Kautz, and Pavlo Molchanov\. 2025\.[CLIMB: CLustering\-based Iterative Data Mixture Bootstrapping for Language Model Pre\-training](https://arxiv.org/abs/2504.13161)\.*Preprint*, arXiv:2504\.13161\.
- Ding et al\. \(2024\)Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Heng Liao, and Yunhe Wang\. 2024\.[MemoryFormer: Minimize transformer computation by removing fully\-connected layers](https://arxiv.org/abs/2411.12992)\.*Preprint*, arXiv:2411\.12992\.
- Ghandeharioun et al\. \(2024\)Asma Ghandeharioun, Avi Caciularu, Adam Pearce, Lucas Dixon, and Mor Geva\. 2024\.[Patchscopes: A unifying framework for inspecting hidden representations of language models](https://arxiv.org/abs/2401.06102)\.In*Proceedings of the International Conference on Machine Learning \(ICML\)*\.
- He et al\. \(2016\)Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun\. 2016\.[Deep residual learning for image recognition](https://arxiv.org/abs/1512.03385)\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\)*\.
- Hooper et al\. \(2024\)Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W\. Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami\. 2024\.[KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization](https://arxiv.org/abs/2401.18079)\.In*Advances in Neural Information Processing Systems*\.
- Jordan and contributors \(2024\)Keller Jordan and contributors\. 2024\.[modded\-nanogpt](https://github.com/KellerJordan/modded-nanogpt)\.GitHub repository\.
- Karbevski and Mijoski \(2025\)Marko Karbevski and Antonij Mijoski\. 2025\.[Key and value weights are probably all you need: On the necessity of the query, key, value weight triplet in self\-attention transformers](https://arxiv.org/abs/2510.23912)\.*Preprint*, arXiv:2510\.23912\.
- Karpathy \(2025\)Andrej Karpathy\. 2025\.[nanochat: The best ChatGPT that $100 can buy](https://github.com/karpathy/nanochat)\.GitHub repository\.
- Kimi Team et al\. \(2026\)Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, and 1 others\. 2026\.[Attention Residuals](https://arxiv.org/abs/2603.15031)\.*Preprint*, arXiv:2603\.15031\.
- Kwon et al\. \(2023\)Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\. 2023\.[Efficient Memory Management for Large Language Model Serving with PagedAttention](https://arxiv.org/abs/2309.06180)\.In*Proceedings of the ACM Symposium on Operating Systems Principles*\.
- Lample et al\. \(2019\)Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou\. 2019\.[Large memory layers with product keys](https://arxiv.org/abs/1907.05242)\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*\.
- Levesque et al\. \(2012\)Hector J\. Levesque, Ernest Davis, and Leora Morgenstern\. 2012\.The Winograd Schema Challenge\.In*Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning*\.
- Li et al\. \(2024\)Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, and 1 others\. 2024\.[DataComp\-LM: In Search of the Next Generation of Training Sets for Language Models](https://arxiv.org/abs/2406.11794)\.In*Advances in Neural Information Processing Systems*\.
- Li \(2026\)Yangyan Li\. 2026\.[MoVE: Mixture of Value Embeddings – A New Axis for Scaling Parametric Memory in Autoregressive Models](https://arxiv.org/abs/2601.22887)\.*Preprint*, arXiv:2601\.22887\.
- Liu et al\. \(2024\)Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu\. 2024\.[KIVI: A Tuning\-Free Asymmetric 2bit Quantization for KV Cache](https://arxiv.org/abs/2402.02750)\.In*International Conference on Machine Learning*\.
- Mihaylov et al\. \(2018\)Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal\. 2018\.[Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering](https://arxiv.org/abs/1809.02789)\.In*Proceedings of the Conference on Empirical Methods in Natural Language Processing*\.
- Nguyen et al\. \(2023\)Tam Nguyen, Tan M\. Nguyen, and Stanley J\. Osher\. 2023\.[Mitigating Over\-smoothing in Transformers via Regularized Nonlocal Functionals](https://arxiv.org/abs/2312.00751)\.*Preprint*, arXiv:2312\.00751\.
- Pagliardini et al\. \(2024\)Matteo Pagliardini, Amirkeivan Mohtashami, François Fleuret, and Martin Jaggi\. 2024\.[DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging](https://arxiv.org/abs/2402.02622)\.*Preprint*, arXiv:2402\.02622\.
- Paperno et al\. \(2016\)Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández\. 2016\.[The LAMBADA Dataset: Word Prediction Requiring a Broad Discourse Context](https://arxiv.org/abs/1606.06031)\.In*Proceedings of the Annual Meeting of the Association for Computational Linguistics*\.
- Qasim et al\. \(2026\)Kaleem Ullah Qasim, Jiashu Zhang, Muhammad Kafeel Shaheen, Razan Alharith, and Heying Zhang\. 2026\.[The residual stream is all you need: On the redundancy of the KV cache in transformer inference](https://arxiv.org/abs/2603.19664)\.*Preprint*, arXiv:2603\.19664\.
- Radford et al\. \(2019\)Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\. 2019\.[Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)\.*OpenAI Technical Report*\.
- Rajpurkar et al\. \(2016\)Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang\. 2016\.[SQuAD: 100,000\+ Questions for Machine Comprehension of Text](https://arxiv.org/abs/1606.05250)\.In*Proceedings of the Conference on Empirical Methods in Natural Language Processing*\.
- Reddy et al\. \(2019\)Siva Reddy, Danqi Chen, and Christopher D\. Manning\. 2019\.[CoQA: A Conversational Question Answering Challenge](https://arxiv.org/abs/1808.07042)\.*Transactions of the Association for Computational Linguistics*\.
- Roemmele et al\. \(2011\)Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S\. Gordon\. 2011\.Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning\.In*AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning*\.
- Sakaguchi et al\. \(2019\)Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi\. 2019\.[WinoGrande: An Adversarial Winograd Schema Challenge at Scale](https://arxiv.org/abs/1907.10641)\.*Communications of the ACM*\.
- Shah et al\. \(2024\)Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao\. 2024\.[FlashAttention\-3: Fast and Accurate Attention with Asynchrony and Low\-Precision](https://arxiv.org/abs/2407.08608)\.*Preprint*, arXiv:2407\.08608\.
- Shazeer \(2019\)Noam Shazeer\. 2019\.[Fast transformer decoding: One write\-head is all you need](https://arxiv.org/abs/1911.02150)\.*arXiv preprint arXiv:1911\.02150*\.
- Shen et al\. \(2025\)Yiqun Shen, Song Yuan, Zhengze Zhang, Xiaoliang Wang, Daxin Jiang, and Cam\-Tu Nguyen\. 2025\.[LAVa: Layer\-wise KV Cache Eviction with Dynamic Budget Allocation](https://arxiv.org/abs/2509.09754)\.*Preprint*, arXiv:2509\.09754\.
- snimu \(2025\)snimu\. 2025\.[modded\-nanogpt: Analyzing value\-embedding\-, UNet\-, and x0\-lambdas](https://snimu.github.io/2025/08/11/modded-nanogpt-lambdas.html)\.Blog post\.
- Srivastava et al\. \(2022\)Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, and 1 others\. 2022\.[Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models](https://arxiv.org/abs/2206.04615)\.*Preprint*, arXiv:2206\.04615\.
- Sukhbaatar et al\. \(2019\)Sainbayar Sukhbaatar, Edouard Grave, Guillaume Lample, Hervé Jégou, and Armand Joulin\. 2019\.[Augmenting self\-attention with persistent memory](https://arxiv.org/abs/1907.01470)\.*Preprint*, arXiv:1907\.01470\.
- Sun et al\. \(2024\)Yutao Sun, Li Dong, Yi Zhu, Shaohan Huang, Wenhui Wang, Shuming Ma, Quanlu Zhang, Jianyong Wang, and Furu Wei\. 2024\.[You only cache once: Decoder\-decoder architectures for language models](https://arxiv.org/abs/2405.05254)\.*Preprint*, arXiv:2405\.05254\.
- Talmor et al\. \(2019\)Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant\. 2019\.[CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge](https://arxiv.org/abs/1811.00937)\.In*Proceedings of NAACL\-HLT*\.
- Vaswani et al\. \(2017\)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Łukasz Kaiser, and Illia Polosukhin\. 2017\.[Attention Is All You Need](https://arxiv.org/abs/1706.03762)\.In*Advances in Neural Information Processing Systems*\.
- Wu and Tu \(2024\)Haoyi Wu and Kewei Tu\. 2024\.[Layer\-condensed KV cache for efficient inference of large language models](https://arxiv.org/abs/2405.10637)\.In*Proceedings of ACL*\.
- Wu et al\. \(2025\)Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, and Zhouchen Lin\. 2025\.[Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads](https://arxiv.org/abs/2510.16807)\.*Preprint*, arXiv:2510\.16807\.
- Yuan et al\. \(2025\)Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y\. X\. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, and Wangding Zeng\. 2025\.[Native Sparse Attention: Hardware\-Aligned and Natively Trainable Sparse Attention](https://arxiv.org/abs/2502.11089)\.*Preprint*, arXiv:2502\.11089\.
- Zellers et al\. \(2019\)Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi\. 2019\.[HellaSwag: Can a Machine Really Finish Your Sentence?](https://arxiv.org/abs/1905.07830)In*Proceedings of the Annual Meeting of the Association for Computational Linguistics*\.
- Zhong et al\. \(2023\)Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan\. 2023\.[AGIEval: A Human\-Centric Benchmark for Evaluating Foundation Models](https://arxiv.org/abs/2304.06364)\.*Preprint*, arXiv:2304\.06364\.
- Zhou et al\. \(2024\)Zhanchao Zhou, Tianyi Wu, Zhiyun Jiang, and Zhenzhong Lan\. 2024\.[Value Residual Learning For Alleviating Attention Concentration In Transformers](https://arxiv.org/abs/2410.17897)\.*Preprint*, arXiv:2410\.17897\.
- Zhu et al\. \(2025\)Defa Zhu, Hongzhi Huang, Zihao Huang, Yutao Zeng, Yunyao Mao, Banggu Wu, Qiyang Min, and Xun Zhou\. 2025\.[Hyper\-Connections](https://arxiv.org/abs/2409.19606)\.In*International Conference on Learning Representations*\.
- Zhu et al\. \(2026\)Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, and Xinggang Wang\. 2026\.[Mixture\-of\-Depths Attention](https://arxiv.org/abs/2603.15619)\.*Preprint*, arXiv:2603\.15619\.
## Appendix AAdditional Results
Tables[4](https://arxiv.org/html/2606.02780#A1.T4)and[5](https://arxiv.org/html/2606.02780#A1.T5)report the full benchmark results behind Tables[2](https://arxiv.org/html/2606.02780#S5.T2)and[3](https://arxiv.org/html/2606.02780#S5.T3)\.
Table 4:Full benchmark comparison of theBaselineandBoVon the 780M model across all 21 DCLM CORE tasks, grouped into the five DCLM categories, plus validation BPB and the aggregate CORE score\.Table 5:Full benchmark comparison of the add form \(𝐕\+𝐄v\[i\]\\mathbf\{V\}\+\\mathbf\{E\}\_\{v\}\[i\]\) and the replace form of BoV \(𝐄v\[i\]\\mathbf\{E\}\_\{v\}\[i\]\) on the 780M model across all 21 DCLM CORE tasks, grouped into the five DCLM categories, plus validation BPB and the aggregate CORE score\.
## Appendix BAdditional Experimental Details
Table[6](https://arxiv.org/html/2606.02780#A2.T6)reports the full training configuration for both model scales\.
Table 6:Training hyperparameters for the 135M \(12\-layer\) and 780M \(24\-layer\) models\. The 780M learning rates and weight decay are scaled from the 135M values following batch\-size \(×2\\times\\sqrt\{2\}\) and width scaling rules\.Similar Articles
The Context-Ready Transformer
The paper introduces the context-ready transformer, a recurrent architecture that pre-contextualizes tokens before the transformer block, achieving significant inference speedups (e.g., 1.7x on A100) while matching or exceeding standard transformer performance with fewer layers.
WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers
This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.
Delta Attention Residuals
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways
This paper proposes RoVE, a parameter-free modification to Rotary Position Embeddings that makes value pathways position-sensitive by rotating values simultaneously with keys, transforming RoPE attention into attentive convolution. Experiments on GPT-2 models show consistent gains in few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval.