WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

arXiv cs.LG 06/08/26, 04:00 AM Papers

deep-learning transformers residual-connections attention language-models gpt architecture

Summary

This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.

arXiv:2606.06564v1 Announce Type: new Abstract: Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed unit weights. Recent Attention Residuals replace this fixed accumulation with content-dependent depth-wise routing, and Block Attention Residuals make the mechanism efficient by routing over block-level residual summaries. However, a single block summary stores only the low-frequency total residual displacement inside a block, discarding directional structure such as attention-vs-MLP imbalance and early-vs-late block dynamics. We propose WAV v1, a lightweight multi-resolution residual routing method for decoder-only Transformers. Instead of representing each block only by its accumulated residual sum, WAV v1 augments every block with two directional detail bases: a phase basis that contrasts attention and MLP updates, and a split basis that contrasts early and late sublayer updates. These bases are routed together with standard block summaries through the same depth-wise softmax mixer, while negative detail-source initialization and detached RMS matching stabilize training. On character-level TinyStories and Text8 language modeling, WAV v1 shows a clear depth-dependent benefit. Although it is not consistently beneficial at 12 layers, it becomes competitive at 24 layers and outperforms all baselines at 48 layers. At 48 layers, WAV v1 reduces validation loss relative to Block AttnRes from 0.4960 to 0.4738 on TinyStories and from 0.9363 to 0.9305 on Text8, with negligible additional parameters. These results suggest that directional residual details, not only block-level sums, are important for scaling residual routing in deeper Transformers.

Original Article

View Cached Full Text

Cached at: 06/08/26, 09:16 AM

# Multi-Resolution Residual Routing for Deep Decoder-Only Transformers
Source: [https://arxiv.org/html/2606.06564](https://arxiv.org/html/2606.06564)
###### Abstract

Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed unit weights\. Recent Attention Residuals replace this fixed accumulation with content\-dependent depth\-wise routing, and Block Attention Residuals make the mechanism efficient by routing over block\-level residual summaries\. However, a single block summary stores only the low\-frequency, total residual displacement inside a block, discarding directional structure such as attention\-vs\-MLP imbalance and early\-vs\-late block dynamics\. We propose*Multi\-Resolution Residual Routing*, instantiated as WAV v1, a lightweight extension of Block Attention Residuals that augments every block representation with two zero\-sum directional detail bases: a phase basis contrasting attention and MLP updates, and a split basis contrasting the first and second halves of a block\. These detail bases are routed by the same depth\-wise softmax mixer as the block summary, but are introduced with a negative bias and RMS matching to preserve early training stability\. On character\-level GPT decoder\-only language modeling with TinyStories and Text8, WAV v1 shows a strong depth\-dependent trend: it is not beneficial at 12 layers, becomes competitive at 24 layers, and achieves the best validation loss at 48 layers on both datasets\. At 48 layers, WAV v1 improves over Block AttnRes by 0\.0222 validation loss on TinyStories and 0\.0057 on Text8, while leaving the attention and MLP modules unchanged\. These preliminary results suggest that deep residual routing benefits not only from selecting*which block*to read, but also from routing the internal directional structure of each block\.

## 1 Introduction

Modern decoder\-only Transformers are typically trained with PreNorm residual connections, where each sublayer contributes an additive update to the residual stream\. This simple structure has enabled stable training of very deep networks, but it also imposes a rigid aggregation rule: every residual update is accumulated with fixed unit weight\. As depth increases, this uniform accumulation can dilute individual layer contributions and make the residual stream increasingly redundant\. Attention Residuals address this issue by replacing fixed residual accumulation with learned softmax attention over previous layer outputs, enabling each layer to perform content\-dependent depth\-wise routing\[[3](https://arxiv.org/html/2606.06564#bib.bib3)\]\. Block Attention Residuals further reduce the memory and communication overhead by compressing multiple layers into block\-level residual summaries\.

The central hypothesis of this work is that a single block summary is still an incomplete representation of the residual trajectory inside a block\. If a block contains a sequence of sublayer updates\{ub,i\}i=1m\\\{u\_\{b,i\}\\\}\_\{i=1\}^\{m\}, Block AttnRes stores only their sum,Cb=∑iub,iC\_\{b\}=\\sum\_\{i\}u\_\{b,i\}\. This is analogous to preserving a low\-frequency or DC component of the block trajectory\. Yet the internal shape of the trajectory may carry useful information\. For example, the block may be attention\-dominant or MLP\-dominant; its early updates may point in a direction different from its late updates\. Such directional information is lost when the block is represented only byCbC\_\{b\}\.

We introduce WAV v1, a minimal multi\-resolution extension of Block AttnRes\. For every block, WAV v1 stores the original block sumCbC\_\{b\}together with two zero\-sum detail bases:DbphaseD^\{\\mathrm\{phase\}\}\_\{b\}, the difference between attention and MLP updates, andDbsplitD^\{\\mathrm\{split\}\}\_\{b\}, the difference between the first\-half and second\-half updates\. These bases are cheap to compute because they are accumulated online as residual updates are produced\. They are also conservative: the main attention and MLP functions are unchanged, detail sources receive a negative initial bias, details are RMS\-matched to the corresponding block summary, and the final prediction mixer does not use detail sources by default\.

Our preliminary experiments evaluate 12\-, 24\-, and 48\-layer GPT decoder\-only models on TinyStories and Text8\. The results reveal a clear depth\-dependent pattern\. At 12 layers, WAV v1 underperforms Block AttnRes, suggesting that directional detail sources are not useful when residual depth is shallow\. At 24 layers, WAV v1 becomes competitive\. At 48 layers, WAV v1 is consistently best, improving over Block AttnRes on both datasets and substantially outperforming ReZero and LayerScale\. This pattern supports the view that multi\-resolution residual information becomes increasingly valuable as the residual trajectory length grows\.

This draft makes three contributions:

1. 1\.We formulate block residual routing as a multi\-resolution representation problem: block summaries provide low\-frequency state information, while directional detail bases provide intra\-block trajectory information\.
2. 2\.We propose WAV v1, a drop\-in extension of Block AttnRes that augments each block with phase and split detail bases while preserving the original attention and MLP modules\.
3. 3\.We provide preliminary scaling evidence on two character\-level language modeling datasets, showing that WAV v1 becomes stronger with depth and achieves the best 48\-layer validation losses among the evaluated residual mechanisms\.

## 2 Related Work

#### Residual connections in deep networks\.

Residual learning was popularized by ResNets\[[1](https://arxiv.org/html/2606.06564#bib.bib1)\], where identity skip connections make optimization of very deep networks substantially easier\. Transformers\[[2](https://arxiv.org/html/2606.06564#bib.bib2)\]inherit this residual principle and commonly use PreNorm variants for stability\. However, standard residual addition fixes the aggregation coefficient of every update to one, which may become suboptimal in very deep architectures\.

#### Learned residual scaling\.

Several methods improve deep training by scaling residual updates\. ReZero introduces a scalar residual gate initialized at zero, enabling stable signal propagation at large depth\[[5](https://arxiv.org/html/2606.06564#bib.bib5)\]\. LayerScale uses per\-channel learnable residual scaling and has been shown to improve deep vision transformers\[[6](https://arxiv.org/html/2606.06564#bib.bib6)\]\. These methods modulate residual magnitude, but they do not perform content\-dependent selection over previous residual states\.

#### Attention Residuals and block routing\.

Attention Residuals replace additive residual accumulation with learned attention over preceding representations\[[3](https://arxiv.org/html/2606.06564#bib.bib3)\]\. Block AttnRes compresses this mechanism by grouping layers into blocks and routing over block\-level representations\. This greatly reduces the cost relative to attending over every preceding layer\. A related recent direction, Delta Attention Residuals, routes over residual deltas instead of cumulative hidden states, emphasizing that the choice of routed representation is crucial\[[4](https://arxiv.org/html/2606.06564#bib.bib4)\]\. Our work is complementary: we keep the block\-level efficiency of Block AttnRes but enrich each block representation with structured directional details\.

## 3 Method

### 3\.1 Preliminaries: Block Residual Routing

Consider a decoder\-only Transformer withLLlayers\. Each Transformer layer contains an attention sublayer and an MLP sublayer\. We treat every sublayer output as a residual update\. For a blockbbcontainingmmsublayer updates\{ub,i\}i=1m\\\{u\_\{b,i\}\\\}\_\{i=1\}^\{m\}, Block AttnRes stores a block\-level residual summary

Cb=∑i=1mub,i\.C\_\{b\}=\\sum\_\{i=1\}^\{m\}u\_\{b,i\}\.\(1\)At a later sublayer, the depth\-wise mixer receives a source set such as

𝒮block=\{e,C0,C1,…,Cb−1,PC\},\\mathcal\{S\}\_\{\\mathrm\{block\}\}=\\\{e,C\_\{0\},C\_\{1\},\\ldots,C\_\{b\-1\},P\_\{C\}\\\},\(2\)whereeeis the token embedding source andPCP\_\{C\}is the current partial block sum\.

Given source tensors\{sj\}j=1S\\\{s\_\{j\}\\\}\_\{j=1\}^\{S\}, a query vectorqq, and source biases\{βj\}\\\{\\beta\_\{j\}\\\}, the mixer computes

ℓj\\displaystyle\\ell\_\{j\}=q⊤RMSNorm⁡\(sj\)\+βj,\\displaystyle=q^\{\\top\}\\operatorname\{RMSNorm\}\(s\_\{j\}\)\+\\beta\_\{j\},\(3\)αj\\displaystyle\\alpha\_\{j\}=exp⁡\(ℓj\)∑k=1Sexp⁡\(ℓk\),\\displaystyle=\\frac\{\\exp\(\\ell\_\{j\}\)\}\{\\sum\_\{k=1\}^\{S\}\\exp\(\\ell\_\{k\}\)\},\(4\)h\\displaystyle h=∑j=1Sαjsj\.\\displaystyle=\\sum\_\{j=1\}^\{S\}\\alpha\_\{j\}s\_\{j\}\.\(5\)The resultinghhis used as the context input for either the attention or MLP sublayer\.

### 3\.2 Multi\-Resolution Block Basis

WAV v1 preserves the original block summaryCbC\_\{b\}but augments it with two zero\-sum detail bases\. Letai∈\{\+1,−1\}a\_\{i\}\\in\\\{\+1,\-1\\\}indicate whether updateub,iu\_\{b,i\}comes from an attention sublayer or an MLP sublayer:

ai=\{\+1,ub,iis from attention,−1,ub,iis from MLP\.a\_\{i\}=\\begin\{cases\}\+1,&u\_\{b,i\}\\text\{ is from attention\},\\\\ \-1,&u\_\{b,i\}\\text\{ is from MLP\}\.\\end\{cases\}\(6\)The phase basis is

Dbphase=∑i=1maiub,i\.D^\{\\mathrm\{phase\}\}\_\{b\}=\\sum\_\{i=1\}^\{m\}a\_\{i\}u\_\{b,i\}\.\(7\)This basis captures whether the block’s residual displacement is dominated by attention\-like or MLP\-like updates\.

Similarly, letri∈\{\+1,−1\}r\_\{i\}\\in\\\{\+1,\-1\\\}indicate whether a sublayer update is in the first or second half of the block:

ri=\{\+1,i≤m/2,−1,i\>m/2\.r\_\{i\}=\\begin\{cases\}\+1,&i\\leq m/2,\\\\ \-1,&i\>m/2\.\\end\{cases\}\(8\)The split basis is

Dbsplit=∑i=1mriub,i\.D^\{\\mathrm\{split\}\}\_\{b\}=\\sum\_\{i=1\}^\{m\}r\_\{i\}u\_\{b,i\}\.\(9\)This basis captures coarse early\-vs\-late movement inside the block\.

The resulting source set is

𝒮WAV=\{e,C0,D~0phase,D~0split,…,Cb−1,D~b−1phase,D~b−1split,PC,P~phase,P~split\}\.\\begin\{split\}\\mathcal\{S\}\_\{\\mathrm\{WAV\}\}=\\\{e,&C\_\{0\},\\tilde\{D\}^\{\\mathrm\{phase\}\}\_\{0\},\\tilde\{D\}^\{\\mathrm\{split\}\}\_\{0\},\\ldots,\\\\ &C\_\{b\-1\},\\tilde\{D\}^\{\\mathrm\{phase\}\}\_\{b\-1\},\\tilde\{D\}^\{\\mathrm\{split\}\}\_\{b\-1\},P\_\{C\},\\tilde\{P\}^\{\\mathrm\{phase\}\},\\tilde\{P\}^\{\\mathrm\{split\}\}\\\}\.\\end\{split\}\(10\)

### 3\.3 Stable Detail Injection

Directly adding detail sources to the depth mixer can destabilize early training\. We therefore use two conservative mechanisms\.

#### Negative detail bias\.

The two detail sources are initialized with a negative source bias,βD=−2\.0\\beta\_\{D\}=\-2\.0, while the embedding andCCsources use zero bias\. This makes WAV v1 close to Block AttnRes at initialization and lets the model gradually increase detail usage only when beneficial\.

#### Detached RMS matching\.

For a detail tensorDDassociated with block summaryCC, we compute

D~=D⋅stopgrad⁡\(clip⁡\(RMS⁡\(C\)RMS⁡\(D\)\+ϵ,1ρ,ρ\)\),\\tilde\{D\}=D\\cdot\\operatorname\{stopgrad\}\\left\(\\operatorname\{clip\}\\left\(\\frac\{\\operatorname\{RMS\}\(C\)\}\{\\operatorname\{RMS\}\(D\)\+\\epsilon\},\\frac\{1\}\{\\rho\},\\rho\\right\)\\right\),\(11\)whereρ\\rhois a maximum scale factor\. This prevents detail sources from becoming active only because their raw scale is larger thanCC\. In our implementation, the final prediction mixer reads only embedding andCCsources, not detail sources\.

![Refer to caption](https://arxiv.org/html/2606.06564v1/x1.png)Figure 1:Detailed overview of WAV v1\. \(a\) Within each residual block, sublayer updates are accumulated into one state basisCbC\_\{b\}and two directional detail bases,DbphaseD^\{\\mathrm\{phase\}\}\_\{b\}andDbsplitD^\{\\mathrm\{split\}\}\_\{b\}\. \(b\) Compared with Block AttnRes, the depth\-wise mixer receives an expanded source pool that includes completed\-block and partial\-block detail sources\. \(c\) During a forward step, the MLP branch reads the partial basis after the attention update has been written; the final prediction readout uses only embedding andCCsources\. Detail sources are stabilized by a negative initial bias and detached RMS matching\.

### 3\.4 Computational Cost

WAV v1 leaves the attention, MLP, token embedding, and output head unchanged\. Its additional cost comes from increasing the number of block\-level routing sources\. If a model hasNNresidual blocks, Block AttnRes routes over approximatelyO\(N\)O\(N\)block summaries, while WAV v1 routes over approximatelyO\(3N\)O\(3N\)block\-basis sources\. The asymptotic cost remains block\-level, rather than layer\-level, because it does not attend over all previous sublayer states\. The additional parameters are negligible: each layer has two detail biases for the attention mixer and two for the MLP mixer, i\.e\., four scalar parameters per Transformer layer\.

## 4 Experiments

### 4\.1 Setup

We evaluate character\-level GPT decoder\-only language models on TinyStories\[[9](https://arxiv.org/html/2606.06564#bib.bib9)\]and Text8\[[10](https://arxiv.org/html/2606.06564#bib.bib10)\]\. All models use PreNorm RMSNorm\[[7](https://arxiv.org/html/2606.06564#bib.bib7)\], causal self\-attention, and SwiGLU MLPs\[[8](https://arxiv.org/html/2606.06564#bib.bib8)\]\. We compare five residual mechanisms: Standard Residual, Block AttnRes, ReZero, LayerScale, and WAV v1\. The current arXiv draft reports the validation losses consolidated in our experiment summary; error bars will be added after raw logs are fully consolidated\.

Table 1:Experimental setup used for the preliminary arXiv draft\.
### 4\.2 Main Results

Table 2:Final validation loss at 50k steps\. Lower is better\.Δ\\Deltavs Block is WAV v1 minus Block AttnRes, so negative values indicate improvement\. PPL reduction is computed from validation loss as1−exp⁡\(ℒWAV\)/exp⁡\(ℒBlock\)1\-\\exp\(\\mathcal\{L\}\_\{\\mathrm\{WAV\}\}\)/\\exp\(\\mathcal\{L\}\_\{\\mathrm\{Block\}\}\)\.Table[2](https://arxiv.org/html/2606.06564#S4.T2)shows the final validation losses\. The most important result is the depth\-dependent trend\. At 12 layers, WAV v1 is worse than Block AttnRes on both datasets\. At 24 layers, WAV v1 becomes competitive: it slightly improves TinyStories and is close on Text8\. At 48 layers, WAV v1 is the best method on both datasets\. On TinyStories, it reduces validation loss from 0\.4960 to 0\.4738 relative to Block AttnRes\. On Text8, it reduces validation loss from 0\.9363 to 0\.9305\.

![Refer to caption](https://arxiv.org/html/2606.06564v1/x2.png)Figure 2:Depth\-dependent gain of WAV v1 over Block AttnRes\. Negative values mean WAV v1 is better\. The method is not beneficial at shallow depth but becomes clearly stronger at 48 layers\.![Refer to caption](https://arxiv.org/html/2606.06564v1/x3.png)Figure 3:Final validation loss across depths and datasets\. WAV v1 is strongest in the 48\-layer regime, while Block AttnRes is strongest or competitive at shallower depths\.
### 4\.3 Training Dynamics at 48 Layers

Figure[4](https://arxiv.org/html/2606.06564#S4.F4)compares 48\-layer training curves\. On TinyStories, WAV v1 separates from Block AttnRes early and maintains a persistent advantage throughout training\. On Text8, the advantage is smaller but consistent by the end of training\. These curves suggest that the improvement is not only a late\-stage artifact: multi\-resolution sources can affect the optimization trajectory once the model is sufficiently deep\.

![Refer to caption](https://arxiv.org/html/2606.06564v1/x4.png)Figure 4:Validation loss curves for 48\-layer models\. WAV v1 obtains the best final validation loss on both TinyStories and Text8\.
### 4\.4 Ranking Summary

Table 3:Best and second\-best methods by final validation loss\.The ranking in Table[3](https://arxiv.org/html/2606.06564#S4.T3)highlights a key qualitative distinction\. Block AttnRes is highly competitive at shallow and medium depth, but WAV v1 becomes strongest in the deepest evaluated configuration\. This supports the interpretation that direction\-aware block bases are most useful when each block summarizes a longer residual trajectory\.

## 5 Analysis

### 5\.1 Why Does WAV Improve with Depth?

The detail bases in WAV v1 are zero\-sum directional summaries\. They do not replace the block stateCbC\_\{b\}; instead, they expose structured deviations around it\. With very shallow models, there are few completed blocks and limited residual history, so detail sources may behave like noise or redundant features\. With deeper models, each block contains more sublayer updates and later layers can access more completed multi\-resolution block summaries\. Under this regime,DphaseD^\{\\mathrm\{phase\}\}andDsplitD^\{\\mathrm\{split\}\}become informative signals for depth\-wise routing\.

This explains the observed scaling pattern: WAV v1 underperforms at 12 layers, is roughly tied at 24 layers, and clearly improves at 48 layers\. The result also suggests that future work should not evaluate residual\-routing methods only at shallow depth, because their benefits may be tied to the length and structure of the residual trajectory\.

### 5\.2 Comparison to Residual Scaling

ReZero and LayerScale are designed to improve residual optimization by controlling update magnitude\. In our experiments, however, both are consistently weaker than Block AttnRes and WAV v1 at 48 layers\. This indicates that the central issue is not only how much residual signal to add, but which residual information to read\. WAV v1 directly changes the representation available to the depth\-wise mixer, while residual scaling methods preserve the standard sequential residual path\.

### 5\.3 Relation to Delta Routing

Delta Attention Residuals argue that cumulative hidden states can be redundant and that residual deltas are more structurally diverse\[[4](https://arxiv.org/html/2606.06564#bib.bib4)\]\. WAV v1 is consistent with this perspective but uses a different efficiency trade\-off\. Instead of routing over every individual delta, WAV v1 compresses deltas inside a block into a low\-frequency stateCbC\_\{b\}and two coarse detail components\. Thus it can be viewed as a block\-level, multi\-resolution approximation to delta routing\.

## 6 Limitations and Future Work

This paper is an initial arXiv draft\. There are several limitations that should be addressed before formal conference submission\. First, the current draft reports consolidated validation losses but does not yet include standard deviations across seeds\. Second, the experiments are limited to small character\-level language models; larger\-scale token\-level language modeling should be evaluated\. Third, WAV v1 currently uses two hand\-designed detail bases\. Future versions should study soft orthogonal detail bases, polarity\-aligned detail routing, and learnable block wavelet bases\. Finally, computational measurements should be reported with diagnostics disabled and with a fused or cached implementation to separate method cost from Python overhead\.

## 7 Conclusion

We presented WAV v1, a lightweight multi\-resolution extension of Block Attention Residuals for deep decoder\-only Transformers\. Instead of representing each residual block only by its total updateCbC\_\{b\}, WAV v1 additionally stores phase and split detail bases that expose attention\-vs\-MLP and early\-vs\-late directional structure\. Preliminary experiments on TinyStories and Text8 show that WAV v1 has a strong depth\-dependent benefit: it is not useful at 12 layers, becomes competitive at 24 layers, and achieves the best final validation loss at 48 layers\. These results suggest that deep residual routing should preserve not only coarse block states but also structured residual directions\.

## Appendix AImplementation Notes

#### Online basis update\.

For each sublayer updateuu, WAV v1 updatesCC,DphaseD^\{\\mathrm\{phase\}\}, andDsplitD^\{\\mathrm\{split\}\}online\. Attention updates are added with positive phase sign, while MLP updates are added with negative phase sign\. Updates in the first half of a block use positive split sign, while those in the second half use negative split sign\.

#### Default safety settings\.

The preliminary implementation uses detail bias−2\.0\-2\.0, RMS matching with a clipped detached scale, and disables detail sources in the final output mixer\. These choices make the model close to Block AttnRes at initialization\.

#### Recommended additional ablations\.

The next arXiv update should include: \(1\)DphaseD^\{\\mathrm\{phase\}\}only, \(2\)DsplitD^\{\\mathrm\{split\}\}only, \(3\) both details without RMS matching, \(4\) detail bias sensitivity, and \(5\) varying the number of residual blocks\.

## Appendix BReproducibility Checklist

The draft package includes the parsed result CSV files and plotting scripts\. Raw training logs, seed\-wise standard deviations, and the exact training code should be released with the next version\.

## References

- He et al\. \[2016\]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun\. 2016\.Deep residual learning for image recognition\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*\.
- Vaswani et al\. \[2017\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\. 2017\.Attention is all you need\.In*Advances in Neural Information Processing Systems*\.
- Kimi Team et al\. \[2026\]Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, and others\. 2026\.Attention residuals\.*arXiv preprint arXiv:2603\.15031*\.
- Luo et al\. \[2026\]Cheng Luo, Zefan Cai, and Junjie Hu\. 2026\.Delta attention residuals\.*arXiv preprint arXiv:2605\.18855*\.
- Bachlechner et al\. \[2020\]Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W\. Cottrell, and Julian McAuley\. 2020\.ReZero is all you need: Fast convergence at large depth\.*arXiv preprint arXiv:2003\.04887*\.
- Touvron et al\. \[2021\]Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou\. 2021\.Going deeper with image transformers\.In*Proceedings of the IEEE/CVF International Conference on Computer Vision*\.
- Zhang and Sennrich \[2019\]Biao Zhang and Rico Sennrich\. 2019\.Root mean square layer normalization\.In*Advances in Neural Information Processing Systems*\.
- Shazeer \[2020\]Noam Shazeer\. 2020\.GLU variants improve Transformer\.*arXiv preprint arXiv:2002\.05202*\.
- Eldan and Li \[2023\]Ronen Eldan and Yuanzhi Li\. 2023\.TinyStories: How small can language models be and still speak coherent English?*arXiv preprint arXiv:2305\.07759*\.
- Mahoney \[2011\]Matt Mahoney\. 2011\.Large text compression benchmark\.*http://mattmahoney\.net/dc/textdata\.html*\.

WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers

Similar Articles

Block-Based Double Decoders

Delta Attention Residuals

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Adaptive Computation Depth via Learned Token Routing in Transformers

Submit Feedback

Similar Articles

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Adaptive Computation Depth via Learned Token Routing in Transformers