WAV: Multi-Resolution Block Residual Routing for Deep Decoder-Only Transformers
Summary
This paper introduces Multi-Resolution Residual Routing (WAV v1), an extension of Block Attention Residuals that augments block representations with directional detail bases, improving deep decoder-only Transformer training.
View Cached Full Text
Cached at: 06/08/26, 09:16 AM
# Multi-Resolution Residual Routing for Deep Decoder-Only Transformers
Source: [https://arxiv.org/html/2606.06564](https://arxiv.org/html/2606.06564)
###### Abstract
Residual connections are central to training deep Transformers, but standard PreNorm residual streams aggregate sublayer updates with fixed unit weights\. Recent Attention Residuals replace this fixed accumulation with content\-dependent depth\-wise routing, and Block Attention Residuals make the mechanism efficient by routing over block\-level residual summaries\. However, a single block summary stores only the low\-frequency, total residual displacement inside a block, discarding directional structure such as attention\-vs\-MLP imbalance and early\-vs\-late block dynamics\. We propose*Multi\-Resolution Residual Routing*, instantiated as WAV v1, a lightweight extension of Block Attention Residuals that augments every block representation with two zero\-sum directional detail bases: a phase basis contrasting attention and MLP updates, and a split basis contrasting the first and second halves of a block\. These detail bases are routed by the same depth\-wise softmax mixer as the block summary, but are introduced with a negative bias and RMS matching to preserve early training stability\. On character\-level GPT decoder\-only language modeling with TinyStories and Text8, WAV v1 shows a strong depth\-dependent trend: it is not beneficial at 12 layers, becomes competitive at 24 layers, and achieves the best validation loss at 48 layers on both datasets\. At 48 layers, WAV v1 improves over Block AttnRes by 0\.0222 validation loss on TinyStories and 0\.0057 on Text8, while leaving the attention and MLP modules unchanged\. These preliminary results suggest that deep residual routing benefits not only from selecting*which block*to read, but also from routing the internal directional structure of each block\.
## 1 Introduction
Modern decoder\-only Transformers are typically trained with PreNorm residual connections, where each sublayer contributes an additive update to the residual stream\. This simple structure has enabled stable training of very deep networks, but it also imposes a rigid aggregation rule: every residual update is accumulated with fixed unit weight\. As depth increases, this uniform accumulation can dilute individual layer contributions and make the residual stream increasingly redundant\. Attention Residuals address this issue by replacing fixed residual accumulation with learned softmax attention over previous layer outputs, enabling each layer to perform content\-dependent depth\-wise routing\[[3](https://arxiv.org/html/2606.06564#bib.bib3)\]\. Block Attention Residuals further reduce the memory and communication overhead by compressing multiple layers into block\-level residual summaries\.
The central hypothesis of this work is that a single block summary is still an incomplete representation of the residual trajectory inside a block\. If a block contains a sequence of sublayer updates\{ub,i\}i=1m\\\{u\_\{b,i\}\\\}\_\{i=1\}^\{m\}, Block AttnRes stores only their sum,Cb=∑iub,iC\_\{b\}=\\sum\_\{i\}u\_\{b,i\}\. This is analogous to preserving a low\-frequency or DC component of the block trajectory\. Yet the internal shape of the trajectory may carry useful information\. For example, the block may be attention\-dominant or MLP\-dominant; its early updates may point in a direction different from its late updates\. Such directional information is lost when the block is represented only byCbC\_\{b\}\.
We introduce WAV v1, a minimal multi\-resolution extension of Block AttnRes\. For every block, WAV v1 stores the original block sumCbC\_\{b\}together with two zero\-sum detail bases:DbphaseD^\{\\mathrm\{phase\}\}\_\{b\}, the difference between attention and MLP updates, andDbsplitD^\{\\mathrm\{split\}\}\_\{b\}, the difference between the first\-half and second\-half updates\. These bases are cheap to compute because they are accumulated online as residual updates are produced\. They are also conservative: the main attention and MLP functions are unchanged, detail sources receive a negative initial bias, details are RMS\-matched to the corresponding block summary, and the final prediction mixer does not use detail sources by default\.
Our preliminary experiments evaluate 12\-, 24\-, and 48\-layer GPT decoder\-only models on TinyStories and Text8\. The results reveal a clear depth\-dependent pattern\. At 12 layers, WAV v1 underperforms Block AttnRes, suggesting that directional detail sources are not useful when residual depth is shallow\. At 24 layers, WAV v1 becomes competitive\. At 48 layers, WAV v1 is consistently best, improving over Block AttnRes on both datasets and substantially outperforming ReZero and LayerScale\. This pattern supports the view that multi\-resolution residual information becomes increasingly valuable as the residual trajectory length grows\.
This draft makes three contributions:
1. 1\.We formulate block residual routing as a multi\-resolution representation problem: block summaries provide low\-frequency state information, while directional detail bases provide intra\-block trajectory information\.
2. 2\.We propose WAV v1, a drop\-in extension of Block AttnRes that augments each block with phase and split detail bases while preserving the original attention and MLP modules\.
3. 3\.We provide preliminary scaling evidence on two character\-level language modeling datasets, showing that WAV v1 becomes stronger with depth and achieves the best 48\-layer validation losses among the evaluated residual mechanisms\.
## 2 Related Work
#### Residual connections in deep networks\.
Residual learning was popularized by ResNets\[[1](https://arxiv.org/html/2606.06564#bib.bib1)\], where identity skip connections make optimization of very deep networks substantially easier\. Transformers\[[2](https://arxiv.org/html/2606.06564#bib.bib2)\]inherit this residual principle and commonly use PreNorm variants for stability\. However, standard residual addition fixes the aggregation coefficient of every update to one, which may become suboptimal in very deep architectures\.
#### Learned residual scaling\.
Several methods improve deep training by scaling residual updates\. ReZero introduces a scalar residual gate initialized at zero, enabling stable signal propagation at large depth\[[5](https://arxiv.org/html/2606.06564#bib.bib5)\]\. LayerScale uses per\-channel learnable residual scaling and has been shown to improve deep vision transformers\[[6](https://arxiv.org/html/2606.06564#bib.bib6)\]\. These methods modulate residual magnitude, but they do not perform content\-dependent selection over previous residual states\.
#### Attention Residuals and block routing\.
Attention Residuals replace additive residual accumulation with learned attention over preceding representations\[[3](https://arxiv.org/html/2606.06564#bib.bib3)\]\. Block AttnRes compresses this mechanism by grouping layers into blocks and routing over block\-level representations\. This greatly reduces the cost relative to attending over every preceding layer\. A related recent direction, Delta Attention Residuals, routes over residual deltas instead of cumulative hidden states, emphasizing that the choice of routed representation is crucial\[[4](https://arxiv.org/html/2606.06564#bib.bib4)\]\. Our work is complementary: we keep the block\-level efficiency of Block AttnRes but enrich each block representation with structured directional details\.
## 3 Method
### 3\.1 Preliminaries: Block Residual Routing
Consider a decoder\-only Transformer withLLlayers\. Each Transformer layer contains an attention sublayer and an MLP sublayer\. We treat every sublayer output as a residual update\. For a blockbbcontainingmmsublayer updates\{ub,i\}i=1m\\\{u\_\{b,i\}\\\}\_\{i=1\}^\{m\}, Block AttnRes stores a block\-level residual summary
Cb=∑i=1mub,i\.C\_\{b\}=\\sum\_\{i=1\}^\{m\}u\_\{b,i\}\.\(1\)At a later sublayer, the depth\-wise mixer receives a source set such as
𝒮block=\{e,C0,C1,…,Cb−1,PC\},\\mathcal\{S\}\_\{\\mathrm\{block\}\}=\\\{e,C\_\{0\},C\_\{1\},\\ldots,C\_\{b\-1\},P\_\{C\}\\\},\(2\)whereeeis the token embedding source andPCP\_\{C\}is the current partial block sum\.
Given source tensors\{sj\}j=1S\\\{s\_\{j\}\\\}\_\{j=1\}^\{S\}, a query vectorqq, and source biases\{βj\}\\\{\\beta\_\{j\}\\\}, the mixer computes
ℓj\\displaystyle\\ell\_\{j\}=q⊤RMSNorm\(sj\)\+βj,\\displaystyle=q^\{\\top\}\\operatorname\{RMSNorm\}\(s\_\{j\}\)\+\\beta\_\{j\},\(3\)αj\\displaystyle\\alpha\_\{j\}=exp\(ℓj\)∑k=1Sexp\(ℓk\),\\displaystyle=\\frac\{\\exp\(\\ell\_\{j\}\)\}\{\\sum\_\{k=1\}^\{S\}\\exp\(\\ell\_\{k\}\)\},\(4\)h\\displaystyle h=∑j=1Sαjsj\.\\displaystyle=\\sum\_\{j=1\}^\{S\}\\alpha\_\{j\}s\_\{j\}\.\(5\)The resultinghhis used as the context input for either the attention or MLP sublayer\.
### 3\.2 Multi\-Resolution Block Basis
WAV v1 preserves the original block summaryCbC\_\{b\}but augments it with two zero\-sum detail bases\. Letai∈\{\+1,−1\}a\_\{i\}\\in\\\{\+1,\-1\\\}indicate whether updateub,iu\_\{b,i\}comes from an attention sublayer or an MLP sublayer:
ai=\{\+1,ub,iis from attention,−1,ub,iis from MLP\.a\_\{i\}=\\begin\{cases\}\+1,&u\_\{b,i\}\\text\{ is from attention\},\\\\ \-1,&u\_\{b,i\}\\text\{ is from MLP\}\.\\end\{cases\}\(6\)The phase basis is
Dbphase=∑i=1maiub,i\.D^\{\\mathrm\{phase\}\}\_\{b\}=\\sum\_\{i=1\}^\{m\}a\_\{i\}u\_\{b,i\}\.\(7\)This basis captures whether the block’s residual displacement is dominated by attention\-like or MLP\-like updates\.
Similarly, letri∈\{\+1,−1\}r\_\{i\}\\in\\\{\+1,\-1\\\}indicate whether a sublayer update is in the first or second half of the block:
ri=\{\+1,i≤m/2,−1,i\>m/2\.r\_\{i\}=\\begin\{cases\}\+1,&i\\leq m/2,\\\\ \-1,&i\>m/2\.\\end\{cases\}\(8\)The split basis is
Dbsplit=∑i=1mriub,i\.D^\{\\mathrm\{split\}\}\_\{b\}=\\sum\_\{i=1\}^\{m\}r\_\{i\}u\_\{b,i\}\.\(9\)This basis captures coarse early\-vs\-late movement inside the block\.
The resulting source set is
𝒮WAV=\{e,C0,D~0phase,D~0split,…,Cb−1,D~b−1phase,D~b−1split,PC,P~phase,P~split\}\.\\begin\{split\}\\mathcal\{S\}\_\{\\mathrm\{WAV\}\}=\\\{e,&C\_\{0\},\\tilde\{D\}^\{\\mathrm\{phase\}\}\_\{0\},\\tilde\{D\}^\{\\mathrm\{split\}\}\_\{0\},\\ldots,\\\\ &C\_\{b\-1\},\\tilde\{D\}^\{\\mathrm\{phase\}\}\_\{b\-1\},\\tilde\{D\}^\{\\mathrm\{split\}\}\_\{b\-1\},P\_\{C\},\\tilde\{P\}^\{\\mathrm\{phase\}\},\\tilde\{P\}^\{\\mathrm\{split\}\}\\\}\.\\end\{split\}\(10\)
### 3\.3 Stable Detail Injection
Directly adding detail sources to the depth mixer can destabilize early training\. We therefore use two conservative mechanisms\.
#### Negative detail bias\.
The two detail sources are initialized with a negative source bias,βD=−2\.0\\beta\_\{D\}=\-2\.0, while the embedding andCCsources use zero bias\. This makes WAV v1 close to Block AttnRes at initialization and lets the model gradually increase detail usage only when beneficial\.
#### Detached RMS matching\.
For a detail tensorDDassociated with block summaryCC, we compute
D~=D⋅stopgrad\(clip\(RMS\(C\)RMS\(D\)\+ϵ,1ρ,ρ\)\),\\tilde\{D\}=D\\cdot\\operatorname\{stopgrad\}\\left\(\\operatorname\{clip\}\\left\(\\frac\{\\operatorname\{RMS\}\(C\)\}\{\\operatorname\{RMS\}\(D\)\+\\epsilon\},\\frac\{1\}\{\\rho\},\\rho\\right\)\\right\),\(11\)whereρ\\rhois a maximum scale factor\. This prevents detail sources from becoming active only because their raw scale is larger thanCC\. In our implementation, the final prediction mixer reads only embedding andCCsources, not detail sources\.
Figure 1:Detailed overview of WAV v1\. \(a\) Within each residual block, sublayer updates are accumulated into one state basisCbC\_\{b\}and two directional detail bases,DbphaseD^\{\\mathrm\{phase\}\}\_\{b\}andDbsplitD^\{\\mathrm\{split\}\}\_\{b\}\. \(b\) Compared with Block AttnRes, the depth\-wise mixer receives an expanded source pool that includes completed\-block and partial\-block detail sources\. \(c\) During a forward step, the MLP branch reads the partial basis after the attention update has been written; the final prediction readout uses only embedding andCCsources\. Detail sources are stabilized by a negative initial bias and detached RMS matching\.
### 3\.4 Computational Cost
WAV v1 leaves the attention, MLP, token embedding, and output head unchanged\. Its additional cost comes from increasing the number of block\-level routing sources\. If a model hasNNresidual blocks, Block AttnRes routes over approximatelyO\(N\)O\(N\)block summaries, while WAV v1 routes over approximatelyO\(3N\)O\(3N\)block\-basis sources\. The asymptotic cost remains block\-level, rather than layer\-level, because it does not attend over all previous sublayer states\. The additional parameters are negligible: each layer has two detail biases for the attention mixer and two for the MLP mixer, i\.e\., four scalar parameters per Transformer layer\.
## 4 Experiments
### 4\.1 Setup
We evaluate character\-level GPT decoder\-only language models on TinyStories\[[9](https://arxiv.org/html/2606.06564#bib.bib9)\]and Text8\[[10](https://arxiv.org/html/2606.06564#bib.bib10)\]\. All models use PreNorm RMSNorm\[[7](https://arxiv.org/html/2606.06564#bib.bib7)\], causal self\-attention, and SwiGLU MLPs\[[8](https://arxiv.org/html/2606.06564#bib.bib8)\]\. We compare five residual mechanisms: Standard Residual, Block AttnRes, ReZero, LayerScale, and WAV v1\. The current arXiv draft reports the validation losses consolidated in our experiment summary; error bars will be added after raw logs are fully consolidated\.
Table 1:Experimental setup used for the preliminary arXiv draft\.
### 4\.2 Main Results
Table 2:Final validation loss at 50k steps\. Lower is better\.Δ\\Deltavs Block is WAV v1 minus Block AttnRes, so negative values indicate improvement\. PPL reduction is computed from validation loss as1−exp\(ℒWAV\)/exp\(ℒBlock\)1\-\\exp\(\\mathcal\{L\}\_\{\\mathrm\{WAV\}\}\)/\\exp\(\\mathcal\{L\}\_\{\\mathrm\{Block\}\}\)\.Table[2](https://arxiv.org/html/2606.06564#S4.T2)shows the final validation losses\. The most important result is the depth\-dependent trend\. At 12 layers, WAV v1 is worse than Block AttnRes on both datasets\. At 24 layers, WAV v1 becomes competitive: it slightly improves TinyStories and is close on Text8\. At 48 layers, WAV v1 is the best method on both datasets\. On TinyStories, it reduces validation loss from 0\.4960 to 0\.4738 relative to Block AttnRes\. On Text8, it reduces validation loss from 0\.9363 to 0\.9305\.
Figure 2:Depth\-dependent gain of WAV v1 over Block AttnRes\. Negative values mean WAV v1 is better\. The method is not beneficial at shallow depth but becomes clearly stronger at 48 layers\.Figure 3:Final validation loss across depths and datasets\. WAV v1 is strongest in the 48\-layer regime, while Block AttnRes is strongest or competitive at shallower depths\.
### 4\.3 Training Dynamics at 48 Layers
Figure[4](https://arxiv.org/html/2606.06564#S4.F4)compares 48\-layer training curves\. On TinyStories, WAV v1 separates from Block AttnRes early and maintains a persistent advantage throughout training\. On Text8, the advantage is smaller but consistent by the end of training\. These curves suggest that the improvement is not only a late\-stage artifact: multi\-resolution sources can affect the optimization trajectory once the model is sufficiently deep\.
Figure 4:Validation loss curves for 48\-layer models\. WAV v1 obtains the best final validation loss on both TinyStories and Text8\.
### 4\.4 Ranking Summary
Table 3:Best and second\-best methods by final validation loss\.The ranking in Table[3](https://arxiv.org/html/2606.06564#S4.T3)highlights a key qualitative distinction\. Block AttnRes is highly competitive at shallow and medium depth, but WAV v1 becomes strongest in the deepest evaluated configuration\. This supports the interpretation that direction\-aware block bases are most useful when each block summarizes a longer residual trajectory\.
## 5 Analysis
### 5\.1 Why Does WAV Improve with Depth?
The detail bases in WAV v1 are zero\-sum directional summaries\. They do not replace the block stateCbC\_\{b\}; instead, they expose structured deviations around it\. With very shallow models, there are few completed blocks and limited residual history, so detail sources may behave like noise or redundant features\. With deeper models, each block contains more sublayer updates and later layers can access more completed multi\-resolution block summaries\. Under this regime,DphaseD^\{\\mathrm\{phase\}\}andDsplitD^\{\\mathrm\{split\}\}become informative signals for depth\-wise routing\.
This explains the observed scaling pattern: WAV v1 underperforms at 12 layers, is roughly tied at 24 layers, and clearly improves at 48 layers\. The result also suggests that future work should not evaluate residual\-routing methods only at shallow depth, because their benefits may be tied to the length and structure of the residual trajectory\.
### 5\.2 Comparison to Residual Scaling
ReZero and LayerScale are designed to improve residual optimization by controlling update magnitude\. In our experiments, however, both are consistently weaker than Block AttnRes and WAV v1 at 48 layers\. This indicates that the central issue is not only how much residual signal to add, but which residual information to read\. WAV v1 directly changes the representation available to the depth\-wise mixer, while residual scaling methods preserve the standard sequential residual path\.
### 5\.3 Relation to Delta Routing
Delta Attention Residuals argue that cumulative hidden states can be redundant and that residual deltas are more structurally diverse\[[4](https://arxiv.org/html/2606.06564#bib.bib4)\]\. WAV v1 is consistent with this perspective but uses a different efficiency trade\-off\. Instead of routing over every individual delta, WAV v1 compresses deltas inside a block into a low\-frequency stateCbC\_\{b\}and two coarse detail components\. Thus it can be viewed as a block\-level, multi\-resolution approximation to delta routing\.
## 6 Limitations and Future Work
This paper is an initial arXiv draft\. There are several limitations that should be addressed before formal conference submission\. First, the current draft reports consolidated validation losses but does not yet include standard deviations across seeds\. Second, the experiments are limited to small character\-level language models; larger\-scale token\-level language modeling should be evaluated\. Third, WAV v1 currently uses two hand\-designed detail bases\. Future versions should study soft orthogonal detail bases, polarity\-aligned detail routing, and learnable block wavelet bases\. Finally, computational measurements should be reported with diagnostics disabled and with a fused or cached implementation to separate method cost from Python overhead\.
## 7 Conclusion
We presented WAV v1, a lightweight multi\-resolution extension of Block Attention Residuals for deep decoder\-only Transformers\. Instead of representing each residual block only by its total updateCbC\_\{b\}, WAV v1 additionally stores phase and split detail bases that expose attention\-vs\-MLP and early\-vs\-late directional structure\. Preliminary experiments on TinyStories and Text8 show that WAV v1 has a strong depth\-dependent benefit: it is not useful at 12 layers, becomes competitive at 24 layers, and achieves the best final validation loss at 48 layers\. These results suggest that deep residual routing should preserve not only coarse block states but also structured residual directions\.
## Appendix AImplementation Notes
#### Online basis update\.
For each sublayer updateuu, WAV v1 updatesCC,DphaseD^\{\\mathrm\{phase\}\}, andDsplitD^\{\\mathrm\{split\}\}online\. Attention updates are added with positive phase sign, while MLP updates are added with negative phase sign\. Updates in the first half of a block use positive split sign, while those in the second half use negative split sign\.
#### Default safety settings\.
The preliminary implementation uses detail bias−2\.0\-2\.0, RMS matching with a clipped detached scale, and disables detail sources in the final output mixer\. These choices make the model close to Block AttnRes at initialization\.
#### Recommended additional ablations\.
The next arXiv update should include: \(1\)DphaseD^\{\\mathrm\{phase\}\}only, \(2\)DsplitD^\{\\mathrm\{split\}\}only, \(3\) both details without RMS matching, \(4\) detail bias sensitivity, and \(5\) varying the number of residual blocks\.
## Appendix BReproducibility Checklist
The draft package includes the parsed result CSV files and plotting scripts\. Raw training logs, seed\-wise standard deviations, and the exact training code should be released with the next version\.
## References
- He et al\. \[2016\]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun\. 2016\.Deep residual learning for image recognition\.In*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*\.
- Vaswani et al\. \[2017\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\. 2017\.Attention is all you need\.In*Advances in Neural Information Processing Systems*\.
- Kimi Team et al\. \[2026\]Kimi Team, Guangyu Chen, Yu Zhang, Jianlin Su, and others\. 2026\.Attention residuals\.*arXiv preprint arXiv:2603\.15031*\.
- Luo et al\. \[2026\]Cheng Luo, Zefan Cai, and Junjie Hu\. 2026\.Delta attention residuals\.*arXiv preprint arXiv:2605\.18855*\.
- Bachlechner et al\. \[2020\]Thomas Bachlechner, Bodhisattwa Prasad Majumder, Huanru Henry Mao, Garrison W\. Cottrell, and Julian McAuley\. 2020\.ReZero is all you need: Fast convergence at large depth\.*arXiv preprint arXiv:2003\.04887*\.
- Touvron et al\. \[2021\]Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou\. 2021\.Going deeper with image transformers\.In*Proceedings of the IEEE/CVF International Conference on Computer Vision*\.
- Zhang and Sennrich \[2019\]Biao Zhang and Rico Sennrich\. 2019\.Root mean square layer normalization\.In*Advances in Neural Information Processing Systems*\.
- Shazeer \[2020\]Noam Shazeer\. 2020\.GLU variants improve Transformer\.*arXiv preprint arXiv:2002\.05202*\.
- Eldan and Li \[2023\]Ronen Eldan and Yuanzhi Li\. 2023\.TinyStories: How small can language models be and still speak coherent English?*arXiv preprint arXiv:2305\.07759*\.
- Mahoney \[2011\]Matt Mahoney\. 2011\.Large text compression benchmark\.*http://mattmahoney\.net/dc/textdata\.html*\.Similar Articles
Block-Based Double Decoders
Proposes block-based double decoders, a novel transformer architecture using doubly-causal block-based attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, achieving strong scaling performance and reduced KV-cache memory.
Delta Attention Residuals
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]
ResBM introduces a transformer-based architecture with residual encoder-decoder bottlenecks for pipeline-parallel training, achieving 128× activation compression while maintaining convergence. The work advances decentralized, internet-grade distributed training by reducing inter-stage communication overhead.
Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.
Adaptive Computation Depth via Learned Token Routing in Transformers
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.