Adaptive Computation Depth via Learned Token Routing in Transformers
Summary
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
View Cached Full Text
Cached at: 05/08/26, 06:35 AM
# Adaptive Computation Depth via Learned Token Routing in Transformers
Source: [https://arxiv.org/html/2605.05222](https://arxiv.org/html/2605.05222)
Ahmed Abdelmuniem Abdalla Mohammed Independent Researcher ahmed\.abdelmuniem@gmail\.com [ORCID: 0009\-0008\-7410\-6621](https://orcid.org/0009-0008-7410-6621)
###### Abstract
Standard transformer architectures apply the same number of layers to every token regardless of contextual difficulty\. We presentToken\-Selective Attention \(TSA\), a learned per\-token gate on residual updates between consecutive transformer blocks\. Each gate is a lightweight two\-layer multi\-layer perceptron \(MLP\) that produces a continuous halting probability, making the mechanism end\-to\-end differentiable with 1\.7% parameter overhead and no changes to the base architecture\. Notably, TSA learns difficulty\-proportional routing without any explicit depth pressure: even atλ=0\\lambda\{=\}0\(no depth regularisation\), the task\-loss gradient alone drives the router to skip 20% of token\-layer operations\. On character\-level language modeling, TSA saved 14–23% of token\-layer operations \(TLOps\) across Tiny\-Shakespeare and enwik8 at<<0\.5% quality loss\. At matched efficiency, TSA achieved 0\.7% lower validation loss than early exit, and the learned routing transfers directly to inference\-time sparse execution for real wall\-clock speedup\.
Keywords:adaptive computation, token routing, sparse transformers, efficient inference, depth regularisation
## 1Introduction
Transformer language models\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.05222#bib.bib6)\)apply a fixed number of layers to every token in every sequence\. This design trades per\-token adaptability for architectural simplicity\. In practice, the trade\-off is costly: a common token in a predictable context requires far less processing than a rare token in a novel construction, yet both receive identical compute at every layer of every forward pass\.
The inefficiency is particularly consequential at inference scale\. For large deployed models, the dominant cost is the forward pass through all layers for all tokens\. If a significant fraction of tokens could exit early without quality loss, the savings would translate directly to reduced latency and throughput gains\.
Several approaches have addressed this problem\.Graves \([2016](https://arxiv.org/html/2605.05222#bib.bib1)\)introduced Adaptive Computation Time \(ACT\) for recurrent neural networks \(RNNs\), accumulating a halting probability across recurrent steps\.Dehghaniet al\.\([2019](https://arxiv.org/html/2605.05222#bib.bib2)\)extended the idea to depth\-shared transformer layers with the Universal Transformer\. More recently,Raposoet al\.\([2024](https://arxiv.org/html/2605.05222#bib.bib3)\)proposed Mixture\-of\-Depths \(MoD\), which routes tokens through a fixed subset of layers using hard top\-kkselection;Baeet al\.\([2025](https://arxiv.org/html/2605.05222#bib.bib4)\)introduced Mixture of Recursions, which applies recursive blocks for a learned number of steps per token; andChenet al\.\([2025](https://arxiv.org/html/2605.05222#bib.bib5)\)presented the Inner Thinking Transformer, which inserts additional computation steps at high\-stakes positions\.
We presentToken\-Selective Attention \(TSA\): a continuous soft gate on residual updates, conditioned per token on its current hidden state\. The mechanism is architecturally minimal—a two\-layer MLP per inter\-block gap—and fully differentiable, requiring no straight\-through estimators, Gumbel sampling, or reinforcement learning\. Our contributions are:
- •A simple, differentiable token routing mechanism that gates residual updates softly per token per layer \(§[2](https://arxiv.org/html/2605.05222#S2)\)\.
- •Evidence that routing emerges from the task\-loss gradient alone: atλ=0\\lambda\{=\}0\(no depth regularisation\), the router learns to skip 20% of token\-layer operations without any explicit depth pressure \(§[3\.4](https://arxiv.org/html/2605.05222#S3.SS4)\)\.
- •Cross\-dataset validation on character\-level language modeling: 14–23% token\-layer operations saved across Tiny\-Shakespeare and enwik8 at<<0\.5% quality loss \(§[3\.2](https://arxiv.org/html/2605.05222#S3.SS2), §[3\.3](https://arxiv.org/html/2605.05222#S3.SS3)\)\.
- •Ablations showing robustness toλ\\lambdaacross two orders of magnitude \(§[3\.4](https://arxiv.org/html/2605.05222#S3.SS4)\), quality advantage over early exit at matched efficiency \(§[3\.5](https://arxiv.org/html/2605.05222#S3.SS5)\), and real wall\-clock speedup via sparse inference on commodity hardware \(§[3\.6](https://arxiv.org/html/2605.05222#S3.SS6)\)\.
## 2Method
### 2\.1Architecture
Let a pre\-norm decoder\-only transformer have blocksf0,f1,…,fL−1f\_\{0\},f\_\{1\},\\ldots,f\_\{L\-1\}, where each block applies multi\-head self\-attention and a feed\-forward network \(FFN\) with residual connections and LayerNorm\(Baet al\.,[2016](https://arxiv.org/html/2605.05222#bib.bib12)\)\. In TSA, a lightweight routerrlr\_\{l\}is inserted after each blockflf\_\{l\}forl=0,…,L−2l=0,\\ldots,L\-2\.
Blockf0f\_\{0\}is the*stem*and always executes unconditionally: a bare token embedding carries no contextual signal, making a routing decision at step zero uninformative and potentially degenerate\. The routing begins after the stem:
h←f0\(h\),pl=rl\(h\),h←fl\+1\(h,pl\),l=0,…,L−2\.h\\leftarrow f\_\{0\}\(h\),\\quad p\_\{l\}=r\_\{l\}\(h\),\\quad h\\leftarrow f\_\{l\+1\}\(h,\\,p\_\{l\}\),\\quad l=0,\\ldots,L\-2\.\(1\)
Figure[1](https://arxiv.org/html/2605.05222#S2.F1)illustrates the dual\-mode mechanism: soft gating during training \(differentiable\) and hard\-threshold sparse execution at inference \(real FLOPs savings\)\.
Figure 1:TSA dual\-mode architecture\. A routerrlr\_\{l\}reads hidden statehhand produces a per\-token halting probabilityplp\_\{l\}\.Left \(training\):all tokens always pass through attn \+ FFN; the residual update is soft\-scaled by\(1−pl\)\(1\{\-\}p\_\{l\}\), keeping the gate differentiable so the router learns\.Right \(inference\):attention remains dense, but tokens withpl\>0\.5p\_\{l\}\>0\.5skip the FFN entirely via gather/scatter, yielding real FLOPs savings\. The stem blockf0f\_\{0\}always executes unconditionally\.
### 2\.2Router Architecture
Each router is a two\-layer MLP with sigmoid output:
rl\(h\)=σ\(Wl\(2\)ReLU\(Wl\(1\)h\+bl\(1\)\)\+bl\(2\)\),rl\(h\)∈\(0,1\)B×T,r\_\{l\}\(h\)=\\sigma\\\!\\bigl\(W\_\{l\}^\{\(2\)\}\\,\\mathrm\{ReLU\}\(W\_\{l\}^\{\(1\)\}h\+b\_\{l\}^\{\(1\)\}\)\+b\_\{l\}^\{\(2\)\}\\bigr\),\\quad r\_\{l\}\(h\)\\in\(0,1\)^\{B\\times T\},\(2\)where the hidden dimension isd/4d/4\(floored at 16\)\. Each router addsd2/4\+d/2\+1d^\{2\}/4\+d/2\+1parameters; atd=256d=256,L=6L=6, this totals≈\\approx83K on a 4\.78M parameter base model \(1\.7% overhead\)\.
The final biasbl\(2\)b\_\{l\}^\{\(2\)\}is initialised to−1\.0\-1\.0, givingσ\(−1\)≈0\.27\\sigma\(\-1\)\\approx 0\.27at initialisation\. This bias prevents early collapse to “halt everything” before the model has learned useful representations\.
### 2\.3Gated Block Update
For each routing decisionl=0,…,L−2l=0,\\ldots,L\{\-\}2, the gated update of blockfl\+1f\_\{l\+1\}is:
h\\displaystyle h←h\+\(1−pl\)⊙Δl\+1attn\(h\),\\displaystyle\\leftarrow h\+\(1\-p\_\{l\}\)\\odot\\Delta\_\{l\+1\}^\{\\mathrm\{attn\}\}\(h\),\(3\)h\\displaystyle h←h\+\(1−pl\)⊙Δl\+1ffn\(h\),\\displaystyle\\leftarrow h\+\(1\-p\_\{l\}\)\\odot\\Delta\_\{l\+1\}^\{\\mathrm\{ffn\}\}\(h\),\(4\)wherepl∈\(0,1\)B×Tp\_\{l\}\\in\(0,1\)^\{B\\times T\}is broadcast over the model dimensiondd, andΔl\+1attn\\Delta\_\{l\+1\}^\{\\mathrm\{attn\}\},Δl\+1ffn\\Delta\_\{l\+1\}^\{\\mathrm\{ffn\}\}are the pre\-norm attention and feed\-forward residual deltas of blockfl\+1f\_\{l\+1\}respectively\. Whenpl=0p\_\{l\}=0, the update is identical to the standard transformer\. Whenpl=1p\_\{l\}=1, the state is unchanged—the block is skipped\. The interpolation is smooth, preserving gradient flow throughplp\_\{l\}during training\.
### 2\.4Depth Regularisation
Without any incentive to halt, routers default topl≈0p\_\{l\}\\approx 0and TSA reduces to a standard transformer with extra parameters\. We added a depth regularisation term that gently encourages early halting:
ℒdepth=λ⋅1L−1∑l=0L−21−pl¯,\\mathcal\{L\}\_\{\\mathrm\{depth\}\}=\\lambda\\cdot\\frac\{1\}\{L\-1\}\\sum\_\{l=0\}^\{L\-2\}\\overline\{1\-p\_\{l\}\},\(5\)where1−pl¯\\overline\{1\-p\_\{l\}\}is the mean active fraction at layerll\(averaged over batch and sequence position\)\. The total training loss isℒ=ℒtask\+ℒdepth\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\mathcal\{L\}\_\{\\mathrm\{depth\}\}\. We usedλ=0\.001\\lambda=0\.001for language experiments; Section[3\.4](https://arxiv.org/html/2605.05222#S3.SS4)demonstrates that TSA is robust acrossλ∈\[0,0\.1\]\\lambda\\in\[0,\\,0\.1\]\.
### 2\.5Compute Metric
We measure compute usingtoken\-layer operations \(TLOps\): for each block, TLOps equals the number of tokens processed at that block\. The mean active fraction across routing decisions is:
α=1L−1∑l=0L−21−pl¯\.\\alpha=\\frac\{1\}\{L\-1\}\\sum\_\{l=0\}^\{L\-2\}\\overline\{1\-p\_\{l\}\}\.\(6\)TLOps savings relative to the fixed\-depth baseline are:
Δ=1−1\+\(L−1\)αL\.\\Delta=1\-\\frac\{1\+\(L\-1\)\\,\\alpha\}\{L\}\.\(7\)The stem block \(always active\) is included in both numerator and denominator, makingΔ\\Deltaa conservative estimate\.
*Note on training compute\.*During training, all layers execute fully: the gate scales residual updates but does not skip computation\. TLOps therefore measures the effective contribution of each layer to the final representation, not actual FLOPs saved\. At inference, sparse\-TSA \(Section[3\.6](https://arxiv.org/html/2605.05222#S3.SS6)\) exploits low\-contribution positions via gather/scatter to achieve real compute savings\.
## 3Experiments
### 3\.1Synthetic Algorithmic Tasks
#### Setup\.
We used decoder\-only transformers trained oncopyandsorttasks over length\-10 sequences from a 32\-token vocabulary\. Inputs followed the format\[BOS\] src \[SEP\] tgt \[EOS\]with loss masked on source tokens\. Both baseline and TSA usedd=128d\{=\}128,L=6L\{=\}6,H=4H\{=\}4,dff=512d\_\{\\mathrm\{ff\}\}\{=\}512\(baseline: 1\.20M params; TSA: 1\.22M params,\+\+1\.7%\)\. Training employed AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.05222#bib.bib11)\)withβ=\(0\.9,0\.95\)\\beta\{=\}\(0\.9,0\.95\),lr=3×10−4\\mathrm\{lr\}\{=\}3\{\\times\}10^\{\-4\},λwd=0\.1\\lambda\_\{\\mathrm\{wd\}\}\{=\}0\.1for 10K gradient steps on 10K training sequences\. We report token\-level sequence accuracy on 1K held\-out sequences\.
#### Results\.
Table 1:Synthetic Task Results \(d=128d\{=\}128,L=6L\{=\}6, Toy Vocabulary\)†TLOps saved=1−\(1\+\(L−1\)α\)/L=1\-\(1\+\(L\-1\)\\,\\alpha\)/L, including the mandatory stem block\.
The routing pattern directly reflected task difficulty\. Copy is an identity mapping: the router learned that nearly all tokens were fully determined after the stem block \(α=0\.341\\alpha=0\.341; 54\.9% overall TLOps saved\)\. Sort requires comparison and permutation, yieldingα=0\.730\\alpha=0\.730—more compute where the task genuinely demanded it\. This difficulty\-proportional allocation emerged without any explicit supervision about task identity or difficulty\.
### 3\.2Character\-Level Language Modeling
#### Setup\.
We trained on Tiny\-Shakespeare\(Karpathy,[2015](https://arxiv.org/html/2605.05222#bib.bib10)\)\(1\.1M characters, 65\-char vocabulary, 80/10/10 train/val/test split\)\. Both models usedd=256d\{=\}256,L=6L\{=\}6,H=8H\{=\}8,dff=1024d\_\{\\mathrm\{ff\}\}\{=\}1024, context length 128\. Training employed AdamW with cosine learning rate schedule, batch size 64, for 5,000 gradient steps \(baseline: 4\.78M params; TSA: 4\.86M params,\+\+1\.7%\)\. Token embeddings were initialised without a padding index: character index 0 is the newline character \(~8% of the corpus\), whose embedding gradient must not be zeroed\.
#### Results\.
Table 2:Language Modeling Results \(d=256d\{=\}256,L=6L\{=\}6, Tiny\-Shakespeare\)Val loss increase:\+\+0\.006 nats \(\+\+0\.4% relative\)\. BPC = bits\-per\-character\.
TSA achievedα=0\.726\\alpha=0\.726: 22\.8% of token\-layer operations saved at a cost of 0\.006 nats \(\+\+0\.4%\) in validation loss\. Both models reached all convergence thresholds at identical step counts, indicating TSA did not impede convergence \(Figure[2](https://arxiv.org/html/2605.05222#S3.F2)\)\. The TSA curve lies consistently to the left on the compute axis, confirming that savings remained stable throughout training\.
\(a\)Validation loss vs\. training step\.
\(b\)Validation loss vs\. cumulative TLOps \(×109\\times 10^\{9\}\)\.
Figure 2:TSA \(red\) and Baseline \(blue\) on Tiny\-Shakespeare\. Left: equivalent convergence speed\. Right: TSA reaches the same loss for 22\.8% fewer token\-layer operations\.
### 3\.3Cross\-Dataset Validation: enwik8
To test whether TSA generalises beyond a single corpus, we trained on enwik8\(Hutter,[2006](https://arxiv.org/html/2605.05222#bib.bib14)\): the first10810^\{8\}bytes of English Wikipedia \(raw XML, 6,064 unique characters\)\. This corpus is substantially more diverse than Shakespeare—it contains markup, multilingual text, tables, and mathematical notation\. We usedd=256d\{=\}256,L=6L\{=\}6,H=8H\{=\}8,dff=1024d\_\{\\mathrm\{ff\}\}\{=\}1024, context length 256, batch size 64, for 5,000 steps \(6\.35M params baseline; 6\.43M TSA\)\. Experiments were conducted on Apple M1 Pro using MLX\(Apple Machine Learning Research,[2023](https://arxiv.org/html/2605.05222#bib.bib16)\)\.
Table 3:enwik8 Results \(d=256d\{=\}256,L=6L\{=\}6, Context 256\)TSA quality vs\. Baseline:−\-0\.4% \(TSA is marginally better; within noise\)\.
TSA achievedα=0\.833\\alpha=0\.833on enwik8, more conservative than Shakespeare’sα=0\.726\\alpha=0\.726\. The router allocated more compute on the structurally diverse Wikipedia corpus while still saving 13\.9% of TLOps at no quality cost\. Both conditions reached all convergence thresholds \(≤\\leq2\.5,≤\\leq2\.0,≤\\leq1\.8 BPC\) at identical steps \(500, 750, 1,000\)\. The cross\-dataset result confirms that the routing mechanism learned a content\-dependent signal rather than overfitting to corpus\-specific patterns\. Training curves are presented in Figure[5](https://arxiv.org/html/2605.05222#A2.F5)\(Appendix\)\.
### 3\.4Ablation: Depth Regularisation Sensitivity
We sweptλ∈\{0,0\.001,0\.005,0\.01,0\.05,0\.1,0\.5\}\\lambda\\in\\\{0,\\,0\.001,\\,0\.005,\\,0\.01,\\,0\.05,\\,0\.1,\\,0\.5\\\}on Tiny\-Shakespeare with all other hyperparameters fixed at the values in §[3\.2](https://arxiv.org/html/2605.05222#S3.SS2)\. Figure[3\(a\)](https://arxiv.org/html/2605.05222#S3.F3.sf1)shows the quality\-efficiency Pareto curve; full results are presented in Table[6](https://arxiv.org/html/2605.05222#A3.T6)\(Appendix\)\.
\(a\)λ\\lambdasweep: val loss vs\. active fraction\.
\(b\)Early exit vs\. TSA Pareto curves\.
Figure 3:Ablation studies on Tiny\-Shakespeare\.Left:TSA is robust acrossλ∈\[0,0\.1\]\\lambda\\in\[0,\\,0\.1\]; quality range is 0\.015 nats \(1\.04%\)\. Evenλ=0\\lambda\{=\}0produces meaningful routing\.Right:TSA \(red star\) dominates the early exit threshold sweep \(blue\) at matchedα≈0\.726\\alpha\\approx 0\.726\.Three findings emerged\.*\(i\)λ=0\\lambda\{=\}0still routes:*without any explicit depth pressure, the router learned to save 20\.4% of TLOps \(α=0\.755\\alpha=0\.755\) via the task\-loss gradient alone\. This is the central finding: the gating multiplicationh\+=\(1−pl\)⊙Δh\\mathrel\{\+\}=\(1\{\-\}p\_\{l\}\)\\odot\\Deltaprovides an intrinsic learning signal—when a layer’s residual update is noisy or redundant, the gradient favours increasingplp\_\{l\}to attenuate the update, even without regularisation\. The router thus acts as a learned noise gate\.*\(ii\) Robustness:*acrossλ∈\[0,0\.1\]\\lambda\\in\[0,\\,0\.1\], the quality range was only 0\.015 nats \(1\.04% relative\)—TSA does not require precise tuning ofλ\\lambda\.*\(iii\)λ=0\.05\\lambda\{=\}0\.05is Pareto\-optimal:*50\.4% TLOps saved at<<0\.5% quality loss, 2\.4 times as efficient as the defaultλ=0\.001\\lambda\{=\}0\.001with negligible quality cost\. The stability boundary isλ<0\.5\\lambda<0\.5; atλ=0\.5\\lambda\{=\}0\.5the active fraction collapsed to 0\.036 and quality degraded by 5\.9%\.
### 3\.5Ablation: Comparison With Early Exit
The early exit approach\(Elbayadet al\.,[2020](https://arxiv.org/html/2605.05222#bib.bib15)\)is the canonical inference\-time baseline for adaptive\-depth transformers\. We trained an early exit model withNNauxiliary exit classifiers \(one per block, tied embedding output heads, uniform mean cross\-entropy loss across all exits\) on the same Shakespeare setup\. At inference, a token exited when its maximum softmax probability exceeded a confidence threshold\. We swept 13 thresholds and selected the one yieldingα≈0\.726\\alpha\\approx 0\.726to match the TSA operating point\. Full threshold data are presented in Table[7](https://arxiv.org/html/2605.05222#A4.T7)\(Appendix\)\.
Table 4:Comparison at Matched Active Fraction \(α≈0\.726\\alpha\\approx 0\.726, Shakespeare\)TSA’s router operates at both train and inference time, learning routing decisions end\-to\-end via the task\-loss gradient\. Early exit trains identically to the baseline and applies routing only at inference via a separate confidence threshold\. At matched active fraction, TSA achieved 0\.71% lower validation loss than early exit \(Table[4](https://arxiv.org/html/2605.05222#S3.T4)\), suggesting that end\-to\-end learned routing produces better routing decisions than post\-hoc confidence thresholding\. At more conservative thresholds \(higherα\\alpha, fewer tokens skipped\), early exit quality improves—as expected, since fewer routing decisions are made—but the comparison at matchedα\\alphaisolates routing quality from routing aggressiveness \(Table[7](https://arxiv.org/html/2605.05222#A4.T7)\)\.
### 3\.6Wall\-Clock Throughput
Soft gating \(§[2](https://arxiv.org/html/2605.05222#S2)\) multiplies all residuals by\(1−pl\)\(1\{\-\}p\_\{l\}\)but does not skip any computation\. To translate TLOps savings to wall\-clock speedup, we implementedsparse\-TSA: at inference, tokens withpl\>0\.5p\_\{l\}\>0\.5were excluded from the FFN via gather/scatter operations; attention computation remained dense to preserve exact key\-value \(KV\) semantics, though the attention residual update was gated by the same binary mask as the FFN\. We benchmarked on Apple M1 Pro using MLX\(Apple Machine Learning Research,[2023](https://arxiv.org/html/2605.05222#bib.bib16)\)with batch=64, seq=256, 30 warmup\+\+200 timed forward passes per configuration\. Full data are presented in Table[8](https://arxiv.org/html/2605.05222#A5.T8)\(Appendix\)\.
Table 5:Wall\-Clock Throughput \(M1 Pro, MLX, Batch=64, Seq=256\)Soft\-TSA overhead \(∼\\sim1%\) is flat across allα\\alpha—the router is negligible cost\. Sparse\-TSA speedup requires batch≥64\\geq 64\.
Soft gating added∼\\sim1% overhead regardless ofα\\alpha—the router represented negligible cost\. Sparse\-TSA was faster than the baseline forα≤0\.83\\alpha\\leq 0\.83: 2\.3% speedup atα=0\.726\\alpha\{=\}0\.726\(Shakespeare\), break\-even atα=0\.833\\alpha\{=\}0\.833\(enwik8\)\. The speedup required batch≥64\\geq 64; at batch==1, CPU–GPU synchronisation dominated\.
## 4Analysis and Limitations
#### What did the router learn?
On synthetic tasks, routing was interpretable: copy halted maximally \(identity requires no deep computation\); sort halted moderately \(comparison needs depth\)\. On language, the router was more conservative on enwik8 \(α=0\.833\\alpha=0\.833\) than Shakespeare \(α=0\.726\\alpha=0\.726\), consistent with Wikipedia’s greater structural diversity\.
Figure[4](https://arxiv.org/html/2605.05222#S4.F4)shows per\-token routing decisions across a sample Shakespeare passage\. Early routers \(r0r\_\{0\},r1r\_\{1\}\) keep most tokens active, while later routers \(r3r\_\{3\},r4r\_\{4\}\) exhibit selective gating: punctuation, spaces, and predictable characters are attenuated more aggressively than content\-bearing characters in mid\-word positions\. This confirms that the router learns a difficulty\-sensitive signal rather than uniform layer skipping\.
Figure 4:Per\-token active fraction\(1−pl\)\(1\{\-\}p\_\{l\}\)across five routing decisions on a Shakespeare passage\. Green: fully active; red: nearly halted\. Early routers stay permissive; later routers selectively gate predictable tokens \(spaces, punctuation, common characters\) while preserving computation for content\-bearing positions\.
#### Limitations\.
- •*Scale\.*All experiments used≈\\approx5–6M parameters; scaling behaviour at 10M–100M is unknown\.
- •*Batch\-size dependence\.*Sparse\-TSA wall\-clock speedup required batch≥64\\geq 64; custom Metal kernels would likely eliminate this\.
- •*No attention sparsity\.*Only FFN was sparsified; block\-sparse attention could yield larger savings but would require retraining\.
- •*Preliminary routing analysis\.*Figure[4](https://arxiv.org/html/2605.05222#S4.F4)shows qualitative patterns; quantitative analysis by token type \(e\.g\., frequency, entropy\) remains future work\.
## 5Related Work
#### Adaptive Computation\.
Graves \([2016](https://arxiv.org/html/2605.05222#bib.bib1)\)introduced ACT for RNNs;Dehghaniet al\.\([2019](https://arxiv.org/html/2605.05222#bib.bib2)\)extended this approach to weight\-shared layers with the Universal Transformer\. TSA differs in using separate blocks and per\-layer soft gates rather than an accumulated budget\.Raposoet al\.\([2024](https://arxiv.org/html/2605.05222#bib.bib3)\)proposed Mixture\-of\-Depths, which uses hard top\-kkrouting;Baeet al\.\([2025](https://arxiv.org/html/2605.05222#bib.bib4)\)introduced Mixture of Recursions, which applies recursive blocks for a learned number of steps\. Both employ discrete routing; TSA uses a continuous gate\.
#### Early Exit\.
Elbayadet al\.\([2020](https://arxiv.org/html/2605.05222#bib.bib15)\)proposed the Depth\-Adaptive Transformer, which trains auxiliary classifiers and exits tokens at inference based on output confidence\. Training cost equals the baseline\. Our ablation \(§[3\.5](https://arxiv.org/html/2605.05222#S3.SS5)\) showed that TSA achieved better quality at matched efficiency, suggesting that end\-to\-end learned routing outperforms post\-hoc confidence thresholding\.
#### Other Approaches\.
Chenet al\.\([2025](https://arxiv.org/html/2605.05222#bib.bib5)\)proposed the Inner Thinking Transformer, which augments compute at hard positions \(complementary to TSA\)\.Feduset al\.\([2022](https://arxiv.org/html/2605.05222#bib.bib7)\)introduced Switch Transformers, which route tokens to different expert FFNs, varying width rather than depth\.
## 6Conclusion
TSA reduced token\-layer operations by 14–23% on character\-level language modeling and up to 55% on synthetic tasks, at<<0\.5% quality cost across two language corpora\. The router learns difficulty\-proportional allocation from the task\-loss gradient alone, producing meaningful routing even atλ=0\\lambda\{=\}0\. The mechanism adds 1\.7% parameters, proved robust toλ\\lambdaacross two orders of magnitude, achieved better quality than early exit at matched efficiency, and translated to real wall\-clock speedup via sparse inference at batch≥64\\geq 64\. Scaling to 10M\+ parameters and per\-position routing analysis are ongoing\.
## References
- MLX: an array framework for Apple silicon\.Note:[https://github\.com/ml\-explore/mlx](https://github.com/ml-explore/mlx)Cited by:[§3\.3](https://arxiv.org/html/2605.05222#S3.SS3.p1.5),[§3\.6](https://arxiv.org/html/2605.05222#S3.SS6.p1.3)\.
- J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton \(2016\)Layer normalization\.InarXiv preprint arXiv:1607\.06450,External Links:[Link](https://arxiv.org/abs/1607.06450)Cited by:[§2\.1](https://arxiv.org/html/2605.05222#S2.SS1.p1.4)\.
- S\. Bae, Y\. Hwang, J\. Noh, M\. Kim, K\. H\. Yoo, and C\. D\. Yoo \(2025\)Mixture of recursions: learning dynamic recursive depths for adaptive token\-level computation\.InInternational Conference on Machine Learning \(ICML\),External Links:[Link](https://arxiv.org/abs/2507.10524)Cited by:[§1](https://arxiv.org/html/2605.05222#S1.p3.1),[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px1.p1.1)\.
- T\. B\. Brownet al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems \(NeurIPS\)33\.External Links:[Link](https://arxiv.org/abs/2005.14165)Cited by:[Appendix A](https://arxiv.org/html/2605.05222#A1.SS0.SSS0.Px1.p1.3)\.
- Y\. Chen, J\. Wang, Z\. Luo, X\. Wang, G\. Li, Z\. Zhao, X\. Zeng, T\. Liu, B\. Ding, and J\. Zhou \(2025\)Inner thinking transformer: leveraging dynamic depth scaling to foster adaptive internal thinking\.InProceedings of the Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:[Link](https://arxiv.org/abs/2502.13842)Cited by:[§1](https://arxiv.org/html/2605.05222#S1.p3.1),[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px3.p1.1)\.
- M\. Dehghani, S\. Gouws, O\. Vinyals, J\. Uszkoreit, and L\. Kaiser \(2019\)Universal transformers\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/1807.03819)Cited by:[§1](https://arxiv.org/html/2605.05222#S1.p3.1),[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Elbayad, J\. Gu, E\. Grave, and M\. Auli \(2020\)Depth\-adaptive transformer\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/1910.10073)Cited by:[§3\.5](https://arxiv.org/html/2605.05222#S3.SS5.p1.2),[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px2.p1.1)\.
- W\. Fedus, B\. Zoph, and N\. Shazeer \(2022\)Switch transformers: scaling to trillion parameter models with simple and efficient sparsity\.Journal of Machine Learning Research23\(120\),pp\. 1–39\.External Links:[Link](https://arxiv.org/abs/2101.03961)Cited by:[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px3.p1.1)\.
- A\. Graves \(2016\)Adaptive computation time for recurrent neural networks\.arXiv preprint arXiv:1603\.08983\.External Links:[Link](https://arxiv.org/abs/1603.08983)Cited by:[§1](https://arxiv.org/html/2605.05222#S1.p3.1),[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Hutter \(2006\)The hutter prize\.Note:[http://prize\.hutter1\.net/](http://prize.hutter1.net/)Cited by:[§3\.3](https://arxiv.org/html/2605.05222#S3.SS3.p1.5)\.
- A\. Karpathy \(2015\)Char\-rnn\.Note:[https://github\.com/karpathy/char\-rnn](https://github.com/karpathy/char-rnn)Cited by:[§3\.2](https://arxiv.org/html/2605.05222#S3.SS2.SSS0.Px1.p1.5)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://arxiv.org/abs/1711.05101)Cited by:[Appendix A](https://arxiv.org/html/2605.05222#A1.SS0.SSS0.Px2.p1.4),[§3\.1](https://arxiv.org/html/2605.05222#S3.SS1.SSS0.Px1.p1.8)\.
- O\. Press and L\. Wolf \(2017\)Using the output embedding to improve language models\.Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics \(EACL\)\.External Links:[Link](https://arxiv.org/abs/1608.05859)Cited by:[Appendix A](https://arxiv.org/html/2605.05222#A1.SS0.SSS0.Px1.p1.3)\.
- D\. Raposo, S\. Ritter, B\. Richards, T\. Lillicrap, P\. C\. Humphreys, and A\. Santoro \(2024\)Mixture\-of\-depths: dynamically allocating compute in transformer\-based language models\.arXiv preprint arXiv:2404\.02258\.External Links:[Link](https://arxiv.org/abs/2404.02258)Cited by:[§1](https://arxiv.org/html/2605.05222#S1.p3.1),[§5](https://arxiv.org/html/2605.05222#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.Advances in Neural Information Processing Systems \(NeurIPS\)30\.External Links:[Link](https://arxiv.org/abs/1706.03762)Cited by:[§1](https://arxiv.org/html/2605.05222#S1.p1.1)\.
## Appendix AImplementation Details
#### Weight Initialisation\.
Token and positional embeddings:𝒩\(0,0\.022\)\\mathcal\{N\}\(0,0\.02^\{2\}\)\. Residual projections \(attention output and final feed\-forward layer\):𝒩\(0,\(0\.02/2L\)2\)\\mathcal\{N\}\(0,\\,\(0\.02/\\sqrt\{2L\}\)^\{2\}\), following GPT\-3\(Brown and others,[2020](https://arxiv.org/html/2605.05222#bib.bib9)\)\. Router final bias:−1\.0\-1\.0\. Weight tying between token embedding and output head followedPress and Wolf \([2017](https://arxiv.org/html/2605.05222#bib.bib8)\)\.
#### Optimiser\.
AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.05222#bib.bib11)\)withβ=\(0\.9,0\.95\)\\beta=\(0\.9,\\,0\.95\),λwd=0\.1\\lambda\_\{\\mathrm\{wd\}\}=0\.1on all parameters except biases, LayerNorm parameters, and embeddings \(which usedλwd=0\\lambda\_\{\\mathrm\{wd\}\}=0\)\. The MLX implementation applies weight decay uniformly to all parameters, as MLX does not support per\-parameter decay groups\. This affects the enwik8 experiments \(Table[3](https://arxiv.org/html/2605.05222#S3.T3)\) and theλ\\lambdasweep \(Table[6](https://arxiv.org/html/2605.05222#A3.T6)\)\.
#### Causal Masking\.
Standard upper\-triangular causal attention mask was used\. Routing decisions were computed from hidden states after the causal attention sublayer and did not depend on the mask structure\.
#### Character\-Level Tokenisation\.
Characters were mapped to integer indices via a sorted vocabulary of unique characters in the training corpus \(65 characters in Tiny\-Shakespeare; 6,064 unique characters in enwik8\)\. Token index 0 mapped to the newline character in Shakespeare; the embedding for this token was initialised normally \(nopadding\_idxwas set\) to avoid zeroing gradients for≈\\approx8% of corpus tokens\. enwik8 used the raw byte distribution with no preprocessing beyond vocabulary construction\.
## Appendix Benwik8 Training Curves
\(a\)Validation loss vs\. training step\.
\(b\)Validation loss vs\. cumulative TLOps\.
Figure 5:TSA \(red\) and Baseline \(blue\) on enwik8\. Convergence speed is identical; TSA reaches the same loss for 13\.9% fewer token\-layer operations\.
## Appendix CFullλ\\lambdaSweep Results
Table 6:Fullλ\\lambdaSweep on Tiny\-Shakespeare \(5,000 Steps, Batch=64, Ctx=128\)Allλ\\lambdasweep experiments used MLX on Apple M1 Pro; within\-sweep comparisons are framework\-consistent\. Theλ=0\.001\\lambda\{=\}0\.001result differs from Table[2](https://arxiv.org/html/2605.05222#S3.T2)\(PyTorch, MPS\) by 0\.63% in validation loss due to framework and RNG differences; cross\-framework comparisons should use the sweep\-internal baseline row above, not Table[2](https://arxiv.org/html/2605.05222#S3.T2)\.
## Appendix DFull Early Exit Threshold Sweep
Table 7:Early Exit Threshold Sweep on Tiny\-Shakespeare \(Post 5,000 Steps Training\)Bold row matches the TSA operating point \(α≈0\.726\\alpha\\approx 0\.726\)\. Full model \(no exit\): val loss = 1\.4450, identical to Baseline training cost\.
## Appendix EFull Wall\-Clock Data
Table 8:Active\-Fraction Sweep \(M1 Pro, MLX, Batch=64, Seq=256\)Figure 6:Wall\-clock speedup vs\. active fraction\. Sparse\-TSA is faster than Baseline forα≤0\.83\\alpha\\leq 0\.83\. Vertical dashed lines mark the Shakespeare \(α=0\.726\\alpha=0\.726\) and enwik8 \(α=0\.833\\alpha=0\.833\) operating points\.Table 9:Batch\-Size Scaling atα=0\.833\\alpha=0\.833\(M1 Pro, MLX, Seq=256\)Atα=0\.833\\alpha\{=\}0\.833, sparse\-TSA breaks even at batch==64 and achieves speedup at batch==128; at batch==1, CPU–GPU syncs dominate\.Similar Articles
Attribution-Guided Continual Learning for Large Language Models
This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.
Generative modeling with sparse transformers
OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.
@simplifyinAI: DeepSeek has dropped a fundamental rewrite of the Transformer architecture. And it solves the "identity crisis" that br…
DeepSeek has published a paper introducing mHC (Manifold-Constrained Hyper-Connections), a fundamental rewrite of the Transformer architecture that stabilizes large models by replacing standard residual connections with mathematically constrained multi-stream pathways.
(1D) Ordered Tokens Enable Efficient Test-Time Search
This paper investigates how 1D coarse-to-fine token structures in autoregressive models improve test-time search efficiency compared to classical 2D grid tokenization. The authors show that such ordered tokens enable better test-time scaling and even training-free text-to-image generation when guided by image-text verifiers.
TIDE: Every Layer Knows the Token Beneath the Context
This paper introduces TIDE, a method that addresses the Rare Token and Contextual Collapse problems in LLMs by injecting token identity into every layer via Embedding Memory. The authors demonstrate theoretical and empirical improvements across language modeling and downstream tasks.