Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA 05/12/26, 07:10 PM Papers
Summary
This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.
Speculative decoding accelerates LLM inference by drafting future tokens with a small model, but drafter models degrade sharply under template perturbation and long-context inputs. We identify a previously-unreported phenomenon we call \\textbf{attention drift}: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently-generated tokens. We observe this across both \\emph{EAGLE3} drafters and \\emph{MTP heads}, suggesting drift is a property of drafter designs. We trace this to the un-normalized residual path between chain steps: the drafter's hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor. In order to limit the growth, we propose two architectural changes: Post-norm on the drafter hidden states and per-hidden-state RMSNorm after capturing target hidden states. Our interventions improve acceptance length over the current leading model, pre-norm EAGLE3, by up to 2× under template perturbation, 1.18× on long-context tasks, and 1.10× on seven standard benchmarks spanning multi-turn chat, math, and coding. Our changes also allow shorter train-time-test depths to generalize over longer drafting sequences.
Original Article
View Cached Full Text
Cached at: 05/13/26, 12:18 AM
# What Autoregressive Speculative Decoding Models Learn Code: https://github.com/Dogacel/Attention-Drift
Source: [https://arxiv.org/html/2605.09992](https://arxiv.org/html/2605.09992)
Doğaç Eldenk Northwestern University &Payal Mohapatra Northwestern University &Yigitcan Comlek GE Aerospace &Kaan Oktay fal &Hongyang Zhang University of Waterloo &Stephen Xia Northwestern University

###### Abstract

Speculative decoding accelerates LLM inference by drafting future tokens with a small model\. However, drafter models degrade sharply under template perturbation and long\-context inputs\. We identify a previously\-unreported phenomenon we callattention drift: as the drafter generates successive tokens within a speculation chain, attention progressively moves from the prompt onto its own recently\-generated tokens\. This phenomenon is observed across both*EAGLE3*drafters and*MTP heads*, suggesting drift is a property of drafter designs\. We trace this to the unnormalized residual path between chain steps: the drafter’s hidden state magnitude grows monotonically with chain depth, which exhibits dynamics consistent with additional pre\-norm transformer layers stacked on the target rather than as a standalone autoregressive predictor\. In order to limit the growth, we propose two architectural changes: Post\-norm on the drafter hidden states and per\-hidden\-state RMSNorm after capturing target hidden states\. Our interventions improve acceptance length over the current leading model, pre\-norm EAGLE3, by up to2×2\\timesunder template perturbation,1\.18×1\.18\\timeson long\-context tasks, and1\.10×1\.10\\timeson seven standard benchmarks spanning multi\-turn chat, math, and coding\. Our changes also allow shorter train\-time\-test depths to generalize over longer drafting sequences\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x1.png)

SpeculationstartsAttention DriftAmount of AttentionSink TokenRecently Generated

Figure 1:Attention drift\.During speculation, the drafter’s attention moves from the prompt’s sink token onto its own recently\-generated tokens\.Left:Emergence of the attention sink demonstrated on drafter’s attention heatmap, rows = query, columns = key; darker = higher attention\.Center: Graphical visualization of attention drift on a drafter\.Right: Attention per token position onxxaxis, with speculated tokens to the right of the dashed line\.## 1Introduction

Speculative decoding\[[8](https://arxiv.org/html/2605.09992#bib.bib13),[4](https://arxiv.org/html/2605.09992#bib.bib14)\]is a lossless acceleration technique for large language model \(LLM\) inference in which a lightweightdrafterpredicts future tokens that are later verified by a largertargetorverifiermodel\. In practice, speculative decoding systems must operate under a wide range of deployment conditions, including different inference engines, context lengths, system prompts, and chat templates\. Robustness under varying scenarios is critical to maintaining high acceptance rates and consistent inference speedups\. However, recent works have shown that drafters often degrade substantially under challenging deployment scenarios, such as template perturbations and long\-context inputs\[[17](https://arxiv.org/html/2605.09992#bib.bib2),[20](https://arxiv.org/html/2605.09992#bib.bib3)\]\. As a result, practitioners frequently retrain or specialize drafters for specific inference engines, prompts, or templates to maximize performance\. Despite speculative decoding models being relatively cheap to train, this sensitivity points to an underlying issue\. Our main contributions in this work are as follows:

\(i\)We identify a previously unreported phenomenon that helps explain this fragility, which we callattention drift\([Figure˜1](https://arxiv.org/html/2605.09992#S0.F1)\)\. As a drafter generates more tokens during speculation, its attention progressively shifts away from the sink toward recently generated tokens\. We observe this phenomenon consistently across EAGLE\-3 drafters\[[11](https://arxiv.org/html/2605.09992#bib.bib4)\]and Multi\-Token Prediction \(MTP\) heads\[[6](https://arxiv.org/html/2605.09992#bib.bib7)\], suggesting that attention drift reflects a broader property of auto\-regressive drafter designs\.

Transformer LayerRMSNormLM Headhouth\_\{\\text\{out\}\}Transformer LayerRMSNormLM Headhouth\_\{\\text\{out\}\}

Figure 2:Overview of Pre\-norm \(Left\) and proposed Post\-norm \(Right\) architecture\.\(ii\)To understand this behavior, we analyze the hidden\-state dynamics of speculative drafters and find that the unnormalized residual connection between speculation steps causes hidden\-state magnitudes to grow monotonically with chain depth, resembling additional transformer layers stacked on top of the verifier\. The drafter implicitly learns a depth\-dependent refinement process instead of a stable autoregressive token predictor, making it sensitive to template changes and long contexts \([Section˜4\.2](https://arxiv.org/html/2605.09992#S4.SS2)\)\.

\(iii\)Motivated by this observation, we introduce a simple architectural modification based onpost\-normalization\([Figure˜2](https://arxiv.org/html/2605.09992#S1.F2)\) combined with a normalization before hidden\-state fusion, that prevents hidden\-state magnitude growth and stabilizes the drafting process\. These changes substantially reduce attention drift and improve acceptance length over the current state\-of\-the\-art pre\-norm EAGLE3 architecture by up to2×2\\timesunder template perturbation,1\.18×1\.18\\timeson long\-context tasks, and1\.10×1\.10\\timesacross seven standard benchmarks spanning multi\-turn chat, reasoning, coding, and mathematics\. We further show that post\-norm drafters generalize to inference depths beyond those seen during training, enabling shorter train\-time\-test depths and reducing training cost\.

## 2Preliminaries

HowcanIhelpmetodayDrafterVerifierHowcanIhelpmeyoutoday

Figure 3:Verification phase: green tokens are accepted, yellow is resampled and red ones are rejected\.Speculative Decodingalternates between two stages \([Figure˜3](https://arxiv.org/html/2605.09992#S2.F3)\)\. In the drafting phase, the drafter auto\-regressively generates a sequence ofkkcandidate future tokens, wherekkis thespeculation depth\. In the verification phase, the target model evaluates the drafted sequence in a single forward pass and applies rejection sampling against its own token distribution; the longest valid prefix is accepted\. A key efficiency metric is theacceptance length \(τ\\tau\), defined as the average number of drafted tokens accepted per verification round\. Higher acceptance lengths generally translate directly into larger end\-to\-end inference speedups for a fixed drafter overhead\.

EAGLE\[[10](https://arxiv.org/html/2605.09992#bib.bib6)\]and its successors EAGLE\-2/EAGLE\-3\[[9](https://arxiv.org/html/2605.09992#bib.bib5),[11](https://arxiv.org/html/2605.09992#bib.bib4)\]feed the target’s fused internal hidden states through a fully\-connected projection \(fcin[Figure˜13](https://arxiv.org/html/2605.09992#S5.F13)\) into a single pre\-norm transformer decoder layer\[[15](https://arxiv.org/html/2605.09992#bib.bib27)\]that serves as the drafter \([Figure˜2](https://arxiv.org/html/2605.09992#S1.F2)Left\)\. The drafter is trained with a token\-level cross\-entropy loss against the frozen target’s predicted distribution, with a fixed number of speculation steps \(the train\-time\-test depth, TTT\)\. EAGLE\-3, our focus, is the dominant auto\-regressive drafter design in production engines such as vLLM\[[7](https://arxiv.org/html/2605.09992#bib.bib8)\]and SGLang\[[22](https://arxiv.org/html/2605.09992#bib.bib9)\]\.

Attention Sinks\.An attention sink\[[18](https://arxiv.org/html/2605.09992#bib.bib1)\]is a token \(typically near the start of a sequence\) that absorbs disproportionately large attention during inference\. Sinks are widely observed in modern LLMs and are believed to act as stable anchors that stabilize attention, particularly under long\-context\.

## 3Attention Drift

![Refer to caption](https://arxiv.org/html/2605.09992v1/x2.png)\(a\)Qwen3\.5 35B
![Refer to caption](https://arxiv.org/html/2605.09992v1/x3.png)\(b\)Llama3 8B\.

Figure 4:Attention heatmaps for visualizing attention drift on EAGLE\-3 drafters\. Aggregated over 200\+ samples from varied prompts and sequence positions\.Looking at speculative decoding models’ attention, a pattern emerges:whenever the target model develops a sink, the drafter develops a sink at exactly the same place\. This observation is consistent acrossMTP headsandEAGLE\-3drafters\. Upon close inspection of three model families with different attention patterns, we observed Llama’s sink is on the first token, Qwen’s on the second, and GPT\-oss has no sink\. In every case, the drafter matched its target\.

The drafter attention resembles the verifier attention as the drafter consumes verifier hidden states\. However, once drafting begins, the attention distribution progressively changes as the drafter transitions from consuming verifier hidden states to consuming its own generated hidden states\. During this process, attention increasingly shifts away from the sink and toward recently generated tokens\.

We refer to this phenomenon asattention drift, visualized in[Figures˜4](https://arxiv.org/html/2605.09992#S3.F4)and[5](https://arxiv.org/html/2605.09992#S3.F5)across multiple EAGLE\-3 drafters\. In models with strong sinks, the sink progressively weakens during drafting while attention mass concentrates on recently drafted tokens\. The drafter therefore operates in two modes: attending to the verifier’s hidden states at pre\-fill and to its own hidden states during generation, with a gradual learned transition between them\. We hypothesize that this transition makes the draftermore fragile to out\-of\-distribution inputs, most notably in long context and at deeper speculation steps\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x4.png)Figure 5:Percentage of attention concentrated on Sink token \(Left\) and Latest Token \(Right\) on EAGLE\-3 drafters\.Here, we examine MTP Heads\[[6](https://arxiv.org/html/2605.09992#bib.bib7)\]as a cross\-architecture check on whether attention drift is specific to EAGLE\-3 or a more general property of drafters that consume their own previous outputs\. MTP Heads are auxiliary prediction heads jointly trained with the target model during pretraining, sharing the target’s LM head and predicting multiple positions ahead\. This differs from EAGLE\-3, which is trained post\-hoc as a separate drafter on top of a frozen target with its own LM head\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x5.png)Figure 6:Attention Drift on Qwen3\.5 9B MTP headWe inspect the MTP head of Qwen3\.5 9B \([Figure˜6](https://arxiv.org/html/2605.09992#S3.F6)\), a single transformer layer that follows the target model’s attention architecture and reuses the same weights across consecutive speculation steps\. During MTP speculation, we observe similar trends \([Figure˜7](https://arxiv.org/html/2605.09992#S3.F7)\): attention to the sink token decreases substantially, while attention to the most recently drafted token increases\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x6.png)Figure 7:Sink and drafted token self attention on Qwen3\.5 9B MTP heads \(MT\-Bench, 80 prompts\)\.![Refer to caption](https://arxiv.org/html/2605.09992v1/x7.png)Figure 8:Attention Drift on GPT\-oss 120B\.What if verifier doesn’t have an attention sink?Some recent models, such as Qwen3\-Next and GPT\-oss, are designed to suppress attention sinks\. Qwen3\-Next uses gated attention, applying a per\-head sigmoid gate to the SDPA output so that heads can multiplicatively suppress their contribution\. GPT\-oss instead introduces a per\-head learnable bias logit in the softmax denominator, giving each head an explicit "attend to nothing" option that absorbs excess attention mass\. Inspecting a GPT\-oss drafter \([Figure˜8](https://arxiv.org/html/2605.09992#S3.F8)\), we observe no visible sink on the first token, but a weak yet consistent attention concentration on recurring template tokens\. This suggests that sink\-like behavior can arise not only from architectural inductive bias but also from repeated special tokens or template markers, especially in the absence of a dedicated architectural sink mechanism in the verifier\.

## 4What Causes Attention Drift?

To understand why attention drift occurs and what it reveals about the underlying model, we trained drafters from three different verifier model families\. We primarily evaluate on MT\-Bench\[[21](https://arxiv.org/html/2605.09992#bib.bib28)\], eight categories of multi\-turn instructions allows us to evaluate across diverse prompts\. All architectural changes explored in this section required re\-training of the drafter \(details in Appendix[C](https://arxiv.org/html/2605.09992#A3)\)\.

### 4\.1Hidden State Magnitudes

Attention drift occurs as we move away from the target’s hidden states towards consuming the drafter’s own hidden states during generation\. Therefore we first inspect the hidden states of the verifier and drafter\. In[Table˜1](https://arxiv.org/html/2605.09992#S4.T1), the magnitudes are compared between the verifier’s hidden states, verifier’s fused hidden state and the drafter’s own hidden states at different speculation steps, using root mean square \(RMS\),‖x‖2/d\\\|x\\\|\_\{2\}/\\sqrt\{d\}\. We notice several patterns described next\.

Table 1:Hidden state magnitudes \(RMS\) across model families, averaged over 80 MT\-Bench prompts andk=8k\{=\}8speculation rounds\. Hidden states from the verifier’s low, middle, and high layers \(hlowh\_\{low\},hmidh\_\{mid\},hhighh\_\{high\}\); the fused representation of the verifier’s hidden stateshFC\{h\_\{FC\}\}; and the drafter’s output hidden states at speculation steps 1, 2, 3, and 8 \(h1∗h^\{\*\}\_\{1\}throughh8∗h^\{\*\}\_\{8\}\)Verifier hidden statesDrafter outputht∗h^\{\*\}\_\{t\}Target Modelhlowh\_\{\\text\{low\}\}hmidh\_\{\\text\{mid\}\}hhighh\_\{\\text\{high\}\}hFC\{h\_\{FC\}\}k=1k\{=\}1k=2k\{=\}2k=3k\{=\}3k=8k\{=\}8Llama 3\.1 8B0\.560\.580\.7812\.463\.924\.875\.8614\.02Qwen 3 30B0\.332\.172\.712\.561\.001\.131\.251\.67Qwen 3\.5 35B0\.030\.120\.210\.893\.474\.034\.425\.92GPT\-oss 20B35232496878995138GPT\-oss 120B31631455583497511537647Observation 1: Hidden\-state magnitudes mismatch\.The hidden states of the verifier, target, and the fully connected layer have substantially different magnitudes\. This is attributed to the positioning of the RMSNorm in the drafter and a norm component being shared between \(hFC\{h\_\{FC\}\}\) and the drafter \(h∗h^\{\*\}\) shown in[Figure˜13](https://arxiv.org/html/2605.09992#S5.F13)\. Despite the hidden state magnitudes being vastly different between the target and drafter, we did not see a correlation between the magnitude of hidden states and acceptance rates for models, hinting us that the model learns to compensate for it\.

Observation 2: The verifier fused representation,hFC\{h\_\{FC\}\}, is imbalanced\.All three target families we explored use the pre\-norm architecture: RMSNorm is applied inside each transformer block \(before attention and MLP\), but never to the residual stream itself\. As such, each layer’s magnitudes accumulate monotonically through depth \(prenorm dilution\[[14](https://arxiv.org/html/2605.09992#bib.bib21)\]\), and‖h1‖<‖h2‖<‖h3‖\\\|h\_\{1\}\\\|<\\\|h\_\{2\}\\\|<\\\|h\_\{3\}\\\|holds for every row of Table[1](https://arxiv.org/html/2605.09992#S4.T1)\. The target’s final pre\-LM\-head RMSNorm normally absorbs this growth when producing logits, but EAGLE\-3 uses verifier’s internal states captured before this norm \(hlowh\_\{low\},hmidh\_\{mid\},hhighh\_\{high\}\)\. Because of thishhighh\_\{high\}dominates thehFCh\_\{FC\}signal with its larger magnitude\. Thus in future sectionswe add RMSNorm layers before each target hidden streamused as input to generatehFC\{h\_\{FC\}\}to prevent this imbalance and generate a more stable verifier representation\.

Observation 3: Magnitude growth across speculation depth\.The drafter\-generated hidden states grow monotonically with speculation depth\. Across all model families,hk∗h^\{\*\}\_\{k\}accumulates magnitude away from the distribution the drafter was trained on at step0\. This means that the drafter does not operate on a depth\-invariant hidden\-state distribution\. Instead, each speculation step changes the scale of the representation consumed by the next step\.

How is drift and magnitude related?This scaling magnitude across speculation depth is important in computing attention\. Attention depends on computing query\-key similarities\. As the drafter generates deeper into the speculation chain, its current hidden state increasingly resembles recently generated drafter states rather than verifier\-conditioned prompt states\. The empirical consequence, quantified in the next section, is a*redistribution*of attention mass away from the original input and onto the drafter’s own predictions\.

### 4\.2Norm Position Changes: Layer\-stacking vs Autoregression

The monotonic growth of the drafter hidden states suggests a simple interpretation of what a drafter*learns to be*\. In a standard pre\-norm drafter, the drafter’s chain hidden statesh1∗,h2∗,…,hk∗h^\{\*\}\_\{1\},h^\{\*\}\_\{2\},\\dots,h^\{\*\}\_\{k\}accumulate through an unnormalized residual path, causing the hidden states to monotonically grow with speculation depth\. This makes the drafter behave less like an independent autoregressive model and more like a sequence of additional transformer layers stacked on top of the target\. At speculation stepkk, the drafter effectively approximates what would happen if the target hadN,N\+1,…,N\+kN,N\{\+\}1,\\dots,N\{\+\}klayers, rather than acting as a standalone depth\-invariant model that autoregressively predicts the next token\. The model takes the target’s representations and*keeps refining it*with another layer of attention\+MLP compute, one per speculation step \([Figure˜9](https://arxiv.org/html/2605.09992#S4.F9)\)\.

Layer 1⋯\\cdotsLayerNNN\+1N\{\+\}1N\+2N\{\+\}2⋯\\cdots\(h,e\)\(h,e\)DrafterHowcanTarget / Verifier

\(a\)With Pre\-norm
Layer 1⋮LayerNNTarget / VerifierDrafter\(h,e\)\(h,e\)\(h∗,e∗\)\(h^\{\*\},e^\{\*\}\)How, can, …

\(b\)With Post\-norm

Figure 9:Two views of the EAGLE3 drafter at chain depth\>1\>1\.With the standard pre\-norm structure, the model acts more like additional layers stacked onto the target model, whereas post\-norm acts more like an independent auto\-regressive drafter that accepts its own output\.
![Refer to caption](https://arxiv.org/html/2605.09992v1/x8.png)

Figure 10:Pre\-norm vs\. Post\-norm at various TTT \(max training depth\)\. Magnitude and per\-step acceptance on coding \(HumanEval\) and math \(GSM8K\),k=20k\{=\}20steps\.To test this interpretation, we compare standard pre\-norm and the modified post\-norm drafters trained with different train\-time\-test depths\. Train\-time\-test depth denotes the maximum speculation depth used during training\. If a drafter learns a depth\-invariant autoregressive rule, then training with a small depth should still generalize reasonably to longer speculation chains at inference\. In contrast, if the drafter learns depth\-specific transformations resembling layersN,N\+1,…,N,N\{\+\}1,\\dots,then it should fail beyond the depths observed during training\.[Figure˜10](https://arxiv.org/html/2605.09992#S4.F10)supports the latter interpretation for pre\-norm drafters\. A pre\-norm drafter trained with train\-time\-test depth 2 performs well at the first few drafting steps but collapses beyond its training horizon\. Its hidden\-state magnitude grows rapidly, and its conditional acceptance probability drops sharply at deeper speculation steps\.

By contrast, the post\-norm drafter trained with the same TTT depth remains stable beyond the training horizon\. Applying post\-normalization to the drafter hidden state between speculation steps prevents residual\-scale accumulation, making the drafter less able to encode depth through magnitude growth\. Rather than continuing the verifier as a depth\-dependent stack of additional pre\-norm transformer layers, the post\-norm drafter is regularized toward a stable autoregressive prediction function\.

### 4\.3Fixing the Attention Sink

The observation above suggests that hidden\-state magnitude accumulation is a major contributor to attention drift\. However, attention drift could also be interpreted as a consequence of attention\-sink collapse: if the sink weakens during drafting, attention mass must move elsewhere\. In this section, we test whether eliminating the sink is sufficient to solve drift as well as the magnitude growth\.

To test this, we compare several architectural variants on the same target: a standard pre\-norm drafter, a gated\-attention drafter, a post\-norm drafter, and combining gated\-attention and post\-norm\. The gated\-attention variant introduces a learned element\-wise gate inside the attention layer \([Appendix˜A](https://arxiv.org/html/2605.09992#A1)\), which strongly suppresses reliance on a fixed sink token\.[Table˜2](https://arxiv.org/html/2605.09992#S4.T2)summarizes the results\.

Table 2:Gated attention vs\. normalization: chain magnitude of hidden stateshk∗h^\{\*\}\_\{k\}, entropyHH, sink and recent token attention across EAGLE3 drafters for Llama\-3\.1 8B on MT\-Bench \(80 prompts over 8 categories and all drafting rounds\)\.Magnitude and attention are independent failure modes\.Gated attention nearly eliminates sink attention and removes the visible drift pattern \(≤0\.03\\leq\\\!0\.03throughout\), but its hidden state magnitude grows5\.0×5\.0\\timesduring drafting and starts from a largerh1∗≈8\.4h^\{\*\}\_\{1\}\\approx 8\.4\. On the other hand, post\-norm pins the hidden\-state magnitude across speculation steps and stabilizes the attention pattern over the drafting chain\. These results indicate that \(i\) fixing attention sink eliminates the drift, \(ii\) attention drift and magnitude accumulation are independent failure modes and thatsolving attention drift does not necessarily eliminate the magnitude issue\.

Role of Entropy\.We also examine whether attention entropy \(HH\) explains drafter failures\. Each model sits at a different base entropy level, and per\-token entropy rises slightly as the speculation depthkkgrows, the distribution becomes differently\-*shaped*, not peakier\. However, no consistent relationship between entropy and acceptance length emerges across our pre\-norm, post\-norm, and gated\-attention variants, indicating that drafters learn to compensate for entropy shifts and that entropy alone is not a useful predictor of drafter quality across architectures\.

Combining theGated AttentionwithPost\-normcreates a new pathology\.The Post\-norm \+ Gated variant combines both modifications, and it does what one would hope along each of the two individual axes: chain magnitude is flat, the sink is eliminated, and the recent token attention is stable at0\.500\.50\. But entropy collapses toH≈0\.62H\\approx 0\.62already at pre\-fill, about a third of the other drafters’ values, corresponding to an effective attention support of roughly two positions \(e0\.62≈1\.85e^\{0\.62\}\\approx 1\.85\)\. Applying both changes appears to over\-regularize the drafter: the attention distribution collapses onto roughly two tokens\.

### 4\.4Noise and Error Accumulation

The drafter consumes its own predictions as inputs to subsequent speculation steps, and these self\-generated hidden states are noisier than the clean verifier states seen att=1t\{=\}1\. Theα\\alpha\-noise sweep therefore tests how gracefully each architecture absorbs imprecise hidden\-state inputs, determining how quickly errors compound with depth\.Post\-norm tolerates an order of magnitude more perturbation than pre\-normon the hidden pathway \(58% vs\. 5% atα=0\.5\\alpha\{=\}0\.5\), offering a mechanistic explanation for its better deep\-chain behavior: it accumulates less error per speculation step\. We further hypothesize that this tolerance may translate to robustness under other small hidden\-state perturbations, such as those induced by verifier quantization or mild distribution shift\.

Table 3:Drafter pathway reliance under noise injection\. Each cell reports acceptance length as a percentage of the model’s no\-perturbation baseline \(parenthesized in the model column\)\. Noise is scaled per\-tensor RMS:x←x\+α⋅rms\(x\)⋅εx\\leftarrow x\+\\alpha\\cdot\\mathrm\{rms\}\(x\)\\cdot\\varepsilon\.Does artificially shrinking the magnitude fix attention drift?We observed that fixing the drift does not necessarily fix the magnitude accumulation\. However we are not sure whether fixing the magnitude alone would fix attention drift\. We have created an inference\-time controlled experiment, the drafter’s hidden states \(houth\_\{out\}\) were normalized to match thefc’s magnitude during inference\.[Table˜4](https://arxiv.org/html/2605.09992#S4.T4)shows that normalizing the magnitude without matched training hurts the accuracy in both cases, pre\-norm being effected even more than post\-norm\. However we observe one interesting thing, the attention drift is significantly lessened on pre\-norm architecture, while still happening weakly\. This shows that magnitude accumulation is one of the contributors to the attention drift but not the only one\. Note that the experiment isolates the magnitude\-attention link at inference but does not establish what a magnitude\-controlled drafter would learn end\-to\-end\.

Table 4:Effect of pinning the hidden\-state RMS to the FC’s RMS on Llama 3\.1 8B during test time\.Role of Training Window\.A possible additional contributor to attention drift is the training procedure itself\. Most EAGLE\-style trainers use a fixed context window during train\-time testing \(TTT\), with the oldest tokens dropped from the window as drafting proceeds\. This means the drafter is trained on inputs where early prompt positions, including the sink token, are progressively pushed out of context as speculation depth increases\. The drafter may therefore learn to reduce reliance on these positions over the chain\. While our main results identify hidden\-state magnitude accumulation as a key contributor to drift, this training\-window effect may further amplify sink weakening\. We leave a detailed study of how training\-window construction interacts with attention drift to future work\.

### 4\.5Attention Drift in Other Architectures

Sections 4\.2 and 4\.3 propose two fixes for EAGLE: post\-norm and gated attention\. However our experiments so far cover only one model family\. Here we examine whether these fixes generalize\.[Figure˜11](https://arxiv.org/html/2605.09992#S4.F11)shows that across drafter–target pairs, both architectural changes stabilize sink attention and self\-token attention\. The gated\-attention models \(Qwen 3\.5 and the post\-norm gated variant of Llama 3\.1\) settle at different absolute attention levels than the post\-norm\-only models, but they similarly do not exhibit progressive drift across the drafting chain\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x9.png)Figure 11:Attention Sink and Self\-Token Attention for each model that fixes the attention drift\.MTP Heads\.We observe a different drift profile in MTP heads compared to pre\-norm EAGLE\-3: drift onsets sharply within the first few speculation steps and then stabilizes at a new attention baseline, rather than progressing gradually across the entire chain \([Figure˜7](https://arxiv.org/html/2605.09992#S3.F7)\)\. Sink attention drops sharply at speculation start, and attention to the most recently drafted token rises and then plateaus after a few steps\. This is notable because MTP uses a post\-norm architecture, where we expected drift to be largely suppressed\. We see a similar sharp\-then\-stabilize pattern in our gated\-attention variants \([Figure˜11](https://arxiv.org/html/2605.09992#S4.F11)\), suggesting that post\-norm combined with attention reweighting produces a distinct dynamic from gradual pre\-norm drift\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x10.png)Figure 12:Magnitude drift on Qwen3\.5 9B MTP headsThis is also consistent with our finding thatmagnitude accumulation is one contributor to drift but not the only mechanism\. We focus our diagnosis and intervention on EAGLE\-3, which trains a separate drafter with its own LM head post\-hoc against the frozen target’s output distribution\. MTP, in contrast, reuses the target’s LM head and is trained jointly with the target during pretraining, which may contribute to the observed effect\. The exact training procedure for Qwen3\.5’s MTP heads has not been publicly disclosed; characterizing the role of training loss versus architecture in the emergence of attention drift is left as future work\.

## 5Performance Impact

In this section, we evaluate different solutions to the attention and magnitude drift problem\. We report acceptance length excluding the bonus token except for SGLang benchmarks, to compare raw performance across speculative decoding models\. We trained drafters for four target models spanning different architectures: Llama 3\.1 8B \(dense\), Qwen 3 8B \(dense thinking\), Qwen 3\.5 9B \(GDN\-hybrid thinking\), and GPT\-OSS 20B \(sparse MoE thinking\)\.

RMSNorm:hinh\_\{\\text\{in\}\}RMSNorm:eine\_\{\\text\{in\}\}⊕\\oplusAttention\+\+RMSNormMLP\+\+RMSNormLM Headhouth\_\{\\text\{out\}\}FChlowh\_\{\\text\{low\}\}hmidh\_\{\\text\{mid\}\}hhighh\_\{\\text\{high\}\}RMSNorm:hinh\_\{\\text\{in\}\}RMSNorm:eine\_\{\\text\{in\}\}⊕\\oplusAttention\+\+RMSNormMLP\+\+RMSNormLM Headhouth\_\{\\text\{out\}\}FCRMSRMSRMShlowh\_\{\\text\{low\}\}hmidh\_\{\\text\{mid\}\}hhighh\_\{\\text\{high\}\}

Figure 13:Standard*Pre\-norm*\(Left\) vs proposed*Post\-norm*\(Right\) drafter architectures\.The proposed post\-norm architecture places individual RMSNorms after each target hidden statehlow,hmid,hhighh\_\{low\},h\_\{mid\},h\_\{high\}and accumulates the drafter’s hidden states*after*the RMSNorm \([Figure˜13](https://arxiv.org/html/2605.09992#S5.F13)\)\. By using post\-norm, we were able to reduce the TTT length from 8 to 4, cutting training time by roughly one third without impacting performance\. We also trained baseline models using the same training data and configuration with the regular pre\-norm architecture for a fair comparison\.Post\-norm improved acceptance length consistently across all four models, with gains ranging up to𝟏𝟐%\\mathbf\{12\\%\}\([Figure˜14](https://arxiv.org/html/2605.09992#S5.F14), full per\-model results in Appendix[B](https://arxiv.org/html/2605.09992#A2)\)\.

![Refer to caption](https://arxiv.org/html/2605.09992v1/x11.png)Figure 14:GPT\-OSS 20B results on SGLang, acceptance length includes bonus token, temp=0\.7=0\.7\. Proposed post\-norm shows consistent improvements over standard pre\-norm\.### 5\.1Template Sensitivity

Performance of speculative decoding models varies heavily across different deployment settings, system prompts, and chat templates\. Since drafters are not trained on raw pre\-training data but on processed supervised fine\-tuning \(SFT\) conversations, we observed they heavily overfit to certain elements of the chat template\. Templates may be perturbed intentionally \(e\.g\., disabling reasoning tags to save tokens\) or unintentionally, as reported in inference engines and benchmarks\. We test how well our post\-norm architecture generalizes beyond the chat format and improves resilience to deployment conditions\.

![[Uncaptioned image]](https://arxiv.org/html/2605.09992v1/x12.png)

![Refer to caption](https://arxiv.org/html/2605.09992v1/x13.png)Figure 15:Template sensitivity of Llama 3\.1 8B drafter pre\-norm and post\-norm\. Temp = 0\.7\.
![Refer to caption](https://arxiv.org/html/2605.09992v1/x14.png)Figure 16:System prompt length’s effect on accuracy for Llama 3\.1 8B\.

We designed two experiments to test how well our post\-norm architecture generalizes beyond the training\-time chat format and prompt length\. In the first, we removed the chat template and sink token to measure sensitivity \([Appendix˜D](https://arxiv.org/html/2605.09992#A4)\)\.Post\-norm was significantly more resilient to every kind of disruption, losing at most 5% accuracy in the worst case, whereas pre\-norm dropped 52% \([Figure˜16](https://arxiv.org/html/2605.09992#S5.F16)\)\. Temperature set to0\.70\.7to prevent token\-level repetition we frequently observed under template perturbation\.

Is template sensitivity an artifact of attention sinks?To test this, we compared post\-norm against a Gated Attention model, which eliminates the attention sink and, thus, the sensitivity to the beginning\-of\-sequence \(BoS\) token\. The gated model\(s\) still showed high template sensitivity and in fact developed*higher*sensitivity to system prompt length, with lower overall accuracy \([Figures˜18](https://arxiv.org/html/2605.09992#S5.F18)and[18](https://arxiv.org/html/2605.09992#S5.F18)\)\. This shows thatremoving the sink alone is not enough to reduce template fragilityand that fixing hidden\-residual dynamics is required\. We observe similar template sensitivity patterns on Qwen3\-8B and GPT\-oss 20B \([Tables˜6](https://arxiv.org/html/2605.09992#S5.T6)and[6](https://arxiv.org/html/2605.09992#S5.T6)\), where post\-norm provides 5–35% gains under template perturbation, with the gap widening as templates become more disrupted\.

Table 5:Qwen3\-8B: Avg accepted draft tokens per verification round under different prompt\-template manipulations on MT\-Bench\.
Table 6:GPT\-oss 20B: Avg accepted draft tokens per verification round under different prompt\-template manipulations on MT\-Bench\.

Another factor in template sensitivity may lie in how the loss is constructed during training\. In the EAGLE series, the loss is computed only on assistant tokens; user tokens don’t contribute to the loss directly\. So while parameters are shared and trained on the full sequence, the model’s outputs at user positions are not directly constrained, only their utility for downstream assistant\-position predictions is\. This asymmetry could show up especially when the chat template is changed and the boundary between supervised and unsupervised positions change from where the model was trained to expect it\.

![[Uncaptioned image]](https://arxiv.org/html/2605.09992v1/x15.png)

![Refer to caption](https://arxiv.org/html/2605.09992v1/x16.png)Figure 17:Gating is not sensitive to BoS, but it still overfits to other tokens in the template\.
![Refer to caption](https://arxiv.org/html/2605.09992v1/x17.png)Figure 18:Gated attention overfits to the system prompt, hurting accuracy badly\.

System Prompt Length\.In our second experiment with temperature 0, we varied system prompt length \(Llama’s default prompt, trimmed to different lengths\)\. Post\-norm beat pre\-norm at every length tested \([Figure˜16](https://arxiv.org/html/2605.09992#S5.F16)\)\. More importantly,post\-norm was substantially more robust to system prompt length variation: post\-norm degraded at most 5% from its peak, while pre\-norm dropped 13%\. Post\-norm’s worst case still beats pre\-norm’s best\.

The cross\-model differences in this sensitivity correlate with training\-time prompt distribution\. Llama’s default system prompt is around 500 characters, while Qwen3 and GPT\-OSS are trained with much shorter system prompts \(30–40 characters\)\. This may explain why Qwen3 and GPT\-OSS are more resilient to system\-prompt length changes overall, and why they perform best at prompts around 64 tokens \([Tables˜8](https://arxiv.org/html/2605.09992#S5.T8)and[8](https://arxiv.org/html/2605.09992#S5.T8)\)\.

Table 7:Qwen3\-8B: Avg accepted draft tokens per verification round as the system context grows from 0 to 256 tokens\.
Table 8:GPT\-oss 20B: Avg accepted draft tokens per verification round as the prepended system context grows from 0 to 256 tokens\.

### 5\.2Long Context

Another challenge for speculative decoding models is long context\. Drafters are relatively cheap to train and are usually trained with a short context window such as 4096 tokens, which makes them easier and more affordable to train\. However, they fail catastrophically outside their trained context length\. LLMs have developed various techniques to handle this, one being sliding window attention \(SWA\), where the model’s effective context is a fixed window over the most recent tokens\. This makes intuitive sense for drafters in particular: they are weaker by design, and their predictions on easy tokens shouldn’t depend on long\-range context\. SWA has been studied for long\-context speculative decoding from both serving\-systems\[[13](https://arxiv.org/html/2605.09992#bib.bib10)\]and drafter\-training angles\[[20](https://arxiv.org/html/2605.09992#bib.bib3)\]\. We use SWA as a long\-context evaluation tool rather than as a method we propose; our post\-norm fix is orthogonal to these works and can be combined with existing long\-context speculative decoding techniques\.

![[Uncaptioned image]](https://arxiv.org/html/2605.09992v1/x18.png)

![Refer to caption](https://arxiv.org/html/2605.09992v1/x19.png)Figure 19:Accuracy of Llama 3\.1 8B drafter on various long\-context multi\-turn chat settings\.
![Refer to caption](https://arxiv.org/html/2605.09992v1/x20.png)Figure 20:Effect of window size on prediction accuracy\.

SWASWA \+ BoSSWA \+ System Prompt

Figure 21:Attended tokens \(marked in orange\) in different SWA implementations\.We created a multi\-turn conversation benchmark with context length beyond our models’ trained length\. A multi\-turn setup isolates the model’s long\-context capability from degradation caused by unrelated filler text \(results summarized in[Figure˜21](https://arxiv.org/html/2605.09992#S5.F21)\)\. Underfull attention\(regular\), the pre\-norm model fails catastrophically, dropping to0\.050\.05average acceptance length, while post\-norm drops to0\.830\.83, also unusable, but15×15\\timesbetter\. WithSWA, most of the accuracy is rescued: pre\-norm recovers88%88\\%of its single\-turn baseline, while post\-norm matches and slightly exceeds its single\-turn performance\. Since pre\-norm develops a strong sink,carrying the BoS token\(SWA/BoS\) through the window helps it more \(up to91%91\\%recovery\), with only a minor bump for post\-norm\. Following on this, we hypothesized thatcarrying the system prompt\(SWA/System\) would add additional context\. For post\-norm this added another1%1\\%, but pre\-norm collapsed to near\-zero, it could not accommodate the wider range of positional embeddings introduced by the longer prefix\.Across all SWA conditions, post\-norm outperformed pre\-norm by𝟐𝟎%\\mathbf\{20\\%\}\.

True long\-context performance\.In addition to our multi\-turn benchmark, we evaluate onLongBench\[[2](https://arxiv.org/html/2605.09992#bib.bib29)\]across three task categories: summarization, few\-shot learning, and coding\. Absolute task scores are lower than in the multi\-turn setting due to the inherent difficulty of long\-context understanding\.Post\-norm outperformed pre\-norm by𝟐𝟎\\mathbf\{20\}–𝟐𝟓%\\mathbf\{25\\%\}across all categories and context lengths \([Table˜9](https://arxiv.org/html/2605.09992#S5.T9)\)\. On the government report summarization task, adding the system prompt yielded an additional8%8\\%gain, suggesting that informative prefixes can further help SWA\-based drafters recover context, though we leave a broader study of this effect to future work\.

Table 9:LongBench: avg accepted draft tokens per round, per task and SWA mode\. Window=1024\.How much does the drafter rely on long\-range context vs\. recent tokens?To answer this, we varied the SWA window size \([Figure˜20](https://arxiv.org/html/2605.09992#S5.F20)\)\. Across all three window sizes tested, post\-norm maintained a consistent gap over pre\-norm\. Even a small 256\-token window recovers80%80\\%of the full\-context baseline; performance saturates around 1024 tokens, with diminishing returns beyond that\. This suggeststhe model mostly cares about recent tokens and SWA can substantially reduce compute without meaningfully hurting drafter accuracy\.

## 6Related Work

Speculative decodingaccelerates LLM inference by drafting candidate tokens and verifying them in parallel\[[8](https://arxiv.org/html/2605.09992#bib.bib13)\]\. Several drafter designs have been proposed: Medusa\[[3](https://arxiv.org/html/2605.09992#bib.bib15)\]attaches multiple prediction heads to the target, Hydra\[[1](https://arxiv.org/html/2605.09992#bib.bib16)\]introduces sequential dependence between these heads, and D\-Flash\[[5](https://arxiv.org/html/2605.09992#bib.bib11)\]uses diffusion models to predict sequences in one forward pass\. We focus on EAGLE\-3 as the dominant auto\-regressive drafter design\.

Attention sinks and patterns\.Xiaoet al\.\[[18](https://arxiv.org/html/2605.09992#bib.bib1)\]was the first to observe attention sinks and show that preserving these sinks is essential for long\-context perplexity under sliding\-window attention\.Qiuet al\.\[[12](https://arxiv.org/html/2605.09992#bib.bib26)\]proposed gated attention as a mechanism to reduce the model’s reliance on a fixed sink token\. Both works study sink behavior in the target model itself; we are the first, to our knowledge, to characterize attention behavior in speculative decoding drafters and identify a distinct failure mode \(attention drift\) specific to the chain\-residual structure of these models\.

Normalization in transformers\.The placement of normalization relative to residual connections has been studied extensively in the standard transformer setting\.Xionget al\.\[[19](https://arxiv.org/html/2605.09992#bib.bib24)\], Wuet al\.\[[16](https://arxiv.org/html/2605.09992#bib.bib23)\]showed that pre\-norm enables more stable training than post\-norm, leading most modern LLMs to adopt pre\-norm\. Our work revisits this question for speculative decoding drafters, where iterative self\-feedback across speculation steps changes the dynamics\. We show that post\-norm becomes necessary to prevent magnitude accumulation that drives attention drift\.

## 7Limitations

Our study has two main limitations\. First, we focus on EAGLE\-3 as the dominant post\-training auto\-regressive drafter in production\. We observe attention drift in MTP heads, suggesting the phenomenon extends beyond EAGLE\-3, but we do not investigate the alternative mechanisms at play there or evaluate our fix across other drafter designs\. Extending the analysis to MTP, Hydra, and other variants is a natural direction for future work\. Second, our experiments are limited to models below 120B parameters due to compute constraints; behavior at larger scale remains unverified\.

## 8Conclusion

We identifiedattention driftin auto\-regressive speculative decoding drafters: as the drafter generates successive tokens during speculation, attention progressively migrates from the sink onto recently\-generated tokens\. We characterized this phenomenon across model families and showed that it is closely tied to monotonic magnitude growth in the unnormalized residual path between speculation steps, which causes the drafter to behave more like additional transformer layers stacked on the target than as a standalone auto\-regressive predictor\. Two simple architectural interventions, post\-norm on the drafter output and RMSNorm before the target hidden projection, address the underlying dynamics and yield consistent improvements, with the largest gains under deployment shift\. Crucially, our experiments show that fixing drift alone is not enough to improve performance;attention drift is merely a symptom of a deeper issue, not its cause\. Post\-norm helps because it addresses the underlying dynamics: it prevents magnitude accumulation and stops the drafter from learning depth\-specific transformations, which together improve both performance and robustness\. We hope that naming attention drift gives the community a useful diagnostic for drafter robustness, and that post\-norm proves a useful default for production systems\.

## Acknowledgments and Disclosure of Funding

We thank fal and Lambda for the compute grants that supported this research\.

## References

- \[1\]\(2024\)Hydra: sequentially\-dependent draft heads for medusa decoding\.InFirst Conference on Language Modeling,External Links:[Link](https://openreview.net/forum?id=FbhjirzvJG)Cited by:[§6](https://arxiv.org/html/2605.09992#S6.p1.1)\.
- \[2\]Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou,et al\.\(2024\)Longbench: a bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd annual meeting of the association for computational linguistics \(volume 1: Long papers\),pp\. 3119–3137\.Cited by:[§5\.2](https://arxiv.org/html/2605.09992#S5.SS2.p3.3)\.
- \[3\]T\. Cai, Y\. Li, Z\. Geng, H\. Peng, J\. D\. Lee, D\. Chen, and T\. Dao\(2024\)Medusa: simple LLM inference acceleration framework with multiple decoding heads\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=PEpbUobfJv)Cited by:[§6](https://arxiv.org/html/2605.09992#S6.p1.1)\.
- \[4\]C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper\(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§1](https://arxiv.org/html/2605.09992#S1.p1.1)\.
- \[5\]J\. Chen, Y\. Liang, and Z\. Liu\(2026\)DFlash: block diffusion for flash speculative decoding\.arXiv preprint arXiv:2602\.06036\.Cited by:[§6](https://arxiv.org/html/2605.09992#S6.p1.1)\.
- \[6\]F\. Gloeckle, B\. Y\. Idrissi, B\. Rozière, D\. Lopez\-Paz, and G\. Synnaeve\(2024\)Better & faster large language models via multi\-token prediction\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 15706–15734\.External Links:[Link](https://proceedings.mlr.press/v235/gloeckle24a.html)Cited by:[§1](https://arxiv.org/html/2605.09992#S1.p2.1),[§3](https://arxiv.org/html/2605.09992#S3.p4.1)\.
- \[7\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the 29th symposium on operating systems principles,pp\. 611–626\.Cited by:[§2](https://arxiv.org/html/2605.09992#S2.p2.1)\.
- \[8\]Y\. Leviathan, M\. Kalman, and Y\. Matias\(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§1](https://arxiv.org/html/2605.09992#S1.p1.1),[§6](https://arxiv.org/html/2605.09992#S6.p1.1)\.
- \[9\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)Eagle\-2: faster inference of language models with dynamic draft trees\.InProceedings of the 2024 conference on empirical methods in natural language processing,pp\. 7421–7432\.Cited by:[§2](https://arxiv.org/html/2605.09992#S2.p2.1)\.
- \[10\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2024\)EAGLE: speculative sampling requires rethinking feature uncertainty\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 28935–28948\.External Links:[Link](https://proceedings.mlr.press/v235/li24bt.html)Cited by:[§2](https://arxiv.org/html/2605.09992#S2.p2.1)\.
- \[11\]Y\. Li, F\. Wei, C\. Zhang, and H\. Zhang\(2026\)EAGLE\-3: scaling up inference acceleration of large language models via training\-time test\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=4exx1hUffq)Cited by:[§1](https://arxiv.org/html/2605.09992#S1.p2.1),[§2](https://arxiv.org/html/2605.09992#S2.p2.1)\.
- \[12\]Z\. Qiu, Z\. Wang, B\. Zheng, Z\. Huang, K\. Wen, S\. Yang, R\. Men, L\. Yu, F\. Huang, S\. Huang, D\. Liu, J\. Zhou, and J\. Lin\(2025\)Gated attention for large language models: non\-linearity, sparsity, and attention\-sink\-free\.InAdvances in Neural Information Processing Systems,D\. Belgrave, C\. Zhang, H\. Lin, R\. Pascanu, P\. Koniusz, M\. Ghassemi, and N\. Chen \(Eds\.\),Vol\.38,pp\. 100092–100118\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/904e89bb4e632e75fb47f093b620b257-Paper-Conference.pdf)Cited by:[§6](https://arxiv.org/html/2605.09992#S6.p2.1)\.
- \[13\]R\. Sadhukhan, J\. Chen, Z\. Chen, V\. Tiwari, R\. Lai, J\. Shi, I\. E\. Yen, A\. May, T\. Chen, and B\. Chen\(2025\)MagicDec: breaking the latency\-throughput tradeoff for long context generation with speculative decoding\.InICLR,External Links:[Link](https://openreview.net/forum?id=CS2JWaziYr)Cited by:[§5\.2](https://arxiv.org/html/2605.09992#S5.SS2.p1.1)\.
- \[14\]K\. Team, G\. Chen, Y\. Zhang, J\. Su, W\. Xu, S\. Pan, Y\. Wang, Y\. Wang, G\. Chen, B\. Yin,et al\.\(2026\)Attention residuals\.arXiv preprint arXiv:2603\.15031\.Cited by:[§4\.1](https://arxiv.org/html/2605.09992#S4.SS1.p3.8)\.
- \[15\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin\(2017\)Attention is all you need\.Advances in neural information processing systems30\.Cited by:[§2](https://arxiv.org/html/2605.09992#S2.p2.1)\.
- \[16\]X\. Wu, A\. Ajorlou, Y\. Wang, S\. Jegelka, and A\. Jadbabaie\(2024\)On the role of attention masks and layernorm in transformers\.InThe Thirty\-eighth Annual Conference on Neural Information Processing Systems,External Links:[Link](https://openreview.net/forum?id=lIH6oCdppg)Cited by:[§6](https://arxiv.org/html/2605.09992#S6.p3.1)\.
- \[17\]G\. Xiao, J\. Tang, J\. Zuo, junxian guo, S\. Yang, H\. Tang, Y\. Fu, and S\. Han\(2025\)DuoAttention: efficient long\-context LLM inference with retrieval and streaming heads\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=cFu7ze7xUm)Cited by:[§1](https://arxiv.org/html/2605.09992#S1.p1.1)\.
- \[18\]G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis\(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NG7sS51zVF)Cited by:[§2](https://arxiv.org/html/2605.09992#S2.p3.1),[§6](https://arxiv.org/html/2605.09992#S6.p2.1)\.
- \[19\]R\. Xiong, Y\. Yang, D\. He, K\. Zheng, S\. Zheng, C\. Xing, H\. Zhang, Y\. Lan, L\. Wang, and T\. Liu\(2020\)On layer normalization in the transformer architecture\.InInternational conference on machine learning,pp\. 10524–10533\.Cited by:[§6](https://arxiv.org/html/2605.09992#S6.p3.1)\.
- \[20\]P\. Yang, C\. Du, F\. Zhang, H\. Wang, T\. Pang, C\. Du, and B\. An\(2025\)LongSpec: long\-context lossless speculative decoding with efficient drafting and verification\.InES\-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models,External Links:[Link](https://openreview.net/forum?id=GFN9PWbfHs)Cited by:[§1](https://arxiv.org/html/2605.09992#S1.p1.1),[§5\.2](https://arxiv.org/html/2605.09992#S5.SS2.p1.1)\.
- \[21\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.Advances in neural information processing systems36,pp\. 46595–46623\.Cited by:[§4](https://arxiv.org/html/2605.09992#S4.p1.1)\.
- \[22\]L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez,et al\.\(2024\)Sglang: efficient execution of structured language model programs\.Advances in neural information processing systems37,pp\. 62557–62583\.Cited by:[§2](https://arxiv.org/html/2605.09992#S2.p2.1)\.

## Appendix AGated attention

In the gated attention variant of our models, we add an optional per\-element sigmoid gate over the hidden dimension to the EAGLE3 draft self\-attention\. Let𝐱∈ℝT×din\\mathbf\{x\}\\in\\mathbb\{R\}^\{T\\times d\_\{\\mathrm\{in\}\}\}denote the input stream to the attention block\. Because EAGLE3 fuses the verifier hidden state with the input\-embedding inside the decoder layer,𝐱\\mathbf\{x\}is the concatenation of the two anddin=2dd\_\{\\mathrm\{in\}\}=2d, whereddis the draft hidden size\. The query, key and value projections operate on𝐱\\mathbf\{x\}as in standard Llama, and we write the multi\-head attention output*before*the output projection as

𝐀=Attn\(𝐱𝐖Q,𝐱𝐖K,𝐱𝐖V\)∈ℝT×hdh,\\mathbf\{A\}\\;=\\;\\mathrm\{Attn\}\(\\mathbf\{x\}\\mathbf\{W\}\_\{Q\},\\mathbf\{x\}\\mathbf\{W\}\_\{K\},\\mathbf\{x\}\\mathbf\{W\}\_\{V\}\)\\;\\in\\;\\mathbb\{R\}^\{T\\times h\\,d\_\{h\}\},
withhhheads of dimensiondhd\_\{h\}\. Our modification introduces a learned gate𝐖g∈ℝdin×hdh\\mathbf\{W\}\_\{g\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\\times h\\,d\_\{h\}\}that shares its input with the QKV projections,

𝐠=σ\(𝐱𝐖g\),\\mathbf\{g\}\\;=\\;\\sigma\\\!\\big\(\\mathbf\{x\}\\mathbf\{W\}\_\{g\}\\big\),\(1\)
whereσ\(⋅\)\\sigma\(\\cdot\)denotes the element\-wise sigmoid \(σ\(z\)=11\+e−z\\sigma\(z\)=\\dfrac\{1\}\{1\+e^\{\-z\}\}\)\. The gate is applied to the concatenated head output and*precedes*the output projection𝐖O\\mathbf\{W\}\_\{O\}:

𝐲=\(𝐠⊙𝐀\)𝐖O\.\\mathbf\{y\}\\;=\\;\\big\(\\mathbf\{g\}\\odot\\mathbf\{A\}\\big\)\\,\\mathbf\{W\}\_\{O\}\.\(2\)

## Appendix BBenchmark Results

We report SGLang acceptance length \(including the bonus token\) per benchmark for each target model trained in this work\. Tables 6–9 cover GPT\-oss 20B, Qwen3\.5 9B, Qwen3 8B, and Llama 3\.1 8B respectively, sweeping the relevant decoding modes for each \(low/medium reasoning effort for GPT\-oss; thinking/no\-thinking for Qwen3 and Qwen3\.5\)\. Across all four models and every decoding mode, post\-norm matches or improves on pre\-norm; the single regression is Llama 3\.1 8B on MATH\-500, which falls within evaluation noise\. Gains are largest on math and code \(e\.g\., GSM8K \+10% on GPT\-oss low, HumanEval \+5% on Qwen3 8B no\-think\) and smallest on Alpaca\-style open\-ended chat, consistent with the harder, longer\-horizon completions benefiting more from a stable drafter\.

Table 10:SGLang acceptance length forGPT\-oss 20B\+ EAGLE3 across pre/post\-norm draft and low/medium reasoning effort\. Best per row in bold\.Table 11:SGLang acceptance length forQwen3\.5 9B\+ EAGLE3 across pre/post\-norm draft and thinking/no\-thinking decoding\. Best per row in bold\.Table 12:SGLang acceptance length forQwen3 8B\+ EAGLE3 across pre/post\-norm draft and thinking/no\-thinking decoding\. Best per row in bold\.Table 13:SGLang acceptance length per benchmark forLlama 3\.1 8Bwith EAGLE3 \(steps=7, topk=1, draft=8\)\. Best per row in bold\. The single regression on MATH\-500 is within evaluation noise\.### B\.1Model Long\-Context SWA Sensitivity

To our surprise, the Gated Post\-Norm model did not suffer at all from out\-of\-distribution inference lengths and showed no accuracy loss in the full\-context attention case\. This could be attributed to its low attention entropy: it learns to attend to only a couple of tokens, and this sharpened attention pattern may filter the destructive signal via the sigmoid gate\. This also suggests the model relies less on absolute positional information than the baseline does\.

Table 14:Llama 3\.1 8B: Avg accepted draft tokens per round at 8k context, comparing full attention to sliding\-window variants with optional BoS / system\-prompt carry\.Table 15:GPT\-oss 20b: Avg accepted draft tokens per round at 16k context, comparing full attention to sliding\-window variants\.Table 16:Qwen3\-8B: Avg accepted draft tokens per round at 8k context, comparing full attention to sliding\-window variants with optional BoS / system\-prompt carry\.
### B\.2SWA Size / Accuracy

Section 5\.2 reports the window\-size sweep for Llama 3\.1 8B at 8k context\. Table 20 confirms the same pattern holds for Qwen3 8B at a 32k target context: post\-norm leads pre\-norm at every window size, and gains saturate around w=1024 with only marginal improvement at w=2048\.

Table 17:Llama\-3\.1 8B: Avg accepted draft tokens per round under SWA \+ BoS carry as the window size grows from 256 to 2048 \(target context 8192\)\.Table 18:Qwen3\-8B: Avg accepted draft tokens per round under SWA \+ BoS carry as the window size grows from 256 to 2048 \(target context 32768\)\.
### B\.3LongBench Results

We further break down LongBench\-E results for Llama 3\.1 8B by prompt\-length bucket, using each architecture’s preferred SWA mode \(SWA\+BoS for pre\-norm, SWA\+System Prompt for post\-norm\. Post\-norm outperforms pre\-norm in every bucket from 0–4k all the way to 32–36k\. This indicates that the post\-norm advantage seen on the multi\-turn benchmark in Section 5\.2 transfers to true long\-context tasks rather than being specific to the multi\-turn setting\.

Table 19:Llama 3\.1 8B,LongBench\-E: avg accepted draft tokens per round vs prompt length, averaged over all tasks\. Each model uses its preferred SWA mode \(pre\-norm: SWA \+ BoS, post\-norm: SWA \+ Sys Prompt\)\.nnis the number of verification rounds in each bucket\.Ctx \(tokens\)Pre\-norm \(SWA \+ BoS\)Post\-norm \(SWA \+ Sys Prompt\)Acceptance Length \(τ\\tau\)Number of samplesAcceptance Length \(τ\\tau\)nn0\-4k1\.6078381\.9968024\-8k1\.63111462\.0795638\-12k1\.3099271\.73835412\-16k1\.2862421\.60548216\-20k1\.8230001\.98283920\-24k1\.9438551\.96383124\-28k1\.7727442\.20237428\-32k1\.7315602\.25130932\-36k1\.215421\.80428

## Appendix CTraining

We have used a modified instance of theSpecForgerepository to train our models\. This repository implements the Train\-Time\-Test method for EAGLE\-3 training and it is the framework used to train current state\-of\-the\-art EAGLE\-3 models\.

Llama and Qwen3 variants are trained on the Open\-PerfectBlend dataset with regenerated answers using target model\. The dataset consists over 1\.4M samples\. We have run it for 2 epochs and1\.5×10−41\.5\\times 10^\{\-4\}LR, with effective batch size of 4\. On average training took around 48 H200 hours for each model with 8\-9B parameters\.

Qwen3\.5 and Gpt\-oss variants were trained on Nemotron post training dataset with regenerated answers using target model\. The dataset consist over 1\.4M samples\. The maximum sequence length was 8K tokens\. We have run it for 1 epoch and1\.5×10−41\.5\\times 10^\{\-4\}LR\. On average training took around 36 to 48 H200 hours based on model size \(20B and 120B\)\.

## Appendix DTemplate Perturbations

```
--- regular ---
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>

What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>

--- no_bos ---
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant.<|eot_id|>...

--- no_template ---
<|begin_of_text|>Question: What is the capital of France?
Answer:

--- no_bos_no_template ---
Question: What is the capital of France?
Answer:
```
Attention Drift: What Autoregressive Speculative Decoding Models Learn

Similar Articles

Teaching Diffusion to Speculate Left-to-Right

Speculative Decoding Across Languages

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

Submit Feedback

Similar Articles

Teaching Diffusion to Speculate Left-to-Right
Speculative Decoding Across Languages
Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding
Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding