P-Cast Precision in FP8 Attention: Sink-Induced Collapse and the Optimality of S=2^8

arXiv cs.AI Papers

Summary

This paper analyzes precision loss in FP8 attention due to the attention sink phenomenon when casting the softmax output to FP8 (E4M3). It shows that forward KV iteration causes underflow of non-sink attention values, and proposes reverse iteration and a static scaling factor S=256 to eliminate underflow, achieving 3-10x MSE improvement.

arXiv:2606.06521v1 Announce Type: cross Abstract: FP8 (E4M3) acceleration for attention computation offers significant throughput gains, but the 3-bit mantissa introduces precision challenges when the softmax probability matrix P is cast to FP8 before the P*V matrix multiplication. We analyze two implementation choices that affect output precision under the Attention Sink phenomenon: (1) the KV block iteration order, and (2) the static scaling factor applied to P before casting. We show that forward KV iteration causes "P-collapse" -- to leading order, a fraction Phi(Delta + delta_k - 6.93 - ln S) of non-sink P values underflow to zero, where the small shift delta_k ~ 1 (for k_sink = 4) is the expected within-sink-block score maximum -- and that reverse iteration removes it, with a zero-underflow guarantee when reverse is combined with S = 256. We further give a constructive characterization of S = 256 = 2^8 as the static scale that simultaneously satisfies (i) bit-exact IEEE 754 scaling, (ii) the lower envelope of a sawtooth function dp(S) over the E4M3 number line (dp = 2^-4, the minimum worst-case quantization step), and (iii) the maximum normal-range coverage among bit-exact (2^k) scales (a non-bit-exact scale such as 448 attains slightly higher coverage). Both optimizations are already deployed in FlashAttention-3/4 on engineering grounds; our contribution is a quantitative account of why these choices are good and a closed-form threshold Delta_c = 6.93 + ln S - delta_k for predicting kernel-level precision loss. Kernel-faithful experiments (Q, K, V in FP32 to isolate the P-cast effect) show 3-10x MSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined.
Original Article
View Cached Full Text

Cached at: 06/08/26, 09:15 AM

# Sink-Induced Collapse and the Optimality of 𝑆=2⁸
Source: [https://arxiv.org/html/2606.06521](https://arxiv.org/html/2606.06521)
## P\-Cast Precision in FP8 Attention: Sink\-Induced Collapse and the Optimality ofS=28S=2^\{8\}

###### Abstract

FP8 \(E4M3\) acceleration for attention computation offers significant throughput gains, but the 3\-bit mantissa introduces precision challenges when the softmax probability matrixPPis cast to FP8 before theP⋅VP\\cdot Vmatrix multiplication\. We analyze two implementation choices that affect output precision under the*Attention Sink*phenomenon: \(1\) the KV block iteration order, and \(2\) the static scaling factor applied toPPbefore casting\.

We show that forward KV iteration causes*P\-collapse*—to leading order a fractionΦ​\(Δ\+δk−6\.93−ln⁡S\)\\Phi\(\\Delta\+\\delta\_\{k\}\-6\.93\-\\ln S\)of non\-sinkPPvalues underflow to zero, where the small shiftδk≈1\\delta\_\{k\}\\approx 1\(forksink=4k\_\{\\text\{sink\}\}\{=\}4\) is the expected within\-sink\-block score maximum—and that reverse iteration removes it, with a zero\-underflow guarantee when reverse is combined withS=256S\{=\}256\. We further give a constructive characterization ofS=256=28S=256=2^\{8\}as the static scale that simultaneously satisfies \(i\) bit\-exact IEEE 754 scaling, \(ii\) the lower envelope of a sawtooth functiond​p​\(S\)dp\(S\)over the E4M3 number line \(d​p=2−4dp=2^\{\-4\}, the minimum worst\-case quantization step\), and \(iii\) the maximum normal\-range coverage*among bit\-exact \(2k2^\{k\}\) scales*\(a non\-bit\-exact scale such as448448attains slightly higher coverage; §[5](https://arxiv.org/html/2606.06521#S5)\)\. Both optimizations are already deployed in FlashAttention\-3/4 on engineering grounds; our contribution is a quantitative account of*why*these choices are good and a closed\-form thresholdΔc=6\.93\+ln⁡S−δk\\Delta\_\{c\}=6\.93\+\\ln S\-\\delta\_\{k\}for predicting kernel\-level precision loss\. Kernel\-faithful experiments \(Q,K,VQ,K,Vin FP32 to isolate the P\-cast effect\) show33–10×10\\timesMSE improvement at moderate sink strengths, and paired tests confirm both fixes saturate to the same precision floor when combined—which motivated updating the hpc\-ops kernel fromS=1S\{=\}1toS=256S\{=\}256\.

## 1Introduction

### 1\.1Motivation

The relentless scaling of large language models has made low\-precision arithmetic essential for both training and inference throughput\. Modern GPU architectures \(NVIDIA Hopper, Blackwell; AMD MI300\) now provide native FP8 tensor core support, operating on two formats: E4M3 \(4\-bit exponent, 3\-bit mantissa\) for forward computation and E5M2 for gradients\[[5](https://arxiv.org/html/2606.06521#bib.bib5)\]\.

FlashAttention\-3\[[6](https://arxiv.org/html/2606.06521#bib.bib6)\]exploits this by casting the softmax output matrixPPto E4M3 before theP⋅VP\\cdot Vmatmul, while keeping the output accumulator in FP32\. This design creates a*precision bottleneck at the P\-cast step*: E4M3’s 3\-bit mantissa provides only 8 representable values per binade, giving a relative precision of just12\.5%12\.5\\%—roughly16×16\\timesworse than BF16\.

In this regime, two implementation “details” that are inconsequential in higher precision become first\-order precision determinants:

1. 1\.KV block iteration order: whether the online softmax processes KV blocks forward \(0→N0\\to N\) or reverse \(N→0N\\to 0\)\.
2. 2\.P\-scaling factor: the constantSSby whichPPis multiplied before the E4M3 cast \(and divided back in the epilogue\)\.

### 1\.2The Attention Sink Problem

The*Attention Sink*\[[8](https://arxiv.org/html/2606.06521#bib.bib8)\]phenomenon—where initial tokens receive disproportionately large attention weights—interacts destructively with FP8 quantization\. Under forward iteration, the sink’s high logit score inflates the running softmax maximummm, forcing all subsequentPPvalues below E4M3’s representable range\. This*P\-collapse*is a threshold effect: it activates around sink strengthΔ≈6\\Delta\\approx 6–77, where the cast zeroes the majority of non\-sinkPPvalues while those positions still carry roughly half of the total probability mass\.

### 1\.3Contributions

This paper provides a quantitative account of both implementation choices:

1. 1\.P\-collapse quantification\(§[3](https://arxiv.org/html/2606.06521#S3)\): We derive a closed\-form expressionF​\(Δ,S\)=Φ​\(Δ\+δk−6\.93−ln⁡S\)F\(\\Delta,S\)=\\Phi\(\\Delta\+\\delta\_\{k\}\-6\.93\-\\ln S\)\(withδk\\delta\_\{k\}a small within\-sink extreme\-value correction\) for the fraction of non\-sink P values that underflow, and a leading\-order MSE estimate showing the effect peaks in a narrow transition region\.
2. 2\.Reverse iteration sufficiency\(§[4](https://arxiv.org/html/2606.06521#S4)\): We show that reverse iteration \(rigorously, combined withS=256S\{=\}256\) keeps all P values representable for any practical sequence length, with an explicit probabilistic guarantee\.
3. 3\.S=256S=256characterization\(§[5](https://arxiv.org/html/2606.06521#S5)\): We introduce thed​p​\(S\)dp\(S\)function—the normalized maximum quantization step—and show it forms a sawtooth with power\-of\-two values on its lower envelope\. This yields a constructive characterization ofS=256S=256via three jointly imposed conditions \(bit\-exactness, minimum quantization step, maximum normal coverage\)\.
4. 4\.Kernel\-faithful validation\(§[6](https://arxiv.org/html/2606.06521#S6)\): Using a simulation that exactly matches production kernel semantics \(gSum from FP32 pre\-cast P\), we measure33–10×10\\timesMSE improvement and observe saturation when both fixes are applied\.

### 1\.4Positioning

Both optimizations are already deployed in FlashAttention\-3/4\[[6](https://arxiv.org/html/2606.06521#bib.bib6)\]on engineering grounds \(register savings, range utilization\)\. This paper is therefore not a proposal of new techniques, but a quantitative explanation of*why*those choices work and a closed\-form diagnostic \(Δc=6\.93\+ln⁡S−δk\\Delta\_\{c\}=6\.93\+\\ln S\-\\delta\_\{k\}\) that practitioners can apply to predict P\-cast failure regimes\. The same analysis motivated updating the hpc\-ops kernel fromS=1S=1toS=256S=256and could inform corresponding changes in FlashInfer \(S=448S=448\) and TensorRT\-LLM XQA \(S=448S=448\)\.

## 2Background and Related Work

### 2\.1FP8 E4M3 Number Format

The E4M3 format\[[5](https://arxiv.org/html/2606.06521#bib.bib5)\]allocates 1 sign bit, 4 exponent bits \(biasb=7b=7\), and 3 mantissa bits\. The complete positive representable set contains 126 values:

Subnormals\(exponent fieldE=0E=0, 7 values\):

v=21−b⋅M23=2−6⋅M8,M∈\{1,…,7\}v=2^\{1\-b\}\\cdot\\frac\{M\}\{2^\{3\}\}=2^\{\-6\}\\cdot\\frac\{M\}\{8\},\\quad M\\in\\\{1,\\ldots,7\\\}\(1\)spanning\[2−9,7⋅2−9\]=\[0\.00195,0\.01367\]\[2^\{\-9\},7\\cdot 2^\{\-9\}\]=\[0\.00195,0\.01367\]\.

Normals\(exponent fieldE=1,…,15E=1,\\ldots,15; 119 values\):

v=2E−b⋅\(1\+M23\),M∈\{0,…,7\}v=2^\{E\-b\}\\cdot\\left\(1\+\\frac\{M\}\{2^\{3\}\}\\right\),\\quad M\\in\\\{0,\\ldots,7\\\}\(2\)withE=15,M=7E=15,M=7reserved as NaN, giving max=448=1\.75×28=448=1\.75\\times 2^\{8\}\.

Key thresholds for P quantization:

Within any normal binade\[2n,2n\+1\)\[2^\{n\},2^\{n\+1\}\), the spacing \(LSB\) is2n−32^\{n\-3\}, yielding exactly 8 uniformly spaced representable values\. The subnormal region has uniform spacing2−92^\{\-9\}with only 7 values—much coarser relative precision\.

### 2\.2Online Softmax and FP8 FlashAttention

FlashAttention\[[3](https://arxiv.org/html/2606.06521#bib.bib3),[2](https://arxiv.org/html/2606.06521#bib.bib2)\]computes exact attention via tiled online softmax\. Algorithm[1](https://arxiv.org/html/2606.06521#alg1)shows the FP8 variant, highlighting the P\-cast step and the critical separation betweenℓ\\ell\(FP32 pre\-cast\) andOO\(from cast P\)\.

Algorithm 1FP8 Online Softmax Attention \(single query row\)1:

Q∈ℝ1×dQ\\in\\mathbb\{R\}^\{1\\times d\},

K,V∈ℝN×dK,V\\in\\mathbb\{R\}^\{N\\times d\}\(FP8\), scale

SS, block size

BB
2:

m←−∞m\\leftarrow\-\\infty,

ℓ←0\\ell\\leftarrow 0,

O←𝟎∈ℝ1×dO\\leftarrow\\mathbf\{0\}\\in\\mathbb\{R\}^\{1\\times d\}⊳\\trianglerightFP32

3:for

j∈BlockOrderj\\in\\text\{BlockOrder\}do⊳\\trianglerightForward:0\.\.N/B0\.\.N/B; Reverse:N/B​\.\.0N/B\.\.0

4:

𝐙←Q⋅KjT/dk\\mathbf\{Z\}\\leftarrow Q\\cdot K\_\{j\}^\{T\}/\\sqrt\{d\_\{k\}\}⊳\\trianglerightFP8×\\timesFP8→\\toFP32 accumulator \(scores\)

5:

mloc←max⁡\(𝐙\)m\_\{\\text\{loc\}\}\\leftarrow\\max\(\\mathbf\{Z\}\)
6:

mnew←max⁡\(m,mloc\)m\_\{\\text\{new\}\}\\leftarrow\\max\(m,\\,m\_\{\\text\{loc\}\}\)
7:

α←exp⁡\(m−mnew\)\\alpha\\leftarrow\\exp\(m\-m\_\{\\text\{new\}\}\)⊳\\trianglerightFP32 correction

8:

P←exp⁡\(𝐙−mnew\)P\\leftarrow\\exp\(\\mathbf\{Z\}\-m\_\{\\text\{new\}\}\)⊳\\trianglerightFP32 local prob\.

9:

ℓ←α⋅ℓ\+sum​\(P\)\\ell\\leftarrow\\alpha\\cdot\\ell\+\\textbf\{sum\}\(P\)⊳\\trianglerightFP32 pre\-cast P

10:

Pfp8←cast\_E4M3​\(P⋅S\)P\_\{\\text\{fp8\}\}\\leftarrow\\text\{cast\\\_E4M3\}\(P\\cdot S\)⊳\\trianglerightP\-cast;×S\\times Sis FP32, bit\-exact forS=2kS\{=\}2^\{k\}

11:

O←α⋅O\+Pfp8⋅VjO\\leftarrow\\alpha\\cdot O\+P\_\{\\text\{fp8\}\}\\cdot V\_\{j\}⊳\\trianglerightFP8×\\timesFP8→\\toFP32 accum\.

12:endfor

13:return

O/\(S⋅ℓ\)O/\(S\\cdot\\ell\)⊳\\trianglerightEpilogue: unscale \+ normalize

The key design:ℓ\\ell\(line 8\) uses exact FP32 probabilities, whileOO\(line 10\) uses the cast E4M3 values\. This means*normalization is always exact*; precision loss appears only in the numerator\.

### 2\.3Attention Sink

Multiple studies \(see, e\.g\.,\[[8](https://arxiv.org/html/2606.06521#bib.bib8),[7](https://arxiv.org/html/2606.06521#bib.bib7),[1](https://arxiv.org/html/2606.06521#bib.bib1),[4](https://arxiv.org/html/2606.06521#bib.bib4)\]\) document that pretrained LLMs allocate disproportionate attention to initial tokens \(“sink tokens”\), with sink\-vs\-normal logit gapΔ\\Deltatypically reported in the range\[6,13\]\[6,13\]at context lengths of several thousand\. Training\-side mitigations exist \(learnable sink tokens, clipped softmax\), but we focus on kernel\-level solutions applicable to already\-trained models\.

### 2\.4Related Work

FP8 attention kernels\.FlashAttention\-3\[[6](https://arxiv.org/html/2606.06521#bib.bib6)\]introduced E4M3 P\-casting withS=256S=256and reverse iteration on Hopper \(SM90\); FlashAttention\-4 extends the same choices to Blackwell \(SM100\)\. FlashInfer adoptsS=448S=448\(matchingmaxE4M3\\max\_\{\\text\{E4M3\}\}\)\. SageAttention2\[[9](https://arxiv.org/html/2606.06521#bib.bib9)\]uses per\-blockS=448S=448; SageAttention2\+\+\[[10](https://arxiv.org/html/2606.06521#bib.bib10)\]constrainsS=112S=112for FP16 accumulation\.

FP8 quantization theory\.Micikevicius et al\.\[[5](https://arxiv.org/html/2606.06521#bib.bib5)\]introduced the E4M3/E5M2 split\. Per\-tensor vs\. per\-channel scaling is well studied for weights/activations, but the specific structure of post\-softmaxP∈\[0,1\]P\\in\[0,1\]quantization has not received formal treatment\.

Attention sink analysis\.Xiao et al\.\[[8](https://arxiv.org/html/2606.06521#bib.bib8)\]identified the phenomenon;Sun et al\.\[[7](https://arxiv.org/html/2606.06521#bib.bib7)\]linked it to massive activations;Gu et al\.\[[4](https://arxiv.org/html/2606.06521#bib.bib4)\]empirically measured sink strength distributions\. None analyzed the interaction with FP8 P\-casting\.

## 3P\-Collapse Under Attention Sink

### 3\.1Setup and Notation

We consider a single attention head with query lengthqq, KV lengthNN, and head dimensiondd\. KV blocks have sizeBB\(typically 64 or 128\)\. The firstksinkk\_\{\\text\{sink\}\}positions are sink tokens with logit scoresΔ\\Deltaabove the mean\. Non\-sink scores follows∼𝒩​\(0,1\)s\\sim\\mathcal\{N\}\(0,1\)\(standard for well\-trained transformers with1/dk1/\\sqrt\{d\_\{k\}\}scaling\)\.

### 3\.2Forward Iteration Failure Mode

In forward iteration, block 0 contains the sink tokens\. After processing block 0, the running maximum is set by the largest sink\-token logit\. Writing the sink scores asΔ\\Deltaplus the same𝒩​\(0,1\)\\mathcal\{N\}\(0,1\)fluctuation carried by other tokens,

mglobal=Δ\+δk,δk≜𝔼​\[maxi≤ksink⁡si\],m\_\{\\text\{global\}\}=\\Delta\+\\delta\_\{k\},\\qquad\\delta\_\{k\}\\triangleq\\mathbb\{E\}\\big\[\\max\\nolimits\_\{i\\leq k\_\{\\text\{sink\}\}\}s\_\{i\}\\big\],\(3\)whereδk\\delta\_\{k\}is the expected maximum ofksinkk\_\{\\text\{sink\}\}standard Gaussians \(δ4≈1\.03\\delta\_\{4\}\\approx 1\.03; the asymptotic2​ln⁡ksink≈1\.67\\sqrt\{2\\ln k\_\{\\text\{sink\}\}\}\\approx 1\.67badly*over*estimatesδk\\delta\_\{k\}at the smallksinkk\_\{\\text\{sink\}\}of interest, and we use the exact value\)\.

For all subsequent blocksj\>0j\>0, the local probability values are:

Pj​\(i\)=exp⁡\(si−mglobal\)=exp⁡\(si−Δ−δk\)\.P\_\{j\}\(i\)=\\exp\\big\(s\_\{i\}\-m\_\{\\text\{global\}\}\\big\)=\\exp\\big\(s\_\{i\}\-\\Delta\-\\delta\_\{k\}\\big\)\.\(4\)
###### Proposition 1\(P\-underflow condition\)\.

A P valueppunderflows to zero in E4M3 \(with scaleSS\) iff:

p⋅S<2−10p\\cdot S<2^\{\-10\}\(5\)ForPj​\(i\)=exp⁡\(si−Δ−δk\)P\_\{j\}\(i\)=\\exp\(s\_\{i\}\-\\Delta\-\\delta\_\{k\}\)with scaleSS, this occurs when:

si<Δ\+δk−10​ln⁡2−ln⁡S=Δ\+δk−6\.93−ln⁡Ss\_\{i\}<\\Delta\+\\delta\_\{k\}\-10\\ln 2\-\\ln S=\\Delta\+\\delta\_\{k\}\-6\.93\-\\ln S\(6\)

### 3\.3Underflow Fraction: Closed Form

Fors∼𝒩​\(0,1\)s\\sim\\mathcal\{N\}\(0,1\):

###### Corollary 2\(P\-collapse fraction\)\.

To leading order, the fraction of non\-sinkPPvalues that underflow to zero under forward iteration with scaleSSis:

F​\(Δ,S\)=Φ​\(Δ\+δk−6\.93−ln⁡S\)F\(\\Delta,S\)=\\Phi\\big\(\\Delta\+\\delta\_\{k\}\-6\.93\-\\ln S\\big\)\(7\)whereΦ\\Phiis the standard normal CDF,6\.93=10​ln⁡26\.93=10\\ln 2, andδk\\delta\_\{k\}is the within\-sink extreme\-value shift of §[3](https://arxiv.org/html/2606.06521#S3)\(the naiveδk=0\\delta\_\{k\}=0form is a lower bound on the realized collapse\)\.

###### Proof\.

By Proposition[1](https://arxiv.org/html/2606.06521#Thmtheorem1),Pj​\(i\)P\_\{j\}\(i\)underflows iffsi<Δ\+δk−10​ln⁡2−ln⁡Ss\_\{i\}<\\Delta\+\\delta\_\{k\}\-10\\ln 2\-\\ln S\. Forsi∼𝒩​\(0,1\)s\_\{i\}\\sim\\mathcal\{N\}\(0,1\),Pr⁡\[si<x\]=Φ​\(x\)\\Pr\[s\_\{i\}<x\]=\\Phi\(x\), hence the underflow fraction equalsΦ​\(Δ\+δk−6\.93−ln⁡S\)\\Phi\(\\Delta\+\\delta\_\{k\}\-6\.93\-\\ln S\)\(using10​ln⁡2=6\.931510\\ln 2=6\.9315\)\. This treats the per\-rowmglobalm\_\{\\text\{global\}\}as the fixed meanΔ\+δk\\Delta\+\\delta\_\{k\}; sincemglobalm\_\{\\text\{global\}\}has nonzero spread across query rows andΦ\\Phiis convex in its lower tail, the realized fraction is slightly*higher*than this mean\-shift estimate\. Table[1](https://arxiv.org/html/2606.06521#S3.T1)therefore reports the exact simulated values, which Eq\. \([7](https://arxiv.org/html/2606.06521#S3.E7)\) reproduces to within a few percentage points\. ∎

Table[1](https://arxiv.org/html/2606.06521#S3.T1)evaluates this forS=1S=1\(direct cast\) andS=256S=256:

Table 1:P\-collapse analysis \(N=4096N=4096,ksink=4k\_\{\\text\{sink\}\}=4, block size 64\)\.*Both*theS=1S\{=\}1andS=256S\{=\}256“frac\. zeroed” columns are measured from the*same*kernel\-faithful forward simulation \(averaged over 12 seeds\), so they share onemglobal=Δ\+δkm\_\{\\text\{global\}\}=\\Delta\+\\delta\_\{k\}convention\. The measured shift isδ4≈1\.0\\delta\_\{4\}\\approx 1\.0, consistent with𝔼\[max\\mathbb\{E\}\[\\maxof 4𝒩\(0,1\)\]≈1\.03\\mathcal\{N\}\(0,1\)\]\\approx 1\.03\(and*not*the asymptotic2​ln⁡4≈1\.67\\sqrt\{2\\ln 4\}\\approx 1\.67\)\. “Eff\. info loss” is non\-sink mass×\\timesfrac\. zeroed \(S=1S\{=\}1\)\. The closed form \([7](https://arxiv.org/html/2606.06521#S3.E7)\) reproduces these columns to within a few points; the residual is the per\-row spread ofmglobalm\_\{\\text\{global\}\}\(Corollary[2](https://arxiv.org/html/2606.06521#Thmtheorem2)proof\)\.The*effective information loss*\(non\-sink mass×\\timesfraction zeroed\) peaks atΔ≈6\\Delta\\approx 6–77\(∼\\sim40%\), where positions carrying roughly half to three\-quarters of the probability mass have most of their P values zeroed\.

### 3\.4Output MSE Bound

###### Proposition 3\(MSE from P\-collapse\)\.

Let𝒵\\mathcal\{Z\}be the set of positions whose P values underflow\. AssumeVjV\_\{j\}are zero\-mean random vectors with𝔼​\[Vj​VjT\]=σV2​Id\\mathbb\{E\}\[V\_\{j\}V\_\{j\}^\{T\}\]=\\sigma\_\{V\}^\{2\}I\_\{d\}and pairwise uncorrelated across positions\. Then the expected per\-dimension MSE of the kernel output vs\. exact FP32 reference satisfies:

𝔼​\[‖ε‖2/d\]=σV2ℓ2​∑j∈𝒵Pj2\+1d​ℓ2​∑j≠k∈𝒵Pj​Pk​tr​𝔼​\[Vj​VkT\],\\mathbb\{E\}\[\\\|\\varepsilon\\\|^\{2\}/d\]\\;=\\;\\frac\{\\sigma\_\{V\}^\{2\}\}\{\\ell^\{2\}\}\\sum\_\{j\\in\\mathcal\{Z\}\}P\_\{j\}^\{2\}\\;\+\\;\\frac\{1\}\{d\\ell^\{2\}\}\\sum\_\{j\\neq k\\in\\mathcal\{Z\}\}P\_\{j\}P\_\{k\}\\,\\mathrm\{tr\}\\,\\mathbb\{E\}\[V\_\{j\}V\_\{k\}^\{T\}\],\(8\)whereℓ=∑allPj\\ell=\\sum\_\{\\text\{all\}\}P\_\{j\}is the \(exact, FP32\) running sum\. Under the pairwise\-uncorrelated assumption the cross term vanishes, yielding

MSEcollapse=σV2ℓ2​∑j∈𝒵Pj2\.\\mathrm\{MSE\}\_\{\\mathrm\{collapse\}\}\\;=\\;\\frac\{\\sigma\_\{V\}^\{2\}\}\{\\ell^\{2\}\}\\sum\_\{j\\in\\mathcal\{Z\}\}P\_\{j\}^\{2\}\.\(9\)

###### Proof\.

The output error vector isε=1ℓ​∑j∈𝒵Pj​Vj\\varepsilon=\\frac\{1\}\{\\ell\}\\sum\_\{j\\in\\mathcal\{Z\}\}P\_\{j\}V\_\{j\}\(the “missing” contribution\)\. Expanding the squared norm and taking expectations:

𝔼​\[‖ε‖2\]=1ℓ2​∑j,k∈𝒵Pj​Pk​𝔼​\[VjT​Vk\]\.\\mathbb\{E\}\[\\\|\\varepsilon\\\|^\{2\}\]\\;=\\;\\frac\{1\}\{\\ell^\{2\}\}\\sum\_\{j,k\\in\\mathcal\{Z\}\}P\_\{j\}P\_\{k\}\\,\\mathbb\{E\}\[V\_\{j\}^\{T\}V\_\{k\}\]\.\(10\)With𝔼​\[VjT​Vj\]=d​σV2\\mathbb\{E\}\[V\_\{j\}^\{T\}V\_\{j\}\]=d\\sigma\_\{V\}^\{2\}and the pairwise\-uncorrelated assumption𝔼​\[VjT​Vk\]=0\\mathbb\{E\}\[V\_\{j\}^\{T\}V\_\{k\}\]=0forj≠kj\\neq k, dividing byddgives the stated equality\. ∎

## 4Optimization 1: Reverse KV Iteration

### 4\.1Mechanism

Reversing the iteration order to\(N−1,N−2,…,0\)\(N\{\-\}1,N\{\-\}2,\\ldots,0\)defers the sink block to the*last*iteration\. During all preceding iterations, only non\-sink tokens contribute to the running maximum:

mpre\-sink≤maxi=1N−ksink⁡si≈2​ln⁡\(N−ksink\)m\_\{\\text\{pre\-sink\}\}\\leq\\max\_\{i=1\}^\{N\-k\_\{\\text\{sink\}\}\}s\_\{i\}\\approx\\sqrt\{2\\ln\(N\-k\_\{\\text\{sink\}\}\)\}\(11\)by extreme value theory for Gaussian order statistics\. The P values during these iterations are:

Pj​\(i\)=exp⁡\(si−mpre\-sink\)∈\[exp⁡\(−mpre\-sink−3​σ\),1\]P\_\{j\}\(i\)=\\exp\(s\_\{i\}\-m\_\{\\text\{pre\-sink\}\}\)\\in\\big\[\\exp\(\-m\_\{\\text\{pre\-sink\}\}\-3\\sigma\),\\,1\\big\]\(12\)with high probability, whereσ=1\\sigma=1is the per\-token score standard deviation and the lower endpoint uses the standard3​σ3\\sigmatail bound \(violated with probabilityΦ​\(−3\)≈1\.3×10−3\\Phi\(\-3\)\\approx 1\.3\\times 10^\{\-3\}\)\. ForN=8192N=8192,2​ln⁡N≈4\.25\\sqrt\{2\\ln N\}\\approx 4\.25, givingPmin≈exp⁡\(−7\.25\)≈7×10−4P\_\{\\min\}\\approx\\exp\(\-7\.25\)\\approx 7\\times 10^\{\-4\}—comfortably above theS=256S\{=\}256round\-to\-zero boundary2−10/256=2−18≈3\.8×10−62^\{\-10\}/256=2^\{\-18\}\\approx 3\.8\\times 10^\{\-6\}, and only marginally*below*the bare2−102^\{\-10\}boundary \(which is whyS=256S\{=\}256adds a safety margin over reverse alone\)\.

### 4\.2Formal Sufficiency Condition

###### Theorem 4\(Zero\-underflow guarantee for reverse \+S=256S=256\)\.

In reverse iteration withS=256S=256, a P valuep=exp⁡\(s−m\)p=\\exp\(s\-m\)survives the E4M3 cast \(i\.e\.,p⋅256≥2−10p\\cdot 256\\geq 2^\{\-10\}\) whenever:

s\>m−18​ln⁡2=m−12\.48s\>m\-18\\ln 2=m\-12\.48\(13\)Form≤2​ln⁡N\+𝒪​\(1\)m\\leq\\sqrt\{2\\ln N\}\+\\mathcal\{O\}\(1\)\(pre\-sink\) with scores∼𝒩​\(0,1\)s\\sim\\mathcal\{N\}\(0,1\):

Pr⁡\[underflow\]=Φ​\(m−12\.48\)<Φ​\(−7\.2\)<10−12\\Pr\[\\text\{underflow\}\]=\\Phi\(m\-12\.48\)<\\Phi\(\-7\.2\)<10^\{\-12\}\(14\)forN≤106N\\leq 10^\{6\}\. Effectively zero P values underflow\.

###### Proof\.

The cast\-to\-zero condition isp⋅S<2−10p\\cdot S<2^\{\-10\}, i\.e\.,exp⁡\(s−m\)⋅256<2−10\\exp\(s\-m\)\\cdot 256<2^\{\-10\}\. Taking logs:s−m\+8​ln⁡2<−10​ln⁡2s\-m\+8\\ln 2<\-10\\ln 2, hences<m−18​ln⁡2s<m\-18\\ln 2\. Numerically,18​ln⁡2=12\.476618\\ln 2=12\.4766\(equivalently10​ln⁡2\+ln⁡256=6\.9315\+5\.5452=12\.476610\\ln 2\+\\ln 256=6\.9315\+5\.5452=12\.4766\), giving the thresholds<m−12\.48s<m\-12\.48\. Before the sink block,m≤2​ln⁡N\+𝒪​\(1\)m\\leq\\sqrt\{2\\ln N\}\+\\mathcal\{O\}\(1\)\. ForN=106N=10^\{6\}:m≤5\.3m\\leq 5\.3, so the threshold iss<5\.3−12\.48=−7\.18s<5\.3\-12\.48=\-7\.18\. Fors∼𝒩​\(0,1\)s\\sim\\mathcal\{N\}\(0,1\):Φ​\(−7\.18\)≈3\.5×10−13<10−12\\Phi\(\-7\.18\)\\approx 3\.5\\times 10^\{\-13\}<10^\{\-12\}\. ∎

### 4\.3The Finalα\\alpha\-Correction

When the sink block is finally processed \(last in reverse\), the correction factor is:

α=exp⁡\(mpre\-sink−mnew\)≈exp⁡\(2​ln⁡N−Δ\)\\alpha=\\exp\(m\_\{\\text\{pre\-sink\}\}\-m\_\{\\text\{new\}\}\)\\approx\\exp\\big\(\\sqrt\{2\\ln N\}\-\\Delta\\big\)\(15\)ForΔ=7,N=4096\\Delta=7,N=4096:α≈exp⁡\(4\.1−7\)≈0\.055\\alpha\\approx\\exp\(4\.1\-7\)\\approx 0\.055, which multiplies the FP32 accumulator \(23\-bit mantissa\) with negligible precision loss\. A pairedtt\-test \(Appendix[B](https://arxiv.org/html/2606.06521#A2)\) confirms forward\+S=256S\{=\}256and reverse\+S=256S\{=\}256are indistinguishable where P\-collapse dominates \(Δ≤9\\Delta\\leq 9\); atΔ≥10\\Delta\\geq 10reverse is marginally—and inconsequentially—better \(∼10−8\\sim 10^\{\-8\}vs\. MSE of10−510^\{\-5\}–10−610^\{\-6\}\)\.

## 5Optimization 2: Scale FactorS=256S=256

The complete P\-quantization pipeline is:

P~=RoundE4M3⁡\(P⋅S\)S,O=P~⋅Vℓ\\tilde\{P\}=\\frac\{\\operatorname\{Round\}\_\{\\text\{E4M3\}\}\(P\\cdot S\)\}\{S\},\\qquad O=\\frac\{\\tilde\{P\}\\cdot V\}\{\\ell\}\(16\)whereRoundE4M3\\operatorname\{Round\}\_\{\\text\{E4M3\}\}denotes round\-to\-nearest in E4M3\. The choice ofSScontrols the fidelity ofP~\\tilde\{P\}as an approximation toPP\.

### 5\.1Condition 1: Bit\-Exact Scaling

###### Proposition 5\(Power\-of\-two bit\-exactness\)\.

ForS=2kS=2^\{k\}and any IEEE 754 FP32 valuexx\(with result in representable range\), bothx×Sx\\times Sandx×\(1/S\)x\\times\(1/S\)are*exact*—no rounding occurs\.

###### Proof\.

In IEEE 754 binary32,2k2^\{k\}has exponent=127\+k=127\+kand zero mantissa\. Multiplication by2k2^\{k\}addskkto the result exponent without modifying the 23\-bit mantissa\. Since1/2k=2−k1/2^\{k\}=2^\{\-k\}is also exactly representable, the reciprocal multiplication is likewise exponent\-only\. ∎

For non\-power\-of\-twoSS\(e\.g\., 448\):1/448=0\.002232​…1/448=0\.002232\\ldotsis*not*exactly representable in IEEE 754\. Every×\(1/448\)\\times\(1/448\)introduces∼2−24\\sim 2^\{\-24\}rounding per element\. Over thedd\-dimensional output, this accumulates to an error of𝒪​\(d⋅2−48\)\\mathcal\{O\}\(d\\cdot 2^\{\-48\}\)—negligible individually, but a systematic precision disadvantage nonetheless\.

### 5\.2Condition 2: Thed​p​\(S\)dp\(S\)Sawtooth

###### Definition 1\(Normalized quantization step\)\.

For scale factorS∈\(0,448\]S\\in\(0,448\]:

d​p​\(S\)≜maxx∈\[0,S\]⁡LSBE4M3⁡\(x\)Sdp\(S\)\\triangleq\\frac\{\\max\_\{x\\in\[0,\\,S\]\}\\operatorname\{LSB\}\_\{\\text\{E4M3\}\}\(x\)\}\{S\}\(17\)whereLSBE4M3⁡\(x\)\\operatorname\{LSB\}\_\{\\text\{E4M3\}\}\(x\)is the spacing between consecutive representable E4M3 values in the binade containingxx\. ForS\>448S\>448, the cast saturates at448448and we extend the definition to also include the \(one\-sided\) saturation step, see Remark[3](https://arxiv.org/html/2606.06521#Thmremark3)\.

d​p​\(S\)dp\(S\)represents the worst\-case quantization*step*for anyP∈\[0,1\]P\\in\[0,1\]through the pipeline \([16](https://arxiv.org/html/2606.06521#S5.E16)\); the maximum pointwise error isd​p​\(S\)/2dp\(S\)/2\.

###### Theorem 6\(d​p​\(S\)dp\(S\)structure\)\.

1. \(i\)For allS=2kS=2^\{k\}with integerk∈\{0,1,…,8\}k\\in\\\{0,1,\\ldots,8\\\}\(i\.e\.S∈\{1,2,4,…,256\}S\\in\\\{1,2,4,\\ldots,256\\\}\):d​p​\(2k\)=2−4dp\(2^\{k\}\)=2^\{\-4\}\.
2. \(ii\)For allS∈\[2−6,448\]S\\in\[2^\{\-6\},448\]withS≠2kS\\neq 2^\{k\}:d​p​\(S\)\>2−4dp\(S\)\>2^\{\-4\}\.
3. \(iii\)ForS\>448S\>448\(using the extended form of Remark[3](https://arxiv.org/html/2606.06521#Thmremark3)\):d​p​\(S\)\>2−4dp\(S\)\>2^\{\-4\}\.

###### Proof\.

\(i\)ForS=2kS=2^\{k\}withk∈\{0,…,8\}k\\in\\\{0,\\ldots,8\\\}, we haveS≤256<448S\\leq 256<448, soSSitself is exactly representable in E4M3\. The mapped range\[0,S\]=\[0,2k\]\[0,S\]=\[0,2^\{k\}\]has its highest binade\[2k−1,2k\)\[2^\{k\-1\},2^\{k\}\)entirely in the normal region \(sincek−1≥−1\>−6k\-1\\geq\-1\>\-6\), withLSB=2k−4\\operatorname\{LSB\}=2^\{k\-4\}\. Thusd​p​\(2k\)=2k−4/2k=2−4dp\(2^\{k\}\)=2^\{k\-4\}/2^\{k\}=2^\{\-4\}\.

\(ii\)Letn=⌊log2⁡S⌋n=\\lfloor\\log\_\{2\}S\\rfloor, soS∈\[2n,2n\+1\)S\\in\[2^\{n\},2^\{n\+1\}\)withS≠2nS\\neq 2^\{n\}\(sinceSSis not a power of two\)\. BecauseS≥2−6S\\geq 2^\{\-6\}, the binade\[2n,2n\+1\)\[2^\{n\},2^\{n\+1\}\)lies in the normal region, so it hasLSB=2n−3\\operatorname\{LSB\}=2^\{n\-3\}\. The range\[0,S\]\[0,S\]overlaps with this binade, henced​p​\(S\)=2n−3/Sdp\(S\)=2^\{n\-3\}/S\. SinceS<2n\+1S<2^\{n\+1\},d​p​\(S\)\>2n−3/2n\+1=2−4dp\(S\)\>2^\{n\-3\}/2^\{n\+1\}=2^\{\-4\}\.111Equivalently,d​p​\(S\)=2n−3/S<2n−3/2n=2−3dp\(S\)=2^\{n\-3\}/S<2^\{n\-3\}/2^\{n\}=2^\{\-3\}, sod​p​\(S\)∈\(2−4,2−3\)dp\(S\)\\in\(2^\{\-4\},\\,2^\{\-3\}\)for non\-power\-of\-twoS∈\[2−6,448\]S\\in\[2^\{\-6\},448\]\. The rangeS<2−6S<2^\{\-6\}\(entirely subnormal\) is excluded for cleanliness; the conclusion still holds there since the uniform subnormal LSB2−92^\{\-9\}givesd​p​\(S\)=2−9/S\>2−3dp\(S\)=2^\{\-9\}/S\>2^\{\-3\}, but the practical scope of this paper isS≥1S\\geq 1\.

\(iii\)ForS\>448S\>448, both terms inmax⁡\(32/S,2​\(1−448/S\)\)\\max\(32/S,2\(1\-448/S\)\)exceed2−42^\{\-4\}over disjoint sub\-ranges:32/S\>2−432/S\>2^\{\-4\}forS<512S<512, and2​\(1−448/S\)\>2−42\(1\-448/S\)\>2^\{\-4\}forS\>448⋅11−2−5≈462\.5S\>448\\cdot\\frac\{1\}\{1\-2^\{\-5\}\}\\approx 462\.5; the two intervals overlap on\(448,512\)\(448,512\), covering the fullS\>448S\>448regime\. ∎

Figure[1](https://arxiv.org/html/2606.06521#S5.F1)showsd​p​\(S\)dp\(S\)computed by exact enumeration of all 126 positive E4M3 values, confirming the sawtooth structure and power\-of\-two lower envelope\.

![Refer to caption](https://arxiv.org/html/2606.06521v1/x1.png)Figure 1:The normalized quantization stepd​p​\(S\)dp\(S\)for E4M3\. All2k2^\{k\}values \(blue\) sit on the lower envelope atd​p=2−4dp=2^\{\-4\}\.S=256S=256\(green star\) is the rightmost before overflow\.S=448S=448\(red\) sits14%14\\%above the envelope\.
### 5\.3Condition 3: Maximum Normal Coverage

Among the power\-of\-two candidates satisfyingS≤448S\\leq 448\(i\.e\.,S∈\{1,2,4,…,256\}S\\in\\\{1,2,4,\\ldots,256\\\}\), largerSSis better because the E4M3*normal\-region lower bound*in the P\-domain scales as2−6/S2^\{\-6\}/S:

This monotonicity does*not*stop at256256: a non\-power\-of\-two scale such asS=448S=448pushes the normal threshold down further to2−6/448≈3\.5×10−52^\{\-6\}/448\\approx 3\.5\\times 10^\{\-5\}, i\.e\.*strictly better*coverage thanS=256S=256\(6\.1×10−56\.1\\times 10^\{\-5\}\)\. Coverage alone therefore favors448448;256256wins only once we additionally require bit\-exactness \(C1\) and the minimum quantization step \(C2\), both of which448448fails\. The256256\-vs\-448448choice is thus a genuine trade\-off \(smaller worst\-case step vs\. slightly better deep\-tail coverage\), resolved empirically in §[6](https://arxiv.org/html/2606.06521#S6)—not a clean domination\.

### 5\.4Characterization ofS=256S=256

###### Theorem 7\(Characterization ofS=256S=256\)\.

Restricted to scalesS≥1S\\geq 1\(so that the cast*expands*the dynamic range ofP∈\[0,1\]P\\in\[0,1\]rather than contracting it\),S=256=28S=256=2^\{8\}is the unique value satisfying all three of:

1. \(C1\)S=2kS=2^\{k\}\(bit\-exact scaling; Proposition[5](https://arxiv.org/html/2606.06521#Thmtheorem5)\)
2. \(C2\)d​p​\(S\)=2−4dp\(S\)=2^\{\-4\}\(minimum quantization step; Theorem[6](https://arxiv.org/html/2606.06521#Thmtheorem6)\)
3. \(C3\)S=max⁡\{2k:2k≤448\}S=\\max\\\{2^\{k\}:2^\{k\}\\leq 448\\\}\(maximum normal coverage*among2k2^\{k\}scales*\)

###### Proof\.

By \(C1\),S=2kS=2^\{k\}for some integerk≥0k\\geq 0\(using the standingS≥1S\\geq 1assumption\)\. By Theorem[6](https://arxiv.org/html/2606.06521#Thmtheorem6), the set\{2k:k≥0\}\\\{2^\{k\}:k\\geq 0\\\}partitions into\{20,…,28\}\\\{2^\{0\},\\ldots,2^\{8\}\\\}whered​p=2−4dp=2^\{\-4\}and\{2k:k≥9\}\\\{2^\{k\}:k\\geq 9\\\}whered​p\>2−4dp\>2^\{\-4\}\(Theorem[6](https://arxiv.org/html/2606.06521#Thmtheorem6)\(iii\)\)\. Thus \(C2\) admits exactly the nine candidates\{1,2,4,…,256\}\\\{1,2,4,\\ldots,256\\\}\. \(C3\) selects the maximum element of this set,k=8k=8, givingS=256S=256\. ∎

#### Comparison:S=256S=256vsS=448S=448\.

d​p​\(448\)=32/448≈0\.0714dp\(448\)=32/448\\approx 0\.0714exceedsd​p​\(256\)=0\.0625dp\(256\)=0\.0625by14%14\\%\. In MSE terms:\(0\.0714/0\.0625\)2≈1\.30\(0\.0714/0\.0625\)^\{2\}\\approx 1\.30, predicting∼\\sim30% higher MSE forS=448S=448\. Our experiments \(§[6](https://arxiv.org/html/2606.06521#S6)\) measure 10–15% MSE difference, consistent with the bound \(the bound is worst\-case; average\-case is milder, and448448’s better deep\-tail coverage partly offsets its larger worst\-case step\)\.

## 6Experimental Validation

### 6\.1Kernel\-Faithful Simulation

We simulate the FP8 attention kernel with semantics matched to production code \(specifically, theattention\_with\_kvcache\_prefill\_fp8kernel in hpc\-ops and FlashAttention\-3’s Hopper/Blackwell backend\):

- •ℓ\\ell\(running sum\): FP32, from*pre\-cast*P \(line 8 of Alg\.[1](https://arxiv.org/html/2606.06521#alg1)\)\.
- •OO\(output\): FP32 accumulator, from*post\-cast*Pfp8×VP\_\{\\text\{fp8\}\}\\times V\.
- •P\-cast: round\-to\-nearest E4M3 \(values below2−102^\{\-10\}→\\to0\)\.
- •Epilogue:Ofinal=O/\(S⋅ℓ\)O\_\{\\text\{final\}\}=O/\(S\\cdot\\ell\)\.

To*isolate*the P\-cast effect, we keepQ,K,VQ,K,Vin FP32\. The QKV FP8 quantization is an orthogonal concern \(handled by separate qscale/kscale/vscale\) and does not interact with the P\-cast precision loss\.

### 6\.2Configurations

We compare five configurations spanning the design space:

### 6\.3Experiment 1: Sink Strength Sweep

We sweepΔ∈\[4,13\]\\Delta\\in\[4,13\]withN=4096N=4096,d=128d=128,qlen=32q\_\{\\text\{len\}\}=32,B=64B=64,ksink=4k\_\{\\text\{sink\}\}=4, over 20 seeds\. Figure[2](https://arxiv.org/html/2606.06521#S6.F2)shows both the MSE results and the underlying physics \(P\-zeroing fraction and mass at risk\)\.

![Refer to caption](https://arxiv.org/html/2606.06521v1/x2.png)Figure 2:Top:MSE vs\. sink strength\. Forward\+S=1 exhibits a peak atΔ=7\\Delta=7–88\(the transition region\)\. All other configurations remain at the quantization\-noise floor\.Bottom:Diagnostic—non\-sink probability mass \(blue bars\) and fraction of P values zeroed by S=1 cast \(red bars\) vs\. S=256 cast \(green line\)\.#### Key findings:

1. 1\.Forward\+S=1 is3\.4×3\.4\\timesworse than optimized configurations atΔ=7\\Delta=7\.
2. 2\.Both reverse \(anySS\) andS=256S=256\(any direction\) independently fix the issue\.
3. 3\.S=256S=256gives∼\\sim10–15% lower MSE thanS=448S=448, consistent with thed​pdpratio\.
4. 4\.AtΔ≥11\\Delta\\geq 11, all configurations converge \(non\-sink mass<2%<2\\%\)\.

### 6\.4Experiment 2: Sequence Length Sweep

AtΔ=7\\Delta=7, we sweepN∈\[512,16384\]N\\in\[512,16384\]\. AsNNgrows, the non\-sink probability mass increases, amplifying the P\-collapse effect\. Figure[3](https://arxiv.org/html/2606.06521#S6.F3)shows the results\.

![Refer to caption](https://arxiv.org/html/2606.06521v1/x3.png)Figure 3:MSE vs\. sequence length atΔ=7\\Delta=7\. The improvement of Forward\+S=256 over Forward\+S=1 grows from1\.3×1\.3\\times\(N=512N\{=\}512\) to10×10\\times\(N=16384N\{=\}16384\) as non\-sink probability mass increases\.
### 6\.5Numerical Results

Table[2](https://arxiv.org/html/2606.06521#S6.T2)reports exact MSE values for key operating points:

Table 2:MSE \(×10−5\\times 10^\{\-5\}\) at selected configurations\.
### 6\.6Comparison with Mainstream Implementations

Table[3](https://arxiv.org/html/2606.06521#S6.T3)positions the implementations in terms of both design choices\.

Table 3:Design choices in production FP8 attention kernels\. “Optimal” denotes joint satisfaction of \(C1\)–\(C3\) of Theorem[7](https://arxiv.org/html/2606.06521#Thmtheorem7)*and*reverse iteration \(which is redundant givenS=256S\{=\}256but offers belt\-and\-suspenders robustness against multi\-sink patterns\)\. “Near\-optimal” denotes one fix applied \(here, reverse iteration eliminates P\-collapse\) but a non\-power\-of\-two scale incurs thed​p​\(448\)/d​p​\(256\)=1\.14dp\(448\)/dp\(256\)\\\!=\\\!1\.14residual penalty \(§[5](https://arxiv.org/html/2606.06521#S5)\)\. “Suboptimal scale” denotes the same scale issue*without*the iteration\-order safety net\.Based on this analysis, we have adoptedS=256S=256in the hpc\-ops codebase\. FlashAttention\-3/4 and the updated hpc\-ops both satisfy the optimality conditions; FlashInfer and TensorRT\-LLM XQA could gain an additional 10–15% precision improvement by switching fromS=448S=448toS=256S=256\.

## 7Discussion

### 7\.1Saturation: Same Mechanism, Same Floor

Both optimizations address the*identical*failure mode: P values falling below E4M3’s representable range\. Once either fix is applied, P\-collapse is eliminated and the residual MSE is set by the*inherent*E4M3 quantization noise on representable values\. Pairedtt\-tests over 100 instances confirm that Forward\+S=256 and Reverse\+S=256 are statistically indistinguishable wherever P\-collapse is active \(Δ≤9\\Delta\\leq 9\); forΔ≥10\\Delta\\geq 10the residuals diverge with reverse marginally better, but the absolute gap is∼10−8\\sim 10^\{\-8\}, three orders of magnitude below the MSE itself, so the practical conclusion is unchanged \(Appendix[B](https://arxiv.org/html/2606.06521#A2)\)\.

This has a practical implication:*either optimization alone suffices*\. The choice between them should be driven by engineering constraints \(reverse may simplify causal mask handling; forward may offer better memory access patterns\)\.

### 7\.2Structural Interpretation

The role ofS=2kS=2^\{k\}admits a simple geometric reading: every E4M3 normal binade\[2n,2n\+1\)\[2^\{n\},2^\{n\+1\}\)contains 8 equispaced points, differing only in absolute scale\. The quantityd​p​\(S\)=LSB⁡\(S\)/Sdp\(S\)=\\operatorname\{LSB\}\(S\)/Smeasures the*normalized*granularity; atS=2kS=2^\{k\}, both numerator and denominator double simultaneously, locking their ratio at2−42^\{\-4\}\. Non\-power\-of\-two scales break this alignment, leavingd​p​\(S\)dp\(S\)on the rising portion of the sawtooth \(Figure[1](https://arxiv.org/html/2606.06521#S5.F1)\)\. The P\-collapse thresholdΔc=6\.93\+ln⁡S−δk\\Delta\_\{c\}=6\.93\+\\ln S\-\\delta\_\{k\}corresponds to the sink strength at which the median non\-sinkPPvalue falls below the round\-to\-zero boundary2−10/S2^\{\-10\}/S; below this threshold the cast is well\-conditioned, above it most of the non\-sink mass is silently zeroed\.

### 7\.3Practical Recommendations

1. 1\.For forward\-order kernels \(e\.g\., TRT\-LLM XQA\):AddS=256S=256P\-scaling\. Implementation: multiply P by 256 before cast; divide output by 256 in epilogue\. Both operations are bit\-exact \(§[5](https://arxiv.org/html/2606.06521#S5)\) and add only two FMAs whose latency is negligible against the tensor\-core matmul\. We have applied this change to the hpc\-ops kernel\.
2. 2\.For kernels already usingS=448S=448\(FlashInfer, SageAttention2\):Switch toS=256S=256for 10–15% MSE reduction at negligible cost\.
3. 3\.For kernels already using reverse \+S=256S=256\(FA3/4\):No change needed—this is already optimal among static\-scale strategies\.

### 7\.4Limitations and Future Work

Worst\-case vs\. average\-case\.Thed​p​\(S\)dp\(S\)analysis provides the minimax\-optimal scale assumingPPuniformly spans\[0,1\]\[0,1\]\. In practice, the post\-softmax distribution is highly non\-uniform \(a few large values, many small\)\. A*dynamic*per\-block scale \(still constrained to2k2^\{k\}for bit\-exactness\) could yield better average\-case precision by adapting to the per\-blockPPdistribution\. Even so, for typical smallPP\(∼\\sim0\.01–0\.05\) the staticS=256S=256maps into mid\-range binades with competitive relative precision, and its no\-underflow guarantee remains the main advantage\.

Interaction with QKV quantization\.We isolated the P\-cast effect by keepingQ,K,VQ,K,Vin FP32\. In production, these are also in FP8, contributing an additive noise floor\. Since both reference and kernel use identical dequantized QKV values, the P\-cast MSE signal is preserved, but the*relative*improvement ratio is compressed at high delta where QKV noise dominates\.

No end\-to\-end metric\.All our numbers are isolated output MSE against an FP32 reference; we report no perplexity or task\-accuracy\. The33–10×10\\timesfigures show the P\-cast error is real and removable, but*not*that removing it shifts a downstream metric—where the FP8 QKV floor dominates, the end\-to\-end benefit may be small\. We thus frame theS=256S\{=\}256switch as zero\-cost and strictly\-no\-worse rather than a guaranteed quality win, to be confirmed per model\.

Scope of thed​p​\(S\)dp\(S\)analysis\.Thed​p​\(S\)dp\(S\)construction assumes \(a\) a bounded range \(hereP∈\[0,1\]P\\in\[0,1\]\), \(b\) one static scale per cast, and \(c\) a bit\-exactness requirement\. These break down elsewhere: MXFP4’s E8M0 block exponent already forces scales to2k2^\{k\}\(removing the freedomd​p​\(S\)dp\(S\)addresses\), and outlier\-heavy activations need per\-channel smoothing first\. We thus make no claims beyond E4M3 P\-cast\.

Multi\-sink patterns\.Our analysis assumes a single sink cluster at position 0\. Some models exhibit distributed sink patterns across the sequence; the P\-collapse analysis generalizes straightforwardly by replacingΔ\\Deltawith the per\-block maximum gap\.

## 8Conclusion

We have analyzed two implementation\-level precision considerations for FP8 E4M3 attention: \(1\) P\-collapse under Attention Sink is a quantifiable threshold effect—activating aroundΔ≈6\\Delta\\approx 6–77with leading\-order underflow fractionF=Φ​\(Δ\+δk−6\.93−ln⁡S\)F=\\Phi\(\\Delta\+\\delta\_\{k\}\-6\.93\-\\ln S\), whereδk\\delta\_\{k\}is the within\-sink extreme\-value shift—that both reverse iteration andS=256S=256scaling independently eliminate; \(2\) thed​p​\(S\)dp\(S\)sawtooth, defined over the E4M3 number line, characterizesS=256S=256via three conditions—bit\-exact arithmetic, the lower\-envelope minimum step \(d​p=2−4dp=2^\{\-4\}\), and the largest normal\-range coverage among2k2^\{k\}scales\. Kernel\-faithful experiments measure33–10×10\\timesMSE improvement at the critical transition region \(Δ=5\\Delta=5–99,N=4096N=4096–1638416384\), and paired statistical tests show that both optimizations reach the same precision floor when combined\.

For practitioners, the actionable recommendation is simple: for any FP8 attention kernel currently usingS=1S=1\(direct cast\) orS=448S=448\(max\-normal\), switching toS=256S=256is a single\-constant, bit\-exact change that removes P\-collapse and is never worse on P\-cast MSE; whether it moves an end\-to\-end metric should be confirmed per model \(§[7](https://arxiv.org/html/2606.06521#S7)\)\. We make no transfer claims to other formats \(MXFP4, FP4\) or per\-channel quantization, where the preconditions differ \(§[7](https://arxiv.org/html/2606.06521#S7)\)\.

## References

- Barbero et al\. \[2025\]Federico Barbero, Álvaro Arroyo, Xiangming Gu, Christos Perivolaropoulos, Michael Bronstein, Petar Veličković, and Razvan Pascanu\.Why do LLMs attend to the first token?*arXiv preprint arXiv:2504\.02732*, 2025\.
- Dao \[2024\]Tri Dao\.FlashAttention\-2: Faster attention with better parallelism and work partitioning\.*International Conference on Learning Representations*, 2024\.
- Dao et al\. \[2022\]Tri Dao, Daniel Y\. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré\.FlashAttention: Fast and memory\-efficient exact attention with IO\-awareness\.*Advances in Neural Information Processing Systems*, 35, 2022\.
- Gu et al\. \[2024\]Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, and Min Lin\.When attention sink emerges in language models: An empirical view\.*arXiv preprint arXiv:2410\.10781*, 2024\.
- Micikevicius et al\. \[2022\]Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al\.FP8 formats for deep learning\.*arXiv preprint arXiv:2209\.05433*, 2022\.
- Shah et al\. \[2024\]Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, and Tri Dao\.FlashAttention\-3: Fast and accurate attention with asynchrony and low\-precision\.*arXiv preprint arXiv:2407\.08608*, 2024\.
- Sun et al\. \[2024\]Mingjie Sun, Xinlei Chen, J\. Zico Kolter, and Zhuang Liu\.Massive activations in large language models\.*arXiv preprint arXiv:2402\.17762*, 2024\.
- Xiao et al\. \[2024\]Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis\.Efficient streaming language models with attention sinks\.*International Conference on Learning Representations*, 2024\.
- Zhang et al\. \[2024\]Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen\.SageAttention2: Efficient attention with thorough outlier smoothing and per\-thread INT4 quantization\.*arXiv preprint arXiv:2411\.10958*, 2024\.
- Zhang et al\. \[2025\]Jintao Zhang, Xiaoming Xu, Jia Wei, Haofeng Huang, Pengle Zhang, Chendong Xiang, Jun Zhu, and Jianfei Chen\.SageAttention2\+\+: A more efficient implementation of SageAttention2\.*arXiv preprint arXiv:2505\.21136*, 2025\.

## Appendix ACompleted​p​\(S\)dp\(S\)Verification

We verify Theorem[6](https://arxiv.org/html/2606.06521#Thmtheorem6)by exhaustive computation over all 126 positive E4M3 values: for eachSS, take the binade with maximum LSB overlapping\[0,S\]\[0,S\]and computed​p​\(S\)=LSBmax/Sdp\(S\)=\\text\{LSB\}\_\{\\max\}/S\.

Power\-of\-two scales\(k=0,1,…,8k=0,1,\\ldots,8\): all gived​p=2−4=0\.0625dp=2^\{\-4\}=0\.0625exactly\. Top binade is\[2k−1,2k\)\[2^\{k\-1\},2^\{k\}\)with LSB=2k−4=2^\{k\-4\}\. Fork=9k=9\(S=512S=512\): overflow,d​p=0\.25dp=0\.25\.

Non\-power\-of\-two scales:d​p​\(3\)=0\.083dp\(3\)=0\.083,d​p​\(100\)=0\.080dp\(100\)=0\.080,d​p​\(250\)=0\.064dp\(250\)=0\.064,d​p​\(300\)=0\.107dp\(300\)=0\.107,d​p​\(448\)=0\.071dp\(448\)=0\.071\. All strictly exceed2−42^\{\-4\}\. Seeexperiments/proof\_mathematical\.pyfor full enumeration\.

## Appendix BSaturation Statistical Verification

We testH0H\_\{0\}: MSE\(Fwd\+S=256S\{=\}256\)==MSE\(Rev\+S=256S\{=\}256\) via pairedtt\-test \(100 instances per condition\)\.

∗Where significant, reverse is marginally*better*\(not worse\)\. All differences are∼10−8\\sim\\\!10^\{\-8\}, negligible vs\. MSE values of10−510^\{\-5\}–10−610^\{\-6\}, confirming algorithmic equivalence\.

Similar Articles

Bias Compounds, Variance Washes Out

Hacker News Top

This article demonstrates that using stochastic rounding for BF16 optimizer state can match FP32 performance because unbiased errors cancel over time, whereas round-to-nearest stalls due to compounding bias. An experiment with an MLP shows BF16+SR achieves similar loss to FP32 while using less memory.