FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Summary
This paper proposes FAIR-Calib, a two-stage post-training quantization framework for diffusion large language models that addresses the instability of token commitments during iterative refinement. It achieves state-of-the-art results on LLaDA and Dream models under low-bit quantization.
View Cached Full Text
Cached at: 06/08/26, 09:16 AM
# FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Source: [https://arxiv.org/html/2606.06547](https://arxiv.org/html/2606.06547)
Linlin YangSheng XuBoyu LiuGuodong GuoZhongqian FuHang ZhouBaochang Zhang
###### Abstract
Diffusion Large Language Models \(dLLMs\) refine tokens iteratively but commit them irreversibly, leading to a “stability lag” where early decisions remain fragile even after being written\. We reveal that Post\-Training Quantization \(PTQ\) error easily flips these borderline decisions at the write frontier, which are then permanently locked in and amplified\. To address this, we propose*Frontier\-Aware Instability\-Reweighted Calibration*\(*FAIR\-Calib*\), a two\-stage PTQ framework for dLLMs\. Stage I probes a full\-precision teacher to estimate a position prior that combines frontier hits and masked\-stage reliability\. Stage II performs off\-policy, layer\-wise calibration by minimizing a reweighted hidden\-state MSE, effectively prioritizing the protection of fragile frontier states without requiring expensive end\-to\-end diffusion rollouts\. We further theoretically justify our weighted objective as a surrogate for output KL divergence\. Empirically, FAIR\-Calib consistently outperforms state\-of\-the\-art baselines on LLaDA and Dream \(W4A4\), significantly reducing frontier decision flips and suppressing post\-commit mismatches across diverse benchmarks\.
Post\-Training Quantization, Diffusion Language Models, Large Language Models
## 1Introduction
Figure 1:\(a\)Schematic: naive quantization perturbs relative logits and permanently commits the wrong token, highlighting a key failure mode of current quantization methods under diffusion decoding\.\(b\)Complementary cumulative distribution function \(CCDF\) of the stability lagδlag\\delta\_\{\\text\{lag\}\}in the generation region \(Nsamples=32N\_\{\\text\{samples\}\}=32\)\. Although most positions stabilize shortly after commit, the heavy tail indicates*fragile commit states*that keep oscillating post\-commit, showing thatcommitment≠\\neqstabilization\. The calibrated baseline exhibits an even heavier tail than FP, implying that standard calibration does not remove fragility\.\(c\)Decode divergence w\.r\.t the FP \(metric:mse\_prob\\mathrm\{mse\\\_prob\}\) as a function of diffusion step\. The baseline shows progressive, step\-wise amplification of error once a false commit occurs \(red marker\), indicating that small local perturbations can trigger sustained divergence across subsequent steps\.Transformer\-based large language models have achieved remarkable generalization and instruction\-following abilities at the scale of tens to hundreds of billions of parameters, as exemplified by the LLaMA\(Touvronet al\.,[2023](https://arxiv.org/html/2606.06547#bib.bib16)\)and Qwen\(Yanget al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib17)\)model families\.
Recently, diffusion large language models \(dLLMs\) have emerged as a promising alternative to autoregressive decoding, offering iterative refinement and flexible infilling by initializing an entire response sequence upfront and denoising it with bidirectional attention over multiple steps\(Nieet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib1); Zhuet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib3); Yeet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib2)\)\. This iterative masked refinement is conceptually related to earlier non\-left\-to\-right decoding paradigms\(Ghazvininejadet al\.,[2019](https://arxiv.org/html/2606.06547#bib.bib26); Sternet al\.,[2019](https://arxiv.org/html/2606.06547#bib.bib27); Changet al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib28)\)\. However, such multi\-step global refinement substantially increases inference\-time compute and memory footprints, making post\-training quantization \(PTQ\) crucial for practical deployment\(Frantaret al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib19); Frantar and Alistarh,[2023](https://arxiv.org/html/2606.06547#bib.bib18); Xiaoet al\.,[2023](https://arxiv.org/html/2606.06547#bib.bib20); Linet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib29); Ashkbooset al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib5); Sunet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib4)\)\.
However, quantizing dLLMs is not a straightforward extension of autoregressive PTQ:Linet al\.\([2025](https://arxiv.org/html/2606.06547#bib.bib33)\)systematically transferred classic low\-bit PTQ from autoregressive LLMs to dLLMs and found that naive transfer degrades notably on challenging reasoning tasks\. We attribute this brittleness to a diffusion\-specific inference mechanism: dLLM decoding proceeds by repeatedly predicting token distributions for all positions and*unmasking*a subset of mask positions into concrete tokens, reducing the mask set step by step\. We refer to this irreversible write as a*commit*\. This*irreversibility*means that once a token is written, it becomes part of the conditioning context and cannot be revised, even if the model’s posterior belief about that position continues to evolve\. Consequently, the decoding process becomes particularly brittle under perturbations: as illustrated in Figure[1](https://arxiv.org/html/2606.06547#S1.F1)\(a\), quantization perturbations can easily flip a borderline decision at the write frontier, creating an error that ispermanently locked in\.
We trace this brittleness to a fundamental mismatch:commitment≠\\neqstabilization\. As visualized in Figure[1](https://arxiv.org/html/2606.06547#S1.F1)\(b\), even in full precision, many positions exhibit a significant*stability lag*δlag\\delta\_\{\\text\{lag\}\}\. We defineδlag\\delta\_\{\\text\{lag\}\}as the number of diffusion steps after the first irreversible commit until the model’s top\-1 prediction becomes consistent with the final decoded token for all subsequent steps\. This means that many positions continue to oscillate in their top\-1 prediction long after being committed\. The heavy tail of this distribution reveals a non\-negligible subset of*fragile commit states*, where decisions remain context\-sensitive and can keep oscillating post\-commit\. Standard calibration in PTQ methods exacerbates this issue, prolonging the instability and exposing more positions to the irreversible flips described above\. Crucially, these locked\-in flips do not remain isolated; instead, they can lead to severe degradation in generation quality\. Because the incorrect token is fixed as context, it forces the model to refine subsequent tokens based on the error\. Figure[1](https://arxiv.org/html/2606.06547#S1.F1)\(c\)confirms this trajectory: once a false commit occurs at a fragile frontier \(red marker\), the divergence from the teacher does not vanish but undergoes aprogressive, step\-wise amplificationacross subsequent refinement steps, severely degrading generation quality\.
To address these challenges, we propose the*Frontier\-Aware Instability\-Reweighted Calibration*\(*FAIR\-Calib*\) framework for dLLM quantization\. Our framework consists of two synergistic stages: \(i\)*Teacher Probing*: We utilize the full\-precision teacher to estimate a position\-aware prior\. This prior uniquely integrates frontier irreversibility \(upweighting positions at commit time\) and masked\-stage reliability \(accounting for teacher confidence\)\. We show that this prior is largely mechanism\-driven and exhibits robust cross\-corpus transferability\. \(ii\) Off\-policy Weighted Calibration: We perform efficient layer\-wise hidden\-state alignment using the estimated weights\. By employing a teacher\-forcing surrogate, FAIR\-Calib avoids expensive end\-to\-end diffusion rollouts while effectively stabilizing the write frontier\. Empirically, FAIR\-Calib significantly reduces write\-step decision flips and post\-commit mismatches including both “mean\-disagree” and “never\-agree” cases\. Furthermore, our method successfully mitigates the sequential error amplification typically triggered by false commits, as evidenced by improved probability\-MSE traces\. Our major contributions in this paper are summarized as:
- •We identify and quantify brittleness in dLLM decoding induced by*irreversible commit*under*fragile commit states*, where low\-bit quantization flips borderline write decisions and the resulting errors are locked in and amplified across refinement steps\.
- •We proposeFAIR\-Calib, a two\-stage PTQ framework for dLLMs \(Figure[2](https://arxiv.org/html/2606.06547#S1.F2)\): Stage I probes an FP teacher to estimate a*frontier\-aware, reliability\-gated*position prior, and Stage II performs*off\-policy*layer\-wise teacher\-forcing calibration via a weighted hidden\-state MSE, avoiding expensive diffusion rollouts\.
- •We justify an additive time×\\timesposition weighting and its weighted hidden\-state MSE surrogate under mild assumptions, and empirically show consistent W4A4 gains on Dream/LLaDA across diverse benchmarks, with fewer teacher\-forced commit\-step flips, reduced post\-commit mismatch, and suppressed error amplification\.
Figure 2:FAIR\-Calib overview\.Stage Iprobes the FP teacher to estimate a fixed position priorw¯\\bar\{w\}that highlights irreversible commit positions and masked\-stage reliability\.Stage IIperforms layer\-wise teacher\-forcing calibration with aw¯\\bar\{w\}\-weighted hidden\-state MSE to obtain a W4A4 model without diffusion rollouts\.
## 2Related Works
### 2\.1Diffusion Language Models
Diffusion models were generalized to discrete state spaces via denoising diffusion over categorical variables\(Austinet al\.,[2021](https://arxiv.org/html/2606.06547#bib.bib23)\)\. Subsequent works explored diffusion\-style text generation by iterative denoising of token sequences or latent representations, enabling non\-left\-to\-right generation with global revision\(Liet al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib24); Gonget al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib25)\)\. This refinement view is also related to earlier iterative decoding paradigms that repeatedly revise low\-confidence positions\(Ghazvininejadet al\.,[2019](https://arxiv.org/html/2606.06547#bib.bib26); Sternet al\.,[2019](https://arxiv.org/html/2606.06547#bib.bib27); Changet al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib28)\)\. More recently, diffusion*large*language models scale masked refinement to Transformer LLMs by initializing an answer window with masks and denoising it with bidirectional attention over multiple steps\(Nieet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib1); Zhuet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib3); Yeet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib2)\)\. While enabling flexible infilling, their stepwise*commit*and long\-horizon refinement increase inference cost and introduce new brittleness modes for compression such as PTQ\.
### 2\.2Post\-training Quantization for Large Language Models
Post\-training quantization \(PTQ\) compresses pretrained LLMs by quantizing weights and/or activations with a small calibration set\(Zhuet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib34)\)\. Reconstruction\-based PTQ explicitly minimizes layer\-wise output/hidden discrepancies, e\.g\., GPTQ\-style second\-order updates\(Frantaret al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib19)\)\. For low\-bit joint weight–activation quantization \(e\.g\., W4A4\), distribution mismatch and outliers are key bottlenecks, motivating activation smoothing\(Xiaoet al\.,[2023](https://arxiv.org/html/2606.06547#bib.bib20)\), rotation\-based conditioning\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib5)\), and affine flattening transforms\(Sunet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib4)\), along with complementary advances such as activation\-aware scaling/clipping and system\-oriented stacks\(Linet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib29); Shaoet al\.,[2023](https://arxiv.org/html/2606.06547#bib.bib30); Yaoet al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib31); Tsenget al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib32)\)\. Most PTQ methods are developed under autoregression and do not account for diffusion decoding with*irreversible commits*; our work targets this gap\.
## 3Method
### 3\.1Preliminaries
#### 3\.1\.1PTQ with Affine Flattening Transforms
We follow a standard post\-training quantization \(PTQ\) setup, quantizing weights and activations to low bit\-width without updating pretrained FP weights\. In addition, we adopt the same*layer\-wise learnable affine transformation*design as FlatQuant\(Sunet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib4)\)to flatten weight/activation distributions before applying uniform quantization\.
Let𝒬=\{qmin,…,qmax\}\\mathcal\{Q\}=\\\{q\_\{\\min\},\\ldots,q\_\{\\max\}\\\}be the integer grid determined by bit\-widthbb\. Given a scales\>0s\>0and an \(optional\) zero\-pointzz, the quantizer maps a real\-valued variableuuto
u¯=clip\(⌊us\+z⌉,qmin,qmax\),Q\(u\)=s\(u¯−z\),\\bar\{u\}=\\mathrm\{clip\}\\\!\\left\(\\left\\lfloor\\tfrac\{u\}\{s\}\+z\\right\\rceil,\\;q\_\{\\min\},q\_\{\\max\}\\right\),\\qquad Q\(u\)=s\(\\bar\{u\}\-z\),\(1\)where⌊⋅⌉\\lfloor\\cdot\\rceildenotes rounding\-to\-nearest andclip\\mathrm\{clip\}clamps to the valid integer range\. In this paper, we use symmetric quantization and setz=0z=0\.
For each linear layery=Wxy=Wx, we introduce an invertible affine reparameterization
y=U−1W~x~,W~=UWV−1,x~=Vx,y=U^\{\-1\}\\,\\widetilde\{W\}\\,\\widetilde\{x\},\\qquad\\widetilde\{W\}=UWV^\{\-1\},\\;\\;\\widetilde\{x\}=Vx,\(2\)and compute the quantized output as
yq=U−1\(QW\(W~\)QX\(x~\)\),y^\{q\}=U^\{\-1\}\\Big\(Q\_\{W\}\(\\widetilde\{W\}\)\\,Q\_\{X\}\(\\widetilde\{x\}\)\\Big\),\(3\)whereQWQ\_\{W\}andQXQ\_\{X\}are uniform affine quantizers \(with their own scales and, if used, zero\-points\)\.
We calibrate the quantization\-related parameters for each layer on a small calibration set𝒟\\mathcal\{D\}, while keeping pretrained weights frozen\. Denoting the FP and quantized layer mappings byFℓ⋆\(⋅\)F\_\{\\ell\}^\{\\star\}\(\\cdot\)andFℓq\(⋅;θℓ\)F\_\{\\ell\}^\{q\}\(\\cdot;\\theta\_\{\\ell\}\), respectively, we solve a sequential layer\-wise reconstruction problem:
minθℓ𝔼x∼𝒟\[‖Fℓq\(x;θℓ\)−Fℓ⋆\(x\)‖22\]\.\\min\_\{\\theta\_\{\\ell\}\}\\ \\mathbb\{E\}\_\{x\\sim\\mathcal\{D\}\}\\left\[\\left\\\|F\_\{\\ell\}^\{q\}\(x;\\theta\_\{\\ell\}\)\-F\_\{\\ell\}^\{\\star\}\(x\)\\right\\\|\_\{2\}^\{2\}\\right\]\.\(4\)In our method, this generic reconstruction loss is instantiated as a*position\-weighted hidden\-state MSE*that emphasizes fragile commit positions \(Section[4](https://arxiv.org/html/2606.06547#S4)\)\.
#### 3\.1\.2Masked Diffusion Decoding as a Markov Chain
Let𝒱\\mathcal\{V\}be the vocabulary andNNthe full sequence length\. We denote the prompt length byNpN\_\{p\}and the generation window \(answer region\) by indices𝒢=\{Np\+1,…,N\}\\mathcal\{G\}=\\\{N\_\{p\}\+1,\\dots,N\\\}\. We index reverse diffusion fromt=Tt=T\(fully masked\) tot=0t=0\(final\)\. A masked diffusion decoder maintains a partially observed state
St≡X\(t\)∈\(𝒱∪\{\[MASK\]\}\)N,S\_\{t\}\\equiv X^\{\(t\)\}\\in\(\\mathcal\{V\}\\cup\\\{\\texttt\{\[MASK\]\}\\\}\)^\{N\},\(5\)where some positions are concrete tokens and others are\[MASK\]\. Letℳ\(St\)=\{i:St\[i\]=\[MASK\]\}\\mathcal\{M\}\(S\_\{t\}\)=\\\{i:S\_\{t\}\[i\]=\\texttt\{\[MASK\]\}\\\}be the set of masked positions\.
Given stateStS\_\{t\}, a modelMM\(FP teacherM⋆M^\{\\star\}or quantized modelMqM^\{q\}\) produces per\-position logitszM,i\(St\)∈ℝ\|𝒱\|z\_\{M,i\}\(S\_\{t\}\)\\in\\mathbb\{R\}^\{\|\\mathcal\{V\}\|\}and categorical distributions
pM,i\(⋅∣St\)=softmax\(zM,i\(St\)\)\.p\_\{M,i\}\(\\cdot\\mid S\_\{t\}\)=\\mathrm\{softmax\}\(z\_\{M,i\}\(S\_\{t\}\)\)\.\(6\)A decoding policy selects a commit setCt⊆ℳ\(St\)C\_\{t\}\\subseteq\\mathcal\{M\}\(S\_\{t\}\)and writes tokens at committed positions\. We emphasize that dLLM decoding is not necessarily “posterior sampling”; it is an algorithmic transition rule\.
Initialization andp\(ST\)p\(S\_\{T\}\)\.The initial stateSTS\_\{T\}is typically prompt \+ fully masked generation window:
ST=\[x1:Np,\[MASK\],…,\[MASK\]⏟N−Np\],S\_\{T\}=\[x\_\{1:N\_\{p\}\},\\underbrace\{\\texttt\{\[MASK\]\},\\dots,\\texttt\{\[MASK\]\}\}\_\{N\-N\_\{p\}\}\],\(7\)which induces a degenerate initial distribution
p\(ST\)=δ\(ST=S¯T\),p\(S\_\{T\}\)=\\delta\(S\_\{T\}=\\bar\{S\}\_\{T\}\),\(8\)i\.e\., a Dirac delta at the deterministic initializationS¯T\\bar\{S\}\_\{T\}\.
Quantization\-induced local noise propagates and scales within the Markov decoding chain, demanding spatio\-temporally aware calibration\.
### 3\.2Overview of FAIR\-Calib
We propose a two\-stage post\-training quantization framework tailored for diffusion language models with irreversible commits\. It separates*where errors are amplified*from*how calibration is performed*\.
Stage I: teacher probing\.We run a small number of full\-precision teacher rolloutsunder stochastic \(random\) commitsand construct a fixed position priorw¯\\bar\{w\}over the generation window\. This choice of random commits is deliberate: it aligns the probing process with the randomized masking mechanisms used during dLLM training\(Nieet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib1)\)and provides a policy\-agnostic coverage of the masked state space, ensuring thatw¯\\bar\{w\}reflects the model’s intrinsic structural sensitivity\. The prior combines \(i\) a*frontier\-hit*signal that marks when a position is committed \(irreversible\), and \(ii\) a*masked\-stage teacher\-reliability*signal computed from the teacher distribution while the position remains masked\. Weights are additively accumulated across diffusion steps and then window\-aligned/normalized\.
Stage II: static weighted calibration\.Usingw¯\\bar\{w\}, we calibrate the quantized model with standard layer\-wise teacher\-forcing on fully observed tokens, minimizing a position\-weighted hidden\-state MSE\. This avoids expensive end\-to\-end diffusion rollouts during calibration while prioritizing high\-impact frontier commits and using masked\-stage reliability to obtain a robust, transferable prior\.
Crucially, this design stems from the insight that positional vulnerability is an intrinsic structural property dictated by the model weights and decoding dynamics\. Therefore, the importance prior estimated from the masked probing phase can be effectively reused in a full\-text calibration setting without loss of relevance\.
### 3\.3Frontier\-Aware Time×\\timesPosition Weights
We design a position\-wise weightwiw\_\{i\}by probing the FP teacher decoding dynamics\. At each stepttalong a teacher rollout, we accumulate two additive components in answer region:
wi←wi\+λ0\(t\)1\{i∈C^t\}\+λ1c~t,i1\{i∈ℳ\(St\)\},w\_\{i\}\\leftarrow w\_\{i\}\+\\lambda\_\{0\}\(t\)\\,\\mathbf\{1\}\\\{i\\in\\widehat\{C\}\_\{t\}\\\}\+\\lambda\_\{1\}\\,\\widetilde\{c\}\_\{t,i\}\\,\\mathbf\{1\}\\\{i\\in\\mathcal\{M\}\(S\_\{t\}\)\\\},\(9\)where: \(i\)C^t\\widehat\{C\}\_\{t\}is the realized write frontier \(a sampledCtC\_\{t\}\) in the teacher rollout;𝟏\{i∈C^t\}\\mathbf\{1\}\\\{i\\in\\widehat\{C\}\_\{t\}\\\}is the*frontier\-hit*indicator at steptt\. \(ii\)c~t,i\\widetilde\{c\}\_\{t,i\}is a row\-wise normalized*masked\-state teacher sharpness*score \(e\.g\., token\-probability, negative entropy, or margin\), used as a*reliability gate*when aggregating a transferable static prior for off\-policy Stage II calibration\. \(iii\)λ0\(t\)\\lambda\_\{0\}\(t\)follows an early\-boost schedule to emphasize earlier commits that influence more subsequent steps;λ1\\lambda\_\{1\}scales the fragility term\. Importantly, the weight is*additively accumulated*across steps, which will be theoretically justified in Section[4](https://arxiv.org/html/2606.06547#S4)\. We probe with random commits to obtain*training\-aligned, policy\-agnostic*coverage of partially\-masked states; all accuracy evaluations still follow each model’s default inference\-time policy\.
Window alignment and floor\.In prompt\-conditioned dLLM generation, diffusion acts primarily on the answer window𝒢\\mathcal\{G\}while the prompt is a fixed condition\. We therefore alignwwto the last\-KKgeneration window \(e\.g\.,K=256K=256\) and apply a small floor weight outside𝒢\\mathcal\{G\}for numerical stability in layerwise calibration\.
### 3\.4Off\-Policy Static Teacher\-Forcing Calibration with Weighted Hidden MSE
Direct end\-to\-end optimization over diffusion trajectories during calibration would require rolling out the quantized model for all steps and updating quantization parameters iteratively, which is prohibitively expensive and is incompatible with standard layer\-wise PTQ calibration\. Instead, we use an off\-policy surrogate: rather than calibrating on the on\-policy masked states induced by a commit policy, we feed fully observed real tokens \(no masks\) and align hidden representations between the quantized model and the FP teacher\. For each layer/blockℓ\\ellin a sequential order, we calibrate onlyθℓ\\theta\_\{\\ell\}while keeping other layers fixed:
argminθℓ𝔼\(x,y\)∼𝒟\[∑i=1Nw¯i‖hℓ,iq\(x,y;θ≤ℓ\)−hℓ,i⋆\(x,y\)‖22\],\\arg\\min\_\{\\theta\_\{\\ell\}\}\\ \\mathbb\{E\}\_\{\(x,y\)\\sim\\mathcal\{D\}\}\\left\[\\sum\_\{i=1\}^\{N\}\\bar\{w\}\_\{i\}\\,\\left\\\|h\_\{\\ell,i\}^\{q\}\(x,y;\\theta\_\{\\leq\\ell\}\)\-h\_\{\\ell,i\}^\{\\star\}\(x,y\)\\right\\\|\_\{2\}^\{2\}\\right\],\(10\)
whereθ≤ℓ\\theta\_\{\\leq\\ell\}denotes that layers<ℓ<\\ellhave been already calibrated and frozen\. Section[4](https://arxiv.org/html/2606.06547#S4)shows that this objective is a principled surrogate for reducingKL\(μ⋆∥μq\)\\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\)under mild assumptions, whereμ⋆\\mu^\{\\star\}andμq\\mu^\{q\}are the final decoded output distributions of the teacher and the quantized model \(formalized in Section[4\.1](https://arxiv.org/html/2606.06547#S4.SS1)\)\.
## 4Theoretical Analysis
Takeaway\.Under model\-independent random commits and mild smoothness,KL\(μ⋆∥μq\)\\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\)admits an additive time×\\timesposition upper bound with contributions only from committed positions, and each term is controlled by a squared hidden\-state discrepancy—justifying our Stage IIw¯\\bar\{w\}\-weighted hidden\-state MSE surrogate\.
Our analysis proceeds in three steps: \(i\) upper bound the output KL by a trajectory KL and decompose it across timesteps; \(ii\) show each per\-step divergence only involves committed positions; \(iii\) bound token\-level KL by squared logit error and then by hidden\-state MSE, yielding a tractable weighted surrogate\. All necessary proofs and additional remarks are provided in Appendix[B](https://arxiv.org/html/2606.06547#A2)\.
### 4\.1Output Divergence Objective
Letτ=\(ST,ST−1,…,S0\)\\tau=\(S\_\{T\},S\_\{T\-1\},\\dots,S\_\{0\}\)denote a decoding trajectory\. Under policyπ\\piand modelMM, the induced trajectory distribution is
ℙM\(τ\)=p\(ST\)∏t=1TKtM\(St−1∣St\),\\mathbb\{P\}^\{M\}\(\\tau\)=p\(S\_\{T\}\)\\prod\_\{t=1\}^\{T\}K\_\{t\}^\{M\}\(S\_\{t\-1\}\\mid S\_\{t\}\),\(11\)whereKtMK\_\{t\}^\{M\}is the one\-step transition kernel\. The output distribution is the marginal ofS0S\_\{0\}:
μM\(S0\)=∑τ:S0\(τ\)=S0ℙM\(τ\)\.\\mu^\{M\}\(S\_\{0\}\)=\\sum\_\{\\tau:S\_\{0\}\(\\tau\)=S\_\{0\}\}\\mathbb\{P\}^\{M\}\(\\tau\)\.\(12\)Our ultimate theoretical objective for calibration is distribution alignment between FP and quantized outputs:
minKL\(μ⋆∥μq\),\\min\\ \\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\),\(13\)which is intractable to compute directly, because evaluatingμM\\mu^\{M\}requires marginalizing over all possible commit\-set choices and token assignments acrossTTsteps\.
### 4\.2From Output Divergence to Trajectory Divergence
The objectiveKL\(μ⋆∥μq\)\\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\)compares the*final*decoded outputs, but it is difficult to evaluate becauseμM\\mu^\{M\}marginalizes over all intermediate commit decisions acrossTTsteps\. We therefore upper bound the output divergence by the divergence between the*entire decoding trajectories*, which admits a Markovian decomposition into per\-step terms\.
###### Lemma 4\.1\(Data processing upper bound\)\.
Letg\(τ\)=S0g\(\\tau\)=S\_\{0\}be the mapping from a trajectory to its final state\. Then
KL\(μ⋆∥μq\)≤KL\(ℙ⋆\(τ\)∥ℙq\(τ\)\)\.\\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\)\\leq\\mathrm\{KL\}\(\\mathbb\{P\}^\{\\star\}\(\\tau\)\\\|\\mathbb\{P\}^\{q\}\(\\tau\)\)\.\(14\)
Lemma[4\.1](https://arxiv.org/html/2606.06547#S4.Thmtheorem1)formalizes that matching the full trajectory distribution is sufficient: any mismatch in outputs must originate from mismatches along the trajectory\.
Next, we use the chain rule for KL on Markov path measures, wheredt⋆d\_\{t\}^\{\\star\}denotes the teacher’s*occupancy measure*, i\.e\., the marginal distribution ofStS\_\{t\}when rolling out the teacher fromp\(ST\)p\(S\_\{T\}\)\.
###### Lemma 4\.2\(Markov chain KL decomposition\)\.
Assume both chains share the same initial distributionp\(ST\)p\(S\_\{T\}\)\. LetℙM\(τ\)=p\(ST\)∏t=1TKtM\(St−1∣St\)\\mathbb\{P\}^\{M\}\(\\tau\)=p\(S\_\{T\}\)\\prod\_\{t=1\}^\{T\}K\_\{t\}^\{M\}\(S\_\{t\-1\}\\mid S\_\{t\}\)forM∈\{⋆,q\}M\\in\\\{\\star,q\\\}\. Then
KL\(ℙ⋆\(τ\)∥ℙq\(τ\)\)\\displaystyle\\mathrm\{KL\}\(\\mathbb\{P\}^\{\\star\}\(\\tau\)\\\|\\mathbb\{P\}^\{q\}\(\\tau\)\)\(15\)=∑t=1T𝔼St∼dt⋆\[KL\(Kt⋆\(⋅∣St\)∥Ktq\(⋅∣St\)\)\],\\displaystyle=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{S\_\{t\}\\sim d\_\{t\}^\{\\star\}\}\\left\[\\mathrm\{KL\}\\\!\\left\(K\_\{t\}^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\\,\\\|\\,K\_\{t\}^\{q\}\(\\cdot\\mid S\_\{t\}\)\\right\)\\right\],wheredt⋆d\_\{t\}^\{\\star\}is the teacher occupancy measure at steptt\.
Together, Lemma[4\.1](https://arxiv.org/html/2606.06547#S4.Thmtheorem1)and Lemma[4\.2](https://arxiv.org/html/2606.06547#S4.Thmtheorem2)reduce the problem to bounding the one\-step kernel divergenceKL\(Kt⋆\(⋅∣St\)∥Ktq\(⋅∣St\)\)\\mathrm\{KL\}\(K\_\{t\}^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\\\|K\_\{t\}^\{q\}\(\\cdot\\mid S\_\{t\}\)\)under teacher\-visited states\. We next show this per\-step divergence is*sparse*and only involves the committed positions\.
### 4\.3Per\-Step KL Decomposition under Commit Policies
To derive a*structural, policy\-agnostic*time×\\timesposition prior, we analyze the Markov kernel divergence under*model\-independent random commits*, where it admits an exact sparse decomposition over committed positions\. We formalize the intuition that*only committed positions contribute to the per\-step KL*\.
Letπ\\pibe a commit\-set distribution and assume it is model\-independent under random commit\. GivenStS\_\{t\}, we sampleCt∼π\(⋅∣St\)C\_\{t\}\\sim\\pi\(\\cdot\\mid S\_\{t\}\)and then sample tokens fori∈Cti\\in C\_\{t\}frompM,i\(⋅∣St\)p\_\{M,i\}\(\\cdot\\mid S\_\{t\}\), while copying all other positions deterministically\. The transition kernel can be expressed as:
KtM\(St−1∣St\)\\displaystyle K\_\{t\}^\{M\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\(16\)=∑Ct⊆ℳ\(St\)π\(Ct∣St\)\[∏i∈CtpM,i\(St−1\[i\]∣St\)\]\\displaystyle=\\sum\_\{C\_\{t\}\\subseteq\\mathcal\{M\}\(S\_\{t\}\)\}\\pi\(C\_\{t\}\\mid S\_\{t\}\)\\;\\left\[\\prod\_\{i\\in C\_\{t\}\}p\_\{M,i\}\(S\_\{t\-1\}\[i\]\\mid S\_\{t\}\)\\right\]⋅𝟏\{St−1\[¬Ct\]=St\[¬Ct\]\}\.\\displaystyle\\cdot\\mathbf\{1\}\\\{S\_\{t\-1\}\[\\neg C\_\{t\}\]=S\_\{t\}\[\\neg C\_\{t\}\]\\\}\.
###### Assumption 4\.3\(Disjoint\-support monotone\-mask transitions\)\.
For any stateStS\_\{t\}, committed positions always emit tokens in𝒱\\mathcal\{V\}\(excluding\[MASK\]\), and uncommitted positions deterministically remain\[MASK\]\. Equivalently, the mask pattern ofSt−1S\_\{t\-1\}uniquely determines the realized commit setCtC\_\{t\}viaℳ\(St−1\)=ℳ\(St\)∖Ct\\mathcal\{M\}\(S\_\{t\-1\}\)=\\mathcal\{M\}\(S\_\{t\}\)\\setminus C\_\{t\}\. Hence, conditional next\-state distributions induced by different commit sets have disjoint support\. Notably, the decoding of both Dream and LLaDA is*monotone by design*: once a position is committed to a vocabulary token, it is never remasked in later steps\.
###### Proposition 4\.4\(Per\-step KL reduces to committed positions\)\.
Assume the commit\-set distributionπ\(⋅∣St\)\\pi\(\\cdot\\mid S\_\{t\}\)is model\-independent \(random commit\) and Assumption[4\.3](https://arxiv.org/html/2606.06547#S4.Thmtheorem3)holds\. Then for any fixedStS\_\{t\},
KL\(Kt⋆\(⋅∣St\)∥Ktq\(⋅∣St\)\)\\displaystyle\\mathrm\{KL\}\(K\_\{t\}^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\\\|K\_\{t\}^\{q\}\(\\cdot\\mid S\_\{t\}\)\)\(17\)=𝔼Ct∼π\(⋅∣St\)\[∑i∈CtKL\(p⋆,i\(⋅∣St\)∥pq,i\(⋅∣St\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{C\_\{t\}\\sim\\pi\(\\cdot\\mid S\_\{t\}\)\}\\left\[\\sum\_\{i\\in C\_\{t\}\}\\mathrm\{KL\}\\\!\\left\(p\_\{\\star,i\}\(\\cdot\\mid S\_\{t\}\)\\,\\\|\\,p\_\{q,i\}\(\\cdot\\mid S\_\{t\}\)\\right\)\\right\]\.
Combining Lemmas[4\.1](https://arxiv.org/html/2606.06547#S4.Thmtheorem1)–[4\.2](https://arxiv.org/html/2606.06547#S4.Thmtheorem2)and Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4)yields an upper bound:
KL\(μ⋆∥μq\)\\displaystyle\\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\)\(18\)≤∑t=1T𝔼St∼dt⋆𝔼Ct\[∑i∈CtKL\(p⋆,i\(⋅∣St\)∥pq,i\(⋅∣St\)\)\]\.\\displaystyle\\leq\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{S\_\{t\}\\sim d\_\{t\}^\{\\star\}\}\\mathbb\{E\}\_\{C\_\{t\}\}\\left\[\\sum\_\{i\\in C\_\{t\}\}\\mathrm\{KL\}\(p\_\{\\star,i\}\(\\cdot\\mid S\_\{t\}\)\\\|p\_\{q,i\}\(\\cdot\\mid S\_\{t\}\)\)\\right\]\.Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4)shows that, under model\-independent commits, the per\-step divergence decomposes into a sum of token\-level KL terms*only on committed positions*\. This yields a “sum over time, then sum over positions” structure, which directly motivates additive time×\\timesposition accumulation\.
With Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4), Eq\. \([18](https://arxiv.org/html/2606.06547#S4.E18)\) makes explicit which\(t,i\)\(t,i\)updates can contribute to the trajectory divergence\. We therefore construct a static position prior by \(i\) Monte Carlo frontier\-hit accumulation with time reweightingλ0\(t\)\\lambda\_\{0\}\(t\), and \(ii\) a masked\-stage reliability gate to stabilize the estimate for off\-policy reuse in Stage II; see Appendix[B\.9](https://arxiv.org/html/2606.06547#A2.SS9)\.
Table 1:W4A4 results on the LLaDA family\.Accuracy \(%\) on PIQA, BoolQ, WinoGrande, ARC\-E, ARC\-C, HellaSwag, TruthfulQA\-MC2, MMLU, HumanEval, and GSM8K\. Higher is better\.ModelMethodPIQABoolQWino\.ARC\-EARC\-CHella\.TruthMMLUHuman\.GSM8KAvg\.LLaDA\-BaseFP74\.8463\.7373\.6475\.0847\.4472\.9045\.3065\.8033\.5468\.9262\.12RTN69\.2664\.0161\.8064\.3134\.4760\.4138\.8051\.0512\.2035\.0349\.13QuaRot73\.6165\.6972\.2272\.3546\.1669\.9643\.3562\.0425\.0057\.3958\.78FlatQuant74\.1662\.5172\.1673\.2346\.8471\.1542\.9063\.8029\.7057\.2459\.37FAIR\-Calib74\.9262\.6972\.9374\.4548\.3871\.2743\.9164\.1133\.5464\.7561\.09LLaDA\-InstructFP82\.8688\.3877\.3594\.0088\.4776\.8948\.4764\.3646\.9570\.3673\.81RTN77\.2684\.6870\.0988\.7179\.6665\.9345\.1757\.2737\.8061\.7166\.83QuaRot81\.2387\.9875\.0693\.3085\.4273\.2747\.7761\.9940\.2467\.1071\.34FlatQuant81\.8387\.9875\.5392\.5587\.3174\.1546\.3962\.5839\.0266\.4571\.38FAIR\-Calib82\.1088\.2976\.0192\.7787\.4674\.5047\.4462\.5843\.2969\.6072\.40LLaDA\-1\.5FP82\.9788\.4777\.2793\.1287\.1276\.9148\.7664\.5146\.9569\.2273\.53RTN76\.2885\.9970\.0988\.7182\.3766\.5346\.3957\.7838\.4161\.4967\.40QuaRot80\.7487\.0775\.5391\.8985\.8273\.9647\.4661\.9742\.0763\.9971\.05FlatQuant81\.2886\.9276\.0192\.9585\.1273\.9746\.4862\.9646\.3467\.4071\.94FAIR\-Calib82\.7087\.8976\.0194\.0086\.1074\.2948\.0163\.1246\.9568\.4672\.75Table 2:W4A4 results on the Dream family\.Accuracy \(%\) on the same benchmark suite\. Higher is better\.ModelMethodPIQABoolQWino\.ARC\-EARC\-CHella\.TruthMMLUHuman\.GSM8KAvg\.Dream\-BaseFP75\.4184\.4673\.3283\.8459\.1373\.6544\.1771\.3658\.5476\.1970\.01RTN61\.7059\.6355\.6451\.6034\.3055\.2041\.4236\.946\.7116\.8342\.00QuaRot72\.5274\.7763\.4675\.7248\.7266\.9240\.4059\.7423\.1749\.5157\.49FlatQuant71\.6579\.1465\.9877\.1950\.5169\.5443\.2364\.9638\.4160\.2062\.08FAIR\-Calib73\.0780\.3169\.5380\.5153\.9270\.6043\.4066\.9141\.4666\.6464\.64Dream\-InstructFP75\.7985\.6672\.6984\.6861\.4373\.8947\.1269\.7957\.9381\.1071\.01RTN63\.3359\.9456\.0459\.5140\.5356\.5441\.1241\.2813\.0526\.2945\.76QuaRot71\.1176\.9164\.9678\.2052\.3067\.2839\.9264\.1732\.3264\.2561\.14FlatQuant71\.8279\.6666\.3880\.8955\.2969\.8244\.2764\.1841\.4666\.0363\.98FAIR\-Calib73\.1281\.7770\.8083\.4658\.0270\.9846\.9964\.0444\.5172\.8666\.66
### 4\.4From Token KL to Squared Logit Error via Smoothness
Define the log\-sum\-exp functionf\(z\)=log∑vexp\(zv\)f\(z\)=\\log\\sum\_\{v\}\\exp\(z\_\{v\}\)\. Its gradient is∇f\(z\)=softmax\(z\)\\nabla f\(z\)=\\mathrm\{softmax\}\(z\)and its Hessian is
∇2f\(z\)=Diag\(p\)−pp⊤,p=softmax\(z\),\\nabla^\{2\}f\(z\)=\\mathrm\{Diag\}\(p\)\-pp^\{\\top\},\\quad p=\\mathrm\{softmax\}\(z\),\(19\)whose operator norm satisfies‖∇2f\(z\)‖2≤12\\\|\\nabla^\{2\}f\(z\)\\\|\_\{2\}\\leq\\frac\{1\}\{2\}for allzz\. Henceffis\(1/2\)\(1/2\)\-smooth\. This property holds because the maximum eigenvalue of a covariance matrix of a categorical distribution is at most\(1/2\)\(1/2\)\.
###### Lemma 4\.6\(Softmax KL is a Bregman divergence bounded by squared logit error\)\.
Letp=softmax\(z\)p=\\mathrm\{softmax\}\(z\)andq=softmax\(z′\)q=\\mathrm\{softmax\}\(z^\{\\prime\}\)\. Then
KL\(p∥q\)=Df\(z′∥z\)≤14‖z′−z‖22,\\mathrm\{KL\}\(p\\\|q\)=D\_\{f\}\(z^\{\\prime\}\\\|z\)\\leq\\frac\{1\}\{4\}\\\|z^\{\\prime\}\-z\\\|\_\{2\}^\{2\},\(20\)whereDf\(u∥v\)=f\(u\)−f\(v\)−⟨∇f\(v\),u−v⟩D\_\{f\}\(u\\\|v\)=f\(u\)\-f\(v\)\-\\langle\\nabla f\(v\),u\-v\\rangleis the Bregman divergence offf\.
### 4\.5Bridging to Weighted Hidden\-State MSE
Define the*suffix network*gℓg\_\{\\ell\}as the mapping from layer\-ℓ\\ellhidden states to logits at the same position \(i\.e\., the remaining blocks, final normalization, and output head\), so thatzM,i\(St\)=gℓ\(hℓ,iM\(St\)\)z\_\{M,i\}\(S\_\{t\}\)=g\_\{\\ell\}\(h\_\{\\ell,i\}^\{M\}\(S\_\{t\}\)\)forM∈\{⋆,q\}M\\in\\\{\\star,q\\\}\. Assumegℓg\_\{\\ell\}isLℓL\_\{\\ell\}\-Lipschitz on the calibration domain:‖gℓ\(u\)−gℓ\(v\)‖2≤Lℓ‖u−v‖2\\\|g\_\{\\ell\}\(u\)\-g\_\{\\ell\}\(v\)\\\|\_\{2\}\\leq L\_\{\\ell\}\\\|u\-v\\\|\_\{2\}\. Then‖zq,i\(St\)−z⋆,i\(St\)‖22≤Lℓ2‖hℓ,iq\(St\)−hℓ,i⋆\(St\)‖22\\\|z\_\{q,i\}\(S\_\{t\}\)\-z\_\{\\star,i\}\(S\_\{t\}\)\\\|\_\{2\}^\{2\}\\leq L\_\{\\ell\}^\{2\}\\\|h\_\{\\ell,i\}^\{q\}\(S\_\{t\}\)\-h\_\{\\ell,i\}^\{\\star\}\(S\_\{t\}\)\\\|\_\{2\}^\{2\}\. Combining Lemma[4\.6](https://arxiv.org/html/2606.06547#S4.Thmtheorem6)with Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4)yields
KL\(μ⋆∥μq\)\\displaystyle\\mathrm\{KL\}\(\\mu^\{\\star\}\\\|\\mu^\{q\}\)\\\(21\)≤Lℓ24∑t=1T𝔼St∼dt⋆𝔼Ct\[∑i∈Ct‖hℓ,iq\(St\)−hℓ,i⋆\(St\)‖22\]\.\\displaystyle\\leq\\ \\frac\{L\_\{\\ell\}^\{2\}\}\{4\}\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{S\_\{t\}\\sim d\_\{t\}^\{\\star\}\}\\mathbb\{E\}\_\{C\_\{t\}\}\\\!\\left\[\\sum\_\{i\\in C\_\{t\}\}\\\|h\_\{\\ell,i\}^\{q\}\(S\_\{t\}\)\-h\_\{\\ell,i\}^\{\\star\}\(S\_\{t\}\)\\\|\_\{2\}^\{2\}\\right\]\.
Details of the suffix\-network Lipschitz bridge are provided in Appendix[B\.3](https://arxiv.org/html/2606.06547#A2.SS3)\. This provides a principled justification for minimizing a*weighted hidden\-state MSE*as a KL\-consistent surrogate, and explains why directly applying softmax\-KL to hidden features is unnecessary\.
## 5Experiments
### 5\.1Implementation Details
We evaluate W4A4 on LLaDA and Dream, comparing RTN/QuaRot/FlatQuant under matched PTQ settings\. FAIR\-Calib probes a fixed prior from the FP teacher and performs weighted layer\-wise calibration; details in Appendix[C](https://arxiv.org/html/2606.06547#A3)\.
### 5\.2Main Results
We report W4A4 accuracy on a broad benchmark suite for LLaDA and Dream, comparing FP, RTN, QuaRot, FlatQuant, and FAIR\-Calib under the implementation details described above\. Tables[1](https://arxiv.org/html/2606.06547#S4.T1)and[2](https://arxiv.org/html/2606.06547#S4.T2)summarize the results for the LLaDA and Dream families, respectively\. Overall, FAIR\-Calib consistently improves over baselines while using a shorter calibration sequence length \(1024\) by default\.
### 5\.3Ablation Study
Ablation on components\.Table[3](https://arxiv.org/html/2606.06547#S5.T3)shows that using eitherfrontier\-hit onlyormasked\-stage reliability onlyimproves over the uniform PTQ baseline, while combining them \(FAIR\-Calib\) achieves the best average accuracy on Dream\-Base across the 10\-benchmark suite\. This suggests the two signals are complementary: frontier hits prioritize positions with high downstream impact due to irreversible commits, while masked\-stage reliability provides an offline reliability prior that, in expectation, downweights positions where the teacher is frequently ambiguous during masking, thereby reducing the finite\-sample noise when estimating and reusing a static prior across corpora in Stage II\.
Table 3:Ablation on components\.Average W4A4 accuracy \(%\) on Dream\-Base over the same 10\-benchmark suite\.frontier\-hit onlykeeps only the write\-frontier indicator term withλ0\(t\)\\lambda\_\{0\}\(t\);masked\-stage reliability onlykeeps only the masked\-stage reliability term withλ1\\lambda\_\{1\};FAIR\-Calibcombines both\.MethodAvg\.baseline61\.76frontier\-hit only63\.12masked\-stage reliability only62\.89FAIR\-Calib64\.64Sensitivity to probing budget\.Table[4\(a\)](https://arxiv.org/html/2606.06547#S5.T4.st1)varies the probing sample sizeNprobeN\_\{\\mathrm\{probe\}\}used in Stage I\. Accuracy improves withNprobeN\_\{\\mathrm\{probe\}\}and saturates around512512–10241024samples, suggesting thatw¯\\bar\{w\}can be estimated reliably with a moderate probing budget\. Unless stated otherwise, we useNprobe=512N\_\{\\mathrm\{probe\}\}\{=\}512\.
Effect of time weighting\.Table[4\(b\)](https://arxiv.org/html/2606.06547#S5.T4.st2)compares differentλ0\(t\)\\lambda\_\{0\}\(t\)schedules for the frontier\-hit term\. Early\-boost performs best, consistent with the intuition that earlier commits influence more subsequent refinement steps and thus have higher downstream impact\. Late\-boost underperforms, suggesting that emphasizing late commits is less effective at mitigating irreversible error amplification\.
Table 4:Probing and time\-weighting ablations\.Average W4A4 accuracy \(%\) on Dream\-Base over the same 10\-benchmark suite\.Left:varying the probing budgetNprobeN\_\{\\mathrm\{probe\}\}for estimating the priorw¯\\bar\{w\}\.Right:the frontier\-hit time\-weight scheduleλ0\(t\)\\lambda\_\{0\}\(t\)\.\(a\)Probing budgetNprobeN\_\{\\mathrm\{probe\}\}\.NprobeN\_\{probe\}Avg\.12863\.1525663\.5651264\.64102464\.63
\(b\)Time weightingλ0\(t\)\\lambda\_\{0\}\(t\)\.λ0\\lambda\_\{0\}scheduleAvg\.w/oλ0\\lambda\_\{0\}62\.89uniform63\.21late\-boost62\.58early\-boost64\.64
### 5\.4Mechanistic Diagnostics
False\-commit error amplification\.We first measure the end\-to\-end consequence under*inference\-time*commit policy\. We track the per\-step discrepancy w\.r\.t the FP teacher \(probability MSE\) and mark the first*false commit*where the written token disagrees with the teacher\. Figure[3](https://arxiv.org/html/2606.06547#S5.F3)shows that, for the uniformly calibrated baseline, a single false commit is typically followed by an abrupt rise in discrepancy that persists over subsequent refinement steps, consistent with irreversibility at the write frontier\. In contrast,FAIR\-Calibsubstantiallysuppressesthis error growth, yielding a flatter discrepancy trajectory after the first wrong write\. This motivates a controlled test that isolates whether quantization increases*write\-decision flips*at the frontier\.
Figure 3:Probability MSE w\.r\.t the FP \(mse\_prob\\mathrm\{mse\\\_prob\}\) as a function of diffusion step\. Wrong commit amplifies downstream error, while FAIR\-Calib suppresses this error\.Teacher\-forced commit\-step flips\.To isolate write\-decision errors from state\-distribution shift, we evaluate quantized models on teacher\-forced intermediate states along diffusion steps\. We compare the token a quantized model would write at the teacher’s commit positions against the teacher’s write decision, and count mismatches per sequence \(a flip at\(t,i\)\(t,i\)\)\. FAIR\-Calib reduces the average flip count from2\.9±1\.42\.9\\pm 1\.4to1\.9±0\.91\.9\\pm 0\.9overN=32N\{=\}32samples \(Figure[4](https://arxiv.org/html/2606.06547#S5.F4)\), indicating improved alignment at the write frontier\.
Figure 4:Teacher\-forced commit\-step flips\.On teacher\-forced states, we count write\-decision mismatches w\.r\.t the FP teacher at commit steps \(mean±std over N=32\)\. FAIR\-Calib reduces flip, showing better alignment at the irreversible write frontier\.Cross\-corpus rank consistency of position weights\.We find that the probed priorw¯\\bar\{w\}is largely*mechanism\-driven*rather than corpus\-specific: mean normalizedw¯\\bar\{w\}curves computed on GSM8K\-CoT and WikiText2 exhibit substantial rank consistency \(Spearman\)\. Notably, the masked\-stage reliability gate is important for transferability, while the frontier\-hit\-only variant shows near\-zero cross\-corpus agreement\. Detailed diagnostic is provided in Appendix[D\.2](https://arxiv.org/html/2606.06547#A4.SS2)\.
## 6Conclusion
We propose FAIR\-Calib to tackle the instability in dLLM quantization caused by irreversible commit decisions\. By integrating a frontier\-aware reliability prior with off\-policy hidden\-state alignment, our framework significantly reduces commit\-step flips and downstream error propagation\. Empirically, FAIR\-Calib outperforms existing PTQ methods \(e\.g\., QuaRot, FlatQuant\) on Dream and LLaDA at W4A4 precision, providing an efficient and robust solution for compressing diffusion\-based language models\.
## Acknowledgements
The work was supported by the National Key Research and Development Program of China \(No\.2023YFC3306401\) and National Natural Science Foundation of China 62576018\. This research was also supported by Zhejiang Provincial Natural Science Foundation of China under Grant No\. LD24F020007, Beijing Natural Science Foundation L244043; The experimental part was supported by the Ministry of Economic Development of the Russian Federation \(agreement identifier 000000C313925P3X0002, grant No 139\-15\-2025\-004 dated 17\.04\.2025\)\.
## Impact Statement
This work studies post\-training quantization for diffusion large language models to reduce memory footprint and inference compute, which can lower deployment cost and energy consumption and improve accessibility on resource\-constrained hardware\. The proposed method does not introduce new model capabilities; it focuses on calibration procedures and diagnostic metrics for quantized inference\. As with many compression techniques, quantization may alter model outputs in subtle ways, so we recommend task\-relevant evaluation and monitoring before deployment in user\-facing settings\. Our experiments use standard public benchmarks and do not involve collecting or inferring sensitive personal information\. Overall, we expect the primary impact of this work to be improved efficiency and reproducibility for deploying diffusion\-style LLMs\.
## References
- S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)Quarot: outlier\-free 4\-bit inference in rotated llms\.Adv\. Neural Inform\. Process\. Syst\.37,pp\. 100213–100240\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p2.1),[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- J\. Austin, D\. D\. Johnson, J\. Ho, D\. Tarlow, and R\. Van Den Berg \(2021\)Structured denoising diffusion models in discrete state\-spaces\.Adv\. Neural Inform\. Process\. Syst\.34,pp\. 17981–17993\.Cited by:[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InAAAI Conf\. Artif\. Intell\.,Vol\.34,pp\. 7432–7439\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- H\. Chang, H\. Zhang, L\. Jiang, C\. Liu, and W\. T\. Freeman \(2022\)Maskgit: masked generative image transformer\.InIEEE/CVF Conf\. Comput\. Vis\. Pattern Recog\.,pp\. 11315–11325\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- M\. Chen \(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)Boolq: exploring the surprising difficulty of natural yes/no questions\.arXiv preprint arXiv:1905\.10044\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- E\. Frantar and D\. Alistarh \(2023\)Sparsegpt: massive language models can be accurately pruned in one\-shot\.InInt\. Conf\. Mach\. Learn\.,pp\. 10323–10337\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2022\)Gptq: accurate post\-training quantization for generative pre\-trained transformers\.arXiv preprint arXiv:2210\.17323\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- M\. Ghazvininejad, O\. Levy, Y\. Liu, and L\. Zettlemoyer \(2019\)Mask\-predict: parallel decoding of conditional masked language models\.arXiv preprint arXiv:1904\.09324\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- S\. Gong, M\. Li, J\. Feng, Z\. Wu, and L\. Kong \(2022\)Diffuseq: sequence to sequence text generation with diffusion models\.arXiv preprint arXiv:2210\.08933\.Cited by:[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- D\. Kingma and R\. Gao \(2023\)Understanding diffusion objectives as the elbo with simple data augmentation\.Adv\. Neural Inform\. Process\. Syst\.36,pp\. 65484–65516\.Cited by:[§B\.9](https://arxiv.org/html/2606.06547#A2.SS9.p1.2)\.
- X\. Li, J\. Thickstun, I\. Gulrajani, P\. S\. Liang, and T\. B\. Hashimoto \(2022\)Diffusion\-lm improves controllable text generation\.Adv\. Neural Inform\. Process\. Syst\.35,pp\. 4328–4343\.Cited by:[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- H\. Lin, H\. Xu, Y\. Wu, Z\. Guo, R\. Zhang, Z\. Lu, Y\. Wei, Q\. Zhang, and Z\. Sun \(2025\)Quantization meets dllms: a systematic study of post\-training quantization for diffusion llms\.arXiv preprint arXiv:2508\.14896\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p3.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of Machine Learning and Systems6,pp\. 87–100\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)Truthfulqa: measuring how models mimic human falsehoods\.InProceedings of the 60th annual meeting of the association for computational linguistics \(volume 1: long papers\),pp\. 3214–3252\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.arXiv preprint arXiv:1609\.07843\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p2.1)\.
- S\. Nie, F\. Zhu, Z\. You, X\. Zhang, J\. Ou, J\. Hu, J\. Zhou, Y\. Lin, J\. Wen, and C\. Li \(2025\)Large language diffusion models\.arXiv preprint arXiv:2502\.09992\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1),[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1),[§3\.2](https://arxiv.org/html/2606.06547#S3.SS2.p2.2),[Remark 4\.5](https://arxiv.org/html/2606.06547#S4.Thmtheorem5.p1.2)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo \(2023\)Omniquant: omnidirectionally calibrated quantization for large language models\.arXiv preprint arXiv:2308\.13137\.Cited by:[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- J\. Shi and M\. K\. Titsias \(2025\)Demystifying diffusion objectives: reweighted losses are better variational bounds\.arXiv preprint arXiv:2511\.19664\.Cited by:[§B\.9](https://arxiv.org/html/2606.06547#A2.SS9.p1.2)\.
- M\. Stern, W\. Chan, J\. Kiros, and J\. Uszkoreit \(2019\)Insertion transformer: flexible sequence generation via insertion operations\.InInt\. Conf\. Mach\. Learn\.,pp\. 5976–5985\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- Y\. Sun, R\. Liu, H\. Bai, H\. Bao, K\. Zhao, Y\. Li, J\. Hu, X\. Yu, L\. Hou, C\. Yuan,et al\.\(2024\)Flatquant: flatness matters for llm quantization\.arXiv preprint arXiv:2410\.09426\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p2.1),[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1),[§3\.1\.1](https://arxiv.org/html/2606.06547#S3.SS1.SSS1.p1.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p1.1)\.
- A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. De Sa \(2024\)Quip\#: even better llm quantization with hadamard incoherence and lattice codebooks\.arXiv preprint arXiv:2402\.04396\.Cited by:[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInt\. Conf\. Mach\. Learn\.,pp\. 38087–38099\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.06547#S1.p1.1)\.
- Z\. Yao, R\. Yazdani Aminabadi, M\. Zhang, X\. Wu, C\. Li, and Y\. He \(2022\)Zeroquant: efficient and affordable post\-training quantization for large\-scale transformers\.Adv\. Neural Inform\. Process\. Syst\.35,pp\. 27168–27183\.Cited by:[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
- J\. Ye, Z\. Xie, L\. Zheng, J\. Gao, Z\. Wu, X\. Jiang, Z\. Li, and L\. Kong \(2025\)Dream 7b: diffusion large language models\.arXiv preprint arXiv:2508\.15487\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1),[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1)\.
- F\. Zhu, R\. Wang, S\. Nie, X\. Zhang, C\. Wu, J\. Hu, J\. Zhou, J\. Chen, Y\. Lin, J\. Wen,et al\.\(2025\)LLaDA 1\.5: variance\-reduced preference optimization for large language diffusion models\.arXiv preprint arXiv:2505\.19223\.Cited by:[Appendix C](https://arxiv.org/html/2606.06547#A3.p1.1),[§1](https://arxiv.org/html/2606.06547#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.06547#S2.SS1.p1.1)\.
- X\. Zhu, J\. Li, Y\. Liu, C\. Ma, and W\. Wang \(2024\)A survey on model compression for large language models\.Transactions of the Association for Computational Linguistics12,pp\. 1556–1577\.Cited by:[§2\.2](https://arxiv.org/html/2606.06547#S2.SS2.p1.1)\.
## Appendix AAlgorithmic Details of FAIR\-Calib
For completeness, we provide the full pseudo\-code of FAIR\-Calib in Algorithm[A](https://arxiv.org/html/2606.06547#alg1), including teacher\-probed stability weight construction and the subsequent static weighted calibration\.
Algorithm AFAIR\-Calib: teacher\-probed stability weights \+ static weighted calibration0:FP teacher
M⋆M^\{\\star\}, quant model
MqM^\{q\}, probing set
𝒟probe\\mathcal\{D\}\_\{\\mathrm\{probe\}\}, calibration set
𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}, steps
TT, gen window
𝒢\\mathcal\{G\}, schedules
λ0\(t\),λ1\\lambda\_\{0\}\(t\),\\lambda\_\{1\}
0:Fixed prior
w¯\\bar\{w\}and calibrated quant parameters of
MqM^\{q\}
Stage I: compute a global priorw¯\\bar\{w\}by teacher probing
forsample
\(x,y\)∼𝒟probe\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{probe\}\}do
Run FP teacher rollout with*random commit*to obtain
\{St\}t=0T\\\{S\_\{t\}\\\}\_\{t=0\}^\{T\},
\{C^t\}\\\{\\widehat\{C\}\_\{t\}\\\}, and
\{c~t,i\}\\\{\\widetilde\{c\}\_\{t,i\}\\\}
Accumulate
wi\(x,y\)←∑t=1Tλ0\(t\)𝟏\{i∈C^t\}\+λ1c~t,i𝟏\{i∈ℳ\(St\)\}w^\{\(x,y\)\}\_\{i\}\\leftarrow\\sum\_\{t=1\}^\{T\}\\lambda\_\{0\}\(t\)\\mathbf\{1\}\\\{i\\in\\widehat\{C\}\_\{t\}\\\}\+\\lambda\_\{1\}\\,\\widetilde\{c\}\_\{t,i\}\\mathbf\{1\}\\\{i\\in\\mathcal\{M\}\(S\_\{t\}\)\\\}
Normalize
w\(x,y\)w^\{\(x,y\)\}\(e\.g\., divide by
maxiwi\(x,y\)\\max\_\{i\}w^\{\(x,y\)\}\_\{i\}\) and add to global sum
endfor
Window\-align and normalize the aggregated weights to obtain
w¯\\bar\{w\}
Stage II: layer\-wise static calibration with fixedw¯\\bar\{w\}
forlayer/block
ℓ\\ellin a sequential layer\-wise orderdo
forsample
\(x,y\)∼𝒟cal\(x,y\)\\sim\\mathcal\{D\}\_\{\\mathrm\{cal\}\}do
Teacher\-forcing forward \(full real tokens\) through
M⋆M^\{\\star\}and current
MqM^\{q\}
ℒℓ←∑i=1Nw¯i‖hℓ,iq\(x,y\)−hℓ,i⋆\(x,y\)‖22\\mathcal\{L\}\_\{\\ell\}\\leftarrow\\sum\_\{i=1\}^\{N\}\\bar\{w\}\_\{i\}\\,\\\|h\_\{\\ell,i\}^\{q\}\(x,y\)\-h\_\{\\ell,i\}^\{\\star\}\(x,y\)\\\|\_\{2\}^\{2\}
Update only quantization/calibration parameters of layer/block
ℓ\\ellby minimizing
ℒℓ\\mathcal\{L\}\_\{\\ell\}
endfor
endfor
## Appendix BAdditional proofs and remarks
### B\.1Remark: model\-dependent commit policies
For confidence\-driven commit policies \(e\.g\., top\-kkon confidence\), the commit\-set distributionπ\(Ct∣St,M\)\\pi\(C\_\{t\}\\mid S\_\{t\},M\)depends on model outputs, and teacher vs quant may induce differentπ⋆\\pi^\{\\star\}andπq\\pi^\{q\}\. In that case, the per\-step kernel KL admits an additional policy\-shift term:
KL\(Kt⋆∥Ktq\)=KL\(π⋆\(⋅∣St\)∥πq\(⋅∣St\)\)\+𝔼Ct∼π⋆\[∑i∈CtKL\(p⋆,i∥pq,i\)\],\\mathrm\{KL\}\(K\_\{t\}^\{\\star\}\\\|K\_\{t\}^\{q\}\)=\\mathrm\{KL\}\(\\pi^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\\\|\\pi^\{q\}\(\\cdot\\mid S\_\{t\}\)\)\+\\mathbb\{E\}\_\{C\_\{t\}\\sim\\pi^\{\\star\}\}\\left\[\\sum\_\{i\\in C\_\{t\}\}\\mathrm\{KL\}\(p\_\{\\star,i\}\\\|p\_\{q,i\}\)\\right\],under Assumption[4\.3](https://arxiv.org/html/2606.06547#S4.Thmtheorem3)\. ControllingKL\(π⋆∥πq\)\\mathrm\{KL\}\(\\pi^\{\\star\}\\\|\\pi^\{q\}\)would require on\-policy rollouts or reasoning through discrete commit selection, which is typically expensive and not aligned with layer\-wise PTQ\. We therefore use*model\-independent random commits*in Stage I to fixπ\\piand obtain the clean per\-position decomposition in Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4), yielding a stable, policy\-agnostic importance prior\. Random commits also provide broad coverage of partially masked states without coupling the probe to a specific inference policy\. Stage II then focuses on reducing the token\-level divergence term via the weighted hidden\-state MSE surrogate\.
### B\.2Remark: why additiveλ0\+λ1\\lambda\_\{0\}\+\\lambda\_\{1\}is natural
The Markov KL decomposition yields a sum over timesteps of per\-step divergences; per\-step divergences further sum over updated positions\. Thus any importance\-weighted surrogate has a canonical linear \(additive\) accumulation form\. Multiplicative coupling is not implied by the bound and may introduce unnecessary variance\.
### B\.3Remark: suffix\-network Lipschitz bridge \(details\)\.
For any intermediate layer indexℓ\\ell, we can write the model as a composition of a prefix up to layerℓ\\elland a*suffix network*gℓg\_\{\\ell\}that mapshℓ,ih\_\{\\ell,i\}to the final logits at positionii\(i\.e\., blocksℓ\+1:L\\ell\{\+\}1\{:\}L, final layer normalization, and the output projection\)\. This avoids treating intermediate\-layer features as ”logits” and makes the bridge precise:zM,i\(St\)=gℓ\(hℓ,iM\(St\)\)z\_\{M,i\}\(S\_\{t\}\)=g\_\{\\ell\}\(h\_\{\\ell,i\}^\{M\}\(S\_\{t\}\)\)\. Assuminggℓg\_\{\\ell\}isLℓL\_\{\\ell\}\-Lipschitz on the teacher\-forced calibration domain is standard in PTQ analyses and can be viewed as a local smoothness condition of the remaining network around the in\-domain hidden states\. Then the token\-level KL at committed positions admits the boundKL\(p⋆,i∥pq,i\)≤14‖zq,i−z⋆,i‖22≤Lℓ24‖hℓ,iq−hℓ,i⋆‖22\\mathrm\{KL\}\(p\_\{\\star,i\}\\\|p\_\{q,i\}\)\\leq\\frac\{1\}\{4\}\\\|z\_\{q,i\}\-z\_\{\\star,i\}\\\|\_\{2\}^\{2\}\\leq\\frac\{L\_\{\\ell\}^\{2\}\}\{4\}\\\|h\_\{\\ell,i\}^\{q\}\-h\_\{\\ell,i\}^\{\\star\}\\\|\_\{2\}^\{2\}, and plugging this into Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4)yields Eq\. \([21](https://arxiv.org/html/2606.06547#S4.E21)\)\. In Stage II, we calibrate blocks layer\-wise by directly minimizing a position\-weighted hidden\-state MSE, which reduces these per\-position hidden discrepancies and thus controls an explicit upper bound on the accumulated token divergences on the write frontier\.
### B\.4Remark: Off\-policy Stage\-II as a practical surrogate
The bound above is stated on teacher\-visited masked statesSt∼dt⋆S\_\{t\}\\sim d\_\{t\}^\{\\star\}under the probing policy \(random commit\), while Stage II calibration uses teacher\-forced fully observed tokens for efficiency\. Our goal here is to motivate the*additive time×\\timesposition*structure and a*KL\-consistent*feature\-space surrogate: reducing hidden/logit discrepancies at high\-w¯\\bar\{w\}positions decreases a principled upper bound on token\-level divergences on the write frontier, up to a domain\-mismatch residual\. We empirically validate that this surrogate indeed improves frontier behavior via reduced teacher\-forced commit\-step flips and suppressed post\-commit error amplification \(Section[5\.4](https://arxiv.org/html/2606.06547#S5.SS4)\)\.
### B\.5Proof of Lemma[4\.1](https://arxiv.org/html/2606.06547#S4.Thmtheorem1)
###### Proof\.
By definition,μM=g\#ℙM\\mu^\{M\}=g\_\{\\\#\}\\mathbb\{P\}^\{M\}is the pushforward measure ofℙM\\mathbb\{P\}^\{M\}throughgg\. KL divergence contracts under measurable mappings \(data processing inequality\), henceKL\(g\#ℙ⋆∥g\#ℙq\)≤KL\(ℙ⋆∥ℙq\)\\mathrm\{KL\}\(g\_\{\\\#\}\\mathbb\{P\}^\{\\star\}\\\|g\_\{\\\#\}\\mathbb\{P\}^\{q\}\)\\leq\\mathrm\{KL\}\(\\mathbb\{P\}^\{\\star\}\\\|\\mathbb\{P\}^\{q\}\)\. ∎
### B\.6Proof of Lemma[4\.2](https://arxiv.org/html/2606.06547#S4.Thmtheorem2)
###### Proof\.
Using the chain rule for KL on path distributions:
KL\(ℙ⋆∥ℙq\)\\displaystyle\\mathrm\{KL\}\(\\mathbb\{P\}^\{\\star\}\\\|\\mathbb\{P\}^\{q\}\)=𝔼τ∼ℙ⋆\[logp\(ST\)∏tKt⋆\(St−1∣St\)p\(ST\)∏tKtq\(St−1∣St\)\]=∑t=1T𝔼τ∼ℙ⋆\[logKt⋆\(St−1∣St\)Ktq\(St−1∣St\)\]\.\\displaystyle=\\mathbb\{E\}\_\{\\tau\\sim\\mathbb\{P\}^\{\\star\}\}\\left\[\\log\\frac\{p\(S\_\{T\}\)\\prod\_\{t\}K\_\{t\}^\{\\star\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\{p\(S\_\{T\}\)\\prod\_\{t\}K\_\{t\}^\{q\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\\right\]=\\sum\_\{t=1\}^\{T\}\\mathbb\{E\}\_\{\\tau\\sim\\mathbb\{P\}^\{\\star\}\}\\left\[\\log\\frac\{K\_\{t\}^\{\\star\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\{K\_\{t\}^\{q\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\\right\]\.\(B\.1\)Conditioning onStS\_\{t\}underℙ⋆\\mathbb\{P\}^\{\\star\}yields
𝔼τ∼ℙ⋆\[logKt⋆\(St−1∣St\)Ktq\(St−1∣St\)\]\\displaystyle\\mathbb\{E\}\_\{\\tau\\sim\\mathbb\{P\}^\{\\star\}\}\\left\[\\log\\frac\{K\_\{t\}^\{\\star\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\{K\_\{t\}^\{q\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\\right\]=𝔼St∼dt⋆\[𝔼St−1∼Kt⋆\(⋅∣St\)logKt⋆\(St−1∣St\)Ktq\(St−1∣St\)\]\\displaystyle=\\mathbb\{E\}\_\{S\_\{t\}\\sim d\_\{t\}^\{\\star\}\}\\left\[\\mathbb\{E\}\_\{S\_\{t\-1\}\\sim K\_\{t\}^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\}\\log\\frac\{K\_\{t\}^\{\\star\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\{K\_\{t\}^\{q\}\(S\_\{t\-1\}\\mid S\_\{t\}\)\}\\right\]\(B\.2\)=𝔼St∼dt⋆\[KL\(Kt⋆\(⋅∣St\)∥Ktq\(⋅∣St\)\)\]\.\\displaystyle=\\mathbb\{E\}\_\{S\_\{t\}\\sim d\_\{t\}^\{\\star\}\}\\left\[\\mathrm\{KL\}\(K\_\{t\}^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\\\|K\_\{t\}^\{q\}\(\\cdot\\mid S\_\{t\}\)\)\\right\]\.\(B\.3\)Summing overttcompletes the proof\. ∎
### B\.7Proof of Proposition[4\.4](https://arxiv.org/html/2606.06547#S4.Thmtheorem4)
###### Proof\.
Condition on a commit setCtC\_\{t\}\. Under random commit, both models share the same mixing weightπ\(Ct∣St\)\\pi\(C\_\{t\}\\mid S\_\{t\}\), and the conditional kernel factorizes over committed positions:
Kt,CtM\(St−1∣St\)=\[∏i∈CtpM,i\(St−1\[i\]∣St\)\]⋅𝟏\{St−1\[¬Ct\]=St\[¬Ct\]\}\.\\displaystyle K\_\{t,C\_\{t\}\}^\{M\}\(S\_\{t\-1\}\\mid S\_\{t\}\)=\\Big\[\\prod\_\{i\\in C\_\{t\}\}p\_\{M,i\}\(S\_\{t\-1\}\[i\]\\mid S\_\{t\}\)\\Big\]\\cdot\\mathbf\{1\}\\\{S\_\{t\-1\}\[\\neg C\_\{t\}\]=S\_\{t\}\[\\neg C\_\{t\}\]\\\}\.\(B\.4\)All uncommitted positions are deterministic copies and thus contribute zero KL, yielding
KL\(Kt,Ct⋆\(⋅∣St\)∥Kt,Ctq\(⋅∣St\)\)=∑i∈CtKL\(p⋆,i\(⋅∣St\)∥pq,i\(⋅∣St\)\)\.\\displaystyle\\mathrm\{KL\}\\\!\\left\(K\_\{t,C\_\{t\}\}^\{\\star\}\(\\cdot\\mid S\_\{t\}\)\\,\\\|\\,K\_\{t,C\_\{t\}\}^\{q\}\(\\cdot\\mid S\_\{t\}\)\\right\)=\\sum\_\{i\\in C\_\{t\}\}\\mathrm\{KL\}\\\!\\left\(p\_\{\\star,i\}\(\\cdot\\mid S\_\{t\}\)\\,\\\|\\,p\_\{q,i\}\(\\cdot\\mid S\_\{t\}\)\\right\)\.\(B\.5\)Finally, by Assumption[4\.3](https://arxiv.org/html/2606.06547#S4.Thmtheorem3), the supports of\{Kt,CtM\(⋅∣St\)\}Ct\\\{K\_\{t,C\_\{t\}\}^\{M\}\(\\cdot\\mid S\_\{t\}\)\\\}\_\{C\_\{t\}\}are disjoint across differentCtC\_\{t\}\. Therefore, the KL between the mixtures equals the mixture of KLs \(no log\-sum slack\), giving the desired equality after taking expectation overCtC\_\{t\}\. ∎
### B\.8Proof of Lemma[4\.6](https://arxiv.org/html/2606.06547#S4.Thmtheorem6)
###### Proof\.
A standard identity givesKL\(softmax\(z\)∥softmax\(z′\)\)=Df\(z′∥z\)\\mathrm\{KL\}\(\\mathrm\{softmax\}\(z\)\\\|\\mathrm\{softmax\}\(z^\{\\prime\}\)\)=D\_\{f\}\(z^\{\\prime\}\\\|z\)\. For anLL\-smooth convex function,Df\(u∥v\)≤L2‖u−v‖22D\_\{f\}\(u\\\|v\)\\leq\\frac\{L\}\{2\}\\\|u\-v\\\|\_\{2\}^\{2\}\. HereL=1/2L=1/2, thusDf\(z′∥z\)≤14‖z′−z‖22D\_\{f\}\(z^\{\\prime\}\\\|z\)\\leq\\frac\{1\}\{4\}\\\|z^\{\\prime\}\-z\\\|\_\{2\}^\{2\}\. ∎
### B\.9Why these weights: reweighting the KL upper bound
The bound \(Eq\.[18](https://arxiv.org/html/2606.06547#S4.E18)\) suggests an additive time×\\timesposition structure that different positions contribute*unequally*to the output divergence: a position matters only when it is updated, and its impact depends on both when it is updated and how sensitive its token distribution is to perturbations\. This motivates estimating*structural importance*over\(t,i\)\(t,i\)and then aggregating it into a static position prior\. More broadly, reweighting across diffusion timesteps admits a variational interpretation: reweighted diffusion losses correspond to tighter time\-dependent variational bounds and can reduce data–model KL, providing theoretical support for principled \(monotone\) time reweighting beyond uniform ELBO weighting\(Kingma and Gao,[2023](https://arxiv.org/html/2606.06547#bib.bib22); Shi and Titsias,[2025](https://arxiv.org/html/2606.06547#bib.bib21)\)\.
\(A\) Structural importance via frontier hits\.Under model\-independent random commits,𝟏\{i∈C^t\}\\mathbf\{1\}\\\{i\\in\\widehat\{C\}\_\{t\}\\\}is an unbiased indicator of whether positioniicontributes to the per\-step kernel divergence at timett\. Thus, accumulating
wihit≜∑t=1Tλ0\(t\)1\{i∈C^t\}w\_\{i\}^\{\\mathrm\{hit\}\}\\triangleq\\sum\_\{t=1\}^\{T\}\\lambda\_\{0\}\(t\)\\,\\mathbf\{1\}\\\{i\\in\\widehat\{C\}\_\{t\}\\\}\(B\.6\)can be viewed as a Monte Carlo estimator of the*structural*time×\\timesposition contribution suggested by Eq\. \([18](https://arxiv.org/html/2606.06547#S4.E18)\), whereλ0\(t\)\\lambda\_\{0\}\(t\)implements a monotone time reweighting to reflect downstream amplification of early irreversible writes\.
\(B\) Transferable prior estimation via masked\-stage reliability\.Stage II performs*off\-policy*teacher\-forcing calibration without masks, sow¯\\bar\{w\}must be a*static*prior that is robust to corpus shift and finite probing budget\. We therefore add a masked\-stage reliability gate
wirel≜∑t=1Tλ1c~t,i1\{i∈ℳ\(St\)\},w\_\{i\}^\{\\mathrm\{rel\}\}\\triangleq\\sum\_\{t=1\}^\{T\}\\lambda\_\{1\}\\,\\widetilde\{c\}\_\{t,i\}\\,\\mathbf\{1\}\\\{i\\in\\mathcal\{M\}\(S\_\{t\}\)\\\},\(B\.7\)wherec~t,i\\widetilde\{c\}\_\{t,i\}is high when the teacher distribution is sharp whileiiis masked\. This term is not a structural coefficient of the KL decomposition; rather, it reduces the variance of the probed prior by downweighting intrinsically ambiguous masked contexts, which empirically improves cross\-corpus rank consistency ofw¯\\bar\{w\}and stabilizes its reuse in Stage II\.
Final weight\.We setwi≜wihit\+wirelw\_\{i\}\\triangleq w\_\{i\}^\{\\mathrm\{hit\}\}\+w\_\{i\}^\{\\mathrm\{rel\}\}and window\-align/normalize it to obtain the fixed priorw¯\\bar\{w\}used in Stage II\.
## Appendix CImplementation Details
Models and datasets\.We evaluate W4A4 post\-training quantization on two diffusion LLM families:LLaDA\(Base/Instruct/1\.5\)\(Nieet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib1); Zhuet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib3)\)andDream\(Base/Instruct\)\(Yeet al\.,[2025](https://arxiv.org/html/2606.06547#bib.bib2)\)\. Unless stated otherwise, we use each model’s default inference\-time commit policy for all accuracy evaluations \(Dream: entropy\-driven; LLaDA: confidence\-driven\)\. We report performance on a diverse benchmark suite covering commonsense reasoning and NLU \(PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.06547#bib.bib10)\), BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2606.06547#bib.bib14)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2606.06547#bib.bib9)\), ARC\-E/C\(Clarket al\.,[2018](https://arxiv.org/html/2606.06547#bib.bib7)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.06547#bib.bib8)\)\), truthfulness \(TruthfulQA\-MC2\(Linet al\.,[2022](https://arxiv.org/html/2606.06547#bib.bib6)\)\), broad knowledge and multi\-task understanding \(MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.06547#bib.bib11)\)\), code generation \(HumanEval\(Chen,[2021](https://arxiv.org/html/2606.06547#bib.bib13)\)\), and mathematical reasoning \(GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.06547#bib.bib12)\)\)\. Following the official evaluation scripts provided in the released repositories, we adopt the repository\-default scoring mode for each benchmark\.
Baselines\.We compare FAIR\-Calib against vanilla RTN, QuaRot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib5)\), and FlatQuant\(Sunet al\.,[2024](https://arxiv.org/html/2606.06547#bib.bib4)\)under the same W4A4 PTQ setting\. All baselines use 128 calibration sequences from WikiText2\(Merityet al\.,[2016](https://arxiv.org/html/2606.06547#bib.bib15)\)\. QuaRot is calibrated with GPTQ, while FlatQuant and RTN use RTN\-style reconstruction\. Following the default settings of each method, QuaRot and FlatQuant calibrate with sequence length 2048 for Dream and 4096 for LLaDA; unless stated otherwise, FAIR\-Calib uses a shorter calibration length of 1024 for both model families\.
FAIR\-Calib specifics\.We adopt per\-channel and per\-token symmetric quantization for weights and activations, respectively\.Stage I \(probing\)runs the FP teacher for*T=256*diffusion steps with*random commit*to estimate a fixed priorw¯\\bar\{w\}over aKK\-token generation window \(defaultK=256K\{=\}256\)\. We probe on 512 GSM8K questions from the*train*split, formatting each prompt with a standard zero\-shot CoT instruction \(e\.g\., “Let’s think step by step\.”\)\. We then compute masked\-stage reliability scoresc~t,i\\widetilde\{c\}\_\{t,i\}from token probability \(row\-wise min–max normalized over masked positions\)\. For the frontier\-hit term, we use a polynomial early\-boost schedule
λ0\(t\)=λ0⋅max\(\(t−1T−1\)α,ρ\),t=1,…,T,\\lambda\_\{0\}\(t\)=\\lambda\_\{0\}\\cdot\\max\\\!\\left\(\\left\(\\frac\{t\-1\}\{T\-1\}\\right\)^\{\\alpha\},\\ \\rho\\right\),\\qquad t=1,\\dots,T,\(C\.8\)where largerttcorresponds to earlier steps; we setλ0=1\.0\\lambda\_\{0\}\{=\}1\.0,α=1\.5\\alpha\{=\}1\.5, andρ=0\.1\\rho\{=\}0\.1\.Stage II \(calibration\)performs standard teacher\-forcing layer\-wise calibration on WikiText2, minimizing thew¯\\bar\{w\}\-weighted hidden\-state MSE, withλ1=1\.0\\lambda\_\{1\}\{=\}1\.0\.
## Appendix DMore Mechanistic Diagnostics
### D\.1Post\-commit mismatch diagnostics\.
We measure*post\-commit self\-consistency*along a decoding trajectory\. For a positioniithat is first written at steptwrite\(i\)t\_\{\\mathrm\{write\}\}\(i\)with tokenx^i\\hat\{x\}\_\{i\}, we re\-evaluate the model’s top\-1 prediction at the same positioniifor all later steps while keeping the committed token fixed in the state, and record whether the current top\-1 prediction matchesx^i\\hat\{x\}\_\{i\}\. We report: \(i\)mean\_disagree\_rate, the average fraction of post\-commit steps where the top\-1 prediction disagrees withx^i\\hat\{x\}\_\{i\}, averaged over committed positions; \(ii\)never\_agree\_rate, the fraction of committed positions whose top\-1 prediction never matchesx^i\\hat\{x\}\_\{i\}at any subsequent step\. Figure[A](https://arxiv.org/html/2606.06547#A4.F1)shows that quantization increases post\-commit mismatch under a uniform baseline, while FAIR\-Calib consistently reduces both metrics, with a larger reduction onnever\_agree\_rate\.
Figure A:Post\-commit mismatch metrics\.We reportnever\_agree\_rate\(left\) andmean\_disagree\_rate\(right\), averaged over evaluation samples \(mean±\\pmstd\), for the FP teacher, a uniformly calibrated W4A4 baseline, and FAIR\-Calib\. Quantization exacerbates post\-commit mismatch, while FAIR\-Calib reduces both metrics, especially the “never\-agree” failures\.
### D\.2Cross\-corpus rank consistency of position weights
To test whether the probed weight prior reflects corpus statistics or decoding dynamics, we compute mean normalized weight curves over the generation window on two corpora \(GSM8K\-CoT vs\. WikiText2\) and report Spearman \(and Pearson\) correlations between the two mean curves\. Since per\-sample weights are normalized before averaging, the absolute scale of the mean curve \(often centered around∼0\.5\\sim 0\.5\) is not informative; we focus on the relative ordering across positions \(Spearman\)\. Figure[B](https://arxiv.org/html/2606.06547#A4.F2)compares four probing regimes:
- •\(a\)*fixed\-trajectory random commit*\(16 samples; fixed seed\), where both corpora share the same commit\-set sampling trajectory, yielding high consistency;
- •\(b\)*less\-stochastic commits*\(256 samples; entropy score with top\-kkcommit\), which reduces within\-corpus variability but introduces a policy\-dependent frontier\-update pattern and yields lower cross\-corpus agreement;
- •\(c\)*independent\-trajectory random commit*\(256 samples\), where stochastic trajectory variation reduces mean\-curve consistency relative to the fixed\-trajectory setting, yet substantial correlation remains, indicating a persistent mechanism\-driven component\.
- •\(d\) Under the same setting as \(c\),*frontier\-hit only*yields near\-zero cross\-corpus agreement \(Spearman≈−0\.056\\approx\-0\.056\), suggesting that frontier\-hit statistics alone are dominated by trajectory noise and corpus\-specific effects, and motivating the masked\-stage reliability gate for constructing a transferable static prior\.
Figure B:Cross\-corpus consistency of the time×\\timesposition weight prior\.Mean normalized weight curves over the 256\-token generation window for GSM8K\-CoT \(blue\) and WikiText2 \(orange\) under four probing settings \(top to bottom\)\. Titles report Spearman and Pearson correlations between the two mean curves; shaded bands visualize across\-sample variability\.
### D\.3Decision\-Margin Preservation at Commit Steps
To further substantiate whether FAIR\-Calib better preserves write\-relevant decisions after quantization, we analyze the frontier decision margin at commit steps\. We first run the full\-precision teacher to obtain a reference decoding trajectory\. For each diffusion stepttand positionii, letxfin\(i\)x\_\{\\mathrm\{fin\}\}\(i\)denote the token finally committed at positioniiby the teacher decoding trajectory\. We then evaluate each quantized model on the same pre\-commit state from the teacher trajectory and compute its logit margin with respect to this teacher\-committed target token:
margin\(t,i\)=logitt,i\(xfin\(i\)\)−maxv≠xfin\(i\)logitt,i\(v\),\\operatorname\{margin\}\(t,i\)=\\operatorname\{logit\}\_\{t,i\}\\\!\\left\(x\_\{\\mathrm\{fin\}\}\(i\)\\right\)\-\\max\_\{v\\neq x\_\{\\mathrm\{fin\}\}\(i\)\}\\operatorname\{logit\}\_\{t,i\}\(v\),\(D\.9\)where the logits are produced by the quantized student model\. In other words, the teacher trajectory provides the reference target tokenxfin\(i\)x\_\{\\mathrm\{fin\}\}\(i\), while the margin measures how strongly the quantized model ranks this teacher\-committed token over its strongest competing alternative under the same write\-frontier state\.
Figure C:Average frontier margin at commit steps\.FAIR\-Calib better preserves the decision margin of fragile write\-frontier states after quantization\. The average frontier margin increases from6\.56±0\.626\.56\{\\pm\}0\.62for the unweighted baseline to7\.23±0\.647\.23\{\\pm\}0\.64under FAIR\-Calib\.As shown in Figure[C](https://arxiv.org/html/2606.06547#A4.F3), FAIR\-Calib achieves a higher average frontier margin at commit steps than the unweighted baseline\. Since both methods are evaluated against the same teacher\-committed target tokenxfin\(i\)x\_\{\\mathrm\{fin\}\}\(i\), this diagnostic directly measures whether quantization changes the model’s preference away from the teacher’s write decision\. The average margin increases from6\.56±0\.626\.56\{\\pm\}0\.62to7\.23±0\.647\.23\{\\pm\}0\.64, providing more direct evidence that the proposed frontier\-aware weighting better preserves fragile write\-relevant decisions after quantization\.
### D\.4Stability Lag Analysis for Dream and LLaDA
We further analyze why FAIR\-Calib yields larger gains on Dream than on LLaDA\. We attribute this difference to their distinct decoding dynamics\. In particular, we measure the*stability lag*of commit decisions, which characterizes how long a position remains fragile after it is selected for commitment\. A heavier tail in the stability\-lag distribution indicates that the model’s decisions stay unstable for more steps, making them more vulnerable to quantization\-induced perturbations\.
Figure D:Stability\-lag analysis for Dream and LLaDA\.Dream exhibits a heavier tail in instability, suggesting that its commit decisions remain fragile for longer after the commit step\. This explains why FAIR\-Calib tends to yield larger gains on Dream: the proposed frontier\-aware objective is designed to protect precisely these fragile write\-relevant states\.As shown in Figure[D](https://arxiv.org/html/2606.06547#A4.F4), Dream exhibits a heavier tail in instability, which means its decisions stay fragile for longer after the commit step compared to LLaDA\. Since FAIR\-Calib is specifically designed to protect these fragile states, it naturally yields higher gains where the ”stability lag” issue is more acute\. LLaDA is inherently more stable, yet FAIR\-Calib still provides consistent, non\-trivial improvements\.
## Appendix EAdditional Generalization Results
In this section, we provide two additional studies to further examine the generality of the proposed frontier\-aware calibration objective\. First, we evaluate FAIR\-Calib under a more challenging W3A4 quantization setting, where quantization noise is stronger and commitment\-flip errors are expected to be more severe\. Second, we integrate the proposed position\-weighted objective into a QDrop\-style calibration framework that directly optimizes quantization parameters, in order to verify that the proposed objective is not tied to a specific PTQ pipeline\.
### E\.1Lower\-Bit Quantization under W3A4
We first evaluate FAIR\-Calib under the more challenging W3A4 setting\. We use the same LLaDA\-Instruct model, PTQ pipeline, calibration setup, and evaluation protocol as in the main experiments, and only change the quantization precision from W4A4 to W3A4\. We compare FAIR\-Calib with FlatQuant under this lower\-bit setting\.
Table A:Comparison on LLaDA\-Instruct under W3A4 quantization\.Compared with FlatQuant, FAIR\-Calib improves the overall average from 69\.58 to 70\.41, yielding a gain of \+0\.83 points\.MethodPIQABoolQWinoGrandeARC\-eARC\-cHellaSwagTruthfulQAMMLUHumanEvalGSM8K\-CoTAVGReal\-valued82\.8688\.3877\.3594\.0088\.4776\.8948\.4764\.3646\.9570\.3673\.81FlatQuant79\.5487\.6574\.0691\.7183\.0571\.1246\.1560\.6036\.5965\.3469\.58FAIR\-Calib79\.8887\.5274\.9091\.7185\.4271\.2947\.5260\.7838\.4166\.6670\.41
As shown in Table[A](https://arxiv.org/html/2606.06547#A5.T1), FAIR\-Calib remains effective under the lower\-bit W3A4 setting\. Compared with W3A4 FlatQuant, FAIR\-Calib improves the overall average from 69\.58 to 70\.41, with a gain of \+0\.83 points\. The improvements are especially clear on ARC\-c, HumanEval, TruthfulQA, GSM8K\-CoT, and WinoGrande, where FAIR\-Calib improves the corresponding scores by \+2\.37, \+1\.82, \+1\.37, \+1\.32, and \+0\.84 points, respectively\. These results suggest that the proposed frontier\-aware weighting remains beneficial in the more challenging low\-bit regime, where quantization\-induced perturbations are stronger and write\-frontier states are more vulnerable\.
### E\.2Extension to QDrop\-Style Calibration
We further examine whether the proposed weighting strategy can be applied beyond the specific calibration pipeline used in FAIR\-Calib\. To this end, we integrate the proposed position\-weighted hidden\-state reconstruction objective into a QDrop\-style calibration framework\. Unlike the main FAIR\-Calib pipeline, this setting directly optimizes quantization parameters, including scale and zero\-point, rather than relying on affine transformation learning\. For a fair comparison, we keep the same W8A8 quantization setting, model, calibration data, and evaluation protocol, and only replace the vanilla calibration objective with the proposed position\-weighted version\.
Table B:Comparison with the QDrop\-style baseline under W8A8 quantization\.The proposed position\-weighted objective substantially improves the QDrop\-style baseline, increasing the average score from 70\.66 to 73\.17\.MethodPIQABoolQWinoGrandeARC\-eARC\-cHellaSwagTruthfulQAMMLUHumanEvalGSM8K\-CoTAVGReal\-valued82\.8688\.3877\.3594\.0088\.4776\.8948\.4764\.3646\.9570\.3673\.81QDrop\-style80\.6988\.2075\.4592\.0687\.4672\.2147\.1461\.7836\.5965\.0570\.66QDrop\-style \+ position\-weighted MSE82\.9388\.4477\.1993\.3088\.4776\.7447\.1464\.0643\.2970\.1373\.17FAIR\-Calib82\.6488\.3577\.1993\.3088\.4776\.9348\.5864\.4346\.9571\.2773\.81
As shown in Table[B](https://arxiv.org/html/2606.06547#A5.T2), the proposed position\-weighted objective substantially improves the QDrop\-style baseline\. The average score increases from 70\.66 to 73\.17, yielding a gain of \+2\.51 points\. This brings the QDrop\-style pipeline much closer to the real\-valued model, whose average score is 73\.81\. In addition, the full FAIR\-Calib pipeline achieves an average score of 73\.81 under W8A8 quantization, matching the real\-valued model in overall average performance\. These results indicate that the proposed frontier\-aware objective is not specific to a single PTQ implementation\. Instead, it can serve as a general calibration principle for protecting fragile write\-frontier states during post\-training quantization\.
## Appendix FEfficiency Discussion
We discuss efficiency from two aspects: \(i\)calibration efficiency, and \(ii\)compression efficiency
### F\.1Calibration efficiency
Stage\-I probing cost\.Stage I runs a full\-precision teacher forTTdiffusion steps under random commits, usingNprobeN\_\{\\mathrm\{probe\}\}samples, and only records lightweight statistics \(frontier\-hit indicators and masked\-stage sharpness\) over theKK\-token generation window\. Thus the probing complexity is
𝒞probe=𝒪\(Nprobe⋅T⋅Cost\(FP forward per step\)\),\\mathcal\{C\}\_\{\\mathrm\{probe\}\}\\;=\\;\\mathcal\{O\}\\\!\\left\(N\_\{\\mathrm\{probe\}\}\\cdot T\\cdot\\mathrm\{Cost\}\(\\text\{FP forward per step\}\)\\right\),\(F\.10\)with a small constant factor because no gradients are stored and only a window\-aligned weight vector is accumulated\. In our default setting \(T=256T\{=\}256,Nprobe=512N\_\{\\mathrm\{probe\}\}\{=\}512,K=256K\{=\}256\), the probing budget is moderate and independent of any layer\-wise calibration iterations\.
Stage\-II off\-policy calibration avoids diffusion rollouts\.A key efficiency feature of FAIR\-Calib is that Stage II*does not*perform end\-to\-end diffusion rollouts for the quantized model during calibration\. Instead, we use teacher\-forcing on fully observed sequences and apply sequential layer\-wise reconstruction with the fixed priorw¯\\bar\{w\}\(Eq\. \([10](https://arxiv.org/html/2606.06547#S3.E10)\)\)\. This makes the calibration cost comparable to conventional layer\-wise PTQ:
𝒞cal=𝒪\(L⋅Ncal⋅Cost\(forward/backward on a single layer/block\)\),\\mathcal\{C\}\_\{\\mathrm\{cal\}\}\\;=\\;\\mathcal\{O\}\\\!\\left\(L\\cdot N\_\{\\mathrm\{cal\}\}\\cdot\\mathrm\{Cost\}\(\\text\{forward/backward on a single layer/block\}\)\\right\),\(F\.11\)whereLLis the number of blocks andNcalN\_\{\\mathrm\{cal\}\}is the number of calibration sequences \(we use 128 sequences from WikiText2 by default\)\. Crucially, this avoids the multiplicative factor ofTTthat would arise if one calibrated by differentiating through diffusion trajectories\.
Shorter sequences and no on\-policy state collection\.FAIR\-Calib uses a shorter calibration sequence length by default \(1024 for both Dream and LLaDA\), while some strong baselines calibrate with longer sequences \(e\.g\., 2048/4096 depending on the model family\)\. Moreover, becausew¯\\bar\{w\}is computed once in Stage I and reused statically, Stage II does not require collecting on\-policy masked states induced by a model\-dependent commit policy, which would otherwise increase calibration complexity and engineering overhead\.
Memory footprint during calibration\.Stage II requires comparing teacher and quantized hidden states, but this can be implemented in a streaming manner: teacher activations can be computed underno\_gradand consumed immediately for the corresponding layer/block calibration\. Thus, the peak memory overhead over standard layer\-wise PTQ is typically limited to holding \(i\) the current layer’s teacher hidden states and \(ii\) the quantized layer’s activations needed for local reconstruction, rather than storing full diffusion trajectories\. The substantially reduced calibration\-time memory footprint \(Table[C](https://arxiv.org/html/2606.06547#A6.T3)\) makes FAIR\-Calib practical beyond server\-class GPUs\. In particular, the peak memory during calibration is reduced to∼\\sim12–15 GB for LLaDA/Dream in our setup, which falls within the budget of widely available consumer\-grade GPUs \(e\.g\., 16–24 GB\)\. This lower footprint allows practitioners to run calibration locally, and also leaves additional headroom to increase the calibration batch size, sequence length, or the number of calibration steps when needed\.
Table C:Calibration GPU memory footprint\.Peak GPU memory usage during calibration \(reported memory in MB; lower is better\)\.Reductionis the ratioBaseline / FAIR\-Calib\.ModelBaselineFAIR\-CalibReductionLLaDA\-Base32923121212\.72×\\timesLLaDA\-Instruct32923121212\.72×\\timesLLaDA\-1\.532923121212\.72×\\timesDream\-Base29943148692\.01×\\timesDream\-Instruct29943148692\.01×\\times
### F\.2Compression efficiency
Weight memory reduction under W4A4\.At deployment, low\-bit PTQ primarily reduces the storage of model weights\. Ignoring small metadata terms, the weight memory scales approximately as
Mem\(weights\)≈bw16⋅Mem\(FP16 weights\),\\mathrm\{Mem\}\(\\text\{weights\}\)\\approx\\frac\{b\_\{w\}\}\{16\}\\cdot\\mathrm\{Mem\}\(\\text\{FP16 weights\}\),\(F\.12\)suggesting an ideal∼4×\\sim 4\\timesreduction whenbw=4b\_\{w\}\{=\}4\. In practice, the realized memory footprint also includes auxiliary tensors such as per\-channel scales, as well as runtime buffers and framework overhead, so the end\-to\-end reduction is typically smaller than the ideal ratio\. Table[D](https://arxiv.org/html/2606.06547#A6.T4)summarizes the deployment\-time memory footprint after quantization\. Across both model families, W4A4 quantization calibrated by FAIR\-Calib reduces memory by∼\\sim3\.1–3\.2×\\timescompared with FP, while keeping the inference\-time decoding procedure unchanged \(same number of diffusion steps and commit policy\)\.
Table D:Deployment memory footprint after quantization\.Memory usage in GB for FP vs\. W4A4 \(FAIR\-Calib\)\.Memory savingis the ratioFP / FAIR\-Calib\.ModelFP \(GB\)FAIR\-Calib \(GB\)Memory savingLLaDA\-Base15\.894\.913\.24×\\timesLLaDA\-Instruct15\.894\.913\.24×\\timesLLaDA\-1\.515\.884\.903\.24×\\timesDream\-Base13\.954\.443\.14×\\timesDream\-Instruct13\.954\.443\.14×\\timesSimilar Articles
Saliency-Aware Regularized Quantization Calibration for Large Language Models
This paper proposes Saliency-Aware Regularized Quantization Calibration (SARQC), a unified framework that improves Post-Training Quantization (PTQ) for LLMs by adding a regularization term to preserve weight proximity, enhancing generalization and performance.
Retrieval-Augmented Linguistic Calibration
This paper proposes Retrieval-Augmented Linguistic Calibration (RALC), a post-hoc pipeline for calibrating confidence signals in LLMs by modeling linguistic confidence as a distribution and using retrieval-augmented rewriting. It introduces Faithfulness Divergence metric and shows significant improvements across benchmarks.
Learnability-Informed Fine-Tuning of Diffusion Language Models
We propose LIFT, a learnability-informed fine-tuning algorithm for diffusion language models that aligns training with token difficulty and time step, achieving substantial gains on reasoning benchmarks.
Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2
This paper presents Tail-Aware HiFloat4, a W4A4 post-training quantization method for the Wan2.2 text-to-video diffusion model, which uses activation-tail-aware percentile calibration to mitigate outlier effects while preserving HiFloat4 arithmetic.
Uncertainty Quantification for Large Language Diffusion Models
This paper presents the first systematic study of uncertainty quantification (UQ) for Large Language Diffusion Models (LLDMs), proposing lightweight zero-shot uncertainty signals derived from the iterative denoising process and showing that LLDMs can achieve both fast inference and reliable hallucination detection with up to 100x lower computational overhead compared to sampling-based baselines.