ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

arXiv cs.LG Papers

Summary

ScaleSweep proposes a new block scale initialization method for NVFP4 post-training quantization of LLMs, achieving improved accuracy by sweeping over feasible block scale candidates. Experiments on Llama and Qwen models show it preserves over 93% of full-precision performance under aggressive quantization.

arXiv:2606.07618v1 Announce Type: new Abstract: NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:51 AM

# ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
Source: [https://arxiv.org/html/2606.07618](https://arxiv.org/html/2606.07618)
Li Lin,Xiaojun Wan, Wangxuan Institute of Computer Technology, Peking University, efsotr\_l@stu\.pku\.edu\.cn, wanxiaojun@pku\.edu\.cn

###### Abstract

NVFP4 is a recently introduced hardware\-supported FP4 format that improves the fidelity of 4\-bit quantization through fine\-grained block scales\. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution\. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective\. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error \(MSE\) and weighted mean square error \(WMSE\) between the original tensor and the quantized reconstructed tensor\. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators\. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision\. In particular, under aggressive end\-to\-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full\-precision performance\.

ScaleSweep: Accurate NVFP4 Post\-Training Quantization of LLMs via Block Scale Initialization

Li Lin, Xiaojun Wan,Wangxuan Institute of Computer Technology, Peking University,efsotr\_l@stu\.pku\.edu\.cn, wanxiaojun@pku\.edu\.cn

## 1Introduction

Recent advances in large language models \(LLMs\) have substantially increased memory footprint, bandwidth demand, and computational cost during deployment\. Post\-training quantization \(PTQ\)\(Krishnamoorthi,[2018](https://arxiv.org/html/2606.07618#bib.bib28)\)has therefore become a key approach for efficient inference, enabling model compression without retraining or full fine\-tuning\(Frantaret al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib4); Xiaoet al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib11); Ashkbooset al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib5); Liuet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib6); Huet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib7)\)\. Among low\-precision quantization schemes, NVFP4 is particularly notable for combining an FP4 E2M1 format with both FP8 micro\-block scales and a tensor\-level global scale, with native support on NVIDIA Blackwell GPUs\(Alvarezet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib9)\)\. This combination reduces memory and bandwidth requirements while preserving greater numerical flexibility than integer\-only formats\(Chenet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib31); Egiazarianet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib8)\)\. The FP8 micro\-block scaling design of NVFP4 enables practical low\-bit LLM inference under aggressive compression, making scale optimization increasingly critical in fine\-grained low\-precision quantization\.

Despite the advantages of NVFP4, existing PTQ methods exhibit different behaviors under this format\. Some methods, such as GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib4)\)and SmoothQuantXiaoet al\.\([2023](https://arxiv.org/html/2606.07618#bib.bib11)\), remain applicable to NVFP4, whereas rotation\-based approaches\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib5); Liuet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib6)\)may degrade performance\(Egiazarianet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib8)\)\. This difference stems from two key distinctions between NVFP4 and conventional INT4 quantization: the use of micro\-block scaling and FP4 data types\. Various scale initialization techniques have been proposed for INT quantization\(Zhang and Shrivastava,[2025](https://arxiv.org/html/2606.07618#bib.bib29); Linet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib30)\), but they are not directly applicable to NVFP4 due to its two\-level scaling structure\. Existing NVFP4 initialization methods still primarily rely on AbsMax\-based heuristics, including the 4/6 strategy\(Cooket al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib3)\), where a noticeable gap to the optimal solution remains\. These characteristics make scale optimization and error distribution in NVFP4 fundamentally different from those in both INT4 and single\-level FP quantization, thereby motivating dedicated scale optimization methods for NVFP4\.

![Refer to caption](https://arxiv.org/html/2606.07618v1/x1.png)Figure 1:Normalized MSE and WMSE, together with their relative gaps to the optimal FP32 block scale, between the original tensor and the quantized reconstructed tensor under different NVFP4 block sizes using AbsMax, 4/6, ScaleSweep, and the FP8\-quantized optimal FP32 scale\. Definitions are provided in Section[3](https://arxiv.org/html/2606.07618#S3)\.Existing NVFP4 scale initialization strategies such as AbsMax and 4/6 rely on simple heuristics based on the maximum representable FP4 values\. However, as shown in Figure[1](https://arxiv.org/html/2606.07618#S1.F1), they still exhibit a noticeable gap compared with the FP8\-quantized optimal FP32 block scale111For both the MSE and WMSE objectives, the optimal FP32 block scale can be solved exactly with low computational complexity\. Details are provided in Appendix[C](https://arxiv.org/html/2606.07618#A3)\.across different block sizes\. This observation suggests that there remains substantial room for improving FP8 block scale selection\. Since the number of representable FP8 scales is very limited, exhaustive scale sweep becomes computationally feasible\. To this end, we propose ScaleSweep, an NVFP4\-specific scale sweep method designed for FP4 quantization with FP8 block scales\. For FP4 quantization, we further provide a theorical analysis of block scale optimization under both MSE and WMSE objectives\. In particular, through theorical analysis and computer\-assisted analysis, we derive theoretically justified lower and upper bounds for the optimal FP8 block scale, thereby reducing the feasible sweep range to a compact local neighborhood in the FP8 bit\-pattern space and enabling efficient scale sweep\. We evaluate ScaleSweep under increasingly challenging quantization settings, including weight\-activation quantization, weight\-activation quantization with KV\-cache quantization, and weight\-activation quantization with both KV\-cache and query state quantization\. Across all settings, ScaleSweep generally achieves stronger recovery than existing initialization methods for NVFP4\.

Our main contributions are summarized below:

- •We analyze FP4 quantization with FP8 block scales and derive lower bound and upper bound for the optimal block scale under MSE and WMSE objectives\.
- •Based on the derived bounds, we propose ScaleSweep, an NVFP4\-specific calibration method that restricts FP8 block scale optimization to a compact interval in the bit\-pattern space, enabling efficient scale selection for RTN and GPTQ pipelines\.
- •We validate the effectiveness of ScaleSweep on Llama and Qwen models across weight\-activation, KV\-cache, and query\-state quantization settings, where ScaleSweep generally outperforms existing initialization methods and recovers up to9393–95%95\\%of BF16 performance in the most aggressive setting, while introducing negligible operator overhead compared with the default NVFP4 quantization operators in vLLM\.

## 2Related Work

##### Integer quantization\.

Post\-training quantization for integer quantization has been extensively studied for efficient large language model \(LLM\) inference\. GPTQFrantaret al\.\([2023](https://arxiv.org/html/2606.07618#bib.bib4)\)improves low\-bit quantization through layer\-wise reconstruction with approximate second\-order information, while SmoothQuant\(Xiaoet al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib11)\)mitigates activation outliers through smoothing transformations between weights and activations\. More recent work further improves low\-bit quantization by reshaping tensor distributions before quantization\. QuaRot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib5)\)applies randomized Hadamard rotations to remove activation outliers and support 4\-bit inference in rotated LLMs; SpinQuant\(Liuet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib6)\)learns rotation transformations to better align tensors with low\-bit quantization grids; and OSTQuant\(Huet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib7)\)combines orthogonal and scaling transformations to refine quantization through improved distribution fitting\. Together, these methods show that reducing outliers and smoothing quantization\-unfriendly distributions are central to accurate low\-bit INT PTQ, where Hadamard, rotation, and orthogonal transformations have become increasingly important techniques\.

##### FP4 quantization\.

FP4 quantization has recently emerged as an important direction for efficient low\-precision LLM inference, particularly with the introduction of NVIDIA’s NVFP4 format\(Alvarezet al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib9)\)\. Recent studies have begun to investigate FP4 quantization for both pretraining and post\-training settings\. NVIDIA’s NVFP4 pretraining work demonstrates the feasibility of training large language models with NVFP4 precision\(NVIDIAet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib2)\), while TetraJet\-v2\(Chenet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib10)\)further improves NVFP4 training accuracy by addressing weight oscillation and outlier issues during low\-precision training\. In terms of scale initialization, 4/6\(Cooket al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib3)\)extends AbsMax scaling by additionally evaluating a scale that maps the block maximum to 4 instead of 6 and selecting the lower\-error quantization\. MR\-GPTQ\(Egiazarianet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib8)\)shows that directly applying rotation transformations such as QuaRot and SpinQuant can degrade performance under NVFP4 quantization, and instead proposes micro\-rotation on top of GPTQ for hardware\-supported FP4 formats\. These results suggest that FP4 quantization requires techniques tailored to floating\-point codebooks and microscaling structures, particularly for scale selection and format\-aware reconstruction, rather than a direct reuse of INT4 quantization methods\.

## 3Preliminary

### 3\.1Notation

Let\[N\]\[N\]denote the index set\{0,1,2,…,N−1\}\\\{0,1,2,\\dots,N\-1\\\}\. For an original tensorxxand its reconstructed tensorx^\\hat\{x\}of sizenn, the mean square error \(MSE\) and normalized mean square error \(NMSE\) are defined asMSE​\(x,x^\)=‖x−x^‖22/n\\mathrm\{MSE\}\(x,\\hat\{x\}\)=\\\|x\-\\hat\{x\}\\\|\_\{2\}^\{2\}/nandNMSE​\(x,x^\)=‖x−x^‖22/‖x‖22\\mathrm\{NMSE\}\(x,\\hat\{x\}\)=\\\|x\-\\hat\{x\}\\\|\_\{2\}^\{2\}/\\\|x\\\|\_\{2\}^\{2\}, respectively\. Given an element\-wise weight tensorww, the weighted mean square error \(WMSE\) and normalized weighted mean square error \(NWMSE\) are defined asWMSE​\(x,x^,w\)=\(∑iwi​\(xi−x^i\)2\)/\(∑iwi\)\\mathrm\{WMSE\}\(x,\\hat\{x\},w\)=\(\\sum\_\{i\}w\_\{i\}\(x\_\{i\}\-\\hat\{x\}\_\{i\}\)^\{2\}\)/\(\\sum\_\{i\}w\_\{i\}\)andNWMSE​\(x,x^,w\)=\(∑iwi​\(xi−x^i\)2\)/\(∑iwi​xi2\)\\mathrm\{NWMSE\}\(x,\\hat\{x\},w\)=\(\\sum\_\{i\}w\_\{i\}\(x\_\{i\}\-\\hat\{x\}\_\{i\}\)^\{2\}\)/\(\\sum\_\{i\}w\_\{i\}x\_\{i\}^\{2\}\), respectively\.

For a numeric formatℱ\\mathcal\{F\}, let𝒢ℱ\\mathcal\{G\}\_\{\\mathcal\{F\}\}denote its representable value set\. In particular,𝒢FP4\\mathcal\{G\}\_\{\\mathrm\{FP4\}\},𝒢FP8\\mathcal\{G\}\_\{\\mathrm\{FP8\}\}, and𝒢FP32\\mathcal\{G\}\_\{\\mathrm\{FP32\}\}denote the representable value set of FP4 E2M1, FP8 E4M3, and FP32, respectively, where𝒢FP4\\mathcal\{G\}\_\{\\mathrm\{FP4\}\}and𝒢FP8\\mathcal\{G\}\_\{\\mathrm\{FP8\}\}are also referred to as quantization grids\.

Forx∈ℝx\\in\\mathbb\{R\}, let⌊x⌉ℱ\\left\\lfloor x\\right\\rceil\_\{\\mathcal\{F\}\}denote roundingxxto the nearest value in𝒢ℱ\\mathcal\{G\}\_\{\\mathcal\{F\}\}, let⌊x⌋ℱ\\left\\lfloor x\\right\\rfloor\_\{\\mathcal\{F\}\}denote roundingxxdownward to the largest value in𝒢ℱ\\mathcal\{G\}\_\{\\mathcal\{F\}\}not exceedingxx, and let⌈x⌉ℱ\\left\\lceil x\\right\\rceil\_\{\\mathcal\{F\}\}denote roundingxxupward to the smallest value in𝒢ℱ\\mathcal\{G\}\_\{\\mathcal\{F\}\}not smaller thanxx\.

For FP4 quantization with scaless, define the scaled formatFP4​\(s\)\\mathrm\{FP4\}\(s\)as the standard FP4 format scaled byss\. Its representable value set is

𝒢FP4​\(s\)=\{s⋅v∣v∈𝒢FP4\}\.\\mathcal\{G\}\_\{\\mathrm\{FP4\}\(s\)\}=\\\{s\\cdot v\\mid v\\in\\mathcal\{G\}\_\{\\mathrm\{FP4\}\}\\\}\.\(1\)
Let𝐰∈ℝ≥0n\\mathbf\{w\}\\in\\mathbb\{R\}\_\{\\geq 0\}^\{n\}denote non\-negative weights\. The weighted quantization loss under scalessis

ℒ\(s;𝐱,𝐰\)=∑i=0n−1wi\(xi−⌊xi⌉FP4​\(s\)\)2\.\\mathcal\{L\}\(s;\\mathbf\{x\},\\mathbf\{w\}\)=\\sum\_\{i=0\}^\{n\-1\}w\_\{i\}\\left\(x\_\{i\}\-\\left\\lfloor x\_\{i\}\\right\\rceil\_\{\\mathrm\{FP4\}\(s\)\}\\right\)^\{2\}\.\(2\)When𝐰=𝟏\\mathbf\{w\}=\\mathbf\{1\}, the objective reduces to the unweighted loss

ℒ\(s;𝐱\)=∑i=0n−1\(xi−⌊xi⌉FP4​\(s\)\)2\.\\mathcal\{L\}\(s;\\mathbf\{x\}\)=\\sum\_\{i=0\}^\{n\-1\}\\left\(x\_\{i\}\-\\left\\lfloor x\_\{i\}\\right\\rceil\_\{\\mathrm\{FP4\}\(s\)\}\\right\)^\{2\}\.\(3\)For simplicity, both formulations are denoted byℒ\\mathcal\{L\}when clear from context\.

### 3\.2NVFP4 Quantization

NVFP4 combines FP4 values with two\-level scaling to improve quantization fidelity under low bit\-width constraints\. The FP4 format used in NVFP4 follows the OCP Microscaling Formats Specification\(Open Compute Project,[2023](https://arxiv.org/html/2606.07618#bib.bib1)\)with representable values

𝒢FP4=\{0,±0\.5,±1,±1\.5,±2,±3,±4,±6\}\.\\mathcal\{G\}\_\{\\mathrm\{FP4\}\}=\\\{0,\\pm 0\.5,\\pm 1,\\pm 1\.5,\\pm 2,\\pm 3,\\pm 4,\\pm 6\\\}\.\(4\)
Given an input tensor𝐱∈ℝN\\mathbf\{x\}\\in\\mathbb\{R\}^\{N\}, NVFP4 represents it as\(𝐪,𝐬,S\)\(\\mathbf\{q\},\\mathbf\{s\},S\), where𝐪∈𝒢FP4N\\mathbf\{q\}\\in\\mathcal\{G\}\_\{\\mathrm\{FP4\}\}^\{N\}denotes FP4 values,𝐬∈𝒢FP8N/16\\mathbf\{s\}\\in\\mathcal\{G\}\_\{\\mathrm\{FP8\}\}^\{N/16\}denotes FP8 E4M3 micro\-block scales shared across every 16 elements, andS∈𝒢FP32S\\in\\mathcal\{G\}\_\{\\mathrm\{FP32\}\}denotes a global FP32 scale\. The reconstructed tensor𝐱^\\hat\{\\mathbf\{x\}\}is computed as

x^i=qi⋅s⌊i/16⌋⋅S\.\\hat\{x\}\_\{i\}=q\_\{i\}\\cdot s\_\{\\lfloor i/16\\rfloor\}\\cdot S\.\(5\)
For NVFP4 quantization, a widely adopted initialization strategy is AbsMax\(NVIDIAet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib2)\):

S\\displaystyle S=maxi⁡\|xi\|448⋅6,\\displaystyle=\\frac\{\\max\_\{i\}\|x\_\{i\}\|\}\{448\\cdot 6\},\(6\)sk\\displaystyle s\_\{k\}=⌊max⌊i/16⌋=k⁡\|xi\|S⋅6⌉FP8,k∈\[N/16\],\\displaystyle=\\left\\lfloor\\frac\{\\max\_\{\\lfloor i/16\\rfloor=k\}\|x\_\{i\}\|\}\{S\\cdot 6\}\\right\\rceil\_\{\\mathrm\{FP8\}\},\\ k\\in\[N/16\],\(7\)qi\\displaystyle q\_\{i\}=⌊xiS⋅s⌊i/16⌋⌉FP4,i∈\[N\],\\displaystyle=\\left\\lfloor\\frac\{x\_\{i\}\}\{S\\cdot s\_\{\\lfloor i/16\\rfloor\}\}\\right\\rceil\_\{\\mathrm\{FP4\}\},\\ i\\in\[N\],\(8\)where448448and66are the maximum representable magnitudes of FP8 E4M3 and FP4 E2M1, respectively\.

## 4Method

### 4\.1Optimization Objective

##### MSE Objective

For a weight tensorW∈ℝdin×doutW\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\\times d\_\{\\mathrm\{out\}\}\}, reconstruction fidelity under NVFP4 quantization is naturally measured by the MSE objective\. The resulting optimization objective is

ℒNVFP4MSE​\(𝐬,S;W\)\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NVFP4\}\}^\{\\mathrm\{MSE\}\}\(\\mathbf\{s\},S;W\)=∑i=0din/16−1∑j=0dout−1\\displaystyle=\\sum\_\{i=0\}^\{d\_\{\\mathrm\{in\}\}/16\-1\}\\sum\_\{j=0\}^\{d\_\{\\mathrm\{out\}\}\-1\}ℒ​\(si,j⋅S;W16​i:16​\(i\+1\),j\)\.\\displaystyle\\quad\\mathcal\{L\}\\\!\\Big\(s\_\{i,j\}\\\!\\cdot\\\!S;\\,W\_\{16i:16\(i\+1\),\\,j\}\\Big\)\.\(9\)

##### WMSE Objective

For weight quantization, reconstruction error depends not only on the weight tensor itself, but also on the input activations propagated through the linear layer\. Consider the local reconstruction error of the linear layer

‖X​W−X​W^‖F2\\displaystyle\\\|XW\-X\\hat\{W\}\\\|\_\{F\}^\{2\}=tr​\(\(W−W^\)T​XT​X​\(W−W^\)\),\\displaystyle=\\mathrm\{tr\}\\\!\\left\(\(W\-\\hat\{W\}\)^\{T\}X^\{T\}X\(W\-\\hat\{W\}\)\\right\),\(10\)whereX∈ℝT×dinX\\in\\mathbb\{R\}^\{T\\times d\_\{\\mathrm\{in\}\}\}is the input activation matrix andW^∈ℝdin×dout\\hat\{W\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\\times d\_\{\\mathrm\{out\}\}\}is the quantized weight tensor\.

Directly optimizing Eq\. \([4\.1](https://arxiv.org/html/2606.07618#S4.Ex2)\) is impractical for quantization parameter optimization because the full Gram matrixXT​XX^\{T\}Xcouples quantization errors across different input channels\. Therefore, we adopt the diagonal approximation commonly used in model compression and quantization\(LeCunet al\.,[1989](https://arxiv.org/html/2606.07618#bib.bib22); Guoet al\.,[2022](https://arxiv.org/html/2606.07618#bib.bib23); Liuet al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib32)\), retaining only the diagonal entries ofXT​XX^\{T\}X\. Under this approximation, the reconstruction objective becomes

∑i=0din−1∑j=0dout−1Impi​\(Wi,j−W^i,j\)2,\\sum\_\{i=0\}^\{d\_\{\\mathrm\{in\}\}\-1\}\\sum\_\{j=0\}^\{d\_\{\\mathrm\{out\}\}\-1\}\\mathrm\{Imp\}\_\{i\}\\left\(W\_\{i,j\}\-\\hat\{W\}\_\{i,j\}\\right\)^\{2\},\(11\)where the importance score of theii\-th input channel is defined as

Impi=\(X:,i\)T​X:,i=‖X:,i‖22\.\\mathrm\{Imp\}\_\{i\}=\(X\_\{:,i\}\)^\{T\}X\_\{:,i\}=\\\|X\_\{:,i\}\\\|\_\{2\}^\{2\}\.\(12\)The resulting formulation naturally induces the following WMSE objective for NVFP4 quantization:

ℒNVFP4WMSE\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NVFP4\}\}^\{\\mathrm\{WMSE\}\}\(𝐬,S;W,Imp\)\\displaystyle\(\\mathbf\{s\},S;W,\\mathrm\{Imp\}\)=∑i=0din/16−1∑j=0dout−1ℒ\(si,j⋅S;\\displaystyle=\\sum\_\{i=0\}^\{d\_\{\\mathrm\{in\}\}/16\-1\}\\sum\_\{j=0\}^\{d\_\{\\mathrm\{out\}\}\-1\}\\mathcal\{L\}\\\!\\Big\(s\_\{i,j\}\\\!\\cdot\\\!S;W16​i:16​\(i\+1\),j,Imp16​i:16​\(i\+1\)\)\.\\displaystyle\\quad W\_\{16i:16\(i\+1\),\\,j\},\\mathrm\{Imp\}\_\{16i:16\(i\+1\)\}\\Big\)\.\(13\)
Similarly, for activation quantization in linear layers, the importance score of each input channel can be derived from the squared norm of the corresponding input channel in the associated weight matrix\. For query state and KV cache quantization in attention operators, the attention output is computed assoftmax​\(Q​KT\)​V​WO\\mathrm\{softmax\}\(QK^\{T\}\)VW^\{O\}, where the inputs are the query statesQQ, key cacheKK, value cacheVV, and output projection matrixWOW^\{O\}\. Accordingly, the importance scores for query statesQQ, key cacheKK, and value cacheVVare derived from the input\-channel squared norms ofKK,QQ, andWOW^\{O\}, respectively\.

### 4\.2ScaleSweep

#### 4\.2\.1Global Scale Selection

As shown in Eq\. \([4\.1](https://arxiv.org/html/2606.07618#S4.Ex1)\) and Eq\. \([4\.1](https://arxiv.org/html/2606.07618#S4.Ex3)\), once the global scale is fixed, the optimization of different micro\-blocks becomes independent\. The two\-level scaling mechanism in NVFP4 can be interpreted as a form of double quantization, where block scales are quantized into FP8 values and the global scale acts as the corresponding FP8 quantization scale\. Prior analyses of FP8 quantization\(Kuzminet al\.,[2022](https://arxiv.org/html/2606.07618#bib.bib25); Micikeviciuset al\.,[2022](https://arxiv.org/html/2606.07618#bib.bib24)\)show that the quantization error introduced by FP8 is relatively small\. Therefore, it is sufficient to select a global scale that preserves adequate dynamic range for block scale optimization\. FollowingCooket al\.\([2026](https://arxiv.org/html/2606.07618#bib.bib3)\), we set

S=maxi⁡\|xi\|256⋅6\.S=\\frac\{\\max\_\{i\}\|x\_\{i\}\|\}\{256\\cdot 6\}\.\(14\)As illustrated in Figure[2](https://arxiv.org/html/2606.07618#S4.F2), different choices of global scale produce only marginal differences in normalized MSE for both AbsMax and ScaleSweep\.

![Refer to caption](https://arxiv.org/html/2606.07618v1/x2.png)Figure 2:Normalized MSE between the original tensor and the quantized reconstructed tensor under different global scales for NVFP4, using AbsMax and ScaleSweep\.Table 1:Bits\-pattern upper bounds forα=127\\alpha=\\frac\{12\}\{7\}in E4M3\. Herebbdenotes the mantissa bits ofsbaseFP8s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\},sup\\mathrm\{sup\}denotes the carry bit concatenated with the mantissa bits of the largest possible⌊127​sbase⌋E4M3\\left\\lfloor\\frac\{12\}\{7\}s\_\{\\mathrm\{base\}\}\\right\\rfloor\_\{\\mathrm\{E4M3\}\}, andΔ=sup−b\\Delta=\\mathrm\{sup\}\-bdenotes the upper bound on the bit difference relative tobb\.
#### 4\.2\.2Block Scale Optimization

For a single micro\-block𝐱=\{xi\}i=0n−1\\mathbf\{x\}=\\\{x\_\{i\}\\\}\_\{i=0\}^\{n\-1\}withn=16n=16, once the global scaleSSis fixed, optimization can be equivalently performed on the normalized tensor𝐱=\{xi/S\}i=0n−1\\mathbf\{x\}=\\\{x\_\{i\}/S\\\}\_\{i=0\}^\{n\-1\}, since the quantization loss differs only by a constant factorS2S^\{2\}\.

Since the block scale is represented in FP8 format, the number of positive finite representable values is limited to only126126candidates\. Therefore, a natural strategy is therefore to sweep all candidate scales and select the one minimizing Eq\. \([2](https://arxiv.org/html/2606.07618#S3.E2)\)\. However, sweeping over the entire FP8 scale space is inefficient\. To improve optimization efficiency, it is desirable to derive tighter lower and upper bounds for the sweep range\. To this end, define the base scale as

sbase=maxi⁡\|xi\|6,sbaseFP8=⌊sbase⌋FP8\.s\_\{\\mathrm\{base\}\}=\\frac\{\\max\_\{i\}\|x\_\{i\}\|\}\{6\},\\quad s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}=\\left\\lfloor s\_\{\\mathrm\{base\}\}\\right\\rfloor\_\{\\mathrm\{FP8\}\}\.\(15\)
##### Block Scale Upper Bound

###### Lemma 4\.1\.

Ifs\>maxi⁡\|xi\|3\.5s\>\\frac\{\\max\_\{i\}\|x\_\{i\}\|\}\{3\.5\}, then

ℒ​\(s;𝐱,𝐰\)≥ℒ​\(s/2;𝐱,𝐰\)\.\\mathcal\{L\}\(s;\\mathbf\{x\},\\mathbf\{w\}\)\\geq\\mathcal\{L\}\(s/2;\\mathbf\{x\},\\mathbf\{w\}\)\.\(16\)

Based on Lemma[4\.1](https://arxiv.org/html/2606.07618#S4.Thmlemma1), whens\>127​sbases\>\\frac\{12\}\{7\}s\_\{\\mathrm\{base\}\}, the scalesscannot achieve a lower quantization loss thans/2s/2\. Therefore, the upper bound of ScaleSweep is set to127​sbase\\frac\{12\}\{7\}s\_\{\\mathrm\{base\}\}\. Using the same technique, the corresponding upper bound for non\-FP4 formats can be derived as2×2\\timesthe baseline scale\.

###### Lemma 4\.2\.

Under the EeMm floating\-point format introduced in Appendix[A](https://arxiv.org/html/2606.07618#A1), for any1≤α≤21\\leq\\alpha\\leq 2and any positivexx, the following inequality holds:

IEeMm​\(⌊α​x⌋EeMm\)−IEeMm​\(⌊x⌋EeMm\)\\displaystyle I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\-I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)≤⌈\(1−1α\)​2m\+1⌉,\\displaystyle\\leq\\big\\lceil\(1\-\\frac\{1\}\{\\alpha\}\)2^\{m\+1\}\\big\\rceil,\(17\)whereIEeMm​\(⋅\)I\_\{\\mathrm\{EeMm\}\}\(\\cdot\)denotes the integer interpretation of the floating\-point bit pattern of the EeMm value\.

When the sweep upper bound is set to127​sbase\\frac\{12\}\{7\}s\_\{\\mathrm\{base\}\}, the corresponding upper bound in the bit\-pattern space becomesIE​4​M​3​\(sbaseFP8\)\+7I\_\{E4M3\}\\\!\\left\(s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\right\)\+7\. This upper bound is obtained from Lemma[4\.2](https://arxiv.org/html/2606.07618#S4.Thmlemma2)by settingα=127\\alpha=\\frac\{12\}\{7\}, and the resulting upper bound for each mantissa field is summarized in Table[1](https://arxiv.org/html/2606.07618#S4.T1)\.

Table 2:Bit\-pattern lower bounds for FP8 E4M3 micro\-block scales whenn=16n=16\. Herebbdenotes the mantissa bits ofsbaseFP8s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\},inf\\mathrm\{inf\}denotes the borrow bit concatenated with the mantissa bits of the smallest possible⌈45​sbaseFP8⌉E4M3\\left\\lceil\\frac\{4\}\{5\}s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\right\\rceil\_\{\\mathrm\{E4M3\}\}, andΔ\\Deltadenotes the upper bound on the bit difference relative tobb\.
##### Block Scale Lower Bound

###### Lemma 4\.3\.

For block size1616, namelyn=16n=16, when117​sbaseFP8≤448\\frac\{11\}\{7\}s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\leq 448, it follows that

arg⁡mins∈𝒢FP8⁡ℒ​\(s;𝐱\)≥45​sbaseFP8\.\\arg\\min\_\{s\\in\\mathcal\{G\}\_\{\\mathrm\{FP8\}\}\}\\mathcal\{L\}\(s;\\mathbf\{x\}\)\\geq\\frac\{4\}\{5\}s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\.\(18\)

For the MSE objective, based on Lemma[4\.3](https://arxiv.org/html/2606.07618#S4.Thmlemma3), the lower bound of ScaleSweep can be set to45​sbaseFP8\\frac\{4\}\{5\}s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\. Accordingly, the corresponding lower bound in the bit\-pattern space isIE4M3​\(sbaseFP8\)−3I\_\{\\mathrm\{E4M3\}\}\\\!\\left\(s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\right\)\-3, and the resulting lower bound for each mantissa field is summarized in Table[2](https://arxiv.org/html/2606.07618#S4.T2)\.

For the WMSE objective, the optimal scale can become arbitrarily small if the weight associated withmaxi⁡\|xi\|\\max\_\{i\}\|x\_\{i\}\|is sufficiently small\. Therefore, ScaleSweep empirically sets the lower bound to12​sbase\\frac\{1\}\{2\}s\_\{\\mathrm\{base\}\}, and the corresponding lower bound in the bit\-pattern space becomesIE4M3​\(sbaseFP8\)−8I\_\{\\mathrm\{E4M3\}\}\\\!\\left\(s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\right\)\-8\.

In summary, for the MSE objective in Eq\. \([3](https://arxiv.org/html/2606.07618#S3.E3)\), the bit\-pattern sweep space usesIE4M3​\(sbaseFP8\)I\_\{\\mathrm\{E4M3\}\}\\\!\\left\(s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\right\)as the zero point with offset range\[−3,7\]\[\-3,7\]\. For the WMSE objective in Eq\. \([2](https://arxiv.org/html/2606.07618#S3.E2)\), the bit\-pattern sweep space also usesIE4M3​\(sbaseFP8\)I\_\{\\mathrm\{E4M3\}\}\\\!\\left\(s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}\\right\)as the zero point, but with offset range\[−8,7\]\[\-8,7\]\. The proofs of all lemmas are provided in Appendix[B](https://arxiv.org/html/2606.07618#A2)\.

## 5Experiment

SettingMethodInit MethodMMLU ProGSM8kIFEvalHellaSWinoGAvgRecovery \(%\)Llama\-3\.1\-8B\-Instruct\-BF16\-46\.5984\.6973\.9479\.5373\.8071\.71100WARTNAbsMax41\.4878\.3269\.6977\.9172\.5367\.9994\.814/642\.2980\.5967\.2878\.5273\.0168\.3495\.30\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1542\.77\\cellcolorblue\!1582\.03\\cellcolorblue\!1573\.20\\cellcolorblue\!1578\.17\\cellcolorblue\!1573\.09\\cellcolorblue\!1569\.85\\cellcolorblue\!1597\.41\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1543\.09\\cellcolorblue\!1579\.68\\cellcolorblue\!1570\.79\\cellcolorblue\!1578\.27\\cellcolorblue\!1572\.45\\cellcolorblue\!1568\.86\\cellcolorblue\!1596\.03GPTQAbsMax41\.9481\.5869\.1377\.9473\.3268\.7895\.924/642\.4680\.6770\.9878\.0173\.2469\.0796\.33\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1543\.03\\cellcolorblue\!1581\.43\\cellcolorblue\!1569\.13\\cellcolorblue\!1578\.19\\cellcolorblue\!1574\.11\\cellcolorblue\!1569\.18\\cellcolorblue\!1596\.47\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1542\.70\\cellcolorblue\!1581\.50\\cellcolorblue\!1574\.49\\cellcolorblue\!1578\.35\\cellcolorblue\!1572\.93\\cellcolorblue\!1569\.99\\cellcolorblue\!1597\.61WAKVRTNAbsMax38\.2176\.9568\.9577\.3171\.3566\.5592\.814/639\.2978\.7066\.5477\.5872\.1466\.8593\.23\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1541\.06\\cellcolorblue\!1578\.85\\cellcolorblue\!1570\.24\\cellcolorblue\!1578\.09\\cellcolorblue\!1572\.85\\cellcolorblue\!1568\.22\\cellcolorblue\!1595\.13\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1541\.29\\cellcolorblue\!1579\.91\\cellcolorblue\!1569\.69\\cellcolorblue\!1578\.12\\cellcolorblue\!1572\.53\\cellcolorblue\!1568\.31\\cellcolorblue\!1595\.26GPTQAbsMax38\.4077\.9470\.6177\.9571\.3567\.2593\.784/640\.2680\.0669\.1377\.7872\.2267\.8994\.68\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1540\.86\\cellcolorblue\!1579\.45\\cellcolorblue\!1570\.61\\cellcolorblue\!1577\.71\\cellcolorblue\!1572\.22\\cellcolorblue\!1568\.17\\cellcolorblue\!1595\.07\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1541\.26\\cellcolorblue\!1579\.98\\cellcolorblue\!1571\.90\\cellcolorblue\!1577\.54\\cellcolorblue\!1572\.77\\cellcolorblue\!1568\.69\\cellcolorblue\!1595\.80WAKVQRTNAbsMax34\.0469\.9070\.6176\.9970\.8864\.4889\.934/636\.6974\.2268\.0277\.4372\.1465\.7091\.62\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1537\.86\\cellcolorblue\!1574\.98\\cellcolorblue\!1569\.32\\cellcolorblue\!1577\.38\\cellcolorblue\!1571\.59\\cellcolorblue\!1566\.23\\cellcolorblue\!1592\.35\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1539\.72\\cellcolorblue\!1577\.33\\cellcolorblue\!1569\.13\\cellcolorblue\!1577\.43\\cellcolorblue\!1571\.27\\cellcolorblue\!1566\.98\\cellcolorblue\!1593\.40GPTQAbsMax34\.5373\.1668\.2176\.9271\.5164\.8790\.464/636\.4576\.5767\.6577\.6272\.1466\.0992\.16\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1538\.60\\cellcolorblue\!1578\.24\\cellcolorblue\!1568\.58\\cellcolorblue\!1577\.42\\cellcolorblue\!1572\.06\\cellcolorblue\!1566\.98\\cellcolorblue\!1593\.41\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1539\.67\\cellcolorblue\!1577\.33\\cellcolorblue\!1570\.24\\cellcolorblue\!1577\.67\\cellcolorblue\!1571\.82\\cellcolorblue\!1567\.35\\cellcolorblue\!1593\.92Table 3:Comparison results of ScaleSweep and baseline initialization methods under RTN and GPTQ across different settings on Llama\-3\.1\-8B\-Instruct\.### 5\.1Experiment Setup

##### Models and metrics\.

We evaluate four instruction\-tuned large language models: Llama\-3\.1\-8B\-Instruct, Llama\-3\.2\-3B\-Instruct\(Meta AI,[2024a](https://arxiv.org/html/2606.07618#bib.bib20),[b](https://arxiv.org/html/2606.07618#bib.bib19)\), Qwen3\-8B, and Qwen3\-4B\(Yanget al\.,[2025](https://arxiv.org/html/2606.07618#bib.bib18)\)\. All Qwen3 evaluations are conducted in non\-thinking mode\. We report results on five benchmarks to cover different aspects of model capability: MMLU Pro for world knowledge and reasoning with 5\-shot prompting\(Wanget al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib12)\), GSM8K for mathematical reasoning with 8\-shot chain\-of\-thought prompting\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.07618#bib.bib16)\), IFEval for instruction following in the 0\-shot setting\(Zhouet al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib15)\), and HellaSwag \(HellaS\)\(Zellerset al\.,[2019](https://arxiv.org/html/2606.07618#bib.bib14)\)together with WinoGrande \(WinoG\)\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2606.07618#bib.bib13)\)for commonsense reasoning and language understanding in the 0\-shot setting\.Avgdenotes the average score across the five benchmarks, andRecovery \(%\)denotes the performance recovery relative to the BF16 baseline\.

##### Quantization settings\.

We evaluate three quantization configurations\. The first configuration, denoted as WA, applies NVFP4 quantization to both weights and activations\. The second configuration, denoted as WAKV, further extends NVFP4 quantization to the KV cache\. The third configuration, denoted as WAKVQ, additionally quantizes the query states using NVFP4\. Together, these configurations enable a progressive evaluation of increasingly comprehensive NVFP4 quantization across model components\. All experiments are conducted with simulation quantization on NVIDIA L40 GPUs\.

##### Baselines and initialization methods\.

We evaluate ScaleSweep under RTN and GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib4)\)post\-training quantization frameworks, where static activation reordering is applied for GPTQ\. Calibration uses 128 sequences of length 2048 sampled from RedPajama\(Weberet al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib17)\)\. For scale initialization and optimization baselines, we compare against AbsMax and 4/6\(Cooket al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib3)\)\. Additional implementation details are provided in Appendix[D](https://arxiv.org/html/2606.07618#A4)\.

![Refer to caption](https://arxiv.org/html/2606.07618v1/x3.png)Figure 3:Heatmaps of ScaleSweep under different lower and upper bounds in the bit\-pattern space for the MSE and WMSE objectives\.MSE​\[l,r\]\\mathrm\{MSE\}\[l,r\]andWMSE​\[l,r\]\\mathrm\{WMSE\}\[l,r\]denote ScaleSweep on Llama\-3\.1\-8B\-Instruct using the MSE and WMSE objectives, respectively, with bit\-pattern sweep bounds set to\[l,r\]\[l,r\]\.

### 5\.2Results

As shown in Table[3](https://arxiv.org/html/2606.07618#S5.T3)and Table[8](https://arxiv.org/html/2606.07618#A5.T8),[9](https://arxiv.org/html/2606.07618#A5.T9)in Appendix[E](https://arxiv.org/html/2606.07618#A5), ScaleSweep and ScaleSweepMSEachieve competitive performance across different NVFP4 quantization settings and, in most cases, further narrow the gap to the BF16 baseline compared with AbsMax and 4/6 initialization\. Moreover, ScaleSweep consistently achieves further improvements over ScaleSweepMSEin most settings\. Under the WA setting, the baseline initialization methods already achieve strong performance, while ScaleSweep still provides additional improvements for both RTN and GPTQ quantization\. In particular, on Qwen3\-8B, ScaleSweep further improves the recovery rate to99\.50%99\.50\\%\. Under the WAKV setting, ScaleSweep continues to provide incremental gains on top of already competitive baselines, where the recovery rate is improved by around1%1\\%to2%2\\%across different models and quantization backends in most cases\. Under the more challenging WAKVQ setting, ScaleSweep demonstrates stronger robustness as the quantization difficulty increases, while maintaining high recovery rates across different models\. In particular, Llama\-3\.2\-3B\-Instruct maintains at least93%93\\%recovery rate under both RTN and GPTQ quantization, while Qwen3\-4B consistently achieves at least94\.5%94\.5\\%\. These results demonstrate that ScaleSweep remains effective under increasingly aggressive quantization settings\.

##### Scale Sweep Range

We further evaluate the impact of different lower and upper bounds in ScaleSweep under both MSE and WMSE objectives\. As shown in Figure[3](https://arxiv.org/html/2606.07618#S5.F3), a narrower sweep range does not necessarily lead to worse performance and can occasionally achieve slightly better results\. Noticeable degradation appears only under more aggressive quantization settings\. This phenomenon suggests that the local surrogate metric used in ScaleSweep cannot perfectly characterize the final downstream performance, which remains an important direction for future research\.

Table 4:Comparison results of ScaleSweep under different settings on Llama\-3\.1\-8B\-Instruct using RedPajama and the GSM8K training split as calibration datasets\.
##### Calibration Data

We further assess the impact of using the GSM8K training split rather than RedPajama as the calibration dataset for GPTQ with ScaleSweep\. As reported in Table[4](https://arxiv.org/html/2606.07618#S5.T4), the performance advantage on GSM8K becomes more pronounced under increasingly aggressive quantization, yielding an improvement of approximately33points in the WAKVQ setting\. Across the five evaluated benchmarks, the average performance also exhibits a modest improvement, primarily attributable to gains on GSM8K, while the results on the remaining four benchmarks remain largely consistent\.

![Refer to caption](https://arxiv.org/html/2606.07618v1/x4.png)Figure 4:Performance gap \(%\) of ScaleSweep, initialization baselines, and FP8\-quantized optimal FP32 scales relative to the optimal FP32 block scale in terms of MSE and WMSE for weight quantization and activation quantization on Llama\-3\.1\-8B\-Instruct\. For both weight and activation quantization, results are averaged over all quantized tensors within each Transformer block\.
##### Quantization Error Analysis

We evaluate the quantization error of different initialization methods against the optimal FP32 block scales across layers\. As shown in Figure[4](https://arxiv.org/html/2606.07618#S5.F4)and Figure[6](https://arxiv.org/html/2606.07618#A5.F6)in Appendix[E](https://arxiv.org/html/2606.07618#A5), ScaleSweep achieves substantial improvements over AbsMax and 4/6, and in most cases even outperforms the FP8\-quantized optimal FP32 scale\. For the MSE objective, ScaleSweep achieves a relative gap below10%10\\%in nearly all cases\. For the WMSE objective, the relative gap also remains below10%10\\%in nearly all cases for weight quantization, activation quantization, and value cache quantization\.

![Refer to caption](https://arxiv.org/html/2606.07618v1/x5.png)Figure 5:NVFP4 quantization operator latency comparison across different batch sizes with hidden dimension 8192 on NVIDIA RTX PRO 6000 Blackwell GPUs\.
##### Overhead Analysis

We further implemented preliminary Triton kernels for ScaleSweep and ScaleSweepMSE, and evaluated them under different batch sizes with hidden dimension 8192 against the default NVFP4 quantization operator in vLLM on NVIDIA RTX PRO 6000 Blackwell GPUs\. The comparison results are presented in Figure[5](https://arxiv.org/html/2606.07618#S5.F5)and Table[7](https://arxiv.org/html/2606.07618#A5.T7)in Appendix[E](https://arxiv.org/html/2606.07618#A5)\. ScaleSweepMSEachieves nearly identical latency to the default vLLM operator, while ScaleSweep incurs only approximately1\.37×1\.37\\timeshigher latency, indicating negligible overhead during practical inference\. Since the operators are memory\-bound, further optimization through kernel fusion remains feasible\.

## 6Conclusion

In this paper, we propose ScaleSweep, a scale sweep method for FP8 block scales in NVFP4 quantization that can be seamlessly integrated with existing PTQ methods such as RTN and GPTQ\. To the best of our knowledge, our study is the first to derive lower and upper bounds for block scale optimization under both the MSE and WMSE objectives in NVFP4 quantization\. Based on these bounds, ScaleSweep restricts the sweep range to a theoretically justified local neighborhood in the FP8 bit\-pattern space\. Experimental results demonstrate that ScaleSweep consistently outperforms existing initialization baselines in weight\-activation quantization, and further improves performance under more aggressive settings involving KV cache quantization and query state quantization\. Further experiments demonstrate that ScaleSweep introduces negligible overhead during inference\.

## Limitations

In this work, we focus on scale optimization for NVFP4 quantization\. Accordingly, the evaluation is restricted to NVFP4 settings and may not directly generalize to other low\-precision formats or quantization schemes with different scaling structures\. In addition, the effectiveness of quantization can vary depending on model characteristics and calibration data distributions, and the behavior of ScaleSweep in broader application scenarios remains to be further investigated\.

## Ethical considerations

This work does not present ethical concerns\. All large language models and datasets used in this study are publicly available and utilized strictly for research purposes in accordance with their licenses\. The datasets are carefully examined to avoid personally identifiable information and offensive content\. AI assistants are used only for polishing descriptive text and generating plotting code\.

## References

- Introducing NVFP4 for efficient and accurate low\-precision inference\.Note:NVIDIA Technical Blog\. Accessed: 2026\-05\-18External Links:[Link](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)Quarot: outlier\-free 4\-bit inference in rotated llms\.Advances in Neural Information Processing Systems37,pp\. 100213–100240\.Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p2.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Chee, Y\. Cai, V\. Kuleshov, and C\. M\. De Sa \(2023\)Quip: 2\-bit quantization of large language models with guarantees\.Advances in Neural Information Processing Systems36,pp\. 4396–4429\.Cited by:[§D\.1](https://arxiv.org/html/2606.07618#A4.SS1.SSS0.Px3.p1.1)\.
- M\. Chen, M\. Wu, H\. Jin, Z\. Yuan, J\. Liu, C\. Zhang, Y\. Li, J\. Huang, J\. Ma, Z\. Xue, Z\. Liu, X\. Bin, and P\. Luo \(2025\)INT v\.s\. fp: a comprehensive study of fine\-grained low\-bit quantization formats\.External Links:2510\.25602,[Link](https://arxiv.org/abs/2510.25602)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1)\.
- Y\. Chen, Y\. Liu, X\. Xu, P\. Zhang, M\. Beyer, M\. Rapp, J\. Zhu, and J\. Chen \(2026\)TetraJet\-v2: accurate nvfp4 training for large language models with oscillation suppression and outlier control\.External Links:2510\.27527,[Link](https://arxiv.org/abs/2510.27527)Cited by:[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- J\. Cook, J\. Guo, G\. Xiao, Y\. Lin, K\. Wyss, M\. Nazemi, A\. Mishra, C\. del Mundo, T\. Blankevoort, and S\. Han \(2026\)Four over six: more accurate nvfp4 quantization with adaptive block scaling\.External Links:2512\.02010,[Link](https://arxiv.org/abs/2512.02010)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p2.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px2.p1.1),[§4\.2\.1](https://arxiv.org/html/2606.07618#S4.SS2.SSS1.p1.1),[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px3.p1.1)\.
- V\. Egiazarian, R\. L\. Castro, D\. Kuznedelev, A\. Panferov, E\. Kurtic, S\. Pandit, A\. N\. Marques, M\. Kurtz, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2026\)Bridging the gap between promise and performance for microscaling FP4 quantization\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=zCBGe9AqJZ)Cited by:[§D\.1](https://arxiv.org/html/2606.07618#A4.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p2.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px2.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.External Links:2210\.17323,[Link](https://arxiv.org/abs/2210.17323)Cited by:[§D\.1](https://arxiv.org/html/2606.07618#A4.SS1.SSS0.Px3.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p2.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px3.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§D\.2](https://arxiv.org/html/2606.07618#A4.SS2.p1.1)\.
- C\. Guo, Y\. Qiu, J\. Leng, X\. Gao, C\. Zhang, Y\. Liu, F\. Yang, Y\. Zhu, and M\. Guo \(2022\)SQuant: on\-the\-fly data\-free quantization via diagonal hessian approximation\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JXhROKNZzOc)Cited by:[§4\.1](https://arxiv.org/html/2606.07618#S4.SS1.SSS0.Px2.p2.2)\.
- X\. Hu, Y\. Cheng, D\. Yang, Z\. Chen, Z\. Xu, JiangyongYu, XUCHEN, Z\. Yuan, Z\. jiang, and S\. Zhou \(2025\)OSTQuant: refining large language model quantization with orthogonal and scaling transformations for better distribution fitting\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rAcgDBdKnP)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Krishnamoorthi \(2018\)Quantizing deep convolutional networks for efficient inference: a whitepaper\.External Links:1806\.08342,[Link](https://arxiv.org/abs/1806.08342)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1)\.
- A\. Kuzmin, M\. Van Baalen, Y\. Ren, M\. Nagel, J\. Peters, and T\. Blankevoort \(2022\)Fp8 quantization: the power of the exponent\.Advances in Neural Information Processing Systems35,pp\. 14651–14662\.Cited by:[§4\.2\.1](https://arxiv.org/html/2606.07618#S4.SS2.SSS1.p1.1)\.
- Y\. LeCun, J\. Denker, and S\. Solla \(1989\)Optimal brain damage\.Advances in neural information processing systems2\.Cited by:[§4\.1](https://arxiv.org/html/2606.07618#S4.SS1.SSS0.Px2.p2.2)\.
- L\. Lin, X\. Hu, and X\. Wan \(2026\)NeUQI: near\-optimal uniform quantization parameter initialization for low\-bit llms\.External Links:2505\.17595,[Link](https://arxiv.org/abs/2505.17595)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p2.1)\.
- Y\. Liu, J\. Wen, Y\. Wang, S\. Ye, L\. L\. Zhang, T\. Cao, C\. Li, and M\. Yang \(2024\)Vptq: extreme low\-bit vector post\-training quantization for large language models\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 8181–8196\.Cited by:[§4\.1](https://arxiv.org/html/2606.07618#S4.SS1.SSS0.Px2.p2.2)\.
- Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2025\)Spinquant: llm quantization with learned rotations\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 92009–92032\.Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p2.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px1.p1.1)\.
- V\. Malinovskii, D\. Mazur, I\. Ilin, D\. Kuznedelev, K\. Burlachenko, K\. Yi, D\. Alistarh, and P\. Richtarik \(2024\)Pv\-tuning: beyond straight\-through estimation for extreme llm compression\.Advances in Neural Information Processing Systems37,pp\. 5074–5121\.Cited by:[§D\.1](https://arxiv.org/html/2606.07618#A4.SS1.SSS0.Px1.p1.1)\.
- Meta AI \(2024a\)Introducing Llama 3\.1: our most capable models to date\.Note:Accessed: 2026\-05\-18External Links:[Link](https://ai.meta.com/blog/meta-llama-3-1/)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- Meta AI \(2024b\)Llama 3\.2: revolutionizing edge ai and vision with open, customizable models\.Note:Accessed: 2026\-05\-18External Links:[Link](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- P\. Micikevicius, D\. Stosic, N\. Burgess, M\. Cornea, P\. Dubey, R\. Grisenthwaite, S\. Ha, A\. Heinecke, P\. Judd, J\. Kamalu, N\. Mellempudi, S\. Oberman, M\. Shoeybi, M\. Siu, and H\. Wu \(2022\)FP8 formats for deep learning\.External Links:2209\.05433,[Link](https://arxiv.org/abs/2209.05433)Cited by:[§4\.2\.1](https://arxiv.org/html/2606.07618#S4.SS2.SSS1.p1.1)\.
- NVIDIA, F\. Abecassis, A\. Agrusa, D\. Ahn, J\. Alben, S\. Alborghetti, M\. Andersch, S\. Arayandi, A\. Bjorlin, A\. Blakeman, E\. Briones, I\. Buck, B\. Catanzaro, M\. Chang, J\. Choi, M\. Chrzanowski, E\. Chung, V\. Cui, S\. Dai, B\. D\. Rouhani, C\. del Mundo, D\. Donia, B\. Eryilmaz, H\. Estela, A\. Goel, O\. Goncharov, Y\. Guvvala, R\. Hesse, R\. Hewett, H\. Hum, U\. Kapasi, B\. Khailany, M\. Khona, N\. Knight, A\. Kondratenko, R\. Krashinsky, B\. Lanir, S\. Layton, M\. Lightstone, D\. Lo, P\. Micikevicius, A\. Mishra, T\. Moon, D\. Narayanan, C\. Ni, A\. Paithankar, S\. Pasumarthi, A\. Patel, M\. Patwary, A\. Poojary, G\. Prasad, S\. Priyadarshi, Y\. Qin, X\. Ren, O\. Rybakov, C\. Sakr, S\. Satheesh, S\. Sergienko, P\. Shamis, K\. Shankar, N\. Sharma, M\. Shoeybi, M\. Siu, M\. Smelyanskiy, D\. Stosic, D\. Stosic, B\. Su, F\. Sun, N\. Tajbakhsh, S\. Thomas, P\. Tredak, E\. Tsykunov, G\. Vaithilingam, A\. Vavre, R\. Venkatesan, R\. Waleffe, Q\. Wan, H\. Wang, M\. Wang, L\. Wei, H\. Wu, E\. Wu, K\. Wyss, N\. Xu, J\. Xue, C\. Yang, Y\. Zhai, R\. Zhang, J\. Zhu, and Z\. Zhu \(2026\)Pretraining large language models with nvfp4\.External Links:2509\.25149,[Link](https://arxiv.org/abs/2509.25149)Cited by:[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2606.07618#S3.SS2.p3.3)\.
- Open Compute Project \(2023\)OCP microscaling formats \(MX\) specification\.Open Compute Project\.Note:Version 1\.0; contributed by AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and QualcommExternal Links:[Link](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf)Cited by:[§3\.2](https://arxiv.org/html/2606.07618#S3.SS2.p1.1)\.
- K\. Sakaguchi, R\. Le Bras, C\. Bhagavatula, and Y\. Choi \(2021\)WinoGrande: an adversarial winograd schema challenge at scale\.Commun\. ACM64\(9\),pp\. 99–106\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/3474381),[Document](https://dx.doi.org/10.1145/3474381)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang,et al\.\(2024\)Mmlu\-pro: a more robust and challenging multi\-task language understanding benchmark\.Advances in Neural Information Processing Systems37,pp\. 95266–95290\.Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- M\. Weber, D\. Y\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams,et al\.\(2024\)Redpajama: an open dataset for training large language models\.Advances in neural information processing systems37,pp\. 116462–116492\.Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px3.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational conference on machine learning,pp\. 38087–38099\.Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p1.1),[§1](https://arxiv.org/html/2606.07618#S1.p2.1),[§2](https://arxiv.org/html/2606.07618#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.External Links:2505\.09388,[Link](https://arxiv.org/abs/2505.09388)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,A\. Korhonen, D\. Traum, and L\. Màrquez \(Eds\.\),Florence, Italy,pp\. 4791–4800\.External Links:[Link](https://aclanthology.org/P19-1472/),[Document](https://dx.doi.org/10.18653/v1/P19-1472)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.
- T\. Zhang and A\. Shrivastava \(2025\)LeanQuant: accurate and scalable large language model quantization with loss\-error\-aware grid\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ISqx8giekS)Cited by:[§1](https://arxiv.org/html/2606.07618#S1.p2.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.External Links:2311\.07911,[Link](https://arxiv.org/abs/2311.07911)Cited by:[§5\.1](https://arxiv.org/html/2606.07618#S5.SS1.SSS0.Px1.p1.1)\.

## Appendix AFloating\-Point Representation

EeMm denotes a floating\-point format consisting of one sign bit,eeexponent bits, andmmmantissa bits\. Unless otherwise specified, the exponent bias is defined as

B=2e−1−1\.B=2^\{e\-1\}\-1\.\(19\)
Only finite floating\-point numbers are considered throughout this work\. Therefore, the special encodings corresponding to infinity and NaN are excluded\.

For a floating\-point number with sign bitcc, exponent fieldaa, and mantissa fieldbb, the represented finite value is denoted byv​\(c,a,b\)v\(c,a,b\)and defined as

v​\(c,a,b\)=\{\(−1\)c​2a−B​\(1\+b2m\),a≠0,\(−1\)c​21−B​b2m,a=0,v\(c,a,b\)=\\begin\{cases\}\(\-1\)^\{c\}2^\{a\-B\}\\left\(1\+\\dfrac\{b\}\{2^\{m\}\}\\right\),&a\\neq 0,\\\\ \\\\ \(\-1\)^\{c\}2^\{1\-B\}\\dfrac\{b\}\{2^\{m\}\},&a=0,\\end\{cases\}\(20\)wherec∈\{0,1\}c\\in\\\{0,1\\\},a∈\[2e\]a\\in\[2^\{e\}\], andb∈\[2m\]b\\in\[2^\{m\}\]\.

The set of all representable finite values in the EeMm format is denoted by

𝒢EeMm=\{v\(c,a,b\)∣\\displaystyle\\mathcal\{G\}\_\{\\mathrm\{EeMm\}\}=\\\{v\(c,a,b\)\\midc∈\{0,1\},a∈\[2e\],\\displaystyle\\ c\\in\\\{0,1\\\},\\ a\\in\[2^\{e\}\],b∈\[2m\]\}\.\\displaystyle\\ b\\in\[2^\{m\}\]\\\}\.\(21\)
The corresponding integer interpretation of the bit pattern of the value is defined as

IEeMm​\(v​\(c,a,b\)\)=c⋅2e\+m\+a⋅2m\+b\.I\_\{\\mathrm\{EeMm\}\}\(v\(c,a,b\)\)=c\\cdot 2^\{e\+m\}\+a\\cdot 2^\{m\}\+b\.\(22\)

## Appendix BProofs

### B\.1Proof of Lemma[4\.1](https://arxiv.org/html/2606.07618#S4.Thmlemma1)

###### Proof\.

Sinces\>maxi⁡\|xi\|/3\.5s\>\\max\_\{i\}\|x\_\{i\}\|/3\.5, every element satisfies\|xi\|<3\.5​s\|x\_\{i\}\|<3\.5s\. Under nearest\-point FP4 quantization with scaless, no element can therefore be rounded to±4​s\\pm 4s, whose nearest decision boundary begins at3\.5​s3\.5s\. Consequently, every activated quantized value underFP4​\(s\)\\mathrm\{FP4\}\(s\)belongs to

\{0,±0\.5​s,±s,±1\.5​s,±2​s,±3​s\}\.\\\{0,\\pm 0\.5s,\\pm s,\\pm 1\.5s,\\pm 2s,\\pm 3s\\\}\.\(23\)This set is entirely contained in𝒢FP4​\(s/2\)\\mathcal\{G\}\_\{\\mathrm\{FP4\}\(s/2\)\}\. Hence, for eachxix\_\{i\}, the quantized value⌊xi⌉FP4​\(s\)\\left\\lfloor x\_\{i\}\\right\\rceil\_\{\\mathrm\{FP4\}\(s\)\}is also feasible underFP4​\(s/2\)\\mathrm\{FP4\}\(s/2\)\. By the optimality of nearest\-point quantization,

\|xi−⌊xi⌉FP4​\(s/2\)\|2≤\|xi−⌊xi⌉FP4​\(s\)\|2\.\\left\|x\_\{i\}\-\\left\\lfloor x\_\{i\}\\right\\rceil\_\{\\mathrm\{FP4\}\(s/2\)\}\\right\|^\{2\}\\leq\\left\|x\_\{i\}\-\\left\\lfloor x\_\{i\}\\right\\rceil\_\{\\mathrm\{FP4\}\(s\)\}\\right\|^\{2\}\.\(24\)Multiplying both sides bywi≥0w\_\{i\}\\geq 0and summing over all elements yields

ℒ​\(s/2;𝐱,𝐰\)≤ℒ​\(s;𝐱,𝐰\)\.\\mathcal\{L\}\(s/2;\\mathbf\{x\},\\mathbf\{w\}\)\\leq\\mathcal\{L\}\(s;\\mathbf\{x\},\\mathbf\{w\}\)\.\(25\)∎

### B\.2Proof of Lemma[4\.2](https://arxiv.org/html/2606.07618#S4.Thmlemma2)

###### Proof\.

Let the exponent field haveeebits and the mantissa field havemmbits\. DefineM=2mM=2^\{m\}\.

For any positiveE​e​M​mEeMmnumber, letaadenote the exponent field and letbbdenote the mantissa field, where0≤b<M0\\leq b<M\. If the number is exactly representable inE​e​M​mEeMm, thenbbis an integer\. Since the proof also considers arbitrary positive real numbersxx, this notation is extended to non\-representable values by allowingbbto be real\.

##### Subnormal case\.

Assume thatxxlies in the subnormal range\. Then

IEeMm​\(⌊x⌋EeMm\)=⌊b⌋\.I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)=\\lfloor b\\rfloor\.\(26\)
Sinceα≤2\\alpha\\leq 2, we haveα​b<2​M\\alpha b<2M, and therefore

IEeMm​\(⌊α​x⌋EeMm\)≤⌊α​b⌋\.I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\\leq\\lfloor\\alpha b\\rfloor\.\(27\)Hence,

IEeMm​\(⌊α​x⌋EeMm\)−IEeMm​\(⌊x⌋EeMm\)\\displaystyle I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\-I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)≤⌊α​b⌋−⌊b⌋\.\\displaystyle\\leq\\lfloor\\alpha b\\rfloor\-\\lfloor b\\rfloor\.\(28\)
Using

⌊α​b⌋−⌊b⌋≤⌈\(α−1\)​b⌉,\\lfloor\\alpha b\\rfloor\-\\lfloor b\\rfloor\\leq\\lceil\(\\alpha\-1\)b\\rceil,\(29\)together withb<Mb<M, we obtain

⌊α​b⌋−⌊b⌋≤⌈\(α−1\)​M⌉\.\\lfloor\\alpha b\\rfloor\-\\lfloor b\\rfloor\\leq\\left\\lceil\(\\alpha\-1\)M\\right\\rceil\.\(30\)
Since1≤α≤21\\leq\\alpha\\leq 2,

\(α−1\)​M≤\(1−1α\)​2​M\.\(\\alpha\-1\)M\\leq\\left\(1\-\\frac\{1\}\{\\alpha\}\\right\)2M\.\(31\)Therefore,

IEeMm​\(⌊α​x⌋EeMm\)−IEeMm​\(⌊x⌋EeMm\)\\displaystyle I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\-I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)≤⌈\(1−1α\)​2​M⌉\.\\displaystyle\\leq\\left\\lceil\\left\(1\-\\frac\{1\}\{\\alpha\}\\right\)2M\\right\\rceil\.\(32\)

##### Normal case below the binade boundary\.

Assume thatxxlies in the normal range\. Then

IEeMm​\(⌊x⌋EeMm\)=a​M\+⌊b⌋\.I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)=aM\+\\lfloor b\\rfloor\.\(33\)
Suppose that multiplying byα\\alphadoes not cross into the next exponent interval:

α​\(M\+b\)≤2​M\.\\alpha\(M\+b\)\\leq 2M\.\(34\)Then

IEeMm​\(⌊α​x⌋EeMm\)≤a​M\+⌊α​\(M\+b\)−M⌋\.I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\\leq aM\+\\left\\lfloor\\alpha\(M\+b\)\-M\\right\\rfloor\.\(35\)Therefore,

IEeMm​\(⌊α​x⌋EeMm\)−IEeMm​\(⌊x⌋EeMm\)\\displaystyle I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\-I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)≤⌊b\+\(α−1\)​\(M\+b\)⌋−⌊b⌋\.\\displaystyle\\leq\\left\\lfloor b\+\(\\alpha\-1\)\(M\+b\)\\right\\rfloor\-\\lfloor b\\rfloor\.\(36\)
Using

⌊b\+\(α−1\)​\(M\+b\)⌋−⌊b⌋\\displaystyle\\left\\lfloor b\+\(\\alpha\-1\)\(M\+b\)\\right\\rfloor\-\\lfloor b\\rfloor≤⌈\(α−1\)​\(M\+b\)⌉,\\displaystyle\\leq\\left\\lceil\(\\alpha\-1\)\(M\+b\)\\right\\rceil,\(37\)together withα​\(M\+b\)≤2​M\\alpha\(M\+b\)\\leq 2M, we obtain

⌊b\+\(α−1\)​\(M\+b\)⌋−⌊b⌋≤\(1−1α\)​2​M\.\\left\\lfloor b\+\(\\alpha\-1\)\(M\+b\)\\right\\rfloor\-\\lfloor b\\rfloor\\leq\\left\(1\-\\frac\{1\}\{\\alpha\}\\right\)2M\.\(38\)Hence,

IEeMm​\(⌊α​x⌋EeMm\)−IEeMm​\(⌊x⌋EeMm\)\\displaystyle I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\-I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)≤⌈\(1−1α\)​2​M⌉\.\\displaystyle\\leq\\left\\lceil\\left\(1\-\\frac\{1\}\{\\alpha\}\\right\)2M\\right\\rceil\.\(39\)

##### Normal case beyond the binade boundary\.

Now assumeα​\(M\+b\)\>2​M\\alpha\(M\+b\)\>2M\. At the binade boundaryα​\(M\+b\)=2​M\\alpha\(M\+b\)=2M, the desired upper bound has already been obtained in the previous case\. Moreover,⌊α​x⌋EeMm\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}is exactly the first representable value on this boundary\. After this point, increasingIEeMm​\(⌊α​x⌋EeMm\)I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)by one requires an increase of2/α2/\\alphain the mantissa field ofxx, while increasingIEeMm​\(⌊x⌋EeMm\)I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)by one requires an increase of11\. Since2/α≥12/\\alpha\\geq 1, the difference cannot exceed its value at the boundary\. Hence,

IEeMm​\(⌊α​x⌋EeMm\)−IEeMm​\(⌊x⌋EeMm\)\\displaystyle I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor\\alpha x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)\-I\_\{\\mathrm\{EeMm\}\}\\\!\\left\(\\lfloor x\\rfloor\_\{\\mathrm\{EeMm\}\}\\right\)≤⌈\(1−1α\)​2​M⌉\.\\displaystyle\\leq\\left\\lceil\\left\(1\-\\frac\{1\}\{\\alpha\}\\right\)2M\\right\\rceil\.\(40\)
∎

### B\.3Proof of Lemma[4\.3](https://arxiv.org/html/2606.07618#S4.Thmlemma3)

###### Proof\.

The proof consists of first excluding all ratios satisfying≤34\\leq\\tfrac\{3\}\{4\}, and then eliminating the remaining ratios in the interval\(34,45\)\(\\tfrac\{3\}\{4\},\\tfrac\{4\}\{5\}\)one by one\.

##### Notation

Let

s0=sbaseFP8=⌊maxi⁡\|xi\|6⌋FP8\.s\_\{0\}=s\_\{\\mathrm\{base\}\}^\{\\mathrm\{FP8\}\}=\\left\\lfloor\\frac\{\\max\_\{i\}\|x\_\{i\}\|\}\{6\}\\right\\rfloor\_\{\\mathrm\{FP8\}\}\.\(41\)By symmetry of the FP4 value set, it suffices to consider only nonnegative values\. Define the nonnegative FP4 E2M1 grids as

𝒢=\{0,12,1,32,2,3,4,6\}\.\\mathcal\{G\}=\\left\\\{0,\\tfrac\{1\}\{2\},1,\\tfrac\{3\}\{2\},2,3,4,6\\right\\\}\.\(42\)The normalized single\-coordinate quantization loss is defined by

dr\(y\)=ming∈𝒢\(y−rg\)2\.d\_\{r\}\(y\)=\\min\_\{g\\in\\mathcal\{G\}\}\(y\-rg\)^\{2\}\.\(43\)Letyi=\|xi\|/s0y\_\{i\}=\|x\_\{i\}\|/s\_\{0\}and define the scale ratior=s/s0r=s/s\_\{0\}\. Then minimizingℒ​\(s;𝐱\)\\mathcal\{L\}\(s;\\mathbf\{x\}\)over FP8 scalesssis equivalent to minimizing

ℒ¯​\(r;𝐲\)=∑i=015dr​\(yi\)\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\)=\\sum\_\{i=0\}^\{15\}d\_\{r\}\(y\_\{i\}\)\(44\)over the corresponding FP8 ratiosrr\. Furthermore, suppose that

s0=m8​2a,m∈\{8,…,15\}\.s\_\{0\}=\\frac\{m\}\{8\}2^\{a\},\\qquad m\\in\\\{8,\\dots,15\\\}\.\(45\)Then the normalized coordinates satisfy

maxi⁡yi=6​α,1≤α<ρm,\\max\_\{i\}y\_\{i\}=6\\alpha,\\qquad 1\\leq\\alpha<\\rho\_\{m\},\(46\)whereρm=m\+1m\\rho\_\{m\}=\\frac\{m\+1\}\{m\}\.

Table 5:Convex certificates used to exclude the remaining FP8 E4M3 scale ratios in\(3/4,4/5\)\(3/4,4/5\)\. For each candidate ratiorr, the listed FP8 ratios\{uj\}\\\{u\_\{j\}\\\}and weights\{λj\}\\\{\\lambda\_\{j\}\\\}satisfy∑jλj=1\\sum\_\{j\}\\lambda\_\{j\}=1\. The positive marginAr−15​BrA\_\{r\}\-15B\_\{r\}proves that the weighted average loss over\{uj\}\\\{u\_\{j\}\\\}is strictly smaller than the loss atrrfor any block of size1616\.
##### Proof sketch\.

For each candidate ratior<54r<\\tfrac\{5\}\{4\}, we find another valid ratior′r^\{\\prime\}satisfying

r<r′≤114,r<r^\{\\prime\}\\leq\\tfrac\{11\}\{4\},\(47\)and show that

ℒ¯​\(r′;𝐲\)≤ℒ¯​\(r;𝐲\)\\bar\{\\mathcal\{L\}\}\(r^\{\\prime\};\\mathbf\{y\}\)\\leq\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\)\(48\)for all admissible normalized coordinates𝐲\\mathbf\{y\}\. The worst\-case increase of

ℒ¯​\(r′;𝐲\)−ℒ¯​\(r;𝐲\)\\bar\{\\mathcal\{L\}\}\(r^\{\\prime\};\\mathbf\{y\}\)\-\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\)\(49\)is obtained by combining the largest possible contribution from the maximal coordinate with the largest possible contribution from each of the remaining1515coordinates\. Therefore, it suffices to verify

Ar′,r\\displaystyle A\_\{r^\{\\prime\},r\}=sup6≤y≤6​ρm\(dr′​\(y\)−dr​\(y\)\),\\displaystyle=\\sup\_\{6\\leq y\\leq 6\\rho\_\{m\}\}\\left\(d\_\{r^\{\\prime\}\}\(y\)\-d\_\{r\}\(y\)\\right\),Br′,r\\displaystyle B\_\{r^\{\\prime\},r\}=sup0≤y≤6​ρm\(dr′​\(y\)−dr​\(y\)\),\\displaystyle=\\sup\_\{0\\leq y\\leq 6\\rho\_\{m\}\}\\left\(d\_\{r^\{\\prime\}\}\(y\)\-d\_\{r\}\(y\)\\right\),Ar′,r\\displaystyle A\_\{r^\{\\prime\},r\}\+15​Br′,r≤0\.\\displaystyle\+15B\_\{r^\{\\prime\},r\}\\leq 0\.\(50\)

##### Excludingr≤34r\\leq\\tfrac\{3\}\{4\}\.

We first consider ratios satisfyingr≤34r\\leq\\tfrac\{3\}\{4\}, for which we chooser′=2​rr^\{\\prime\}=2r\. The grids induced byrrand2​r2rshare most reconstruction points, while the grid associated withrradditionally contains the intermediate points12​r\\tfrac\{1\}\{2\}rand32​r\\tfrac\{3\}\{2\}r\. Therefore,

d2​r​\(y\)−dr​\(y\)≤r24,0≤y≤6​ρm,d\_\{2r\}\(y\)\-d\_\{r\}\(y\)\\leq\\frac\{r^\{2\}\}\{4\},\\qquad 0\\leq y\\leq 6\\rho\_\{m\},\(51\)where equality is attained aty=12​ry=\\tfrac\{1\}\{2\}randy=32​ry=\\tfrac\{3\}\{2\}r\. Consequently,

B2​r,r=sup0≤y<6​ρm\(d2​r​\(y\)−dr​\(y\)\)=r24\.B\_\{2r,r\}=\\sup\_\{0\\leq y<6\\rho\_\{m\}\}\\left\(d\_\{2r\}\(y\)\-d\_\{r\}\(y\)\\right\)=\\frac\{r^\{2\}\}\{4\}\.\(52\)Therefore, replacingrrwith2​r2rincreases the total loss on the remaining1515coordinates by at most15​r2/415r^\{2\}/4\.

Now consider the maximal coordinateymax=6​αy\_\{\\max\}=6\\alpha\. Sincer≤34r\\leq\\tfrac\{3\}\{4\}andα≥1\\alpha\\geq 1, the maximal reconstruction value achievable under scalerrequals6​r6r, which gives

dr​\(6​α\)=\(6​α−6​r\)2\.d\_\{r\}\(6\\alpha\)=\(6\\alpha\-6r\)^\{2\}\.\(53\)Under scale2​r2r, selecting the FP4 grid point44yields reconstruction value8​r8r, and hence

d2​r​\(6​α\)≤\(6​α−8​r\)2\.d\_\{2r\}\(6\\alpha\)\\leq\(6\\alpha\-8r\)^\{2\}\.\(54\)Therefore,

d2​r​\(6​α\)−dr​\(6​α\)\\displaystyle d\_\{2r\}\(6\\alpha\)\-d\_\{r\}\(6\\alpha\)≤\(6​α−8​r\)2−\(6​α−6​r\)2\\displaystyle\\leq\(6\\alpha\-8r\)^\{2\}\-\(6\\alpha\-6r\)^\{2\}=−24​α​r\+28​r2\.\\displaystyle=\-24\\alpha r\+28r^\{2\}\.\(55\)Usingα≥1\\alpha\\geq 1andr≤34r\\leq\\tfrac\{3\}\{4\}yields

A2​r,r\\displaystyle A\_\{2r,r\}=sup6≤y≤6​ρm\(d2​r​\(y\)−dr​\(y\)\)\\displaystyle=\\sup\_\{6\\leq y\\leq 6\\rho\_\{m\}\}\\left\(d\_\{2r\}\(y\)\-d\_\{r\}\(y\)\\right\)≤−24​α​r\+28​r2\\displaystyle\\leq\-24\\alpha r\+28r^\{2\}≤−24​r\+28​r2\.\\displaystyle\\leq\-24r\+28r^\{2\}\.\(56\)Combining this bound with

B2​r,r=r24B\_\{2r,r\}=\\frac\{r^\{2\}\}\{4\}\(57\)gives

A2​r,r\+15​B2​r,r\\displaystyle A\_\{2r,r\}\+15B\_\{2r,r\}≤−24​r\+28​r2\+15​r24\\displaystyle\\leq\-24r\+28r^\{2\}\+\\frac\{15r^\{2\}\}\{4\}=−r​\(24−1274​r\)\.\\displaystyle=\-r\\left\(24\-\\frac\{127\}\{4\}r\\right\)\.\(58\)Sincer≤34r\\leq\\tfrac\{3\}\{4\},

24−1274​r≥24−1274⋅34=316\>0\.24\-\\frac\{127\}\{4\}r\\geq 24\-\\frac\{127\}\{4\}\\cdot\\frac\{3\}\{4\}=\\frac\{3\}\{16\}\>0\.\(59\)Hence,

A2​r,r\+15​B2​r,r<0,A\_\{2r,r\}\+15B\_\{2r,r\}<0,\(60\)which implies

ℒ¯​\(2​r;𝐲\)<ℒ¯​\(r;𝐲\)\.\\bar\{\\mathcal\{L\}\}\(2r;\\mathbf\{y\}\)<\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\)\.\(61\)Therefore, no ratior≤34r\\leq\\tfrac\{3\}\{4\}can be optimal\.

##### Excludingr∈\(3/4,4/5\)r\\in\(3/4,4/5\)

It remains to consider FP8 ratios in the interval\(3/4,4/5\)\(3/4,4/5\)\. Since a positive normal E4M3 value has significandm/8m/8withm∈\{8,…,15\}m\\in\\\{8,\\dots,15\\\}, every ratio between two such FP8 values has the form

r=jm​2Δ​a,j,m∈\{8,…,15\}\.r=\\frac\{j\}\{m\}2^\{\\Delta a\},\\qquad j,m\\in\\\{8,\\dots,15\\\}\.\(62\)A finite enumeration shows that the only legal ratios in\(3/4,4/5\)\(3/4,4/5\)are

79,1013,1114\.\\frac\{7\}\{9\},\\qquad\\frac\{10\}\{13\},\\qquad\\frac\{11\}\{14\}\.
We exclude the remaining cases using convex certificates\. For a fixed candidate ratiorr, consider valid FP8 ratiosuju\_\{j\}and nonnegative weightsλj\\lambda\_\{j\}satisfying∑jλj=1\\sum\_\{j\}\\lambda\_\{j\}=1, and define

Δ\{uj\},r​\(y\)=∑jλj​duj​\(y\)−dr​\(y\)\.\\Delta\_\{\\\{u\_\{j\}\\\},r\}\(y\)=\\sum\_\{j\}\\lambda\_\{j\}d\_\{u\_\{j\}\}\(y\)\-d\_\{r\}\(y\)\.\(63\)This construction extends the previous argument by allowing comparison against a convex combination of multiple ratios instead of a single ratior′r^\{\\prime\}\. Define

A\{uj\},r\\displaystyle A\_\{\\\{u\_\{j\}\\\},r\}=sup6≤y≤6​ρmΔ\{uj\},r​\(y\),\\displaystyle=\\sup\_\{6\\leq y\\leq 6\\rho\_\{m\}\}\\Delta\_\{\\\{u\_\{j\}\\\},r\}\(y\),B\{uj\},r\\displaystyle B\_\{\\\{u\_\{j\}\\\},r\}=sup0≤y≤6​ρmΔ\{uj\},r​\(y\)\.\\displaystyle=\\sup\_\{0\\leq y\\leq 6\\rho\_\{m\}\}\\Delta\_\{\\\{u\_\{j\}\\\},r\}\(y\)\.\(64\)If

A\{uj\},r\+15​B\{uj\},r≤0,A\_\{\\\{u\_\{j\}\\\},r\}\+15B\_\{\\\{u\_\{j\}\\\},r\}\\leq 0,\(65\)then for any block containing a maximal coordinate,

∑jλj​ℒ¯​\(uj;𝐲\)−ℒ¯​\(r;𝐲\)\\displaystyle\\sum\_\{j\}\\lambda\_\{j\}\\bar\{\\mathcal\{L\}\}\(u\_\{j\};\\mathbf\{y\}\)\-\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\)\(66\)=∑i=015Δ\{uj\},r​\(yi\)\\displaystyle=\\sum\_\{i=0\}^\{15\}\\Delta\_\{\\\{u\_\{j\}\\\},r\}\(y\_\{i\}\)≤A\{uj\},r\+15​B\{uj\},r\\displaystyle\\leq A\_\{\\\{u\_\{j\}\\\},r\}\+15B\_\{\\\{u\_\{j\}\\\},r\}\(67\)≤0\.\\displaystyle\\leq 0\.\(68\)Therefore, at least one ratiouju\_\{j\}achieves strictly smaller loss thanrr, implying thatrrcannot be optimal\.

The required certificates are summarized in Table[5](https://arxiv.org/html/2606.07618#A2.T5)\. All listed certificates satisfy

A\{uj\},r\+15​B\{uj\},r≤0\.A\_\{\\\{u\_\{j\}\\\},r\}\+15B\_\{\\\{u\_\{j\}\\\},r\}\\leq 0\.\(69\)Since∑jλj=1\\sum\_\{j\}\\lambda\_\{j\}=1, the quadraticy2y^\{2\}terms cancel inΔ\{uj\},r​\(y\)\\Delta\_\{\\\{u\_\{j\}\\\},r\}\(y\)\. Therefore,Δ\{uj\},r​\(y\)\\Delta\_\{\\\{u\_\{j\}\\\},r\}\(y\)is piecewise affine inyy, and all extrema are attained at breakpoints\. Consequently, bothA\{uj\},rA\_\{\\\{u\_\{j\}\\\},r\}andB\{uj\},rB\_\{\\\{u\_\{j\}\\\},r\}can be computed by enumerating finitely many breakpoints\. The coefficientsλj\\lambda\_\{j\}are obtained either by exhaustive search or by solving a linear program\.

##### Summary

In summary, for every ratio satisfyingr≤45r\\leq\\tfrac\{4\}\{5\}, there exists a valid FP8 ratior′≤114r^\{\\prime\}\\leq\\tfrac\{11\}\{4\}, or more generally a convex combination of valid FP8 ratios\{uj\}\\\{u\_\{j\}\\\}withuj≤114u\_\{j\}\\leq\\tfrac\{11\}\{4\}, such that

ℒ¯​\(r′;𝐲\)≤ℒ¯​\(r;𝐲\),\\bar\{\\mathcal\{L\}\}\(r^\{\\prime\};\\mathbf\{y\}\)\\leq\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\),\(70\)or

minj⁡ℒ¯​\(uj;𝐲\)≤∑jλj​ℒ¯​\(uj;𝐲\)≤ℒ¯​\(r;𝐲\),\\min\_\{j\}\\bar\{\\mathcal\{L\}\}\(u\_\{j\};\\mathbf\{y\}\)\\leq\\sum\_\{j\}\\lambda\_\{j\}\\bar\{\\mathcal\{L\}\}\(u\_\{j\};\\mathbf\{y\}\)\\leq\\bar\{\\mathcal\{L\}\}\(r;\\mathbf\{y\}\),\(71\)for all admissible normalized coordinates𝐲\\mathbf\{y\}\. Therefore, no ratior≤45r\\leq\\tfrac\{4\}\{5\}can be optimal\. ∎

Table 6:Dataset size and domain for MMLU Pro, GSM8K, IFEval, HellaSwag, and WinoGrande\.

## Appendix COptimal FP32 Scale for FP4 Quantization

##### Optimal FP32 Block Scale

The WMSE objective in Eq\. \([2](https://arxiv.org/html/2606.07618#S3.E2)\) can be equivalently written as

ℒ\(s;𝐱,𝐰\)=∑i=0n−1wi\(xi−⌊xi/s⌉FP4⋅s\)2\.\\mathcal\{L\}\(s;\\mathbf\{x\},\\mathbf\{w\}\)=\\sum\_\{i=0\}^\{n\-1\}w\_\{i\}\\left\(x\_\{i\}\-\\left\\lfloor x\_\{i\}/s\\right\\rceil\_\{\\mathrm\{FP4\}\}\\cdot s\\right\)^\{2\}\.\(72\)Each termwi\(xi−⌊xi/s⌉FP4⋅s\)2w\_\{i\}\\left\(x\_\{i\}\-\\left\\lfloor x\_\{i\}/s\\right\\rceil\_\{\\mathrm\{FP4\}\}\\cdot s\\right\)^\{2\}constitutes a piecewise quadratic function ofss, with 15 segments corresponding to the breakpoints at whichxi/sx\_\{i\}/saligns with the midpoints between consecutive FP4 representable values\. Consequently,ℒ​\(s;𝐱,𝐰\)\\mathcal\{L\}\(s;\\mathbf\{x\},\\mathbf\{w\}\)is a piecewise quadratic function ofss, with breakpoints given by the union of all individual breakpoints, yielding at most14​n\+114n\+1segments\. The exact global minimum ofℒ​\(s;𝐱,𝐰\)\\mathcal\{L\}\(s;\\mathbf\{x\},\\mathbf\{w\}\)can be computed efficiently by enumerating all segments, determining the quadratic coefficients within each, and evaluating the minimum of each quadratic segment\.

##### FP8\-quantized Optimal FP32 Scale

The FP8\-quantized optimal FP32 scalesfp8s\_\{\\mathrm\{fp8\}\}is obtained by quantizing the optimal FP32 scalesswith the global scaleSS, wheresfp8s\_\{\\mathrm\{fp8\}\}is selected between⌊s/S⌋FP8\\lfloor s/S\\rfloor\_\{\\mathrm\{FP8\}\}and⌈s/S⌉FP8\\lceil s/S\\rceil\_\{\\mathrm\{FP8\}\}according to the minimization ofℒ​\(sfp8⋅S;𝐱,𝐰\)\\mathcal\{L\}\(s\_\{\\mathrm\{fp8\}\}\\cdot S;\\mathbf\{x\},\\mathbf\{w\}\)\.

## Appendix DAdditional Implementation Details

### D\.1Calibration Details

##### Calibration

We follow PV\-Tuning\(Malinovskiiet al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib27)\)and sample 128 sequences of length 2048 from the 1B samples released in the official RedPajama dataset as calibration samples\. Calibration is performed only once using a fixed seed of 0\.

##### Quantization

For activation quantization, KV cache quantization, and query states quantization in NVFP4, we follow the practices presented in LLM\-Compressor222[LLM\-Compressor W4A4 FP4 quantization example](https://docs.vllm.ai/projects/llm-compressor/en/latest/examples/quantization_w4a4_fp4/#2-prepare-calibration-data)and TensorRT333[TensorRT dynamic double quantization guide](https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-quantized-types.html#dynamic-double-quantization), employing a static global scale\. The global scale is determined during the calibration stage and remains fixed throughout the inference stage\.

##### GPTQ

For GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib4)\), we employ LDLQ\(Cheeet al\.,[2023](https://arxiv.org/html/2606.07618#bib.bib21)\), which has been shown to be algorithmically equivalent to GPTQ\. The damping factor is set to0\.010\.01times the mean of the Hessian diagonal\. Following MR\-GPTQ\(Egiazarianet al\.,[2026](https://arxiv.org/html/2606.07618#bib.bib8)\), we apply static reordering: the block scales is determined prior to GPTQ, and the input channels for GPTQ quantization are ordered according to the Hessian diagonal in descending order\.

### D\.2Evaluation Details

For MMLU Pro, GSM8K, IFEval, HellaSwag, and WinoGrande, we use lm\-evaluation\-harness\(Gaoet al\.,[2024](https://arxiv.org/html/2606.07618#bib.bib26)\)for evaluation, corresponding to the tasksmmlu\_pro\_llama,gsm8k\_llama,ifeval,hellaswag, andwinogrande, respectively\. Among these,apply\_chat\_templateis enabled for MMLU Pro, GSM8K, and IFEval\. The size and evaluation domain of the five benchmarks are summarized in Table[6](https://arxiv.org/html/2606.07618#A2.T6)\.

### D\.3Operator Benchmark Details

We conduct operator\-level benchmarks using PyTorch 2\.9\.0, Triton 3\.5\.0, and vLLM 0\.13\.0 on NVIDIA RTX PRO 6000 Blackwell GPUs\. Since the default NVFP4 operators in vLLM employ a swizzle layout, the batch dimension is padded to multiples of 128\. Consequently, latency results for batch sizes smaller than 128 may not accurately\.

## Appendix EFull Result

Table[7](https://arxiv.org/html/2606.07618#A5.T7)presents the latency comparison and relative latency ratios under different batch sizes with hidden dimension 8192 on NVIDIA RTX PRO 6000 Blackwell GPUs\. Figure[6](https://arxiv.org/html/2606.07618#A5.F6)presents the MSE and WMSE gap \(%\) of different initialization methods relative to the optimal FP32 block scale for key cache, value cache, and query state quantization on all layers of Llama\-3\.1\-8B\-Instruct\. Table[8](https://arxiv.org/html/2606.07618#A5.T8)presents the results on Llama\-3\.2\-3B\-Instruct, while Table[9](https://arxiv.org/html/2606.07618#A5.T9)presents the results on Qwen3\-4B and Qwen3\-8B\.

Table 7:Comparison of latency and relative latency ratio of the NVFP4 quantization operator across different batch sizes with hidden dimension 8192 on NVIDIA RTX PRO 6000 Blackwell GPUs\. “Rel\. Latency” refers to relative latency ratio, which denotes the latency ratio relative to the default vLLM operator\.![Refer to caption](https://arxiv.org/html/2606.07618v1/x6.png)Figure 6:Performance gap \(%\) of ScaleSweep, initialization baselines, and FP8\-quantized optimal FP32 scales relative to the optimal FP32 block scale in terms of MSE and WMSE for KV cache quantization and query states quantization on Llama\-3\.1\-8B\-Instruct\.SettingMethodInit MethodMMLU ProGSM8kIFEvalHellaSWinoGAvgRecovery \(%\)Llama\-3\.2\-3B\-Instruct\-BF16\-37\.3078\.5470\.2471\.7568\.5165\.27100WARTNAbsMax33\.4371\.4963\.9670\.3068\.2761\.4994\.214/634\.1071\.9567\.2870\.1165\.5161\.7994\.67\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1533\.96\\cellcolorblue\!1573\.46\\cellcolorblue\!1568\.02\\cellcolorblue\!1570\.39\\cellcolorblue\!1567\.09\\cellcolorblue\!1562\.59\\cellcolorblue\!1595\.89\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1534\.12\\cellcolorblue\!1573\.16\\cellcolorblue\!1568\.76\\cellcolorblue\!1570\.18\\cellcolorblue\!1568\.98\\cellcolorblue\!1563\.04\\cellcolorblue\!1596\.58GPTQAbsMax33\.2971\.5765\.4369\.8568\.7561\.7894\.654/634\.2372\.7869\.6970\.5366\.0662\.6696\.00\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1535\.09\\cellcolorblue\!1574\.68\\cellcolorblue\!1565\.62\\cellcolorblue\!1570\.26\\cellcolorblue\!1568\.75\\cellcolorblue\!1562\.88\\cellcolorblue\!1596\.34\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1534\.97\\cellcolorblue\!1573\.69\\cellcolorblue\!1570\.61\\cellcolorblue\!1570\.37\\cellcolorblue\!1566\.46\\cellcolorblue\!1563\.22\\cellcolorblue\!1596\.86WAKVRTNAbsMax30\.9367\.1766\.3669\.7468\.0360\.4592\.614/631\.9969\.8367\.8469\.6366\.6961\.1993\.76\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1533\.08\\cellcolorblue\!1572\.33\\cellcolorblue\!1567\.47\\cellcolorblue\!1569\.79\\cellcolorblue\!1566\.30\\cellcolorblue\!1561\.79\\cellcolorblue\!1594\.67\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1533\.10\\cellcolorblue\!1572\.33\\cellcolorblue\!1567\.84\\cellcolorblue\!1570\.08\\cellcolorblue\!1567\.80\\cellcolorblue\!1562\.23\\cellcolorblue\!1595\.34GPTQAbsMax31\.2667\.6366\.7369\.8166\.8560\.4592\.624/632\.4171\.1166\.1769\.5067\.4061\.3293\.95\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1533\.44\\cellcolorblue\!1572\.33\\cellcolorblue\!1565\.62\\cellcolorblue\!1570\.22\\cellcolorblue\!1566\.93\\cellcolorblue\!1561\.71\\cellcolorblue\!1594\.54\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1533\.75\\cellcolorblue\!1571\.80\\cellcolorblue\!1567\.65\\cellcolorblue\!1570\.07\\cellcolorblue\!1567\.25\\cellcolorblue\!1562\.10\\cellcolorblue\!1595\.15WAKVQRTNAbsMax28\.0061\.8763\.9668\.7165\.4357\.5988\.244/630\.4467\.3260\.4469\.4667\.3259\.0090\.39\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1531\.33\\cellcolorblue\!1568\.31\\cellcolorblue\!1566\.17\\cellcolorblue\!1569\.13\\cellcolorblue\!1566\.69\\cellcolorblue\!1560\.33\\cellcolorblue\!1592\.43\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1531\.93\\cellcolorblue\!1569\.90\\cellcolorblue\!1566\.91\\cellcolorblue\!1569\.73\\cellcolorblue\!1566\.38\\cellcolorblue\!1560\.97\\cellcolorblue\!1593\.41GPTQAbsMax29\.8162\.6261\.7469\.0565\.3557\.7188\.434/630\.7366\.1165\.8069\.0164\.8059\.2990\.84\\cellcolorblue\!15ScaleSweepMSE\\cellcolorblue\!1532\.21\\cellcolorblue\!1568\.76\\cellcolorblue\!1562\.66\\cellcolorblue\!1569\.25\\cellcolorblue\!1565\.82\\cellcolorblue\!1559\.74\\cellcolorblue\!1591\.53\\cellcolorblue\!15ScaleSweep\\cellcolorblue\!1533\.02\\cellcolorblue\!1571\.11\\cellcolorblue\!1565\.99\\cellcolorblue\!1569\.43\\cellcolorblue\!1566\.30\\cellcolorblue\!1561\.17\\cellcolorblue\!1593\.72Table 8:Comparison results of ScaleSweep and baseline initialization methods under RTN and GPTQ across different settings on Llama\-3\.2\-3B\-Instruct\.Table 9:Comparison results of ScaleSweep and baseline initialization methods under RTN and GPTQ across different settings on Qwen3\-8B and Qwen3\-4B in the non\-thinking mode\.

Similar Articles

Here is my llama.cpp NVFP4/MXFP6 GGUF quantizer tool

Reddit r/LocalLLaMA

The author introduces an open-source GGUF quantizer tool for llama.cpp that creates NVFP4 and MXFP6 quantized models with advanced techniques like RSF, tensor promotion, and dynamic quantization, achieving better quality than existing methods like ModelOpt.

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

arXiv cs.CL

Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.