Saliency-Aware Regularized Quantization Calibration for Large Language Models

arXiv cs.AI 05/08/26, 04:00 AM Papers
Summary
This paper proposes Saliency-Aware Regularized Quantization Calibration (SARQC), a unified framework that improves Post-Training Quantization (PTQ) for LLMs by adding a regularization term to preserve weight proximity, enhancing generalization and performance.
arXiv:2605.05693v1 Announce Type: new Abstract: Post-training quantization (PTQ) is an effective approach for deploying large language models (LLMs) under memory and latency constraints. Most existing PTQ methods determine quantization parameters by minimizing a layer-wise reconstruction error on a predetermined calibration dataset, usually optimized via either scale search or Gram-based methods. However, from the perspective of generalization risk, existing calibration objectives of PTQ based only on empirical reconstruction error on limited or unrepresentative calibration data could move the quantized weights away from the original weights. This may cause the generalization risk to diverge, potentially degrading downstream performance. To address this issue, we propose \emph{Saliency-Aware Regularized Quantization Calibration} (SARQC) a unified framework that augments the standard PTQ objective with a saliency-aware regularization term. This term encourages quantized weights to stay close to the original weights during calibration, leading to improved generalization during inference. SARQC integrates seamlessly into existing PTQ pipelines, enhancing both scale search and Gram-based methods under a unified formulation. Extensive experiments on dense and Mixture-of-Experts LLMs demonstrate consistent improvements in perplexity and zero-shot accuracy, without additional computational overhead during inference.
Original Article
View Cached Full Text
Cached at: 05/08/26, 08:34 AM
# Saliency-Aware Regularized Quantization Calibration for Large Language Models
Source: [https://arxiv.org/html/2605.05693](https://arxiv.org/html/2605.05693)
Yanlong Zhao1Xiaoyuan Cheng211footnotemark:1Huihang Liu311footnotemark:1 Baihua He1Xinyu Zhang1,4Harrison Bo Hua Zhu5,6,7 Wenlong Chen6Li Zeng8Zhuo Sun3,6 1University of Science and Technology of China,2University College London, 3Shanghai University of Finance and Economics 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences, 5University of Copenhagen,6Imperial College London, 7Technical University of Denmark8Peking UniversityEqual Contribution:[Yanlong Zhao](https://arxiv.org/html/2605.05693v1/mailto:[email protected]),[Xiaoyuan Cheng](https://arxiv.org/html/2605.05693v1/mailto:[email protected])and[Huihang Liu](https://arxiv.org/html/2605.05693v1/mailto:[email protected])\.Correspondence Author\. Correspondence to[Zhuo Sun](https://arxiv.org/html/2605.05693v1/mailto:[email protected])\.

###### Abstract

Post\-training quantization \(PTQ\) is an effective approach for deploying large language models \(LLMs\) under memory and latency constraints\. Most existing PTQ methods determine quantization parameters by minimizing a layer\-wise reconstruction error on a predetermined calibration dataset, usually optimized via either scale search or Gram\-based methods\. However, from the perspective of generalization risk, existing calibration objectives of PTQ based only on empirical reconstruction error on limited or unrepresentative calibration data could move the quantized weights away from the original weights\. This may cause the generalization risk to diverge, potentially degrading downstream performance\. To address this issue, we propose*Saliency\-Aware Regularized Quantization Calibration*\(SARQC\) a unified framework that augments the standard PTQ objective with a saliency\-aware regularization term\. This term encourages quantized weights to stay close to the original weights during calibration, leading to improved generalization during inference\. SARQC integrates seamlessly into existing PTQ pipelines, enhancing both scale search and Gram\-based methods under a unified formulation\. Extensive experiments on dense and Mixture\-of\-Experts LLMs demonstrate consistent improvements in perplexity and zero\-shot accuracy, without additional computational overhead during inference\.

$\\clubsuit$$\\clubsuit$footnotetext:Project Page:[https://github\.com/Riceormice/SARQC](https://github.com/Riceormice/SARQC)\.## 1Introduction

Large language models \(LLMs\) have scaled to tens of billions of parameters and beyond, delivering strong capabilities across tasks ranging from instruction following and knowledge\-intensive QA to code generation and multi\-step reasoning\(OpenAIet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib25); Jianget al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib31); Grattafioriet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib43); Yanget al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib55)\)\. However, deploying these models remains costly\. Their enormous parameter size leads to high memory usage, and autoregressive decoding is often limited by memory bandwidth, since weights must be repeatedly loaded from memory for each generated token\.

To deploy LLMs efficiently, post\-training quantization \(PTQ\) has become a widely adopted technique for memory\-constrained scenarios\. Existing PTQ algorithms aim to convert a floating\-point \(FP\) LLM into a quantized model\(Banneret al\.,[2019](https://arxiv.org/html/2605.05693#bib.bib5); Gholamiet al\.,[2021](https://arxiv.org/html/2605.05693#bib.bib24)\)\. Such a compressed model typically stores its weights in low\-bit integer formats \(e\.g\., INT4\)\. Most PTQ pipelines determine quantization parameters, such as scales, zero points, clipping thresholds, and rounding decisions, by minimizing layer\- or block\-level output reconstruction error between the FP model and the quantized model over a small calibration dataset\(Nagelet al\.,[2020](https://arxiv.org/html/2605.05693#bib.bib49); Liet al\.,[2021](https://arxiv.org/html/2605.05693#bib.bib35); Frantaret al\.,[2022](https://arxiv.org/html/2605.05693#bib.bib19)\)\. In contrast to quantization\-aware training \(QAT\), which typically incurs substantial data and training cost\(Liuet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib39)\), post\-training quantization provides a more efficient alternative by quantizing LLMs using only a small calibration set and minimal computation\(Weiet al\.,[2022](https://arxiv.org/html/2605.05693#bib.bib60); Xiaoet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib65); Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20); Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38); Tianet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib57)\)\. In this work, we mainly focus on weight\-only PTQ techniques, which quantize FP weights to low\-bit formats while keeping activations in floating\-point precision\. These methods significantly reduce memory usage and improve throughput in memory\-bound serving scenarios\(Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38); Lianget al\.,[2026](https://arxiv.org/html/2605.05693#bib.bib54)\)\.

![Refer to caption](https://arxiv.org/html/2605.05693v1/x1.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.05693v1/x2.png)\(b\)

Figure 1:Illustration and validation of our motivation\.*\(a\) Conceptual illustration*: A smaller𝔼X\[‖𝐖^lX−𝐖lX‖22\]\\mathbb\{E\}\_\{X\}\[\\\|\\widehat\{\\mathbf\{W\}\}\_\{l\}X\-\\mathbf\{W\}\_\{l\}X\\\|\_\{2\}^\{2\}\]generally implies better downstream performance\. Vanilla calibration minimizes only the reconstruction loss, which can induce*weight drift*and degrade performance\. The optimal solution lies in a*sweet spot*that balances*output mismatch*and*weight drift*\.*\(b\) Empirical evidence of our motivation*: ForSARQC\-GBSon LLaMA2\-7B under W4A16, the best downstream accuracy is achieved with a moderate regularization toward the original FP weights\. The average accuracy over the eight evaluation tasks described in[Section˜4\.1](https://arxiv.org/html/2605.05693#S4.SS1)is reported here\.ℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}andℒsar\\mathcal\{L\}\_\{\\mathrm\{sar\}\}are the reconstruction error and saliency\-aware regularizer defined in[Equation˜8](https://arxiv.org/html/2605.05693#S3.E8)\. See[Figure˜4](https://arxiv.org/html/2605.05693#A6.F4)\(in Appendix[F\.2](https://arxiv.org/html/2605.05693#A6.SS2)\) for the visualization of weight drift\. See Appendix[F\.1](https://arxiv.org/html/2605.05693#A6.SS1)for more details\.#### Motivation

Recent weight\-only PTQ methods for LLMs select quantized weights by minimizing layer\-wise reconstruction error on a predetermined, often small, calibration set\(Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38); Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20); Liet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib36)\)\. Specifically, for a linear layer with input activations𝐗l\\mathbf\{X\}\_\{l\}and FP weights𝐖l\\mathbf\{W\}\_\{l\}, a typical PTQ calibration process selects quantized weights𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}by minimizing‖𝐖l𝐗l−𝐖^l𝐗l‖F2\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\. However, reconstruction\-based PTQ calibration does not explicitly constrain the dequantized weights to remain close to the original FP weights, which can lead to undesired weight drift\. During inference, weight\-only quantized LLMs still rely on dequantized weights to interact with floating\-point activations, so substantial deviation from the original FP weights may degrade downstream performance\. As illustrated in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)and[Figure˜4](https://arxiv.org/html/2605.05693#A6.F4)\(in Appendix[F\.2](https://arxiv.org/html/2605.05693#A6.SS2)\), calibration that solely minimizes reconstruction error can induce undesired weight drift even when the calibration loss is small\. In particular, a smaller reconstruction error does not guarantee a smaller weight drift‖𝐖l−𝐖^l‖F2\\\|\\mathbf\{W\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}; in fact, the two objectives can conflict\. As formalized later in[Theorem˜3\.1](https://arxiv.org/html/2605.05693#S3.Thmtheorem1), this may place the model in a regime with an enlarged generalization risk\. By introducing an explicit discrepancy regularizer, this trade\-off becomes more explicit and controllable\. As the regularization strength increases, weight drift decreases while reconstruction error rises, and the best downstream performance is achieved at an intermediate point along this trade\-off curve, as shown in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1), which is formalized in Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)later\.

#### Our Approach

Motivated by this observation, we propose*Saliency\-Aware Regularized Quantization Calibration*\(SARQC\), a general calibration framework for post\-training quantization that augments the standard layer\-wise reconstruction objective with an explicit regularizer on weight drift\. The original FP model serves as a natural reference point, and the added regularizer constrains the quantized solution from drifting too far away from it\. This leads to a more balanced calibration objective that better preserves the behavior of the original FP model and is more robust to limited or unrepresentative calibration data\. Importantly, SARQC can be seamlessly integrated into existing PTQ algorithms and applies to the two dominant PTQ paradigms, namely grid\-search methods\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib65); Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\)and Gram\-based methods\(Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20); Liet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib36)\)\.

#### Main Contributions

The main contributions of this work are as follows: \(1\) We propose*Saliency\-Aware Regularized Quantization Calibration*\(SARQC\), a regularized calibration method for weight\-only post\-training quantization \(PTQ\) of large language models, which explicitly controls weight drift from the original FP weights\. \(2\) We provide a theoretical analysis from the perspective of generalization risk and constrained optimization\. \(3\) Through extensive experiments, we demonstrate that the proposed method is broadly applicable across various PTQ paradigms and consistently achieves superior performance across a wide range of LLM families and model sizes\.

## 2Background

In this section, we briefly review quantization and post\-training quantization for LLMs\. See Appendix[B](https://arxiv.org/html/2605.05693#A2)and Appendix[C](https://arxiv.org/html/2605.05693#A3)for more preliminaries\.

#### Quantization and Dequantization

Quantization maps high\-precision floating\-point values \(e\.g\., BF16/FP16\) to discrete integer values \(e\.g\., INT2/INT4\), thereby reducing memory cost and improving throughput in memory\-bound scenarios\. For weight\-only quantization considered in this work, a floating\-point weight valuewwis quantized as follows:

wINT\-N=round\(wFP16η\),η=max⁡\|w\|2N−1−1,\\textstyle w\_\{\\text\{INT\-N\}\}=\\mathrm\{round\}\\\!\\left\(\\frac\{w\_\{\\text\{FP16\}\}\}\{\\eta\}\\right\),\\quad\\quad\\eta=\\frac\{\\max\|w\|\}\{2^\{N\-1\}\-1\}~~,\(1\)whereNNis the number of bits \(e\.g\.,N=4N=4for INT4\) andη\\etais the quantization step size\. The corresponding dequantization process reconstructs a floating\-point approximation of the quantized value asw^=η⋅wINT\-N\\widehat\{w\}=\\eta\\cdot w\_\{\\text\{INT\-N\}\}\. Such dequantization operations are performed during inference for weight\-only quantized LLMs\. To simplify notation, we assume a symmetric quantization scheme centered at zero\. The asymmetric case can be handled similarly by introducing a zero\-point; further details are provided in Appendix[C\.1](https://arxiv.org/html/2605.05693#A3.SS1)\. Weight quantization can be applied at various granularities, including per\-tensor, group\-wise, and per\-channel quantization; seeXiaoet al\.\([2023](https://arxiv.org/html/2605.05693#bib.bib65)\)for a detailed discussion\.

#### Challenges in PTQ of LLMs

Although quantization is appealing to use for memory\-bound deployment, naively applying it as described above can lead to significant performance degradation in downstream tasks\. This is due to the "outliers" in activations and weights of LLMs\. These "outliers" are \(potentially extreme\) large values that can dominate the limited dynamic range of low\-bit formats and degrade quantization fidelity\(Xiet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib1); Xiaoet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib65); Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\)\. To alleviate these challenges, one line of work applies extra scaling factors to both weights and activations\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib65); Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\)\. That is, for a linear layer with weight𝐖l∈ℝdout×din\\mathbf\{W\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, given a scaling factor matrix𝐒~l∈ℝdin×din\\widetilde\{\\mathbf\{S\}\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times d\_\{\\text\{in\}\}\}and the input𝐗l∈ℝdin×n\\mathbf\{X\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times n\}conditioned on the calibration data, the output of this layer can be rewritten as

𝐘l:=𝐖l𝐗l=\(𝐖l𝐒~l\)\(𝐒~l−1𝐗l\),\\textstyle\\mathbf\{Y\}\_\{l\}:=\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}=\\left\(\\mathbf\{W\}\_\{l\}\\widetilde\{\\mathbf\{S\}\}\_\{l\}\\right\)\\left\(\\widetilde\{\\mathbf\{S\}\}\_\{l\}^\{\-1\}\\mathbf\{X\}\_\{l\}\\right\)~~,\(2\)where𝐒~l−1\\widetilde\{\\mathbf\{S\}\}\_\{l\}^\{\-1\}denotes the inverse of𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}\. This transformation is function\-preserving, since it leaves the layer output unchanged in the absence of discretization error\. A common choice is to take𝐒~l:=diag\(s~l\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}:=\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\), wheres~l∈ℝdin\\tilde\{s\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\}is a channel\-wise scaling vector constructed from summary statistics of the weights and activations\. The optimal quantization operationsQQincluding𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}and the associated quantized weights𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}are selected by minimizing‖𝐖^l𝐗l−𝐖l𝐗l‖F2\\\|\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\. For instance, AWQ\(Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\)uses activation statistics to guide per\-channel scaling factorss~l\\tilde\{s\}\_\{l\}to better preserve salient channels for weight\-only PTQ\. This approach alleviates the impact of activation outliers by applyingdiag\(s~l\)−1\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\)^\{\-1\}to activations, while the scaled weights𝐖ldiag\(s~l\)\\mathbf\{W\}\_\{l\}\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\)can be absorbed into the offline weights\. During inference, the inverse scaling can be fused into adjacent operators, avoiding additional runtime overhead\.

An alternative line of work applies orthogonal or structured rotations to redistribute outliers across dimensions before quantization, aiming to make both activations and weights more quantization\-friendly\. For instance, QuIP\(Cheeet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib10)\)introduces orthogonal transformations to reduce the impact of outliers, QuaRot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib4)\)uses Hadamard\-style transformations to suppress outliers while preserving the model function, and SpinQuant\(Liuet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib41)\)further optimizes the rotation parameters to better match quantization constraints\. Despite their effectiveness, rotation\-based methods may introduce additional calibration, optimization, or inference costs, especially when the transformations are learned\(Xiet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib1); Liuet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib41)\)\.

#### Grid Search over Scaling Factors

As discussed above, a common strategy in PTQ is to apply tensor\-, group\-, or channel\-wise scaling, which rebalances the quantization difficulty between activations and weights and often improves downstream performance\. In practice, the scaling factorss~l\\tilde\{s\}\_\{l\}in[Equation˜2](https://arxiv.org/html/2605.05693#S2.E2)are derived from simple tensor\-, group\-, or channel\-level statistics that serve as heuristic proxies for activation or weight saliency\. Different scaling\-based PTQ methods instantiate these factors in different ways\. A widely adopted class of scaling factors is

s~l\(j\)∝statX\(\|𝐗l\(j\)\|\)αstatW\(\|𝐖l\(j\)\|\)1−α,\\textstyle\\tilde\{s\}\_\{l\}^\{\(j\)\}\\propto\\frac\{\\mathrm\{stat\}\_\{X\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\alpha\}\}\{\\mathrm\{stat\}\_\{W\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1\-\\alpha\}\},\(3\)whereα∈\[0,1\]\\alpha\\in\[0,1\]andstatX\(⋅\)\\mathrm\{stat\}\_\{X\}\(\\cdot\)andstatW\(⋅\)\\mathrm\{stat\}\_\{W\}\(\\cdot\)denote pre\-defined activation/weight summary statistics\. This form covers the scaling factors used inXiaoet al\.\([2023](https://arxiv.org/html/2605.05693#bib.bib65)\), where maximum statistics are used to migrate activation outliers into weights\. In contrast,Linet al\.\([2024](https://arxiv.org/html/2605.05693#bib.bib38)\)only uses channel\-wise activation summary statistics as scaling factors e\.g\.,s~l\(j\)∝mean\(\|𝐗l\(j\)\|\)α\\tilde\{s\}\_\{l\}^\{\(j\)\}\\propto\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\alpha\}\. Under such parameterizations, PTQ can be simplified to a lightweight grid search over scaling parameters such asα\\alpha, with the goal of minimizing the reconstruction error between the original FP layer output and the quantized layer output conditioning on the calibration data\.

#### Gram\-based PTQ via Optimal Brain Compression

As an alternative to grid search over scaling factors, Gram\-based methods are also widely adopted in practice\(Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20); Liet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib36)\)\. Second\-order compression dates back to Optimal Brain Surgeon \(OBS\)\(LeCunet al\.,[1989](https://arxiv.org/html/2605.05693#bib.bib66); Hassibiet al\.,[1993](https://arxiv.org/html/2605.05693#bib.bib51)\), which uses curvature information to estimate how parameter perturbations affect the loss\. Optimal Brain Compression \(OBC\) connects this idea to calibration objectives by introducing a tractable curvature proxy\. When mean squared error is used as the layer\-wise calibration objective with inputs𝐗l\\mathbf\{X\}\_\{l\}, the resulting quadratic form involves the Gram matrix𝐗l𝐗l⊤\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}, which serves as a surrogate curvature and penalizes output distortion induced by weight perturbations\(Frantaret al\.,[2022](https://arxiv.org/html/2605.05693#bib.bib19)\)\.Frantaret al\.\([2023](https://arxiv.org/html/2605.05693#bib.bib20)\)further adapt OBC to weight\-only PTQ by performing efficient sequential or block\-wise updates using cached activation statistics, which do not require end\-to\-end backpropagation\. That is,

minΔ𝐖l∥\(𝐖l\+Δ𝐖l\)𝐗l−𝐖l𝐗l∥F2,s\.t\.Δ𝐖l=𝐖^l−𝐖l\.\\textstyle\\min\_\{\\Delta\\mathbf\{W\}\_\{l\}\}\\\|\(\\mathbf\{W\}\_\{l\}\+\\Delta\\mathbf\{W\}\_\{l\}\)\\mathbf\{X\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\\quad,\\,\\,\\mathrm\{s\.t\.\}\\,\\Delta\\mathbf\{W\}\_\{l\}=\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\.\(4\)where𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}denotes the quantized weights\. Note that this is actually equivalent to select𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}by minimizing‖𝐖^l𝐗l−𝐖l𝐗l‖F2\\\|\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\. Several follow\-up works further improve the effectiveness of OBC\-style methods\(Liet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib36); Van Baalenet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib59)\)\.

## 3Methodology

In this section, we formally introduce*Saliency\-Aware Regularized Quantization Calibration*\(SARQC\) in[Section˜3\.1](https://arxiv.org/html/2605.05693#S3.SS1)\. We then show how the resulting SARQC objective can be optimized by both scaling\-based grid search and Gram\-based methods in[Section˜3\.2](https://arxiv.org/html/2605.05693#S3.SS2)\.

### 3\.1Saliency\-Aware Regularized Quantization Calibration

We begin by reviewing the standard calibration objective used in post\-training quantization for LLMs, which measures the output discrepancy induced by quantization\. For thell\-th linear layer with FP weight𝐖l∈ℝdout×din\\mathbf\{W\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}and cached calibration input𝐗l∈ℝdin×n\\mathbf\{X\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\text\{in\}\}\\times n\}conditioning on the calibration data, PTQ seeks the quantized weights𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}by minimizing the reconstruction error

min𝐖^l∈𝒬⁡‖𝐖l𝐗l−𝐖^l𝐗l‖F2,\\textstyle\\min\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\}\\;\\;\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\},\(5\)where𝒬\\mathcal\{Q\}denotes the feasible set of dequantized\-quantized weights induced by the underlying quantization scheme, possibly together with additional reparameterizations such as scaling by𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}\. Note that each quantization operationQQinduces a Dirac measure supported at the associated quantized weight𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}\. The collection of all quantization operations therefore induces a family of Dirac measures𝒬\\mathcal\{Q\}over the space of quantized weights\.

#### Why It May Fail

Existing calibration\-based quantization methods aim to minimize the reconstruction error‖𝐖l𝐗l−𝐖^l𝐗l‖F2\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}conditioning on the input𝐗l\\mathbf\{X\}\_\{l\}from a pre\-determined calibration dataset to implicitly preserve the task\-relevant informationI\(𝐖^lX;𝐖lX\)I\\\!\\bigl\(\\widehat\{\\mathbf\{W\}\}\_\{l\}X;\\,\\mathbf\{W\}\_\{l\}X\\bigr\), whereX∼pXX\\sim p\_\{X\}is drawn from the downstream data distribution andIIdenotes the mutual information \(Definition[C\.4](https://arxiv.org/html/2605.05693#A3.Thmtheorem4)\)\. However, minimizing the reconstruction error using only the calibration data does not explicitly control the deviation between𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}and𝐖l\\mathbf\{W\}\_\{l\}, which is critical to the generalization performance of quantized LLMs on downstream tasks, as shown in[Theorem˜3\.1](https://arxiv.org/html/2605.05693#S3.Thmtheorem1)\.

###### Theorem 3\.1\.

Consider a restricted finite quantization class𝒬R:=\{𝐖^l∈𝒬:‖𝐖^l−𝐖l‖F≤R\}\\mathcal\{Q\}\_\{R\}:=\\\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}:\\\|\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}\\leq R\\\}\. DefineΔ𝐖l:=𝐖^l−𝐖l\\Delta\\mathbf\{W\}\_\{l\}:=\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}, and assume that‖X‖2≤MX\\\|X\\\|\_\{2\}\\leq M\_\{X\}almost surely\. Let the true downstream reconstruction risk beℛ\(𝐖^l\):=𝔼X∼pX\[‖Δ𝐖lX‖22\]\.\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\):=\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\\\!\\left\[\\\|\\Delta\\mathbf\{W\}\_\{l\}X\\\|\_\{2\}^\{2\}\\right\]\.Let the calibration set at layerllbe𝒟cal,l:=\{Xl,i\}i=1n,Xl,1,…,Xl,n∼i\.i\.d\.pX,\\mathcal\{D\}\_\{\\mathrm\{cal\},l\}:=\\\{X\_\{l,i\}\\\}\_\{i=1\}^\{n\},X\_\{l,1\},\\dots,X\_\{l,n\}\\stackrel\{\{\\scriptstyle\\mathrm\{i\.i\.d\.\}\}\}\{\{\\sim\}\}p\_\{X\},and define the empirical calibration risk asℛ^cal\(𝐖^l\):=1n∑i=1n‖Δ𝐖lXl,i‖22\.\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\):=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\Delta\\mathbf\{W\}\_\{l\}X\_\{l,i\}\\\|\_\{2\}^\{2\}\.Then, for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta, the following holds simultaneously for all𝐖^l∈𝒬R\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\}:

\|ℛ\(𝐖^l\)−ℛ^cal\(𝐖^l\)\|≤R2MX2log⁡2\|𝒬R\|δ2n\.\\textstyle\\left\|\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\right\|\\leq R^\{2\}M\_\{X\}^\{2\}\\sqrt\{\\frac\{\\log\\frac\{2\|\\mathcal\{Q\}\_\{R\}\|\}\{\\delta\}\}\{2n\}\}\.

See Appendix[D\.2](https://arxiv.org/html/2605.05693#A4.SS2)for proof\. The above theorem shows that the true downstream risk can be controlled by two terms: the empirical calibration risk and a generalization term depending on the radius of the weight driftRR\. Large weight drift tends to increase the generalization risk\. Therefore, minimizing onlyℛ^cal\(𝐖^l\)\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)may lead to poor downstream generalization if the selected quantized weights have large drift from the original FP weights\.

This is also supported by the empirical results in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)and[Figure˜4](https://arxiv.org/html/2605.05693#A6.F4)\(in Appendix[F\.2](https://arxiv.org/html/2605.05693#A6.SS2)\): undesired deviation to the original FP weights could potentially lower the performance of the weight\-only quantized models\. In this sense, the degradation in performance of vanilla calibration can be understood as moving away from the*sweet spot*that balances output mismatch and weight drift\. These motivate the following regularized calibration objective for weight\-only post\-training quantization\.

#### Regularized Quantization Calibration

[Theorem˜3\.1](https://arxiv.org/html/2605.05693#S3.Thmtheorem1)implies that it might be better to control the differences between𝐖l\\mathbf\{W\}\_\{l\}and𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}for each quantized layer indexed byll\. To control this deviation explicitly, we introduce an extra regularization term into the quantization calibration that allows control to such deviation, given by𝒟\(𝐖l,𝐖^l\)\\mathcal\{D\}\(\\mathbf\{W\}\_\{l\},\\widehat\{\\mathbf\{W\}\}\_\{l\}\)where𝒟\(⋅,⋅\)\\mathcal\{D\}\(\\cdot,\\cdot\)is some discrepancy one may specify\. That is,

min𝐖^l∈𝒬⁡‖𝐖l𝐗l−𝐖^l𝐗l‖F2\+λ𝒟\(𝐖^l,𝐖l\)\.\\textstyle\\min\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\}\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\lambda\\mathcal\{D\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\},\\mathbf\{W\}\_\{l\}\)~~\.\(6\)whereλ\>0\\lambda\>0controls the strength of regularization\. Note here we regard𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}and𝐖l\\mathbf\{W\}\_\{l\}as vectors in𝒟\(𝐖^l,𝐖l\)\\mathcal\{D\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\},\\mathbf\{W\}\_\{l\}\), i\.e\.,𝒟\(vec\(𝐖^l\),vec\(𝐖l\)\)\\mathcal\{D\}\(\\textrm\{vec\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\),\\textrm\{vec\}\(\\mathbf\{W\}\_\{l\}\)\)given some discrepancy or distance metric𝒟\\mathcal\{D\}\. To simplify notation, we dropvec\(⋅\)\\textrm\{vec\}\(\\cdot\)when it is clear from the context\. A natural choice for𝒟\\mathcal\{D\}is the Kullback–Leibler \(KL\) divergence\. However, it could lead to ill\-posed discrepancy measurements asKL\(δ𝐖l,δ𝐖^l\)=0\\mathrm\{KL\}\(\\delta\_\{\\mathbf\{W\}\_\{l\}\},\\delta\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}\)=0if𝐖l=𝐖^l\\mathbf\{W\}\_\{l\}=\\widehat\{\\mathbf\{W\}\}\_\{l\}otherwise\+∞\+\\inftywhereδ𝐖l\\delta\_\{\\mathbf\{W\}\_\{l\}\}andδ𝐖^l\\delta\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}denote the Dirac measures supported at𝐖l\\mathbf\{W\}\_\{l\}and𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}, respectively\. While the Wasserstein\-2 distance remains well\-defined in this setting, and reduces to the squared distance between the weightsW22\(δ𝐖l,δ𝐖^l\)=‖𝐖l−𝐖^l‖F2=‖vec\(𝐖l\)−vec\(𝐖^l\)‖22W\_\{2\}^\{2\}\(\\delta\_\{\\mathbf\{W\}\_\{l\}\},\\delta\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}\)=\\\|\\mathbf\{W\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}=\\\|\\textrm\{vec\}\(\\mathbf\{W\}\_\{l\}\)\-\\textrm\{vec\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\\|\_\{2\}^\{2\}\. That is, we select the quantized weights𝐖^l∈𝒬\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}by minimizing:

min𝐖^l∈𝒬⁡‖𝐖l𝐗l−𝐖^l𝐗l‖F2\+λ‖\(𝐖^l−𝐖l\)‖F2\.\\textstyle\\min\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\}\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\lambda\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\)\\\|\_\{\\mathrm\{F\}\}^\{2\}~~\.\(7\)The reconstruction error encourages the quantized layer to preserve output\-relevant behavior on inputs drawn from the calibration distribution, while the regularization term discourages unnecessary deviation from the original FP models\. The following corollary establishes the connection between the regularized objective and[Theorem˜3\.1](https://arxiv.org/html/2605.05693#S3.Thmtheorem1)\.

###### Corollary 3\.2\.

Fix a layer indexed byll, assume that𝒬\\mathcal\{Q\}is finite and the constrained feasible set𝒬R:=\{𝐖^′∈𝒬:‖𝐖^′−𝐖l‖F2≤R2\}\\mathcal\{Q\}\_\{R\}:=\\\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}:\\\|\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\-\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\\leq R^\{2\}\\\}is nonempty\. Let𝐖^l∈arg⁡min𝐖^′∈𝒬R⁡ℛ^cal\(𝐖^′\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\arg\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\_\{R\}\}\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. If the supportedness conditionλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}holds, then for every finiteλR\\lambda\_\{R\}satisfyingλmin≤λR≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{R\}\\leq\\lambda\_\{\\max\},

𝐖^l∈arg⁡min𝐖^′∈𝒬⁡\{ℛ^cal\(𝐖^′\)\+λR‖𝐖^′−𝐖l‖F2\},\\textstyle\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\arg\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\}\\left\\\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\_\{R\}\\\|\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\-\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\\right\\\},whereλmin\\lambda\_\{\\min\}andλmax\\lambda\_\{\\max\}are defined in Appendix[D\.3](https://arxiv.org/html/2605.05693#A4.SS3)\. Conversely, if there exists a finiteλR≥0\\lambda\_\{R\}\\geq 0such that𝐖^l∈arg⁡min𝐖^′∈𝒬⁡\{ℛ^cal\(𝐖^′\)\+λR‖𝐖^′−𝐖l‖F2\}\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\arg\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\}\\\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\_\{R\}\\\|\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\-\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\\\}, then necessarilyλmin≤λR≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{R\}\\leq\\lambda\_\{\\max\}\.

See Appendix[D\.3](https://arxiv.org/html/2605.05693#A4.SS3)for proof\. Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)clarifies that the constrained calibration problem can be represented by a regularized calibration objective over a finite quantization set\. In particular, whenλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}holds, then one can choose a penalty strengthλR∈\[λmin,λmax\]\\lambda\_\{R\}\\in\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]such that𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}remains optimal after replacing the hard radius constraintD\(𝐖^′\)≤R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\leq R^\{2\}by the soft penaltyλRD\(𝐖^′\)\\lambda\_\{R\}D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. Hereλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}simply states that the penalty must be large enough to suppress farther, possibly infeasible, low\-risk candidates, but not so large that it favors closer, higher\-risk feasible candidates\. This is the exact condition under which the bias introduced by the penalty does not change the selected minimizer\. See Appendix[D\.3](https://arxiv.org/html/2605.05693#A4.SS3)for more details, interpretation ofλmin\\lambda\_\{\\min\}andλmax\\lambda\_\{\\max\}, reasonableness of the supportedness condition and the connection to Lagrangian relaxation\.

#### Saliency\-Aware Regularized Quantization Calibration

As discussed in[Section˜2](https://arxiv.org/html/2605.05693#S2), a key observation in previous works is that introducing scaling factors𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}is essential for maintaining performance, as it effectively transfers the difficulty of quantization from activations to weights\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib65); Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\)\. These scaling factors also serve to quantify the saliency of the weights\. This naturally leads to saliency\-aware constraints on the weights\. These motivate augmenting the reconstruction error with a saliency\-weighted penalty\. We therefore quantize each layer of LLMs by minimizing the following regularized quantization calibration objective, namely*Saliency\-Aware Regularized Quantization Calibration*\(SARQC\), given by:

min𝐖^l∈𝒬⁡‖𝐖l𝐗l−𝐖^l𝐗l‖F2⏟ℒrecon:=Reconstruction Error\+λ‖\(𝐖^l−𝐖l\)𝐒l‖F2⏟ℒsar:=Saliency\-Aware Regularization,\\textstyle\\min\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\}\\underbrace\{\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\}\_\{\\mathcal\{L\}\_\{\\text\{recon\}\}:=\\text\{Reconstruction Error\}\}\+\\lambda\\underbrace\{\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\)\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\}\_\{\\mathcal\{L\}\_\{\\text\{sar\}\}:=\\text\{Saliency\-Aware Regularization\}\}~~,\(8\)where𝐒l\\mathbf\{S\}\_\{l\}encodes the saliency of the weights; see also[Equation˜3](https://arxiv.org/html/2605.05693#S2.E3)\. Note that𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}and𝐒l\\mathbf\{S\}\_\{l\}may both be present, in which case additional handling is required\. In this setting, we find that encouraging𝐒l\\mathbf\{S\}\_\{l\}to be close to𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}and removing the tuning hyperparameter associated with𝐒l\\mathbf\{S\}\_\{l\}works well in practice; see[Section˜3\.2](https://arxiv.org/html/2605.05693#S3.SS2)for details\. The first termℒrecon\\mathcal\{L\}\_\{\\text\{recon\}\}matches layer outputs between FP weights and quantized weights conditioned on the calibration data, while the second termℒsar\\mathcal\{L\}\_\{\\text\{sar\}\}penalizes deviations from the original FP weights in a saliency\-aware manner\. Since𝐒l=diag\(sl\)\\mathbf\{S\}\_\{l\}=\\mathrm\{diag\}\(s\_\{l\}\)reweights different channels, larger saliency values impose stronger penalties on the corresponding weight deviations\. In this way, the regularizer more strongly protects important channels that are more influential to the layer output under low\-bit quantization\. When𝐒l=I\\mathbf\{S\}\_\{l\}=I, the penalty reduces to a uniformly weighted regularizer and recovers the objective of RQC in[Equation˜7](https://arxiv.org/html/2605.05693#S3.E7)\.*SARQC constrains information loss primarily along directions that are most relevant to the saliency\.*We later show that this saliency\-aware regularizerℒsar\\mathcal\{L\}\_\{\\text\{sar\}\}with𝐒l≠I\\mathbf\{S\}\_\{l\}\\neq Ifurther helps maintain the performance of LLMs after post\-training quantization, as supported by experimental results in[Section˜4\.2](https://arxiv.org/html/2605.05693#S4.SS2)\.

### 3\.2Practical Strategies to Optimize SARQC

We now describe two practical strategies to optimize the proposed SARQC objective in[Equation˜8](https://arxiv.org/html/2605.05693#S3.E8)\.

#### SARQC via Grid Search

Grid search over scaling factors𝐒~l\\widetilde\{\\mathbf\{S\}\}\_\{l\}is commonly used in PTQ methods to capture the saliency of weights\. Therefore, selecting an optimal quantization operation for each layer amounts to choosing a small set of parameters associated with these scaling factors\. Let𝐒~l:=diag\(s~l\(α\)\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}:=\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\(\\alpha\)\)withs~l\(α\)∈ℝ\+din\\tilde\{s\}\_\{l\}\(\\alpha\)\\in\\mathbb\{R\}\_\{\+\}^\{d\_\{\\text\{in\}\}\}being a scaling vector constructed from activation/weight statistics \(e\.g\., channel\-wise\) withα\\alpha; see[Equation˜3](https://arxiv.org/html/2605.05693#S2.E3)for the widely\-adopted ones\. In this case, the quantized weights𝐖^l\(α\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)are functions ofα\\alphathrough the scaling matrix𝐒~l\(α\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)\. This naturally reframes the optimization process of SARQC as a grid\-search process overα∈\[0,1\]\\alpha\\in\[0,1\]in the scaling factor𝐒~l\(α\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)\. Therefore, to preserve LLM performance after PTQ, SARQC selects𝐖^l\(α\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)by optimizing

minα∈\{αk\}k=0K−1⁡‖𝐖l𝐗l−𝐖^l\(α\)𝐗l‖F2\+λ‖\(𝐖^l\(α\)−𝐖l\)𝐒l‖F2,\\textstyle\\min\_\{\\alpha\\in\\\{\\alpha\_\{k\}\\\}\_\{k=0\}^\{K\-1\}\}\\;\\;\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\lambda\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\-\\mathbf\{W\}\_\{l\}\)\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}~~,\(9\)where, following the grid\-search design in\(Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\), we consider a discrete grid\{αk\}k=0K−1⊂\[0,1\]\\\{\\alpha\_\{k\}\\\}\_\{k=0\}^\{K\-1\}\\subset\[0,1\], withK∈ℕ\+K\\in\\mathbb\{N\}^\{\+\},K≥2K\\geq 2, andαk=k/\(K−1\)\\alpha\_\{k\}=k/\(K\-1\)fork=0,1,…,K−1k=0,1,\\dots,K\-1\. In the first term, the candidate quantized weights𝐖^l\(α\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)are generated through the scaling matrix𝐒~l\(α\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\), for which we use the widely adopted choice in[Equation˜3](https://arxiv.org/html/2605.05693#S2.E3)\. We find that setting𝐒l:=diag\(sl\)\\mathbf\{S\}\_\{l\}:=\\mathrm\{diag\}\(s\_\{l\}\), with thejj\-th elementsl\(j\):=mean\(\|𝐗l\(j\)\|\)/mean\(\|𝐖l\(j\)\|\)s\_\{l\}^\{\(j\)\}:=\\nicefrac\{\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)\}\}\{\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)\}\}, performs well in practice, also inspired by\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib65)\)\. To balance the scale between the two terms, we apply min–max normalization and select the best candidate by minimizing the normalized joint objective\. This also aligns with Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)to ensure that the range ofλ\\lambdais proper\. This enables the seamless integration of SARQC into grid\-search\-based PTQ methods without requiring backpropagation through quantization operations or scaling factors\. We refer to this approach asSARQC\-GS\(Saliency\)\. When𝐒l:=I\\mathbf\{S\}\_\{l\}:=I, we denote the resulting variant asSARQC\-GS\(Identity\)\. See[Algorithm˜1](https://arxiv.org/html/2605.05693#alg1)\(in Appendix[E\.1](https://arxiv.org/html/2605.05693#A5.SS1)\) and Appendix[E\.2](https://arxiv.org/html/2605.05693#A5.SS2)for more details\.

#### SARQC via Gram\-Based Solvers

An alternative approach for selecting the optimal quantized weights𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}is to employ second\-order Gram\-based solvers to minimize the SARQC objective as\(Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20); Liet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib36)\)\. LetΔ𝐖l:=𝐖^l−𝐖l\\Delta\\mathbf\{W\}\_\{l\}:=\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\. Since‖𝐖l𝐗l−𝐖^l𝐗l‖F2=‖Δ𝐖l𝐗l‖F2\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}=\\\|\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}, we can rewrite \([8](https://arxiv.org/html/2605.05693#S3.E8)\) as

min𝐖^l∈𝒬⁡Tr\(Δ𝐖l𝐆lΔ𝐖l⊤\),\\textstyle\\min\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\}\\;\\;\\mathrm\{Tr\}\\ \\\!\\Big\(\\Delta\\mathbf\{W\}\_\{l\}\\,\\mathbf\{G\}\_\{l\}\\,\\Delta\\mathbf\{W\}\_\{l\}^\{\\top\}\\Big\)~~,\(10\)where the regularized curvature matrix used inSARQC\-GBSis𝐆l:=𝐗l𝐗l⊤\+λ𝐒l𝐒l⊤\\mathbf\{G\}\_\{l\}:=\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\+\\lambda\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\}\. To match the practical scale of the empirical Gram matrix across layers, we use𝐒l:=𝐒l\(γ\)=h¯ldiag⁡\(sl\(γ\)mean⁡\(sl\(γ\)2\)\)\\mathbf\{S\}\_\{l\}:=\\mathbf\{S\}\_\{l\}\(\\gamma\)=\\sqrt\{\\bar\{h\}\_\{l\}\}\\,\\operatorname\{diag\}\\\!\(\\frac\{s\_\{l\}\(\\gamma\)\}\{\\sqrt\{\\operatorname\{mean\}\(s\_\{l\}\(\\gamma\)^\{2\}\)\}\}\)withh¯l:=mean⁡\(diag⁡\(𝐗l𝐗l⊤\)\)\\bar\{h\}\_\{l\}:=\\operatorname\{mean\}\(\\operatorname\{diag\}\(\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\)\)andsl\(γ\)s\_\{l\}\(\\gamma\)is defined channel\-wise bysl\(j\)\(γ\):=mean\(\|𝐗l\(j\)\|\)γ/mean\(\|𝐖l\(j\)\|\)1−γs\_\{l\}^\{\(j\)\}\(\\gamma\):=\\nicefrac\{\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\gamma\}\}\}\{\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1\-\\gamma\}\}\}whereγ\\gammacontrols the relative contribution of activation and weight statistics\. The factorh¯l\\bar\{h\}\_\{l\}aligns the regularization scale with the empirical Gram matrix, while the normalization bymean⁡\(sl\(γ\)2\)\\operatorname\{mean\}\(s\_\{l\}\(\\gamma\)^\{2\}\)keeps the saliency factors comparable across layers\. This also aligns with Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)to ensure that the range ofλ\\lambdais proper\. Compared with GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20)\), SARQC preserves the same layer\-wise calibration structure while replacing the Gram matrix𝐗l𝐗l⊤\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}with the modified curvature𝐆l\\mathbf\{G\}\_\{l\}\. This allows SARQC to be optimized in a GPTQ\-style sequential manner by using𝐆l\\mathbf\{G\}\_\{l\}and its inverse or Cholesky factor in place of𝐗l𝐗l⊤\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}, while retaining the same compensation mechanism\. We refer to this approach asSARQC\-GBS\(Saliency\)\. If set𝐒l:=I\\mathbf\{S\}\_\{l\}:=I, we denote this asSARQC\-GBS\(Identity\)\. See Appendix[D\.1](https://arxiv.org/html/2605.05693#A4.SS1)for the detailed derivations and more discussions and see[Algorithm˜2](https://arxiv.org/html/2605.05693#alg2)\(in Appendix[E\.1](https://arxiv.org/html/2605.05693#A5.SS1)\) and Appendix[E\.2](https://arxiv.org/html/2605.05693#A5.SS2)for more details\.

## 4Experimental Results

In this section, we compare the proposed method, SARQC, with widely\-adopted weight\-only post\-training quantization methods across a diverse range of LLMs and downstream tasks\. We report experimental setup in[Section˜4\.1](https://arxiv.org/html/2605.05693#S4.SS1), and analyze results in[Section˜4\.2](https://arxiv.org/html/2605.05693#S4.SS2)\. See extra details in Appendix[E](https://arxiv.org/html/2605.05693#A5)and additional experimental results in Appendix[F](https://arxiv.org/html/2605.05693#A6)\.

### 4\.1Experimental Setup

#### Baselines

We compare our method with representative layer\-wise PTQ baselines based on linear quantization\. Specifically, we considerAWQ\(Linet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib38)\)which uses statistics\-guided scaling with grid search as the baseline forSARQC\-GS\. For the baselines ofSARQC\-GBS, we considerGPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib20)\), which performs layer\-wise reconstruction using second\-order Hessian information, andGPTAQ\(Liet al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib36)\), a refined variant ofGPTQ\. Note that the GPTAQ baseline results are omitted for all MoE models due to a lack of MoE architectural support in implementation\.

#### Models

We consider both*dense LLMs*and*Mixture\-of\-Experts \(MoE\) LLMs*\. For dense models, we useLLaMA2\(7B and 13B\)\(Touvronet al\.,[2023](https://arxiv.org/html/2605.05693#bib.bib42)\)andLLaMA\(7B, 13B and 30B, reported in Appendix[F\.4](https://arxiv.org/html/2605.05693#A6.SS4)due to page limit\)\. For MoE models, we useDeepSeek\-MoE\-16B\-Base\(Daiet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib11)\),Qwen3\-MoE\-30B\(Yanget al\.,[2025](https://arxiv.org/html/2605.05693#bib.bib55)\), andMixtral\-8x7B\(Jianget al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib47)\)\.

#### Evaluations

We evaluate quantization performance using two types of metrics: \(i\)*perplexity*on the test split ofWikiText2\(Merityet al\.,[2017](https://arxiv.org/html/2605.05693#bib.bib62)\); \(ii\)*zero\-shot accuracy*on a suite of commonsense reasoning and knowledge benchmarks including:PIQA,HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.05693#bib.bib28)\),MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.05693#bib.bib48)\),HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2605.05693#bib.bib30)\),BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2605.05693#bib.bib6)\),ARC\-Challenge,ARC\-Easy\(Clarket al\.,[2018](https://arxiv.org/html/2605.05693#bib.bib2)\), andWinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.05693#bib.bib63)\)\.

#### Implementation

We investigate weight\-only post\-training quantization at low bit widths, ranging from44\-bit,33\-bit, to22\-bit\. We apply group\-wise quantization to all linear layers\. We use a group size of6464for MoE models and128128for dense models\. Unless otherwise stated in the ablation studies, all methods use the same fixed set of128128calibration samples from the training split ofWikiText2\(Merityet al\.,[2017](https://arxiv.org/html/2605.05693#bib.bib62)\)to ensure a controlled comparison\. We do not introduce additional transformations or reordering of weights or activations, so that the results reflect the intrinsic performance of each method\. For hyperparameter selection,SARQC\-GSsearchesλ\\lambdaover\{0\.1,0\.2,…,1\.0\}\\\{0\.1,0\.2,\\ldots,1\.0\\\}on the calibration set\.SARQC\-GBSperforms layer\-wise selection withλ∈\{0\.25,0\.5,0\.75\}\\lambda\\in\\\{0\.25,0\.5,0\.75\\\}andγ∈\{0\.1,0\.15,0\.35,0\.5\}\\gamma\\in\\\{0\.1,0\.15,0\.35,0\.5\\\}on the calibration set\. All experiments are conducted on NVIDIA A100 80GB GPUs\. See Appendix[E\.2](https://arxiv.org/html/2605.05693#A5.SS2)for extra details\.

Table 1:Perplexity of quantized models under different bit\-widths\.

### 4\.2Results Analysis

*\(1\) When does vanilla reconstruction error\-based calibration become unreliable?*Vanilla calibration based solely on reconstruction error becomes increasingly unreliable in challenging quantization regimes, especially under more aggressive low\-bit settings such as W2A16 and W3A16\. As shown in[Table˜1](https://arxiv.org/html/2605.05693#S4.T1), this failure can manifest as a dramatic increase in perplexity, indicating that preserving calibration\-set reconstruction alone is insufficient to maintain the original generation behavior of LLMs\. Even under the milder W4A16 setting, the same tendency is reflected in downstream zero\-shot performance in[Table˜2](https://arxiv.org/html/2605.05693#S4.T2), where reconstruction\-only calibration often leads to noticeable accuracy degradation relative to the original LLMs\. This issue becomes even more pronounced when the calibration set is small\. As illustrated in[Figure˜2](https://arxiv.org/html/2605.05693#S4.F2)*\(b\)*, GPTQ degrades substantially under both W3A16 and W2A16 when only limited calibration data are available\. Taken together, these results suggest that standard calibration objectives are vulnerable to limited or unrepresentative calibration data\.

Table 2:Zero\-shot accuracy \(%\) on multiple benchmarks under W4A16\.*\(2\) Does SARQC improve robustness and overall performance, especially when vanilla calibration degrades?*Yes\.SARQC consistently delivers stronger robustness and better task performance precisely in the regimes where vanilla calibration becomes unstable\. In terms of perplexity, SARQC remains much more stable than standard calibration methods across both dense and MoE LLMs, particularly in the hardest low\-bit settings\. For example, under W2A16 in[Table˜1](https://arxiv.org/html/2605.05693#S4.T1), SARQC reduces the perplexity of LLaMA2\-13B to117\.67117\.67, compared with792\.32792\.32for GPTQ and176\.53176\.53for GPTAQ\. A similar pattern also holds for MoE models\. On Qwen3\-MoE\-30B, SARQC reduces perplexity to16\.6916\.69, compared with38\.0338\.03for GPTQ\. This robustness further translates into performance gains on downstream tasks\. Under W4A16, SARQC achieves higher average zero\-shot accuracy than the baselines on both dense and MoE models in[Table˜2](https://arxiv.org/html/2605.05693#S4.T2)\. For instance,SARQC\-GS\(Saliency\)attains the best average accuracy on both LLaMA2\-7B and LLaMA2\-13B, andSARQC\-GBS\(Saliency\)achieves the best average accuracy on both DeepSeek\-MoE\-16B and Qwen3\-MoE\-30B\. Similar performance is also observed under W3A16, as reported in[Table˜6](https://arxiv.org/html/2605.05693#A6.T6)\. Moreover, the ablation study on the calibration sample size in[Figure˜2](https://arxiv.org/html/2605.05693#S4.F2)*\(b\)*shows thatSARQC\-GBSconsistently outperforms GPTQ under both W3A16 and W2A16, with a larger gap at smaller sample sizes\. This further confirms that SARQC performs well not only under low\-bit scenarios but also under calibration data scarcity\.

![Refer to caption](https://arxiv.org/html/2605.05693v1/x3.png)\(a\)
![Refer to caption](https://arxiv.org/html/2605.05693v1/x4.png)\(b\)

Figure 2:Ablation study\.*\(a\) Extension to OmniQuant*: the y\-axis shows downstream task accuracy, and the bars compare FP16/BF16, OmniQuant, and OmniQuant augmented with SARQC on the reported tasks for LLaMA2\-7B and Mixtral\-8x7B under W4A16\.*\(b\) Effect of calibration size*: the x\-axis shows the number of calibration samples, and the y\-axis shows average downstream accuracy\. The vertical dashed segments indicate the accuracy gap between GPTQ andSARQC\-GBS\.*\(3\) Is saliency\-aware regularization more effective than its non\-saliency\-aware counterpart?*Overall, yes\.The saliency\-aware variant consistently performs better in practice than the corresponding identity\-based variant\. From[Table˜2](https://arxiv.org/html/2605.05693#S4.T2), the saliency\-aware regularizer yields higher average accuracy than its identity\-based counterpart on all five evaluated models for the grid\-search solver, and on four out of five models for the Gram\-based solver\. The same trend is also supported by the perplexity results in[Table˜1](https://arxiv.org/html/2605.05693#S4.T1), where the saliency\-aware variants are generally more competitive, especially in the more difficult low\-bit settings\. These results indicate that explicitly accounting for weight saliency helps preserve information better during calibration and leads to better performance\.

*\(4\) Can SARQC be extended further?*Yes\.SARQC is not tied to a particular solver or calibration pipeline, but instead acts as a general regularization principle that can be incorporated into other post\-training quantization frameworks\. This is evidenced by the results in[Figure˜2](https://arxiv.org/html/2605.05693#S4.F2)*\(a\)*, where integrating SARQC into OmniQuant\(Shaoet al\.,[2024](https://arxiv.org/html/2605.05693#bib.bib53)\)consistently improves performance across different architectures and downstream tasks\. On both LLaMA2\-7B and Mixtral\-8x7B, SARQC consistently outperforms the original OmniQuant baseline, and the saliency\-aware variant achieves the best overall performance in all cases\. These findings demonstrate that SARQC can be widely generalized as a practical plug\-in module for other PTQ methods without introducing additional computational overhead during inference; see also[Table˜10](https://arxiv.org/html/2605.05693#A6.T10)\(in Appendix[F\.7](https://arxiv.org/html/2605.05693#A6.SS7)\) for the ablation study on the impact of inference speedup\.

## 5Conclusion

In this work, we propose SARQC, a general post\-training quantization calibration framework for LLMs\. SARQC augments standard quantization calibration objectives with a penalty that explicitly balances reconstruction fidelity and weight drift from the original floating\-point weights, motivated by generalization\-risk analysis and constrained optimization principles\. Two practical optimization algorithms are proposed for SARQC: one built on grid\-search over scaling factors, and the other built on Gram\-based solvers\. SARQC is easy to implement, incurs no extra computational overhead during inference, and can be seamlessly integrated into existing PTQ pipelines\. Extensive experimental results show the effectiveness of the proposed method\.

This work also has several limitations\. For instance, we do not evaluate SARQC on extremely large LLMs due to limited computational resources\. It is possible the current design of saliency factors is not optimal as they are inspired from previous works\. There are a number of possible extensions\. Firstly, it would be valuable to evaluate the performance of SARQC on extremely large LLMs\. Secondly, it may be worthwhile to extend our method to account for error propagation in the PTQ calibration process, since both the proposed method and the theoretical analysis are conducted in a layerwise manner\. Thirdly, it would be interesting to investigate how well the method generalizes to weight–activation quantization scenarios\.

## References

- \[1\]\(1990\)Information theory\.Dover Publications,New York\.External Links:ISBN 9780486665214,[Link](https://store.doverpublications.com/products/9780486665214)Cited by:[Definition C\.4](https://arxiv.org/html/2605.05693#A3.Thmtheorem4)\.
- \[2\]S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman\(2024\)QuaRot: outlier\-free 4\-bit inference in rotated LLMs\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 100213–100240\.External Links:[Document](https://dx.doi.org/10.52202/079017-3180),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/hash/b5b939436789f76f08b9d0da5e81af7c-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p2.1)\.
- \[3\]R\. Banner, Y\. Nahshan, and D\. Soudry\(2019\)Post training 4\-bit quantization of convolutional networks for rapid\-deployment\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\. 7950–7958\.External Links:[Link](https://proceedings.neurips.cc/paper/2019/hash/c0a62e133894cdce435bcb4a5df1db2d-Abstract.html)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[4\]S\. Boyd and L\. Vandenberghe\(2004\)Convex optimization\.Cambridge University Press,Cambridge\.External Links:[Document](https://dx.doi.org/10.1017/CBO9780511804441),[Link](https://doi.org/10.1017/CBO9780511804441)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p2.2),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px3.p1.8)\.
- \[5\]M\. A\. Bragin\(2024\)Survey on Lagrangian relaxation for MILP: importance, challenges, historical review, recent advancements, and opportunities\.Annals of Operations Research333\(1\),pp\. 29–45\.External Links:[Document](https://dx.doi.org/10.1007/s10479-023-05499-9),[Link](https://doi.org/10.1007/s10479-023-05499-9)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p3.1)\.
- \[6\]S\. Cha, H\. Chen, D\. Kim, H\. Zhang, K\. Chan, G\. de Veciana, and H\. Vikalo\(2026\)Regularized calibration with successive rounding for post\-training quantization\.arXiv preprint arXiv:2602\.05902\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2602.05902),2602\.05902Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[7\]J\. Chee, Y\. Cai, V\. Kuleshov, and C\. De Sa\(2023\)QuIP: 2\-bit quantization of large language models with guarantees\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 4396–4429\.External Links:[Document](https://dx.doi.org/10.52202/075280-0196),2307\.13304,[Link](https://arxiv.org/abs/2307.13304)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p2.1)\.
- \[8\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto,et al\.\(2021\)Evaluating large language models trained on code\.External Links:2107\.03374,[Document](https://dx.doi.org/10.48550/arXiv.2107.03374)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1)\.
- \[9\]C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova\(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 2924–2936\.External Links:[Document](https://dx.doi.org/10.18653/v1/N19-1300),[Link](https://aclanthology.org/N19-1300/)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1)\.
- \[10\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? try ARC, the AI2 reasoning challenge\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.1803.05457),1803\.05457,[Link](https://arxiv.org/abs/1803.05457)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1)\.
- \[11\]D\. Dai, C\. Deng, C\. Zhao, R\. X\. Xu, H\. Gao,et al\.\(2024\)DeepSeekMoE: towards ultimate expert specialization in mixture\-of\-experts language models\.External Links:2401\.06066,[Document](https://dx.doi.org/10.48550/arXiv.2401.06066)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px2.p1.1)\.
- \[12\]T\. Dettmers, R\. A\. Svirschevski, V\. Egiazarian, D\. Kuznedelev, E\. Frantar, S\. Ashkboos, A\. Borzunov, T\. Hoefler, and D\. Alistarh\(2024\)SpQR: a sparse\-quantized representation for near\-lossless LLM weight compression\.InThe Twelfth International Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2306.03078),2306\.03078,[Link](https://openreview.net/forum?id=Q1u25ahSuy)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[13\]G\. Di Pillo and L\. Grippo\(1989\)Exact penalty functions in constrained optimization\.SIAM Journal on Control and Optimization27\(6\),pp\. 1333–1360\.External Links:[Document](https://dx.doi.org/10.1137/0327068)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p3.1)\.
- \[14\]Y\. Diouane, M\. Gollier, and D\. Orban\(2026\)Nonsmooth exact penalty methods for equality\-constrained optimization: complexity and implementation\.SIAM Journal on Optimization36\(2\),pp\. 626–650\.External Links:[Document](https://dx.doi.org/10.1137/24M1705974)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p3.1)\.
- \[15\]A\. Edalati, A\. Ghaffari, M\. G\. Nejad, L\. Hou, B\. Chen, M\. Asgharian, and V\. P\. Nia\(2025\)OAC: output\-adaptive calibration for accurate post\-training quantization\.InProceedings of the Thirty\-Ninth AAAI Conference on Artificial Intelligence and Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence,AAAI’25/IAAI’25/EAAI’25\.External Links:ISBN 978\-1\-57735\-897\-8,[Link](https://doi.org/10.1609/aaai.v39i16.33807),[Document](https://dx.doi.org/10.1609/aaai.v39i16.33807)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[16\]M\. Ehrgott\(2005\)Multicriteria optimization\.2 edition,Springer,Berlin, Heidelberg\.External Links:[Document](https://dx.doi.org/10.1007/3-540-27659-9)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p2.2),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.p2.6)\.
- \[17\]H\. Everett\(1963\)Generalized lagrange multiplier method for solving problems of optimum allocation of resources\.Operations Research11\(3\),pp\. 399–417\.External Links:[Document](https://dx.doi.org/10.1287/opre.11.3.399)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px3.p1.8)\.
- \[18\]M\. L\. Fisher\(1981\)The Lagrangian relaxation method for solving integer programming problems\.Management Science27\(1\),pp\. 1–18\.External Links:[Document](https://dx.doi.org/10.1287/mnsc.27.1.1)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p3.1),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px3.p1.8)\.
- \[19\]E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh\(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2210.17323),2210\.17323Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1),[§C\.2](https://arxiv.org/html/2605.05693#A3.SS2.SSS0.Px2.p1.2),[§C\.2](https://arxiv.org/html/2605.05693#A3.SS2.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px1.p1.5),[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.p2.1),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.2),[§3\.2](https://arxiv.org/html/2605.05693#S3.SS2.SSS0.Px2.p1.17),[§3\.2](https://arxiv.org/html/2605.05693#S3.SS2.SSS0.Px2.p1.3),[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px1.p1.1)\.
- \[20\]E\. Frantar, S\. P\. Singh, and D\. Alistarh\(2022\)Optimal brain compression: a framework for accurate post\-training quantization and pruning\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 4475–4488\.External Links:[Document](https://dx.doi.org/10.52202/068431-0323),[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/1caf09c9f4e6b0150b06a07e77f2710c-Abstract-Conference.html)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.p2.1),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.2)\.
- \[21\]A\. M\. Geoffrion\(1968\)Proper efficiency and the theory of vector maximization\.Journal of Mathematical Analysis and Applications22\(3\),pp\. 618–630\.External Links:[Document](https://dx.doi.org/10.1016/0022-247X%2868%2990201-1)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p2.2),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.p2.6)\.
- \[22\]A\. M\. Geoffrion\(1974\)Lagrangean relaxation for integer programming\.InApproaches to Integer Programming,M\. L\. Balinski \(Ed\.\),Mathematical Programming Studies, Vol\.2,pp\. 82–114\.External Links:[Document](https://dx.doi.org/10.1007/BFb0120690),[Link](https://doi.org/10.1007/BFb0120690)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p3.1),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px3.p1.8)\.
- \[23\]A\. Ghaffari, S\. Younesian, B\. Chen, V\. Partovi Nia, and M\. Asgharian\(2025\)Rethinking post\-training quantization: introducing a statistical pre\-calibration approach\.InProceedings of the 14th International Conference on Pattern Recognition Applications and Methods,pp\. 159–169\.External Links:ISBN 978\-989\-758\-730\-6,[Document](https://dx.doi.org/10.5220/0013348800003905)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[24\]A\. Gholami, S\. Kim, Z\. Dong, Z\. Yao, M\. W\. Mahoney, and K\. Keutzer\(2021\)A survey of quantization methods for efficient neural network inference\.External Links:2103\.13630,[Document](https://dx.doi.org/10.48550/arXiv.2103.13630)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[25\]A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian,et al\.\(2024\)The Llama 3 herd of models\.External Links:2407\.21783,[Document](https://dx.doi.org/10.48550/arXiv.2407.21783)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p1.1)\.
- \[26\]S\.\-P\. Han and O\. L\. Mangasarian\(1979\)Exact penalty functions in nonlinear programming\.Mathematical Programming17,pp\. 251–269\.External Links:[Document](https://dx.doi.org/10.1007/BF01588250)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p3.1)\.
- \[27\]B\. Hassibi, D\. G\. Stork, and G\. Wolff\(1993\)Optimal brain surgeon: extensions and performance comparisons\.InAdvances in Neural Information Processing Systems,Vol\.6\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/1993/file/b056eb1587586b71e2da9acfe4fbd19e-Paper.pdf)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.2)\.
- \[28\]S\. Helfrich, A\. Herzel, S\. Ruzika, and C\. Thielen\(2024\)Using scalarizations for the approximation of multiobjective optimization problems: towards a general theory\.Mathematical Methods of Operations Research100,pp\. 27–63\.External Links:[Document](https://dx.doi.org/10.1007/s00186-023-00823-2)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p2.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p5.1)\.
- \[29\]D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt\(2021\)Measuring massive multitask language understanding\.InInternational Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2009.03300),2009\.03300,[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1)\.
- \[30\]A\. E\. Hoerl and R\. W\. Kennard\(1970\)Ridge regression: biased estimation for nonorthogonal problems\.Technometrics12\(1\),pp\. 55–67\.External Links:[Document](https://dx.doi.org/10.1080/00401706.1970.10488634)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p2.1)\.
- \[31\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot,et al\.\(2023\)Mistral 7B\.External Links:2310\.06825,[Document](https://dx.doi.org/10.48550/arXiv.2310.06825)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p1.1)\.
- \[32\]A\. Q\. Jiang, A\. Sablayrolles, A\. Roux, A\. Mensch, B\. Savary,et al\.\(2024\)Mixtral of experts\.External Links:2401\.04088,[Document](https://dx.doi.org/10.48550/arXiv.2401.04088)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px2.p1.1)\.
- \[33\]J\. Kim, M\. El Halabi, W\. Park, C\. J\. S\. Schaefer, D\. Lee, Y\. Park, J\. W\. Lee, and H\. O\. Song\(2025\)GuidedQuant: large language model quantization via exploiting end loss guidance\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 30011–30037\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2505.07004),2505\.07004,[Link](https://proceedings.mlr.press/v267/kim25d.html)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[34\]D\. Könen and M\. Stiglmayr\(2025\)On supportedness in multi\-objective integer linear programming\.Journal of Multi\-Criteria Decision Analysis32\(3\),pp\. e70024\.External Links:[Document](https://dx.doi.org/10.1002/mcda.70024)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p2.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p5.1)\.
- \[35\]Y\. LeCun, J\. S\. Denker, and S\. A\. Solla\(1989\)Optimal brain damage\.InAdvances in Neural Information Processing Systems,Vol\.2,pp\. 598–605\.External Links:[Link](https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.2)\.
- \[36\]C\. Lee, J\. Jin, T\. Kim, H\. Kim, and E\. Park\(2024\)OWQ: outlier\-aware weight quantization for efficient fine\-tuning and inference of large language models\.Proceedings of the AAAI Conference on Artificial Intelligence38\(12\),pp\. 13355–13364\.External Links:[Document](https://dx.doi.org/10.1609/aaai.v38i12.29237),[Link](https://ojs.aaai.org/index.php/AAAI/article/view/29237)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[37\]Y\. Li, R\. Gong, X\. Tan, Y\. Yang, P\. Hu, Q\. Zhang, F\. Yu, W\. Wang, and S\. Gu\(2021\)BRECQ: pushing the limit of post\-training quantization by block reconstruction\.InInternational Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2102.05426),2102\.05426,[Link](https://openreview.net/forum?id=POWv6hDd9XH)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[38\]Y\. Li, R\. Yin, D\. Lee, S\. Xiao, and P\. Panda\(2025\)GPTAQ: efficient finetuning\-free quantization for asymmetric calibration\.InProceedings of the 42nd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.267,pp\. 36690–36706\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2504.02692),2504\.02692,[Link](https://proceedings.mlr.press/v267/li25dn.html)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px1.p1.5),[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.2),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.5),[§3\.2](https://arxiv.org/html/2605.05693#S3.SS2.SSS0.Px2.p1.3),[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px1.p1.1)\.
- \[39\]Y\. Liang, H\. Chen, S\. Han, and Z\. Liu\(2026\)ParoQuant: pairwise rotation quantization for efficient reasoning LLM inference\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=1USeVjsKau)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[40\]H\. Liao, X\. Yuan, and R\. Gao\(2024\)An exact penalty function optimization method and its application in stress constrained topology optimization and scenario based reliability design problems\.Applied Mathematical Modelling125,pp\. 260–292\.External Links:[Document](https://dx.doi.org/10.1016/j.apm.2023.10.014)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p3.1)\.
- \[41\]J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han\(2024\)AWQ: activation\-aware weight quantization for on\-device llm compression and acceleration\.InProceedings of Machine Learning and Systems,Vol\.6,pp\. 87–100\.External Links:[Link](https://proceedings.mlsys.org/paper_files/paper/2024/file/42a452cbafa9dd64e9ba4aa95cc1ef21-Paper-Conference.pdf)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1),[§C\.2](https://arxiv.org/html/2605.05693#A3.SS2.SSS0.Px1.p1.4),[§C\.2](https://arxiv.org/html/2605.05693#A3.SS2.p1.1),[§E\.2](https://arxiv.org/html/2605.05693#A5.SS2.SSS0.Px1.p1.4),[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px1.p1.5),[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.p2.1),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p1.14),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p1.3),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px3.p1.6),[§3\.1](https://arxiv.org/html/2605.05693#S3.SS1.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.05693#S3.SS2.SSS0.Px1.p1.22),[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px1.p1.1)\.
- \[42\]J\. Liu, L\. Niu, Z\. Yuan, D\. Yang, X\. Wang, and W\. Liu\(2023\)PD\-Quant: post\-training quantization based on prediction difference metric\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 24427–24437\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52729.2023.02340)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[43\]Z\. Liu, B\. Oguz, C\. Zhao, E\. Chang, P\. Stock, Y\. Mehdad, Y\. Shi, R\. Krishnamoorthi, and V\. Chandra\(2024\)LLM\-QAT: data\-free quantization aware training for large language models\.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp\. 467–484\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.26),[Link](https://aclanthology.org/2024.findings-acl.26/)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[44\]Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort\(2025\)SpinQuant: LLM quantization with learned rotations\.InThe Thirteenth International Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2405.16406),2405\.16406,[Link](https://openreview.net/forum?id=ogO6DGE6FZ)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p2.1)\.
- \[45\]S\. Lucidi and F\. Rinaldi\(2010\)Exact penalty functions for nonlinear integer programming problems\.Journal of Optimization Theory and Applications145\(3\),pp\. 479–488\.External Links:[Document](https://dx.doi.org/10.1007/s10957-010-9700-7)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p3.1)\.
- \[46\]R\. T\. Marler and J\. S\. Arora\(2010\)The weighted sum method for multi\-objective optimization: new insights\.Structural and Multidisciplinary Optimization41\(6\),pp\. 853–862\.External Links:[Document](https://dx.doi.org/10.1007/s00158-009-0460-7)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p2.2),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.p2.6)\.
- \[47\]S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher\(2017\)Pointer sentinel mixture models\.InInternational Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.1609.07843),1609\.07843,[Link](https://openreview.net/forum?id=Byj72udxe)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px4.p1.10)\.
- \[48\]K\. Miettinen\(1998\)Nonlinear multiobjective optimization\.International Series in Operations Research & Management Science, Vol\.12,Springer,New York\.External Links:[Document](https://dx.doi.org/10.1007/978-1-4615-5563-6)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.SSS0.Px2.p2.2),[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.p2.6)\.
- \[49\]M\. Nagel, R\. A\. Amjad, M\. Van Baalen, C\. Louizos, and T\. Blankevoort\(2020\)Up or down? Adaptive rounding for post\-training quantization\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 7197–7206\.External Links:[Link](https://proceedings.mlr.press/v119/nagel20a.html)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[50\]M\. Nagel, M\. Fournarakis, R\. A\. Amjad, Y\. Bondarenko, M\. van Baalen, and T\. Blankevoort\(2021\)A white paper on neural network quantization\.External Links:2106\.08295,[Document](https://dx.doi.org/10.48550/arXiv.2106.08295)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p1.1)\.
- \[51\]J\. Nocedal and S\. J\. Wright\(2006\)Numerical optimization\.2 edition,Springer Series in Operations Research and Financial Engineering,Springer,New York\.External Links:[Document](https://dx.doi.org/10.1007/978-0-387-40065-5)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p2.2)\.
- \[52\]OpenAI, J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad,et al\.\(2023\)GPT\-4 technical report\.External Links:2303\.08774,[Document](https://dx.doi.org/10.48550/arXiv.2303.08774)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p1.1)\.
- \[53\]K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi\(2021\)WinoGrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.External Links:ISSN 0001\-0782,[Document](https://dx.doi.org/10.1145/3474381)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1)\.
- \[54\]Y\. Shang, G\. Liu, R\. R\. Kompella, and Y\. Yan\(2024\)Enhancing post\-training quantization calibration through contrastive learning\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15921–15930\.External Links:[Document](https://dx.doi.org/10.1109/CVPR52733.2024.01507)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[55\]W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo\(2024\)OmniQuant: omnidirectionally calibrated quantization for large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Document](https://dx.doi.org/10.48550/arXiv.2308.13137),2308\.13137,[Link](https://openreview.net/forum?id=8Wuvhh0LYW)Cited by:[§E\.2](https://arxiv.org/html/2605.05693#A5.SS2.SSS0.Px3.p1.2),[§4\.2](https://arxiv.org/html/2605.05693#S4.SS2.p4.1)\.
- \[56\]Y\. Tian, C\. Wang, J\. Han, Y\. Tang, and K\. Han\(2025\)PocketLLM: ultimate compression of large language models via meta networks\.External Links:2511\.17637,[Document](https://dx.doi.org/10.48550/arXiv.2511.17637)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[57\]R\. Tibshirani\(1996\)Regression shrinkage and selection via the lasso\.Journal of the Royal Statistical Society: Series B \(Methodological\)58\(1\),pp\. 267–288\.External Links:[Document](https://dx.doi.org/10.1111/j.2517-6161.1996.tb02080.x)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px2.p1.1),[§C\.4](https://arxiv.org/html/2605.05693#A3.SS4.p2.1)\.
- \[58\]H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.External Links:2307\.09288,[Document](https://dx.doi.org/10.48550/arXiv.2307.09288)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px2.p1.1)\.
- \[59\]M\. Van Baalen, A\. Kuzmin, M\. Nagel, P\. Couperus, A\. Bolshakov, C\. Bastoul, E\. Mahurin, T\. Blankevoort, and P\. Whatmough\(2024\)GPTVQ: the blessing of dimensionality for LLM quantization\.InWorkshop on Efficient Systems for Foundation Models II @ ICML 2024,External Links:[Link](https://openreview.net/forum?id=sFWttLybEu)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px4.p1.5)\.
- \[60\]M\. Wei, Y\. Yan, and D\. Wang\(2025\)MPPQ: enhancing post\-training quantization for LLMs via mixed supervision, proxy rounding, and pre\-searching\.InProceedings of the Thirty\-Fourth International Joint Conference on Artificial Intelligence,pp\. 8277–8285\.External Links:[Document](https://dx.doi.org/10.24963/ijcai.2025/920)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p2.1)\.
- \[61\]X\. Wei, Y\. Zhang, X\. Zhang, R\. Gong, S\. Zhang, Q\. Zhang, F\. Yu, and X\. Liu\(2022\)Outlier suppression: pushing the limit of low\-bit transformer language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 17402–17414\.External Links:[Document](https://dx.doi.org/10.52202/068431-1265),[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/6f6db140de9c9f111b12ef8a216320a9-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p2.1)\.
- \[62\]D\. Wu, Q\. Tang, Y\. Zhao, M\. Zhang, Y\. Fu, and D\. Zhang\(2020\)EasyQuant: post\-training quantization via scale optimization\.External Links:2006\.16669,[Document](https://dx.doi.org/10.48550/arXiv.2006.16669)Cited by:[Appendix B](https://arxiv.org/html/2605.05693#A2.SS0.SSS0.Px1.p1.1)\.
- \[63\]H\. Xi, C\. Li, J\. Chen, and J\. Zhu\(2023\)Training transformers with 4\-bit integers\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 49146–49168\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2023/hash/99fc8bc48b917c301a80cb74d91c0c06-Abstract-Conference.html)Cited by:[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p1.3),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p2.1)\.
- \[64\]G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han\(2023\)SmoothQuant: accurate and efficient post\-training quantization for large language models\.InProceedings of the 40th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.202,pp\. 38087–38099\.External Links:[Link](https://proceedings.mlr.press/v202/xiao23c.html)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.05693#S1.p2.1),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px1.p1.5),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px2.p1.3),[§2](https://arxiv.org/html/2605.05693#S2.SS0.SSS0.Px3.p1.6),[§3\.1](https://arxiv.org/html/2605.05693#S3.SS1.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.05693#S3.SS2.SSS0.Px1.p1.22)\.
- \[65\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui,et al\.\(2025\)Qwen3 technical report\.External Links:2505\.09388,[Document](https://dx.doi.org/10.48550/arXiv.2505.09388)Cited by:[§1](https://arxiv.org/html/2605.05693#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px2.p1.1)\.
- \[66\]L\. A\. Zadeh\(1963\)Optimality and non\-scalar\-valued performance criteria\.IEEE Transactions on Automatic Control8\(1\),pp\. 59–60\.External Links:[Document](https://dx.doi.org/10.1109/TAC.1963.1105511)Cited by:[§D\.3](https://arxiv.org/html/2605.05693#A4.SS3.p2.6)\.
- \[67\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\-07\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics,Florence, Italy,pp\. 4791–4800\.External Links:[Link](https://aclanthology.org/P19-1472/),[Document](https://dx.doi.org/10.18653/v1/P19-1472)Cited by:[§4\.1](https://arxiv.org/html/2605.05693#S4.SS1.SSS0.Px3.p1.1)\.

Appendix

## Appendix Overview

This appendix offers a complete, self\-contained elaboration of the theoretical underpinnings, proofs, implementation details, algorithms, and supplementary experimental results for the proposed SARQC framework\.

Appendix[A](https://arxiv.org/html/2605.05693#A1)summarizes the main notation used throughout the paper and appendix\.

Appendix[B](https://arxiv.org/html/2605.05693#A2)presents a more comprehensive literature review on post\-training quantization for LLMs and constrained optimization, complementing the discussion in[Section˜2](https://arxiv.org/html/2605.05693#S2)\.

Appendix[C](https://arxiv.org/html/2605.05693#A3)provides additional preliminaries on quantization, most relevant scaling\-based and Gram\-based PTQ methods, information and constrained optimization notions\.

Appendix[D](https://arxiv.org/html/2605.05693#A4)presents additional theoretical results\. In particular, Appendix[D\.1](https://arxiv.org/html/2605.05693#A4.SS1)derives the quadratic form of the SARQC objective and shows how it leads to a GPTQ\-style row\-wise compensation rule with a regularized curvature matrix\. Appendix[D\.2](https://arxiv.org/html/2605.05693#A4.SS2)gives the proof of the generalization\-risk bound in[Theorem˜3\.1](https://arxiv.org/html/2605.05693#S3.Thmtheorem1)\. Appendix[D\.3](https://arxiv.org/html/2605.05693#A4.SS3)discusses the relationship between hard drift constraints and quadratic penalty formulations for supported quantized solutions\.

Appendix[E](https://arxiv.org/html/2605.05693#A5)reports the pseudo\-algorithms and implementation details of SARQC\. We describe both SARQC\-GS, which performs regularized model selection over scaling factors, and SARQC\-GBS, which modifies the Gram\-based sequential quantization curvature using saliency\-aware regularization\. We also describe how SARQC is incorporated into OmniQuant in our ablation study\.

Appendix[F](https://arxiv.org/html/2605.05693#A6)provides supplementary experimental results\. Appendix[F\.1](https://arxiv.org/html/2605.05693#A6.SS1)gives the detailed results underlying[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)\. Appendix[F\.2](https://arxiv.org/html/2605.05693#A6.SS2)visualizes weight drift under different regularization strengths\. The remaining subsections report additional results under W3A16, results on the LLaMA model family, sensitivity analyses with respect to calibration size and calibration corpus, and practical inference speedup comparisons\.

## Appendix ANotation

[Table˜3](https://arxiv.org/html/2605.05693#A1.T3)summarizes the main notation used throughout the paper and appendix\.

Table 3:Summary of main notation\.
## Appendix BMore Comprehensive Literature Review

#### Post\-training quantization

Post\-training quantization \(PTQ\)\[[50](https://arxiv.org/html/2605.05693#bib.bib50),[62](https://arxiv.org/html/2605.05693#bib.bib64)\]has evolved from a lightweight deployment technique into a central compression paradigm for large language models \(LLMs\)\. Early optimization\-based PTQ methods showed that naive round\-to\-nearest quantization is often suboptimal\. AdaRound\[[49](https://arxiv.org/html/2605.05693#bib.bib49)\]formulates rounding as a data\-dependent local optimization, while BRECQ\[[37](https://arxiv.org/html/2605.05693#bib.bib35)\]extends this view from individual layers to block reconstruction, balancing local reconstruction fidelity with cross\-layer dependency\. Second\-order methods further strengthened this line of work\. Optimal Brain Compression revisits the Optimal Brain Surgeon framework for one\-shot pruning and quantization\[[20](https://arxiv.org/html/2605.05693#bib.bib19)\]\.

For LLMs, however, quantization is complicated by activation and weight outliers, large inter\-layer dependencies, and the distinction between weight\-only and weight\-activation quantization\. Weight\-only methods such as GPTQ\[[19](https://arxiv.org/html/2605.05693#bib.bib20)\], AWQ\[[41](https://arxiv.org/html/2605.05693#bib.bib38)\], SpQR\[[12](https://arxiv.org/html/2605.05693#bib.bib12)\], and OWQ\[[36](https://arxiv.org/html/2605.05693#bib.bib34)\]mainly reduce memory bandwidth while keeping activations in higher precision\. Recent PTQ methods have also moved beyond plain layer\-wise MSE by incorporating additional optimization criteria, including prediction\-difference calibration\[[42](https://arxiv.org/html/2605.05693#bib.bib40)\], contrastive or mutual\-information\-based calibration\[[54](https://arxiv.org/html/2605.05693#bib.bib56)\], output\-adaptive Hessian or end\-loss\-guided objectives\[[15](https://arxiv.org/html/2605.05693#bib.bib15),[33](https://arxiv.org/html/2605.05693#bib.bib32)\], statistical pre\-calibration based on distributional discrepancy\[[23](https://arxiv.org/html/2605.05693#bib.bib23)\], mixed\-metric reconstruction regularization\[[60](https://arxiv.org/html/2605.05693#bib.bib61)\], and explicit regularized calibration with successive rounding\[[6](https://arxiv.org/html/2605.05693#bib.bib9)\]\.

#### Constrained optimization

Constrained and penalized formulations are a classical mechanism for balancing data fidelity against structural preference or model complexity\. In statistics, ridge regression and the lasso are canonical examples in which a constrained estimator can be expressed through a penalized objective with a tuning parameter controlling the trade\-off between empirical fit and regularity\[[30](https://arxiv.org/html/2605.05693#bib.bib29),[57](https://arxiv.org/html/2605.05693#bib.bib58)\]\. In convex optimization, this connection is typically formalized through Lagrange multipliers, KKT conditions, and duality theory\[[4](https://arxiv.org/html/2605.05693#bib.bib7),[51](https://arxiv.org/html/2605.05693#bib.bib52)\]\. Recent constrained\-optimization literature emphasizes that replacing constraints by penalties is an exact reformulation only when a suitable multiplier or exactness condition exists; otherwise, the penalized problem should be interpreted as a relaxation or scalarized surrogate\. This distinction is particularly relevant for post\-training quantization, since the feasible set𝒬\\mathcal\{Q\}is discrete and generally nonconvex\. Recent work on exact penalties and Lagrangian relaxation in constrained and mixed\-integer optimization makes this multiplier\-existence issue explicit\[[5](https://arxiv.org/html/2605.05693#bib.bib8),[40](https://arxiv.org/html/2605.05693#bib.bib37),[14](https://arxiv.org/html/2605.05693#bib.bib13)\]\.

Our supportedness condition follows the same principle from multiobjective optimization\. The reconstruction error and the discrepancy from the floating\-point weights define two competing objectives, and the penalized objective is a weighted\-sum scalarization whose exact recovery requires the selected point to be supported\[[28](https://arxiv.org/html/2605.05693#bib.bib27),[34](https://arxiv.org/html/2605.05693#bib.bib33)\]\. This viewpoint is natural for quantization calibration, where one balances output fidelity against proximity to the original FP weights\.

## Appendix CAdditional Preliminaries

This section reviews several preliminaries used throughout the paper\. We first summarize basic quantization formulations, together with representative scaling\-based and Gram\-based PTQ methods, since the proposed SARQC framework extends these standard calibration objectives and optimization strategies in[Sections˜2](https://arxiv.org/html/2605.05693#S2)and[3](https://arxiv.org/html/2605.05693#S3)\. We then review the information and constrained optimization basics used in the theoretical discussion, which provide background for part of the motivation and analysis in[Section˜3\.1](https://arxiv.org/html/2605.05693#S3.SS1)\. These preliminaries are included to establish notation for the derivations, proofs, and implementation details in the appendix\.

### C\.1Quantization and Dequantization

Quantization maps high\-precision weights or activations to a discrete low\-bit set to reduce memory and improve inference efficiency\. Given a FP weight matrix𝐖∈ℝdout×din\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\}, we denote its quantized–dequantized approximation as𝐖^=Q\(𝐖\)\\widehat\{\\mathbf\{W\}\}=Q\(\\mathbf\{W\}\), whereQ\(⋅\)Q\(\\cdot\)represents the quantization operator together with the corresponding dequantization map\.

For a scalar valueww, a standard formulation is uniform affine quantization:

q=clip\(⌊wη\+z⌉,qmin,qmax\),w^=η\(q−z\),q=\\mathrm\{clip\}\\\!\\left\(\\left\\lfloor\\frac\{w\}\{\\eta\}\+z\\right\\rceil,\\;q\_\{\\min\},\\;q\_\{\\max\}\\right\),\\qquad\\widehat\{w\}=\\eta\\,\(q\-z\),\(11\)whereη\>0\\eta\>0is the quantization step size,zzis the zero\-point, and⌊⋅⌉\\lfloor\\cdot\\rceildenotes rounding to the nearest integer\. The integerqqlies in a finite range\[qmin,qmax\]\[q\_\{\\min\},q\_\{\\max\}\]determined by the bit width\. In practice, quantization is applied elementwise to vectors, matrices, or tensors, with parameters shared at different granularities such as per\-tensor, per\-channel, or per\-group\.

### C\.2Post\-Training Quantization: AWQ and GPTQ

As our method is closely related to AWQ\[[41](https://arxiv.org/html/2605.05693#bib.bib38)\]and GPTQ\[[19](https://arxiv.org/html/2605.05693#bib.bib20)\], we now present details of these two methods\.

#### AWQ:

Activation\-Aware Weight Quantization \(AWQ\) is a weight\-only PTQ method that improves quantization by applying channel\-wise scaling prior to quantization\[[41](https://arxiv.org/html/2605.05693#bib.bib38)\]\. For thell\-th layer with weights𝐖l\\mathbf\{W\}\_\{l\}and calibration inputs𝐗l\\mathbf\{X\}\_\{l\}, AWQ rescales input channels using a positive vectors~l\(α\)\\tilde\{s\}\_\{l\}\(\\alpha\), quantizes the rescaled weights, and compensates the scaling on the input side\. This leads to the reconstruction objective

minα⁡‖𝐖l𝐗l−Q\(𝐖ldiag\(s~l\(α\)\)\)diag\(s~l\(α\)\)−1𝐗l‖F2\.\\displaystyle\\min\_\{\\alpha\}\\;\\;\\bigl\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-Q\\\!\\bigl\(\\mathbf\{W\}\_\{l\}\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\(\\alpha\)\)\\bigr\)\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\(\\alpha\)\)^\{\-1\}\\mathbf\{X\}\_\{l\}\\bigr\\\|\_\{\\mathrm\{F\}\}^\{2\}\.\(12\)
The scaling factors are parameterized using simple calibration statistics:

s~l\(j\)\(α\)=mean\(\|𝐗l\(j\)\|\)αmean\(\|𝐖l\(j\)\|\)1−α,\\tilde\{s\}\_\{l\}^\{\(j\)\}\(\\alpha\)=\\frac\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\alpha\}\}\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1\-\\alpha\}\},and the final quantized weights are selected via a lightweight grid search overα\\alpha\. This mechanism effectively protects activation\-salient channels\.

#### GPTQ:

GPTQ formulates PTQ as a Gram\-based reconstruction problem\[[19](https://arxiv.org/html/2605.05693#bib.bib20)\]\. It minimizes the layer\-wise error

‖𝐖l𝐗l−𝐖^l𝐗l‖F2,\\bigl\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\bigr\\\|\_\{\\mathrm\{F\}\}^\{2\},which admits a quadratic form governed by the Gram matrix𝐗l𝐗l⊤\.\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\.This matrix serves as a surrogate curvature, capturing the sensitivity of the reconstruction loss\.

GPTQ proceeds by sequentially quantizing weights and compensating the induced error using inverse Gram\-based information\. Combined with efficient implementations \(e\.g\., blockwise updates and Cholesky\-based inversion\), this yields a scalable and accurate PTQ method for large language models\.

### C\.3Information\-theoretic Basics

We briefly review several standard notions from information theory\.

###### Definition C\.1\(Entropy\)\.

LetUUbe a random variable with distributionp\(u\)p\(u\)\. IfUUis discrete, its entropy is defined as

H\(U\)=−∑up\(u\)log⁡p\(u\)\.H\(U\)=\-\\sum\_\{u\}p\(u\)\\log p\(u\)\.IfUUis continuous, the corresponding notion is the differential entropy

h\(U\)=−∫p\(u\)log⁡p\(u\)𝑑u\.h\(U\)=\-\\int p\(u\)\\log p\(u\)\\,du\.Entropy quantifies the uncertainty of a random variable\.

###### Definition C\.2\(Conditional entropy\)\.

The conditional entropy ofVVgivenUUis defined as

H\(V∣U\)=−𝔼U,V\[log⁡p\(V∣U\)\]H\(V\\mid U\)=\-\\mathbb\{E\}\_\{U,V\}\\bigl\[\\log p\(V\\mid U\)\\bigr\]in the discrete case, and analogously via differential entropy in the continuous case\. It measures the remaining uncertainty ofVVafter observingUU\.

###### Definition C\.3\(Kullback–Leibler divergence\)\.

Letppandqqbe two probability distributions defined on the same space\. The Kullback–Leibler \(KL\) divergence is defined as

KL\(p∥q\)=𝔼u∼p\[log⁡p\(u\)q\(u\)\]\.\\mathrm\{KL\}\(p\\\|q\)=\\mathbb\{E\}\_\{u\\sim p\}\\\!\\left\[\\log\\frac\{p\(u\)\}\{q\(u\)\}\\right\]\.It is always nonnegative and equals zero if and only ifp=qp=qalmost everywhere\.

###### Definition C\.4\(Mutual information\[[1](https://arxiv.org/html/2605.05693#bib.bib3)\]\)\.

LetUUandVVbe two random variables with joint distributionp\(u,v\)p\(u,v\)and marginalsp\(u\)p\(u\)andp\(v\)p\(v\)\. The mutual information betweenUUandVVis defined as

I\(U;V\)=𝔼U,V\[log⁡p\(U,V\)p\(U\)p\(V\)\]=𝔼U,V\[log⁡p\(V∣U\)p\(V\)\]\.I\(U;V\)=\\mathbb\{E\}\_\{U,V\}\\left\[\\log\\frac\{p\(U,V\)\}\{p\(U\)p\(V\)\}\\right\]=\\mathbb\{E\}\_\{U,V\}\\left\[\\log\\frac\{p\(V\\mid U\)\}\{p\(V\)\}\\right\]\.

By Definition[C\.4](https://arxiv.org/html/2605.05693#A3.Thmtheorem4), mutual information admits several equivalent forms:

I\(U;V\)=KL\(p\(U,V\)∥p\(U\)p\(V\)\)=𝔼V\[KL\(p\(U∣V\)∥p\(U\)\)\],I\(U;V\)=\\mathrm\{KL\}\\\!\\bigl\(p\(U,V\)\\,\\\|\\,p\(U\)p\(V\)\\bigr\)=\\mathbb\{E\}\_\{V\}\\\!\\left\[\\mathrm\{KL\}\\\!\\bigl\(p\(U\\mid V\)\\,\\\|\\,p\(U\)\\bigr\)\\right\],and the entropy decompositions

I\(U;V\)=H\(V\)−H\(V∣U\)=H\(U\)−H\(U∣V\)\.I\(U;V\)=H\(V\)\-H\(V\\mid U\)=H\(U\)\-H\(U\\mid V\)\.Thus, mutual information quantifies the reduction in uncertainty about one variable after observing the other\.

### C\.4Optimization: Constraints and Penalties

We briefly review the relationship between constrained and penalized formulations, which is a classical tool for expressing trade\-offs between empirical fidelity and structural preference\.

###### Definition C\.5\(Constrained formulation\)\.

LetF\(x\)F\(x\)be a primary objective and letR\(x\)R\(x\)measure the complexity, regularity, or deviation ofxx\. A constrained formulation takes the form

minx∈𝒳F\(x\)s\.t\.R\(x\)≤τ,\\min\_\{x\\in\\mathcal\{X\}\}F\(x\)\\quad\\mathrm\{s\.t\.\}\\quad R\(x\)\\leq\\tau,whereτ\\tauspecifies the admissible level of regularity or complexity\.

###### Definition C\.6\(Penalized formulation\)\.

The corresponding penalized formulation is

minx∈𝒳⁡F\(x\)\+λR\(x\),\\min\_\{x\\in\\mathcal\{X\}\}F\(x\)\+\\lambda R\(x\),whereλ≥0\\lambda\\geq 0controls the trade\-off between the primary objective and the penalty term\.

Such constrained–penalized pairs are standard in statistics\. For example, ridge regression and the lasso can be viewed either as constrained estimators or as penalized estimators, where the tuning parameter controls the trade\-off between data fitting and regularity\[[30](https://arxiv.org/html/2605.05693#bib.bib29),[57](https://arxiv.org/html/2605.05693#bib.bib58)\]\. In convex optimization, this relationship is usually formalized through the Lagrangian

ℒ\(x,λ\)=F\(x\)\+λ\(R\(x\)−τ\),λ≥0\.\\mathcal\{L\}\(x,\\lambda\)=F\(x\)\+\\lambda\\bigl\(R\(x\)\-\\tau\\bigr\),\\qquad\\lambda\\geq 0\.Under suitable regularity conditions, KKT conditions, and strong duality, a solution of the constrained problem can be recovered by minimizing a penalized Lagrangian objective with an appropriate multiplier\[[4](https://arxiv.org/html/2605.05693#bib.bib7),[51](https://arxiv.org/html/2605.05693#bib.bib52)\]\.

However, this equivalence is not automatic\. A penalized problem is an exact reformulation of the constrained problem only when a suitable multiplier or exactness condition exists\. Otherwise, the penalized objective should be interpreted as a relaxation or scalarized surrogate rather than a fully equivalent problem\. This issue is particularly important in nonconvex, nonsmooth, or mixed\-integer optimization, where strong duality may fail and a nonzero duality gap can appear\. Recent work on exact penalties and Lagrangian relaxation makes this multiplier\-existence issue explicit\[[5](https://arxiv.org/html/2605.05693#bib.bib8),[40](https://arxiv.org/html/2605.05693#bib.bib37),[14](https://arxiv.org/html/2605.05693#bib.bib13)\]\.

The same phenomenon can also be understood from the perspective of multiobjective optimization\. Consider two competing objectives,

F1\(x\)andF2\(x\)\.F\_\{1\}\(x\)\\quad\\text\{and\}\\quad F\_\{2\}\(x\)\.A weighted\-sum scalarization solves

minx∈𝒳⁡F1\(x\)\+λF2\(x\),λ≥0\.\\min\_\{x\\in\\mathcal\{X\}\}F\_\{1\}\(x\)\+\\lambda F\_\{2\}\(x\),\\qquad\\lambda\\geq 0\.This scalarized problem does not necessarily recover every Pareto\-optimal solution\.

###### Definition C\.7\(Supported solution\)\.

A feasible pointx⋆∈𝒳x^\{\\star\}\\in\\mathcal\{X\}is called supported if there exists a weightλ≥0\\lambda\\geq 0such that

x⋆∈arg⁡minx∈𝒳⁡F1\(x\)\+λF2\(x\)\.x^\{\\star\}\\in\\arg\\min\_\{x\\in\\mathcal\{X\}\}F\_\{1\}\(x\)\+\\lambda F\_\{2\}\(x\)\.Equivalently, the objective vector\(F1\(x⋆\),F2\(x⋆\)\)\(F\_\{1\}\(x^\{\\star\}\),F\_\{2\}\(x^\{\\star\}\)\)lies on a part of the attainable objective frontier that can be touched by a supporting hyperplane\.

Weighted\-sum scalarization can recover supported Pareto solutions, but it can miss unsupported Pareto solutions, especially when the attainable objective set is discrete or nonconvex\[[28](https://arxiv.org/html/2605.05693#bib.bib27),[34](https://arxiv.org/html/2605.05693#bib.bib33)\]\. Therefore, the existence of a penalty parameter that exactly recovers a constrained solution is closely related to whether the target solution is supported\.

In summary, constrained and penalized formulations express the same trade\-off from two different viewpoints\. Their exact equivalence requires additional structural conditions, such as the existence of a suitable Lagrange multiplier or supportedness of the selected trade\-off point\. Without such conditions, the penalized objective remains a useful and tunable surrogate, but not necessarily an exact reformulation of the constrained problem\.

## Appendix DAdditional Theoretical Results

### D\.1Derivations of the Update for SARQC\-GBS

In this subsection, we derive the quadratic form underlying SARQC and show that it leads to a GPTQ\-style row\-wise compensation rule under a regularized curvature matrix\.

#### Quadratic reformulation of the SARQC objective

Recall the layer\-wise SARQC objective

min𝐖^l∈𝒬⁡‖𝐖l𝐗l−𝐖^l𝐗l‖F2\+λ‖\(𝐖^l−𝐖l\)𝐒l‖F2\.\\min\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\}\\;\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\lambda\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\)\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\.Let

Δ𝐖l:=𝐖^l−𝐖l\.\\Delta\\mathbf\{W\}\_\{l\}:=\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\.Then the reconstruction term can be written as

‖𝐖l𝐗l−𝐖^l𝐗l‖F2=‖Δ𝐖l𝐗l‖F2=Tr\(Δ𝐖l𝐗l𝐗l⊤Δ𝐖l⊤\),\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}=\\\|\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}=\\mathrm\{Tr\}\\ \\\!\\left\(\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\\Delta\\mathbf\{W\}\_\{l\}^\{\\top\}\\right\),and the regularization term becomes

‖\(𝐖^l−𝐖l\)𝐒l‖F2=‖Δ𝐖l𝐒l‖F2=Tr\(Δ𝐖l𝐒l𝐒l⊤Δ𝐖l⊤\)\.\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\)\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}=\\\|\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}=\\mathrm\{Tr\}\\ \\\!\\left\(\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\}\\Delta\\mathbf\{W\}\_\{l\}^\{\\top\}\\right\)\.Therefore, the SARQC objective reduces to the quadratic form

Tr\(Δ𝐖l𝐆lΔ𝐖l⊤\),𝐆l:=𝐗l𝐗l⊤\+λ𝐒l𝐒l⊤\.\\mathrm\{Tr\}\\ \\\!\\left\(\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{G\}\_\{l\}\\Delta\\mathbf\{W\}\_\{l\}^\{\\top\}\\right\),\\qquad\\mathbf\{G\}\_\{l\}:=\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\+\\lambda\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\}\.

#### Row\-wise decomposition\.

Since the trace is additive over rows, writingΔ𝐖l\\Delta\\mathbf\{W\}\_\{l\}row\-wise yields

Tr\(Δ𝐖l𝐆lΔ𝐖l⊤\)=∑r=1doutΔ𝐰l,r⊤𝐆lΔ𝐰l,r,\\mathrm\{Tr\}\\ \\\!\\left\(\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{G\}\_\{l\}\\Delta\\mathbf\{W\}\_\{l\}^\{\\top\}\\right\)=\\sum\_\{r=1\}^\{d\_\{\\mathrm\{out\}\}\}\\Delta\\mathbf\{w\}\_\{l,r\}^\{\\top\}\\mathbf\{G\}\_\{l\}\\Delta\\mathbf\{w\}\_\{l,r\},whereΔ𝐰l,r\\Delta\\mathbf\{w\}\_\{l,r\}is therr\-th row ofΔ𝐖l\\Delta\\mathbf\{W\}\_\{l\}\. Hence, the optimization decomposes across output rows, as in GPTQ, with the curvature matrix replaced by𝐆l\\mathbf\{G\}\_\{l\}\.

#### Closed\-form coordinate compensation

Consider a single output row, written as a column vector𝐰\\mathbf\{w\}, with quantized–dequantized counterpart𝐰^\\widehat\{\\mathbf\{w\}\}\. Define the row\-wise perturbation

Δ𝐰:=𝐰^−𝐰\.\\Delta\\mathbf\{w\}:=\\widehat\{\\mathbf\{w\}\}\-\\mathbf\{w\}\.Let𝐆:=𝐆l\\mathbf\{G\}:=\\mathbf\{G\}\_\{l\}, and assume that𝐆\\mathbf\{G\}is invertible\. The row\-wise quadratic objective is

minΔ𝐰⁡12Δ𝐰⊤𝐆Δ𝐰\.\\min\_\{\\Delta\\mathbf\{w\}\}\\;\\frac\{1\}\{2\}\\Delta\\mathbf\{w\}^\{\\top\}\\mathbf\{G\}\\Delta\\mathbf\{w\}\.
Now suppose thejj\-th coordinate is committed to its quantized–dequantized valuew^j\\widehat\{w\}\_\{j\}, and define the induced quantization error by

ej:=wj−w^j,Δwj=−ej\.e\_\{j\}:=w\_\{j\}\-\\widehat\{w\}\_\{j\},\\qquad\\Delta w\_\{j\}=\-e\_\{j\}\.This leads to the constrained problem

minΔ𝐰⁡12Δ𝐰⊤𝐆Δ𝐰s\.t\.Δwj=−ej\.\\min\_\{\\Delta\\mathbf\{w\}\}\\;\\frac\{1\}\{2\}\\Delta\\mathbf\{w\}^\{\\top\}\\mathbf\{G\}\\Delta\\mathbf\{w\}\\quad\\text\{s\.t\.\}\\quad\\Delta w\_\{j\}=\-e\_\{j\}\.
Introducing a Lagrange multiplierν\\nu, the Lagrangian is

𝒥\(Δ𝐰,ν\)=12Δ𝐰⊤𝐆Δ𝐰\+ν\(Δwj\+ej\)\.\\mathcal\{J\}\(\\Delta\\mathbf\{w\},\\nu\)=\\frac\{1\}\{2\}\\Delta\\mathbf\{w\}^\{\\top\}\\mathbf\{G\}\\Delta\\mathbf\{w\}\+\\nu\(\\Delta w\_\{j\}\+e\_\{j\}\)\.The first\-order optimality condition gives

𝐆Δ𝐰\+ν𝐞j=0,⟹Δ𝐰=−ν𝐆−1𝐞j,\\mathbf\{G\}\\Delta\\mathbf\{w\}\+\\nu\\mathbf\{e\}\_\{j\}=0,\\qquad\\Longrightarrow\\qquad\\Delta\\mathbf\{w\}=\-\\nu\\mathbf\{G\}^\{\-1\}\\mathbf\{e\}\_\{j\},where𝐞j\\mathbf\{e\}\_\{j\}denotes thejj\-th standard basis vector\. Let

𝐌:=𝐆−1\.\\mathbf\{M\}:=\\mathbf\{G\}^\{\-1\}\.Enforcing the constraint yields

−ej=Δwj=−νMjj,⟹ν=ejMjj\.\-e\_\{j\}=\\Delta w\_\{j\}=\-\\nu M\_\{jj\},\\qquad\\Longrightarrow\\qquad\\nu=\\frac\{e\_\{j\}\}\{M\_\{jj\}\}\.Therefore, the optimal perturbation update is

Δ𝐰⋆=−ejMjj𝐌:,j\.\\Delta\\mathbf\{w\}^\{\\star\}=\-\\frac\{e\_\{j\}\}\{M\_\{jj\}\}\\,\\mathbf\{M\}\_\{:,j\}\.
Substituting this back into the quadratic objective, the minimal objective value induced by fixing coordinatejjis

Δℒ⋆=12ej2Mjj\.\\Delta\\mathcal\{L\}^\{\\star\}=\\frac\{1\}\{2\}\\frac\{e\_\{j\}^\{2\}\}\{M\_\{jj\}\}\.

#### Relation to GPTQ

The above derivation shows thatSARQC\-GBSis a strictly minimal modification of GPTQ under the same Gram\-based sequential framework\. In standard GPTQ, the quadratic form is governed by the activation Gram matrix𝐗l𝐗l⊤\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\. SARQC replaces it with the regularized matrix

𝐆l=𝐗l𝐗l⊤\+λ𝐒l𝐒l⊤,\\mathbf\{G\}\_\{l\}=\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\+\\lambda\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\},while leaving the row\-wise decomposition, coordinate\-wise commitment, and inverse\-curvature compensation unchanged\.

Consequently,SARQC\-GBSpreserves the implementation structure and computational pattern of GPTQ, but alters the effective curvature to incorporate saliency\-aware regularization\. Whenλ=0\\lambda=0, the objective reduces to the undamped GPTQ\-style quadratic form governed only by𝐗l𝐗l⊤\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\. In practical GPTQ implementations, however, an additional heuristic damping term is often added to the Gram matrix for numerical stability\. This damping is an implementation\-level stabilization and can be viewed as an isotropic curvature regularization\. In contrast, SARQC introduces a saliency\-aware structured regularization term throughλ𝐒l𝐒l⊤\\lambda\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\}\. Therefore, theλ=0\\lambda=0case ofSARQC\-GBSshould be understood as the undamped GPTQ\-style objective\. When𝐒l𝐒l⊤=𝐈\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\}=\\mathbf\{I\}, SARQC reduces to an isotropically regularized variant, which is closely related to the damping used in practical GPTQ implementations\.

### D\.2Proof of[Theorem˜3\.1](https://arxiv.org/html/2605.05693#S3.Thmtheorem1)

Proof\.Consider a single linear layer and defineΔ𝐖l:=𝐖^l−𝐖l\\Delta\\mathbf\{W\}\_\{l\}:=\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\. Let the true downstream reconstruction risk be

ℛ\(𝐖^l\):=𝔼X∼pX\[‖Δ𝐖lX‖22\],\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\):=\\mathbb\{E\}\_\{X\\sim p\_\{X\}\}\\left\[\\\|\\Delta\\mathbf\{W\}\_\{l\}X\\\|\_\{2\}^\{2\}\\right\],and let the empirical calibration risk on a calibration set𝒟cal,l:=𝐗l:=\{𝐗l,i\}i=1n\\mathcal\{D\}\_\{\\mathrm\{cal\},l\}:=\\mathbf\{X\}\_\{l\}:=\\\{\\mathbf\{X\}\_\{l,i\}\\\}\_\{i=1\}^\{n\}be

ℛ^cal\(𝐖^l\):=1n∑i=1n‖Δ𝐖l𝐗l,i‖22\.\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\):=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\\|\\Delta\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l,i\}\\\|\_\{2\}^\{2\}\.
Assume that‖X‖2≤MX\\\|X\\\|\_\{2\}\\leq M\_\{X\}\. We consider a restricted quantization hypothesis class

𝒬R:=\{𝐖^l∈𝒬:‖𝐖^l−𝐖l‖F≤R\}\.\\mathcal\{Q\}\_\{R\}:=\\left\\\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}:\\\|\\widehat\{\\mathbf\{W\}\}\_\{l\}\-\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}\\leq R\\right\\\}\.For any𝐖^l∈𝒬R\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\}, we have

‖Δ𝐖lX‖2≤‖Δ𝐖l‖F‖X‖2≤RMX\.\\\|\\Delta\\mathbf\{W\}\_\{l\}X\\\|\_\{2\}\\leq\\\|\\Delta\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}\\\|X\\\|\_\{2\}\\leq RM\_\{X\}\.Therefore,

0≤‖Δ𝐖lX‖22≤R2MX2\.0\\leq\\\|\\Delta\\mathbf\{W\}\_\{l\}X\\\|\_\{2\}^\{2\}\\leq R^\{2\}M\_\{X\}^\{2\}\.We assume that𝒬R\\mathcal\{Q\}\_\{R\}is finite\. This finite\-class assumption is natural for quantization, since admissible quantized weights are selected from a discrete codebook\. The resulting bound may be loose because\|𝒬R\|\|\\mathcal\{Q\}\_\{R\}\|can be very large, but it cleanly illustrates the role of the weight drift radiusRR\.

Define the lossℓ𝐖^l\(X\):=‖Δ𝐖lX‖22\\ell\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}\(X\):=\\\|\\Delta\\mathbf\{W\}\_\{l\}X\\\|\_\{2\}^\{2\}\. Then for any fixed𝐖^l∈𝒬R\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\},ℛ\(𝐖^l\)=𝔼\[ℓ𝐖^l\(X\)\]\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)=\\mathbb\{E\}\[\\ell\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}\(X\)\], andℛ^cal\(𝐖^l\)=1n∑i=1nℓ𝐖^l\(𝐗l,i\)\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\ell\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}\(\\mathbf\{X\}\_\{l,i\}\)\.

Since0≤ℓ𝐖^l\(X\)≤R2MX20\\leq\\ell\_\{\\widehat\{\\mathbf\{W\}\}\_\{l\}\}\(X\)\\leq R^\{2\}M\_\{X\}^\{2\}, Hoeffding’s inequality gives, for any fixed𝐖^l∈𝒬R\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\},

Pr⁡\(\|ℛ\(𝐖^l\)−ℛ^cal\(𝐖^l\)\|≥t\)≤2exp⁡\(−2nt2R4MX4\)\.\\Pr\\left\(\\left\|\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\right\|\\geq t\\right\)\\leq 2\\exp\\left\(\-\\frac\{2nt^\{2\}\}\{R^\{4\}M\_\{X\}^\{4\}\}\\right\)\.
Since𝒬R\\mathcal\{Q\}\_\{R\}is finite, we apply the union bound:

Pr⁡\(∃𝐖^l∈𝒬R:\|ℛ\(𝐖^l\)−ℛ^cal\(𝐖^l\)\|≥t\)≤2\|𝒬R\|exp⁡\(−2nt2R4MX4\)\.\\Pr\\left\(\\exists\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\}:\\left\|\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\right\|\\geq t\\right\)\\leq 2\|\\mathcal\{Q\}\_\{R\}\|\\exp\\left\(\-\\frac\{2nt^\{2\}\}\{R^\{4\}M\_\{X\}^\{4\}\}\\right\)\.Setting the right\-hand side toδ\\delta, we have

2\|𝒬R\|exp⁡\(−2nt2R4MX4\)=δ\.2\|\\mathcal\{Q\}\_\{R\}\|\\exp\\left\(\-\\frac\{2nt^\{2\}\}\{R^\{4\}M\_\{X\}^\{4\}\}\\right\)=\\delta\.Taking logarithms gives

2nt2R4MX4=log⁡2\|𝒬R\|δ\.\\frac\{2nt^\{2\}\}\{R^\{4\}M\_\{X\}^\{4\}\}=\\log\\frac\{2\|\\mathcal\{Q\}\_\{R\}\|\}\{\\delta\}\.Thus,

t=R2MX2log⁡2\|𝒬R\|δ2n\.t=R^\{2\}M\_\{X\}^\{2\}\\sqrt\{\\frac\{\\log\\frac\{2\|\\mathcal\{Q\}\_\{R\}\|\}\{\\delta\}\}\{2n\}\}\.
Therefore, with probability at least1−δ1\-\\delta, for all𝐖^l∈𝒬R\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\}, we have

\|ℛ\(𝐖^l\)−ℛ^cal\(𝐖^l\)\|≤R2MX2log⁡2\|𝒬R\|δ2n\.\\left\|\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\right\|\\leq R^\{2\}M\_\{X\}^\{2\}\\sqrt\{\\frac\{\\log\\frac\{2\|\\mathcal\{Q\}\_\{R\}\|\}\{\\delta\}\}\{2n\}\}\.In particular,

ℛ\(𝐖^l\)≤ℛ^cal\(𝐖^l\)\+R2MX2log⁡2\|𝒬R\|δ2n\.\\mathcal\{R\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\leq\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\+R^\{2\}M\_\{X\}^\{2\}\\sqrt\{\\frac\{\\log\\frac\{2\|\\mathcal\{Q\}\_\{R\}\|\}\{\\delta\}\}\{2n\}\}\.This completes the proof\. ∎

This bound shows that the true downstream risk can be controlled by two terms: the empirical calibration risk \(i\.e\., reconstruction error in this case\) and a generalization term depending on the weight driftRR\. A large weight drift increases the generalization term\. Therefore, minimizing onlyℛ^cal\(𝐖^l\)\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)may lead to poor downstream generalization if the selected quantized weights have large drift from the original FP weights\.

### D\.3Proof of Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)

Fix a layerll, and defineD\(𝐖^′\):=‖𝐖^′−𝐖l‖F2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\):=\\\|\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\-\\mathbf\{W\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\. For this fixed constrained minimizer𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}, define𝒮\+:=\{𝐖^′∈𝒬:D\(𝐖^′\)\>D\(𝐖^l\)\}\\mathcal\{S\}\_\{\+\}:=\\\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}:D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\\}and𝒮−:=\{𝐖^′∈𝒬:D\(𝐖^′\)<D\(𝐖^l\)\}\\mathcal\{S\}\_\{\-\}:=\\\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}:D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\\}\. Set

λmin:=max⁡\{0,max𝐖^′∈𝒮\+⁡ℛ^cal\(𝐖^l\)−ℛ^cal\(𝐖^′\)D\(𝐖^′\)−D\(𝐖^l\)\},\\lambda\_\{\\min\}:=\\max\\left\\\{0,\\;\\max\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{S\}\_\{\+\}\}\\frac\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\}\{D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\}\\right\\\},andλmax:=min𝐖^′∈𝒮−⁡ℛ^cal\(𝐖^′\)−ℛ^cal\(𝐖^l\)D\(𝐖^l\)−D\(𝐖^′\)\.\\text\{and\}\\quad\\lambda\_\{\\max\}:=\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{S\}\_\{\-\}\}\\frac\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\}\{D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\-D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\}\.Here we use the conventionsmax∅⁡\(⋅\):=−∞\\max\_\{\\emptyset\}\(\\cdot\):=\-\\inftyandmin∅⁡\(⋅\):=\+∞\\min\_\{\\emptyset\}\(\\cdot\):=\+\\infty\.

###### Proof\.

For simplicity, write𝐖^:=𝐖^l\\widehat\{\\mathbf\{W\}\}:=\\widehat\{\\mathbf\{W\}\}\_\{l\}and let𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}be arbitrary\. We also writeLLandDDwithout the layer subscript\. For a fixed finiteλ≥0\\lambda\\geq 0, defineJλ\(𝐖\):=ℛ^cal\(𝐖\)\+λD\(𝐖\)J\_\{\\lambda\}\(\\mathbf\{W\}\):=\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\mathbf\{W\}\)\+\\lambda D\(\\mathbf\{W\}\)\. The condition𝐖^∈arg⁡min𝐖∈𝒬⁡Jλ\(𝐖\)\\widehat\{\\mathbf\{W\}\}\\in\\arg\\min\_\{\\mathbf\{W\}\\in\\mathcal\{Q\}\}J\_\{\\lambda\}\(\\mathbf\{W\}\)is equivalent to the pairwise inequalitiesJλ\(𝐖^\)≤Jλ\(𝐖^′\)J\_\{\\lambda\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)for all𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\.

For each𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}, defineΔℛ^cal\(𝐖^′\):=ℛ^cal\(𝐖^′\)−ℛ^cal\(𝐖^\)\\Delta\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\):=\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)andΔD\(𝐖^′\):=D\(𝐖^′\)−D\(𝐖^\)\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\):=D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-D\(\\widehat\{\\mathbf\{W\}\}\)\. ThenJλ\(𝐖^\)≤Jλ\(𝐖^′\)J\_\{\\lambda\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)is equivalent toΔℛ^cal\(𝐖^′\)\+λΔD\(𝐖^′\)≥0\\Delta\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\geq 0\.

We first note that infeasible points are not omitted\. Since𝐖^∈𝒬R\\widehat\{\\mathbf\{W\}\}\\in\\mathcal\{Q\}\_\{R\}, we haveD\(𝐖^\)≤R2D\(\\widehat\{\\mathbf\{W\}\}\)\\leq R^\{2\}\. If𝐖^′∉𝒬R\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\notin\\mathcal\{Q\}\_\{R\}, thenD\(𝐖^′\)\>R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>R^\{2\}, and henceD\(𝐖^′\)\>D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>D\(\\widehat\{\\mathbf\{W\}\}\)\. Therefore every infeasible point belongs to𝒮\+\\mathcal\{S\}\_\{\+\}\. In particular, any infeasible point with a smaller value ofLLcontributes to the lower\-bound requirement encoded inλmin\\lambda\_\{\\min\}\.

We now prove the sufficiency\. Assumeλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}and fix a finiteλR\\lambda\_\{R\}such thatλmin≤λR≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{R\}\\leq\\lambda\_\{\\max\}\. We verifyJλR\(𝐖^\)≤JλR\(𝐖^′\)J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)by considering three cases\.

First, supposeD\(𝐖^′\)\>D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>D\(\\widehat\{\\mathbf\{W\}\}\)\. Then𝐖^′∈𝒮\+\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{S\}\_\{\+\}\. By the definition ofλmin\\lambda\_\{\\min\}and the choiceλR≥λmin\\lambda\_\{R\}\\geq\\lambda\_\{\\min\}, we have

λR≥ℛ^cal\(𝐖^\)−ℛ^cal\(𝐖^′\)D\(𝐖^′\)−D\(𝐖^\)\.\\displaystyle\\lambda\_\{R\}\\geq\\frac\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\}\{D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-D\(\\widehat\{\\mathbf\{W\}\}\)\}\.The denominatorD\(𝐖^′\)−D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-D\(\\widehat\{\\mathbf\{W\}\}\)is positive, so multiplying by it preserves the inequality and givesλR\{D\(𝐖^′\)−D\(𝐖^\)\}≥ℛ^cal\(𝐖^\)−ℛ^cal\(𝐖^′\)\\lambda\_\{R\}\\\{D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-D\(\\widehat\{\\mathbf\{W\}\}\)\\\}\\geq\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. This is equivalent toΔℛ^cal\(𝐖^′\)\+λRΔD\(𝐖^′\)≥0\\Delta\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\_\{R\}\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\geq 0, and henceJλR\(𝐖^\)≤JλR\(𝐖^′\)J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. This case includes all infeasible candidates\.

Second, supposeD\(𝐖^′\)<D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\)\. Then𝐖^′∈𝒮−\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{S\}\_\{\-\}\. Moreover,𝐖^′\\widehat\{\\mathbf\{W\}\}^\{\\prime\}is feasible becauseD\(𝐖^′\)<D\(𝐖^\)≤R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\)\\leq R^\{2\}\. Since𝐖^\\widehat\{\\mathbf\{W\}\}minimizesLLover𝒬R\\mathcal\{Q\}\_\{R\}, we haveℛ^cal\(𝐖^\)≤ℛ^cal\(𝐖^′\)\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. Thus the ratio defining the upper bound is nonnegative\. By the definition ofλmax\\lambda\_\{\\max\}and the choiceλR≤λmax\\lambda\_\{R\}\\leq\\lambda\_\{\\max\}, we have

λR≤ℛ^cal\(𝐖^′\)−ℛ^cal\(𝐖^\)D\(𝐖^\)−D\(𝐖^′\)\.\\displaystyle\\lambda\_\{R\}\\leq\\frac\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\}\{D\(\\widehat\{\\mathbf\{W\}\}\)\-D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\}\.The denominatorD\(𝐖^\)−D\(𝐖^′\)D\(\\widehat\{\\mathbf\{W\}\}\)\-D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)is positive, so multiplying by it preserves the inequality and givesλR\{D\(𝐖^\)−D\(𝐖^′\)\}≤ℛ^cal\(𝐖^′\)−ℛ^cal\(𝐖^\)\\lambda\_\{R\}\\\{D\(\\widehat\{\\mathbf\{W\}\}\)\-D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\\}\\leq\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\. Equivalently,Δℛ^cal\(𝐖^′\)\+λRΔD\(𝐖^′\)≥0\\Delta\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\_\{R\}\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\geq 0, and henceJλR\(𝐖^\)≤JλR\(𝐖^′\)J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. This case explains why the penalty parameter cannot be chosen too large: otherwise a feasible point with smaller distance penalty could dominate𝐖^\\widehat\{\\mathbf\{W\}\}\.

Third, supposeD\(𝐖^′\)=D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)=D\(\\widehat\{\\mathbf\{W\}\}\)\. SinceD\(𝐖^\)≤R2D\(\\widehat\{\\mathbf\{W\}\}\)\\leq R^\{2\}, this impliesD\(𝐖^′\)≤R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\leq R^\{2\}, so𝐖^′∈𝒬R\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\_\{R\}\. By constrained optimality,ℛ^cal\(𝐖^\)≤ℛ^cal\(𝐖^′\)\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. Since the penalty terms are equal in this case,JλR\(𝐖^\)≤JλR\(𝐖^′\)J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)follows immediately\.

The three cases exhaust all𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\. ThereforeJλR\(𝐖^\)≤JλR\(𝐖^′\)J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}\)\\leq J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)for every𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}, which proves

𝐖^∈arg⁡min𝐖^′∈𝒬⁡\{ℛ^cal\(𝐖^′\)\+λRD\(𝐖^′\)\}\.\\displaystyle\\widehat\{\\mathbf\{W\}\}\\in\\arg\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\}\\left\\\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\_\{R\}D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\right\\\}\.
It remains to prove the converse\. Suppose there exists a finiteλR≥0\\lambda\_\{R\}\\geq 0such that𝐖^∈arg⁡min𝐖^′∈𝒬⁡JλR\(𝐖^′\)\\widehat\{\\mathbf\{W\}\}\\in\\arg\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\}J\_\{\\lambda\_\{R\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. Then for every𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}, we haveΔℛ^cal\(𝐖^′\)\+λRΔD\(𝐖^′\)≥0\\Delta\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\_\{R\}\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\geq 0\.

IfD\(𝐖^′\)\>D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>D\(\\widehat\{\\mathbf\{W\}\}\), thenΔD\(𝐖^′\)\>0\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>0, and the preceding inequality implies

λR≥ℛ^cal\(𝐖^\)−ℛ^cal\(𝐖^′\)D\(𝐖^′\)−D\(𝐖^\)\.\\displaystyle\\lambda\_\{R\}\\geq\\frac\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\}\{D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-D\(\\widehat\{\\mathbf\{W\}\}\)\}\.Taking the maximum over all such𝐖^′\\widehat\{\\mathbf\{W\}\}^\{\\prime\}and usingλR≥0\\lambda\_\{R\}\\geq 0givesλR≥λmin\\lambda\_\{R\}\\geq\\lambda\_\{\\min\}, with the convention that this is trivial when𝒮\+=∅\\mathcal\{S\}\_\{\+\}=\\emptyset\.

IfD\(𝐖^′\)<D\(𝐖^\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\), thenΔD\(𝐖^′\)<0\\Delta D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<0, and dividing by this negative quantity reverses the inequality\. Thus

λR≤ℛ^cal\(𝐖^′\)−ℛ^cal\(𝐖^\)D\(𝐖^\)−D\(𝐖^′\)\.\\displaystyle\\lambda\_\{R\}\\leq\\frac\{\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-\\widehat\{\\mathcal\{R\}\}\_\{\\mathrm\{cal\}\}\(\\widehat\{\\mathbf\{W\}\}\)\}\{D\(\\widehat\{\\mathbf\{W\}\}\)\-D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\}\.Taking the minimum over all such𝐖^′\\widehat\{\\mathbf\{W\}\}^\{\\prime\}givesλR≤λmax\\lambda\_\{R\}\\leq\\lambda\_\{\\max\}, with the convention that this is trivial when𝒮−=∅\\mathcal\{S\}\_\{\-\}=\\emptyset\. Henceλmin≤λR≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{R\}\\leq\\lambda\_\{\\max\}\.

This completes the proof\. ∎

Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)clarifies that the constrained calibration problem can be represented by a penalized calibration objective over a finite quantization set\. It gives a recovery condition for a fixed constrained minimizer𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}\. In particular, if the supportedness conditionλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}holds, then one can choose a penalty strengthλR∈\[λmin,λmax\]\\lambda\_\{R\}\\in\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]such that𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}remains optimal after replacing the hard radius constraintD\(𝐖^′\)≤R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\leq R^\{2\}by the soft penaltyλRD\(𝐖^′\)\\lambda\_\{R\}D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. It is the standard supported\-solution viewpoint in multiobjective optimization that weighted\-sum scalarizations recover supported efficient points, while finite objective sets may also contain efficient but unsupported points\[[66](https://arxiv.org/html/2605.05693#bib.bib67),[21](https://arxiv.org/html/2605.05693#bib.bib21),[48](https://arxiv.org/html/2605.05693#bib.bib46),[16](https://arxiv.org/html/2605.05693#bib.bib16),[46](https://arxiv.org/html/2605.05693#bib.bib45)\]\.

#### Interpretation ofλmin\\lambda\_\{\\min\}andλmax\\lambda\_\{\\max\}

The quantitiesλmin\\lambda\_\{\\min\}andλmax\\lambda\_\{\\max\}have a direct pairwise\-comparison interpretation\. The lower thresholdλmin\\lambda\_\{\\min\}is the smallest penalty strength needed to prevent candidates with larger distanceD\(𝐖^′\)\>D\(𝐖^l\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)from improving upon𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}under the penalized objective\. This class includes all infeasible candidates, since𝐖^l∈𝒬R\\widehat\{\\mathbf\{W\}\}\_\{l\}\\in\\mathcal\{Q\}\_\{R\}impliesD\(𝐖^l\)≤R2D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\leq R^\{2\}, whereas any𝐖^′∉𝒬R\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\notin\\mathcal\{Q\}\_\{R\}satisfiesD\(𝐖^′\)\>R2≥D\(𝐖^l\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>R^\{2\}\\geq D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\. Thus infeasible quantized weights are explicitly accounted for inλmin\\lambda\_\{\\min\}\. In contrast,λmax\\lambda\_\{\\max\}is the largest penalty strength allowed before a closer feasible candidate withD\(𝐖^′\)<D\(𝐖^l\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)can overtake𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}due to its smaller distance penalty\. Such candidates are automatically feasible becauseD\(𝐖^′\)<D\(𝐖^l\)≤R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\leq R^\{2\}, and the constrained optimality of𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}ensuresL\(𝐖^l\)≤L\(𝐖^′\)L\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\leq L\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)for them\. Henceλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}simply states that the penalty must be large enough to suppress farther, possibly infeasible, low\-risk candidates, but not so large that it favors closer, higher\-risk feasible candidates\. This is the exact condition under which the bias introduced by the penalty does not change the selected minimizer\.

#### Reasonableness of the supportedness condition

We note that the conditionλmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}is not an ad\-hoc regularity assumption which is the finite\-set feasibility condition for a common penalty multiplier\. Indeed, every competitor𝐖^′∈𝒬\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}imposes one admissible interval forλR\\lambda\_\{R\}, that is, candidates withD\(𝐖^′\)\>D\(𝐖^l\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\>D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)impose lower bounds, candidates withD\(𝐖^′\)<D\(𝐖^l\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)<D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)impose upper bounds, and candidates withD\(𝐖^′\)=D\(𝐖^l\)D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)=D\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)impose no additional constraint because constrained optimality already givesL\(𝐖^l\)≤L\(𝐖^′\)L\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\)\\leq L\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. Hence\[λmin,λmax\]\[\\lambda\_\{\\min\},\\lambda\_\{\\max\}\]is precisely the intersection of all pairwise admissible intervals\. Its nonemptiness means that the penalty can be chosen large enough to suppress farther, possibly infeasible, low\-risk candidates, but not so large that it favors closer, higher\-risk feasible candidates\.

This condition is closely aligned with the supported\-solution viewpoint in multiobjective optimization\. Weighted\-sum scalarizations recover supported efficient solutions, whereas finite objective sets may contain efficient but unsupported solutions that cannot be recovered by any nonnegative linear weighting\[[21](https://arxiv.org/html/2605.05693#bib.bib21),[48](https://arxiv.org/html/2605.05693#bib.bib46),[16](https://arxiv.org/html/2605.05693#bib.bib16),[46](https://arxiv.org/html/2605.05693#bib.bib45)\]\. In this sense,λmin≤λmax\\lambda\_\{\\min\}\\leq\\lambda\_\{\\max\}says precisely that the constrained minimizer𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}is supported by some linear scalarization of the distance–risk trade\-off\.

Similar condition also appears in exact\-penalty and Lagrangian\-relaxation theory\. Exact\-penalty methods typically require the existence of a finite penalty parameter, often under constraint qualifications, multiplier bounds, or other exactness conditions\[[26](https://arxiv.org/html/2605.05693#bib.bib26),[13](https://arxiv.org/html/2605.05693#bib.bib14)\]\. Similar exact\-penalty ideas have also been developed for nonlinear integer programming, where additional conditions are imposed to guarantee equivalence between the original discrete problem and its penalized reformulation\[[45](https://arxiv.org/html/2605.05693#bib.bib44)\]\. Likewise, in Lagrangian relaxation for integer programming, a multiplier generally provides a relaxation or bound, and exact primal recovery is not automatic without an additional exactness or zero\-gap condition\[[22](https://arxiv.org/html/2605.05693#bib.bib22),[18](https://arxiv.org/html/2605.05693#bib.bib18)\]\.

#### Link to Lagrangian relaxation

This result is closely related to classical Lagrangian relaxation\. The constrained problemmin𝐖^′∈𝒬⁡L\(𝐖^′\)\\min\_\{\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\\in\\mathcal\{Q\}\}L\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)subject toD\(𝐖^′\)≤R2D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\\leq R^\{2\}has the Lagrangian formL\(𝐖^′\)\+λ\{D\(𝐖^′\)−R2\}L\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda\\\{D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\-R^\{2\}\\\}with multiplierλ≥0\\lambda\\geq 0\. Since the term−λR2\-\\lambda R^\{2\}is constant in𝐖^′\\widehat\{\\mathbf\{W\}\}^\{\\prime\}, minimizing the Lagrangian over𝒬\\mathcal\{Q\}is equivalent to minimizingL\(𝐖^′\)\+λD\(𝐖^′\)L\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\+\\lambda D\(\\widehat\{\\mathbf\{W\}\}^\{\\prime\}\)\. In convex optimization, suitable constraint qualifications and strong duality often justify recovery of constrained optima from Lagrange multipliers\[[4](https://arxiv.org/html/2605.05693#bib.bib7)\]\. In discrete or integer optimization, however, Lagrangian relaxation is generally a relaxation or bounding device, and exact primal recovery is not automatic\[[17](https://arxiv.org/html/2605.05693#bib.bib17),[22](https://arxiv.org/html/2605.05693#bib.bib22),[18](https://arxiv.org/html/2605.05693#bib.bib18)\]\. Corollary[3\.2](https://arxiv.org/html/2605.05693#S3.Thmtheorem2)can therefore be viewed as a finite\-set analogue of Lagrangian primal recovery\. It identifies when a Lagrangian\-style scalarization recovers the selected constrained minimizer\.

## Appendix EAlgorithms and Implementation Details

### E\.1Algorithms

In this section, we summarize the algorithms used to optimize the proposed SARQC objective\.

Algorithm 1SARQC with Grid Search over Scaling Factors \(SARQC\-GS\)1:FP weight matrix

𝐖l∈ℝdout×din\\mathbf\{W\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\}, calibration input matrix

𝐗l∈ℝdin×n\\mathbf\{X\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\\times n\}, search set

𝒜\\mathcal\{A\}, saliency matrix

𝐒l\\mathbf\{S\}\_\{l\}, regularization weight

λ\>0\\lambda\>0
2:foreach

α∈𝒜\\alpha\\in\\mathcal\{A\}do

3:Construct the channel\-wise scaling vector

s~l\(α\)\\tilde\{s\}\_\{l\}\(\\alpha\)
4:

𝐒~l\(α\)←diag\(s~l\(α\)\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)\\leftarrow\\mathrm\{diag\}\(\\tilde\{s\}\_\{l\}\(\\alpha\)\)
5:

𝐖^l\(α\)←Q\(𝐖l𝐒~l\(α\)\)𝐒~l\(α\)−1\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\\leftarrow Q\\\!\\bigl\(\\mathbf\{W\}\_\{l\}\\,\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)\\bigr\)\\,\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)^\{\-1\}
6:

ℒrecon\(α\)←‖𝐖l𝐗l−𝐖^l\(α\)𝐗l‖F2\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\(\\alpha\)\\leftarrow\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}
7:

ℒsar\(α\)←‖\(𝐖^l\(α\)−𝐖l\)𝐒l‖F2\\mathcal\{L\}\_\{\\mathrm\{sar\}\}\(\\alpha\)\\leftarrow\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\-\\mathbf\{W\}\_\{l\}\)\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}
8:endfor

9:Compute

ℒreconmin,ℒreconmax←minα∈𝒜⁡ℒrecon\(α\),maxα∈𝒜⁡ℒrecon\(α\)\\mathcal\{L\}\_\{\\mathrm\{recon\}\}^\{\\min\},\\mathcal\{L\}\_\{\\mathrm\{recon\}\}^\{\\max\}\\leftarrow\\min\_\{\\alpha\\in\\mathcal\{A\}\}\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\(\\alpha\),\\;\\max\_\{\\alpha\\in\\mathcal\{A\}\}\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\(\\alpha\)
10:Compute

ℒsarmin,ℒsarmax←minα∈𝒜⁡ℒsar\(α\),maxα∈𝒜⁡ℒsar\(α\)\\mathcal\{L\}\_\{\\mathrm\{sar\}\}^\{\\min\},\\mathcal\{L\}\_\{\\mathrm\{sar\}\}^\{\\max\}\\leftarrow\\min\_\{\\alpha\\in\\mathcal\{A\}\}\\mathcal\{L\}\_\{\\mathrm\{sar\}\}\(\\alpha\),\\;\\max\_\{\\alpha\\in\\mathcal\{A\}\}\\mathcal\{L\}\_\{\\mathrm\{sar\}\}\(\\alpha\)
11:

ℒ⋆←\+∞\\mathcal\{L\}^\{\\star\}\\leftarrow\+\\infty
12:foreach

α∈𝒜\\alpha\\in\\mathcal\{A\}do

13:

ℒ~recon\(α\)←ℒrecon\(α\)−ℒreconminℒreconmax−ℒreconmin\\widetilde\{\\mathcal\{L\}\}\_\{\\mathrm\{recon\}\}\(\\alpha\)\\leftarrow\\dfrac\{\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\(\\alpha\)\-\\mathcal\{L\}\_\{\\mathrm\{recon\}\}^\{\\min\}\}\{\\mathcal\{L\}\_\{\\mathrm\{recon\}\}^\{\\max\}\-\\mathcal\{L\}\_\{\\mathrm\{recon\}\}^\{\\min\}\},

ℒ~sar\(α\)←ℒsar\(α\)−ℒsarminℒsarmax−ℒsarmin\\widetilde\{\\mathcal\{L\}\}\_\{\\mathrm\{sar\}\}\(\\alpha\)\\leftarrow\\dfrac\{\\mathcal\{L\}\_\{\\mathrm\{sar\}\}\(\\alpha\)\-\\mathcal\{L\}\_\{\\mathrm\{sar\}\}^\{\\min\}\}\{\\mathcal\{L\}\_\{\\mathrm\{sar\}\}^\{\\max\}\-\\mathcal\{L\}\_\{\\mathrm\{sar\}\}^\{\\min\}\}
14:

ℒ\(α\)←ℒ~recon\(α\)\+λℒ~sar\(α\)\\mathcal\{L\}\(\\alpha\)\\leftarrow\\widetilde\{\\mathcal\{L\}\}\_\{\\mathrm\{recon\}\}\(\\alpha\)\+\\lambda\\,\\widetilde\{\\mathcal\{L\}\}\_\{\\mathrm\{sar\}\}\(\\alpha\)
15:if

ℒ\(α\)<ℒ⋆\\mathcal\{L\}\(\\alpha\)<\\mathcal\{L\}^\{\\star\}then

16:

ℒ⋆←ℒ\(α\)\\mathcal\{L\}^\{\\star\}\\leftarrow\\mathcal\{L\}\(\\alpha\),

𝐖^l←𝐖^l\(α\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\\leftarrow\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)
17:endif

18:endfor

19:return

𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}

Algorithm 2SARQC with Gram\-based Sequential Quantization \(SARQC\-GBS\)1:FP weight matrix

𝐖l∈ℝdout×din\\mathbf\{W\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\}, calibration input matrix

𝐗l∈ℝdin×n\\mathbf\{X\}\_\{l\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{in\}\}\\times n\}, saliency profile

sl\(γ\)s\_\{l\}\(\\gamma\), regularization weight

λ\>0\\lambda\>0, block size

BB
2:

𝐆l←𝐗l𝐗l⊤\+λh¯ldiag\(sl\(γ\)2mean\(sl\(γ\)2\)\)\\mathbf\{G\}\_\{l\}\\leftarrow\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\+\\lambda\\bar\{h\}\_\{l\}\\,\\mathrm\{diag\}\\\!\\left\(\\dfrac\{s\_\{l\}\(\\gamma\)^\{2\}\}\{\\mathrm\{mean\}\(s\_\{l\}\(\\gamma\)^\{2\}\)\}\\right\), where

h¯l←mean\(diag\(𝐗l𝐗l⊤\)\)\\bar\{h\}\_\{l\}\\leftarrow\\mathrm\{mean\}\\\!\\left\(\\mathrm\{diag\}\(\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\)\\right\)
3:

𝐌l←chol\(𝐆l−1\)⊤\\mathbf\{M\}\_\{l\}\\leftarrow\\mathrm\{chol\}\(\\mathbf\{G\}\_\{l\}^\{\-1\}\)^\{\\top\}
4:

𝐖^l←𝟎dout×din\\widehat\{\\mathbf\{W\}\}\_\{l\}\\leftarrow\\mathbf\{0\}\_\{d\_\{\\mathrm\{out\}\}\\times d\_\{\\mathrm\{in\}\}\}
5:

𝐔←𝐖l\\mathbf\{U\}\\leftarrow\\mathbf\{W\}\_\{l\}
6:for

i=1,1\+B,1\+2B,…,dini=1,1\+B,1\+2B,\\dots,d\_\{\\mathrm\{in\}\}do

7:

iend←min⁡\(i\+B−1,din\)i\_\{\\mathrm\{end\}\}\\leftarrow\\min\(i\+B\-1,\\;d\_\{\\mathrm\{in\}\}\)
8:

𝐄←𝟎dout×\(iend−i\+1\)\\mathbf\{E\}\\leftarrow\\mathbf\{0\}\_\{d\_\{\\mathrm\{out\}\}\\times\(i\_\{\\mathrm\{end\}\}\-i\+1\)\}
9:for

j=i,i\+1,…,iendj=i,i\+1,\\dots,i\_\{\\mathrm\{end\}\}do

10:

𝐖^l,:,j←Q\(𝐔:,j\)\\widehat\{\\mathbf\{W\}\}\_\{l,:,j\}\\leftarrow Q\(\\mathbf\{U\}\_\{:,j\}\)
11:

𝐞←𝐔:,j−𝐖^l,:,j\(𝐌l\)jj\\mathbf\{e\}\\leftarrow\\dfrac\{\\mathbf\{U\}\_\{:,j\}\-\\widehat\{\\mathbf\{W\}\}\_\{l,:,j\}\}\{\(\\mathbf\{M\}\_\{l\}\)\_\{jj\}\}
12:

𝐄:,j−i\+1←𝐞\\mathbf\{E\}\_\{:,\\,j\-i\+1\}\\leftarrow\\mathbf\{e\}
13:

𝐔:,j:iend←𝐔:,j:iend−𝐞\(𝐌l\)j,j:iend\\mathbf\{U\}\_\{:,\\,j:i\_\{\\mathrm\{end\}\}\}\\leftarrow\\mathbf\{U\}\_\{:,\\,j:i\_\{\\mathrm\{end\}\}\}\-\\mathbf\{e\}\\,\(\\mathbf\{M\}\_\{l\}\)\_\{j,\\,j:i\_\{\\mathrm\{end\}\}\}
14:endfor

15:if

iend<dini\_\{\\mathrm\{end\}\}<d\_\{\\mathrm\{in\}\}then

16:

𝐔:,iend\+1:din←𝐔:,iend\+1:din−𝐄\(𝐌l\)i:iend,iend\+1:din\\mathbf\{U\}\_\{:,\\,i\_\{\\mathrm\{end\}\}\+1:d\_\{\\mathrm\{in\}\}\}\\leftarrow\\mathbf\{U\}\_\{:,\\,i\_\{\\mathrm\{end\}\}\+1:d\_\{\\mathrm\{in\}\}\}\-\\mathbf\{E\}\\,\(\\mathbf\{M\}\_\{l\}\)\_\{\\,i:i\_\{\\mathrm\{end\}\},\\,i\_\{\\mathrm\{end\}\}\+1:d\_\{\\mathrm\{in\}\}\}
17:endif

18:endfor

19:return

𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}

### E\.2Implementation Details

#### Implementation Details for SARQC\-GS

SARQC\-GSis implemented as a scaling\-based post\-training quantization procedure with a regularized model selection rule over candidate scaling factors\. For each layer, we perform a grid search over a scalar parameterα\\alphathat controls a channel\-wise reparameterization of the weight matrix\. Specifically, we consider a uniform grid

𝒜=\{k20:k=0,1,…,20\}\.\\mathcal\{A\}=\\left\\\{\\frac\{k\}\{20\}:k=0,1,\\dots,20\\right\\\}\.For eachα∈𝒜\\alpha\\in\\mathcal\{A\}, we construct a channel\-wise scaling vectors~l\(α\)∈ℝ\+din\\tilde\{s\}\_\{l\}\(\\alpha\)\\in\\mathbb\{R\}\_\{\+\}^\{d\_\{\\mathrm\{in\}\}\}and the corresponding diagonal matrix

𝐒~l\(α\)=diag\(s~l\(α\)\)\.\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)=\\mathrm\{diag\}\\\!\\bigl\(\\tilde\{s\}\_\{l\}\(\\alpha\)\\bigr\)\.Following the design of AWQ\[[41](https://arxiv.org/html/2605.05693#bib.bib38)\], the scaling factors are defined as

s~l\(j\)\(α\)=mean\(\|𝐗l\(j\)\|\)αmean\(\|𝐖l\(j\)\|\)1−α,\\tilde\{s\}\_\{l\}^\{\(j\)\}\(\\alpha\)=\\frac\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\alpha\}\}\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1\-\\alpha\}\},which interpolates between activation\-driven and weight\-driven normalization\. The resulting vector is further normalized by the geometric mean of its extrema to improve numerical stability\.

Given𝐒~l\(α\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\), we generate a candidate quantized weight via

𝐖^l\(α\)=Q\(𝐖l𝐒~l\(α\)\)𝐒~l\(α\)−1\.\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)=Q\\\!\\bigl\(\\mathbf\{W\}\_\{l\}\\,\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)\\bigr\)\\,\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)^\{\-1\}\.
In addition,SARQC\-GSintroduces a fixed saliency matrix𝐒l=diag\(sl\)\\mathbf\{S\}\_\{l\}=\\mathrm\{diag\}\(s\_\{l\}\)for discrepancy regularization, defined as

sl\(j\)=mean\(\|𝐗l\(j\)\|\)mean\(\|𝐖l\(j\)\|\)\.s\_\{l\}^\{\(j\)\}=\\frac\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)\}\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)\}\.Importantly,𝐒~l\(α\)\\widetilde\{\\mathbf\{S\}\}\_\{l\}\(\\alpha\)and𝐒l\\mathbf\{S\}\_\{l\}serve distinct roles: the former generates candidate solutions, while the latter weights channel\-wise deviations in the regularization term\.

For each candidateα\\alpha, we evaluate the reconstruction loss

ℒrecon\(α\)=‖𝐖l𝐗l−𝐖^l\(α\)𝐗l‖F2,\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\(\\alpha\)=\\\|\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\\mathbf\{X\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\},and the saliency\-weighted discrepancy

ℒsar\(α\)=‖\(𝐖^l\(α\)−𝐖l\)𝐒l‖F2\.\\mathcal\{L\}\_\{\\mathrm\{sar\}\}\(\\alpha\)=\\\|\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\alpha\)\-\\mathbf\{W\}\_\{l\}\)\\mathbf\{S\}\_\{l\}\\\|\_\{\\mathrm\{F\}\}^\{2\}\.To balance the scale mismatch between the two terms, we apply min–max normalization acrossα∈𝒜\\alpha\\in\\mathcal\{A\}and select the best candidate by minimizing the normalized joint objective\.

At the outer level, the regularization weightλ\\lambdais selected from\{0\.1,0\.2,…,1\.0\}\\\{0\.1,0\.2,\\dots,1\.0\\\}using a held\-out validation split of the calibration data\. For eachλ\\lambda, we run layer\-wiseSARQC\-GSon the training subset and choose the value that minimizes the validation perplexity\.

#### Implementation Details for SARQC\-GBS

SARQC\-GBSis implemented as a second\-order Gram\-based sequential quantization method, where the curvature matrix is regularized using saliency\-aware statistics\.

We perform layer\-wise hyperparameter selection over

λ∈\{0\.25,0\.5,0\.75\},γ∈\{0\.1,0\.15,0\.35,0\.5\}\.\\lambda\\in\\\{0\.25,0\.5,0\.75\\\},\\qquad\\gamma\\in\\\{0\.1,0\.15,0\.35,0\.5\\\}\.Here,λ\\lambdacontrols the strength of curvature regularization, whileγ\\gammadetermines the relative contribution of activation and weight statistics in the saliency profile\.

For each candidate\(λ,γ\)\(\\lambda,\\gamma\), we construct a saliency vector

sl\(j\)\(γ\)=mean\(\|𝐗l\(j\)\|\)γmean\(\|𝐖l\(j\)\|\)1−γ\.s\_\{l\}^\{\(j\)\}\(\\gamma\)=\\frac\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\gamma\}\}\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1\-\\gamma\}\}\.
We then construct the regularized curvature matrix as

𝐆l:=𝐗l𝐗l⊤\+λ𝐒l𝐒l⊤\\displaystyle\\mathbf\{G\}\_\{l\}:=\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\+\\lambda\\mathbf\{S\}\_\{l\}\\mathbf\{S\}\_\{l\}^\{\\top\}To match the practical scale of the empirical Gram matrix across layers, we use

𝐒l:=𝐒l\(γ\)=h¯ldiag\(sl\(γ\)mean⁡\(sl\(γ\)2\)\.\)\\displaystyle\\mathbf\{S\}\_\{l\}:=\\mathbf\{S\}\_\{l\}\(\\gamma\)=\\sqrt\{\\bar\{h\}\_\{l\}\}\\,\\operatorname\{diag\}\\\!\(\\frac\{s\_\{l\}\(\\gamma\)\}\{\\sqrt\{\\operatorname\{mean\}\(s\_\{l\}\(\\gamma\)^\{2\}\)\}\}\.\)withh¯l:=mean⁡\(diag⁡\(𝐗l𝐗l⊤\)\)\\bar\{h\}\_\{l\}:=\\operatorname\{mean\}\(\\operatorname\{diag\}\(\\mathbf\{X\}\_\{l\}\\mathbf\{X\}\_\{l\}^\{\\top\}\)\)andsl\(γ\)s\_\{l\}\(\\gamma\)is defined channel\-wise bysl\(j\)\(γ\):=mean\(\|𝐗l\(j\)\|\)γ/mean\(\|𝐖l\(j\)\|\)1−γs\_\{l\}^\{\(j\)\}\(\\gamma\):=\\nicefrac\{\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{\\gamma\}\}\}\{\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1\-\\gamma\}\}\}whereγ\\gammacontrols the relative contribution of activation and weight statistics\. This construction can be interpreted as a saliency\-aware and scale\-normalized curvature regularization of the Gram\-based objective\. It preserves the computational structure of GPTQ while modulating the relative importance of different channels\.

For efficiency, hyperparameter selection is performed on a reduced subset of input channels\. The calibration data are split into training and validation subsets\. The training subset is used to construct curvature statistics and perform sequential quantization, while the validation subset is used to evaluate candidate configurations\. The best\(λ,γ\)\(\\lambda,\\gamma\)pair is then applied to the full layer\.

#### Implementation Details for SARQC with OmniQuant

In the ablation study, we extend OmniQuant\[[55](https://arxiv.org/html/2605.05693#bib.bib53)\]by incorporating the SARQC objective into its differentiable quantization framework\. Unlike the layer\-wise linear form used in the main method, OmniQuant operates on transformer\-block outputs\. Accordingly, letℱ\(⋅,⋅\)\\mathcal\{F\}\(\\cdot,\\cdot\)denote the forward mapping of the corresponding transformer block\. In the special case of a single linear layer, this reduces to the familiar form𝐖l𝐗l\\mathbf\{W\}\_\{l\}\\mathbf\{X\}\_\{l\}\.

Let𝐖^l\(Θ1,Θ2\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\Theta\_\{1\},\\Theta\_\{2\}\)and𝐗^l\(Θ2\)\\widehat\{\\mathbf\{X\}\}\_\{l\}\(\\Theta\_\{2\}\)denote the differentiable quantized–dequantized weights and activations induced by the OmniQuant parameters\(Θ1,Θ2\)\(\\Theta\_\{1\},\\Theta\_\{2\}\)\. The original OmniQuant objective can be written as

minΘ1,Θ2⁡‖ℱ\(𝐖l,𝐗l\)−ℱ\(𝐖^l\(Θ1,Θ2\),𝐗^l\(Θ2\)\)‖F2\.\\displaystyle\\min\_\{\\Theta\_\{1\},\\Theta\_\{2\}\}\\;\\bigl\\\|\\mathcal\{F\}\(\\mathbf\{W\}\_\{l\},\\mathbf\{X\}\_\{l\}\)\-\\mathcal\{F\}\\bigl\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\Theta\_\{1\},\\Theta\_\{2\}\),\\,\\widehat\{\\mathbf\{X\}\}\_\{l\}\(\\Theta\_\{2\}\)\\bigr\)\\bigr\\\|\_\{\\mathrm\{F\}\}^\{2\}\.
To incorporate SARQC, we augment this objective with the saliency\-aware regularization term in[Equation˜8](https://arxiv.org/html/2605.05693#S3.E8)\. Specifically, for each layer or blockll, we optimize

minΘ1,Θ2‖ℱ\(𝐖l,𝐗l\)−ℱ\(𝐖^l\(Θ1,Θ2\),𝐗^l\(Θ2\)\)‖F2⏟ℒrecon\+λ‖\(𝐖^l\(Θ1,Θ2\)−𝐖l\)𝐒l‖F2⏟ℒsar\.\\displaystyle\\min\_\{\\Theta\_\{1\},\\Theta\_\{2\}\}\\quad\\underbrace\{\\bigl\\\|\\mathcal\{F\}\(\\mathbf\{W\}\_\{l\},\\mathbf\{X\}\_\{l\}\)\-\\mathcal\{F\}\\bigl\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\Theta\_\{1\},\\Theta\_\{2\}\),\\,\\widehat\{\\mathbf\{X\}\}\_\{l\}\(\\Theta\_\{2\}\)\\bigr\)\\bigr\\\|\_\{\\mathrm\{F\}\}^\{2\}\}\_\{\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\}\+\\lambda\\underbrace\{\\bigl\\\|\\bigl\(\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\Theta\_\{1\},\\Theta\_\{2\}\)\-\\mathbf\{W\}\_\{l\}\\bigr\)\\mathbf\{S\}\_\{l\}\\bigr\\\|\_\{\\mathrm\{F\}\}^\{2\}\}\_\{\\mathcal\{L\}\_\{\\mathrm\{sar\}\}\}\.
Here,𝐒l=diag\(sl\)\\mathbf\{S\}\_\{l\}=\\mathrm\{diag\}\(s\_\{l\}\)is the saliency matrix, with channel\-wise saliency weights defined by

sl\(j\)=mean\(\|𝐗l\(j\)\|\)1/2mean\(\|𝐖l\(j\)\|\)1/2\.\\displaystyle s\_\{l\}^\{\(j\)\}=\\frac\{\\mathrm\{mean\}\(\|\\mathbf\{X\}\_\{l\}^\{\(j\)\}\|\)^\{1/2\}\}\{\\mathrm\{mean\}\(\|\\mathbf\{W\}\_\{l\}^\{\(j\)\}\|\)^\{1/2\}\}\.
The reconstruction term follows the original OmniQuant formulation and remains differentiable with respect to\(Θ1,Θ2\)\(\\Theta\_\{1\},\\Theta\_\{2\}\)via straight\-through estimators\. The SARQC regularization term is likewise differentiable, since it is applied directly to the dequantized weights𝐖^l\(Θ1,Θ2\)\\widehat\{\\mathbf\{W\}\}\_\{l\}\(\\Theta\_\{1\},\\Theta\_\{2\}\)\. In practice, we optimize\(Θ1,Θ2\)\(\\Theta\_\{1\},\\Theta\_\{2\}\)using Adam with the same learning\-rate schedule as OmniQuant\. The regularization weightλ\\lambdais selected from\{0\.01,0\.05,0\.1,0\.15\}\\\{0\.01,0\.05,0\.1,0\.15\\\}based on validation perplexity\.

## Appendix FExtra Experimental Results

### F\.1Details of[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)*\(b\)*

[Table˜4](https://arxiv.org/html/2605.05693#A6.T4)reports the exact zero\-shot accuracies underlying[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)\. The two accuracy curves in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)are obtained from the average accuracies in the last column forSARQC\-GBS\(Identity\)andSARQC\-GBS\(Saliency\), respectively\. The figure also overlays the reconstruction term and the regularization term to illustrate how the trade\-off controlled byλ\\lambdaaffects downstream performance\. For visualization,ℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}andℒsar\\mathcal\{L\}\_\{\\mathrm\{sar\}\}are min–max normalized to\[0,1\]\[0,1\]overλ∈\{0\.1,0\.2,…,1\.0\}\\lambda\\in\\\{0\.1,0\.2,\\dots,1\.0\\\}\. Since the raw reconstruction loss atλ=0\\lambda=0is much larger than the others, directly including it would compress the remaining curve; we therefore plot the point atλ=0\\lambda=0using the same normalization rule determined fromλ∈\{0\.1,…,1\.0\}\\lambda\\in\\\{0\.1,\\dots,1\.0\\\}\. This preserves the outlier nature ofλ=0\\lambda=0while keeping the overall trend visually interpretable\. We also conduct the same study forSARQC\-GS\. The corresponding curve is shown in[Figure˜3](https://arxiv.org/html/2605.05693#A6.F3), and the exact zero\-shot accuracies for differentλ\\lambdaare reported in[Table˜5](https://arxiv.org/html/2605.05693#A6.T5)\.

![Refer to caption](https://arxiv.org/html/2605.05693v1/x5.png)Figure 3:Effect of the regularization strengthλ\\lambdafor SARQC\-GS on LLaMA2\-7B under INT4 weight\-only quantization\.Table 4:Zero\-shot accuracy \(%\) of SARQC\-GBS on LLaMA2\-7B under W4A16 for differentλ\\lambda\. For SARQC\-GBS\(Saliency\)γ\\gammais fixed at0\.50\.5\.Table 5:Zero\-shot accuracy \(%\) of SARQC\-GS on LLaMA2\-7B under W4A16 for differentλ\\lambda\.
### F\.2Visualization of Weight Drift

To further support the motivation in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1),[Figure˜4](https://arxiv.org/html/2605.05693#A6.F4)visualizes the absolute differences,\|𝐖l−𝐖^l\|\|\\mathbf\{W\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\|, for a representative layer of the INT4\-quantized LLaMA2\-7B model\. The two panels correspond to theλ=0\\lambda=0andλ=0\.3\\lambda=0\.3cases in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)*\(b\)*\. Whenλ=0\\lambda=0, corresponding to vanilla calibration, the quantized weights exhibit larger deviation from the original FP weights, indicating more severe weight drift\. With moderate regularization atλ=0\.3\\lambda=0\.3, the discrepancy becomes visibly more controlled\. This directly illustrates in weight space that the proposed regularizer can constrain𝐖^l\\widehat\{\\mathbf\{W\}\}\_\{l\}from drifting too far away from𝐖l\\mathbf\{W\}\_\{l\}\.

![Refer to caption](https://arxiv.org/html/2605.05693v1/figures/w_devivation_without_regularization.png)\(a\)λ=0\\lambda=0
![Refer to caption](https://arxiv.org/html/2605.05693v1/figures/w_devivation_under_regularization.png)\(b\)λ=0\.3\\lambda=0\.3

Figure 4:Visualization of the element\-wise weight discrepancy\|𝐖l−𝐖^l\|\|\\mathbf\{W\}\_\{l\}\-\\widehat\{\\mathbf\{W\}\}\_\{l\}\|for a representative layer of INT4\-quantized LLaMA2\-7B under different regularization strengths\. The two panels correspond to theλ=0\\lambda=0andλ=0\.3\\lambda=0\.3cases in[Figure˜1](https://arxiv.org/html/2605.05693#S1.F1)*\(b\)*\.
### F\.3Results under W3A16

[Table˜6](https://arxiv.org/html/2605.05693#A6.T6)reports zero\-shot accuracy under W3A16 for both dense and MoE models\. For W3A16, we focus on seven non\-code zero\-shot benchmarks and exclude HumanEval from the averaged score, since code\-generation performance becomes highly unstable and substantially degraded under this aggressive quantization setting\. Overall,SARQC\-GBSconsistently improves over strong Gram\-based PTQ baselines under this more aggressive quantization setting\. In particular,SARQC\-GBS\(Saliency\)achieves the best average accuracy among quantized methods on four out of the five evaluated models, whileSARQC\-GBS\(Identity\)performs best on the remaining model\. Comparing the two SARQC variants, the saliency\-aware version tends to outperform the identity\-based one on most dense models and on two of the three MoE models, indicating that saliency\-aware regularization is generally helpful under low\-bit weight\-only quantization\.

Table 6:Zero\-shot accuracy \(%\) on multiple benchmarks under W3A16\.
### F\.4Results on LLaMA\-7B, LLaMA\-13B, and LLaMA\-30B

[Table˜7](https://arxiv.org/html/2605.05693#A6.T7)reports zero\-shot accuracy for LLaMA\-7B, LLaMA\-13B, and LLaMA\-30B under both W4A16 and W3A16\. Overall, SARQC yields consistent improvements over standard PTQ baselines across model scales, with the gains being more pronounced under the more challenging W3A16 setting\.

Table 7:Zero\-shot accuracy \(%\) on multiple benchmarks for LLaMA models under W4A16 and W3A16\.
### F\.5Ablation Study on Calibration Size

[Table˜8](https://arxiv.org/html/2605.05693#A6.T8)examines the sensitivity of SARQC to the calibration set size\. Overall,SARQC\-GBS\(Saliency\)remains stable across different calibration sizes and consistently achieves strong zero\-shot performance, even when only a small number of calibration samples are available\. This suggests that the proposed method is relatively robust to limited calibration data\.

### F\.6Ablation Study on Calibration Corpus

[Table˜9](https://arxiv.org/html/2605.05693#A6.T9)studies the sensitivity of the proposed methods to the choice of calibration corpus for LLaMA2\-7B under W4A16\. We report zero\-shot accuracy on eight downstream benchmarks using two calibration datasets, C4 and Pile\. Overall, the SARQC variants consistently outperform their corresponding baselines across both calibration corpora\. Comparing identity\-based and saliency\-aware regularization, the saliency\-aware variants are generally better in all four SARQC settings reported in the table\. This supports the view that saliency\-aware weighting helps preserve more important channels and improves generalization across different calibration data distributions\.

Table 8:Sensitivity to the sample size of the calibration set: zero\-shot accuracy \(%\) on multiple benchmarks for LLaMA2\-13B\.Table 9:Sensitivity to the choice of the calibration set: zero\-shot accuracy \(%\) on multiple benchmarks under W4A16 for LLaMA2\-7B models\.
### F\.7Speedup

To evaluate the practical efficiency of different quantization methods, we measure both prefill and decoding speedup on representative dense and MoE language models\. All experiments are conducted on a single NVIDIA A100 80GB GPU\. We report speedup relative to the corresponding FP16/BF16 baseline under the same setup\. Specifically, both prefill and decoding speedup are computed as the throughput of the quantized model divided by that of the FP model, where larger values indicate better efficiency\. As shown in[Table˜10](https://arxiv.org/html/2605.05693#A6.T10),SARQC\-GSandSARQC\-GBSdeliver speedups that are broadly comparable to those of AWQ and GPTQ, with only minor variation across methods\. This suggests that SARQC preserves the practical efficiency benefits of weight\-only quantization, while improving accuracy without incurring additional runtime overhead, as the equivalent inference efficiency is guaranteed by design\.

Table 10:Prefill and decoding speedup of quantized models on a sequence length of 2048\.
Saliency-Aware Regularized Quantization Calibration for Large Language Models

Similar Articles

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

Theory-optimal Quantization Based on Flatness

Submit Feedback

Similar Articles

FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws
Theory-optimal Quantization Based on Flatness