InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

arXiv cs.LG Papers

Summary

InfoQuant introduces a train-free method, Peak Suppression Orthogonal Transformation (PSOT), to reshape activation distributions for low-bit LLM quantization, preserving 97% floating-point accuracy under W4A4KV4 and outperforming prior PTQ methods.

arXiv:2605.26175v1 Announce Type: new Abstract: Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)
Original Article
View Cached Full Text

Cached at: 05/27/26, 09:04 AM

# InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
Source: [https://arxiv.org/html/2605.26175](https://arxiv.org/html/2605.26175)
Ke Li1,Dong An2,Xiaoling Zang2,Can Ye2,Liang Xie3, Qibo Qiu4,Chen Shen5,Xiaofei He6,Wenxiao Wang1,\*

1School of Software Technology, Zhejiang University 2Ant Group 3College of Computer Science and Technology, Zhejiang University of Technology 4China Mobile \(Zhejiang\) Research & Innovation Institute 5Alibaba Cloud Computing 6State Key Lab of CAD&CG, Zhejiang University \*Corresponding author\.\{like2248,wenxiaowang\}@zju\.edu\.cn

###### Abstract

Low\-bit activation quantization remains a major bottleneck in efficient large language model \(LLM\) deployment\. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low\-bit uniform quantizer\. Existing post\-training quantization \(PTQ\) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize\. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean\. We recast activation transformation as quantizer\-facing distribution design and analyze quantization error from an information\-theoretic perspective\. Our analysis shows that quantization\-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range\. Guided by this analysis, we proposeInfoQuant, a train\-free method that employs Peak Suppression Orthogonal Transformation \(PSOT\) to shape activations into more quantization\-friendly distributions\. We further introduce adaptive outlier\-token selection to improve the robustness ofPSOTduring optimization\. Across multiple LLM families,InfoQuantconsistently outperforms prior PTQ and end\-to\-end training baselines\. Under W4A4KV4, it preserves 97% of floating\-point accuracy on average and reduces the LLaMA\-2 13B performance gap by 42% over the previous state of the art\.111Code is available at:[github\.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)\.

InfoQuant: Shaping Activation Distributions for Low\-Bit LLM Quantization

Ke Li1, Dong An2, Xiaoling Zang2, Can Ye2, Liang Xie3,Qibo Qiu4,Chen Shen5,Xiaofei He6,Wenxiao Wang1,\*1School of Software Technology, Zhejiang University2Ant Group3College of Computer Science and Technology, Zhejiang University of Technology4China Mobile \(Zhejiang\) Research & Innovation Institute5Alibaba Cloud Computing6State Key Lab of CAD&CG, Zhejiang University\*Corresponding author\.\{like2248,wenxiaowang\}@zju\.edu\.cn

## 1Introduction

Post\-training quantization \(PTQ\) is one of the most practical ways to reduce the memory and compute cost of large language model \(LLM\) inference\. Its main challenge, however, lies in low\-bit activation quantization\. Unlike weights, LLM activations often contain a small number of dominant coordinates that enlarge the quantization range and force round\-to\-nearest quantization to map many normal values to the same few levels\. This mismatch becomes particularly severe in 4\-bit settings, where limited quantization levels leave little room to preserve both rare extremes and dense central values\.

Recent PTQ methods increasingly address this problem through activation transformations before quantization\. SmoothQuantXiaoet al\.\([2023](https://arxiv.org/html/2605.26175#bib.bib41)\)migrates activation difficulty into weights through diagonal scaling, while QuaRotAshkbooset al\.\([2024b](https://arxiv.org/html/2605.26175#bib.bib17)\)and SpinQuantLiuet al\.\([2025b](https://arxiv.org/html/2605.26175#bib.bib23)\)use orthogonal rotations to redistribute activation energy; other methods further introduce more flexible affine transformations or reconstruction objectivesMaet al\.\([2024](https://arxiv.org/html/2605.26175#bib.bib18)\); Sunet al\.\([2025](https://arxiv.org/html/2605.26175#bib.bib22)\)\. Although these methods differ in form, they share the same practical role: they change the activation distribution seen by the quantizer\. Yet most of them are motivated by suppressing outliers, balancing channels, or reducing reconstruction error, rather than by defining what transformed distribution a low\-bit quantizer can represent well\. Consequently, as illustrated in Figure[1](https://arxiv.org/html/2605.26175#S1.F1), they may reduce visible peaks without fully improving discretizability, or preserve small numerical error while still collapsing distributional resolution\. The central question, therefore, is not only how to transform activations, but what transformed activation distribution is actually quantization\-friendly\.

![Refer to caption](https://arxiv.org/html/2605.26175v1/figures/oho2.png)Figure 1:Activation distributions of theLLaMA\-2 7B, layer4 q/k/v\_projinput under three transformations: the original activations \(left\), Hadamard rotation \(center\), and the learned rotation fromPSOT\(right\)\. Compared with the original and Hadamard\-rotated activations,PSOTproduces a more quantization\-friendly distribution with a narrower numerical range and larger normalized dispersion\. Here,bnb\_\{n\}denotes the standard deviation after infinity\-norm normalization, and a smallerλ=s¯/bn\\lambda=\\bar\{s\}/b\_\{n\}indicates a lower normalized quantization\-error bound, wheres¯\\bar\{s\}is the range\-normalized quantization step size\.We address this gap by recasting activation transformation as quantizer\-facing distribution design\. From an information\-theoretic perspective, we analyze how quantization error depends on the activation distribution after transformation\. Our theoretical and empirical study shows that lower quantization error is associated with two complementary properties: a smaller numerical range and greater dispersion within that range\. This result reframes the role of activation transformation\. Rather than merely suppressing outliers or minimizing heuristic reconstruction losses, a PTQ method should explicitly shape activations into distributions that are easier for a low\-bit quantizer to preserve\. Since LLM activations are typically bell\-shaped and often contain outliersLiuet al\.\([2025a](https://arxiv.org/html/2605.26175#bib.bib48)\), they are naturally misaligned with this target, which explains why low\-bit activation quantization remains difficult\.

Guided by this principle, we introduceInfoQuant, a train\-free PTQ method that learns orthogonal transformations to produce more quantization\-friendly activation distributions\. Its core component is Peak Suppression Orthogonal Transformation \(PSOT\), which applies an activation\-wise peak suppression objective to reduce the numerical range while increasing normalized dispersion\. We further introduce adaptive outlier\-token selection to improve optimization robustness, and learn activation clipping parameters to refine the final quantization range after the distribution has been reshaped\. AlthoughInfoQuantis designed around activation optimization, it remains compatible with standard weight quantization pipelines\. Overall, our contributions can be summarized as follows:

- •We introduce an information\-theoretic framework for understanding activation quantization error and show, both theoretically and empirically, that quantization\-friendly distributions should have a smaller numerical range and greater dispersion\.
- •We introduceInfoQuant, a hardware\-efficient and train\-free PTQ method centered on learned orthogonal activation shaping, together with adaptive outlier\-token selection and learnable activation clipping for robust calibration\.
- •We demonstrate that activation\-distribution optimization yields strong empirical and practical gains\. In the W4A4KV4 setting, LLaMA\-2 \(7B, 13B, 70B\) and LLaMA\-3 \(8B, 70B\) retain an average of97%97\\%of their original performance, and the 70B model can be quantized using only2424GB of GPU memory\.

## 2Related Work

#### Post Training Quantization for LLMs\.

PTQ is an efficient and widely used approach for compressing LLMs\. Due to the flatness and uniform distribution of LLM weights, weight\-only quantization typically results in minimal performance degradation\. GPTQFrantaret al\.\([2023](https://arxiv.org/html/2605.26175#bib.bib29)\)uses Hessian\-based error compensation to enable high compression with low accuracy loss\. AWQLinet al\.\([2024b](https://arxiv.org/html/2605.26175#bib.bib49)\)and OWOLeeet al\.\([2024](https://arxiv.org/html/2605.26175#bib.bib52)\)further improve performance by mitigating the effects of activation outliers\. QuIP\(Cheeet al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib46)\)and QuIP\#\(Tsenget al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib45)\)apply random Hadamard transforms for incoherent processing and employ vector quantization on weights, achieving better performance\. In contrast, activation quantization remains more challenging due to the presence of rare but extreme outliers\(Weiet al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib54); Xiaoet al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib41)\), which can disproportionately affect accuracy\.

#### Transformation\-based Methods\.

These methods more effectively redistribute activation outliers across channels\. Channel scaling\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib41)\)shifts part of this burden to weights, OmniQuant\(Shaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib50)\)and LRQuant\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib43)\)optimize scaling parameters via MSE minimization\. However, recent work\(Yiet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib51)\)shows channel scaling alone fails under 4\-bit settings, leading to notable degradation\. AffineQuant\(Maet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib18)\)learns affine transformations to precondition activations\. However, due to the significant overhead of full\-size matrix multiplication, AffineQuant can only apply affine transformations to a small fraction of linear layers\. FlatQuant\(Sunet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib22)\)reduces this cost via Kronecker decomposition, applying affine transformation to every linear layer\. Leveraging computational invariance\(Ashkbooset al\.,[2024a](https://arxiv.org/html/2605.26175#bib.bib53)\), orthogonal transforms can be applied to weights and between\-block activations without extra inference overhead\. QuaRot\(Ashkbooset al\.,[2024b](https://arxiv.org/html/2605.26175#bib.bib17)\)uses randomized Hadamard transforms to remove outliers\. SpinQuant\(Liuet al\.,[2025b](https://arxiv.org/html/2605.26175#bib.bib23)\)further optimizes learnable orthogonal matrices on the Stiefel manifold with task loss \(e\.g\., cross\-entropy\) to find stable transformations\. OSTQuantHuet al\.\([2025](https://arxiv.org/html/2605.26175#bib.bib24)\)combines channel scaling with orthogonal transforms and uses end\-to\-end distillation from original outputs to boost quantization\. Kurtail\(Akhondzadehet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib44)\)facilitates quantization by controlling the kurtosis to make the distribution more uniform\. BASE\-Q\(Heet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib63)\)introduces an additional bias term to balance the mean values of different channels after rotation\.

## 3Motivation

### 3\.1Quantization Preliminaries

Quantization maps high\-precision values to a set of discrete levels\. The process is detailed as follows:

𝒬\(𝐗\)=clamp\(⌊𝐗s⌉\+z,0,2N−1\)\\mathcal\{Q\}\(\\mathbf\{X\}\)=\\text\{clamp\}\\left\(\\left\\lfloor\\frac\{\\mathbf\{X\}\}\{s\}\\right\\rceil\+z,\\ 0,\\ 2^\{N\}\-1\\right\)\(1\)Here, quantization step size is denoted bys=𝐗max−𝐗min2N−1s=\\frac\{\\mathbf\{X\}\_\{\\max\}\-\\mathbf\{X\}\_\{\\min\}\}\{2^\{N\}\-1\}, andz=−⌊𝐗mins⌉z=\-\\left\\lfloor\\frac\{\\mathbf\{X\}\_\{\\min\}\}\{s\}\\right\\rceilis the corresponding zero\-point,⌊⋅⌉\\left\\lfloor\\cdot\\right\\rceildenotes the rounding operation, andNNrepresents the target bit\-width\. Given a floating\-point tensor𝐗\\mathbf\{X\}, the quantization function𝒬​\(⋅\)\\mathcal\{Q\}\(\\cdot\)produces its integer\-valued representation\. Quantization error primarily arises from the rounding operation, which collapses all values within a single interval of sizessinto the same discrete level\.

### 3\.2A Distributional View of Quantization Error

Recent PTQ methods often optimize activation transformations with MSE\-based objectives\(Shaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib50); Zhaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib43); Sunet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib22)\)\. While MSE is a useful measure of numerical distortion, it does not fully capture the distributional mismatch introduced by low\-bit quantization\. This limitation is especially important for activation quantization under round\-to\-nearest \(RTN\), where many values may incur only small pointwise errors yet still be mapped to a small number of discrete levels\. In such cases, the quantized activations can remain close in value to the original ones while losing substantial distributional resolution, which is not well reflected by MSE alone\.

![Refer to caption](https://arxiv.org/html/2605.26175v1/x1.png)Figure 2:Distributional effect of quantizing theLLaMA\-2 7B, layer4 q/k/v\_projinput\. Top: KL divergence between activation histograms before and after quantization, evaluated over different quantization step sizesssand dispersion valuesbnb\_\{n\}with 15,000 histogram bins\. Center/Bottom: activation histograms before and after quantization for low\-error and high\-error cases, respectively\. Low\-bit quantization is most destructive when a wide range and low normalized dispersion force dense activation values into too few discrete levels\.Prior work\(Liuet al\.,[2025a](https://arxiv.org/html/2605.26175#bib.bib48)\)has shown that activation distributions in LLMs are typically bell\-shaped \(e\.g\., Gaussian or Laplace\)\. When such activations are quantized with a low\-bit uniform RTN quantizer, a few large values can determine the quantization range, forcing most normal values to collapse toward levels near the mean \(Figure[2](https://arxiv.org/html/2605.26175#S3.F2), bottom\)\. The resulting error is therefore not only a matter of local rounding distortion, but also of how poorly the available quantization levels match the underlying activation distribution\. Although non\-uniform quantizers can in principle better adapt to dense regions, they usually introduce additional hardware complexity and are less attractive in practical low\-bit deployment\. These observations motivate an analytical metric that reflects both numerical deviation and quantization\-induced distribution shift\. To address this, we use a smoothed KL divergence as an analytical lens for the distributional distortion caused by low\-bit quantization\. Let𝐱\\mathbf\{x\}denote an activation token, and let each entry be denoted by a scalarx∈𝐱x\\in\\mathbf\{x\}with distributionP​\(x\)P\(x\)\. We consider a centered finite\-bit clamped quantizer

x^=Qs,c\(x\)=clip\(s⌊xs⌉,−c,c\),\\hat\{x\}=Q\_\{s,c\}\(x\)=\\operatorname\{clip\}\\\!\\left\(s\\left\\lfloor\\frac\{x\}\{s\}\\right\\rceil,\\,\-c,\\,c\\right\),\(2\)wheressis the quantization step size andccis the clipping scale\. Directly comparingP​\(x\)P\(x\)with the quantized distribution is ill\-posed as a density\-to\-density KL, because quantization turns a continuous density into probability masses on discrete centroids\. We therefore spread each centroid into a narrow continuous kernel\. Specifically, letqiq\_\{i\}denote a quantization centroid, letIi=\{x∣Qs,c​\(x\)=qi\}I\_\{i\}=\\\{x\\mid Q\_\{s,c\}\(x\)=q\_\{i\}\\\}be its quantization cell, and define the corresponding probability masspi=∫IiP​\(x\)​𝑑xp\_\{i\}=\\int\_\{I\_\{i\}\}P\(x\)\\,dx\. The quantized distribution is relaxed as

Qθ\(c\)​\(x\)≈∑ipi⋅δ​\(x−qi;θ\),Q\_\{\\theta\}^\{\(c\)\}\(x\)\\approx\\sum\_\{i\}p\_\{i\}\\cdot\\delta\(x\-q\_\{i\};\\theta\),\(3\)whereδ​\(x;θ\)=12​θ​exp⁡\(−\|x\|θ\)\\delta\(x;\\theta\)=\\frac\{1\}\{2\\theta\}\\exp\\\!\\left\(\-\\frac\{\|x\|\}\{\\theta\}\\right\)is a Laplace kernel that approaches a Dirac delta asθ→0\\theta\\to 0\. Under the standard separation assumptionθ≪s\\theta\\ll s, the smoothed KL objective admits the approximation:

DKL​\(P∥Qθ\(c\)\)=∫P​\(x\)​log⁡\(P​\(x\)Qθ\(c\)​\(x\)\)​𝑑x\\displaystyle D\_\{\\mathrm\{KL\}\}\\left\(P\\,\\\|\\,Q\_\{\\theta\}^\{\(c\)\}\\right\)=\\int P\(x\)\\,\\log\\\!\\left\(\\frac\{P\(x\)\}\{Q\_\{\\theta\}^\{\(c\)\}\(x\)\}\\right\)dx≈−H​\(P\)\+H​\(\{pi\}\)\+log⁡\(2​θ\)\+1θ​ℰclip,\\displaystyle\\approx\-H\(P\)\+H\(\\\{p\_\{i\}\\\}\)\+\\log\(2\\theta\)\+\\frac\{1\}\{\\theta\}\\,\\mathcal\{E\}\_\{\\mathrm\{clip\}\},\(4\)whereH​\(P\)H\(P\)denotes the entropy of the original distribution,H​\(\{pi\}\)H\(\\\{p\_\{i\}\\\}\)is the entropy of the quantized probability masses, andℰclip=𝔼​\|x−Qs,c​\(x\)\|\\mathcal\{E\}\_\{\\mathrm\{clip\}\}=\\mathbb\{E\}\|x\-Q\_\{s,c\}\(x\)\|is the expected absolute error of the finite\-bit clamped quantizer\. This decomposition exposes the key mechanism: after smoothing the discrete outputs, the distributional KL surrogate contains a direct quantization\-error term\. Thus, a transformation that makes activations easier to quantize should not only suppress extreme values, but also reduce the normalized error induced by the finite set of quantization levels\.

Since activations typically have bounded yet varying ranges, we normalize the clamped quantization error by the standard deviationσ\\sigma:

ℰclip′=1σ​𝔼​\|x−Qs,c​\(x\)\|≤s2​σ\+1σ​𝔼​\[\(\|x\|−c\)\+\]\.\\begin\{split\}\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}&=\\frac\{1\}\{\\sigma\}\\,\\mathbb\{E\}\|x\-Q\_\{s,c\}\(x\)\|\\\\ &\\leq\\frac\{s\}\{2\\sigma\}\+\\frac\{1\}\{\\sigma\}\\,\\mathbb\{E\}\\\!\\left\[\(\|x\|\-c\)\_\{\+\}\\right\]\.\\end\{split\}\(5\)The two terms reveal the trade\-off that ordinary outlier suppression does not fully describe\. The first term is the in\-range rounding error, which decreases when the step size is small relative to the activation spread\. The second term is the clipping\-tail error, which measures the mass left outside the finite range\. Letλ=s/σ\\lambda=s/\\sigmaandκ=c/σ\\kappa=c/\\sigma\. Then

ℰclip′≤λ2\+τP​\(κ\),\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}\\leq\\frac\{\\lambda\}\{2\}\+\\tau\_\{P\}\(\\kappa\),\(6\)whereτP​\(κ\)=𝔼​\[\(\|Y\|−κ\)\+\]\\tau\_\{P\}\(\\kappa\)=\\mathbb\{E\}\\\!\\left\[\(\|Y\|\-\\kappa\)\_\{\+\}\\right\]for the normalized variableY=x/σY=x/\\sigma\. For common bell\-shaped activation distributions such as Gaussian and Laplace,τP\\tau\_\{P\}admits closed forms, and for aBB\-bit quantizer withc=M​sc=MsandM=2B−1−1M=2^\{B\-1\}\-1, the resulting bound decreases asκ\\kappadecreases in the tail\-controlled regime relevant to calibrated PTQ\.222Proofs and the Gaussian/Laplace closed forms can be found in Appendix[A](https://arxiv.org/html/2605.26175#A1)\.Sinceκ=c/σ=1/bn\\kappa=c/\\sigma=1/b\_\{n\}andλ=s/σ=s¯/bn\\lambda=s/\\sigma=\\bar\{s\}/b\_\{n\}withbn=σ/cb\_\{n\}=\\sigma/c, this analysis turns the vague goal of “making activations smoother” into a concrete distributional target: reduce the clipped numerical range while keeping the normalized activation values well dispersed inside that range\. Empirically, as shown in the top of Figure[2](https://arxiv.org/html/2605.26175#S3.F2), KL divergence decreases consistently with smallerssand largerbnb\_\{n\}\. Together, the analysis and observation suggest a simple design principle for low\-bit activation quantization: a good transformation should compress the effective range and spread useful activation mass across more available quantization levels\.

## 4Method

InfoQuantis a train\-free PTQ framework that reshapes activations into distributions better matched to low\-bit quantizers\. It consists of three components: Peak Suppression Orthogonal Transformation \(PSOT\) learns an orthogonal activation transformation, adaptive outlier\-token selection \(ASOT\) emphasizes informative calibration tokens, and learnable activation clipping \(LAC\) refines the final quantization interval\.

### 4\.1Peak Suppression Orthogonal Transformation

PSOTlearns a quantizer\-facing orthogonal transformation by directly penalizing peak\-dominated activation tokens\. For each target activation stream, we optimize a block\-diagonal orthogonal rotation on calibration activations and initialize it from a Hadamard transform, preserving the efficient rotation\-based deployment path used by existing PTQ systems\. Let𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}denote one activation token, and let𝐑∈ℝd×d\\mathbf\{R\}\\in\\mathbb\{R\}^\{d\\times d\}be a learnable orthogonal matrix with𝐑⊤​𝐑=𝐈\\mathbf\{R\}^\{\\top\}\\mathbf\{R\}=\\mathbf\{I\}\. We remove the coordinate\-wise mean after rotation using the centering projector𝐏⟂=𝐈−1d​𝟏𝟏⊤\\mathbf\{P\}\_\{\\perp\}=\\mathbf\{I\}\-\\frac\{1\}\{d\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}, and define𝝅T​\(𝐲\)=softmax⁡\(\|𝐲\|/T\)\\boldsymbol\{\\pi\}\_\{T\}\(\\mathbf\{y\}\)=\\operatorname\{softmax\}\(\|\\mathbf\{y\}\|/T\):

𝐲​\(𝐱;𝐑\)\\displaystyle\\mathbf\{y\}\(\\mathbf\{x\};\\mathbf\{R\}\)=𝐱𝐑𝐏⟂,\\displaystyle=\\mathbf\{x\}\\mathbf\{R\}\\mathbf\{P\}\_\{\\perp\},ℓps​\(𝐱;𝐑\)\\displaystyle\\ell\_\{\\mathrm\{ps\}\}\(\\mathbf\{x\};\\mathbf\{R\}\)=‖𝝅T​\(𝐲\)⊙𝐲‖2,\\displaystyle=\\left\\\|\\boldsymbol\{\\pi\}\_\{T\}\(\\mathbf\{y\}\)\\odot\\mathbf\{y\}\\right\\\|\_\{2\},\(7\)whereTTis the temperature and⊙\\odotdenotes element\-wise multiplication\. The softmax weights concentrate the objective on high\-magnitude coordinates, so minimizingℓps\\ell\_\{\\mathrm\{ps\}\}suppresses the coordinates that dominate the quantization range\.

This peak\-suppression objective promotes the two distributional properties identified by the KL analysis\. Since𝐑\\mathbf\{R\}is orthogonal, the rotation preserves token energy before centering\. Reducing the largest centered coordinate therefore pushes the remaining energy to spread across more dimensions, which lowers the clipping scale and increases the max\-normalized dispersionbnb\_\{n\}\. As illustrated in Figure[1](https://arxiv.org/html/2605.26175#S1.F1), the learned rotation yields a narrower range and a more dispersed normalized distribution than both the original activations and a fixed Hadamard rotation\.333Appendix[A](https://arxiv.org/html/2605.26175#A1)gives the corresponding formal analysis\.

### 4\.2Adaptive Outlier\-Token Selection

PSOTobtains the strongest learning signal from tokens with quantization\-sensitive outlier structure\. Uniformly optimizing all calibration tokens can dilute this signal because many tokens have weak or noisy peaks\. We therefore use ASOT to select reliable outlier tokens and reweight them during rotation optimization\.

Let𝐗\(r\)∈ℝnr×d\\mathbf\{X\}^\{\(r\)\}\\in\\mathbb\{R\}^\{n\_\{r\}\\times d\}be the activation matrix of therr\-th calibration sample, where𝐱i\(r\)\\mathbf\{x\}^\{\(r\)\}\_\{i\}is itsii\-th token\. For a threshold coefficientkk, we define the selected token\-index set as

oi\(r\)\\displaystyle o\_\{i\}^\{\(r\)\}=‖𝐱i\(r\)−μ​𝟏σ‖∞,\\displaystyle=\\left\\\|\\frac\{\\mathbf\{x\}^\{\(r\)\}\_\{i\}\-\\mu\\mathbf\{1\}\}\{\\sigma\}\\right\\\|\_\{\\infty\},𝒯\(r\)​\(k\)\\displaystyle\\mathcal\{T\}^\{\(r\)\}\(k\)=\{i∣oi\(r\)\>k\},\\displaystyle=\\left\\\{i\\mid o\_\{i\}^\{\(r\)\}\>k\\right\\\},\(8\)whereμ\\muandσ\\sigmaare estimated from the corresponding calibration activations\. This criterion keeps tokens that contain at least one statistically extreme coordinate\.

The threshold should select sparse outlier tokens without overfitting to fixed sequence positions\. Fixed\-position peaks can reflect prompt layout or calibration artifacts, whereas sample\-dependent outliers provide a more useful signal for learning a rotation that generalizes across inputs\. Following the observation that activation outliers are more closely tied to token identity than absolute sequence position\(Liuet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib61); Chenet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib60)\), we measure whether the selected positions vary acrossmmcalibration samples:

ηm​\(k\)=1−1m​∑r=1m\|𝒯\(r\)​\(k\)\|\|⋃r=1m𝒯\(r\)​\(k\)\|\.\\displaystyle\\eta\_\{m\}\(k\)=1\-\\frac\{\\frac\{1\}\{m\}\\sum\_\{r=1\}^\{m\}\|\\mathcal\{T\}^\{\(r\)\}\(k\)\|\}\{\\left\|\\bigcup\_\{r=1\}^\{m\}\\mathcal\{T\}^\{\(r\)\}\(k\)\\right\|\}\.\(9\)where a largerηm​\(k\)\\eta\_\{m\}\(k\)indicates that the selected outlier positions are less tied to fixed sequence locations\. We choose the smallest threshold at which the inconsistency curve stabilizes while the selected tokens remain sparse:

k⋆=min⁡\{k∈𝒦\|\|∇ηm​\(k\)\|<δ,1m​∑r=1m\|𝒯\(r\)​\(k\)\|<τ​\|ℐ\|\},\\displaystyle k^\{\\star\}=\\min\\left\\\{k\\in\\mathcal\{K\}\\,\\middle\|\\,\\begin\{aligned\} &\|\\nabla\\eta\_\{m\}\(k\)\|<\\delta,\\\\ &\\frac\{1\}\{m\}\\sum\_\{r=1\}^\{m\}\|\\mathcal\{T\}^\{\(r\)\}\(k\)\|<\\tau\|\\mathcal\{I\}\|\\end\{aligned\}\\right\\\},\(10\)where𝒦\\mathcal\{K\}is the ordered threshold grid,ℐ\\mathcal\{I\}is the token\-index universe,δ\\deltacontrols the stabilization tolerance, andτ\\taulimits the selected\-token ratio\. With the selected sets𝒯\(r\)​\(k⋆\)\\mathcal\{T\}^\{\(r\)\}\(k^\{\\star\}\), the final rotation objective is

min𝐑⊤​𝐑=𝐈\\displaystyle\\min\_\{\\mathbf\{R\}^\{\\top\}\\mathbf\{R\}=\\mathbf\{I\}\}∑r=1m∑i∈ℐwi\(r\)​ℓps​\(𝐱i\(r\);𝐑\),\\displaystyle\\sum\_\{r=1\}^\{m\}\\sum\_\{i\\in\\mathcal\{I\}\}w\_\{i\}^\{\(r\)\}\\,\\ell\_\{\\mathrm\{ps\}\}\(\\mathbf\{x\}^\{\(r\)\}\_\{i\};\\mathbf\{R\}\),wi\(r\)\\displaystyle w\_\{i\}^\{\(r\)\}=\{γ,i∈𝒯\(r\)​\(k⋆\),1,otherwise\.\\displaystyle=\\begin\{cases\}\\gamma,&i\\in\\mathcal\{T\}^\{\(r\)\}\(k^\{\\star\}\),\\\\ 1,&\\text\{otherwise\}\.\\end\{cases\}\(11\)whereγ\>1\\gamma\>1emphasizes outlier tokens while retaining normal tokens as regularizing calibration samples\. All ASOT hyperparameters are selected by grid search on the calibration set, and the effect ofγ\\gammais studied in Table[5](https://arxiv.org/html/2605.26175#A3.T5)\.

![Refer to caption](https://arxiv.org/html/2605.26175v1/x2.png)Figure 3:Adaptive threshold selection for ASOT\. The positional inconsistencyη10​\(k\)\\eta\_\{10\}\(k\)increases withkkand then plateaus, while the average number of selected outlier tokensp​\(k\)=1m​∑r=1m\|𝒯\(r\)​\(k\)\|p\(k\)=\\frac\{1\}\{m\}\\sum\_\{r=1\}^\{m\}\|\\mathcal\{T\}^\{\(r\)\}\(k\)\|decreases\. We choosek⋆k^\{\\star\}whenη10​\(k\)\\eta\_\{10\}\(k\)stabilizes and the selected token ratio satisfies the sparsity constraint in Eq\. \([10](https://arxiv.org/html/2605.26175#S4.E10)\)\.
### 4\.3Learnable Activation Clipping

AfterPSOTreshapes the activation distribution, we apply learnable activation clipping as a final calibration step\. Following prior work\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib43); Sunet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib22)\), two bounded parametersα\\alphaandβ\\betarefine the activation step size:

s​\(α,β\)=α​Xmax−β​Xmin2N−1,\\displaystyle s\(\\alpha,\\beta\)=\\frac\{\\alpha X\_\{\\max\}\-\\beta X\_\{\\min\}\}\{2^\{N\}\-1\},\(12\)whereNNis the bit\-width, andXmaxX\_\{\\max\}andXminX\_\{\\min\}are the observed activation bounds\. The clipping parameters are optimized by matching the quantized block output to the full\-precision block output:

minα,β⁡‖ℱl​\(𝐗l\)−ℱ^l​\(𝐗l;α,β\)‖22,\\displaystyle\\min\_\{\\alpha,\\beta\}\\left\\\|\\mathcal\{F\}\_\{l\}\(\\mathbf\{X\}\_\{l\}\)\-\\widehat\{\\mathcal\{F\}\}\_\{l\}\(\\mathbf\{X\}\_\{l\};\\alpha,\\beta\)\\right\\\|\_\{2\}^\{2\},\(13\)whereℱl\\mathcal\{F\}\_\{l\}denotes the full\-precision Transformer block andℱ^l\\widehat\{\\mathcal\{F\}\}\_\{l\}denotes the same block evaluated with activation quantization under Eq\. \([12](https://arxiv.org/html/2605.26175#S4.E12)\)\.

## 5Experiments

LLaMA\-3 8BLLaMA\-3 70BLLaMA\-2 7BLLaMA\-2 13BLLaMA\-2 70B\#BitsMethod0\-shot9Wiki0\-shot9Wiki0\-shot9Wiki0\-shot9Wiki0\-shot9WikiW\-A\-KVAvg\.\(↑\\uparrow\)\(↓\\downarrow\)Avg\.\(↑\\uparrow\)\(↓\\downarrow\)Avg\.\(↑\\uparrow\)\(↓\\downarrow\)Avg\.\(↑\\uparrow\)\(↓\\downarrow\)Avg\.\(↑\\uparrow\)\(↓\\downarrow\)16\-16\-16FloatingPoint68\.096\.1473\.812\.8665\.215\.4767\.614\.8871\.593\.324\-16\-16GPTQ61\.037\.4331\.459e360\.869\.8464\.715\.7970\.963\.94AWQ67\.037\.3668\.925\.9263\.895\.8366\.255\.0770\.884\.03QuaRot67\.276\.5372\.933\.5364\.305\.6266\.955\.0071\.213\.41SpinQuant66\.546\.4972\.903\.4963\.595\.5867\.145\.0071\.123\.43OSTQuant67\.806\.5373\.693\.1964\.375\.6467\.314\.9471\.483\.41InfoQuant67\.366\.4873\.253\.5064\.345\.6067\.274\.9971\.253\.404\-4\-16QuaRot61\.698\.0265\.566\.3561\.876\.0565\.135\.3569\.963\.78SpinQuant64\.117\.2866\.996\.1057\.376\.7863\.235\.2470\.583\.68OSTQuant65\.147\.2472\.213\.9763\.905\.6066\.245\.1470\.923\.57InfoQuant65\.747\.0770\.715\.2462\.845\.8666\.715\.1570\.823\.62\\cellcolorwhite4\-4\-4QuaRot61\.388\.1865\.336\.6061\.486\.1165\.165\.3970\.303\.80SpinQuant64\.107\.3566\.316\.2462\.015\.9664\.135\.7470\.573\.61Kurtail\-7\.20\-7\.20\-5\.90\-5\.20\-OSTQuant65\.377\.2971\.694\.0163\.185\.9165\.415\.2570\.843\.59OSTQuant†\\text\{OSTQuant\}^\{\\dagger\}65\.136\.80\-\-62\.455\.38\-\-\-BASE\-Q65\.397\.17OOMOOM62\.505\.8565\.485\.21OOMOOMInfoQuant\*64\.797\.2170\.015\.5762\.625\.9366\.125\.2270\.103\.84InfoQuant65\.577\.1670\.215\.3963\.165\.8966\.335\.1870\.353\.64Table 1:Comparison of perplexity on WikiText2 and averaged accuracy across nine diverse zero\-shot tasks\. Results for GPTQ, AWQ, QuaRot, SpinQuant, and OSTQuant are reported from the OSTQuant paper, while BASE\-Q results are based on official code \(Note: ’OOM’ denotes out of memory on our device\)\. Gray OSTQuant entries use distillation and are included as a strong supervised reference\.InfoQuant\*denotes the application of a complete global orthogonal rotation\.OSTQuant†\\text\{OSTQuant\}^\{\\dagger\}refers to OSTQuant without distillation\.#### Models and Datasets\.

We evaluate whether activation\-distribution optimization transfers across model families, scales, and evaluation metrics\. The main comparison covers LLaMA\-2 \(7B–70B\)\(Touvronet al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib13)\)and LLaMA\-3 \(8B–70B\)\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib14)\); additional Qwen2\.5 \(14B/32B\)\(Qwenet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib62)\)results are reported in Appendix[D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px1)\. We report WikiText2 perplexity \(PPL\)\(Merityet al\.,[2016](https://arxiv.org/html/2605.26175#bib.bib30)\)as a sensitive language\-modeling metric and use nine zero\-shot tasks fromlm\-evaluation\-harness\(version 0\.4\.7\)\(Gaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib31)\)to check whether lower quantization error translates to task\-level behavior\. The zero\-shot suite includes BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib32)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib34)\), LAMBADA \(OpenAI\)\(Radfordet al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib40)\), OpenBookQA \(OBQA\)\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.26175#bib.bib37)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.26175#bib.bib38)\), SIQA\(Sapet al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib33)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.26175#bib.bib35)\), ARC\-Easy, and ARC\-Challenge\(Boratkoet al\.,[2018](https://arxiv.org/html/2605.26175#bib.bib39)\)\.

#### Baselines\.

We compare with representative quantization methods that stress different parts of the design space: weight reconstruction methods GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib29)\)and AWQ\(Linet al\.,[2024b](https://arxiv.org/html/2605.26175#bib.bib49)\), rotation\-based methods QuaRot\(Ashkbooset al\.,[2024b](https://arxiv.org/html/2605.26175#bib.bib17)\)and SpinQuant\(Liuet al\.,[2025b](https://arxiv.org/html/2605.26175#bib.bib23)\), and recent low\-bit LLM quantizers Kurtail\(Akhondzadehet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib44)\), BASE\-Q\(Heet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib63)\), and OSTQuant\(Huet al\.,[2025](https://arxiv.org/html/2605.26175#bib.bib24)\)\. This comparison is useful becauseInfoQuantchanges the activation distribution before quantization, whereas several baselines mainly reconstruct weights, use fixed rotations, or rely on stronger supervision; in particular, the distilled OSTQuant results are included as a strong supervised reference\.

#### Implementation Details\.

The calibration process uses 128 samples from WikiText\-2, each with a sequence length of 2048\. Activations are quantized using per\-token asymmetric quantization, while weights are quantized using asymmetric per\-channel quantization with GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2605.26175#bib.bib29)\), applying a group size of 128 for key\-value matrices\. During thePSOTphase, we optimize block\-diagonal orthogonal matrices initialized with Hadamard matrices viaCayley SGD\(Liet al\.,[2020](https://arxiv.org/html/2605.26175#bib.bib42)\)\. The ASOT hyperparameters are selected by grid search on the calibration set, and the final search space and chosen values are reported in Appendix[B](https://arxiv.org/html/2605.26175#A2)and Appendix[C](https://arxiv.org/html/2605.26175#A3)\. We report two variants to separate accuracy and deployment considerations\.InfoQuantuses block\-diagonal rotations to reduce transformation overhead, whileInfoQuant\*uses a full global orthogonal rotation similar to SpinQuant\. More implementation details are provided in Appendix[B](https://arxiv.org/html/2605.26175#A2)\.

### 5\.1Main Results

#### Quantization Performance\.

Table[1](https://arxiv.org/html/2605.26175#S5.T1)shows that the value of activation\-distribution optimization becomes visible when activations are quantized\. In the weight\-only 4\-16\-16 setting,InfoQuantpreserves98\.7%98\.7\\%–99\.5%99\.5\\%of the floating\-point zero\-shot accuracy across the evaluated LLaMA models, but the gap among strong rotation\-based methods is relatively small\. This pattern is informative: when activations remain in high precision, reshaping them is not the dominant bottleneck\. The setting mainly verifies that the learned rotation does not damage the weight\-only quantization path\.

Once activations are quantized, the table reveals different behavior\. Under 4\-4\-16,InfoQuantimproves the average zero\-shot accuracy over SpinQuant by2\.912\.91points across the five LLaMA settings, with larger gains on LLaMA\-2 7B \(\+5\.47\+5\.47\) and LLaMA\-3 70B \(\+3\.72\+3\.72\)\. Under W4A4KV4, where weights, activations, and KV cache are all quantized to 4 bits,InfoQuantpreserves96\.9%96\.9\\%of floating\-point accuracy on average and improves over SpinQuant by1\.701\.70points\. These results support the main design intuition of the paper: fixed or generic rotations are often sufficient to avoid catastrophic outliers, but low\-bit activation quantization benefits from learning a distribution that uses the available quantization levels more evenly\.

The comparison with OSTQuant also clarifies the boundary of the method\.InfoQuantoutperforms the distillation\-based OSTQuant baseline on LLaMA\-3 8B and LLaMA\-2 13B, while OSTQuant remains competitive on several 70B entries\. This mixed pattern is useful rather than merely negative: it suggests that distribution shaping can recover a large part of the activation\-quantization loss without full\-precision supervision, but supervision and scale\-specific calibration may still help in some large\-model regimes\. Full per\-task results are reported in Appendix[D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2)\.

#### Speedup and Memory Savings\.

MethodPrefill SpeedupMemory Saving2048409681922048409681922\-70B\-InfoQuant2\.462\.111\.972\.912\.592\.222\-70B\-InfoQuant\*2\.612\.262\.103\.252\.842\.363\-8B\-InfoQuant1\.571\.431\.282\.462\.121\.863\-8B\-InfoQuant\*1\.741\.551\.422\.782\.372\.00Table 2:Speedup and memory savings factors for LLaMA models of different sizes and sequence lengths, comparing 4\-bit quantized implementations to FP16\.We evaluate inference efficiency using the W4A4 kernel fromAshkbooset al\.\([2024b](https://arxiv.org/html/2605.26175#bib.bib17)\)\. Table[2](https://arxiv.org/html/2605.26175#S5.T2)reports prefill speedup and decoding memory savings relative to FP16 on a single Transformer block, using batch size 4 on an NVIDIA RTX 4090\. On LLaMA\-2 70B with sequence length 2048,InfoQuant\*achieves a2\.61×2\.61\\timesprefill speedup and3\.25×3\.25\\timesmemory saving, whileInfoQuantachieves a2\.46×2\.46\\timesprefill speedup and2\.91×2\.91\\timesmemory saving\. The difference betweenInfoQuantandInfoQuant\*exposes the main deployment trade\-off: full rotations can be slightly faster in this implementation path, whereas block\-diagonal rotations provide a more flexible per\-layer optimization structure\.

We further compare quantization\-time memory overhead and inference\-time transformation FLOPs on LLaMA\-3 70B\. As shown in Figure[4](https://arxiv.org/html/2605.26175#S5.F4),InfoQuant\*requires less than 24GB of GPU memory during quantization and introduces low additional transformation cost\. The broader insight is that activation shaping should not be evaluated only by accuracy: a transformation that must be repeatedly applied online can erase part of the benefit of low\-bit inference\. FlatQuant, for example, performs multiple dynamic activation transformations during inference, which increases transform FLOPs and depends on specialized kernels\.InfoQuantinstead keeps the rotation\-based deployment path lightweight, making the method more compatible with existing low\-bit kernels and consumer\-grade quantization hardware\.

![Refer to caption](https://arxiv.org/html/2605.26175v1/x3.png)Figure 4:Comparison of memory overhead in quantization and transform FLOPs during inference across different methods\. During inference, FlatQuant performs transformations on three activation values online and dynamically, while other methods apply the fast Hadamard transform only to the KV\-cache efficiently\.

### 5\.2Ablation Study

#### Module\-wise Impact\.

Table[3](https://arxiv.org/html/2605.26175#S5.T3)isolates how each component contributes toInfoQuantunder the W4A4KV4 setting\. Replacing a fixed Hadamard rotation with the learnedPSOTrotation reduces WikiText2 perplexity from10\.9010\.90to8\.848\.84on LLaMA\-3 8B and from8\.998\.99to7\.037\.03on LLaMA\-2 7B\. This is the largest single change in the ablation, which suggests that the core gain comes from learning where activation energy should be redistributed, not from simply adding more calibration stages\.

The later components produce smaller but more diagnostic changes\. Adding GPTQ further reduces perplexity to7\.537\.53and6\.016\.01, indicating that activation\-oriented rotation remains compatible with standard weight reconstruction\. ASOT gives a modest improvement, which is consistent with its role as a signal reweighting mechanism rather than a separate quantizer\. LAC then gives the best perplexity on both models \(7\.167\.16and5\.895\.89\)\. The ordering of these gains is important: first reshape the distribution, then reconstruct weights, then refine which calibration tokens and clipping ranges deserve attention\.

HadamardPSOTGPTQASOTLACWikiText23\-8B2\-7B✓10\.98\.99✓8\.847\.03✓✓7\.536\.01✓✓✓7\.425\.97✓✓✓✓7\.165\.89Table 3:Component ablation ofInfoQuantunder the W4A4KV4 configuration\. Lower WikiText2 perplexity indicates better quantization quality\.
#### More Ablations\.

Additional ablations in Appendix[C](https://arxiv.org/html/2605.26175#A3)study the weighting factorγ\\gamma, the temperatureTT, initialization robustness, clipping\-ratio sensitivity, and the block size of block\-diagonal orthogonal matrices\. These experiments provide practical guidance for usingInfoQuant: objective reweighting and temperature control the stability of peak suppression, while block size and clipping range determine how much distribution\-shaping flexibility can be traded for lower inference cost and easier calibration\.

## 6Conclusion

This work studies low\-bit LLM quantization from the perspective of quantizer\-facing activation distribution design\. By analyzing the distributional error introduced by quantization, we show that quantization\-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range\. Guided by this principle,InfoQuantreshapes activations with Peak Suppression Orthogonal Transformation, emphasizes informative calibration tokens with ASOT, and refines the quantization interval with learnable activation clipping\. Our results suggest that activation quantization should be evaluated by both outlier reduction and effective use of available low\-bit levels\. A natural next step is to extend this analysis beyond bell\-shaped activation assumptions and develop transformations for broader distributional regimes\.

## Limitations

The findings of this paper should be interpreted within the scope of the evaluated settings\. We validateInfoQuanton a limited set of model families, tasks, and quantization configurations, and do not claim that the same gains will automatically transfer to substantially different architectures, activation regimes, or deployment scenarios without additional study\. Our method is motivated by activation patterns that have been widely observed in prior studies and are also present in the models evaluated in this paper\. While these observations are sufficient to support the improvements reported here, broader validation would still be useful to determine how consistently the same behavior holds across other model families and quantization settings\. From a practical perspective, the method still relies on calibration data and implementation choices such as transformation and clipping settings\. Although the approach is intended for practical post\-training quantization, its effectiveness may therefore vary when calibration conditions differ substantially from those used in evaluation\.

## References

- KURTAIL : KURTOSIS\-BASED LLM QUANTIZATION\.InSparsity in LLMs \(SLLM\): Deep Dive into Mixture of Experts, Quantization, Hardware, and Inference,External Links:[Link](https://openreview.net/forum?id=GYVIWuazp5)Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1)\.
- S\. Ashkboos, M\. L\. Croci, M\. G\. do Nascimento, T\. Hoefler, and J\. Hensman \(2024a\)SliceGPT: compress large language models by deleting rows and columns\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=vXxardq6db)Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Ashkboos, A\. Mohtashami, M\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024b\)Quarot: outlier\-free 4\-bit inference in rotated llms\.Advances in Neural Information Processing Systems37,pp\. 100213–100240\.Cited by:[Appendix C](https://arxiv.org/html/2605.26175#A3.SS0.SSS0.Px4.p1.1),[§1](https://arxiv.org/html/2605.26175#S1.p2.1),[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.26175#S5.SS1.SSS0.Px2.p1.4)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Boratko, H\. Padigela, D\. Mikkilineni, P\. Yuvraj, R\. Das, A\. McCallum, M\. Chang, A\. Fokoue\-Nkoutche, P\. Kapanipathi, N\. Mattei,et al\.\(2018\)A systematic classification of knowledge, reasoning, and context within the arc dataset\.arXiv preprint arXiv:1806\.00358\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Chee, Y\. Cai, V\. Kuleshov, and C\. M\. De Sa \(2023\)Quip: 2\-bit quantization of large language models with guarantees\.Advances in Neural Information Processing Systems36,pp\. 4396–4429\.Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Chen, Y\. Liu, J\. Wang, Y\. Bin, W\. Shao, and P\. Luo \(2025\)PrefixQuant: eliminating outliers by prefixed tokens for large language models quantization\.External Links:2410\.05265,[Link](https://arxiv.org/abs/2410.05265)Cited by:[§4\.2](https://arxiv.org/html/2605.26175#S4.SS2.p3.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.arXiv preprint arXiv:1905\.10044\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)OPTQ: accurate post\-training quantization for generative pre\-trained transformers\.In11th International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px3.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)A framework for few\-shot language model evaluation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- L\. He, S\. Zheng, K\. Sun, Y\. Liu, Y\. Zhao, C\. Tan, H\. Yang, Y\. Du, and L\. Du \(2025\)BASE\-q: bias and asymmetric scaling enhanced rotational quantization for large language models\.arXiv preprint arXiv:2506\.15689\.Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1)\.
- X\. Hu, Y\. Cheng, D\. Yang, Z\. Chen, Z\. Xu, JiangyongYu, XUCHEN, Z\. Yuan, Z\. jiang, and S\. Zhou \(2025\)OSTQuant: refining large language model quantization with orthogonal and scaling transformations for better distribution fitting\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rAcgDBdKnP)Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1)\.
- C\. Lee, J\. Jin, T\. Kim, H\. Kim, and E\. Park \(2024\)Owq: outlier\-aware weight quantization for efficient fine\-tuning and inference of large language models\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.38,pp\. 13355–13364\.Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Li, F\. Li, and S\. Todorovic \(2020\)Efficient riemannian optimization on the stiefel manifold via the cayley transform\.InInternational Conference on Learning Representations,Cited by:[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px3.p1.1)\.
- H\. Lin, H\. Xu, Y\. Wu, J\. Cui, Y\. Zhang, L\. Mou, L\. Song, Z\. Sun, and Y\. Wei \(2024a\)Duquant: distributing outliers via dual transformation makes stronger quantized llms\.Advances in Neural Information Processing Systems37,pp\. 87766–87800\.Cited by:[Appendix C](https://arxiv.org/html/2605.26175#A3.SS0.SSS0.Px3.p1.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024b\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of Machine Learning and Systems6,pp\. 87–100\.Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1)\.
- J\. Liu, P\. Ponnusamy, T\. Cai, H\. Guo, Y\. Kim, and B\. Athiwaratkun \(2025a\)Training\-free activation sparsity in large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=dGVZwyq5tV)Cited by:[§1](https://arxiv.org/html/2605.26175#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.26175#S3.SS2.p2.3)\.
- R\. Liu, H\. Bai, H\. Lin, Y\. Li, H\. Gao, Z\. Xu, L\. Hou, J\. Yao, and C\. Yuan \(2024\)IntactKV: improving large language model quantization by keeping pivot tokens intact\.InACL \(Findings\),Cited by:[§4\.2](https://arxiv.org/html/2605.26175#S4.SS2.p3.1)\.
- Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2025b\)SpinQuant: LLM quantization with learned rotations\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=ogO6DGE6FZ)Cited by:[Appendix B](https://arxiv.org/html/2605.26175#A2.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.26175#S1.p2.1),[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px2.p1.1)\.
- Y\. Ma, H\. Li, X\. Zheng, F\. Ling, X\. Xiao, R\. Wang, S\. Wen, F\. Chao, and R\. Ji \(2024\)AffineQuant: affine transformation quantization for large language models\.InICLR,External Links:[Link](https://openreview.net/forum?id=of2rhALq8l)Cited by:[§1](https://arxiv.org/html/2605.26175#S1.p2.1),[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.arXiv preprint arXiv:1609\.07843\.Cited by:[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.arXiv preprint arXiv:1809\.02789\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- Qwen, :, A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei, H\. Lin, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Lu, K\. Bao, K\. Yang, L\. Yu, M\. Li, M\. Xue, P\. Zhang, Q\. Zhu, R\. Men, R\. Lin, T\. Li, T\. Tang, T\. Xia, X\. Ren, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Cui, Z\. Zhang, and Z\. Qiu \(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- M\. Sap, H\. Rashkin, D\. Chen, R\. LeBras, and Y\. Choi \(2019\)Socialiqa: commonsense reasoning about social interactions\.arXiv preprint arXiv:1904\.09728\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo \(2024\)OmniQuant: omnidirectionally calibrated quantization for large language models\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8Wuvhh0LYW)Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.26175#S3.SS2.p1.1)\.
- Y\. Sun, R\. Liu, H\. Bai, H\. Bao, K\. Zhao, Y\. Li, J\. Hu, X\. Yu, L\. Hou, C\. Yuan, X\. Jiang, W\. Liu, and J\. Yao \(2025\)FlatQuant: flatness matters for llm quantization\.External Links:2410\.09426,[Link](https://arxiv.org/abs/2410.09426)Cited by:[§1](https://arxiv.org/html/2605.26175#S1.p2.1),[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.26175#S3.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.26175#S4.SS3.p1.2)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. D\. Sa \(2024\)QuIP$\\\#$: even better LLM quantization with hadamard incoherence and lattice codebooks\.InForty\-first International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=9BrydUVcoe)Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Wei, Y\. Zhang, Y\. Li, X\. Zhang, R\. Gong, J\. Guo, and X\. Liu \(2023\)Outlier suppression\+: accurate quantization of large language models by equivalent and effective shifting and scaling\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 1648–1665\.Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational Conference on Machine Learning,pp\. 38087–38099\.Cited by:[§1](https://arxiv.org/html/2605.26175#S1.p2.1),[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Yi, Z\. Liu, jianwei zhang, C\. Li, T\. Zhang, J\. Lin, and J\. Zhou \(2025\)Rotated runtime smooth: training\-free activation smoother for accurate INT4 inference\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=WG7GzGx3G9)Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[Appendix D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.26175#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Zhao, M\. Zhang, C\. Zeng, M\. Wang, X\. Liu, and L\. Nie \(2024\)LRQuant: learnable and robust post\-training quantization for large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 2240–2255\.Cited by:[§2](https://arxiv.org/html/2605.26175#S2.SS0.SSS0.Px2.p1.1),[§3\.2](https://arxiv.org/html/2605.26175#S3.SS2.p1.1),[§4\.3](https://arxiv.org/html/2605.26175#S4.SS3.p1.2)\.

## Appendix Overview

- •Section[A](https://arxiv.org/html/2605.26175#A1): Theory proofs\.
- •Section[B](https://arxiv.org/html/2605.26175#A2): Additional implementation details\.
- •Section[C](https://arxiv.org/html/2605.26175#A3): More ablations\.
- •Section[D](https://arxiv.org/html/2605.26175#A4.SS0.SSS0.Px2): Full results\.
- •Section[E](https://arxiv.org/html/2605.26175#A5): Visualization results\.

## Appendix ATheory Proofs

###### Theorem 1

LetX∼PX\\sim Pbe a centered symmetric bell\-shaped activation variable with standard deviationσ\\sigma, and let

Qs,c\(x\)=clip\(s⌊xs⌉,−c,c\)Q\_\{s,c\}\(x\)=\\operatorname\{clip\}\\\!\\left\(s\\left\\lfloor\\frac\{x\}\{s\}\\right\\rceil,\\,\-c,\\,c\\right\)be a finite\-bit uniform round\-to\-nearest quantizer with step sizess, clipping scalec=M​sc=Ms, andM=2B−1−1M=2^\{B\-1\}\-1\. Define the quantization centroids and cells by

qi\\displaystyle q\_\{i\}=i​s,i=−M,…,M,\\displaystyle=is,\\qquad i=\-M,\\ldots,M,Ii\\displaystyle I\_\{i\}=\{x∣Qs,c​\(x\)=qi\},\\displaystyle=\\\{x\\mid Q\_\{s,c\}\(x\)=q\_\{i\}\\\},and the smoothed quantized density

Qθ\(c\)​\(x\)\\displaystyle Q\_\{\\theta\}^\{\(c\)\}\(x\)=∑i=−MMpi​δ​\(x−qi;θ\),\\displaystyle=\\sum\_\{i=\-M\}^\{M\}p\_\{i\}\\,\\delta\(x\-q\_\{i\};\\theta\),pi\\displaystyle p\_\{i\}=∫IiP​\(x\)​𝑑x,\\displaystyle=\\int\_\{I\_\{i\}\}P\(x\)\\,dx,whereδ​\(x;θ\)=12​θ​exp⁡\(−\|x\|/θ\)\\delta\(x;\\theta\)=\\frac\{1\}\{2\\theta\}\\exp\(\-\|x\|/\\theta\)andθ≪s\\theta\\ll s\. Then the smoothed KL objective admits the approximation

DKL​\(P∥Qθ\(c\)\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(P\\\|Q\_\{\\theta\}^\{\(c\)\}\)≈−H​\(P\)\+H​\(\{pi\}\)\+log⁡\(2​θ\)\\displaystyle\\approx\-H\(P\)\+H\(\\\{p\_\{i\}\\\}\)\+\\log\(2\\theta\)\+1θ​ℰclip,\\displaystyle\\quad\+\\frac\{1\}\{\\theta\}\\mathcal\{E\}\_\{\\mathrm\{clip\}\},whereℰclip=𝔼​\|X−Qs,c​\(X\)\|\\mathcal\{E\}\_\{\\mathrm\{clip\}\}=\\mathbb\{E\}\|X\-Q\_\{s,c\}\(X\)\|is the expected absolute error of the clamped quantizer\. Moreover, the normalized clipped error

ℰclip′=1σ​𝔼​\|X−Qs,c​\(X\)\|\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}=\\frac\{1\}\{\\sigma\}\\mathbb\{E\}\|X\-Q\_\{s,c\}\(X\)\|satisfies

ℰclip′\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}≤λ2\+τP​\(κ\),\\displaystyle\\leq\\frac\{\\lambda\}\{2\}\+\\tau\_\{P\}\(\\kappa\),λ\\displaystyle\\lambda=sσ,κ=cσ,\\displaystyle=\\frac\{s\}\{\\sigma\},\\qquad\\kappa=\\frac\{c\}\{\\sigma\},whereτP​\(κ\)=𝔼​\[\(\|Y\|−κ\)\+\]\\tau\_\{P\}\(\\kappa\)=\\mathbb\{E\}\[\(\|Y\|\-\\kappa\)\_\{\+\}\]forY=X/σY=X/\\sigma\. With fixed bit\-widthBB, this bound can be written as

FP,B​\(κ\)\\displaystyle F\_\{P,B\}\(\\kappa\)=κ2​M\+τP​\(κ\),\\displaystyle=\\frac\{\\kappa\}\{2M\}\+\\tau\_\{P\}\(\\kappa\),κ\\displaystyle\\kappa=1bn\.\\displaystyle=\\frac\{1\}\{b\_\{n\}\}\.For Gaussian and Laplace activations,τP\\tau\_\{P\}has the closed forms given below, andFP,B​\(κ\)F\_\{P,B\}\(\\kappa\)is increasing inκ\\kappaonce the clipping tail is sufficiently controlled\. Equivalently, in this tail\-controlled regime, a smaller clipped range relative to the activation spread and a larger max\-normalized dispersionbn=σ/cb\_\{n\}=\\sigma/ctighten the bound\.

### A\.1Proof of Theorem 1

For the finite\-bit clamped quantizer, the interior cells are

Ii\\displaystyle I\_\{i\}=\[\(i−12\)​s,\(i\+12\)​s\),\\displaystyle=\\left\[\\left\(i\-\\frac\{1\}\{2\}\\right\)s,\\left\(i\+\\frac\{1\}\{2\}\\right\)s\\right\),\|i\|<M,\\displaystyle\\qquad\|i\|<M,while the boundary cells absorb the clipped tails:

IM\\displaystyle I\_\{M\}=\[c−s2,∞\),\\displaystyle=\\left\[c\-\\frac\{s\}\{2\},\\infty\\right\),I−M\\displaystyle I\_\{\-M\}=\(−∞,−c\+s2\]\.\\displaystyle=\\left\(\-\\infty,\-c\+\\frac\{s\}\{2\}\\right\]\.Forx∈Iix\\in I\_\{i\}, the quantizer outputs the centroidqi=i​sq\_\{i\}=is\. Under the kernel\-separation assumptionθ≪s\\theta\\ll s, neighboring kernels contribute negligibly around each cell, and we may approximate

Qθ\(c\)​\(x\)\\displaystyle Q\_\{\\theta\}^\{\(c\)\}\(x\)≈pi​δ​\(x−qi;θ\)\\displaystyle\\approx p\_\{i\}\\,\\delta\(x\-q\_\{i\};\\theta\)=pi2​θ​exp⁡\(−\|x−qi\|θ\),x∈Ii\.\\displaystyle=\\frac\{p\_\{i\}\}\{2\\theta\}\\exp\\\!\\left\(\-\\frac\{\|x\-q\_\{i\}\|\}\{\\theta\}\\right\),\\qquad x\\in I\_\{i\}\.Substituting this expression into the KL divergence gives

DKL​\(P∥Qθ\(c\)\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(P\\\|Q\_\{\\theta\}^\{\(c\)\}\)=∑i=−MM∫IiP​\(x\)​log⁡\(P​\(x\)Qθ\(c\)​\(x\)\)​𝑑x\\displaystyle=\\sum\_\{i=\-M\}^\{M\}\\int\_\{I\_\{i\}\}P\(x\)\\log\\\!\\left\(\\frac\{P\(x\)\}\{Q\_\{\\theta\}^\{\(c\)\}\(x\)\}\\right\)dx≈∑i=−MM∫IiP​\(x\)​log⁡P​\(x\)​𝑑x\\displaystyle\\approx\\sum\_\{i=\-M\}^\{M\}\\int\_\{I\_\{i\}\}P\(x\)\\log P\(x\)\\,dx−∑i=−MMlog⁡\(pi\)​∫IiP​\(x\)​𝑑x\\displaystyle\\quad\-\\sum\_\{i=\-M\}^\{M\}\\log\(p\_\{i\}\)\\int\_\{I\_\{i\}\}P\(x\)\\,dx\+1θ​∑i=−MM∫Ii\|x−qi\|​P​\(x\)​𝑑x\\displaystyle\\quad\+\\frac\{1\}\{\\theta\}\\sum\_\{i=\-M\}^\{M\}\\int\_\{I\_\{i\}\}\|x\-q\_\{i\}\|P\(x\)\\,dx\+log⁡\(2​θ\)\.\\displaystyle\\quad\+\\log\(2\\theta\)\.\(14\)Using

∑i∫IiP​\(x\)​log⁡P​\(x\)​𝑑x\\displaystyle\\sum\_\{i\}\\int\_\{I\_\{i\}\}P\(x\)\\log P\(x\)\\,dx=−H​\(P\),\\displaystyle=\-H\(P\),∑ilog⁡\(pi\)​∫IiP​\(x\)​𝑑x\\displaystyle\\sum\_\{i\}\\log\(p\_\{i\}\)\\int\_\{I\_\{i\}\}P\(x\)\\,dx=∑ipi​log⁡pi\\displaystyle=\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}=−H​\(\{pi\}\),\\displaystyle=\-H\(\\\{p\_\{i\}\\\}\),we obtain

DKL​\(P∥Qθ\(c\)\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(P\\\|Q\_\{\\theta\}^\{\(c\)\}\)≈−H​\(P\)\+H​\(\{pi\}\)\+log⁡\(2​θ\)\\displaystyle\\approx\-H\(P\)\+H\(\\\{p\_\{i\}\\\}\)\+\\log\(2\\theta\)\+1θ​ℰclip,\\displaystyle\\quad\+\\frac\{1\}\{\\theta\}\\mathcal\{E\}\_\{\\mathrm\{clip\}\},\(15\)where

ℰclip\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{clip\}\}=∑i=−MM∫Ii\|x−qi\|​P​\(x\)​𝑑x\\displaystyle=\\sum\_\{i=\-M\}^\{M\}\\int\_\{I\_\{i\}\}\|x\-q\_\{i\}\|P\(x\)\\,dx=𝔼​\|X−Qs,c​\(X\)\|\.\\displaystyle=\\mathbb\{E\}\|X\-Q\_\{s,c\}\(X\)\|\.
To upper bound the practical clamped\-quantization error, define the clipping projectionΠc​\(x\)=clip⁡\(x,−c,c\)\\Pi\_\{c\}\(x\)=\\operatorname\{clip\}\(x,\-c,c\)\. Then

\|x−Qs,c​\(x\)\|\\displaystyle\|x\-Q\_\{s,c\}\(x\)\|≤\|x−Πc​\(x\)\|\\displaystyle\\leq\|x\-\\Pi\_\{c\}\(x\)\|\+\|Πc​\(x\)−Qs,c​\(x\)\|\\displaystyle\\quad\+\|\\Pi\_\{c\}\(x\)\-Q\_\{s,c\}\(x\)\|≤\(\|x\|−c\)\+\+s2\.\\displaystyle\\leq\(\|x\|\-c\)\_\{\+\}\+\\frac\{s\}\{2\}\.\(16\)The second term is the standard in\-range round\-to\-nearest error, while the first term is the clipping\-tail error\. Taking expectations and dividing byσ\\sigmagives

ℰclip′\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}:=1σ​𝔼​\|X−Qs,c​\(X\)\|\\displaystyle:=\\frac\{1\}\{\\sigma\}\\mathbb\{E\}\|X\-Q\_\{s,c\}\(X\)\|≤s2​σ\+1σ​𝔼​\[\(\|X\|−c\)\+\]\.\\displaystyle\\leq\\frac\{s\}\{2\\sigma\}\+\\frac\{1\}\{\\sigma\}\\mathbb\{E\}\\\!\\left\[\(\|X\|\-c\)\_\{\+\}\\right\]\.\(17\)With the normalized variableY=X/σY=X/\\sigma,λ=s/σ\\lambda=s/\\sigma, andκ=c/σ\\kappa=c/\\sigma, this becomes

ℰclip′\\displaystyle\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}≤λ2\+τP​\(κ\),\\displaystyle\\leq\\frac\{\\lambda\}\{2\}\+\\tau\_\{P\}\(\\kappa\),τP​\(κ\)\\displaystyle\\tau\_\{P\}\(\\kappa\):=𝔼​\[\(\|Y\|−κ\)\+\]\.\\displaystyle:=\\mathbb\{E\}\\\!\\left\[\(\|Y\|\-\\kappa\)\_\{\+\}\\right\]\.\(18\)Becausec=M​sc=Msfor a fixedBB\-bit quantizer, we haveκ=M​λ\\kappa=M\\lambdaand therefore

FP,B​\(κ\)=κ2​M\+τP​\(κ\)\.F\_\{P,B\}\(\\kappa\)=\\frac\{\\kappa\}\{2M\}\+\\tau\_\{P\}\(\\kappa\)\.
#### Gaussian activations\.

IfX∼𝒩​\(0,σ2\)X\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\), thenY∼𝒩​\(0,1\)Y\\sim\\mathcal\{N\}\(0,1\)with densityϕ​\(y\)=12​π​e−y2/2\\phi\(y\)=\\frac\{1\}\{\\sqrt\{2\\pi\}\}e^\{\-y^\{2\}/2\}and survival functionΦ¯​\(y\)=1−Φ​\(y\)\\bar\{\\Phi\}\(y\)=1\-\\Phi\(y\)\. The clipping\-tail term is

τG​\(κ\)\\displaystyle\\tau\_\{G\}\(\\kappa\)=2​∫κ∞\(y−κ\)​ϕ​\(y\)​𝑑y\\displaystyle=2\\int\_\{\\kappa\}^\{\\infty\}\(y\-\\kappa\)\\phi\(y\)\\,dy=2​\(ϕ​\(κ\)−κ​Φ¯​\(κ\)\)\.\\displaystyle=2\\big\(\\phi\(\\kappa\)\-\\kappa\\bar\{\\Phi\}\(\\kappa\)\\big\)\.\(19\)Hence

FG,B​\(κ\)\\displaystyle F\_\{G,B\}\(\\kappa\)=κ2​M\+2​\(ϕ​\(κ\)−κ​Φ¯​\(κ\)\),\\displaystyle=\\frac\{\\kappa\}\{2M\}\+2\\big\(\\phi\(\\kappa\)\-\\kappa\\bar\{\\Phi\}\(\\kappa\)\\big\),and

FG,B′​\(κ\)=12​M−2​Φ¯​\(κ\)\.F\_\{G,B\}^\{\\prime\}\(\\kappa\)=\\frac\{1\}\{2M\}\-2\\bar\{\\Phi\}\(\\kappa\)\.ThereforeFG,BF\_\{G,B\}is increasing whenever

κ≥κG⋆:=Φ−1​\(1−14​M\)\.\\kappa\\geq\\kappa\_\{G\}^\{\\star\}:=\\Phi^\{\-1\}\\\!\\left\(1\-\\frac\{1\}\{4M\}\\right\)\.In this tail\-controlled regime, decreasingκ=c/σ\\kappa=c/\\sigmareduces the bound\.

#### Laplace activations\.

IfX∼Laplace​\(0,b\)X\\sim\\mathrm\{Laplace\}\(0,b\), thenσ=2​b\\sigma=\\sqrt\{2\}\\,b\. We keep the same normalized error definition as above,

ℰclip′=1σ​𝔼​\|X−Qs,c​\(X\)\|,\\mathcal\{E\}\_\{\\mathrm\{clip\}\}^\{\\prime\}=\\frac\{1\}\{\\sigma\}\\mathbb\{E\}\|X\-Q\_\{s,c\}\(X\)\|,so the Laplace case is parameterized by the scalebbonly through the identityσ=2​b\\sigma=\\sqrt\{2\}b\. With the normalized variableY=X/σY=X/\\sigma, the density becomes

fY​\(y\)=12​e−2​\|y\|\.f\_\{Y\}\(y\)=\\frac\{1\}\{\\sqrt\{2\}\}e^\{\-\\sqrt\{2\}\|y\|\}\.The clipping\-tail term becomes

τL​\(κ\)\\displaystyle\\tau\_\{L\}\(\\kappa\)=2​∫κ∞\(y−κ\)​12​e−2​y​𝑑y\\displaystyle=2\\int\_\{\\kappa\}^\{\\infty\}\(y\-\\kappa\)\\frac\{1\}\{\\sqrt\{2\}\}e^\{\-\\sqrt\{2\}y\}\\,dy=12​e−2​κ\.\\displaystyle=\\frac\{1\}\{\\sqrt\{2\}\}e^\{\-\\sqrt\{2\}\\kappa\}\.\(20\)Hence

FL,B​\(κ\)\\displaystyle F\_\{L,B\}\(\\kappa\)=κ2​M\+12​e−2​κ,\\displaystyle=\\frac\{\\kappa\}\{2M\}\+\\frac\{1\}\{\\sqrt\{2\}\}e^\{\-\\sqrt\{2\}\\kappa\},FL,B′​\(κ\)\\displaystyle F\_\{L,B\}^\{\\prime\}\(\\kappa\)=12​M−e−2​κ\.\\displaystyle=\\frac\{1\}\{2M\}\-e^\{\-\\sqrt\{2\}\\kappa\}\.ThereforeFL,BF\_\{L,B\}is increasing whenever

κ≥κL⋆:=log⁡\(2​M\)2\.\\kappa\\geq\\kappa\_\{L\}^\{\\star\}:=\\frac\{\\log\(2M\)\}\{\\sqrt\{2\}\}\.Again, in this regime, decreasingκ\\kappatightens the bound\. Equivalently, one may express the same condition in terms ofbbviaκ=c/\(2​b\)\\kappa=c/\(\\sqrt\{2\}b\), but the normalized error itself remains defined with respect toσ\\sigma\.

Finally, the normalized dispersion metric used in the main text is

bn=σc=1κ,λ=sσ=s/cσ/c=s¯bn\.b\_\{n\}=\\frac\{\\sigma\}\{c\}=\\frac\{1\}\{\\kappa\},\\qquad\\lambda=\\frac\{s\}\{\\sigma\}=\\frac\{s/c\}\{\\sigma/c\}=\\frac\{\\bar\{s\}\}\{b\_\{n\}\}\.Therefore, for a fixed bit\-width and a controlled clipping tail, a smaller clipped range relative to the activation spread and a larger max\-normalized dispersionbnb\_\{n\}both correspond to a smallerκ\\kappaand a tighter error bound\.

### A\.2Whybnb\_\{n\}Tends to Increase UnderPSOT

Based on the analysis in the motivation section, the normalized surrogate error is controlled by the ratio between the quantization step size and the distributional spread\. Since practical activation quantization uses a clipped numerical range, we normalize both quantities by the clipping scale\. This yields the dispersion metricbnb\_\{n\}used in the main text\.

Suppose we apply a rotation to an activation token𝐭\\mathbf\{t\}withdddimensions using an orthogonal matrix𝐀\\mathbf\{A\}, and then center it as in Eq\. \([7](https://arxiv.org/html/2605.26175#S4.E7)\), yielding𝐭′=𝐭𝐀−𝔼​\[𝐭𝐀\]\\mathbf\{t\}^\{\\prime\}=\\mathbf\{t\}\\mathbf\{A\}\-\\mathbb\{E\}\[\\mathbf\{t\}\\mathbf\{A\}\]\. The optimization objective aims to reduce‖𝐭′‖∞\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{\\infty\}, which lowers the clipping scale that determines the quantization step size\. Since𝐀\\mathbf\{A\}is orthogonal, the Euclidean norm remains invariant under rotation, and centering only removes the mean component:

‖𝐭′‖2≤‖𝐭𝐀‖2=‖𝐭‖2\\displaystyle\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{2\}\\leq\\\|\\mathbf\{t\}\\mathbf\{A\}\\\|\_\{2\}=\\\|\\mathbf\{t\}\\\|\_\{2\}\(21\)For a centered token, the range\-normalized dispersion satisfies

bn​\(𝐭′\)=1d​‖𝐭′‖2‖𝐭′‖∞\.\\displaystyle b\_\{n\}\(\\mathbf\{t\}^\{\\prime\}\)=\\sqrt\{\\frac\{1\}\{d\}\}\\frac\{\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{2\}\}\{\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{\\infty\}\}\.\(22\)Comparing the transformed and original tokens gives

bn​\(𝐭′\)bn​\(𝐭\)=‖𝐭′‖2‖𝐭‖2⋅‖𝐭‖∞‖𝐭′‖∞\.\\displaystyle\\frac\{b\_\{n\}\(\\mathbf\{t\}^\{\\prime\}\)\}\{b\_\{n\}\(\\mathbf\{t\}\)\}=\\frac\{\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{2\}\}\{\\\|\\mathbf\{t\}\\\|\_\{2\}\}\\cdot\\frac\{\\\|\\mathbf\{t\}\\\|\_\{\\infty\}\}\{\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{\\infty\}\}\.\(23\)Therefore, whenPSOTsuppresses the peak value while preserving most of the centered token energy, the decrease in‖𝐭′‖∞\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{\\infty\}can dominate the mild change in‖𝐭′‖2\\\|\\mathbf\{t\}^\{\\prime\}\\\|\_\{2\}, causingbn​\(𝐭′\)b\_\{n\}\(\\mathbf\{t\}^\{\\prime\}\)to increase\. This is a conditional mechanism rather than a universal guarantee, but it explains why peak suppression empirically tends to produce both a smaller effective clipping range and a more dispersed normalized distribution\.

## Appendix BAdditional Implementation Details

#### Additional Setup\.

To improve efficiency, block diagonal matrices are used, partitioned into two blocks for 7B–13B models and four blocks for the 70B model\. ThePSOTtemperature is set toT=2T=2, with a batch size of44and an initial learning rate of22, linearly decayed over 15 epochs\. For ASOT, we usem=10m=10calibration samples and select the hyperparameters by grid search on the calibration set; the final setting usesδ=0\.02\\delta=0\.02,τ=0\.4\\tau=0\.4, andγ=30\\gamma=30\. In the subsequentLACphase, the initial clipping ratio is set to0\.950\.95, constrained within the interval\[0\.5,1\]\[0\.5,1\]\. We use the AdamW optimizer with an initial learning rate of0\.050\.05, applying a cosine annealing decay schedule over 5 epochs with a batch size of44\.

#### Computational Graph\.

To better accommodate the varying degrees of discrepancy across activation at different layers, our proposedInfoQuantmethod employsPSOTto optimize the distribution of each layer’s activation individually\. This per\-layer optimization yields improved quantization performance\. However, to maintain computational consistency in the presence of residual connections, an additional matrix multiplication is required specifically at these connection points, as illustrated in Figure[6](https://arxiv.org/html/2605.26175#A2.F6)\. To mitigate the overhead introduced by this operation, we adopt block\-diagonal orthogonal matrices as the rotation matrices, which significantly reduce inference cost and enhance the efficiency of thePSOTprocess\. For fair comparison, theInfoQuant\*employs a global rotation matrix, as shown in Figure[5](https://arxiv.org/html/2605.26175#A2.F5)\. This approach eliminates the need for additional computation at residual connections and results in a computation graph equivalent to that of SpinQuant\(Liuet al\.,[2025b](https://arxiv.org/html/2605.26175#bib.bib23)\)\.

![Refer to caption](https://arxiv.org/html/2605.26175v1/x4.png)Figure 5:Overall rotation diagram ofInfoQuant\*\.![Refer to caption](https://arxiv.org/html/2605.26175v1/x5.png)Figure 6:Overall rotation diagram ofInfoQuant\.
#### Runtime and Memory Overhead\.

ModelPSOTLACTimeMemoryTimeMemoryLLaMA\-2 7B∼\\sim6\.0∼\\sim6∼\\sim3\.0∼\\sim10LLaMA\-2 13B∼\\sim7\.5∼\\sim7∼\\sim2\.0∼\\sim12LLaMA\-2 70B∼\\sim13\.0∼\\sim11∼\\sim3\.5∼\\sim23Table 4:Per\-layer optimal time \(min\) and GPU memory usage \(GB\) on NVIDIA RTX 4090\.Table[4](https://arxiv.org/html/2605.26175#A2.T4)presents the optimization runtime and memory consumption for a single Transformer layer of the LLaMA\-2 models \(7B, 13B, and 70B\), measured on an NVIDIA RTX 4090 GPU\. Remarkably, our method enables quantization of the 70B model using just 24GB of GPU memory, demonstrating compatibility with consumer\-grade hardware\. Note that GPUs with larger memory capacities can reduce optimization time by enabling greater parallelism, potentially achieving multi\-fold speedups\.

## Appendix CMore Ablations

#### Influence of Weighting Factorγ\\gamma\.

Weighting factorγ\\gammaWikitext2 PPL17\.03306\.89506\.97806\.93\+∞\+\\infty7\.02Table 5:Ablation study of weighting factorγ\\gammaon WikiText2 PPL using RTN quantization under W4A4KV4 configuration for LLaMA\-2 7B\.The weighting factorγ\\gammacontrols how strongly ASOT emphasizes outlier tokens in Eq\. \([11](https://arxiv.org/html/2605.26175#S4.E11)\)\. Table[5](https://arxiv.org/html/2605.26175#A3.T5)shows that a moderate emphasis works best:γ=30\\gamma=30achieves the lowest perplexity, while both weak emphasis \(γ=1\\gamma=1\) and exclusive focus on outlier tokens \(γ=\+∞\\gamma=\+\\infty\) are worse\. This trend suggests that outlier tokens provide high\-value learning signals for peak suppression, but normal tokens are still necessary to preserve the overall activation distribution\. In other words, ASOT should rebalance the optimization objective rather than collapse it into an outlier\-only objective\.

#### Influence of TemperatureTTinPSOT\.

The temperatureTTcontrols how selectivelyPSOTsuppresses high\-magnitude activation coordinates\. As shown in Table[6](https://arxiv.org/html/2605.26175#A3.T6),T=2T=2achieves the best perplexity, while both sharper weighting \(T=0\.3T=0\.3\) and flatter weighting \(T=8T=8\) degrade performance\. This pattern reveals a useful design trade\-off\. IfTTis too small, the objective concentrates on only a few extreme coordinates and may overfit the calibration activations\. IfTTis too large, the softmax weights become nearly uniform and the objective loses its ability to target peaks\. A moderate temperature therefore provides the best balance between peak suppression and distribution\-level stability\.

TemperatureTTWikitext2 PPL0\.37\.2017\.1126\.8947\.0487\.22Table 6:Effect of thePSOTtemperatureTTon WikiText2 perplexity using RTN quantization under the W4A4KV4 configuration for LLaMA\-2 7B\.
#### Block Sizes of Block\-diagonal Orthogonal Matrices\.

To enhance computational and memory efficiency during activation rotation and inference, we adopt block\-diagonal orthogonal matrices\. However, this structure inherently limits interaction to within individual blocks, thereby restricting global information aggregation\. As shown in Table[7](https://arxiv.org/html/2605.26175#A3.T7), increasing the number of blocks, which corresponds to a reduction in block size, results in a decline in quantization performance\. This observation aligns with findings from\(Linet al\.,[2024a](https://arxiv.org/html/2605.26175#bib.bib47)\), which attributes the degradation to inter\-block shifts in activation means that hinder quantization efficiency\. We further hypothesize that full orthogonal matrices enable more effective redistribution of activation values due to the cumulative contribution of their unit vectors across higher dimensions\. In contrast, block\-diagonal matrices reduce this capability, thereby diminishing the range compression essential for effective quantization\. To balance these competing considerations, we select an intermediate block size, as detailed in the Implementation details\.

Table 7:Impact of block number in block\-diagonal orthogonal matrices on quantization performance under W4A4KV4 configuration with GPTQ\.Block numberLLaMA\-3 8BLLaMA\-2 13B17\.135\.1627\.185\.1847\.175\.3287\.395\.37

#### Robustness of Initialization𝐀\\mathbf\{A\}\.

To investigate the robustness of the proposedPSOTalgorithm with respect to orthogonal initialization strategies, we conduct an ablation study comparing Hadamard and randomly generated orthogonal matrices\. For the latter, we start from uniformly sampled random matrices and apply QR decomposition to ensure orthogonality\. As shown in Table[8](https://arxiv.org/html/2605.26175#A3.T8), we evaluate both initialization methods on LLaMA\-2 13B and LLaMA\-3 8B models\. While previous work\(Ashkbooset al\.,[2024b](https://arxiv.org/html/2605.26175#bib.bib17)\)reported a notable performance gap in favor of Hadamard\-based initialization, our findings demonstrate that, after optimization viaPSOT, both initialization schemes yield similarly stable and effective results\. This suggests thatPSOTis robust to the choice of orthogonal basis at initialization\.

Table 8:Impact of initialization ofPSOTunder W4A4KV4 configuration with GPTQ\.ModelInitializationWikiText2 PPLLLaMA\-3 8BHadamard7\.18Random7\.18LLaMA\-2 13BHadamard5\.18Random5\.19

#### Clipping Ratio Ablation\.

The orthogonal matrix obtained byPSOTreduces quantization error through rotation, but this benefit comes at the cost of increased sensitivity to clipping thresholds\. As shown in Table[9](https://arxiv.org/html/2605.26175#A3.T9)\. Small changes in the clipping thresholds \(α,β\\alpha,\\beta\) can lead to significant variations in the model’s performance, highlighting a trade\-off between reduced quantization error and the difficulty of fine\-tuning the clipping parameters for optimal results\.

Table 9:WikiText perplexity of LLAMA 2\-7B afterPSOT, evaluated with different clipping ratios\. To assess the sensitivity to various clipping ratios, all results were obtained using RTN quantization with W4A4KV4\.clip ratioα,β\\alpha,\\betaWikitext2 PPL17\.210\.957\.060\.96\.980\.857\.000\.88\.02

## Appendix DMore Results

#### Results for Qwen family

For a more comprehensive evaluation, we further test our method on the Qwen\-2\.5 models in Table[10](https://arxiv.org/html/2605.26175#A4.T10)\. Experimental results show that under 4\-4\-4 quantization, our method maintains superior performance on both the 14B and 32B models, highlighting its robustness and scalability\.

\#BitsMethodQwen\-2\.5 14BQwen\-2\.5 32B0\-shot9Wiki0\-shot9WikiW\-A\-KVAvg\.\(↑\\uparrow\)\(↓\\downarrow\)Avg\.\(↑\\uparrow\)\(↓\\downarrow\)16\-16\-16FloatingPoint70\.955\.2971\.115\.024\-4\-4QuaRot67\.236\.7768\.146\.04SpinQuant67\.296\.5568\.515\.88OSTQuant67\.816\.37OOMOOMInfoQuant67\.656\.3069\.905\.69Table 10:Evaluation results on Qwen2\.5 models\. The results for QuaRot and SpinQuant are reproduced using their respective official open\-source implementations\. Due to the memory limitations, the Qwen\-2\.5 32B models were not evaluated using the OSTQuant codebase\.
#### Full Results

In Table[11](https://arxiv.org/html/2605.26175#A4.T11), we report the completeInfoQuantresults for the experimental section\. We compare WikiText2 perplexity and accuracy on nine zero\-shot tasks using thelm\-evaluation\-harness\(version 0\.4\.7\)\(Gaoet al\.,[2024](https://arxiv.org/html/2605.26175#bib.bib31)\), including BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib32)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib34)\), LAMBADA \(OpenAI\)\(Radfordet al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib40)\), OpenBookQA \(OBQA\)\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.26175#bib.bib37)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.26175#bib.bib38)\), SIQA\(Sapet al\.,[2019](https://arxiv.org/html/2605.26175#bib.bib33)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.26175#bib.bib35)\), ARC\-Easy, and ARC\-Challenge\(Boratkoet al\.,[2018](https://arxiv.org/html/2605.26175#bib.bib39)\)\.

Table 11:FullInfoQuant’s results of the perplexity score on WikiText2 and averaged accuracy on all task onLLaMA\-2 & 3\.Model\#BitsARC\-cARC\-eBoolQHellaS\.Lam\.OBQAPIQASIQAWinoG\.Avg\.Wiki2W\-A\-KV\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↑\\uparrow\)\(↓\\downarrow\)2\-7B16\-16\-1646\.4274\.3377\.7175\.9473\.6944\.2079\.1645\.9169\.5365\.215\.474\-16\-1644\.6273\.8676\.6775\.2272\.8743\.6078\.0245\.6568\.4364\.345\.604\-4\-1643\.3471\.2575\.4474\.1071\.9040\.8077\.0944\.9866\.6162\.845\.864\-4\-443\.1771\.5975\.6073\.9272\.3942\.2077\.6945\.1466\.7763\.165\.892\-13B16\-16\-1649\.1577\.5380\.5879\.3976\.6245\.2080\.6347\.4971\.9067\.614\.884\-16\-1648\.8977\.3179\.8278\.8176\.4045\.2079\.9246\.7272\.3867\.274\.994\-4\-1648\.9875\.7280\.0678\.4075\.3944\.8079\.4346\.3271\.1966\.715\.154\-4\-448\.0475\.3479\.3677\.9775\.4744\.4079\.5445\.6071\.2766\.335\.182\-70B16\-16\-1657\.4281\.0283\.7983\.8179\.6048\.8082\.7049\.1877\.9871\.593\.324\-16\-1657\.2580\.8682\.9683\.3779\.6648\.2082\.9248\.8277\.2771\.253\.404\-4\-1655\.7280\.0182\.3582\.8679\.6449\.0082\.0548\.5277\.2770\.823\.624\-4\-455\.8980\.0581\.9982\.5378\.9247\.6082\.1548\.6775\.3770\.353\.643\-8B16\-16\-1653\.5077\.7481\.1079\.1875\.7444\.8080\.6347\.0873\.0168\.096\.144\-16\-1653\.0777\.4478\.4478\.1874\.7744\.0080\.3046\.6273\.4067\.366\.484\-4\-1648\.8177\.0277\.7776\.6773\.3043\.8078\.4544\.5271\.3565\.747\.074\-4\-450\.0075\.9875\.6576\.2872\.4644\.6078\.6745\.5071\.0365\.577\.163\-70B16\-16\-1664\.4285\.9885\.1484\.9579\.4748\.4684\.3950\.8280\.6673\.812\.864\-16\-1662\.6385\.1986\.2184\.3578\.2747\.0084\.4950\.7380\.4373\.253\.504\-4\-1657\.3481\.1484\.9282\.8877\.9445\.8081\.6648\.4176\.3270\.715\.244\-4\-456\.6681\.3183\.6182\.2476\.2345\.6082\.0547\.6576\.5670\.215\.39

## Appendix EVisualization results

Figure[7](https://arxiv.org/html/2605.26175#A5.F7)and Figure[8](https://arxiv.org/html/2605.26175#A5.F8)shows the activation distribution of different layers in LLaMA\-2\-7B and LLaMA\-3\-8B\.

![Refer to caption](https://arxiv.org/html/2605.26175v1/figures/app_oho_2_7b.png)Figure 7:The rotated activation distribution of different layers in LLaMA\-2 7B with Hadamard andPSOT\.![Refer to caption](https://arxiv.org/html/2605.26175v1/figures/app_oho_3_8b.png)Figure 8:The rotated activation distribution of different layers in LLaMA\-3 8B with Hadamard andPSOT\.

Similar Articles

Theory-optimal Quantization Based on Flatness

arXiv cs.LG

Introduces Flatness metric and Bidirectional Diagonal Quantization (BDQ) for post-training quantization of large language models, achieving near-lossless 4-bit weight and activation quantization and substantial improvements at extreme low-bit settings.

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

arXiv cs.CL

Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.