Theory-optimal Quantization Based on Flatness

arXiv cs.LG Papers

Summary

Introduces Flatness metric and Bidirectional Diagonal Quantization (BDQ) for post-training quantization of large language models, achieving near-lossless 4-bit weight and activation quantization and substantial improvements at extreme low-bit settings.

arXiv:2605.18800v1 Announce Type: new Abstract: Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.
Original Article
View Cached Full Text

Cached at: 05/20/26, 08:36 AM

# Theory-optimal Quantization Based on Flatness
Source: [https://arxiv.org/html/2605.18800](https://arxiv.org/html/2605.18800)
Xiusheng Huang1,2,3, Zhe Li4, Xuanwu Yin4, Lu Wang5, Yequan Wang3†\\dagger, Dong Li4, Emad Barsoum4, Kang Liu1,2†\\dagger 1The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence4AMD5Ritzz\-AI huangxiusheng2020@ia\.ac\.cn,\{wangluloveslezhi,tshwangyequan\}@gmail\.com, \{z\.li,Xuanwu\.Yin,d\.li,Emad\.Barsoum\}@amd\.com,kliu@nlpr\.ia\.ac\.cn

###### Abstract

Post\-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models \(LLMs\)\. The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision\. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions\. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers\. Based on this, we derive the theoretical optimal solution with respect to Flatness\. Building on these insights, we propose Bidirectional Diagonal Quantization \(BDQ\), a novel post\-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations\. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations\. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark\. It achieves less than 1% accuracy drop in W4A4 quantization on the LLaMA\-3\-8B model\. In the more challenging W2A4KV16 experiment, compared to state\-of\-the\-art approaches, BDQ reduces the performance gap by 39\.1% on the DeepSeek\-R1\-Distill\-LLaMA\-70B model\.

Theory\-optimal Quantization Based on Flatness

Xiusheng Huang1,2,3, Zhe Li4, Xuanwu Yin4, Lu Wang5, Yequan Wang3†\\dagger, Dong Li4,Emad Barsoum4, Kang Liu1,2†\\dagger1The Key Laboratory of Cognition and Decision Intelligence for Complex Systems,Institute of Automation, Chinese Academy of Sciences, Beijing, China2School of Artificial Intelligence, University of Chinese Academy of Sciences3Beijing Academy of Artificial Intelligence4AMD5Ritzz\-AIhuangxiusheng2020@ia\.ac\.cn,\{wangluloveslezhi,tshwangyequan\}@gmail\.com,\{z\.li,Xuanwu\.Yin,d\.li,Emad\.Barsoum\}@amd\.com,kliu@nlpr\.ia\.ac\.cn

††footnotetext:†\\daggerCorresponding authors## 1Introduction

Recent Large Language Models \(LLMs\) have achieved superior performance in multiple natural language processing tasks as their parameters grow\(Yanget al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib74); Grattafioriet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib75)\)\. However, increasing the scale of the parameters leads to significant increases in computational and storage costs\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib20)\)\. Therefore, the efficient deployment of low\-cost LLMs has become an urgent research direction\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23)\)\. Previous research can be divided into architecture\-changing and architecture\-preserving techniques\.

Architecture\-changing methods such as distillation\(Hanet al\.,[2015](https://arxiv.org/html/2605.18800#bib.bib66); Chenet al\.,[2020](https://arxiv.org/html/2605.18800#bib.bib68)\)and pruning\(Zhuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib76)\)reduce the size of the model by transferring knowledge or removing unimportant parameters, but require significant data and computation, making them impractical for LLMs\. In contrast, architecture\-preserving methods such as quantization\(Frantaret al\.,[2022](https://arxiv.org/html/2605.18800#bib.bib15)\)and low\-rank decomposition\(Yuanet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib13)\)keep the model structure; quantization lowers weight precision, while low\-rank methods approximate weight matrices\. Quantization is especially popular in LLM deployment due to its efficiency and strong performance\.

Post\-Training Quantization \(PTQ\) has become a widely adopted technique for compressing and accelerating LLMs\. During quantization, as shown in Figure[1\(a\)](https://arxiv.org/html/2605.18800#S1.F1.sf1), outliers in the original data present huge challenges because the limited quantization space cannot adequately express the original data space, with most data accumulating in a few regions\. Recent research has adopted linear transformations to address these challenges\. The rotation transformation\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23); Liuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib24)\)alleviates this phenomenon in Figure[1\(b\)](https://arxiv.org/html/2605.18800#S1.F1.sf2)\. However, due to the presence of outliers, most of the data still accumulates in theBlueregion\. Existing methods are heuristic and haven’t established direct mathematical relationships between outliers and quantization errors, nor optimized the distribution of the entire quantization space\.

![Refer to caption](https://arxiv.org/html/2605.18800v1/x1.png)\(a\)Original
![Refer to caption](https://arxiv.org/html/2605.18800v1/x2.png)\(b\)Rotation Transformation
![Refer to caption](https://arxiv.org/html/2605.18800v1/x3.png)\(c\)Ours

Fig\. 1:Activation distributions under different transformations for LLaMA3\-8B\. After quantization, values from various ranges are mapped to corresponding integer levels\. The number of points within each box reflects the frequency of quantized values\. A more uniform distribution of points indicates higher quantization quality\.Bluedots represent values near zero,Orangedots indicate mid\-range values, andReddots correspond to large\-magnitude values\.In this paper, we first establish the mathematical relationship between outliers and quantization errors, demonstrating that outliers influence quantization error at the quadratic level\. Furthermore, we introduce the concept ofFlatnessas an effective indicator for quantifying the distribution of outliers\. Inspired by Information\-Entropy\(Tsaiet al\.,[2008](https://arxiv.org/html/2605.18800#bib.bib77)\), we defineFlatnessas evaluating each element’s flatness in its row and column, extending to all elements in the matrix\. Importantly, through mathematical derivation, we discovered the optimal solution for improving Flatness and demonstrated excellent advantages compared to previous methods, laying the foundation for developing more effective quantization methods\.

Based on the above findings, we propose theBidirectionalDiagonalQuantization \(BDQ\) method\. BDQ allocates two learnable diagonal transformation pairs for each fully connected layer in LLMs, applying simultaneous row\-wise and column\-wise scaling to redistribute outliers along both dimensions\. We theoretically demonstrate that this formulation can achieve the optimal solution with respect to Flatness\. In addition, a Hadamard orthogonal transformation is employed to further disperse outliers across the entire matrix\. Meanwhile, it is widely known that only a small calibration set is utilized \(e\.g\., 128 samples\) during the quantization process, which can cause the model to overfit to a limited set of features, an effect shown experimentally to hinder outlier mitigation\. To address this, BDQ introduces a Recursive Cross\-Entropy loss that captures the state from previous iterations, thereby reducing overfitting and improving generalization\. BDQ is a highly effective PTQ method for LLMs, consistently outperforming existing techniques across various models and benchmarks\. In the W4A4KV4 setting, BDQ maintains over99\.1%of full\-precision accuracy\. Furthermore, in the W2A4KV16 setting, BDQ reduces the performance gap of DeepSeek\-R1\-Distill\-LLaMA\-70B\(Guoet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib82)\)by39\.1%compared to the latest methods\.

To our knowledge, we are the first to model the mathematical relationship between outliers and quantization errors, discovering that outliers are key factors affecting quantization accuracy\. Meanwhile, we propose theFlatnessmetric reflecting the presence of outliers in the model and provide the optimal solution through mathematical derivation\. The contributions of this work are summarized as follows:

- •We first model the mathematical relationship between outliers and quantization errors, discovering that outliers are key factors that affect quantization accuracy\.
- •To quantify the outlier distribution, we propose theFlatnessmetric and provide the optimal solution through mathematical derivation\. Compared to previous methods, this optimal solution demonstrates significant advantages\.
- •We propose a Bidirectional Diagonal Quantization \(BDQ\) that effectively reduces quantization errors\. Extensive experiments show that BDQ significantly outperforms existing quantization methods\.

## 2Related Work

### 2\.1Architecture\-Changing Methods

Recent model compression research has focused on structural modifications to reduce complexity and size\. Pruning techniques have progressed from early weight pruning\(Hanet al\.,[2015](https://arxiv.org/html/2605.18800#bib.bib66)\)to dynamic strategies that remove unimportant parameters during training\(Chenet al\.,[2020](https://arxiv.org/html/2605.18800#bib.bib68)\), and to neural architecture search\-based methods for optimal network structures\(Zhanget al\.,[2021](https://arxiv.org/html/2605.18800#bib.bib69)\)\. Knowledge distillation has also advanced, from foundational teacher\-student frameworks\(Kim and Rush,[2016](https://arxiv.org/html/2605.18800#bib.bib67)\)to approaches combining self\-supervised learning\(Yanget al\.,[2022](https://arxiv.org/html/2605.18800#bib.bib70)\)and multi\-modal distillation for preserving semantics\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib71)\)\. However, these methods often entail high computational costs and slow processing, limiting their practical deployment\.

### 2\.2Architecture\-Preserving Methods

Post\-training quantization \(PTQ\) is popular in LLMs for its efficiency, with methods mainly divided into weight\-only and weight\-activation quantization\. FWSVD\([Hsuet al\.,](https://arxiv.org/html/2605.18800#bib.bib12)\)and ASVD\(Yuanet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib13)\)assess parameter or channel importance, while GPTQ\(Frantaret al\.,[2022](https://arxiv.org/html/2605.18800#bib.bib15)\)and AWQ\(Linet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib16); Leeet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib17)\)reduce quantization error and address activation outliers\. QuIP\(Cheeet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib18)\), QuIP\#\(Tsenget al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib19)\), SmoothQuant\(Xiaoet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib20)\), and OmniQuant\(Shaoet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib21)\)further improve quantization with various techniques\. I\-LLM\(Huet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib22)\)supports integer\-only inference, QuaRot\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23)\)uses random rotations, and SpinQuant learns rotations for 4\-bit quantization\(Liuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib24)\)\. Quantization stands out over low\-rank decomposition for its high accuracy and low cost\.

## 3Motivation

In the model quantization process, let the weight or activation beW∈ℝm×nW\\in\\mathbb\{R\}^\{m\\times n\}, and assume the outlier value\|woutlier\|≫𝔼​\[\|W\|\]\|w\_\{\\text\{outlier\}\}\|\\gg\\mathbb\{E\}\[\|W\|\], where𝔼​\[\|W\|\]\\mathbb\{E\}\[\|W\|\]represents the statistical expectation of the elements\. The quantization process is determined by the scale△∈ℝ\+\\bigtriangleup\\in\\mathbb\{R\}^\{\+\}and the zero pointz∈ℤz\\in\\mathbb\{Z\}, mapping floating\-point values to the integer space as follows:

Q\(w\)=round\(w△\)\+z,△=max​\(\|w\|\)2b−1Q\(w\)=\\text\{round\}\\left\(\\frac\{w\}\{\\bigtriangleup\}\\right\)\+z,\\bigtriangleup=\\frac\{\\text\{max\}\(\|w\|\)\}\{2^\{b\}\-1\}\(1\)
wherewwis the original weight andQ​\(w\)−z∈\{0,1,…,2b−1\}Q\(w\)\-z\\in\\\{0,1,\\ldots,2^\{b\}\-1\\\}is the integer value afterbb\-bit quantization\. We setxxis the input of matrix, the quantization error is defined as:

ϵ=w​x−w′​x\\epsilon=wx\-w^\{\\prime\}x\(2\)
### 3\.1The Quantization Error of Single Outlier

When the outlier value is included, let△\\bigtriangleupbe the selected scale factor andbb\-bit integer points\. Assume the quantization range is set to\[−c,c\]\[\-c,c\]:

△=c2b−1\\bigtriangleup=\\frac\{c\}\{2^\{b\}\-1\}\(3\)
If\|woutlier\|\|w\_\{\\text\{outlier\}\}\|is large, then the adjustment leads to:

△′=\|woutlier\|2b−1\\bigtriangleup^\{\\prime\}=\\frac\{\|w\_\{\\text\{outlier\}\}\|\}\{2^\{b\}\-1\}\(4\)
Meanwhile, letwoutlierw\_\{\\text\{outlier\}\}represent the quantization bin,Δ=c2b−1\\Delta=\\frac\{c\}\{2^\{b\}\-1\}expands to\|woutlier\|2b−1\\frac\{\|w\_\{\\text\{outlier\}\}\|\}\{2^\{b\}\-1\}\. For any non outlierwi∈\[−c,c\]w\_\{i\}\\in\[\-c,c\], their upper limit of quantization error increases fromΔ2\\frac\{\\Delta\}\{2\}toΔ′2\\frac\{\\Delta^\{\{\}^\{\\prime\}\}\}\{2\}, that is:

\|ϵi\|≤Δ​x2→o​u​t​l​i​e​r\|ϵi\|≤\|woutlier\|​x2b−1\|\\epsilon\_\{i\}\|\\leq\\frac\{\\Delta x\}\{2\}\\xrightarrow\{outlier\}\|\\epsilon\_\{i\}\|\\leq\\frac\{\|w\_\{\\text\{outlier\}\}\|x\}\{2^\{b\}\-1\}\(5\)
When\|woutlier\|≫c\|w\_\{\\text\{outlier\}\}\|\\gg c, the quantization error due to outliers can be significant\. There is a proportional relationship between quantization errorϵi\\epsilon\_\{i\}and outlierswoutlierw\_\{\\text\{outlier\}\}\.

### 3\.2The Quantization Error of Entire Matrix

The quantization error of the statistics and the characteristics of the weight can be assumed to follow a normal distributionN​\(0,k2​σ2\)N\(0,k^\{2\}\\sigma^\{2\}\)\(wherek≫1k\\gg 1\)\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23)\)\. The total quantization error can be expressed as:

E​\[ϵ2\]\\displaystyle E\[\\epsilon^\{2\}\]=xm​n​∑mj=1∑ni=1\(wi​j−wi​j′\)2\\displaystyle=\\frac\{x\}\{mn\}\\sum\_\{m\}^\{j=1\}\\sum\_\{n\}^\{i=1\}\(w\_\{ij\}\-w\_\{ij\}^\{\{\}^\{\\prime\}\}\)^\{2\}\(6\)=\(1−p\)​E​\[ϵnormal2\]​x\+p​E​\[ϵoutlier2\]​x\\displaystyle=\(1\-p\)E\[\\epsilon\_\{\\text\{normal\}\}^\{2\}\]x\+pE\[\\epsilon\_\{\\text\{outlier\}\}^\{2\}\]x
where\(1−p\)​E​\[ϵnormal2\]\(1\-p\)E\[\\epsilon\_\{\\text\{normal\}\}^\{2\}\]is Normal Contributions andp​E​\[ϵoutlier2\]pE\[\\epsilon\_\{\\text\{outlier\}\}^\{2\}\]is Outlier Contributions,ppis a coefficient related to the number of outliers\. Due to the outlier value, as the scale factor△′\\bigtriangleup^\{\\prime\}increases, the variance of the normal term changes to:

E​\[ϵnormal2\]≈Δ′⁣212=k2​σ212​\(2b−1\)2E\[\\epsilon\_\{\\text\{normal\}\}^\{2\}\]\\approx\\frac\{\\Delta^\{\\prime 2\}\}\{12\}=\\frac\{k^\{2\}\\sigma^\{2\}\}\{12\(2^\{b\}\-1\)^\{2\}\}\(7\)
And the mean error of the outlier itself, due to being truncated to the boundary of the quantization range, the error is:

E​\[ϵoutlier\]=woutlier−sign​\(woutlier\)⋅\(2b−1\)△′E\[\\epsilon\_\{\\text\{outlier\}\}\]=w\_\{\\text\{outlier\}\}\-\\text\{sign\}\(w\_\{\\text\{outlier\}\}\)\\cdot\(2^\{b\}\-1\)\\bigtriangleup^\{\\prime\}\(8\)
when\|woutlier\|\>\(2b−1\)△′\|w\_\{\\text\{outlier\}\}\|\>\(2^\{b\}\-1\)\\bigtriangleup^\{\\prime\}, thes​i​g​n​\(⋅\)sign\(\\cdot\)is a sign function\. The average squared error is given by:

E​\[ϵoutlier2\]=\(\|woutlier−\(2b−1\)△′\|\)2E\[\\epsilon\_\{\\text\{outlier\}\}^\{2\}\]=\(\|w\_\{\\text\{outlier\}\}\-\(2^\{b\}\-1\)\\bigtriangleup^\{\\prime\}\|\)^\{2\}\(9\)
When the outlier value is significantly larger than the quantization range \(i\.e\.,\|woutlier\|≫\(2b−1\)△′\|w\_\{\\text\{outlier\}\}\|\\gg\(2^\{b\}\-1\)\\bigtriangleup^\{\\prime\}\), outliers dominate the total quantification error \(whereE​\[ϵoutlier2\]≫E​\[ϵnormal2\]E\[\\epsilon\_\{\\text\{outlier\}\}^\{2\}\]\\gg E\[\\epsilon\_\{\\text\{normal\}\}^\{2\}\]\), at this point:

E​\[ϵ2\]≈p⋅woutlier2​xE\[\\epsilon^\{2\}\]\\approx p\\cdot w\_\{\\text\{outlier\}\}^\{2\}x\(10\)
The total quantification errorE​\[ϵ2\]E\[\\epsilon^\{2\}\]and outlierswoutlierw\_\{\\text\{outlier\}\}exhibit a square relationship\.

## 4The Optimal Solution for Flatness

In model quantization and compression, the original weight or activation matrixW∈ℝm×nW\\in\\mathbb\{R\}^\{m\\times n\}often contains a few extremely large values that significantly exceed the magnitude of other elements\. We refer to these values as outliers\. The presence of outliers reduces the distinguishability of full\-precision values within the limited quantization space, resulting in increased quantization error—one of the core challenges in quantization research\. Existing studies primarily focus on mitigating outliers through scaling or linear transformations, and have achieved promising results\. However, there remains a lack of a unified metric to evaluate the flatness of a matrix, making it difficult to assess or determine an optimal transformation strategy\.

### 4\.1Flatness of Matrix

In information theory, entropy quantifies the uncertainty or randomness associated with a random variable or probability distribution\. Higher entropy indicates greater uncertainty, lower information content, and a flatter probability distributionP​\(xi\)P\(x\_\{i\}\)\. The information entropy is defined as follows:

H​\(X\)=−∑i=1zP​\(xi\)​log⁡P​\(xi\)H\(X\)=\-\\sum\_\{i=1\}^\{z\}P\(x\_\{i\}\)\\log P\(x\_\{i\}\)\(11\)
Inspired by Information\-Entropy\(Tsaiet al\.,[2008](https://arxiv.org/html/2605.18800#bib.bib77)\), we propose an evaluation metric calledFlatness, which quantifies the uniformity of the data distribution across the entire matrix\. In this context, the elements of the matrixWWare treated as a part of probability values similar toP​\(xi\)P\(x\_\{i\}\)in the information entropy formulation\. Importantly, the outliers inWWare distributed across different rows and columns of the model, so the flatness metric needs to ensure that the distributions of the rows and columns containing outliers are properly evaluated, the expressionWi​j2αi​βj\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}can be similar toP​\(xi\)P\(x\_\{i\}\)in Eq\.[11](https://arxiv.org/html/2605.18800#S4.E11)\. Specifically, Flatness is formalized as:

F=∑i=1m∑j=1n\(Wi​j2αi​βj​ln⁡Wi​j2αi​βj\)F=\\sum\_\{i=1\}^\{m\}\\sum\_\{j=1\}^\{n\}\\left\(\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\ln\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\right\)\(12\)
whereαi\>0\\alpha\_\{i\}\>0is the energy weight factor for theii\-th row,βj\>0\\beta\_\{j\}\>0is the energy weight factor for thejj\-th column\. The objective is to minimize the combined dispersionFF, subject to the energy constraint:

minαi,βj⁡Fs\.t\.∑i,jpi​j=∑i,jWi​j2αi​βj=1\\min\_\{\\alpha\_\{i\},\\beta\_\{j\}\}F\\quad\\text\{s\.t\.\}\\quad\\sum\_\{i,j\}p\_\{ij\}=\\sum\_\{i,j\}\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}=1\(13\)
Additional energy constraints \(avoiding trivial solutions\):

∑i,jαi​Wi​j2​βj=C,\(C\>0\)\\sum\_\{i,j\}\\alpha\_\{i\}W\_\{ij\}^\{2\}\\beta\_\{j\}=C,\(C\>0\)\(14\)
We considerWi​j2αi​βj\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}as a probability distribution from two perspectives\. Non\-negativity and normalization:Wi​j2≥0,αi\>0,βj\>0W\_\{ij\}^\{2\}\\geq 0,\\alpha\_\{i\}\>0,\\beta\_\{j\}\>0, thuspi​j=Wi​j2αi​βj≥0p\_\{ij\}=\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\geq 0\. The constraint∑i,jWi​j2αi​βj=1\\sum\_\{i,j\}\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}=1ensures that∑i,jpi​j=1\\sum\_\{i,j\}p\_\{ij\}=1\. This condition defines the distribution of probabilities\. Information entropy: The informationH​\(p\)=−∑i,jpi​j​ln⁡pi​jH\(p\)=\-\\sum\_\{i,j\}p\_\{ij\}\\ln p\_\{ij\}measures the uncertainty of the distribution\. Aspi​jp\_\{ij\}increases,H​\(p\)H\(p\)becomes larger\. In this problem, we hope to maximizeH​\(p\)H\(p\)\(i\.e\., maximize entropy\), thereby makingpi​jp\_\{ij\}approach the distribution of probabilities\. The quality of this is the maximum entropy of the distribution, which is achieved by optimizingpi​jp\_\{ij\}as much as possible\. The formula isF=∑i,jWi​j2αi​βj​ln⁡\(Wi​j2αi​βj\)=∑i,jpi​j​ln⁡pi​jF=\\sum\_\{i,j\}\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\ln\\left\(\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\right\)=\\sum\_\{i,j\}p\_\{ij\}\\ln p\_\{ij\}\.

Additionally, we consider the constraints from two perspectives\. Summary of Requirements: The information required is to ensure that the probability distributionpi​jp\_\{ij\}satisfies∑pi​j=1\\sum p\_\{ij\}=1\. If we hope to setpi​j=Wi​j2αi​βjp\_\{ij\}=\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}as the probability distribution, then it must satisfy the condition that the total sums to 1\.∑i,jWi​j2αi​βj=1\\sum\_\{i,j\}\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}=1\. Directly ensuring the summary requirement, guarantees∑pi​j=1\\sum p\_\{ij\}=1satisfies the summary condition\. Physical Meaning: This constraint ensures that the total energy corresponding to the variableWWis∑Wi​j2\\sum W\_\{ij\}^\{2\}, while the roles of parametersαi\\alpha\_\{i\}andβj\\beta\_\{j\}are to redistribute the energy, making the distribution more uniform\. The additional energy constraint∑αi​Wi​j2​βj=C\\sum\\alpha\_\{i\}W\_\{ij\}^\{2\}\\beta\_\{j\}=Cis utilized to control the degree of bias in the release of factors, avoidingαi,βj→0\\alpha\_\{i\},\\beta\_\{j\}\\to 0or∞\\inftyin the solution\.

### 4\.2Finding the Optimal Solution

Introducing the Lagrange multiplierλ\\lambda, the Lagrangian is constructed as:

ℒ=\\displaystyle\\mathcal\{L\}=∑i,j\(Wi​j2αi​βj​ln⁡Wi​j2αi​βj\)\+λ1​\(1−∑i,jWi​j2αi​βj\)\\displaystyle\\sum\_\{i,j\}\\left\(\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\ln\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\right\)\+\\lambda\_\{1\}\\left\(1\-\\sum\_\{i,j\}\\frac\{W\_\{ij\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{j\}\}\\right\)\(15\)\+λ2​\(C−∑i,jαi​Wi​j2​βj\)\.\\displaystyle\+\\lambda\_\{2\}\\left\(C\-\\sum\_\{i,j\}\\alpha\_\{i\}W\_\{ij\}^\{2\}\\beta\_\{j\}\\right\)\.
We conducted derivation of optimality conditions\. Taking partial derivatives with respect toαk\\alpha\_\{k\}andβl\\beta\_\{l\}, and setting them to zero:

With respect toαk\\alpha\_\{k\}:

∂ℒ∂αk=\\displaystyle\\frac\{\\partial\\mathcal\{L\}\}\{\\partial\\alpha\_\{k\}\}=−∑jWk​j2αk2​βj​\(ln⁡Wk​j2αk​βj\+1\+λ1\)\\displaystyle\-\\sum\_\{j\}\\frac\{W\_\{kj\}^\{2\}\}\{\\alpha\_\{k\}^\{2\}\\beta\_\{j\}\}\\left\(\\ln\\frac\{W\_\{kj\}^\{2\}\}\{\\alpha\_\{k\}\\beta\_\{j\}\}\+1\+\\lambda\_\{1\}\\right\)\(16\)−λ2​∑jWk​j2​βj=0\.\\displaystyle\-\\lambda\_\{2\}\\sum\_\{j\}W\_\{kj\}^\{2\}\\beta\_\{j\}=0\.
Reorganizing:

∑jWk​j2αk2​βj​\(ln⁡Wk​j2αk​βj\+1\+λ1\)=−λ2​∑jWk​j2​βj\.\\sum\_\{j\}\\frac\{W\_\{kj\}^\{2\}\}\{\\alpha\_\{k\}^\{2\}\\beta\_\{j\}\}\\left\(\\ln\\frac\{W\_\{kj\}^\{2\}\}\{\\alpha\_\{k\}\\beta\_\{j\}\}\+1\+\\lambda\_\{1\}\\right\)=\-\\lambda\_\{2\}\\sum\_\{j\}W\_\{kj\}^\{2\}\\beta\_\{j\}\.\(17\)
Similarly, with respect toβl\\beta\_\{l\}:

∑iWi​l2αi​βl2​\(ln⁡Wi​l2αi​βl\+1\+λ1\)=−λ2​∑iαi​Wi​l2\.\\sum\_\{i\}\\frac\{W\_\{il\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{l\}^\{2\}\}\\left\(\\ln\\frac\{W\_\{il\}^\{2\}\}\{\\alpha\_\{i\}\\beta\_\{l\}\}\+1\+\\lambda\_\{1\}\\right\)=\-\\lambda\_\{2\}\\sum\_\{i\}\\alpha\_\{i\}W\_\{il\}^\{2\}\.\(18\)

### 4\.3Structural Analysis of the Solution

Utilizing the above formulas, we have clarified Row Independence and Column Independence\. Row Independence refers to eachαk\\alpha\_\{k\}in the optimization process is determined only by the data in rowkk\. Column Independence refers to eachβl\\beta\_\{l\}in the optimization process is determined only by the data in columnll\. This implies,αi\\alpha\_\{i\}is a function of the data in rowii, independent of other rows\.βj\\beta\_\{j\}is a function of the data in columnjj, independent of other columns\.

Therefore, the optimal solution must be thatαi\\alpha\_\{i\}is given by a function of the data in rowii, andβj\\beta\_\{j\}is given by a function of the data in columnjj\. By defining diagonal matricesd1=diag⁡\(αi\)d\_\{1\}=\\operatorname\{diag\}\(\\sqrt\{\\alpha\_\{i\}\}\)andd2=diag⁡\(βj\)d\_\{2\}=\\operatorname\{diag\}\(\\sqrt\{\\beta\_\{j\}\}\), we obtain:V=d1​W​d2V=d\_\{1\}Wd\_\{2\}as the unique optimal form\. Notably,VVnot only represents the theoretical optimal solution with respect to Flatness but also, according to Eq\.[10](https://arxiv.org/html/2605.18800#S3.E10), is the optimal form for reducing quantization error\.

![Refer to caption](https://arxiv.org/html/2605.18800v1/x4.png)Fig\. 2:The transformation results of different methods\. The Rotation Matrix is a learnable random Hadamard matrix\. The diagonal matrix is obtained by optimizing and converging utilizing deep neural networks\.Table 1:Experimental results of overfitting phenomenon on LLaMA3\-8B \.

## 5Method

Based on the theoretical solutionVVobtained in Section[4](https://arxiv.org/html/2605.18800#S4), we propose Bidirectional Diagonal Quantization \(BDQ\) along with Recursive Cross\-Entropy loss, which together theoretically yield the optimal Flatness of the matrix\.

### 5\.1Bidirectional Diagonal Quantization

We propose Bidirectional Diagonal Quantization \(BDQ\), a novel framework designed to mitigate the impact of outliers and enhance quantization performance\. The key idea behind BDQ is to distribute the burden of outlier elimination across the entire matrix, as detailed in Section[3](https://arxiv.org/html/2605.18800#S3)\.

As illustrated in Figure[2](https://arxiv.org/html/2605.18800#S4.F2), BDQ applies multiple transformation pairs both within and across LLM blocks globally\. Specifically, based on the transformer architecture, each block learns four equivalent transformation pairs, with each pair consisting of two learnable diagonal matrices and one learnable rotation matrix\. These transformations collaboratively reshape the distribution of weights and activations, making them more amenable to quantization\. BDQ preserves equivalent transformations at the global network level\. Therefore, when quantization is not applied, the network’s output remains identical to that of the original model\. More details are provided in the Appendix[C](https://arxiv.org/html/2605.18800#A3)\.

We define the equivalent transformation pair asEE, whereEEconsists of two diagonal matrices<Λ1,Λ2\><\\Lambda\_\{1\},\\Lambda\_\{2\}\>and a rotation matrixRR\. Therefore, the forward inference processy=x​Wy=xWis reformulated:

y=Q​\(Λ1​x​Λ2​R\)⋅Q​\(RT​Λ2−1​W​Λ1−1\)y=Q\(\\Lambda\_\{1\}x\\Lambda\_\{2\}R\)\\cdot Q\(R^\{T\}\\Lambda\_\{2\}^\{\-1\}W\\Lambda\_\{1\}^\{\-1\}\)\(19\)
whereQ​\(⋅\)Q\(\\cdot\)represents the quantization function andWWrefers to weight or activation\.Λ\\Lambdais a diagonal matrix, and the inverse of the diagonal elements ofΛ\\Lambdais obtained asΛ−1\\Lambda^\{\-1\}\. The rotation matrixRRis composed of a Hadamard matrix and an additional orthogonal matrix\. Appendix[A](https://arxiv.org/html/2605.18800#A1)provides a detailed theoretical comparison with previous rotation\-based methods, and the results demonstrate that our method has significant advantages in complexity and outlier elimination\.

The optimization objective for the entire network can be formalized as follows:

arg⁡minΛi,Ri⁡ℒ​\(y^,y;Λi,Ri,θ\)\\arg\\min\_\{\\Lambda\_\{i\},R\_\{i\}\}\\mathcal\{L\}\(\\hat\{y\},y;\\Lambda\_\{i\},R\_\{i\},\\theta\)\(20\)
Theℒ​\(y^,y\)\\mathcal\{L\}\(\\hat\{y\},y\)represents the loss between the quantized network outputy^\\hat\{y\}and the full\-precision network outputyy\. Theθ\\thetadenotes the parameters of the frozen network\.

\#BitsModelMethodPPL\(↓\\color\[rgb\]\{0,1,0\}\\downarrow\)Accuracy\(↑\\color\[rgb\]\{0,1,0\}\\uparrow\)W\-A\-KVWikiText2C4ARC\-CARC\-EHellaswagLAMBADAPIQAWinograndeAvg\.4\-4\-4LLaMA3\-8BFP166\.149\.4553\.5077\.5779\.1275\.5180\.7472\.9373\.23QuaRot8\.1613\.3845\.7370\.8372\.9762\.7075\.3567\.1765\.79SpinQuant7\.3912\.1947\.2774\.2074\.5570\.2977\.3768\.5168\.70FlatQuant6\.9011\.2150\.5175\.8876\.4973\.2079\.0072\.9371\.33Ours\\cellcolorgreen\!106\.84\\cellcolorgreen\!1010\.97\\cellcolorgreen\!1051\.03\\cellcolorgreen\!1076\.10\\cellcolorgreen\!1076\.77\\cellcolorgreen\!1073\.42\\cellcolorgreen\!1078\.57\\cellcolorgreen\!1072\.88\\cellcolorgreen\!1071\.46LLaMA3\-70BFP162\.867\.1764\.2585\.9484\.9379\.3784\.4480\.7479\.95QuaRot6\.6012\.8749\.4974\.3777\.2271\.6978\.8971\.0370\.45SpinQuant6\.2112\.8251\.9677\.4077\.2971\.9079\.3372\.0671\.66FlatQuant3\.777\.9361\.9584\.4783\.8777\.9983\.9579\.2478\.58Ours\\cellcolorgreen\!103\.52\\cellcolorgreen\!107\.63\\cellcolorgreen\!1062\.83\\cellcolorgreen\!1084\.88\\cellcolorgreen\!1084\.07\\cellcolorgreen\!1079\.42\\cellcolorgreen\!1084\.01\\cellcolorgreen\!1079\.56\\cellcolorgreen\!1079\.192\-4\-16LLaMA3\-8BFP166\.149\.4553\.5077\.5779\.1275\.5180\.7472\.9373\.23QuaRot24\.3629\.8828\.5954\.7654\.6241\.9060\.0351\.3348\.53SpinQuant20\.7724\.7131\.7758\.9361\.2947\.3666\.2555\.4253\.50FlatQuant18\.6723\.6633\.3761\.2662\.5549\.0768\.5956\.8955\.28Ours\\cellcolorgreen\!1016\.52\\cellcolorgreen\!1020\.09\\cellcolorgreen\!1036\.69\\cellcolorgreen\!1064\.89\\cellcolorgreen\!1064\.39\\cellcolorgreen\!1052\.88\\cellcolorgreen\!1072\.71\\cellcolorgreen\!1060\.33\\cellcolorgreen\!1058\.65LLaMA3\-70BFP162\.867\.1764\.2585\.9484\.9379\.3784\.4480\.7479\.95QuaRot19\.4728\.9542\.7672\.0768\.6262\.5768\.9459\.7662\.45SpinQuant13\.7622\.7648\.7476\.7463\.0167\.7973\.8265\.9366\.01FlatQuant11\.5319\.6450\.7178\.6175\.8270\.9375\.4268\.6870\.02Ours\\cellcolorgreen\!1010\.07\\cellcolorgreen\!1016\.39\\cellcolorgreen\!1053\.26\\cellcolorgreen\!1080\.06\\cellcolorgreen\!1077\.49\\cellcolorgreen\!1072\.57\\cellcolorgreen\!1078\.39\\cellcolorgreen\!1071\.53\\cellcolorgreen\!1072\.224\-4\-4DeepSeek\-R1\-DistillFP166\.039\.2864\.5182\.6383\.4279\.4483\.7675\.6378\.23QuaRot8\.0813\.1755\.6773\.6472\.9771\.9376\.3768\.1869\.79SpinQuant7\.2711\.8957\.3874\.2075\.5574\.3678\.8370\.7671\.84LLaMA\-8BFlatQuant6\.8111\.0758\.6476\.8876\.4975\.3179\.3873\.4273\.35Ours\\cellcolorgreen\!106\.74\\cellcolorgreen\!1010\.78\\cellcolorgreen\!1059\.76\\cellcolorgreen\!1078\.81\\cellcolorgreen\!1077\.89\\cellcolorgreen\!1076\.63\\cellcolorgreen\!1079\.64\\cellcolorgreen\!1073\.98\\cellcolorgreen\!1074\.45DeepSeek\-R1\-DistillFP162\.737\.0673\.3987\.4287\.8984\.6287\.3284\.7684\.23QuaRot6\.5112\.0661\.0878\.6478\.3273\.6279\.6274\.4274\.28SpinQuant6\.1811\.2763\.7681\.0381\.2775\.8681\.3677\.2076\.74LLaMA\-70BFlatQuant3\.657\.6465\.9884\.8784\.0878\.3284\.8780\.3979\.75Ours\\cellcolorgreen\!103\.46\\cellcolorgreen\!107\.41\\cellcolorgreen\!1067\.41\\cellcolorgreen\!1085\.97\\cellcolorgreen\!1085\.21\\cellcolorgreen\!1080\.17\\cellcolorgreen\!1085\.62\\cellcolorgreen\!1081\.49\\cellcolorgreen\!1080\.972\-4\-16DeepSeek\-R1\-DistillFP166\.039\.2864\.5182\.6383\.4279\.4483\.7675\.6378\.23QuaRot22\.6327\.4334\.7858\.7557\.4646\.8764\.3154\.3852\.75SpinQuant18\.4622\.0638\.7762\.7259\.8751\.3367\.7357\.4256\.31LLaMA\-8BFlatQuant15\.2720\.3640\.7664\.6462\.3453\.1669\.3559\.1558\.23Ours\\cellcolorgreen\!1012\.36\\cellcolorgreen\!1017\.46\\cellcolorgreen\!1043\.43\\cellcolorgreen\!1066\.83\\cellcolorgreen\!1065\.61\\cellcolorgreen\!1057\.98\\cellcolorgreen\!1072\.62\\cellcolorgreen\!1062\.38\\cellcolorgreen\!1061\.47DeepSeek\-R1\-DistillFP162\.737\.0673\.3987\.4287\.8984\.6287\.3284\.7684\.23QuaRot17\.4625\.4346\.3774\.3070\.0564\.0762\.0761\.8763\.12SpinQuant12\.0821\.3650\.3778\.0972\.4669\.1064\.5264\.5766\.51LLaMA\-70BFlatQuant10\.4318\.0952\.7880\.1274\.0371\.7766\.7366\.7468\.69Ours\\cellcolorgreen\!107\.42\\cellcolorgreen\!1015\.34\\cellcolorgreen\!1054\.76\\cellcolorgreen\!1082\.07\\cellcolorgreen\!1076\.64\\cellcolorgreen\!1073\.52\\cellcolorgreen\!1068\.93\\cellcolorgreen\!1068\.92\\cellcolorgreen\!1070\.81

Table 2:The overall result graph of the quantified results\. Experiments were conducted on different models and settings\.\#BitsModelMethodPPL\(↓\\color\[rgb\]\{0,1,0\}\\downarrow\)Accuracy\(↑\\color\[rgb\]\{0,1,0\}\\uparrow\)W\-A\-KVWikiText2C4ARC\-CARC\-EHellaswagLAMBADAPIQAWinograndeAvg\.4\-4\-4LLaMA2\-7BFP165\.477\.2646\.1674\.5475\.9873\.9279\.0569\.0669\.79Only\-BDQ5\.837\.8642\.6372\.6973\.0371\.5777\.2167\.4267\.42Ours \(BDQ \+ RCE\)\\cellcolorgreen\!105\.76\\cellcolorgreen\!107\.64\\cellcolorgreen\!1043\.07\\cellcolorgreen\!1073\.09\\cellcolorgreen\!1073\.36\\cellcolorgreen\!1072\.06\\cellcolorgreen\!1077\.57\\cellcolorgreen\!1067\.90\\cellcolorgreen\!1067\.84

Table 3:Results of ablation experiment\. The "Only\-BDQ" utilizes cross\-entropy as the loss function\.
### 5\.2Recursive Cross\-Entropy Loss

To achieve low\-cost model compression, a small number of alignment samples \(128 samples\) are typically utilized to optimize learnable parameters during quantization\. However, as shown in Table[1](https://arxiv.org/html/2605.18800#S4.T1), traditional cross\-entropy leads to overfitting\. This is a common problem in the field of post training quantization \(both Ostquant\(Huet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib81)\)and Spinquant\(Liuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib24)\)suffer from overfitting\)\. Specifically, as training steps increase, the alignment data shows lower perplexity, but performance on zero\-shot tasks declines, Flatness increase\. This conclusion is supported when utilizing Wikitext2\(Merityet al\.,[2016](https://arxiv.org/html/2605.18800#bib.bib56)\)and C4\(Raffelet al\.,[2023](https://arxiv.org/html/2605.18800#bib.bib83)\)as alignment data\. Therefore, utilizing cross\-entropy leads to the network overfitting to alignment data, affecting the elimination of outliers, which poses a significant challenge for low\-cost quantization of LLMs\.

Inspired by regularization of noisy labels\(Liuet al\.,[2020](https://arxiv.org/html/2605.18800#bib.bib73)\), we discovered that, besides the label distribution q, the model prediction distribution p has high reliability\. Table[4](https://arxiv.org/html/2605.18800#A2.T4)shows that after applying the quantization function, the top\-50 token hit rate in the model’s predicted distribution p reaches 99\.36%\. To address the aforementioned challenges, we propose a Recursive Cross\-Entropy \(RCE\) loss\. RCE aims to simultaneously fit the label distribution q and the model prediction distribution p, preventing the model from falling into local optima and obtaining a global optimum\. RCE is formalized as:

ℒR​C​E=−∑i=0n\(qi​log⁡pi−pi​log⁡\(δ​pi\+\(1−δ\)​qi\)\)\\mathcal\{L\}\_\{RCE\}=\-\\sum\_\{i=0\}^\{n\}\(q\_\{i\}\\log p\_\{i\}\-p\_\{i\}\\log\(\\delta p\_\{i\}\+\(1\-\\delta\)q\_\{i\}\)\)\(21\)
whereδ\\deltais a hyperparameter\. The larger its value, the more it favors the label distribution during optimization; the smaller its value, the more it favors the predicted distribution\.

## 6Experiments

##### Models and Datasets\.

We evaluate the models on up to six zero\-shot tasks utilizing thelm\-evaluation\-harness\(Gaoet al\.,[2024b](https://arxiv.org/html/2605.18800#bib.bib80)\), including HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.18800#bib.bib59)\), LAMBADA\(Radfordet al\.,[2019](https://arxiv.org/html/2605.18800#bib.bib60)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.18800#bib.bib62)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.18800#bib.bib64)\), ARC\-Easy, and ARC\-Challenge\(Boratkoet al\.,[2018](https://arxiv.org/html/2605.18800#bib.bib65)\)\. The models include LLaMA\(Touvronet al\.,[2023a](https://arxiv.org/html/2605.18800#bib.bib51)\)and DeepSeek\-R1\-Distill\(Guoet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib82)\)family\. The complete experimental details are in Appendix[D](https://arxiv.org/html/2605.18800#A4)\.

Meanwhile, we apply our method to the entire LLaMA family, including LLaMA\-2 \(7B\-70B\)\(Touvronet al\.,[2023b](https://arxiv.org/html/2605.18800#bib.bib33)\), and LLaMA\-3 \(8B\-70B\)\. We report perplexity \(PPL\) scores on the WikiText2\(Merityet al\.,[2016](https://arxiv.org/html/2605.18800#bib.bib56)\)and C4 test set\. All experiments were conducted using the GPTQ method for quantification\. The quantitative baseline includes: Quarot\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23)\), Spinquant\(Liuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib24)\)and Flatquant\(Sunet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib78)\)\.

### 6\.1Overall Results

##### Results on Generation Tasks\.

Table[2](https://arxiv.org/html/2605.18800#S5.T2)shows the quantization results of BDQ and previous methods\. We provide experimental results under the commonly utilized W4A4KV4 quantization setting, while also exploring low\-bit settings \(such as W3A3KV3 and W2A4KV16\)\. Compared to the previous SOTA method FlatQuant, we achieved superior performance across various experimental settings\. For the LLaMA3\-70B model under W2A4KV16, we reduced the PPL on the C4 dataset from 19\.64% to 16\.39%\. Notably, the LLaMA3\-70B model under W4A4KV4 demonstrated performance comparable to the full\-precision model, offering substantial cost savings in practical deployment\. These results highlight the effectiveness of our BDQ method in distributing outlier pressure across the entire matrix\. Detailed experimental results are provided in Appendix[F](https://arxiv.org/html/2605.18800#A6)\.

##### Results on Zero\-shot QA Tasks\.

Table[2](https://arxiv.org/html/2605.18800#S5.T2)shows the performance of quantization methods on downstream tasks\. For fairness, all experiments were conducted utilizing lm\-eval\-harness framework\(Gaoet al\.,[2024a](https://arxiv.org/html/2605.18800#bib.bib57)\)\. As can be seen, BDQ significantly outperforms other methods\. Under the W4A4KV4 setting, the BDQ\-quantized model demonstrates performance comparable to FP16\. Under W3A3KV3 and W2A4KV16 settings, BDQ achieves superior performance compared to previous methods\. Specifically, for the LLaMA3\-8B model under the W2A4KV16 setting, the average performance is 3\.37% higher than previous methods\. The experimental results demonstrate that after mitigating the outlier problem, BDQ can still achieve excellent performance under low\-bit settings\.

### 6\.2Results of Ablation Experiment

As shown in Table[3](https://arxiv.org/html/2605.18800#S5.T3), we conducted ablation experiments\. The experimental methods include Only\-BDQ and our method \(BDQ\+RCE loss\)\. The experimental results show that, based on the SOTA quantization results achieved by the BDQ method, RCE loss can further improve the quantization performance\. Specifically, Only\-BDQ achieved state\-of\-the\-art results on ARC\-E and LAMBADA tasks\. On the Avg metric, our method improved by 0\.42% compared to Only\-BDQ, which validates the effectiveness of RCE loss\.

### 6\.3Inference Efficiency and Quantization Overhead

We conducted inference efficiency and quantization overhead experiments on both NVIDIA A100 80GB and AMD MI250 GPUs\. The evaluation metrics include Prefill Speedup and Memory Savings\. Experimental results demonstrate that our method offers significant efficiency gains in both metrics compared to full\-precision models\. Specifically, on the NVIDIA A100 80GB, the LLaMA2\-70B model achieved up to a 3\.44× speedup during the prefill phase, while on the AMD MI250, memory usage was reduced by up to 3\.74×\. Detailed results are provided in Appendix[E](https://arxiv.org/html/2605.18800#A5)\.

## 7Conclusion

In this paper, we propose Bidirectional Diagonal Quantization \(BDQ\), a state\-of\-the\-art post\-training quantization method\. Existing quantization approaches often suffer from significant performance degradation due to the presence of outliers\. We first establish a mathematical relationship between quantization error and outliers, and analyze the effectiveness and limitations of prior methods in mitigating outlier impact\. To better assess outlier distribution, we introduce a flatness metric that quantifies outlier dispersion across the matrix, and we mathematically prove that the bidirectional diagonal structure is the optimal solution for outlier elimination\. Based on these insights, we develop the BDQ framework, which not only mitigates the adverse effects of outliers but also prevents overfitting on aligned data\. Extensive experiments validate that BDQ significantly enhances the performance of quantized models\.

## Limitations

Our work has several limitations\. First, due to limitations in computing resources, we did not conduct relevant experiments on larger language models\. Second, due to limited experimental resources, there is a lack of experiments conducted on different types of GPUs to verify the widespread practicality of the verification method\.

## Acknowledgments

This work was supported by Beijing Natural Science Foundation \(L243006\), the National Natural Science Foundation of China \(No\. U24A20335\), the independent research project of the Key Laboratory of Cognition and Decision Intelligence for Complex Systems\.

## References

- S\. Ashkboos, A\. Mohtashami, M\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2025\)Quarot: outlier\-free 4\-bit inference in rotated llms\.Advances in Neural Information Processing Systems37,pp\. 100213–100240\.Cited by:[Appendix C](https://arxiv.org/html/2605.18800#A3.p1.4),[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.18800#S1.p1.1),[§1](https://arxiv.org/html/2605.18800#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1),[§3\.2](https://arxiv.org/html/2605.18800#S3.SS2.p1.2),[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p2.1)\.
- Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- M\. Boratko, H\. Padigela, D\. Mikkilineni, P\. Yuvraj, R\. Das, A\. McCallum, M\. Chang, A\. Fokoue\-Nkoutche, P\. Kapanipathi, N\. Mattei,et al\.\(2018\)A systematic classification of knowledge, reasoning, and context within the arc dataset\.arXiv preprint arXiv:1806\.00358\.Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- J\. Chee, Y\. Cai, V\. Kuleshov, and C\. M\. De Sa \(2023\)Quip: 2\-bit quantization of large language models with guarantees\.Advances in Neural Information Processing Systems36,pp\. 4396–4429\.Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- Z\. Chen, T\. Xu, C\. Du, C\. Liu, and H\. He \(2020\)Dynamical channel pruning by conditional accuracy change for deep neural networks\.IEEE transactions on neural networks and learning systems32\(2\),pp\. 799–813\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.18800#S2.SS1.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2022\)Gptq: accurate post\-training quantization for generative pre\-trained transformers\.arXiv preprint arXiv:2210\.17323\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024a\)A framework for few\-shot language model evaluation\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§6\.1](https://arxiv.org/html/2605.18800#S6.SS1.SSS0.Px2.p1.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024b\)The language model evaluation harness\.https://zenodo\.org/records/12608602\.External Links:[Link](https://zenodo.org/records/12608602)Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.18800#S1.p5.1),[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- S\. Han, H\. Mao, and W\. J\. Dally \(2015\)Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding\.arXiv preprint arXiv:1510\.00149\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.18800#S2.SS1.p1.1)\.
- \[12\]Y\. Hsu, T\. Hua, S\. Chang, Q\. Lou, Y\. Shen, and H\. JinLanguage model compression with weighted low\-rank factorization\.InInternational Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- X\. Hu, Y\. Cheng, D\. Yang, Z\. Xu, Z\. Yuan, J\. Yu, C\. Xu, Z\. Jiang, and S\. Zhou \(2025\)OstQuant: refining large language model quantization with orthogonal and scaling transformations for better distribution fitting\.arXiv preprint arXiv:2501\.13987\.Cited by:[§5\.2](https://arxiv.org/html/2605.18800#S5.SS2.p1.1)\.
- X\. Hu, Y\. Cheng, D\. Yang, Z\. Yuan, J\. Yu, C\. Xu, and S\. Zhou \(2024\)I\-llm: efficient integer\-only inference for fully\-quantized low\-bit large language models\.arXiv preprint arXiv:2405\.17849\.Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- Y\. Kim and A\. M\. Rush \(2016\)Sequence\-level knowledge distillation\.InProceedings of the 2016 conference on empirical methods in natural language processing,pp\. 1317–1327\.Cited by:[§2\.1](https://arxiv.org/html/2605.18800#S2.SS1.p1.1)\.
- C\. Lee, J\. Jin, T\. Kim, H\. Kim, and E\. Park \(2023\)Owq: lessons learned from activation outliers for weight quantization in large language models\.arXiv preprint arXiv:2306\.022722\.Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of Machine Learning and Systems6,pp\. 87–100\.Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- S\. Liu, J\. Niles\-Weed, N\. Razavian, and C\. Fernandez\-Granda \(2020\)Early\-learning regularization prevents memorization of noisy labels\.Advances in neural information processing systems33,pp\. 20331–20342\.Cited by:[§5\.2](https://arxiv.org/html/2605.18800#S5.SS2.p2.1)\.
- Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2024\)Spinquant: llm quantization with learned rotations\.arXiv preprint arXiv:2405\.16406\.Cited by:[Appendix C](https://arxiv.org/html/2605.18800#A3.p1.4),[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.18800#S1.p3.1),[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.18800#S5.SS2.p1.1),[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p2.1)\.
- I\. Loshchilov, F\. Hutter,et al\.\(2017\)Fixing weight decay regularization in adam\.arXiv preprint arXiv:1711\.051015,pp\. 5\.Cited by:[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px2.p1.3)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.arXiv preprint arXiv:1609\.07843\.Cited by:[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px1.p1.1),[§5\.2](https://arxiv.org/html/2605.18800#S5.SS2.p1.1),[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p2.1)\.
- A\. Radford, J\. Wu, R\. Child, D\. Luan, D\. Amodei, I\. Sutskever,et al\.\(2019\)Language models are unsupervised multitask learners\.OpenAI blog1\(8\),pp\. 9\.Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2023\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.External Links:1910\.10683,[Link](https://arxiv.org/abs/1910.10683)Cited by:[§5\.2](https://arxiv.org/html/2605.18800#S5.SS2.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo \(2023\)Omniquant: omnidirectionally calibrated quantization for large language models\.arXiv preprint arXiv:2308\.13137\.Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- Y\. Sun, R\. Liu, H\. Bai, H\. Bao, K\. Zhao, Y\. Li, J\. Hu, X\. Yu, L\. Hou, C\. Yuan, X\. Jiang, W\. Liu, and J\. Yao \(2025\)FlatQuant: flatness matters for llm quantization\.External Links:2410\.09426,[Link](https://arxiv.org/abs/2410.09426)Cited by:[Appendix A](https://arxiv.org/html/2605.18800#A1.p1.10),[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p2.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023a\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023b\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[Appendix D](https://arxiv.org/html/2605.18800#A4.SS0.SSS0.Px1.p1.1),[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p2.1)\.
- D\. Tsai, Y\. Lee, and E\. Matsuyama \(2008\)Information entropy measure for evaluation of image quality\.Journal of digital imaging21,pp\. 338–347\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p4.1),[§4\.1](https://arxiv.org/html/2605.18800#S4.SS1.p3.5)\.
- A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. De Sa \(2024\)Quip\#: even better llm quantization with hadamard incoherence and lattice codebooks\.arXiv preprint arXiv:2402\.04396\.Cited by:[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)Smoothquant: accurate and efficient post\-training quantization for large language models\.InInternational Conference on Machine Learning,pp\. 38087–38099\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- A\. Yang, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Li, D\. Liu, F\. Huang, H\. Wei,et al\.\(2024\)Qwen2\. 5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p1.1)\.
- C\. Yang, Z\. An, L\. Cai, and Y\. Xu \(2022\)Knowledge distillation using hierarchical self\-supervision augmented distribution\.IEEE transactions on neural networks and learning systems35\(2\),pp\. 2094–2108\.Cited by:[§2\.1](https://arxiv.org/html/2605.18800#S2.SS1.p1.1)\.
- Z\. Yuan, Y\. Shang, Y\. Song, Q\. Wu, Y\. Yan, and G\. Sun \(2023\)Asvd: activation\-aware singular value decomposition for compressing large language models\.arXiv preprint arXiv:2312\.05821\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p2.1),[§2\.2](https://arxiv.org/html/2605.18800#S2.SS2.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§6](https://arxiv.org/html/2605.18800#S6.SS0.SSS0.Px1.p1.1)\.
- K\. Zhang, J\. Chen, S\. He, E\. Xu, F\. Li, and Z\. Zhou \(2021\)Differentiable neural architecture search augmented with pruning and multi\-objective optimization for time\-efficient intelligent fault diagnosis of machinery\.Mechanical Systems and Signal Processing158,pp\. 107773\.Cited by:[§2\.1](https://arxiv.org/html/2605.18800#S2.SS1.p1.1)\.
- H\. Zhao, Q\. Zhang, S\. Zhao, Z\. Chen, J\. Zhang, and D\. Tao \(2024\)Simdistill: simulated multi\-modal distillation for bev 3d object detection\.InProceedings of the AAAI conference on artificial intelligence,Vol\.38,pp\. 7460–7468\.Cited by:[§2\.1](https://arxiv.org/html/2605.18800#S2.SS1.p1.1)\.
- X\. Zhu, J\. Li, Y\. Liu, C\. Ma, and W\. Wang \(2024\)A survey on model compression for large language models\.Transactions of the Association for Computational Linguistics12,pp\. 1556–1577\.Cited by:[§1](https://arxiv.org/html/2605.18800#S1.p2.1)\.

## Appendix AAppendix: Difference from Previous Rotation Based Methods

More clearly, we illustrate by setting counter examples\. There exists an original matrixW∈ℝ4096×4096W\\in\\mathbb\{R\}^\{4096\\times 4096\}, which contains some outliers that are significantly larger than the other values in the matrix\. We refer to them as outliers\. Our method, which requires two learnable diagonal matricesd1∈ℝ4096d\_\{1\}\\in\\mathbb\{R\}^\{4096\}andd2∈ℝ4096d\_\{2\}\\in\\mathbb\{R\}^\{4096\}to achieve the absence of outliers ind1​W​d2d\_\{1\}Wd\_\{2\}\. The previous SOTA method Flatquant\(Sunet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib78)\), which requires a matrixP∈ℝ4096P\\in\\mathbb\{R\}^\{4096\}for Kronerker decomposition into two matricesp1∈ℝ64×64p\_\{1\}\\in\\mathbb\{R\}^\{64\\times 64\}andp2∈ℝ64×64p\_\{2\}\\in\\mathbb\{R\}^\{64\\times 64\}\.p​1p1andp​2p2are learnable and can achieve the absence of outliers inp1​W​p2p\_\{1\}Wp\_\{2\}\. We will elaborate from the following perspectives:

### A\.1Mathematical Analysis of Parameter Freedom and Adjustment Capability

#### A\.1\.1Our Method \(Scaling the Matrix Element\-wise\):

The adjusted matrix is:

W^=D1​W​D2,D1=diag​\(d1\),D2=diag​\(d2\)\\displaystyle\\widehat\{W\}=D\_\{1\}WD\_\{2\},D\_\{1\}=\\text\{diag\}\(d\_\{1\}\),D\_\{2\}=\\text\{diag\}\(d\_\{2\}\)\(22\)
Here,d1,d2∈ℝ4096d\_\{1\},d\_\{2\}\\in\\mathbb\{R\}^\{4096\}\. The adjustment for each element can be expressed as:

W^i,j=d1​\[i\]⋅d2​\[j\]⋅Wi,j\\displaystyle\\widehat\{W\}\_\{i,j\}=d\_\{1\}\[i\]\\cdot d\_\{2\}\[j\]\\cdot W\_\{i,j\}\(23\)
##### Degrees of Freedom:

- •Total number of parameters:4096\+4096=81924096\+4096=8192\.
- •Each element is independently controlled by parameters: The scaling of a single elementWi,jW\_\{i,j\}only depends ond1​\[i\]d\_\{1\}\[i\]andd2​\[j\]d\_\{2\}\[j\], allowing precise adjustment of outliers by tuning these two parameters\.

#### A\.1\.2Flatquant \(Kronecker Decomposition\):

The adjusted matrix is:

W^=P1​W​P2,where​P=P1⊗P2\\displaystyle\\widehat\{W\}=P\_\{1\}WP\_\{2\},\\quad\\text\{where \}P=P\_\{1\}\\otimes P\_\{2\}\(24\)
Here,P1,P2∈ℝ64×64P\_\{1\},P\_\{2\}\\in\\mathbb\{R\}^\{64\\times 64\}, and the number of parameters is the same as in our method \(64×64×2=819264\\times 64\\times 2=8192\)\. However, the adjusted matrix elements are:

W^i,j=∑k=164∑l=164P1​\[k,l\]⋅P2​\[m,n\]⋅Wi,j\\displaystyle\\widehat\{W\}\_\{i,j\}=\\sum\_\{k=1\}^\{64\}\\sum\_\{l=1\}^\{64\}P\_\{1\}\[k,l\]\\cdot P\_\{2\}\[m,n\]\\cdot W\_\{i,j\}\(25\)
##### Analysis of Degrees of Freedom:

- •Coupling effect:Each parameterP1​\[k,l\]P\_\{1\}\[k,l\]andP2​\[m,n\]P\_\{2\}\[m,n\]influences64×64=409664\\times 64=4096positions\. For example, adjusting a single row ofP1P\_\{1\}affects all positions associated with that row, leading to parameter coupling \(see Kronecker product definition\)\.
- •Independent adjustment not possible:If an outlier is located at a specific position\(i,j\)\(i,j\), adjusting multiple parameters may be required to suppress the outlier, making independent control impossible\.

### A\.2Convexity and Complexity Analysis of the Optimization Process

#### A\.2\.1Our Method’s Optimization Objective

Define the loss function as the sum of squared magnitudes of outliers in the adjusted matrix:

Lours=∑\(i,j\)∈S\(d1​\[i\]⋅d2​\[j\]⋅Wi,j\)2\\displaystyle L\_\{\\text\{ours\}\}=\\sum\_\{\(i,j\)\\in S\}\\left\(d\_\{1\}\[i\]\\cdot d\_\{2\}\[j\]\\cdot W\_\{i,j\}\\right\)^\{2\}\(26\)
Here,SSrepresents the set of positions of outliers\. The optimization variables ared1d\_\{1\}andd2d\_\{2\}\.

##### Convexity Explanation:

- •Ford1​\[i\]d\_\{1\}\[i\]andd2​\[j\]d\_\{2\}\[j\],LoursL\_\{\\text\{ours\}\}is a quadratic function \(non\-negative and convex\)\. For example, fixingd2​\[j\]d\_\{2\}\[j\], the loss function with respect tod1​\[i\]d\_\{1\}\[i\]is: Lours\(i\)\\displaystyle L\_\{\\text\{ours\}\}^\{\(i\)\}=∑j∈Si\(d1​\[i\]⋅d2​\[j\]⋅Wi,j\)2\\displaystyle=\\sum\_\{j\\in S\_\{i\}\}\\left\(d\_\{1\}\[i\]\\cdot d\_\{2\}\[j\]\\cdot W\_\{i,j\}\\right\)^\{2\}\(27\)=d1​\[i\]2⋅∑j∈Si\(d2​\[j\]​Wi,j\)2\\displaystyle=d\_\{1\}\[i\]^\{2\}\\cdot\\sum\_\{j\\in S\_\{i\}\}\\left\(d\_\{2\}\[j\]W\_\{i,j\}\\right\)^\{2\}
- •Clearly, this is a convex function\. Similarly, fixingd1​\[i\]d\_\{1\}\[i\], the loss function with respect tod2​\[j\]d\_\{2\}\[j\]is also convex\. Therefore, the overall optimization problem ismulticonvexand easy to converge\.

#### A\.2\.2Flatquant’s Optimization Objective

Define a similar loss function:

LFlatquant=∑\(i,j\)∈S\(\(P1⊗P2\)∘W\)i,j2\\displaystyle L\_\{\\text\{Flatquant\}\}=\\sum\_\{\(i,j\)\\in S\}\\left\(\\left\(P\_\{1\}\\otimes P\_\{2\}\\right\)\\circ W\\right\)\_\{i,j\}^\{2\}\(28\)
Here,∘\\circdenotes the element\-wise product\. The parametersP1,P2∈ℝ64×64P\_\{1\},P\_\{2\}\\in\\mathbb\{R\}^\{64\\times 64\}\.

##### Non\-Convexity Analysis:

- •The non\-linear structure of the Kronecker product makes the loss function highly coupled with respect toP1P\_\{1\}andP2P\_\{2\}\. For example, calculating∂LFlatquant∂P1​\[k,l\]\\frac\{\\partial L\_\{\\text\{Flatquant\}\}\}\{\\partial P\_\{1\}\[k,l\]\}requires considering all 4096 positions affected byP1​\[k,l\]P\_\{1\}\[k,l\]\.
- •Specifically: ∂LFlatquant∂P1​\[k,l\]\\displaystyle\\frac\{\\partial L\_\{\\text\{Flatquant\}\}\}\{\\partial P\_\{1\}\[k,l\]\}=2​∑\(i,j\)∈S\(\(P1⊗P2\)∘W\)i,j\\displaystyle=2\\sum\_\{\(i,j\)\\in S\}\\left\(\\left\(P\_\{1\}\\otimes P\_\{2\}\\right\)\\circ W\\right\)\_\{i,j\}\(29\)⋅∂\(P1⊗P2\)i,j∂P1​\[k,l\]⋅Wi,j\\displaystyle\\cdot\\frac\{\\partial\\left\(P\_\{1\}\\otimes P\_\{2\}\\right\)\_\{i,j\}\}\{\\partial P\_\{1\}\[k,l\]\}\\cdot W\_\{i,j\}

This requires a large number of nested computations, making the optimization process slower and more complex\.

### A\.3Theoretical Error Bound Comparison

#### Assumptions:

- •Outliers are sparse, i\.e\.,\|S\|=k≪40962\|S\|=k\\ll 4096^\{2\}\.
- •The objective is to minimize the magnitude of outliers after adjustment: min​∑\(i,j\)∈SW^i,j2\.\\displaystyle\\min\\sum\_\{\(i,j\)\\in S\}\\widehat\{W\}\_\{i,j\}^\{2\}\.\(30\)

#### Error Bound for Our Method:

For each outlier position\(i,j\)\(i,j\), choosed1​\[i\]=d2​\[j\]=1Wi,jd\_\{1\}\[i\]=d\_\{2\}\[j\]=\\frac\{1\}\{\\sqrt\{W\_\{i,j\}\}\}\(assuming other parameters are set to 1\)\. Then, after adjustment:

W^i,j=1Wi,j⋅1Wi,j⋅Wi,j=1\.\\displaystyle\\widehat\{W\}\_\{i,j\}=\\frac\{1\}\{\\sqrt\{W\_\{i,j\}\}\}\\cdot\\frac\{1\}\{\\sqrt\{W\_\{i,j\}\}\}\\cdot W\_\{i,j\}=1\.\(31\)
The total error is:

Lours=∑\(i,j\)∈S12=k\.\\displaystyle L\_\{\\text\{ours\}\}=\\sum\_\{\(i,j\)\\in S\}1^\{2\}=k\.\(32\)
This means the error grows linearly with the number of outliers,O​\(k\)O\(k\)\.

#### A\.3\.1Error Bound for Flatquant:

Due to the global coupling of the Kronecker decomposition, adjusting a single outlier requires modifying multiple parameters inP1P\_\{1\}orP2P\_\{2\}\. For example, adjusting one element ofP1P\_\{1\}affects64×64=409664\\times 64=4096positions\. The following condition must hold:

∃\(k,l\),P1​\[k,l\]≠1\\displaystyle\\exists\\,\(k,l\),\\,P\_\{1\}\[k,l\]\\neq 1⟹∑\(i,j\)∈S\(\(P1⊗P2∘W\)i,j\)2\\displaystyle\\implies\\sum\_\{\(i,j\)\\in S\}\\left\(\\left\(P\_\{1\}\\otimes P\_\{2\}\\circ W\\right\)\_\{i,j\}\\right\)^\{2\}\(33\)≥∑\(i,j\)∈Sϵ2,\\displaystyle\\geq\\sum\_\{\(i,j\)\\in S\}\\epsilon^\{2\},
whereϵ\\epsilonis the residual error\. Based on parameter coupling, the minimum error bound isΩ​\(k⋅642\)\\Omega\(k\\cdot 64^\{2\}\), meaning the error increases with the number of outliers and the square of the matrix dimensions\.

### A\.4Information Loss Analysis

#### A\.4\.1Information Loss of Our Method:

The adjusted matrix is defined as:

W^=D1​W​D2,\\displaystyle\\widehat\{W\}=D\_\{1\}WD\_\{2\},\(34\)whereD1D\_\{1\}andD2D\_\{2\}are diagonal matrices\. The adjusted matrix retains the sparsity and structure of the original matrixWW\(its rank and angular structure remain unchanged\)\.

#### A\.4\.2Information Loss of Flatquant:

For the Kronecker decomposition, the adjusted matrix satisfies:

rank​\(P1⊗P2\)=rank​\(P1\)⋅rank​\(P2\)≤64×64=4096\.\\text\{rank\}\(P\_\{1\}\\otimes P\_\{2\}\)=\\text\{rank\}\(P\_\{1\}\)\\cdot\\text\{rank\}\(P\_\{2\}\)\\leq 64\\times 64=4096\.In contrast, the rank of the original matrixWWmay approach 4096 \(full rank\)\. In practice, ifP1P\_\{1\}andP2P\_\{2\}are low\-rank matrices, the rank of the adjusted matrixW^\\widehat\{W\}will be further reduced, leading to information loss\.

All in all, through the analysis of parameter independence, optimization convexity, error bounds, and information loss, the mathematical properties of the two methods can be summarized as follows: In Independence, our method independently adjusts two sets of scaling parameters, while Flatquant suffers from parameter coupling, making local adjustments difficult\. In optimization Efficiency, the non\-convexity of Flatquant’s loss function leads to slower convergence, while our method’s loss function is multiconvex and easier to optimize\. In error Bound the error bound of our method grows asO​\(k\)O\(k\), while Flatquant’s error bound grows asΩ​\(k⋅642\)\\Omega\(k\\cdot 64^\{2\}\), showing a significant difference in efficiency\. In information retention, our method preserves the rank and structure of the original matrix, while Flatquant’s low\-rank decomposition leads to information loss\. In conclusion, our method is theoretically and practically superior to Flatquant\.

## Appendix BAppendix: The Hit Rate Results of Candidate Tokens

Table[4](https://arxiv.org/html/2605.18800#A2.T4)shows the corresponding results\.

Table 4:The hit rate of candidate tokens predicted by the model\.
## Appendix CAppendix: The Position of Equivalent Transformation Pairs

For our BDQ method, each transformer block learns four equivalent transformation pairs, with each pair consisting of two learnable diagonal matrices and one learnable rotation matrix\. Similarly to\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23)\)and\(Liuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib24)\), the positions of these four transformation pairs are respectively in the<Wq,Wk,Wv\><W\_\{q\},W\_\{k\},W\_\{v\}\>matrices of Self\-Attention, the<Wo​u​t​p​u​t\><W\_\{output\}\>matrix of Self\-Attention, the<Wg​a​t​e,Wu​p\><W\_\{gate\},W\_\{up\}\>matrices of Feed\-Forward Network, and the<Wd​o​w​n\><W\_\{down\}\>matrix of Feed\-Forward Network\.

## Appendix DAppendix: Complete Experimental Details

##### Experimental Setup\.

We apply our method to the entire LLaMA family, including LLaMA\-2 \(7B\-70B\)\(Touvronet al\.,[2023b](https://arxiv.org/html/2605.18800#bib.bib33)\), and LLaMA\-3 \(8B\-70B\)\.At the same time, we conducted experiments on the DeepSeek\-R1\-Distill model\(Guoet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib82)\)family of inference models\. We report perplexity \(PPL\) scores on the WikiText2\(Merityet al\.,[2016](https://arxiv.org/html/2605.18800#bib.bib56)\)and C4 test set\. All experiments were conducted utilizing the GPTQ method for quantification\. The quantitative baseline includes: Quarot\(Ashkbooset al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib23)\), Spinquant\(Liuet al\.,[2024](https://arxiv.org/html/2605.18800#bib.bib24)\)and Flatquant\(Sunet al\.,[2025](https://arxiv.org/html/2605.18800#bib.bib78)\)\.

##### Implementation Details\.

We utilizeA​d​a​m​WAdamWoptimizer\(Loshchilovet al\.,[2017](https://arxiv.org/html/2605.18800#bib.bib79)\)with an initial learning rate of5​e−35e\-3and adopt a cosine annealing schedule for learning rate decay\. BDQ is trained on an alignment dataset for 150 epochs, with the calibration set containing 128 sentences from WikiText2, each containing 2048 tokens\. The batch size is set to 4 andδ\\deltais set to 0\.5\. All diagonal matrices are initialized as identity matrices, while orthogonal matrices are initialized with random affine transformations\.

## Appendix EAppendix: Inference Efficiency and Quantization Overhead Experimental Results

Table[5](https://arxiv.org/html/2605.18800#A5.T5)shows the corresponding results\.

Table 5:The overall results of the Speedup and Memory experiments\.
## Appendix FAppendix: More Quantization Experimental Results

### F\.1Supplementary experimental results

Table[6](https://arxiv.org/html/2605.18800#A6.T6)shows the corresponding results\.

\#BitsModelMethodPPL\(↓\\color\[rgb\]\{0,1,0\}\\downarrow\)Accuracy\(↑\\color\[rgb\]\{0,1,0\}\\uparrow\)W\-A\-KVWikiText2C4ARC\-CARC\-EHellaswagLAMBADAPIQAWinograndeAvg\.4\-4\-4LLaMA2\-7BFP165\.477\.2646\.1674\.5475\.9873\.9279\.0569\.0669\.79QuaRot6\.108\.3242\.3268\.3572\.5365\.4076\.3365\.1165\.01SpinQuant5\.968\.2841\.7269\.2872\.9071\.2876\.1766\.0666\.23FlatQuant5\.787\.8643\.0071\.2173\.3172\.0677\.5367\.7267\.47Ours\\cellcolorgreen\!105\.76\\cellcolorgreen\!107\.64\\cellcolorgreen\!1043\.07\\cellcolorgreen\!1073\.09\\cellcolorgreen\!1073\.36\\cellcolorgreen\!1072\.06\\cellcolorgreen\!1077\.57\\cellcolorgreen\!1067\.90\\cellcolorgreen\!1067\.84LLaMA2\-13BFP164\.886\.7349\.1577\.4479\.3976\.7380\.4772\.1472\.55QuaRot5\.407\.5442\.8369\.9573\.5465\.6277\.6967\.8866\.25SpinQuant5\.247\.4843\.6972\.4375\.5272\.4278\.4068\.9068\.56FlatQuant5\.117\.1148\.3876\.9477\.8876\.4079\.6570\.5671\.64Ours\\cellcolorgreen\!105\.08\\cellcolorgreen\!107\.07\\cellcolorgreen\!1048\.52\\cellcolorgreen\!1076\.87\\cellcolorgreen\!1077\.90\\cellcolorgreen\!1076\.47\\cellcolorgreen\!1079\.83\\cellcolorgreen\!1070\.77\\cellcolorgreen\!1071\.73LLaMA2\-70BFP163\.325\.7257\.7181\.0283\.8179\.6082\.7077\.9877\.05QuaRot3\.796\.1255\.4679\.7681\.5879\.3581\.8376\.0975\.68SpinQuant3\.706\.0755\.3879\.0482\.5778\.7582\.3778\.2276\.06FlatQuant3\.545\.9256\.4080\.0982\.9180\.0182\.9276\.8776\.53Ours\\cellcolorgreen\!103\.50\\cellcolorgreen\!105\.88\\cellcolorgreen\!1056\.60\\cellcolorgreen\!1080\.32\\cellcolorgreen\!1082\.97\\cellcolorgreen\!1079\.84\\cellcolorgreen\!1082\.90\\cellcolorgreen\!1077\.03\\cellcolorgreen\!1076\.61LLaMA3\-8BFP166\.149\.4553\.5077\.5779\.1275\.5180\.7472\.9373\.23QuaRot8\.1613\.3845\.7370\.8372\.9762\.7075\.3567\.1765\.79SpinQuant7\.3912\.1947\.2774\.2074\.5570\.2977\.3768\.5168\.70FlatQuant6\.9011\.2150\.5175\.8876\.4973\.2079\.0072\.9371\.33Ours\\cellcolorgreen\!106\.84\\cellcolorgreen\!1010\.97\\cellcolorgreen\!1051\.03\\cellcolorgreen\!1076\.10\\cellcolorgreen\!1076\.77\\cellcolorgreen\!1073\.42\\cellcolorgreen\!1078\.57\\cellcolorgreen\!1072\.88\\cellcolorgreen\!1071\.46LLaMA3\-70BFP162\.867\.1764\.2585\.9484\.9379\.3784\.4480\.7479\.95QuaRot6\.6012\.8749\.4974\.3777\.2271\.6978\.8971\.0370\.45SpinQuant6\.2112\.8251\.9677\.4077\.2971\.9079\.3372\.0671\.66FlatQuant3\.777\.9361\.9584\.4783\.8777\.9983\.9579\.2478\.58Ours\\cellcolorgreen\!103\.52\\cellcolorgreen\!107\.63\\cellcolorgreen\!1062\.83\\cellcolorgreen\!1084\.88\\cellcolorgreen\!1084\.07\\cellcolorgreen\!1079\.42\\cellcolorgreen\!1084\.01\\cellcolorgreen\!1079\.56\\cellcolorgreen\!1079\.133\-3\-3LLaMA3\-8BFP166\.149\.4553\.5077\.5779\.1275\.5180\.7472\.9373\.23QuaRot15\.7327\.3828\.9357\.4260\.3345\.8166\.3454\.2552\.18SpinQuant12\.3722\.3532\.5561\.0363\.5949\.8071\.2957\.9356\.03FlatQuant10\.8219\.0335\.4163\.2665\.3052\.4973\.5660\.6958\.45Ours\\cellcolorgreen\!109\.87\\cellcolorgreen\!1018\.5\\cellcolorgreen\!1037\.4\\cellcolorgreen\!1065\.3\\cellcolorgreen\!1065\.3\\cellcolorgreen\!1053\.6\\cellcolorgreen\!1073\.89\\cellcolorgreen\!1061\.42\\cellcolorgreen\!1059\.48LLaMA3\-70BFP162\.867\.1764\.2585\.9484\.9379\.3784\.4480\.7479\.95QuaRot13\.4423\.3947\.8674\.3170\.5367\.5772\.0967\.5366\.64SpinQuant10\.3518\.7752\.2878\.2476\.6172\.1877\.3770\.7871\.24FlatQuant8\.7215\.7454\.3780\.3178\.6773\.5779\.0373\.3773\.22Ours\\cellcolorgreen\!106\.67\\cellcolorgreen\!1013\.21\\cellcolorgreen\!1056\.12\\cellcolorgreen\!1081\.22\\cellcolorgreen\!1079\.63\\cellcolorgreen\!1074\.79\\cellcolorgreen\!1080\.14\\cellcolorgreen\!1075\.67\\cellcolorgreen\!1074\.592\-4\-16LLaMA3\-8BFP166\.149\.4553\.5077\.5779\.1275\.5180\.7472\.9373\.23QuaRot24\.3629\.8828\.5954\.7654\.6241\.9060\.0351\.3348\.53SpinQuant20\.7724\.7131\.7758\.9361\.2947\.3666\.2555\.4253\.50FlatQuant18\.6723\.6633\.3761\.2662\.5549\.0768\.5956\.8955\.28Ours\\cellcolorgreen\!1016\.52\\cellcolorgreen\!1020\.09\\cellcolorgreen\!1036\.69\\cellcolorgreen\!1064\.89\\cellcolorgreen\!1064\.39\\cellcolorgreen\!1052\.88\\cellcolorgreen\!1072\.71\\cellcolorgreen\!1060\.33\\cellcolorgreen\!1058\.65LLaMA3\-70BFP162\.867\.1764\.2585\.9484\.9379\.3784\.4480\.7479\.95QuaRot19\.4728\.9542\.7672\.0768\.6262\.5768\.9459\.7662\.45SpinQuant13\.7622\.7648\.7476\.7463\.0167\.7973\.8265\.9366\.01FlatQuant11\.5319\.6450\.7178\.6175\.8270\.9375\.4268\.6870\.02Ours\\cellcolorgreen\!1010\.07\\cellcolorgreen\!1016\.39\\cellcolorgreen\!1053\.26\\cellcolorgreen\!1080\.06\\cellcolorgreen\!1077\.49\\cellcolorgreen\!1072\.57\\cellcolorgreen\!1078\.39\\cellcolorgreen\!1071\.53\\cellcolorgreen\!1072\.22

Table 6:The overall result graph of the quantified results\. Experiments were conducted on different models and settings\.
### F\.2Experimental results of downstream tasks

We provide experimental results on MMLU and MATH\. MATH: We report the average of the GSM8K \(8 shot\) and MATH \(4 shot\) benchmarks\.

Table 7:Performance of different methods on LLaMA\-2\-7BThe experimental results show that our method exhibits superior performance on the benchmark datasets in Table[7](https://arxiv.org/html/2605.18800#A6.T7)\.

### F\.3Experimental results of hyperparameter ablationδ\\delta

Table 8:Results for differentδ\\deltavalues on WikiText2 and C4 datasetsThe experimental results show thatδ\\deltaachieves optimal performance at 0\.5 in Table[8](https://arxiv.org/html/2605.18800#A6.T8)\.

## Appendix GAppendix: The Reason for Adding the Rotation Matrix

As we mentioned in Section[4\.3](https://arxiv.org/html/2605.18800#S4.SS3), we obtained the optimal solution for Flatness, which isV=d1​W​d2V=d\_\{1\}Wd\_\{2\}\. The motivation for adding the rotation matrixRRis to prevent the special case where the matrixWWhas strong column correlations\. The rotation matrix can, while retaining the ability of diagonal scaling to eliminate outliers, further enhance the Flatness of the matrix element distribution through orthogonal transformation\. Meanwhile, it utilizes the special structure of the Hadamard matrix to address the limitations of relying solely on diagonal scaling in the first step\. The following is a rigorous theoretical proof of the rationality of this transition\.

Proof from the perspective of information entropy: IntroducingRRenhances distribution uniformity of the matrix\. Define the probability distribution of matrix elements aspi​j=Vi​j2∑i,jVi​j2p\_\{ij\}=\\frac\{V\_\{ij\}^\{2\}\}\{\\sum\_\{i,j\}V\_\{ij\}^\{2\}\}\(energy normalization\), whose information entropy is given by:H​\(𝐕\)=−∑i,jpi​j​log⁡pi​jH\(\\mathbf\{V\}\)=\-\\sum\_\{i,j\}p\_\{ij\}\\log p\_\{ij\}\. A higher entropy value indicates a more uniform distribution of matrix elements \(with reduced influence from outliers\)\.

Step 1 \(Diagonal Scaling Only\): ForV1=d1​W​d2V\_\{1\}=d\_\{1\}Wd\_\{2\}, its elements areV1,i​j=ai​Wi​j​bjV\_\{1,ij\}=a\_\{i\}W\_\{ij\}b\_\{j\}\. Sinced1d\_\{1\}andd2d\_\{2\}are diagonal matrices, they only adjust the magnitude ratio of elements but do not alter the correlation structure between elements\. If the original matrixWWexhibits strong inter\-column correlations \(e\.g\.,Wi​1≈Wi​2W\_\{i1\}\\approx W\_\{i2\}for allii\), thenV1,i​1≈ai​b1ai​b2​V1,i​2V\_\{1,i1\}\\approx\\frac\{a\_\{i\}b\_\{1\}\}\{a\_\{i\}b\_\{2\}\}V\_\{1,i2\}will retain such strong correlations, leading to energy concentration in specific columns \(and thus lower entropy\)\.

Step 2 \(IncorporatingRR\): ForV2=V1​RV\_\{2\}=V\_\{1\}R, its elements areV2,i​k=∑jV1,i​j​Rj​kV\_\{2,ik\}=\\sum\_\{j\}V\_\{1,ij\}R\_\{jk\}\(linear combinations of columns, withRj​k=±1R\_\{jk\}=\\pm 1representing signed weighted sums\)\. Due to the orthogonality of Hadamard matrices, the column vectorsV2\(k\)=∑jRj​k​V1\(j\)V\_\{2\}^\{\(k\)\}=\\sum\_\{j\}R\_\{jk\}V\_\{1\}^\{\(j\)\}are mutually orthogonal, i\.e\.:⟨V2\(k\),V2\(l\)⟩=∑jRj​k​Rj​l​⟨V1\(j\),V1\(l\)⟩=0\\langle V\_\{2\}^\{\(k\)\},V\_\{2\}^\{\(l\)\}\\rangle=\\sum\_\{j\}R\_\{jk\}R\_\{jl\}\\langle V\_\{1\}^\{\(j\)\},V\_\{1\}^\{\(l\)\}\\rangle=0\(k≠lk\\neq l\)\. This implies that inter\-column correlations are completely eliminated, with energy dispersed from originally correlated columns to orthogonal columns\.

The above proof process shows that after orthogonal transformation, the more uniform the energy distribution, the lower the Flatness\. This does not affect the optimality ofV=d1​W​d2V=d\_\{1\}Wd\_\{2\}, and the rotation matrix acts as an external gain onVV\.

In addition, we conducted ablation experiments on the rotation matrix in Table[9](https://arxiv.org/html/2605.18800#A7.T9)\.

Table 9:Ablation experiments on the rotation matrix

Similar Articles

LoopQ: Quantization for Recursive Transformers

arXiv cs.LG

LoopQ is a post-training quantization framework for looped language models that addresses distribution shift, state reuse, and error accumulation. It achieves 68.8% average accuracy improvement under 4-bit weights and activations.