UniSVQ: 2-bit Unified Scalar-Vector Quantization

arXiv cs.CL Papers

Summary

UniSVQ proposes a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices, achieving state-of-the-art performance among scalar methods and matching vector methods with higher throughput.

arXiv:2606.10520v1 Announce Type: new Abstract: Post-training quantization at the 2-bit level enables low-cost deployment and inference acceleration for large language models (LLMs). Scalar quantization (SQ) and vector quantization (VQ) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead. We propose UniSVQ, a unified 2-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices. This structure preserves compatibility with optimized integer kernels while retaining much of VQ's flexibility. We further introduce a data-driven block-wise fine-tuning strategy to directly minimize quantization reconstruction error. Extensive experiments across multiple LLM families and zero-shot benchmarks demonstrate that UniSVQ consistently outperforms state-of-the-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput.
Original Article
View Cached Full Text

Cached at: 06/10/26, 06:11 AM

# UniSVQ: 2-bit Unified Scalar-Vector Quantization
Source: [https://arxiv.org/html/2606.10520](https://arxiv.org/html/2606.10520)
Haiyan Zhao†\{\}^\{\\text\{\\textdagger\}\}Xingyu YuZhangyang YaoXu Han†\{\}^\{\\text\{\\textdagger\}\}Zhiyuan LiuMaosong Sun

###### Abstract

Post\-training quantization at the 2\-bit level enables low\-cost deployment and inference acceleration for large language models \(LLMs\)\. Scalar quantization \(SQ\) and vector quantization \(VQ\) are two primary quantization methods, however, the former suffers from significant performance degradation, and the latter incurs computational and storage overhead\. We propose UniSVQ, a unified 2\-bit quantization framework that bridges scalar and vector quantization by parameterizing codewords as an affine transform of integer lattices\. This structure preserves compatibility with optimized integer kernels while retaining much of VQ’s flexibility\. We further introduce a data\-driven block\-wise fine\-tuning strategy to directly minimize quantization reconstruction error\. Extensive experiments across multiple LLM families and zero\-shot benchmarks demonstrate that UniSVQ consistently outperforms state\-of\-the\-art SQ methods and achieves performance comparable to advanced VQ methods, while providing higher inference throughput\.

Machine Learning, ICML

## 1Introduction

Large language models \(LLMs\) require substantial computational resources, which creates a barrier to their real\-world applications\. Various model compression techniques, such as quantization\(Frantaret al\.,[2022](https://arxiv.org/html/2606.10520#bib.bib16); Linet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib6)\), pruning\(Sunet al\.,[2023](https://arxiv.org/html/2606.10520#bib.bib10); Maet al\.,[2023](https://arxiv.org/html/2606.10520#bib.bib8)\), knowledge distillation\(Guet al\.,[2023](https://arxiv.org/html/2606.10520#bib.bib5); Wanget al\.,[2020](https://arxiv.org/html/2606.10520#bib.bib13)\), and matrix decomposition\(Qinsiet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib9); Wanget al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib14)\), have been adopted to address this challenge and compress large\-scale models\. Post\-training quantization \(PTQ\) is one of the main quantization approaches\. Due to its lower computational cost and minimal performance degradation, PTQ is currently widely used in LLM compression\(Haoet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib17)\)\.

As quantization methods have improved, the performance degradation of PTQ at 4 bits or higher has become relatively minor\(Xinget al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib18); Liuet al\.,[2025a](https://arxiv.org/html/2606.10520#bib.bib19)\)\. Recently, research has begun to focus on extremely low\-bit quantization at 2 bits or below\(Cheeet al\.,[2023](https://arxiv.org/html/2606.10520#bib.bib2); Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11); Baalenet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib1); Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib3)\), and these methods can be classified as scalar or vector quantization\.

Scalar quantization \(SQ\) converts each floating\-point weight into a finite set of discrete values\. The quantization\-dequantization processes of SQ are rather simple\. Furthermore, well\-optimized tensor cores can substantially increase the computational speed of SQ models\(Frantaret al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib20)\)\. However, traditional 2\-bit SQ methods typically experience severe performance degradation\. SQ offers limited flexibility for fine\-tuning, and the min\-max projection and independent processing of each dimension make SQ susceptible to outliers\. Consequently, the performance degradation of state\-of\-the\-art \(SOTA\) 2\-bit scalar\-quantized models can exceed 30% on various zero\-shot tasks\(Liuet al\.,[2025b](https://arxiv.org/html/2606.10520#bib.bib7)\)\.

Vector Quantization \(VQ\), on the other hand, quantizes groups of contiguous weights together\. It maps fixed\-length, floating\-point weights to one of a finite set of floating\-point vectors\. This set of vectors is called a codebook, and each vector in the codebook is called a codeword\. VQ exhibits superior performance at 2 bits\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib3)\)\. However, it requires additional storage for the codebook\. If the size of the codebook exceeds the GPU’s L1 cache, frequently transferring the codebook between the GPU’s memory and the L1 cache can significantly slow down computation\(Tsenget al\.,[2024b](https://arxiv.org/html/2606.10520#bib.bib31)\)\. Thus, many VQ studies focus on reducing the codebook size, which usually leads to decreased performance or a more complex decoding process\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib3); Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11)\)\.

In this work, we show that the unstructured codebook is actually the main cause of the computational and storage overhead of VQ methods\. Based on this insight, this paper proposes UniSVQ, a 2\-bit quantization method that leverages the advantages of SQ and VQ, aiming to minimize performance degradation while reducing codebook storage overhead and decoding complexity\. The key insight is the spatial structure of the quantization grid, i\.e\., the set of all possible values in the quantized weight matrix\. When using a linear\-constrained quantization grid, where all discrete values can be obtained via an affine transformation of a group of integer\-coordinate vectors, a model structure that lies between scalar and vector quantization can be obtained\. During quantization, UniSVQ is equivalent to VQ using a linear\-constrained codebook and can achieve similar performance\. On the other hand, as shown in[Figure1](https://arxiv.org/html/2606.10520#S1.F1), UniSVQ replaces the VQ codebook with an affine transformation, reducing the number of additional parameters, and maintains a structure similar to SQ\. This enables the reuse of well\-optimized SQ matrix multiplication kernels\. Our experiments demonstrate that UniSVQ achieves performance superior to SOTA SQ methods and comparable to or better than VQ methods\. Additionally, we introduce an adaptive fine\-tuning strategy for the affine transformation that directly minimizes the quantization objective function in a data\-driven manner\. This further improves the quantized model’s performance across a wide range of tasks\.

![Refer to caption](https://arxiv.org/html/2606.10520v1/x1.png)

Figure 1:Architecture of the UniSVQ method\. UniSVQ introduces only 20 additional parameters \(4×44\\times 4for the affine matrix and44for the bias vector\) per weight matrix, which are significantly fewer than VQ\. Besides, the block\-wise affine transformation can be pre\-applied to the activations, enabling the reuse of well\-optimized scalar quantization Matmul kernels\. The operationRQU=U​SUR^\{U\}\_\{Q\}=US\_\{U\}denotes the inverse Randomized Hadamard Transform, utilized to suppress outliers and ensure a more uniform weight distribution\.In summary, this paper’s contributions are as follows:

- •We introduce UniSVQ, a 2\-bit quantization framework that bridges scalar and vector quantization via an affine lattice parameterization, achieving VQ\-like flexibility with only a small number of auxiliary parameters\.
- •We show that UniSVQ consistently outperforms SoTA SQ baselines and is competitive with strong VQ methods across models and zero\-shot benchmarks\.
- •We demonstrate that UniSVQ improves inference efficiency by reducing codebook\-related memory traffic, leading to higher throughput in practice\.

## 2Background and Related Work

This section begins with a review of representative works of vector and scalar PTQ\. We then introduce the concepts of UniSVQ and analyze the connections among these methods\.

### 2\.1Weight\-Only PTQ

PTQ directly converts the weights of a pre\-trained model into a low\-bit representation\. This paper focuses primarily on weight\-only PTQ, in which only the weight matrices are quantized\. The objective is to minimize the discrepancy of activations after applying the quantization\-dequantization functionΦ​\(·\)\\Phi\(\\textperiodcentered\)\. For each linear projection, let the input activations beX∈ℝN×nX\\in\\mathbb\{R\}^\{N\\times n\}, the weight matrix beW∈ℝn×mW\\in\\mathbb\{R\}^\{n\\times m\}and the output beY=X​WTY=XW^\{T\}\. We have

ℒ​\(Φ\)=‖X​WT−X​Φ​\(WT\)‖22\.\\mathcal\{L\}\(\\Phi\)=\\left\\\|XW^\{T\}\-X\\Phi\(W^\{T\}\)\\right\\\|\_\{2\}^\{2\}\.\(1\)[Equation1](https://arxiv.org/html/2606.10520#S2.E1)is the same for vector and scalar quantization, and the primary distinction lies in the quantization grids\.

### 2\.2Scalar Quantization

SQ methods treat each weight individually, where the quantization result is determined by the weight’s value\. For the integer quantization considered in this paper, the functionΦ​\(⋅\)\\Phi\(\\cdot\)can be defined in a point\-wise way:

ΦSQ\(w\)=s⋅\(clamp\(⌈ws⌋\+z,qmin,qmax\)−z\)\.\\Phi\_\{\\text\{SQ\}\}\(w\)=s\\cdot\(\\text\{clamp\}\(\\lceil\\frac\{w\}\{s\}\\rfloor\+z,q\_\{\\min\},q\_\{\\max\}\)\-z\)\.\(2\)In[Equation2](https://arxiv.org/html/2606.10520#S2.E2),ssandzzdenote the scaling factor and the zero point, which are typically shared across the entire weight matrix or several contiguous columns, referred to as a group\.qminq\_\{\\text\{min\}\}andqmaxq\_\{\\text\{max\}\}denote the minimum and maximum representable integer values for a given bit width\.

SQ is the earliest PTQ method for LLMs\. While there is minimal performance degradation above 4 bits\(Shaoet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib23); Ashkbooset al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib21)\), typical SQ methods like GPTQ struggle as the bit width decreases\(Kumaret al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib22)\)\. In 2\-bit quantization,ΦSQ​\(w\)\\Phi\_\{\\text\{SQ\}\}\(w\)within a group can only take 4 distinct quantization values, so selecting these values becomes critical\. To address this issue, methods such as OmniQuant\(Shaoet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib23)\)and SignRound\(Chenget al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib24)\)use data\-driven approaches to determine suitable values by treatingssandzzas trainable parameters\.

Other works focus on outliers in weight matrices\. Previous research indicates that outliers are the primary cause of performance degradation in low\-bit quantization\(Anet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib25)\)\. The magnitudes of these outliers far exceed those of other values\. When quantization grids are set in regions of high probability density, outliers incur significant clipping errors, which greatly reduce model performance\. To solve this, SqueezeLLM\(Kimet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib26)\)and ICQuant\(Liet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib27)\)represent outliers as additional sparse matrices\. Other studies, such as PB\-LLM\(Yuanet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib29)\)and Bi\-LLM\(Huanget al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib28)\), have confirmed that outlier\-aware quantization achieves reasonable performance even below 2\-bit\. These studies show that, when weight distributions are concentrated, 2\-bit quantization can retain substantial model capability with only 4 distinct values\. Additional evidence for this assertion comes from quantization methods based on orthogonal transformations\. For example, QuIP\(Cheeet al\.,[2023](https://arxiv.org/html/2606.10520#bib.bib2)\)adjusts weights to a Gaussian distribution via orthogonal transformations, thereby eliminating the impact of outliers\. Follow\-up work such as SpinQuant\(Liuet al\.,[2025a](https://arxiv.org/html/2606.10520#bib.bib19)\), FlatQuant\(Sunet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib30)\), and OSTQuant\(Xinget al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib18)\)further reduces performance degradation by optimizing orthogonal matrices through learnable approaches\.

However, despite these advancements, such methods can still result in a performance drop exceeding 50% on challenging tasks, particularly for smaller models\. Conversely, vector quantization methods have demonstrated greater potential in 2\-bit quantization\.

### 2\.3Vector Quantization

VQ quantizes a contiguous group of weights and represents them as a specific codeword from a fixed codebookCC\. Forbb\-bit quantization applied to a vector ofddcontiguous weights, given a codebookC∈Rd×2b​dC\\in R^\{d\\times 2^\{bd\}\}, we have

Φ​\(\[w0,w1,…,wd\],C\)=Ci\.\\Phi\(\[w\_\{0\},w\_\{1\},\.\.\.,w\_\{d\}\],C\)=C\_\{i\}\.\(3\)In[Equation3](https://arxiv.org/html/2606.10520#S2.E3),CiC\_\{i\}is theii\-th codeword in the codebookCC\. For VQ, the quantization process is not performed point\-wise\. Additionally, the quantization grid is stored within the codebook, which provides a significantly higher degree of freedom\. These advantages lead to superior performance at the 2\-bit level\. Current VQ methods primarily utilize two approaches for quantization grid selection: clustering\-based methods and lattice\-based methods\. Clustering\-based methods use algorithms such as K\-means to identify representative quantization grids\. Lattice\-based methods, on the other hand, derive optimal quantization grids specifically for Gaussian distributions\.

Regardless of the approach, however, additional storage and more complex decoding processes are inevitable for VQ\. A common solution in clustering\-based methods is to adopt multi\-level codebooks\. For example, AQLM\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib3)\)uses two 1\-bit codebooks instead of one 2\-bit codebook, reducing the codebook size fromRd×2b​dR^\{d\\times 2^\{bd\}\}to2​Rd×\(2b/2⋅d\)2R^\{d\\times\(2^\{b/2\\cdot d\}\)\}\. Building upon this, GPTVQ\(Baalenet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib1)\)incorporates outlier\-aware quantization\. Among lattice\-based methods, QuIP\#\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11)\)and NestQuant\(Savkinet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib33)\)use codebooks with high spatial symmetry\. Meanwhile, Qtip\(Tsenget al\.,[2024b](https://arxiv.org/html/2606.10520#bib.bib31)\)and CCQ\(Zhouet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib32)\)introduce trellis\-coded and convolutional codes, respectively\. These methods compress codewords and codebooks in various ways, necessitating a decompression step during inference\.

## 3Methods

In this section, we will analyze the differences and connections between scalar and vector quantization\. We will also introduce the basic concept of UniSVQ and its details\.

### 3\.1The Differences and Connection between Scalar Quantization and Vector Quantization

The differences in quantization grids between VQ and SQ lead to trade\-offs in model performance and computational efficiency\. In terms of performance, VQ’s quantization grids can be positioned in regions of higher probability density, which better leverages the distribution of weight vectors and leads to lower degradation\. In terms of efficiency, SQ’s quantization grid is derived through simple scaling and translation of integer weights\.

![Refer to caption](https://arxiv.org/html/2606.10520v1/x2.png)

Figure 2:Comparison of the 2\-dimensional 2\-bit quantization grids of scalar quantization, vector quantization, and UniSVQ for an isotropic Gaussian weight\. MinMax scalar quantization uses a highly structured grid, which is easier for dequantization but harder to fit the distribution due to boundary values\. Vector quantization achieves lower error, but lacks structure\. UniSVQ maintains a structured grid while providing higher degrees of freedom\.In this paper, we propose UniSVQ, a unified representation that combines vector and scalar quantization to leverage the strengths of both\. The key insight is selecting a quantization grid that aligns with the weight distribution, while maintaining a structured form that simplifies decoding\. We demonstrate that this can be achieved by a linear\-constrained quantization grid\. Specifically, it is defined as[Equation4](https://arxiv.org/html/2606.10520#S3.E4):

Φ​\(\[w1,w2,…,wd\]T\)=A​\[w¯1,w¯2,…​w¯d\]T\+B\.\\Phi\(\[w\_\{1\},w\_\{2\},\.\.\.,w\_\{d\}\]^\{T\}\)=A\[\\bar\{w\}\_\{1\},\\bar\{w\}\_\{2\},\.\.\.\\bar\{w\}\_\{d\}\]^\{T\}\+B\.\(4\)In[Equation4](https://arxiv.org/html/2606.10520#S3.E4),w¯i∈ℤ\\bar\{w\}\_\{i\}\\in\\mathbb\{Z\}\. This unified representation can be interpreted as either scalar quantization with higher degrees of freedom or vector quantization with constrained codewords\.[Figure2](https://arxiv.org/html/2606.10520#S3.F2)shows the quantization grids of VQ, SQ and UniSVQ\. SQ has fixed quantization grids, and the commonly used minmax quantization\(Frantaret al\.,[2022](https://arxiv.org/html/2606.10520#bib.bib16); Xinget al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib18); Liuet al\.,[2025a](https://arxiv.org/html/2606.10520#bib.bib19)\)makes it highly sensitive to the boundary values\. Compared to SQ, UniSVQ treats multiple weights as a single entity, replacing the scaling and translation of individual integers with an affine transformation to offer more flexibility\. Compared to VQ, UniSVQ has a regular structure, thus reducing storage overhead\. For 2\-bit quantization withd=4d=4, the required storage can be reduced from24∗2∗4∗2=20482^\{4\*2\}\*4\*2=2048bytes to\(4∗4\+4\)∗2=40\(4\*4\+4\)\*2=40bytes, which leads to near 1/64 additional storage reduction\.

Furthermore, in UniSVQ, the introduced affine transformation is commutative with the matrix multiplication in linear layers\. Therefore, the computational workflow is similar to that of SQ\. This allows us to use existing, highly optimized scalar quantization kernels\. By imposing specific linear constraints on the quantization grid, we can establish a connection between vector and scalar quantization\.

### 3\.2Implementation of UniSVQ

This section presents the details of the UniSVQ method, consisting of three primary stages\.

1. 1\.Preprocessing: We apply a Randomized Hadamard Transform \(RHT\) to the weights\. This step eliminates outliers and ensures that the linear\-constrained quantization grid is sufficiently accurate\.
2. 2\.Weight Quantization: We construct the linear\-constrained quantization grid and perform weight quantization using the LDLQ method\.
3. 3\.Fine\-Tuning: We perform layer\-wise fine\-tuning of the affine transformations to minimize reconstruction error further\.

#### 3\.2\.1Preprocessing

We first apply a Randomized Hadamard Transform to the weight matrices for preprocessing\. Intuitively, outliers can be viewed as vectors with large projections onto a single coordinate axis\. The RHT acts as a random rotation of these vectors, redistributing their magnitudes across all axes and eliminating the outliers\. The RHT ofWWis given in[Equation5](https://arxiv.org/html/2606.10520#S3.E5):

R​\(W\)=U​SU​W​SV​V\.R\(W\)=US\_\{U\}WS\_\{V\}V\.\(5\)In[Equation5](https://arxiv.org/html/2606.10520#S3.E5),SUS\_\{U\}andSVS\_\{V\}are diagonal matrices with elements sampled randomly from\{1,−1\}\\\{1,\-1\\\}andUUandVVare Hadamard matrices\. Since Hadamard matrices are orthogonal, the RHT is fully reversible\. This allows us to perform an inverse transform on the quantizedWRHTW\_\{\\text\{RHT\}\}during inference\. Furthermore, computingU​\(SU​W\)U\(S\_\{U\}W\)using the Fast Walsh–Hadamard Transform \(FWHT\) takesO​\(n​l​o​g2​n\)O\(nlog\_\{2\}n\)time\. After RHT,WRHTW\_\{\\text\{RHT\}\}satisfies the incoherence property:max​\(WR​H​T\)≤μ​‖W‖F/m​n\\text\{max\}\(W\_\{RHT\}\)\\leq\\mu\{\|\|W\|\|\_\{F\}\}/\\sqrt\{mn\}; Moreover, Tseng et al\.\([2024a](https://arxiv.org/html/2606.10520#bib.bib11)\)demonstrate thatWRHTW\_\{\\text\{RHT\}\}approximately follows a standard multi\-dimensional Gaussian distribution\. Under this condition, the optimal codebook can have regular structure and be approximated by our linear\-constrained quantization grid\.

#### 3\.2\.2Quantization

Initialization of the linear\-constrained quantization grid\. During quantization, UniSVQ maps a group of weights to a vector by passing an integer weight through an affine transformation\. The parameters of this transformation determine the structure of the quantization grid, directly impacting the quantization error\. First, we must select reasonable initial values for the transformation parameters\. This transformation should yield a centrally symmetric quantization grid, and each codeword’s values should better follow a Gaussian distribution to match the standard Gaussian distribution ofWRHTW\_\{\\text\{RHT\}\}\. Specifically, forbb\-bit quantization, the codewordCiC\_\{i\}is obtained through the affine transformation:

Ci=A​W¯i\+B=s​G​W¯i−s​G​b​𝟏\.C\_\{i\}=A\\bar\{W\}\_\{i\}\+B=sG\\bar\{W\}\_\{i\}\-sGb\\mathbf\{1\}\.\(6\)In[Equation6](https://arxiv.org/html/2606.10520#S3.E6),𝟏\\mathbf\{1\}is an all\-ones vector,GGis a random orthogonal matrix,b=\(2b−1\)/2b=\(2^\{b\}\-1\)/2is a centering constant, ands=12/\(22​b−1\)s=\\sqrt\{12/\(2^\{2b\}\-1\)\}is a scaling factor\. We prove in[SectionA\.1](https://arxiv.org/html/2606.10520#A1.SS1)that this grid satisfies the above requirements\.

Quantization with LDLQ\. After establishing the quantization grid, we perform quantization and calibration using the LDLQ method\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11)\), which uses the LDL decomposition for the quantization calibration\. The reason for choosing LDLQ is its outstanding performance in various VQ methods\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11); Savkinet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib33)\), and UniSVQ is equivalent to VQ using a linear\-constrained codebook during quantization\. The LDLQ approach performs quantization on weight groups column\-by\-column, sequentially adjusting the remaining unquantized weights to compensate for the quantization error\. After quantizing theii\-th group of weights, the adjusted weights for thejj\-th column are given by:

W^j=Φ​\(Wj\+∑k=1j−1\(Wk−W^k\)​ak​j\),\\hat\{W\}\_\{j\}=\\Phi\\left\(W\_\{j\}\+\\sum\_\{k=1\}^\{j\-1\}\(W\_\{k\}\-\\hat\{W\}\_\{k\}\)a\_\{kj\}\\right\),\(7\)In[Equation7](https://arxiv.org/html/2606.10520#S3.E7),j∈\{\(i−1\)​d,…,n\}j\\in\\\{\(i\-1\)d,\\dots,n\\\},nnis the total number of columns, andak​ja\_\{kj\}represents the adjustment coefficients\. LDLQ computes these adjustment coefficients through the LDL decomposition of the Hessian matrixHH\. Let the Hessian matrix be decomposed asH=L​D​LTH=LDL^\{T\}, and the adjustment coefficientsaja\_\{j\}correspond to the off\-diagonal elements of the decomposition\.

#### 3\.2\.3Finetuning

The quantization grid is initialized using a random orthogonal matrix, which may not be optimal for minimizing reconstruction error\. Although RHT transforms the weights into a nearly isotropic Gaussian distribution, the varying importance of the weights, the distribution of the activations, and the non\-flatness inherent to the RHT results\(Sunet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib30)\)all influence the optimal configuration of the quantization grid\.

To address these issues, we propose a data\-driven approach to fine\-tune the quantization grid\. In UniSVQ, the dequantization process is reformulated as a matrix multiplication, making optimization via backpropagation feasible\. Specifically, for a quantized integer matrixWintW\_\{\\text\{int\}\}, the dequantized weight matrixW^T\\hat\{W\}^\{T\}is structured as[Equation8](https://arxiv.org/html/2606.10520#S3.E8):

W^T=\[A​Wint,1T\+B​𝟏TA​Wint,2T\+B​𝟏T⋮A​Wint,n/dT\+B​𝟏T\]\.\\hat\{W\}^\{T\}=\\begin\{bmatrix\}AW\_\{\\mathrm\{int\},1\}^\{T\}\+B\\mathbf\{1\}^\{T\}\\\\ AW\_\{\\mathrm\{int\},2\}^\{T\}\+B\\mathbf\{1\}^\{T\}\\\\ \\vdots\\\\ AW\_\{\\mathrm\{int\},n/d\}^\{T\}\+B\\mathbf\{1\}^\{T\}\\end\{bmatrix\}\.\(8\)This formulation allows the linear operationY=X​W^TY=X\\hat\{W\}^\{T\}to be rewritten as:

Y\\displaystyle Y=∑i=1n/dXi​\(A​Wint,iT\+B​𝟏T\)\\displaystyle=\\sum\_\{i=1\}^\{n/d\}X\_\{i\}\(AW\_\{\\mathrm\{int\},i\}^\{T\}\+B\\mathbf\{1\}^\{T\}\)\(9\)=∑i=1n/d\(Xi​A\)​Wint,iT\+∑i=1n/d\(Xi​B\)​𝟏T\.\\displaystyle=\\sum\_\{i=1\}^\{n/d\}\(X\_\{i\}A\)W\_\{\\mathrm\{int\},i\}^\{T\}\+\\sum\_\{i=1\}^\{n/d\}\(X\_\{i\}B\)\\mathbf\{1\}^\{T\}\.In[Equation9](https://arxiv.org/html/2606.10520#S3.E9),Xi​AX\_\{i\}Ais a floating\-point activation, andWint,iW\_\{\\mathrm\{int\},i\}remains an integer matrix\. ForX∈ℝN×nX\\in\\mathbb\{R\}^\{N\\times n\}andWint∈ℤm×nW\_\{\\mathrm\{int\}\}\\in\\mathbb\{Z\}^\{m\\times n\}, the computational complexity of the dequantization transformation isO​\(N​n​d\)\+O​\(N​n\)O\(Nnd\)\+O\(Nn\)\. Sinceddusually takes the value of 4 or 8, the additional computational complexity is much smaller than theO​\(N​n​m\)O\(Nnm\)matrix multiplication\.

This enables us to optimize A and B using a layer\-wise mean squared error \(MSE\) loss\. We fine\-tune the floating\-point parameters layer by layer\. Specifically, after quantizing all matrices within a Transformer block, we fine\-tune the block’s quantization grid using the activations from the previous quantized layer as input and the output of the original FP16 model as target, given the same inputs\. We use MSE loss to minimize reconstruction error, allowing the affine parameters to adaptively compensate for quantization error\.

Table 1:PPL and 0\-Shot QA accuracy of UniSVQ compared with scalar and vector baselines\. With only 20 additional parameters and anO​\(n\)O\(n\)computational overhead, UniSVQ outperforms scalar quantization baselines consistently and achieves comparable results compared to vector quantization using the linear\-constrained quantization grid with much less storage and a simpler model structure\. AE, AC, BQ, HS, WG, and PQ stand for ARC\-Easy, ARC\-Challenge, BoolQ, HellaSwag, WinoGrande, and PIQA, respectively\.modeltypemethodWiki↓C4↓AC↑AE↑BQ↑HS↑PQ↑WG↑Avg\.↑Per\.↑Qwen\-3\-32BN/AFP167\.6112\.4560\.9283\.1686\.3682\.5882\.1072\.9278\.011\.00ScalarGPTQ1\.38e46\.04e325\.3425\.4640\.5825\.1950\.5949\.4036\.090\.46Quip16\.7221\.7434\.8944\.5753\.1462\.2469\.2053\.3552\.900\.68SpinQuant10\.9022\.0143\.3463\.8984\.4067\.4072\.9167\.9666\.650\.85OSTQuant14\.7922\.4646\.8467\.5580\.8068\.3776\.5069\.6968\.290\.88VectorQuip\#9\.0414\.1358\.1080\.0087\.8978\.4679\.7073\.6376\.300\.98AQLM10\.5615\.1358\.2780\.8586\.9476\.2177\.8072\.1375\.370\.97UniSVQProposed9\.2614\.4258\.4480\.8187\.8977\.0778\.8373\.8876\.150\.98Qwen\-3\-14BN/AFP168\.6413\.8160\.1582\.8389\.3078\.8479\.8772\.8577\.311\.00ScalarGPTQ1\.90e38\.49e225\.9425\.7642\.0224\.7650\.2747\.2037\.290\.48Quip17\.3423\.1929\.8646\.4261\.2257\.4766\.8757\.1453\.160\.69SpinQuant13\.7520\.3538\.9963\.9779\.4256\.2870\.4065\.5162\.430\.81OSTQuant20\.9630\.5243\.2669\.6184\.7459\.1773\.7266\.9366\.240\.86VectorQuip\#10\.7616\.4353\.6776\.8987\.9571\.9176\.9971\.6773\.180\.95AQLM14\.8017\.5150\.6075\.2987\.3469\.1776\.1770\.8071\.560\.93UniSVQProposed11\.4116\.8551\.7978\.4986\.8269\.9676\.5071\.2772\.470\.94Qwen\-3\-8BN/AFP169\.7215\.4256\.4080\.8986\.6474\.9677\.4868\.3574\.121\.00ScalarGPTQ4\.68e41\.68e426\.7925\.6742\.8725\.8452\.5050\.0437\.290\.50Quip27\.6134\.4224\.9131\.6954\.8941\.4159\.9050\.1243\.820\.59SpinQuant17\.8237\.2830\.4648\.4067\.0647\.9463\.1758\.6452\.610\.71OSTQuant26\.0839\.7134\.6460\.0273\.2150\.2968\.5058\.1757\.470\.78VectorQuip\#12\.3718\.4546\.5068\.4383\.1266\.6274\.3266\.3067\.550\.91AQLM18\.2620\.7345\.2272\.3173\.7360\.9973\.0764\.3364\.940\.88UniSVQProposed14\.8219\.9645\.8272\.3585\.0763\.1874\.1667\.0967\.950\.92Qwen\-3\-4BN/AFP1610\.0416\.8158\.3681\.2384\.6869\.0675\.8468\.1172\.881\.00ScalarGPTQ1\.96e51\.16e526\.6226\.0543\.3026\.1251\.5248\.3036\.990\.51Quip37\.8846\.5923\.9835\.8248\.2937\.3857\.6749\.4142\.090\.58SpinQuant25\.8873\.3528\.9241\.0564\.3440\.7459\.1951\.7847\.670\.65OSTQuant58\.8575\.1329\.3532\.7862\.1734\.6657\.6253\.4345\.000\.62VectorQuip\#14\.7821\.2545\.8270\.4179\.7258\.5573\.0162\.7565\.040\.89AQLM38\.1626\.9940\.9666\.8877\.2253\.0268\.6160\.8561\.260\.84UniSVQProposed20\.0423\.4443\.3465\.8282\.2655\.2770\.8962\.3563\.320\.87

## 4Experiment Settings

### 4\.1Models and Evaluations

We evaluate the effectiveness of UniSVQ across the Qwen\-3\(Yanget al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib37)\)and Llama\-3\(Llama Team,[2024](https://arxiv.org/html/2606.10520#bib.bib35)\)model families, covering model scales from 4 billion to 32 billion parameters\. We randomly sampled 1,024 sequences from the RedPajama dataset\(Weberet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib36)\), each with a length of 2,048\. We use these samples to computeHHfor the LDLQ stage and fine\-tune the quantization grids\.

We report the PPL on the WikiText 2\(Merityet al\.,[2016](https://arxiv.org/html/2606.10520#bib.bib43)\)and C4\(Raffelet al\.,[2020](https://arxiv.org/html/2606.10520#bib.bib44)\)datasets and 0\-shot accuracy on several downstream benchmarks, including the ARC\-Easy, ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.10520#bib.bib40)\), BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2606.10520#bib.bib39)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.10520#bib.bib42)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.10520#bib.bib38)\), and WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2606.10520#bib.bib41)\)datasets\. 0\-shot accuracy is evaluated using the LM\-Evaluation\-Harness toolkit\(Gaoet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib34)\)\.

### 4\.2Quantization Settings

The quantization grid is initialized using random orthogonal matrices generated by the SciPy library111https://github\.com/scipy/scipy\. To minimize the additional computational overhead introduced by the affine transformation, the vector quantization dimensionddis set to 4, so each linear layer introduces only 20 additional floating\-point parameters, includingA∈ℝ4×4A\\in\\mathbb\{R\}^\{4\\times 4\}andB∈ℝ4B\\in\\mathbb\{R\}^\{4\}\. During quantization, all of the Hessian matricesHHare calculated offline, and quantization is performed layerwise to every linear layer in all the transformer blocks\. To ensure positive definiteness ofHHduring LDL decomposition, a Tikhonov regularization termμ​I\\mu Iis applied, whereμ\\mu= 0\.01\. We fine\-tune the linear\-constrained quantization grid using the same 1,024 samples partitioned into training and validation sets at a ratio of 7:1\. We use a batch size of 16 and a learning rate of5​e−55e\-5\. Each layer is optimized for up to 5 epochs, and the early stop threshold is set to 3\. The entire quantization and fine\-tuning process takes approximately 6 hours on an Nvidia A100 GPU for an 8B model\.

### 4\.3Baselines

Since UniSVQ has a similar structure to orthogonal\-transformation\-based scalar quantization, we mainly selected SOTA methods in this field as our primary baselines\. Furthermore, we include representative VQ methods in our evaluation\. Compared to UniSVQ, these VQ methods have more flexible, unconstrained quantization grids, thereby providing an empirical lower bound for degradation\.

All methods were evaluated at the 2\-bit level\. For the scalar quantization baselines, we selected GPTQ\(Frantaret al\.,[2022](https://arxiv.org/html/2606.10520#bib.bib16)\), Quip\(Cheeet al\.,[2023](https://arxiv.org/html/2606.10520#bib.bib2)\), SpinQuant\(Liuet al\.,[2025a](https://arxiv.org/html/2606.10520#bib.bib19)\), and OSTQuant\(Xinget al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib18)\)\. GPTQ is a classical PTQ strategy that performs quantization and calibration directly on the original weight matrices\. QuIP, SpinQuant, and OSTQuant are the orthogonal\-transformation\-based methods\. QuIP uses the RHT, and SpinQuant uses learnable orthogonal transformations\. OSTQuant introduces the Quantization Space Utilization Rate \(QSUR\) metric to optimize the quantization grid\. For vector quantization baselines, we selected AQLM\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10520#bib.bib3)\)and Quip\#\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11)\)\. AQLM is a representative clustering\-based VQ method\. We use the standard 2×8 configuration, which uses two 8\-bit codewords for an 8\-dimensional vector\. For QuIP\#, we use the E8P codebook proposed by the authors without additional residual quantization\. To ensure a fair comparison, all baseline methods use the same calibration data to maintain consistency with our proposed method\.

## 5Results

### 5\.1Main Results

[Table1](https://arxiv.org/html/2606.10520#S3.T1)compares the PPL and accuracy of the UniSVQ method with that of the baseline methods\.

Comparison with Scalar Quantization\. As the table shows, UniSVQ outperforms all SQ baselines across various model sizes\. Under the challenging 2\-bit quantization setting, traditional scalar methods experience significant performance degradation\. In particular, the unoptimized GPTQ method often yields extremely high PPL and random results\. In contrast, UniSVQ maintains high accuracy, retaining up to 98% of the full\-precision FP16 model’s performance\.

Furthermore, UniSVQ achieves consistently superior performance compared to OSTQuant and SpinQuant, which also use orthogonal transformations and fine\-tuning, while introducing only 20 additional parameters per weight matrix\. This demonstrates the effectiveness of the flexible quantization grid and the strategy of quantizing a group of weights to minimize quantization error\. Notably, the 2\-bit Qwen\-3\-32B model quantized by UniSVQ outperforms the FP16 Qwen\-3\-4B model in terms of average QA accuracy\. This suggests that, when GPU memory is limited, deploying a larger 2\-bit UniSVQ model is more effective than using a smaller FP16 model\. This cannot be achieved by SQ baselines\.

Comparison with Vector Quantization\. Compared to VQ methods, UniSVQ achieves comparable or even superior accuracy while using a linear\-constrained quantization grid\. Additionally, UniSVQ requires only 1/64 of the codebook storage required by baseline VQ methods\. Specifically, UniSVQ yields higher accuracy than the clustering\-based AQLM\. When compared to QuIP\#, which uses a mathematically optimal E8P grid, UniSVQ achieves comparable results, surpassing QuIP\# on the Qwen\-3\-8B model\. These results suggest that the performance trade\-off introduced by linear constraints is minimal\. QuIP\# uses a highly flexible E8P lattice codebook that minimizes quantization error without structural constraints, at the cost of complex decoding and higher memory traffic\. UniSVQ instead imposes a linear constraint on the codebook, which introduces a modest accuracy trade\-off but enables significantly simpler inference\. Comparison with other VQ baselines is provided in Appendix[A\.2\.3](https://arxiv.org/html/2606.10520#A1.SS2.SSS3)\.

Table 2:Ablation study on the fine\-tuning and initialization of the quantization grid for Qwen\-3\-8B\. Disabling fine\-tuning noticeably degrades zero\-shot QA performance\. Additionally, replacing the random orthogonal matrix with the D4 lattice generation matrix during initialization results in an even greater performance drop\. These results validate the effectiveness of fine\-tuning and the rationale behind the proposed initialization strategy\.MethodAvg\.Per\.Proposed67\.950\.92w/o fine\-tuning66\.990\.90w/o orthogonal init61\.000\.82Table 3:0\-shot QA performance on LLama3\-8B\-Instruct\. UniSVQ outperforms all scalar quantization baselines and is comparable to other vector quantization methods\.typemethodWiki↓C4↓AC↑AE↑BQ↑HS↑PQ↑WG↑Avg\.↑Per\.↑N/AFP167\.7211\.3956\.9780\.9386\.6174\.9477\.8067\.8874\.121\.00ScalarGPTQ2\.55e66\.78e525\.9325\.0846\.2026\.2751\.6349\.8837\.500\.51Quip79\.6382\.8323\.5428\.6644\.5534\.7151\.5749\.1738\.700\.52SpinQuant27\.6096\.6221\.4133\.4560\.7339\.9356\.2055\.1644\.480\.60OSTQuant37\.3572\.3325\.2639\.9061\.9938\.5961\.2153\.2846\.710\.63VectorQuip\#9\.4214\.2847\.4473\.2780\.2170\.7078\.1369\.5369\.720\.94AQLM27\.6096\.9232\.7653\.4178\.3863\.6067\.4163\.4659\.840\.81UniSVQProposed10\.7015\.4344\.7969\.5281\.7767\.2076\.2766\.1467\.620\.91
### 5\.2Ablation Studies

#### 5\.2\.1Influence of the Quantization Grid

To isolate the contribution of the linear\-constrained codebook from other pipeline components, we evaluate each quantizer independently by measuring the weight\-level MSE and SNR on Qwen\-3\-4B at 2\-bit\. Result in[Table4](https://arxiv.org/html/2606.10520#S5.T4)shows that UniSVQ achieves substantially lower MSE and an 11\.9 dB higher SNR than both GPTQ and SpinQuant, directly demonstrating the superiority of the linear\-constrained codebook\. This result is consistent with the PPL and zero\-shot accuracy improvements reported in[Table1](https://arxiv.org/html/2606.10520#S3.T1)\.

Table 4:Quantizer\-level isolation on Qwen\-3\-4B at 2\-bit\.HH: Hessian compensation;RR: Randomized Hadamard Transform;CC: linear\-constrained codebook\. “Comp\.” is the abbreviation of “Components”\.MethodComp\.MSE↓\\downarrowSNR \(dB\)↑\\uparrowRTN—9\.81×10−29\.81\\times 10^\{\-2\}−12\.72\-12\.72GPTQHH1\.12×10−31\.12\\times 10^\{\-3\}−3\.45\-3\.45SpinQuantH\+RH\+R1\.06×10−31\.06\\times 10^\{\-3\}−3\.06\-3\.06UniSVQH\+R\+CH\+R\+C7\.40×10−57\.40\\times 10^\{\-5\}8\.488\.48Table 5:Ablation on the Randomized Hadamard Transform on Llama\-3\-8B at 2\-bit\. Removing RHT leads to near\-random performance\.MethodWiki↓\\downarrowAvg\.Per\.FP1610\.0472\.881\.00GPTQ19603\.8836\.990\.51UniSVQ20\.0463\.420\.87UniSVQ w/o RHT8314\.1239\.060\.54Table 6:0\-shot QA accuracy for UniSVQ with differentddon Qwen\-3\-8B\. Increasing theddprovides marginal accuracy improvements, but requires higher computational complexity\.ddFine\-TuningCostAvg\.Per\.4w/oO​\(4​N​n\)O\(4Nn\)66\.990\.908w/oO​\(8​N​n\)O\(8Nn\)67\.340\.914w/O​\(4​N​n\)O\(4Nn\)67\.950\.92We further analyze the influence of fine\-tuning the linear\-constrained quantization grid and the reasonability of using an orthogonal matrix for initialization\.[Table2](https://arxiv.org/html/2606.10520#S5.T2)shows the accuracy of the 6 0\-shot QA tasks for the Qwen\-3\-8B model\.

These results demonstrate that fine\-tuning the affine matrix during quantization can increase the 0\-shot accuracy\. This suggests that optimizing the affine parameters could compensate for performance degradation introduced by the heterogeneity of weight and activation distributions\.

To demonstrate the reasonability of using a random orthogonal matrix for initialization, we replace the orthogonal matrix with the generator matrix of theD​4D4lattice, which is proved to be mathematically optimal for 4\-dimensional VQ\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10520#bib.bib11)\)\. However, our results show that using theD​4D4lattice generator actually leads to poorer performance\. This degradation is likely due to linear constraints\. Using the orthogonal matrix to map the integer codebook\{W¯∣W¯i∈\{0,1,2,3\}\}\\\{\\bar\{W\}\\mid\\bar\{W\}\_\{i\}\\in\\\{0,1,2,3\\\}\\\}results in a diagonal covariance matrix of the codewords, namelyΣW^=s2​A​ΣW¯​AT=s2​σw¯2​I\\Sigma\_\{\\hat\{W\}\}=s^\{2\}A\\Sigma\_\{\\bar\{W\}\}A^\{T\}=s^\{2\}\\sigma\_\{\\bar\{w\}\}^\{2\}I\. This preserves the isotropic nature of the grid, making it suitable for quantizing approximately Gaussian and incoherent weights inWRHTW\_\{\\text\{RHT\}\}\. However, when the codewords are generated from the set of integers, theD​4D4generator matrix results in a non\-diagonal covariance matrix, which introduces undesirable correlations\. These findings confirm that a random orthogonal matrix is a reasonable initial value for a linear\-constrained quantization grid compared to the mathematically optimal choice\.

#### 5\.2\.2Generalization on Model Architecture

[Table3](https://arxiv.org/html/2606.10520#S5.T3)presents results on Llama\-3\-8B\-Instruct\. UniSVQ achieves a WikiText PPL of 10\.70 and an average zero\-shot accuracy of 67\.62%, outperforming all SQ baselines by a large margin\. Specifically, all SQ methods \(GPTQ, QuIP, SpinQuant, OSTQuant\) suffer severe accuracy degradation at 2\-bit, while UniSVQ maintains stable performance and also surpasses AQLM among VQ methods, remaining competitive with QuIP\#\. These results confirm that UniSVQ generalizes reliably across different Transformer\-based model families and training paradigms\.

#### 5\.2\.3Influence of Vector Dimension

In our primary experiments, we setd=4d=4to minimize the additional parameters and computational overhead introduced by the affine operations\. Theoretically, higher dimensions typically yield lower errors\(Savkinet al\.,[2025](https://arxiv.org/html/2606.10520#bib.bib33)\)\. This is further supported by the findings of Tseng et al\.\([2024a](https://arxiv.org/html/2606.10520#bib.bib11)\), where the 8\-dimensionalE8E\_\{8\}codebook has better performance over the 4\-dimensionalD4D\_\{4\}codebook\.

[Table6](https://arxiv.org/html/2606.10520#S5.T6)shows the performance difference between 4\-dimensional and 8\-dimensional UniSVQ\. To isolate the direct impact of dimensionality, the results in this table were obtained without fine\-tuning\. We observe that for Qwen\-3\-8B, increasing the vector dimension provides only marginal performance gains\. Notably, the benefits of higher dimensionality are outweighed by the improvements gained through fine\-tuning, while the former has higher additional computational costs\. Consequently, we conclude thatd=4d=4represents a better balance between precision and efficiency for the UniSVQ framework\.

#### 5\.2\.4Influence of the Randomized Hadamard Transform

Our codebook design assumes weights after the Randomized Hadamard Transform follow an approximately Gaussian distribution\. To evaluate its necessity, we ablate by removing RHT entirely\.

As shown in[Table5](https://arxiv.org/html/2606.10520#S5.T5), without RHT, performance collapses to near\-random levels comparable to uncalibrated GPTQ\. This confirms that RHT is a necessary prerequisite for the linear\-constrained codebook: without it, weight outliers violate the near\-Gaussian assumption underlying the lattice initialization and make the affine parameterization ineffective\.

Table 7:PPL and 0\-shot QA accuracy of UniSVQ at different bit\-widths on Llama\-3\-8B\.BitsWiki PPL↓\\downarrowAvg\.Per\.FP1610\.0472\.881\.002\-bit20\.0463\.420\.873\-bit21\.4867\.710\.93Table 8:Inference throughput and peak GPU memory \(GMem\) on Llama\-3\-8B \(single A100, batch size 1, 1024 tokens\)\. UniSVQ can achieve 1\.68× faster inference and over 75% GMem reduction compared to the FP16 baseline\.ModelThroughputPeak GMemTok/s \(↑\\uparrow\)GB \(↓\\downarrow\)FP1660\.3815\.72GPTQ130\.704\.14AQLM 2×\\times870\.974\.44Quip\#79\.604\.13UniSVQ101\.653\.87
#### 5\.2\.5Bit\-Width Extensibility

To demonstrate that UniSVQ generalizes beyond the 2\-bit setting, we evaluate it under a 3\-bit configuration on Llama\-3\-8B\. As shown in[Table7](https://arxiv.org/html/2606.10520#S5.T7), moving from 2\-bit to 3\-bit yields consistent gains across all six zero\-shot tasks, confirming that UniSVQ effectively utilizes additional bit budget\. Furthermore, this scaling requires minimal structural change: the affine parameterization only requires widening the integer vectors, with no modifications to the codebook design\. This is a direct advantage over methods such as QuIP\# and AQLM, where extending to higher bits requires redesigning the codebook structure\.

#### 5\.2\.6Inference Speed

[Table8](https://arxiv.org/html/2606.10520#S5.T8)reports the end\-to\-end inference throughput and peak GPU memory usage on the Llama\-3\-8B model\. Experiments are conducted on a single NVIDIA A100 GPU, with a batch size of 1 and a generation length of 1024 tokens, averaged over three independent runs\. We implement custom fused CUDA kernels for autoregressive generation based on the open\-sourced code of Dao\-AILab222https://github\.com/Dao\-AILab/fast\-hadamard\-transform\.

It can be observed that UniSVQ achieves approximately a 1\.68×\\timesspeedup and over 75% reduction in memory usage compared to the FP16 baseline\. Furthermore, UniSVQ maintains a lower memory footprint and higher throughput than AQLM and Quip\#\. This is possibly attributed to UniSVQ’s simple structure and minimal auxiliary parameters\. GPTQ achieves a relatively high throughput\. However, the severe accuracy degradation of all SQ methods, including GPTQ, makes it less suitable for such extreme low\-bit quantization scenarios\.

## 6Conclusion

In this paper, we introduce UniSVQ, a unified representation that bridges scalar and vector quantization\. We demonstrate that the key to linking these methods is the linear\-constrained quantization grid\. With this representation, the dequantization process becomes an affine transformation of integer weights\. This allows the model to maintain an architecture nearly as simple as that of SQ and achieve task performance comparable to VQ\. Furthermore, we use block\-wise fine\-tuning to optimize the quantization grid\. Experiments on multiple 0\-shot QA benchmarks demonstrate that, compared to scalar quantization, UniSVQ introduces only 20 additional parameters per weight matrix yet achieves superior performance\. Compared to vector quantization, UniSVQ achieves comparable accuracy while significantly reducing the number of auxiliary parameters\.

## Limitations

Our current study is limited to weight\-only quantization and does not consider activation or KV\-cache quantization, which are also important components for end\-to\-end inference efficiency in LLMs\.

Moreover, while the linear\-constrained quantization grid brings modest overhead, their interaction with highly optimized GEMM kernels remains to be explored\.

## Acknowledgements

This work is supported by the National Key Research and Development Program of China \(2024YFB4505603\) and the National Natural Science Foundation of China \(No\. 62576186\)\. This work is also supported by Tsinghua KA Excellence Center\. This work is partially supported by Tsinghua University \(Department of Computer Science and Technology\) \- Sinopec Joint Research Center for Artificial Intelligence\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- Y\. An, X\. Zhao, T\. Yu, M\. Tang, and J\. Wang \(2025\)Systematic outliers in large language models\.InProceedings of the International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1)\.
- S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)QuaRot: outlier\-free 4\-bit inference in rotated llms\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 100213–100240\.Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p2.3)\.
- M\. V\. Baalen, A\. Kuzmin, M\. Nagel, P\. Couperus, A\. Bolshakov, C\. Bastoul, E\. Mahurin, T\. Blankevoort, and P\. Whatmough \(2024\)GPTVQ: the blessing of dimensionality for llm quantization\.InWorkshop on Efficient Systems for Foundation Models II at the International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.10520#S2.SS3.p2.2)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThe AAAI Conference on Artificial Intelligence,pp\. 7432–7439\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- J\. Chee, Y\. Cai, V\. Kuleshov, and C\. D\. Sa \(2023\)QuIP: 2\-bit quantization of large language models with guarantees\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 4396–4429\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1),[§4\.3](https://arxiv.org/html/2606.10520#S4.SS3.p2.1)\.
- W\. Cheng, W\. Zhang, H\. Shen, Y\. Cai, X\. He, L\. Kaokao, and Y\. Liu \(2024\)Optimize weight rounding via signed gradient descent for the quantization of llms\.InFindings of the Association for Computational Linguistics,pp\. 11332–11350\.Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p2.3)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2924–2936\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.CoRRabs/1803\.05457\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- V\. Egiazarian, A\. Panferov, D\. Kuznedelev, E\. Frantar, A\. Babenko, and D\. Alistarh \(2024\)Extreme compression of large language models via additive quantization\.InProceedings of the 41st International Conference on Machine Learning,pp\. 12284–12303\.External Links:ISSN 2640\-3498Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p2.1),[§1](https://arxiv.org/html/2606.10520#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.10520#S2.SS3.p2.2),[§4\.3](https://arxiv.org/html/2606.10520#S4.SS3.p2.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2022\)GPTQ: accurate post\-training compression for generative pretrained transformers\.CoRRabs/2210\.17323\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.10520#S3.SS1.p2.4),[§4\.3](https://arxiv.org/html/2606.10520#S4.SS3.p2.1)\.
- E\. Frantar, R\. L\. Castro, J\. Chen, T\. Hoefler, and D\. Alistarh \(2025\)MARLIN: mixed\-precision auto\-regressive parallel inference on large language models\.InProceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming,pp\. 239–251\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p3.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- Y\. Gu, L\. Dong, F\. Wei, and M\. Huang \(2023\)MiniLLM: knowledge distillation of large language models\.InProceedings of the International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- Z\. Hao, J\. Guo, L\. Shen, Y\. Luo, H\. Hu, G\. Wang, D\. Yu, Y\. Wen, and D\. Tao \(2025\)Low\-precision training of large language models: methods, challenges, and opportunities\.CoRRabs/2505\.01043\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- W\. Huang, Y\. Liu, H\. Qin, Y\. Li, S\. Zhang, X\. Liu, M\. Magno, and X\. Qi \(2024\)BiLLM: pushing the limit of post\-training quantization for llms\.InProceedings of the International Conference on Machine Learning,pp\. 20023–20042\.Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1)\.
- S\. Kim, C\. Hooper, A\. Gholami, Z\. Dong, X\. Li, S\. Shen, M\. W\. Mahoney, and K\. Keutzer \(2024\)SqueezeLLM: dense\-and\-sparse quantization\.InProceedings of the International Conference on Machine Learning,pp\. 23901–23923\.Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1)\.
- T\. Kumar, Z\. Ankner, B\. F\. Spector, B\. Bordelon, N\. Muennighoff, M\. Paul, C\. Pehlevan, C\. Ré, and A\. Raghunathan \(2025\)Scaling laws for precision\.InProceedings of the International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p2.3)\.
- X\. Li, O\. A\. Hanna, C\. Fragouli, and S\. N\. Diggavi \(2025\)ICQuant: index coding enables low\-bit llm quantization\.CoRRabs/2505\.00850\.Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)AWQ: activation\-aware weight quantization for on\-device llm compression and acceleration\.InProceedings of Machine Learning and Systems,pp\. 87–100\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2025a\)SpinQuant: llm quantization with learned rotations\.CoRRabs/2405\.16406\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1),[§3\.1](https://arxiv.org/html/2606.10520#S3.SS1.p2.4),[§4\.3](https://arxiv.org/html/2606.10520#S4.SS3.p2.1)\.
- Z\. Liu, C\. Zhao, H\. Huang, S\. Chen, J\. Zhang, J\. Zhao, S\. Roy, L\. Jin, Y\. Xiong, Y\. Shi, L\. Xiao, Y\. Tian, B\. Soran, R\. Krishnamoorthi, T\. Blankevoort, and V\. Chandra \(2025b\)ParetoQ: improving scaling laws in extremely low\-bit llm quantization\.InProceedings of the International Conference on Neural Information Processing System,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p3.1)\.
- Llama Team \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p1.1)\.
- X\. Ma, G\. Fang, and X\. Wang \(2023\)LLM\-Pruner: on the structural pruning of large language models\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 21702–21720\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.CoRRabs/1609\.07843\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- W\. Qinsi, J\. Ke, M\. Tomizuka, K\. Keutzer, and C\. Xu \(2024\)Dobi\-SVD: differentiable svd for llm compression and some new perspectives\.InProceedings of the International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)WinoGrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- S\. Savkin, E\. Porat, O\. Ordentlich, and Y\. Polyanskiy \(2025\)NestQuant: nested lattice quantization for matrix products and llms\.InProceedings of the International Conference on Machine Learning,Cited by:[§2\.3](https://arxiv.org/html/2606.10520#S2.SS3.p2.2),[§3\.2\.2](https://arxiv.org/html/2606.10520#S3.SS2.SSS2.p2.2),[§5\.2\.3](https://arxiv.org/html/2606.10520#S5.SS2.SSS3.p1.3)\.
- W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo \(2024\)OmniQuant: omnidirectionally calibrated quantization for large language models\.InProceedings of the International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p2.3)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2023\)A simple and effective pruning approach for large language models\.InProceedings of the International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- Y\. Sun, R\. Liu, H\. Bai, H\. Bao, K\. Zhao, Y\. Li, J\. Hu, X\. Yu, L\. Hou, C\. Yuan, X\. Jiang, W\. Liu, and J\. Yao \(2025\)FlatQuant: flatness matters for llm quantization\.InProceedings of the International Conference on Machine Learning,Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1),[§3\.2\.3](https://arxiv.org/html/2606.10520#S3.SS2.SSS3.p1.1)\.
- A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. D\. Sa \(2024a\)QuIP\#: even better llm quantization with hadamard incoherence and lattice codebooks\.InProceedings of the International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p2.1),[§1](https://arxiv.org/html/2606.10520#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.10520#S2.SS3.p2.2),[§3\.2\.1](https://arxiv.org/html/2606.10520#S3.SS2.SSS1.p1.12),[§3\.2\.2](https://arxiv.org/html/2606.10520#S3.SS2.SSS2.p2.2),[§4\.3](https://arxiv.org/html/2606.10520#S4.SS3.p2.1),[§5\.2\.1](https://arxiv.org/html/2606.10520#S5.SS2.SSS1.p4.6),[§5\.2\.3](https://arxiv.org/html/2606.10520#S5.SS2.SSS3.p1.3)\.
- A\. Tseng, Q\. Sun, D\. Hou, and C\. De Sa \(2024b\)QTIP: quantization with trellises and incoherence processing\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 59597–59620\.Cited by:[§A\.2\.3](https://arxiv.org/html/2606.10520#A1.SS2.SSS3.p2.1),[§1](https://arxiv.org/html/2606.10520#S1.p4.1),[§2\.3](https://arxiv.org/html/2606.10520#S2.SS3.p2.2)\.
- W\. Wang, F\. Wei, L\. Dong, H\. Bao, N\. Yang, and M\. Zhou \(2020\)MiniLM: deep self\-attention distillation for task\-agnostic compression of pre\-trained transformers\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 5776–5788\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- X\. Wang, S\. Alam, Z\. Wan, H\. Shen, and M\. Zhang \(2025\)SVD\-LLM V2: optimizing singular value truncation for large language model compression\.InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4287–4296\.Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p1.1)\.
- M\. Weber, D\. Y\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams, B\. Athiwaratkun, R\. Chalamala, K\. Chen, M\. Ryabinin, T\. Dao, P\. Liang, C\. Ré, I\. Rish, and C\. Zhang \(2024\)RedPajama: an open dataset for training large language models\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 116462–116492\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p1.1)\.
- H\. Xing, C\. Yuan, Y\. Dawei, C\. Zhixuan, X\. Zukang, Y\. Jiangyong, X\. Chen, Y\. Zhihang, j\. Zhe, and Z\. Sifan \(2025\)OSTQuant: refining large language model quantization with orthogonal and scaling transformations for better distribution fitting\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.10520#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1),[§3\.1](https://arxiv.org/html/2606.10520#S3.SS1.p2.4),[§4\.3](https://arxiv.org/html/2606.10520#S4.SS3.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p1.1)\.
- Z\. Yuan, Y\. Shang, and Z\. Dong \(2024\)PB\-LLM: partially binarized large language models\.InProceedings of the International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2606.10520#S2.SS2.p3.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4791–4800\.Cited by:[§4\.1](https://arxiv.org/html/2606.10520#S4.SS1.p2.1)\.
- Z\. Zhou, X\. Li, M\. Li, H\. Zhang, H\. Wang, W\. Chang, Y\. Liu, Q\. Dang, D\. Yu, Y\. Ma, and H\. Wang \(2025\)CCQ: convolutional code for extreme low\-bit quantization in llms\.CoRRabs/2507\.07145\.Cited by:[§2\.3](https://arxiv.org/html/2606.10520#S2.SS3.p2.2)\.

## Appendix AAppendix

### A\.1The initialization ofCiC\_\{i\}

We provide a proof of the mean, covariance and the approximate Gaussianity of the elements inCiC\_\{i\}, which ensures the resulting quantization grid is a reasonable initialization for a standard Gaussian distributed weightWRHTW\_\{\\text\{RHT\}\}\.

#### A\.1\.1Mean ofCiC\_\{i\}

AssumingW¯ij\\bar\{W\}\_\{i\}^\{j\}is uniformly distributed over\{0,1,…,2b−1\}\\\{0,1,\\dots,2^\{b\}\-1\\\}, its expectation isE​\[W¯ij\]=\(2b−1\)/2=bE\[\\bar\{W\}\_\{i\}^\{j\}\]=\(2^\{b\}\-1\)/2=b\. Thus,E​\[W¯i\]=b​𝟏E\[\\bar\{W\}\_\{i\}\]=b\\mathbf\{1\}\. As a result, we have:

E​\[Ci\]=s​G​\(E​\[W¯i\]\)−s​G​b​𝟏=s​G​b​𝟏−s​G​b​𝟏=0E\[C\_\{i\}\]=sG\(E\[\\bar\{W\}\_\{i\}\]\)\-sGb\\mathbf\{1\}=sGb\\mathbf\{1\}\-sGb\\mathbf\{1\}=0The grid is thus centered at zero\.

#### A\.1\.2Covariance ofCiC\_\{i\}

The variance of a discrete uniform distributionW¯ij\\bar\{W\}\_\{i\}^\{j\}isVar​\(W¯ij\)=\(22​b−1\)/12\\text\{Var\}\(\\bar\{W\}\_\{i\}^\{j\}\)=\(2^\{2b\}\-1\)/12\. LetσW¯2\\sigma\_\{\\bar\{W\}\}^\{2\}denote this variance; then the covariance matrix ofW¯i\\bar\{W\}\_\{i\}isΣW¯=σW¯2​I\\Sigma\_\{\\bar\{W\}\}=\\sigma\_\{\\bar\{W\}\}^\{2\}I\. The covariance ofCiC\_\{i\}is:

ΣCi=Cov​\(s​G​W¯i\)=s2​G​ΣW¯​GT=s2​σW¯2​G​GT\\Sigma\_\{C\_\{i\}\}=\\text\{Cov\}\(sG\\bar\{W\}\_\{i\}\)=s^\{2\}G\\Sigma\_\{\\bar\{W\}\}G^\{T\}=s^\{2\}\\sigma\_\{\\bar\{W\}\}^\{2\}GG^\{T\}
SinceGGis an orthogonal matrix,G​GT=IGG^\{T\}=I\. As a result, we have

s=1/σW¯=1222​b−1s=1/\\sigma\_\{\\bar\{W\}\}=\\sqrt\{\\frac\{12\}\{2^\{2b\}\-1\}\}ThusΣCi=I\\Sigma\_\{C\_\{i\}\}=I\.

#### A\.1\.3Approximate Gaussianity ofCiC\_\{i\}

According to the results regarding the Haar measure on orthogonal groups, the entries of a largen×nn\\times nrandom orthogonal matrix𝐆\\mathbf\{G\}are approximately Gaussian with mean0and variance1/n1/n\. Let𝐠=𝐆​𝐖¯i\\mathbf\{g\}=\\mathbf\{G\}\\mathbf\{\\bar\{W\}\}\_\{i\}be the product of the matrix and a fixed integer vector\. Thekk\-th element of𝐠\\mathbf\{g\}is given by the inner product:

gk=∑j=1nGk​j​W¯i​jg\_\{k\}=\\sum\_\{j=1\}^\{n\}G\_\{kj\}\\bar\{W\}\_\{ij\}Since eachGk​jG\_\{kj\}is approximately Gaussian and the sum represents a linear combination of these components, the resulting elementgkg\_\{k\}retains Gaussian properties\.

### A\.2Detailed Results of the Ablation Studies

In this section, we will present the detailed results of the ablation experiments on the selected 6 evaluation set regarding fine\-tuning, orthogonal initialization, and vector dimension, as described in[Section5\.2](https://arxiv.org/html/2606.10520#S5.SS2)\.

#### A\.2\.1The effectiveness of Fine\-tuning

[Table9](https://arxiv.org/html/2606.10520#A1.T9)presents the detailed accuracy of the selected 0\-shot QA dataset\. The results show that block\-wise fine\-tuning of the quantization grid improves performance consistently across all tasks, confirming the effectiveness of our optimization strategy\. Furthermore, in[Section5\.2\.1](https://arxiv.org/html/2606.10520#S5.SS2.SSS1), we replace the random orthogonal initialization with the generation matrix ofD4D\_\{4\}, which is defined as below:

A=\[1−10001−10001−10011\]A=\\begin\{bmatrix\}1&\-1&0&0\\\\ 0&1&\-1&0\\\\ 0&0&1&\-1\\\\ 0&0&1&1\\end\{bmatrix\}
This results in consistent performance degradation, particularly on more challenging benchmarks such as ARC\-Challenge\. These findings further validate that the random orthogonal matrix provides a robust initialization for UniSVQ\.

Table 9:The detailed 0\-shot QA accuracy of the baselines and UniSVQ is shown for variants without fine\-tuning or orthogonal initialization\. Removing these two steps will result in consistent performance degradation\.Quantization TypeMethodACAEBQHSPQWGAvg\.Per\.N/AFP1656\.4080\.8986\.6474\.9677\.4868\.3574\.121\.00ScalarGPTQ26\.7925\.6742\.8725\.8452\.5050\.0437\.290\.50Quip24\.9131\.6954\.8941\.4159\.9050\.1243\.820\.59SpinQuant30\.4648\.4067\.0647\.9463\.1758\.6452\.610\.71OSTQuant34\.6460\.0273\.2150\.2968\.5058\.1757\.470\.78VectorQuip\#46\.5068\.4383\.1266\.6274\.3266\.3067\.550\.91AQLM 2\*845\.2272\.3173\.7360\.9973\.0764\.3364\.940\.88UniSVQProposed45\.8272\.3585\.0763\.1874\.1667\.0967\.950\.92w/o Tuning45\.5669\.3683\.8562\.9473\.9466\.3066\.990\.90w/o Orthogonal Init36\.5267\.5574\.1956\.0571\.1260\.5461\.000\.82
#### A\.2\.2The Influence of Vector Dimensiondd

[Table10](https://arxiv.org/html/2606.10520#A1.T10)shows the detailed accuracy of the selected 0\-shot QA dataset with different vector dimension\. Using largerddresults in some improvement in the average accuracy, but the effect is not consistent, and is weaker than that of Fine\-tuning\. Considering that the additional computational complexity isO​\(N​n​d\)O\(Nnd\), this further indicates that choosingd=4d=4is a more reasonable option\.

Table 10:The detailed 0\-shot QA accuracy of UniSVQ with different quantization dimensionsdd\. Using a largerddresults in inconsistent improvement\.ddFine\-TuningACAEBQHSPQWGAvg\.Per\.4w/o45\.5669\.3683\.8562\.9473\.9466\.3066\.990\.908w/o46\.0772\.9480\.8662\.6573\.0768\.4367\.340\.914w45\.8272\.3585\.0763\.1874\.1667\.0967\.950\.92
#### A\.2\.3Comparison with SOTA VQ Baselines

UniSVQ’s goal is not to maximize quantization precision at the cost of complexity\. Instead, it targets a practical trade\-off: the affine parameterization preserves compatibility with scalar quantization matmul kernels, avoiding the decompression overhead typical of advanced VQ methods\.

To further situate UniSVQ among more advanced VQ baselines, we include Qtip\(Tsenget al\.,[2024b](https://arxiv.org/html/2606.10520#bib.bib31)\)as an additional comparison on Qwen\-3\-8B at 2\-bit quantization\. As shown in[Table11](https://arxiv.org/html/2606.10520#A1.T11), Qtip achieves the highest accuracy among all VQ methods, likely due to its trellis\-coded codebook, while UniSVQ remains competitive and retains kernel\-level compatibility\.

Table 11:Comparison with SOTA VQ methods on Qwen\-3\-8B at 2\-bit\. Per\. denotes performance relative to FP16\.Quantization TypeMethodWiki↓\\downarrowC4↓\\downarrowACAEBQHSPQWGAvg\.Per\.N/AFP169\.7215\.4256\.4080\.8986\.6474\.9677\.4868\.3574\.121\.00ScalarGPTQ46810\.9716797\.3626\.7925\.6742\.8725\.8452\.5050\.0437\.290\.50Quip27\.6134\.4224\.9131\.6954\.8941\.4159\.9050\.1243\.820\.59SpinQuant17\.8237\.2830\.4648\.4067\.0647\.9463\.1758\.6452\.610\.71OSTQuant26\.0839\.7134\.6460\.0273\.2150\.2968\.5058\.1757\.470\.78VectorQuip\#12\.3718\.4546\.5068\.4383\.1266\.6274\.3266\.3067\.550\.91AQLM 2×\\times818\.2620\.7345\.2272\.3173\.7360\.9973\.0764\.3364\.940\.88Qtip11\.5517\.5952\.0577\.0684\.9568\.1477\.0467\.2571\.080\.96UniSVQProposed14\.8219\.9645\.8272\.3585\.0763\.1874\.1667\.0967\.950\.92

Similar Articles

Channel-wise Vector Quantization

Hugging Face Daily Papers

Channel-wise Vector Quantization (CVQ) replaces patch-wise tokens with channel-wise tokens for image tokenization, enabling a next-channel prediction framework (CAR) that generates images by progressively refining visual details, achieving strong reconstruction and text-to-image generation performance.

KVarN: Native vLLM backend for KV-cache quantization by Huawei

Hacker News Top

Huawei CSL releases KVarN, a native vLLM attention backend for KV-cache quantization that delivers 3-5x more KV-cache capacity and up to ~1.3x the throughput of FP16, with no calibration required. It claims up to ~2.4x the throughput of TurboQuant while maintaining FP16-level accuracy on models like Qwen3-32B.