LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

arXiv cs.CL 06/10/26, 04:00 AM Papers
Summary
Proposes LC-QAT, a 2-bit weight-only vector quantization aware training framework for LLMs that uses a learned affine mapping to enable end-to-end training, achieving state-of-the-art results with only 0.1%-10% of training data.
arXiv:2606.10531v1 Announce Type: new Abstract: Quantization-aware training (QAT) is essential for extremely low-bit large language models (LLMs). Current QAT methods are mainly based on scalar quantization (SQ), which enables efficient optimization but suffers from severe performance degradation at 2-bit precision. On the other hand, vector quantization (VQ) provides substantially higher representational capacity, but its discrete codebook lookup prevents end-to-end training. We propose LC-QAT, a 2-bit weight-only VQ-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high-quality PTQ initialization and enables fully differentiable end-to-end optimization without explicit codebook lookup in the training forward pass. This strong post-training initialization makes LC-QAT highly data-efficient. Experiments across diverse LLMs demonstrate that LC-QAT consistently outperforms state-of-the-art QAT methods while using only 0.1%--10% of the training data. Our results establish LC-QAT as a practical and scalable solution for extreme low-bit model deployment.
Original Article
View Cached Full Text
Cached at: 06/10/26, 06:11 AM
# LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
Source: [https://arxiv.org/html/2606.10531](https://arxiv.org/html/2606.10531)
Xingyu YuHaiyan Zhao†\{\}^\{\\text\{\\textdagger\}\}Fengxiang WangXu Han†\{\}^\{\\text\{\\textdagger\}\}

###### Abstract

Quantization\-aware training \(QAT\) is essential for extremely low\-bit large language models \(LLMs\)\. Current QAT methods are mainly based on scalar quantization \(SQ\), which enables efficient optimization but suffers from severe performance degradation at 2\-bit precision\. On the other hand, vector quantization \(VQ\) provides substantially higher representational capacity, but its discrete codebook lookup prevents end\-to\-end training\. We propose LC\-QAT, a 2\-bit weight\-only VQ\-QAT framework that represents quantized weights via a learned affine mapping over discrete vectors, which yields a high\-quality PTQ initialization and enables fully differentiable end\-to\-end optimizationwithout explicit codebook lookup in the training forward pass\. This strong post\-training initialization makes LC\-QAT highly data\-efficient\. Experiments across diverse LLMs demonstrate that LC\-QAT consistently outperforms state\-of\-the\-art QAT methods while using only0\.1%–10% of the training data\. Our results establish LC\-QAT as a practical and scalable solution for extreme low\-bit model deployment\.

Machine Learning, ICML

## 1Introduction

Large language models \(LLMs\) have achieved remarkable success across various tasks, yet their substantial memory and computational requirements pose challenges for deployment on resource\-constrained devices\. Model quantization\(Frantaret al\.,[2022](https://arxiv.org/html/2606.10531#bib.bib25); Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib11)\)has therefore become a key technique for enabling efficient inference, especially in extremely low\-bit regimes such as 1–2 bits\(Haoet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib26); Cheeet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib10); Baalenet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib9); Zhouet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib41); Tsenget al\.,[2024b](https://arxiv.org/html/2606.10531#bib.bib40)\)\.

Existing quantization methods are commonly divided into Post\-Training Quantization \(PTQ\)\(Frantaret al\.,[2022](https://arxiv.org/html/2606.10531#bib.bib25); Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib11)\)and Quantization\-Aware Training \(QAT\)\(Maet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib53); Liuet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib55)\)\. In the aggressive 2\-bit setting, QAT consistently outperforms PTQ by adapting model parameters to compensate for quantization errors\(Liuet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib15)\)\. However, most existing QAT frameworks rely on scalar quantization\(Maet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib53); Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib11)\)\. These SQ\-based QAT works usually independently quantize and dequantize each weight to include quantization error during training\. While being easy to optimize, SQ\-based QAT suffers from severe information loss at ultra\-low precision, leading to weak initializations and a heavy dependence on massive training data for recovery\.

In contrast, Vector Quantization represents groups of weights using entries from a shared codebook and offers substantially stronger representational capacity under 2\-bit constraints\. By assigning each group of weights to one of the possible codewords, VQ\-based methods preserve significantly more information than SQ and achieve much higher post\-quantization accuracy\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib11); Baalenet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib9); Zhouet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib41)\)\. However, incorporating VQ into end\-to\-end QAT remains highly challenging\. The core difficulty lies in the time\-consuming nearest\-neighbor search during quantization and discrete codebook lookup during dequantization\. Existing attempts to address this issue rely on expensive coordinate descent or beam search procedures\(Malinovskiiet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib54)\), resulting in inefficient and unsynchronized updates of model parameters\.

In this work, we propose LC\-QAT, a novel end\-to\-end vector quantization\-aware training framework for 2\-bit weight\-only quantization that overcomes this fundamental limitation\. Our key idea is to replace unconstrained codebooks with a linear\-constrained parameterization\. As shown in[Figure1](https://arxiv.org/html/2606.10531#S1.F1), each codeword in a weight matrix is generated by applying a shared linear mapping to a discrete quaternary vector\. This reformulation transforms discrete index selection into simple rounding and clamping operations followed by linear projection, enabling gradients to propagate through the quantization process without explicit index search\. As a result, LC\-QAT makes vector\-quantized weights trainable under standard backpropagation\.

![Refer to caption](https://arxiv.org/html/2606.10531v1/x1.png)

Figure 1:LC\-QAT training pipeline with a linear\-constrained parameterization\. By replacing discrete codebook lookup with an SQ\-style round/clip discretization followed by an affine projection, LC\-QAT makes VQ\-QAT lookup\-free in the forward pass and compatible with standard end\-to\-end backpropagation\.Building upon a high\-quality 2\-bit initialization, LC\-QAT enables efficient fine\-tuning while significantly reducing the data requirements\. Experiments on multiple large\-scale LLMs and benchmarks demonstrate that our method matches or surpasses state\-of\-the\-art SQ\- and VQ\-based QAT approaches using only 0\.1%–10% of training data\.

In summary, our main contributions are as follows:

- •We propose a lookup\-free parameterization for 2\-bit VQ\-QAT that removes explicit codebook lookup in training and enables end\-to\-end optimization with standard backpropagation\.
- •We provide empirical evidence that LC\-QAT starts in a substantially more favorable optimization region than SQ\-based 2\-bit QAT, which helps explain its improved trainability and data efficiency\.
- •Extensive experiments show that LC\-QAT achieves higher final accuracy than prior VQ\-QAT methods and substantially better data efficiency than SQ\-QAT methods, with consistent gains as the training budget increases across diverse benchmarks\.

## 2Preliminaries and Initialization

In this section, we first introduce the formulation of vector quantization for neural network weights, and then describe the proposed linear\-constrained codebook and the initialization procedure adopted in LC\-QAT\. Finally, we present a preliminary analysis of the resulting optimization landscape\.

### 2\.1Vector Quantization for Neural Network Weights

Consider a weight matrixW∈ℝm×nW\\in\\mathbb\{R\}^\{m\\times n\}in a linear layer\. In vector quantization, the matrix is partitioned into groups of sizeddalong predefined dimensionnn\. Letwg∈ℝdw\_\{g\}\\in\\mathbb\{R\}^\{d\}denote thegg\-th weight group\. A shared codebook

C=\{c1,c2,…,cK\},ck∈ℝd,C=\\\{c\_\{1\},c\_\{2\},\\dots,c\_\{K\}\\\},\\quad c\_\{k\}\\in\\mathbb\{R\}^\{d\},is used to represent all groups, whereKKdenotes the number of codewords\.

For each groupwgw\_\{g\}, VQ assigns an index by solving the nearest\-neighbor problem

ig=arg⁡mink∈\{1,…,K\}⁡‖wg−ck‖22,i\_\{g\}=\\arg\\min\_\{k\\in\\\{1,\\dots,K\\\}\}\\\|w\_\{g\}\-c\_\{k\}\\\|\_\{2\}^\{2\},\(1\)and the quantized weight is given by

w^g=cig\.\\hat\{w\}\_\{g\}=c\_\{i\_\{g\}\}\.\(2\)
In the 2\-bit setting considered in this work, each dimension takes 4 discrete values, resulting inK=4dK=4^\{d\}possible codewords\. Compared with scalar quantization, which restricts each weight to only 4 values, VQ provides substantially higher representational capacity\.

Despite its strong expressive power, the discrete assignment in[Equation1](https://arxiv.org/html/2606.10531#S2.E1)is inherently non\-differentiable with respect to the indexigi\_\{g\}\. As a result, standard gradient\-based optimization cannot directly update the codebook indices during training, posing a fundamental obstacle to end\-to\-end quantization\-aware training\.

### 2\.2Initialization with the Linear\-Constrained Codebook

To overcome the non\-differentiability of conventional codebook lookup, we extend the PTQ method in\(Wanget al\.,[2026](https://arxiv.org/html/2606.10531#bib.bib68)\)to QAT and parameterize codewords using a linear transformation that is shared within a weight matrix\.

Specifically, each codeword is generated as

wherez∈\{0,1,2,3\}dz\\in\\\{0,1,2,3\\\}^\{d\}is a discrete quaternary vector, andA∈ℝd×dA\\in\\mathbb\{R\}^\{d\\times d\}andB∈ℝdB\\in\\mathbb\{R\}^\{d\}are floating\-point parameters\.

This formulation defines a structured codebook in which all codewords lie in an affine subspace determined byAAandBB\. By enumerating all possible values ofzz, the resulting codebook implicitly contains4d4^\{d\}entries\.

Compared with unconstrained codebooks, the proposed linear\-constrained parameterization eliminates the need for nearest\-neighbor search\. Instead, the discrete vectorzzcan be obtained through rounding and clamping operations, and gradients can propagate through the linear mapping in[Equation3](https://arxiv.org/html/2606.10531#S2.E3)\. The main difference between the proposed linear\-constrained method and lattice\-based methods, such as Quip\#\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10531#bib.bib20)\)and NestQuant\(Savkinet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib42)\), is that the former can generate all codewords from the same linear transformation\. However, the padding codewords in Quip\# and the nested codebooks in NestQuant cannot meet this condition\. While the latter offers more flexibility, it necessitates codebook lookup\.

In[Equation3](https://arxiv.org/html/2606.10531#S2.E3),AAandBBare initialized as follows:

A=sG,B=−sGμ𝟏,A=sG,\\quad B=\-sG\\mu\\mathbf\{1\},\(4\)where𝟏\\mathbf\{1\}denotes an all\-ones vector,G∈ℝd×dG\\in\\mathbb\{R\}^\{d\\times d\}is a random orthogonal matrix,μ=\(2b−1\)/2\\mu=\(2^\{b\}\-1\)/2is a centering constant, ands=12/\(22b−1\)s=\\sqrt\{12/\(2^\{2b\}\-1\)\}is a scaling factor\. This initialization ensures that the codebook has approximately zero mean, unit variance, and weak inter\-dimensional correlation\.

We adopt the LDLQ algorithm\(Cheeet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib10)\)to generate the initial quantized model and the Hadamard transformation for eliminating the outliers following QuIP\#\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.10531#bib.bib20)\), QTIP\(Tsenget al\.,[2024b](https://arxiv.org/html/2606.10531#bib.bib40)\), YAQA\(Tsenget al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib73)\)and NestQuant\(Savkinet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib42)\)\. LDLQ performs column\-wise group quantization and sequentially compensates for reconstruction errors\.

### 2\.3Preliminary Optimization Analysis

We empirically analyze the quality of the proposed initialization by the loss landscape around the initial points\.

Following Chen et al\.\([2025](https://arxiv.org/html/2606.10531#bib.bib67)\), we project the loss surface of the Qwen\-3\-1\.7B model onto two given directions and evaluate the cross\-entropy loss on WikiText\-2\(Merityet al\.,[2016](https://arxiv.org/html/2606.10531#bib.bib70)\)\.[Figure2\(a\)](https://arxiv.org/html/2606.10531#S2.F2.sf1)illustrates the resulting landscapes for LC\-QAT and SQ\-based QAT\. It is shown that the LC\-QAT initialization is close to the FP16 baseline, indicating significantly lower performance degradation\. In contrast, the SQ\-based initialization deviates substantially from the global minimum\. We further validate the zero\-shot QA accuracy of VQ and SQ initial models in[SectionA\.3](https://arxiv.org/html/2606.10531#A1.SS3)\.

[Figure2\(b\)](https://arxiv.org/html/2606.10531#S2.F2.sf2)shows that the LC\-QAT initialization lies in the low\-loss basin and exhibits a saddle\-point structure similar to that of the full\-precision model\. In contrast,[Figure2\(c\)](https://arxiv.org/html/2606.10531#S2.F2.sf3)shows that SQ\-based initialization deviates substantially from the optimal region and lacks a nearby local minimum\.

This phenomenon can be attributed to the fact that vector quantization preserves more information during post\-training compression\. As a result, LC\-QAT begins optimization from a more favorable region of the parameter space, leading to reduced optimization difficulty and improved data efficiency in subsequent fine\-tuning\.

![Refer to caption](https://arxiv.org/html/2606.10531v1/x2.png)\(a\)Loss landscape of the FP16 Model, with the initial loss of LC\-QAT and a SQ\-Based QAT model\.
![Refer to caption](https://arxiv.org/html/2606.10531v1/x3.png)\(b\)Loss of LC\-QAT
![Refer to caption](https://arxiv.org/html/2606.10531v1/x4.png)\(c\)Loss of SQ\-Based QAT

Figure 2:\(a\) The loss landscape of a FP16 Qwen\-3\-1\.7B model and the initial point of the LC\-QAT and SQ\-based QAT models\. The loss is measured by cross entropy on the WikiText\-2 dataset\. LC\-QAT closely approaches the local minimum, whereas the SQ model remains distant from the optimal region\. \(b\) The loss landscape for the LC\-QAT model\. The surface exhibits a distinct saddle point structure, with the training starting point\(0,0\)\(0,0\)positioned significantly closer to a local minimum\. \(c\) The loss landscape for the SQ\-based QAT model\. The surface has higher overall loss values and the absence of a well\-defined local minimum, indicating a more challenging optimization landscape compared to LC\-QAT\.

## 3Method

![Refer to caption](https://arxiv.org/html/2606.10531v1/x5.png)

Figure 3:Overview of the forward and backward pass of LC\-QAT\. During the forward pass, proxy weights are discretized into integer weights to incorporate quantization errors\. The computational workflow is reformulated to leverage Int2\-FP16 MatMul kernels, which are well\-optimized for SQ models\. In the backward pass, by bypassing the traditional codebook lookup operation, LC\-QAT enables end\-to\-end optimization via approximate gradients\. LC\-QAT utilizes a Differentiable Gradient Estimator \(DGE\) to facilitate stable gradient flow for the integer weights\.In this section, we present the proposed LC\-QAT framework for end\-to\-end training of VQ LLMs\. We first provide an overview of the training pipeline\. We then introduce the forward pass and the differentiable gradient estimator\. Finally, we describe the improvement of training stability\.

### 3\.1Overview of LC\-QAT

As shown in[Figure3](https://arxiv.org/html/2606.10531#S3.F3), LC\-QAT represents quantized weights using discrete integer variables and a linear transformation\. During training, these integer variables are obtained by the discretization of continuous proxy weights\. Gradients are propagated through the discretization using differentiable approximations\.

Specifically, LC\-QAT maintains a set of floating\-point proxy weightsWp∈ℝm×nW\_\{p\}\\in\\mathbb\{R\}^\{m\\times n\}\. In the forward pass,WpW\_\{p\}is converted into integer weightsWzW\_\{z\}via rounding and clamping\. The integer weights are then decoded into effective weightsW^\\hat\{W\}through a linear\-constrained codebook\. The resulting weights are used to compute outputs and training loss\.

During backpropagation, approximate gradient estimators are employed to propagate gradients through the discretization operation, enabling end\-to\-end optimization ofWpW\_\{p\}\.

### 3\.2The forward pass of LC\-QAT

Given proxy weightsWp∈ℝm×nW\_\{p\}\\in\\mathbb\{R\}^\{m\\times n\}, the corresponding integer weights are computed as

Wz=clip\(round\(Wp\),0,2b−1\),W\_\{z\}=\\mathrm\{clip\}\\left\(\\mathrm\{round\}\(W\_\{p\}\),\\;0,\\;2^\{b\}\-1\\right\),\(5\)wherebbdenotes the quantization bit\-width, andclip\(⋅\)\\mathrm\{clip\}\(\\cdot\)denotes element\-wise clamping\.

The above process is basically the same as that of SQ\-based QAT\. However, LC\-QAT does not treat the elements inWzW\_\{z\}as integer approximations of floating\-point weights, but rather regards them as codewords in the linear\-constrained codebook and performs differentiable dequantization through a linear mapping\. We partitionWzW\_\{z\}intoG=n/dG=n/dgroups along the column dimension:

Wz=\[Wz,1,Wz,2,…,Wz,G\],W\_\{z\}=\\left\[W\_\{z,1\},W\_\{z,2\},\\dots,W\_\{z,G\}\\right\],\(6\)where each blockWz,i∈ℤm×dW\_\{z,i\}\\in\\mathbb\{Z\}^\{m\\times d\}containsddconsecutive columns\. As a result, the decoding procedure can be conducted as follows:

W^T\\displaystyle\\hat\{W\}^\{T\}=\[AWz,1T\+B𝟏TAWz,2T\+B𝟏T⋮AWz,GT\+B𝟏T\]\\displaystyle=\\begin\{bmatrix\}AW\_\{z,1\}^\{T\}\+B\\mathbf\{1\}^\{T\}\\\\ AW\_\{z,2\}^\{T\}\+B\\mathbf\{1\}^\{T\}\\\\ \\vdots\\\\ AW\_\{z,G\}^\{T\}\+B\\mathbf\{1\}^\{T\}\\end\{bmatrix\}\(7\)
[Equation7](https://arxiv.org/html/2606.10531#S3.E7)eliminates explicit nearest\-neighbor search and enables direct gradient propagation through the linear mapping\. Moreover, it can be re\-formulated to reduce computational overhead\. Given input activationsX∈ℝN×nX\\in\\mathbb\{R\}^\{N\\times n\}, the outputY=XW^Y=X\\hat\{W\}is computed as:

Y\\displaystyle Y=∑i=1GXi\(AWz,iT\+B𝟏T\)\\displaystyle=\\sum\_\{i=1\}^\{G\}X\_\{i\}\(AW\_\{\\mathrm\{z\},i\}^\{T\}\+B\\mathbf\{1\}^\{T\}\)\(8\)=∑i=1G\(XiA\)Wz,iT\+∑i=1G\(XiB\)𝟏T\\displaystyle=\\sum\_\{i=1\}^\{G\}\(X\_\{i\}A\)W\_\{\\mathrm\{z\},i\}^\{T\}\+\\sum\_\{i=1\}^\{G\}\(X\_\{i\}B\)\\mathbf\{1\}^\{T\}
In[Equation8](https://arxiv.org/html/2606.10531#S3.E8),XiAX\_\{i\}Acorresponds to a floating\-point activation, whereasWz,iW\_\{\\mathrm\{z\},i\}remains an integer matrix\. This formulation allows LC\-QAT to reuse optimized integer matrix multiplication kernels designed for scalar quantization\.

While the introduction of linear mapping in LC\-QAT incurs additional computation, this overhead remains controllable\. ForX∈ℝN×nX\\in\\mathbb\{R\}^\{N\\times n\}andWz∈ℤm×nW\_\{z\}\\in\\mathbb\{Z\}^\{m\\times n\}, the dequantization incurs a computational cost ofO\(Nnd\)\+O\(Nn\)O\(Nnd\)\+O\(Nn\)\. Given that the group sizeddis typically small \(e\.g\., 4 or 8\), this additional overhead is minimal relative to the standardO\(Nnm\)O\(Nnm\)matrix multiplication\. Using our custom CUDA kernel, we achieved a 1\.68x speedup compared to the FP16 baseline and a 1\.43x speedup compared to the AQLM quantized model\. Detailed throughput measurements are provided in Appendix[A\.2](https://arxiv.org/html/2606.10531#A1.SS2)\.

### 3\.3Differentiable Gradient Estimation

The discretization operation in[Equation5](https://arxiv.org/html/2606.10531#S3.E5)is non\-differentiable, as the derivative of the rounding function is zero almost everywhere\. A common approach is to employ the Straight\-Through Estimator \(STE\), which approximates∂Wz/∂Wp≈1\\partial W\_\{z\}/\{\\partial W\_\{p\}\}\\approx 1\.

However, this estimation is inherently imprecise\. The resulting gradient noise can introduce significant instability near local minima\(Yinet al\.,[2019](https://arxiv.org/html/2606.10531#bib.bib57)\), posing a severe threat to the convergence of LLMs\(Wanget al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib58)\)\. While prior SQ\-based QAT methods have successfully employed STE\(Maet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib53); Liuet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib15)\), we hypothesize this is because the substantial parameter modification of 2\-bit SQ can effectively push the model away from its initial local minima\. In contrast, LC\-QAT begins with a better initialization and may be more strongly affected by the instability of STE near the local minimum\. To address this issue, we adopt the Differentiable Gradient Estimator \(DGE\) proposed in\(Wanget al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib58)\)\. Specifically, the discretization function is approximated by

f\(x\)=δ2\(1\+sign\(2xδ−1\)\|2xδ−1\|1/k\),f\(x\)=\\frac\{\\delta\}\{2\}\\left\(1\+\\mathrm\{sign\}\\left\(\\frac\{2x\}\{\\delta\}\-1\\right\)\\left\|\\frac\{2x\}\{\\delta\}\-1\\right\|^\{1/k\}\\right\),\(9\)whereδ\\deltadenotes the quantization interval andkkcontrols the sharpness of the approximation\.

The derivative is given by

f′\(x\)=1k⋅\|2xδ−1\|1/k−1f^\{\\prime\}\(x\)=\\frac\{1\}\{k\}\\cdot\\left\|\\frac\{2x\}\{\\delta\}\-1\\right\|^\{1/k\-1\}\(10\)
Askkincreases,f\(x\)f\(x\)approaches the hard clamp function, while maintaining smooth gradients for backpropagation\.

### 3\.4Integer Weight Preprocessing

In standard SQ, the floating\-point weightsWpW\_\{p\}in[Equation5](https://arxiv.org/html/2606.10531#S3.E5)are typically initialized randomly\(Maet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib53)\)or from a pre\-trained model\(Liuet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib55); Chenet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib56); Liuet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib15)\), withWzW\_\{z\}naturally being the quantization ofWpW\_\{p\}\. However, this does not hold for LC\-QAT\. Instead,WzW\_\{z\}are derived from a preceding PTQ process\. Consequently, we must reverse this logic:WzW\_\{z\}determines the initialization ofWpW\_\{p\}\. As a result, we use the following affine transformation for initialization:

Wp=s\(Wz−t\),W\_\{p\}=s\(W\_\{z\}\-t\),\(11\)where

s=2a2b−1,t=2b−12,a=6m\+n\.s=\\frac\{2a\}\{2^\{b\}\-1\},\\quad t=\\frac\{2^\{b\}\-1\}\{2\},\\quad a=\\sqrt\{\\frac\{6\}\{m\+n\}\}\.\(12\)
[Equation11](https://arxiv.org/html/2606.10531#S3.E11)aligns the statistics ofWpW\_\{p\}with Xavier initialization, ensuring approximately zero mean and layer\-dependent variance\. This design preserves inference consistency and improves training stability\.

## 4Experimental Setup

### 4\.1Experimental Protocols

We evaluate LC\-QAT under two experimental protocols designed to assess its model capacity and data efficiency\.

\(1\) Model Capacity Experiments\.We compare LC\-QAT with existing VQ\-QAT methods using the same amount of data\. These experiments focus on assessing LC\-QAT’s ability to improve performance by scaling the dataset through end\-to\-end training and synchronous updates of all VQ\-quantized parameters\.

\(2\) Data Efficiency Experiments\.We compare LC\-QAT with state\-of\-the\-art scalar\-quantized QAT methods\. Since most of these methods focus on the foundation models, namely the base models, we set a group of experiments to evaluate the performance degradation of LC\-QAT on these foundation models\. We then evaluate LC\-QAT in the instruction\-tuning setting and compare it with instruction\-following low\-bit models\. These experiments aim to evaluate the performance of quantized LLMs in real\-world applications and assess whether LC\-QAT preserves instruction\-following and reasoning abilities under 2\-bit quantization\.

### 4\.2Models and Baselines

We select Qwen\-3\-1\.7B and 8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib47)\)along with LLaMA\-3\-3B and 8B\(Llama Team,[2024](https://arxiv.org/html/2606.10531#bib.bib45)\)as our primary foundation models\.

For the model capacity experiments, we compare LC\-QAT with PV\-Tuning\(Malinovskiiet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib54)\), which, to our knowledge, is the only end\-to\-end QAT framework for VQ\.

For the data efficiency experiments, we also include SQ\-based QAT foundation models such as LLM\-QAT\(Liuet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib55)\), EfficientQAT\(Chenet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib56)\)and ParetoQ\(Liuet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib15)\), which is the state\-of\-the\-art 2\-bit QAT to our knowledge\. Since the results of ParetoQ are reported on LLaMA\-3, we conduct a controlled experiment on the same model family to ensure a fair comparison\. For the instruction\-tuned QAT models, we compare our model with BitNet 2B4T\(Maet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib53)\), which is a high\-performance instruction\-following QAT model\. Unlike LC\-QAT, BitNet involves training from scratch using a massive 4T\-token corpus to achieve low\-bit quantization\.

### 4\.3Datasets and Evaluation Benchmarks

Training Data\.For VQ\-QAT and standard QAT on foundation models, we randomly sample approximately 4 billion tokens from the FineWeb dataset\(Penedoet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib72)\)\. Instruction\-tuned QAT experiments are conducted on the AM\-Qwen3\-Distilled dataset with approximately 2B tokens\. Example training samples are provided in[SectionA\.1](https://arxiv.org/html/2606.10531#A1.SS1)\.

Calibration Data\.For post\-training VQ, Hessian statistics required by LDLQ are computed using 1,024 samples from the RedPajama dataset\(Weberet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib46)\)\.

Evaluation Benchmarks\.We categorize our evaluation into 2 groups to align with the baselines:

- •Zero\-Shot QA:These evaluations are used to evaluate the VQ\-QAT models and foundation models, including ARC\-Easy, ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.10531#bib.bib50)\), BoolQ\(Clarket al\.,[2019](https://arxiv.org/html/2606.10531#bib.bib49)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.10531#bib.bib52)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.10531#bib.bib48)\), and WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2606.10531#bib.bib51)\)\. The setting is similar to the zero\-shot evaluation of ParetoQ\(Liuet al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib15)\)and PV\-Tuning\(Malinovskiiet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib54)\)\.
- •Instruction\-Following Tasks:These evaluations are used to assess knowledge capabilities, instruction\-following ability, as well as mathematical and coding problem\-solving skills\. We select OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2606.10531#bib.bib59)\), IFEval\(Zhouet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib61)\), MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.10531#bib.bib62)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.10531#bib.bib63)\), MATH\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.10531#bib.bib64)\), and HumanEval\(Chenet al\.,[2021](https://arxiv.org/html/2606.10531#bib.bib60)\)for evaluation\.

### 4\.4Implementation Details

Quantization Configuration\.We set the vector dimension tod=4d=4, resulting inA∈ℝ4×4A\\in\\mathbb\{R\}^\{4\\times 4\}andB∈ℝ4B\\in\\mathbb\{R\}^\{4\}\. This configuration introduces only 20 additional parameters per weight matrix\. Quantization is applied to all the linear layers in all the transformer blocks\.

Training Setup\.All experiments are conducted on 16 NVIDIA A100 GPUs\. For foundation models and VQ\-QAT, we use a batch size of 8 with gradient accumulation of 8, resulting in an effective batch size of 1024 sequences\. Models are trained for approximately 1,900 steps\.

The learning rate is set to1×10−41\\times 10^\{\-4\}for Qwen\-3\-1\.7B and LLaMA\-3\-3B, and3×10−53\\times 10^\{\-5\}for Qwen\-3\-8B and LLaMA\-3\-8B\. For instruction\-tuned QAT, we use a learning rate of1×10−51\\times 10^\{\-5\}and gradient accumulation of 8\.

Optimization\.We use AdamW with standard hyperparameters and the warmup\-stable\-decay learning rate scheduler\(Huet al\.,[2024](https://arxiv.org/html/2606.10531#bib.bib69)\)\. All experiments employ gradient clipping with a threshold of1\.01\.0\. We setδ=1\\delta=1andk=5k=5in[Equation9](https://arxiv.org/html/2606.10531#S3.E9)\.

## 5Results

This section mainly analyzes the experiments presented in the previous chapter to demonstrate that our approach offers superior learning capability compared to conventional VQ\-QAT methods and a significant data efficiency advantage over state\-of\-the\-art SQ\-QAT methods\. Additionally, we include two ablation studies to validate the effectiveness of our preprocessing method used before training and the gradient approximation technique employed during training\.

Table 1:Comparison of the perplexity \(PPL\) on the Wikitext\-2 test set and zero\-shot QA accuracy between LC\-QAT and other baselines\. Under the same amount of training data, LC\-QAT consistently outperforms vector quantization–aware training baselines\. AE, AC, BQ, HS, WG, and PQ denote ARC\-Easy, ARC\-Challenge, BoolQ, HellaSwag, WinoGrande, and PIQA, respectively\. Avg stands for Average, and Per stands for Performance\. AQLM is under the commonly\-used 2\*8 configuration\.MODELTYPEMETHODPPL↓AC↑AE↑BQ↑HS↑PQ↑WG↑Avg\.↑Per\.↑Qwen\-3\-8BN/AFP169\.7256\.4080\.8986\.6474\.9677\.4868\.3574\.121\.00SQGPTQ4\.68e426\.7925\.6742\.8725\.8452\.5050\.0437\.290\.50Quip27\.6124\.9131\.6954\.8941\.4159\.9050\.1243\.820\.59OSTQuant21\.3334\.6460\.0273\.2150\.2968\.5058\.1757\.470\.78VQQuip\#12\.3846\.5068\.4383\.1266\.6274\.3266\.3067\.550\.91QTIP10\.3952\.8277\.3185\.2470\.8577\.2668\.7572\.040\.97YAQA10\.5053\.1678\.4183\.2771\.4177\.4269\.1472\.140\.97AQLM18\.2645\.2272\.3173\.7360\.9973\.0764\.3364\.940\.88LC\-PTQ14\.9545\.8272\.3585\.0763\.1874\.1667\.0967\.950\.92VQ\-QATPV\-Tuning10\.6551\.7975\.4683\.8270\.4476\.8867\.5670\.990\.96LC\-QAT10\.2353\.7578\.8282\.2971\.3776\.9969\.8572\.180\.97Qwen\-3\-1\.7BN/AFP1616\.7243\.0869\.6977\.5260\.3772\.1461\.8064\.101\.00SQGPTQ2\.80e425\.5125\.8845\.3825\.4950\.6048\.8636\.950\.58Quip133\.1624\.0632\.4546\.9730\.9252\.6149\.2539\.380\.61OSTQuant149\.8025\.6827\.8661\.9528\.1352\.5641\.3739\.590\.62VQQuip\#26\.1729\.8646\.4261\.2257\.4766\.8757\.1453\.160\.83AQLM40\.2528\.2447\.7269\.4842\.4263\.4356\.0351\.220\.80LC\-PTQ29\.1131\.9153\.8767\.4644\.9961\.8054\.3852\.400\.82QTIP18\.2136\.0959\.1376\.3953\.3669\.6458\.9658\.930\.92YAQA17\.1437\.0361\.9978\.4154\.8370\.2459\.2760\.300\.94VQ\-QATPV\-Tuning17\.4338\.9966\.0858\.7853\.7971\.0659\.5958\.050\.91LC\-QAT13\.4441\.7268\.1869\.8558\.2372\.6957\.9361\.430\.96

### 5\.1Model Capability

[Table1](https://arxiv.org/html/2606.10531#S5.T1)presents the accuracy of LC\-QAT compared against baseline VQ\-QAT models, as well as SQ and VQ\-based PTQ methods across multiple zero\-shot QA tasks\. As illustrated, our proposed LC\-QAT achieves significantly higher accuracy than both the PTQ methods \(SQ and VQ\) and the VQ\-QAT baseline, PV\-Tuning\. This performance gap demonstrates the superior capability of our model to leverage training data to recover quantization degradation\.

A detailed analysis reveals that under the aggressive 2\-bit quantization setting, SQ models suffer from substantial performance degradation, suggesting they are suboptimal starting points for high\-performance QAT\. In contrast, VQ\-based models generally exhibit superior initial performance, validating the feasibility of using VQ as a foundation for subsequent optimization\. Compared to AQLM, our LC\-PTQ method using a linear\-constrained codebook achieves higher initial performance\. Furthermore, on the 8B model, LC\-QAT performs slightly better than trellis\-based methods such as QTIP and YAQA, while on the 1\.7B model, which presents a greater compression challenge, our method brings significantly higher performance improvement after training, surpassing state\-of\-the\-art PTQ/QAT methods\.

[Figure4](https://arxiv.org/html/2606.10531#S5.F4)shows the superior model capacity of our approach\. As observed, the PV\-Tuning method exhibits a performance plateau in the later stage of training, suggesting that the model reaches saturation within the initial steps\. This limitation may stem from its asynchronous update strategy\. In contrast, our method achieves consistent performance improvements and superior final accuracy as the volume of training data increases\. This trend demonstrates the robust scaling capability of our framework, confirming that LC\-QAT can effectively internalize more information from larger datasets\.

![Refer to caption](https://arxiv.org/html/2606.10531v1/x6.png)

Figure 4:Average zero\-shot task performance over training steps\. LC\-QAT steadily improves, while PV\-Tuning saturates quickly\.
### 5\.2Data Efficiency

Table 2:Comparison of the perplexity \(PPL\) on the Wikitext\-2 test set and zero\-shot QA accuracy between LC\-QAT and other baselines\. Our method outperforms the current state\-of\-the\-art ParetoQ, achieving higher performance using only around 10% of the training data\. Results of all the baselines are from the evaluation of ParetoQ\.MODELTYPEMETHOD\#ToksPPL↓AC↑AE↑BQ↑HS↑PQ↑WG↑Avg\.↑Per\.↑LLaMA\-3\-8BN/AFP16\-6\.257\.78183\.679\.58173\.976\.121\.00PTQRTNN/A1\.20e625\.1027\.2037\.8026\.1049\.7050\.5036\.070\.47GPTQ160\.0026\.1027\.0061\.6026\.0050\.5049\.7040\.150\.53AWQ1\.10e627\.1026\.0058\.3026\.1051\.4049\.8039\.780\.52OmniQ7\.60e422\.8027\.3037\.9025\.3049\.5049\.4035\.370\.46SpinQuant31\.2022\.0032\.4059\.0031\.9053\.2049\.9041\.400\.54QATLLM\-QAT100M29\.5035\.9054\.8064\.8058\.0068\.0054\.7056\.030\.74EfficientQAT24M9\.6046\.8069\.3075\.0569\.0076\.4066\.3067\.140\.88ParetoQ30B8\.0054\.5078\.5076\.4073\.8079\.2070\.0072\.070\.95LC\-QAT4B9\.3857\.9482\.9577\.4676\.6778\.6266\.8573\.420\.96LLaMA\-3\-3BN/AFP16\-7\.7050\.7072\.6074\.6074\.3078\.2069\.2069\.931\.00PTQRTNN/A7\.80e525\.1026\.9037\.8025\.7050\.1049\.6035\.870\.51GPTQ270\.0022\.9028\.6046\.4027\.1050\.0050\.1037\.520\.54AWQ6\.20e527\.5027\.3038\.2026\.1051\.1050\.7036\.820\.53OmniQ6\.50e324\.6028\.3037\.8025\.3050\.5050\.2036\.120\.52SpinQuant57\.4023\.7028\.3053\.2026\.1051\.1049\.0038\.570\.55QATLLM\-QAT100M2\.90e533\.3049\.3063\.5048\.9065\.2052\.2052\.070\.74ParetoQ30B9\.1049\.0073\.0968\.8069\.2076\.4064\.4066\.820\.96LC\-QAT4B13\.2052\.3978\.7566\.5169\.8376\.2266\.3068\.330\.98

[Table2](https://arxiv.org/html/2606.10531#S5.T2)presents the zero\-shot QA performance of LC\-QAT and other baselines\. Although LLM\-QAT and EfficientQAT are trained with relatively smaller amounts of data, their performance remains substantially inferior\. In contrast, our method surpasses ParetoQ while using only around 10% of its training data\. Moreover, on several tasks, LC\-QAT even exceeds the FP16 baseline\. These results strongly validate our choice of vector quantization as the initialization for QAT, which significantly reduces the required training data and improves the efficiency of quantization\-aware training, thereby demonstrating the effectiveness of our approach in recovering performance under low\-bit quantization\.

Table 3:Performance comparison with BitNet 2B4T on math and code tasks\. Our method achieves higher performance than BitNet 2B4T on most tasks and also surpasses it in the average score\.MetricBitNet 2BLC\-QAT 1\.7BOpenBookQA41\.6055\.20IFEval53\.4858\.63MMLU53\.1746\.77GSM8K58\.3858\.61MATH43\.4039\.60HumanEval38\.4043\.29Avg\.48\.0750\.35\#Tokens4T4B[Table3](https://arxiv.org/html/2606.10531#S5.T3)reports a comparison with BitNet\. Our model achieves higher average accuracy than BitNet 2B4T and outperforms it on most tasks\. These results indicate that our approach generalizes well across both reasoning and code benchmarks\. Notably, while BitNet 2B4T attains strong performance using 4T tokens of training data, our method requires only approximately 0\.1% of that data for fine\-tuning to achieve comparable performance, further corroborating the substantial data efficiency advantage of our approach\.

### 5\.3Ablation Studies

#### 5\.3\.1Effect of PTQ Initialization

We first ablate the role of the PTQ starting point\. Specifically, we replace the LC\-PTQ initialization with random and GPTQ initialization, which leads to a severe degradation and fails to provide a usable starting point for subsequent training\. Table[4](https://arxiv.org/html/2606.10531#S5.T4)confirms that a high\-quality PTQ initialization is essential for data\-efficient 2\-bit QAT\. Experiments are performed on a Qwen3\-8B model\.

Table 4:Ablation on the PTQ initialization\. Replacing the LDLQ\-based initialization with random initialization severely degrades the zero\-shot QA accuracy\.Init\. MethodAC↑AE↑BQ↑HS↑PQ↑WG↑Avg\.↑Per\.↑LC\-PTQ45\.8272\.3585\.0763\.1874\.1667\.0967\.950\.92GPTQ Init\.19\.6232\.4939\.7227\.2854\.7951\.3037\.530\.51Random Init\.27\.1325\.5556\.8826\.2251\.6949\.5739\.510\.54
#### 5\.3\.2Value of End\-to\-End Training

We then evaluate whether the gains of LC\-QAT come from the end\-to\-end training rather than a better initialization alone\. We run PV\-Tuning’s coordinate\-descent optimization starting from the same LC\-PTQ initialization used by LC\-QAT\. As shown in Table[5](https://arxiv.org/html/2606.10531#S5.T5), although PV\-Tuning improves upon LC\-PTQ, LC\-QAT achieves substantially higher accuracy under the same initialization\.

Table 5:Comparison of PV\-Tuning and end\-to\-end training of a Qwen\-3\-1\.7B LC\-PTQ model\. LC\-QAT consistently achieves higher accuracy, demonstrating the value of end\-to\-end training\.MethodAC↑AE↑BQ↑HS↑PQ↑WG↑Avg\.↑LC\-PTQ31\.9153\.8767\.4644\.9961\.8054\.3852\.40LC\-PTQ \+ PV\-Tuning36\.6961\.9557\.4349\.9569\.3156\.7555\.35LC\-QAT39\.5164\.3570\.4355\.0171\.6557\.5459\.75![Refer to caption](https://arxiv.org/html/2606.10531v1/x7.png)\(a\)Ablation study on integer weight preprocessing
![Refer to caption](https://arxiv.org/html/2606.10531v1/x8.png)\(b\)Ablation study on gradient estimation

Figure 5:\(a\) Without preprocessing, the training loss remains nearly constant\. With preprocessing, the loss decreases continuously, demonstrating that aligning integer weights with a Xavier\-initialized distribution is essential for stable training and effective gradient propagation\. \(b\) When using the STE, the spikes are extremely large and difficult to recover\. In contrast, using the DGE results in significantly smaller spikes and more stable convergence\.
#### 5\.3\.3Ablations of Auxiliary Components

Differentiable gradient estimator\.As[Figure5\(b\)](https://arxiv.org/html/2606.10531#S5.F5.sf2)shows, our ablation experiments demonstrate that using the differentiable gradient estimator significantly mitigates the initial loss spike observed when using the standard STE\. Without DGE, the model exhibits a pronounced loss spike during early training steps, while incorporating the differentiable gradient estimator stabilizes the loss curve, allowing the model to descend smoothly from the start and maintain consistent learning throughout training\. This highlights the effectiveness of DGE in ensuring robust gradient flow and reliable optimization under low\-bit settings\.

Integer Weight Preprocessing\.We conduct an ablation study to assess the role of integer weight preprocessing in LC\-QAT\. As shown in[Figure5\(a\)](https://arxiv.org/html/2606.10531#S5.F5.sf1), removing this preprocessing prevents effective training\. This demonstrates that integer weight preprocessing is a necessary component for the stable training of LC\-QAT\.

#### 5\.3\.4Sensitivity and Scaling Analyses

We find that LC\-QAT continues to improve as the training budget increases from 1B to 10B tokens, which indicates that LC\-QAT can benefit from continuous scaling and achieve higher performance compared to the 4B token training used in Table[1](https://arxiv.org/html/2606.10531#S5.T1)\(Appendix[A\.5\.1](https://arxiv.org/html/2606.10531#A1.SS5.SSS1), Table[10](https://arxiv.org/html/2606.10531#A1.T10)\)\. Furthermore, we extend our experiments to 14B models and observe that LC\-QAT still achieves notable performance gains at this larger scale \(Appendix[A\.5\.2](https://arxiv.org/html/2606.10531#A1.SS5.SSS2), Table[11](https://arxiv.org/html/2606.10531#A1.T11)\)\.

We also find that the PTQ initialization is insensitive to the dimensionddof the matrixAA\(d=4d=4vs\.d=8d=8\), and given that a largerddintroduces more additional parameters, we considerd=4d=4to be a suitable choice \(Appendix[A\.4](https://arxiv.org/html/2606.10531#A1.SS4)\)\. Additionally, the calibration data may affect the outcome of PTQ initialization; we find that using the commonly adopted RedPajama dataset yields favorable results on zero\-shot QA tasks \(Appendix[A\.4](https://arxiv.org/html/2606.10531#A1.SS4)\)\.

## 6Conclusion

We presented LC\-QAT, a novel framework for end\-to\-end training of 2\-bit VQ models\. By introducing linear\-constrained codebooks, LC\-QAT enables synchronized optimization of discrete and continuous parameters\.

Combined with a strong post\-training initialization and smooth gradient estimation, LC\-QAT achieves superior accuracy and data efficiency compared with existing scalar\- and vector\-quantized QAT methods\. Extensive experiments demonstrate that LC\-QAT exhibits stronger model capability than conventional VQ\-QAT methods, and achieves competitive performance using significantly less training data compared to scalar\-quantization\-based QAT methods\.

Our work offers another option for 2\-bit QAT with higher data\-efficiency and superior performance\. The future work will focus on the scalability of LC\-QAT and the optimization of its inference performance\.

## Limitations

Our experiments primarily focus on data\-efficient fine\-tuning under limited training budgets, and the scalability to large\-scale training remains an open question\.

Moreover, LC\-QAT introduces an additional linear projection in the forward pass\. Although this operation has limited complexity, its practical interaction with highly optimized GEMM kernels in modern deep learning frameworks has not been systematically studied\. The impact of this decoding process on training efficiency and decoding throughput under different conditions remains to be fully explored\.

## Acknowledgements

This work is supported by the National Key Research and Development Program of China \(2024YFB4505603\) and the National Natural Science Foundation of China \(No\. 62576186\)\. This work is also supported by Tsinghua KA Excellence Center\. This work is partially supported by Tsinghua University \(Department of Computer Science and Technology\) \- Sinopec Joint Research Center for Artificial Intelligence\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- M\. V\. Baalen, A\. Kuzmin, M\. Nagel, P\. Couperus, A\. Bolshakov, C\. Bastoul, E\. Mahurin, T\. Blankevoort, and P\. Whatmough \(2024\)GPTVQ: the blessing of dimensionality for llm quantization\.InWorkshop on Efficient Systems for Foundation Models II at the International Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1),[§1](https://arxiv.org/html/2606.10531#S1.p3.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThe AAAI Conference on Artificial Intelligence,pp\. 7432–7439\.Cited by:[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1)\.
- J\. Chee, Y\. Cai, V\. Kuleshov, and C\. D\. Sa \(2023\)QuIP: 2\-bit quantization of large language models with guarantees\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 4396–4429\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p6.1)\.
- H\. Chen, Y\. Dong, Z\. Wei, Y\. Huang, Y\. Zhang, H\. Su, and J\. Zhu \(2025\)Unveiling the basin\-like loss landscape in large language models\.CoRRabs/2505\.17646\.Cited by:[§2\.3](https://arxiv.org/html/2606.10531#S2.SS3.p2.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba \(2021\)Evaluating large language models trained on code\.CoRRabs/2107\.03374\.Cited by:[2nd item](https://arxiv.org/html/2606.10531#S4.I1.i2.p1.1)\.
- M\. Chen, W\. Shao, P\. Xu, J\. Wang, P\. Gao, K\. Zhang, and P\. Luo \(2024\)EfficientQAT: efficient quantization\-aware training for large language models\.CoRRabs/2407\.11062\.Cited by:[§3\.4](https://arxiv.org/html/2606.10531#S3.SS4.p1.6),[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p3.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 2924–2936\.Cited by:[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.CoRRabs/1803\.05457\.Cited by:[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.CoRRabs/2110\.14168\.Cited by:[2nd item](https://arxiv.org/html/2606.10531#S4.I1.i2.p1.1)\.
- V\. Egiazarian, A\. Panferov, D\. Kuznedelev, E\. Frantar, A\. Babenko, and D\. Alistarh \(2024\)Extreme compression of large language models via additive quantization\.InProceedings of the 41st International Conference on Machine Learning,pp\. 12284–12303\.External Links:ISSN 2640\-3498Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1),[§1](https://arxiv.org/html/2606.10531#S1.p2.1),[§1](https://arxiv.org/html/2606.10531#S1.p3.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2022\)GPTQ: accurate post\-training compression for generative pretrained transformers\.CoRRabs/2210\.17323\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1),[§1](https://arxiv.org/html/2606.10531#S1.p2.1)\.
- Z\. Hao, J\. Guo, L\. Shen, Y\. Luo, H\. Hu, G\. Wang, D\. Yu, Y\. Wen, and D\. Tao \(2025\)Low\-precision training of large language models: methods, challenges, and opportunities\.CoRRabs/2505\.01043\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.CoRRabs/2009\.03300\.Cited by:[2nd item](https://arxiv.org/html/2606.10531#S4.I1.i2.p1.1)\.
- S\. Hu, Y\. Tu, X\. Han, C\. He, G\. Cui, X\. Long, Z\. Zheng, Y\. Fang, Y\. Huang, W\. Zhao, X\. Zhang, Z\. L\. Thai, K\. Zhang, C\. Wang, Y\. Yao, C\. Zhao, J\. Zhou, J\. Cai, Z\. Zhai, N\. Ding, C\. Jia, G\. Zeng, D\. Li, Z\. Liu, and M\. Sun \(2024\)MiniCPM: unveiling the potential of small language models with scalable training strategies\.CoRRabs/2404\.06395\.Cited by:[§4\.4](https://arxiv.org/html/2606.10531#S4.SS4.p4.3)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.CoRRabs/2305\.20050\.Cited by:[2nd item](https://arxiv.org/html/2606.10531#S4.I1.i2.p1.1)\.
- Z\. Liu, B\. Oguz, C\. Zhao, E\. Chang, P\. Stock, Y\. Mehdad, Y\. Shi, R\. Krishnamoorthi, and V\. Chandra \(2023\)LLM\-qat: data\-free quantization aware training for large language models\.CoRRabs/2305\.17888\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p2.1),[§3\.4](https://arxiv.org/html/2606.10531#S3.SS4.p1.6),[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p3.1)\.
- Z\. Liu, C\. Zhao, H\. Huang, S\. Chen, J\. Zhang, J\. Zhao, S\. Roy, L\. Jin, Y\. Xiong, Y\. Shi, L\. Xiao, Y\. Tian, B\. Soran, R\. Krishnamoorthi, T\. Blankevoort, and V\. Chandra \(2025\)ParetoQ: improving scaling laws in extremely low\-bit llm quantization\.InProceedings of the International Conference on Neural Information Processing System,Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.10531#S3.SS3.p2.3),[§3\.4](https://arxiv.org/html/2606.10531#S3.SS4.p1.6),[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1),[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p3.1)\.
- Llama Team \(2024\)The llama 3 herd of models\.CoRRabs/2407\.21783\.Cited by:[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p1.1)\.
- S\. Ma, H\. Wang, S\. Huang, X\. Zhang, Y\. Hu, T\. Song, Y\. Xia, and F\. Wei \(2025\)BitNet b1\.58 2b4t technical report\.CoRRabs/2504\.12285\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p2.1),[§3\.3](https://arxiv.org/html/2606.10531#S3.SS3.p2.3),[§3\.4](https://arxiv.org/html/2606.10531#S3.SS4.p1.6),[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p3.1)\.
- V\. Malinovskii, D\. Mazur, I\. Ilin, D\. Kuznedelev, K\. Burlachenko, K\. Yi, D\. Alistarh, and P\. Richtarik \(2024\)PV\-Tuning: beyond straight\-through estimation for extreme llm compression\.InProceedings of the International Conference on Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p3.1),[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1),[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p2.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.CoRRabs/1609\.07843\.Cited by:[§2\.3](https://arxiv.org/html/2606.10531#S2.SS3.p2.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.CoRRabs/1809\.02789\.Cited by:[2nd item](https://arxiv.org/html/2606.10531#S4.I1.i2.p1.1)\.
- G\. Penedo, H\. Kydlíček, L\. B\. allal, A\. Lozhkov, M\. Mitchell, C\. Raffel, L\. V\. Werra, and T\. Wolf \(2024\)The FineWeb datasets: decanting the web for the finest text data at scale\.CoRRabs/2406\.17557\.Cited by:[§4\.3](https://arxiv.org/html/2606.10531#S4.SS3.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)WinoGrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1)\.
- S\. Savkin, E\. Porat, O\. Ordentlich, and Y\. Polyanskiy \(2025\)NestQuant: nested lattice quantization for matrix products and llms\.InProceedings of the International Conference on Machine Learning,Cited by:[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p4.1),[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p6.1)\.
- A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. D\. Sa \(2024a\)QuIP\#: even better llm quantization with hadamard incoherence and lattice codebooks\.InProceedings of the International Conference on Machine Learning,Cited by:[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p4.1),[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p6.1)\.
- A\. Tseng, Q\. Sun, D\. Hou, and C\. De Sa \(2024b\)QTIP: quantization with trellises and incoherence processing\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 59597–59620\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p6.1)\.
- A\. Tseng, Z\. Sun, and C\. D\. Sa \(2025\)Model\-preserving adaptive rounding\.CoRRabs/2505\.22988\.Cited by:[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p6.1)\.
- H\. Wang, H\. Zhao, X\. Yu, Z\. Yao, X\. Han, Z\. Liu, and M\. Sun \(2026\)UniSVQ: 2\-bit unified scalar\-vector quantization\.InProceedings of the International Conference on Neural Information Processing Systems,Cited by:[§2\.2](https://arxiv.org/html/2606.10531#S2.SS2.p1.1)\.
- R\. Wang, Y\. Gong, X\. Liu, G\. Zhao, Z\. Yang, B\. Guo, Z\. Zha, and P\. Cheng \(2025\)Optimizing large language model training using fp4 quantization\.CoRRabs/2501\.17116\.Cited by:[§3\.3](https://arxiv.org/html/2606.10531#S3.SS3.p2.3)\.
- M\. Weber, D\. Y\. Fu, Q\. Anthony, Y\. Oren, S\. Adams, A\. Alexandrov, X\. Lyu, H\. Nguyen, X\. Yao, V\. Adams, B\. Athiwaratkun, R\. Chalamala, K\. Chen, M\. Ryabinin, T\. Dao, P\. Liang, C\. Ré, I\. Rish, and C\. Zhang \(2024\)RedPajama: an open dataset for training large language models\.InProceedings of the International Conference on Neural Information Processing Systems,pp\. 116462–116492\.Cited by:[§4\.3](https://arxiv.org/html/2606.10531#S4.SS3.p2.1)\.
- H\. Xing, C\. Yuan, Y\. Dawei, C\. Zhixuan, X\. Zukang, Y\. Jiangyong, X\. Chen, Y\. Zhihang, j\. Zhe, and Z\. Sifan \(2025\)OSTQuant: refining large language model quantization with orthogonal and scaling transformations for better distribution fitting\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§A\.3](https://arxiv.org/html/2606.10531#A1.SS3.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu \(2025\)Qwen3 technical report\.CoRRabs/2505\.09388\.Cited by:[§4\.2](https://arxiv.org/html/2606.10531#S4.SS2.p1.1)\.
- P\. Yin, J\. Lyu, S\. Zhang, S\. Osher, Y\. Qi, and J\. Xin \(2019\)Understanding straight\-through estimator in training activation quantized neural nets\.CoRRabs/1903\.05662\.Cited by:[§3\.3](https://arxiv.org/html/2606.10531#S3.SS3.p2.3)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,pp\. 4791–4800\.Cited by:[1st item](https://arxiv.org/html/2606.10531#S4.I1.i1.p1.1)\.
- J\. Zhou, T\. Lu, S\. Mishra, S\. Brahma, S\. Basu, Y\. Luan, D\. Zhou, and L\. Hou \(2023\)Instruction\-following evaluation for large language models\.CoRRabs/2311\.07911\.Cited by:[2nd item](https://arxiv.org/html/2606.10531#S4.I1.i2.p1.1)\.
- Z\. Zhou, X\. Li, M\. Li, H\. Zhang, H\. Wang, W\. Chang, Y\. Liu, Q\. Dang, D\. Yu, Y\. Ma, and H\. Wang \(2025\)CCQ: convolutional code for extreme low\-bit quantization in llms\.CoRRabs/2507\.07145\.Cited by:[§1](https://arxiv.org/html/2606.10531#S1.p1.1),[§1](https://arxiv.org/html/2606.10531#S1.p3.1)\.

## Appendix AAppendix

### A\.1Samples of Training Data

This section mainly presents example samples from the two datasets used for training: FineWeb and AM\-Qwen3\-Distilled\. As[Figure6](https://arxiv.org/html/2606.10531#A1.F6)shows, FineWeb is a large\-scale, high\-quality web text dataset curated for instruction\-tuning and general language understanding tasks\. As[Figure7](https://arxiv.org/html/2606.10531#A1.F7)shows, AM\-Qwen3\-Distilled is a distilled dataset generated from the outputs of Qwen3 models, designed to provide high\-quality supervised data for improving instruction\-following and reasoning capabilities\.All training is conducted with a sequence length of 2048\.

Sample1: The Net Neutrality repeal vote is coming\. Tell these Dems to vote Yes\. The House of Representatives is likely to vote tomorrow, Thursday, on the repeal of the FCC’s Net Neutrality power grab\. Using the Congressional Review Act, the repeal of the Net Neutrality order can be accomplished in an expedited way\. In particular this means the bill cannot be filibustered in the Senate, so passing it means something\. As Seton Motley said: This is our first opportunity \. Read More\. Sample2: Tips for Preventing Medicare Fraud: The Department of Health and Human Services \(HHS\) Office of the Inspector General \(OIG\) has created a new web site to provide tips for preventing Medicare fraud and medical identity theft\. See HHS news release\. OIG’s new web site includes a brochure containing tips, where to report fraud, and other resources\. In addition, CMS issued a Medicare Fraud & Abuse Fact Sheet earlier this year\. Sample3: Great decorating addition\.I have a grape/Italian theme in my kitchen\. I purchased 5 of these\. I decided to use them to put around my pull knobs on my overhead cabinets\. Now I am ordering more to sprinkle around in other places in the kitchen \- even to hang up via suction cups on my white kitchen tile\. September 20, 2012\.

Figure 6:Examples of FineWeb\.Sample1:human:Write a python function to reverse the strings in a given list of strings\. For example, given the list \[”hello”, ”world”\], the function should return \[”olleh”, ”dlrow”\]\.assistant:python def reverse\_strings\(list\_of\_strings\): return \[s\[::\-1\] for s in list\_of\_strings\]Sample2:human:Write a python function that takes in two integers, a and b, and returns the sum of the two integers\.assistant:python def add\_numbers\(a, b\): return a \+ bSample3:human:I need to create a Python function that converts a given string to uppercase\. How can I achieve this?assistant:python def to\_uppercase\(input\_string\): return input\_string\.upper\(\)

Figure 7:Examples of AM\-Qwen3\-Distilled showing human instructions and assistant responses\.
### A\.2Inference Speed

We report inference throughput on a single NVIDIA A100 GPU with batch size 1 and sequence length 1024 \(CUDA Graph enabled\)\. As shown in Table[6](https://arxiv.org/html/2606.10531#A1.T6), LC\-QAT achieves the highest throughput among VQ baselines while supporting bothd=4d=4andd=8d=8with nearly identical speed\.

Table 6:Throughput comparison on LLaMA\-3\-8B \(single A100, bs=1, 1024 tokens, CUDA Graph\)\.MethodddThroughput \(tok/s\)↑\\uparrowFP16–60\.38AQLM 2×\\times8870\.97QuIP\# E8P879\.70QuIP\# D4423\.80LC\-QAT4101\.65LC\-QAT 2×\\times88101\.25The inference throughput in Table[6](https://arxiv.org/html/2606.10531#A1.T6)is measured on a single NVIDIA A100 GPU with batch size 1 and sequence length 1024 \(CUDA Graph enabled\)\. Our implementation consists of \(1\) the Hadamard transformation using the publicly available fast\-hadamard\-transform library, and \(2\) a custom fused CUDA kernel implementing the reformulated forward pass in Equation[8](https://arxiv.org/html/2606.10531#S3.E8), which fuses affine dequantization \(A⋅wint\+BA\\cdot w\_\{\\mathrm\{int\}\}\+B\), dot\-product accumulation, and block reduction into a single kernel launch to minimize memory traffic\.

Currently, only QuIP\# provides an official CUDA kernel supportingd=4d=4but it is not well\-optimized for bs=1 inference, and AQLM/QTIP hardcodeddin their kernels\. In contrast, LC\-QAT’s affine dequantization is structurally simple and generalizes across group sizes with the same fused kernel, which explains the strong throughput and the negligible overhead when increasingdd\.

Table 7:Total wall\-clock time comparison including PTQ initialization \(estimated on 8 A800 GPUs\)\.MethodPTQ time \(h\)QAT time \(h\)Total time \(h\)LC\-QAT65561ParetoQN/A417417
### A\.3Detailed Results of Preliminary Optimization Analysis

[Table8](https://arxiv.org/html/2606.10531#A1.T8)shows the performance discrepancy between the initialization point used by LC\-QAT and that of scalar quantization\. In this experiment, we select OSTQuant\(Xinget al\.,[2025](https://arxiv.org/html/2606.10531#bib.bib27)\), which is a state\-of\-the\-art SQ method, as the baseline\. Similar to our approach, OSTQuant utilizes the Hadamard transform to mitigate the impact of outliers\.

While OSTQuant performs excellently at 4 bits and above, the results in[Table8](https://arxiv.org/html/2606.10531#A1.T8)reveal that it suffers from a near 40% performance loss under an aggressive 2\-bit quantization strategy\. Notably, on more complex tasks such as ARC\-C, its performance degrades to a level close to random guessing\. In contrast, the LC\-QAT initialization preserves over 84% of the zero\-shot QA accuracy\. These findings, consistent with the results in[Section2\.3](https://arxiv.org/html/2606.10531#S2.SS3), validate the superior quality of our initialization\. Given that OSTQuant also incorporates an online Hadamard transform, the only distinction between the two lies in the choice of vector quantization versus scalar strategies\. This comparison underscores the exceptional ability of VQ to maintain model performance in the ultra\-low bit\-width regime\.

Table 8:Zero\-shot performance comparison on LLaMA3 models\. We report the initial point of LC\-QAT \(LC\-PTQ\) and scalar quantization baselines on zero\-shot commonsense reasoning benchmarks\.ModelMethodACAEBQHSPQWGAvg\.Per\.LLaMA3\-3BFP1650\.0772\.6074\.6074\.3078\.2069\.2069\.831\.00OSTQuant23\.8134\.3459\.5133\.2355\.8250\.9942\.950\.62LC\-PTQ33\.2858\.0869\.1457\.9671\.0660\.8558\.400\.84LLaMA3\-8BFP1656\.5780\.9386\.6174\.9477\.8067\.8874\.121\.00OSTQuant25\.1739\.2760\.6737\.7960\.3952\.4142\.950\.58LC\-PTQ38\.6566\.0474\.7467\.0275\.9065\.5964\.660\.87

### A\.4Sensitivity to Group Size

We study the sensitivity of LC\-PTQ to the VQ group sizeddand calibration data\. On Qwen\-3\-1\.7B at the PTQ stage, changing the group size fromd=4d=4tod=8d=8results in only marginal differences in zero\-shot accuracy, while switching the calibration dataset from RedPajama to an in\-house dataset introduces a modest PTQ\-level gap\.

Table 9:Sensitivity of LC\-PTQ to group sizeddand calibration data \(Qwen\-3\-1\.7B, PTQ stage\)\.ddCalibrationARC\-CARC\-EBoolQHellaSwagPIQAWinoGrandeAvg\.4RedPajama29\.3544\.8763\.7943\.8864\.9654\.4650\.228RedPajama29\.1045\.6663\.4943\.7764\.8555\.3350\.374In\-house27\.4743\.0662\.2637\.7162\.4654\.7847\.96

### A\.5Scaling Analysis

#### A\.5\.1Data Scaling

We conduct a data\-scaling analysis on Qwen\-3\-1\.7B, where performance continues to improve beyond 4B tokens without saturation\. This supports our primary claim of*data efficiency*: LC\-QAT achieves competitive downstream accuracy with a substantially smaller data budget, while a symmetric scaling analysis for ParetoQ is not feasible since its training data is not publicly available\.

Table 10:Data scaling analysis on Qwen\-3\-1\.7B\.\#TokensARC\-CARC\-EBoolQHellaSwagPIQAWinoGrandeAvg\.1B39\.5164\.3570\.4355\.0171\.6557\.5459\.754B41\.7268\.1869\.8558\.2372\.6957\.9361\.4310B42\.4168\.9069\.0259\.6073\.4561\.3362\.45
#### A\.5\.2Parameter Scaling

We further evaluate LC\-QAT on a 14B\-parameter model using approximately 58M tokens\. The advantage of LC\-QAT over the PTQ initialization remains consistent at this larger scale\.

Table 11:Results on a 14B model \(58M tokens\)\.MethodWikiC4ARC\-CARC\-EBoolQHellaSwagPIQAWinoGrandeAvg\.Per\.FP168\.6413\.8160\.1582\.8389\.3078\.8479\.8772\.8577\.311\.00LC\-PTQ11\.4116\.8551\.6277\.6987\.1669\.8277\.4271\.5972\.550\.94LC\-QAT10\.0815\.7954\.6979\.2585\.9073\.6678\.5173\.8874\.320\.96

### A\.6Additional Reasoning Benchmarks of Base Models

We evaluate LC\-QAT and PV\-Tuning on harder reasoning benchmarks\. LC\-QAT consistently achieves higher accuracy\.

Table 12:Additional reasoning benchmarks\.TaskFP16LC\-QATPV\-TuningMMLU55\.2645\.9244\.70CEval58\.2536\.0334\.70CMMLU56\.9536\.4635\.56
LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization

Similar Articles

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs

2-bit QAT model releases

Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization

Submit Feedback

Similar Articles

CAT-Q: Cost-efficient and Accurate Ternary Quantization for LLMs
Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference
QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling
Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization