LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

arXiv cs.LG Papers

Summary

LiftQuant introduces a 'lift-then-project' mechanism enabling continuous (non-integer) bit-width quantization for LLMs, allowing precise fitting to hardware memory budgets. The framework compresses a 70B LLM to 2.4-bit to fit a 24GB GPU, outperforming state-of-the-art 2-bit models.

arXiv:2606.04050v1 Announce Type: new Abstract: Existing quantization methods are fundamentally limited by rigid, integer-based bit-widths (e.g., 2, 3-bit), resulting in a ``deployment gap" where Large Language Models cannot be optimally fitted to specific memory budgets. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit-width control for true Pareto-optimal deployment. The core innovation is a ``lift-then-project" mechanism which approximates low-dimensional weight vectors by projecting a simple 1-bit lattice from a higher-dimensional ``lifted" space. Crucially, the effective bit-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit-width to be tuned quasi-continuous as the dimension is a flexible structural parameter. This projection generates a structured yet non-uniform codebook, capturing the expressive power of Vector Quantization (VQ). While beneficial over VQ, LiftQuant's decoding path relies solely on linear transformations and 1-bit uniform quantizers, retaining hardware-friendly nature. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state-of-the-art 2-bit models fitted on the same device. Our code and ckpt is available at https://github.com/Heliulu/LiftQuant.
Original Article
View Cached Full Text

Cached at: 06/05/26, 02:18 AM

# LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
Source: [https://arxiv.org/html/2606.04050](https://arxiv.org/html/2606.04050)
XuanAng LiuJuntao LiuTaolue FengTing LuChunsheng GanZhiyv PengYuan DuHuanrui YangYijiang LiuLi Du

###### Abstract

Existing quantization methods are fundamentally limited by rigid, integer\-based bit\-widths \(e\.g\., 2, 3\-bit\), resulting in a “deployment gap” where Large Language Models cannot be optimally fitted to specific memory budgets\. To bridge this gap, we introduce LiftQuant, a novel framework that enables continuous bit\-width control for true Pareto\-optimal deployment\. The core innovation is a “lift\-then\-project” mechanism which approximates low\-dimensional weight vectors by projecting a simple 1\-bit lattice from a higher\-dimensional “lifted” space\. Crucially, the effective bit\-width is determined simply by the ratio of the lifted dimension to the original dimension, which allows the bit\-width to be tuned quasi\-continuous as the dimension is a flexible structural parameter\. This projection generates a structured yet non\-uniform codebook, capturing the expressive power of Vector Quantization \(VQ\)\. While beneficial over VQ, LiftQuant’s decoding path relies solely on linear transformations and 1\-bit uniform quantizers, retaining hardware\-friendly nature\. This flexibility is transformative: LiftQuant enables a 70B LLM to be compressed to 2\.4 bits to precisely fit a 24GB GPU, where its performance significantly surpasses state\-of\-the\-art 2\-bit models fitted on the same device\. Our code and ckpt is available at[https://github\.com/Heliulu/LiftQuant](https://github.com/Heliulu/LiftQuant)\.

Machine Learning, ICML

## 1Introduction

Large Language Models \(LLMs\) have demonstrated unprecedented capabilities across a wide range of tasks, but their massive parameter counts pose a severe challenge for deployment\. The “memory wall” remains the primary bottleneck: running state\-of\-the\-art models \(e\.g\., 70B parameters\) typically requires high\-end, multi\-GPU clusters, making them inaccessible for deployment on commodity hardware or edge devices\. Consequently, weight\-only quantization has emerged as a standard practice to compress these models into manageable footprints\.

However, a fundamental inefficiency plagues current quantization paradigms: the rigidity of integer bit\-widths\. Existing methods, whether based on Uniform Quantization \(UQ\) or Vector Quantization \(VQ\), force users to choose between discrete compression levels \(e\.g\., 2\-bit, 3\-bit, or 4\-bit\)\. This creates a significant “deployment gap” between model size and hardware capacity\. For instance, consider deploying a Llama\-3\-70B model on a consumer\-grade GPU with 24GB of VRAM, a 3\-bit quantization is too large to fit, while a 2\-bit quantization, though small enough, suffers from a catastrophic drop in reasoning capability\. The hardware’s memory capacity between 2\-bit and 3\-bit is effectively wasted, and the model’s potential performance is capped by the coarse granularity of the quantization scheme\.

![Refer to caption](https://arxiv.org/html/2606.04050v1/x1.png)

Figure 1:Pareto\-Optimal Deployment on a 24GB GPU\. Perplexity \(WikiText\-2 and C4\) vs\. Memory Footprint for Llama\-3\-70B\. While advanced integer\-based methods like QTIP and EfficientQAT leave memory wasted or exceed the limit, LiftQuant enables a 2\.4\-bit model that fully utilizes the available VRAM, significantly outperforming 2\-bit baselines\. Note that the reserved memory buffer \(red zone\) is dynamic, varying with deployment scenarios \(e\.g\., KV cache length and precision, batch size, lm\.head precision\)\. LiftQuant allows for flexible bit\-width tuning to precisely match the remaining available memory\.To bridge this gap, we introduce LiftQuant, a novel quantization framework that transforms the rigid selection of bit\-widths into a continuous design space\. LiftQuant is, to the best of our knowledge, the first framework to enable arbitrary fractional bit\-widths \(e\.g\., 2\.4\-bit\) for LLMs, allowing for true Pareto\-optimal deployment under strict memory constraints\. Our approach departs from traditional scalar or vector codebook learning\. Instead, we employ a “lift\-then\-project” mechanism: we construct each weight vector as a learned linear combination of elements from a simple 1\-bit lattice defined in a higher\-dimensional space\. This approach effectively decouples the quantization rate from the coding format—the equivalent bit\-width is simply the ratio between the high\-dimensional lifted space and the target weight space\. By marginally adjusting the size of this lifted dimension, LiftQuant can modulate the compression rate with fine\-grained precision, naturally yielding continuous, fractional bit\-widths without altering the underlying quantization operator\.

This paradigm shift offers a “best\-of\-both\-worlds” solution\. The projection generates a structured, non\-uniform quantization that rivals the expressive power of vector codebook, yet the decoding process relies solely on a low\-complexity linear transformation and a 1\-bit uniform quantizer\. Our extensive experiments demonstrate that LiftQuant not only matches state\-of\-the\-art integer quantization methods but, more importantly, dominates the Pareto frontier in practical deployment scenarios\. For instance, LiftQuant enables a 70B model to be compressed to 2\.4 bits to fit precisely within a 24GB GPU \([Figure1](https://arxiv.org/html/2606.04050#S1.F1)\), and similarly allows a 32B model to be deployed at 2\.5 bits on a widely available 12GB GPU\.

Our main contributions are summarized as follows:

- •Continuous Bit\-Width Control for Pareto Optimality: We propose LiftQuant, the novelty framework to enable continuous bit\-width adjustment by decoupling quantization from integer grids\. This flexibility allows models to fully utilize available hardware memory, achieving true Pareto\-optimal deployment\.
- •High\-Dimensional Non\-Uniformity: We introduce a “lift\-then\-project” mechanism that procedurally generates structured, non\-uniform codebooks from high\-dimensional space\. This approach captures the expressive power of Vector Quantization \(VQ\), enabling LiftQuant to match or surpass the accuracy of state\-of\-the\-art VQ methods\.
- •Unified, Hardware\-Friendly Inference Architecture: We achieve this high accuracy with a decoding path that relies solely on low\-complexity linear transformations and Int1 uniform quantizers, which provides a single, unified operator that supports arbitrary precision configurations, simplifying engineering deployment\.

## 2Related Work

Weight\-only quantization has emerged as one of the most effective strategies for deploying large language models \(LLMs\) under strict memory and latency constraints\.

Uniform Scalar Quantization \(UQ\) is the most widely used approach, where a floating\-point weight vectorwwis represented aswq⋅sw\_\{q\}\\cdot s, withwqw\_\{q\}storing low\-bit integer values andssbeing a floating\-point scaling factor\. Due to the non\-uniform value distribution of LLM weights, recent UQ methods introduce lightweight preprocessing to make weights more amenable to quantization\. These include group\-wise quantization to preserve important channels \(e\.g\., AWQ\(Linet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib46)\)\), low\-rank error compensation \(e\.g\., QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.04050#bib.bib57)\),\(Liuet al\.,[2025](https://arxiv.org/html/2606.04050#bib.bib28)\)\), and matrix\-based transforms to reshape weight distributions \(e\.g\., QuIP\#\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.04050#bib.bib48)\), Quarot\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib31)\), SpinQuant\(Liuet al\.,[2024b](https://arxiv.org/html/2606.04050#bib.bib32)\), FlatQuant\(Sunet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib45)\)\)\.

Non\-uniform Quantization methods improve performance by creating specialized codebooks\. These can be scalar\-based, using data\-driven levels \(e\.g\., NF4\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.04050#bib.bib57)\)\) or additive basis vectors \(e\.g\., BCQ\(Xuet al\.,[2018](https://arxiv.org/html/2606.04050#bib.bib64); Parket al\.,[2025](https://arxiv.org/html/2606.04050#bib.bib65)\)\), but they miss inter\-dimensional correlations\. Vector Quantization \(VQ\) addresses this by mapping weight vectors to a learned codebook, exploiting inter\-element correlations for superior accuracy in ultra\-low\-bit regimes \(e\.g\., AQLM\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib55)\), VPTQ\(Liuet al\.,[2024a](https://arxiv.org/html/2606.04050#bib.bib54)\), QTIP\(Tsenget al\.,[2024b](https://arxiv.org/html/2606.04050#bib.bib62)\)\)\. However, VQ’s reliance on large, hardware\-unfriendly lookup tables imposes significant decoding overhead\.

The Inflexibility of Integer Bit\-Widths\. Despite their diversity, a critical limitation unites these methods: their reliance on rigid, integer\-based bit\-widths \(e\.g\., 2, 3, 4\-bit\)\. This inflexibility prevents models from being optimally fitted to specific hardware memory budgets\. While some workarounds exist, they are fundamentally constrained\. For instance, UQ methods can coarsely modulate the effective bit\-width by varying the group size \(e\.g\., from 128 to 64 in EfficientQAT\(Chenet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib56)\)\), but this provides only a few discrete “gears” to shift between, not a continuous spectrum\. Other approaches achieve specific fractional bit\-widths by using non\-power\-of\-two codebooks \(e\.g\., ternary quantization∼\\sim1\.58bit\(Wanget al\.,[2025](https://arxiv.org/html/2606.04050#bib.bib66)\)\), but require specialized, non\-standard kernels\. Most notably, Q\-Palette\(Lee and Song,[2025](https://arxiv.org/html/2606.04050#bib.bib67)\)recently proposed a collection of fractional\-bit quantizers\. However, it achieves fractional bits by assembling a heterogeneous mix of different quantizers \(scalar, vector, trellis\), which necessitates maintaining a complex library of specialized kernels for each configuration\. In contrast, our LiftQuant enables continuous bit\-width control through a single, unified, and hardware\-friendly architecture\. By simply tuning the projection dimension, we achieve arbitrary bit\-widths without changing the underlying operator\.

![Refer to caption](https://arxiv.org/html/2606.04050v1/fig2-1.png)\(a\)\[±1\]2→ℝ1=2​b​i​t\[\\pm 1\]^\{2\}\\to\\mathbb\{R\}^\{1\}=2\\ bit
![Refer to caption](https://arxiv.org/html/2606.04050v1/fig2-2.png)\(b\)\[±1\]4→ℝ2=2​b​i​t\[\\pm 1\]^\{4\}\\to\\mathbb\{R\}^\{2\}=2\\ bit
![Refer to caption](https://arxiv.org/html/2606.04050v1/fig2-3.png)\(c\)\[±1\]5→ℝ2=2\.5​b​i​t\[\\pm 1\]^\{5\}\\to\\mathbb\{R\}^\{2\}=2\.5\\ bit
![Refer to caption](https://arxiv.org/html/2606.04050v1/fig2-4.png)\(d\)\[±1\]8→ℝ3=2\.67​b​i​t\[\\pm 1\]^\{8\}\\to\\mathbb\{R\}^\{3\}=2\.67\\ bit

Figure 2:Visualization of Codewords Generation in LiftQuant\. Our method generates a structured, non\-uniform codebook by projecting a simple, uniform lattice from a high\-dimensional “lifted” space onto a lower\-dimensional target subspace\.
## 3LiftQuant: Continuous Bit Width Control Via Lifted Projection

Current quantization paradigms are trapped in a rigid coupling between representation capacity and integer bit\-widths\. Whether using scalar grids \(UQ\) or vector codebooks \(VQ\), the effective bit\-rate is determined by discrete design choices—such as the number of grid points or codebook size—which cannot be smoothly adjusted\. This rigidity creates the “deployment gap” discussed in[Sections1](https://arxiv.org/html/2606.04050#S1)and[1](https://arxiv.org/html/2606.04050#S1.F1), preventing models from optimally utilizing available hardware memory\. Furthermore, while VQ offers superior accuracy through non\-uniform quantization, its reliance on lookup tables \(LUTs\) introduces significant latency and engineering complexity, making it difficult to deploy efficiently\.

Key Insight: Decoupling Bit\-Width from Coding Format\.Our key insight is that we can decouple the effective bit\-width from the coding format by shifting the quantization process to a higher\-dimensional space\. Instead of quantizing directly in the target weight spaceℝd\\mathbb\{R\}^\{d\}, we propose to represent weights as the projection of a simple, 1\-bit uniform lattice from a higher\-dimensional “lifted” spaceℝD\\mathbb\{R\}^\{D\}\.

Crucially, this “lift\-then\-project” mechanism transforms the bit\-width from a discrete architectural constant into a continuous, tunable ratioD/dD/d\. By simply adjusting the dimensionDD, we can achieve any desired fractional bit\-width \(e\.g\., 24/10 = 2\.4\-bit\) without changing the underlying 1\-bit quantization operator\. As visualized in[Figure2](https://arxiv.org/html/2606.04050#S2.F2), this linear projection naturally generates a dense, Gaussian\-like codebook in the target space, thereby capturing the expressive power of VQ while retaining the hardware efficiency of a simple matrix multiplication\.

Based on these principles, LiftQuant operates in three phases\. In[Section3\.1](https://arxiv.org/html/2606.04050#S3.SS1), we learn the global projection matrix𝑴\\bm\{M\}that defines the fractional bit\-width and codebook structure\. In[Section3\.2](https://arxiv.org/html/2606.04050#S3.SS2), we introduce a lightweight, layer\-wise whitening transform𝑻\\bm\{T\}to reshape weights into the i\.i\.d\. Gaussian distribution required by our projection\. Finally, in[Section3\.3](https://arxiv.org/html/2606.04050#S3.SS3), we detail the quantization and decoding pipeline, where the fused operation𝒐=d​i​a​g​\(𝒔\)​𝑾​\(𝑴​𝑻​𝒂\)\\bm\{o\}=diag\(\\bm\{s\}\)\\bm\{W\}\(\\bm\{M\}\\bm\{T\}\\bm\{a\}\)enables efficient inference\.

### 3\.1Projection from Lifted\-Space to Subspace

Our approach is grounded in the asymptotic properties of high\-dimensional geometry\. Specifically, a corollary of the Central Limit Theorem \(CLT\) states that the linear projection of a high\-dimensional hypercubic lattice \(i\.e\., independent Bernoulli variables\) onto a lower\-dimensional subspace converges to a Gaussian distribution as the dimension increases\(Diaconis and Freedman,[1984](https://arxiv.org/html/2606.04050#bib.bib68)\)\.

Formally, for a weight vector𝒘≃𝑴​𝒘𝒒\\bm\{w\}\\simeq\\bm\{Mw\_\{q\}\}, where𝒘∈ℝd\\bm\{w\}\\in\\mathbb\{R\}^\{d\}and𝒘𝒒∈\{\+1,−1\}D\\bm\{w\_\{q\}\}\\in\\\{\+1,\-1\\\}^\{D\}, each elementwi=∑j=1D𝑴i​j​𝒚jw\_\{i\}=\\sum^\{D\}\_\{j=1\}\\bm\{M\}\_\{ij\}\\bm\{y\}\_\{j\}represents a sum of independent random variables\. We call the𝑴\\bm\{M\}Mapping Matrix\. Consequently, the resulting codebook naturally forms a dense, Gaussian\-like cloud in the target space\. This provides a strong theoretical justification for our method: LiftQuant does not merely “learn” to fit the Gaussian weights of LLMs; it structurally generates a Gaussian prior by design\.

![Refer to caption](https://arxiv.org/html/2606.04050v1/dequant.png)

Figure 3:The LiftQuant Dequant Mechanism\. A 1\-bit quantized tensor in the high\-dimensional lifted space is projected via the mapping matrix𝑴\\bm\{M\}to generate the de\-quantized weight tensor\.Optimization of theM\\bm\{M\}\. While the CLT guarantees asymptotic Gaussianity, practical deployment requires a finite and relatively small lifted dimensionDDto maintain computational efficiency\. In this regime, the convergence to a perfect Gaussian is incomplete\. To bridge this gap, we explicitly optimize the𝑴\\bm\{M\}to minimize the quantization error on a standard Gaussian distribution\. We initialize the rows of𝑴\\bm\{M\}as orthonormal vectors to encourage uncorrelated projections\. The matrix is then trained on𝒲𝒩∼𝒩​\(0,𝑰d\)\\mathcal\{W\_\{N\}\}\\sim\\mathcal\{N\}\(0,\\bm\{I\}\_\{d\}\):

𝑴∗=arg⁡min𝑴⁡𝔼𝒘∼𝒩​\[min𝒘q∈\{−1,\+1\}ds⋅b⁡‖𝒘−𝑴​𝒘q‖\],\\bm\{M\}^\{\*\}=\\arg\\min\_\{\\bm\{M\}\}\\;\\mathbb\{E\}\_\{\\bm\{w\}\\sim\\mathcal\{N\}\}\\left\[\\min\_\{\\bm\{w\}\_\{q\}\\in\\\{\-1,\+1\\\}^\{d\_\{s\}\\cdot b\}\}\\big\\\|\\bm\{w\}\-\\bm\{M\}\\bm\{w\}\_\{q\}\\big\\\|\\right\],\(1\)where the inner minimization represents the nearest\-neighbor search \(quantization\)\. During training, we approximate the non\-differentiablea​r​g​m​i​narg~minoperation with a temperature of 10 to enable gradient\-based optimization\.

Accelerated Nearest\-Neighbor Search\.A key challenge in training𝑴\\bm\{M\}is the exponential complexity of the exact nearest\-neighbor search, which scales with2D2^\{D\}\. To accelerate this, we employ a heuristic search strategy based on the generalized inverse\. Specifically, we pad the target vector𝒘\\bm\{w\}with a\(D−d\)\(D\-d\)\-dimensional auxiliary vectorz∈\{\+1,−1\}D−dz\\in\\\{\+1,\-1\\\}^\{D\-d\}, and project it back to the lifted spaceℝD\\mathbb\{R\}^\{D\}using the pseudo\-inverse of𝑴\\bm\{M\}\(derived via SVD\)\. This provides a high\-quality initialization for local search, effectively reducing the search space to2D−d2^\{D\-d\}\. This heuristic is also applied during the quantization process \(mapping𝒘\\bm\{w\}to𝒘q\\bm\{w\}\_\{q\}\)\.

Table 1:Comparison of quantization efficiency on a standard Gaussian source and Llama\-2\-7B perplexity on wikitext2\(CTX=2048\)\. Info\. bits are derived fromR​\(D\)=0\.5​l​o​g2​\(1/M​S​E\)R\(D\)=0\.5~log\_\{2\}\(1/MSE\)\. Search time \(S\.T\.\) denotes the nearest\-neighbor search time per 1M parameters\. Although LiftQuant’s coding efficiency at strictly 2\.0 bits is slightly lower than the complex Trellis Coding Quantization used in QTIP, a marginal increase in bit\-width \(from 32/16 to 30/14\) allows LiftQuant to surpass QTIP’s performance\. Since such minor increments are permissible under most hardware constraints, larger increases yield even greater gains, demonstrating the practical superiority of LiftQuant\.CodingbitsMSEInfo\.PPL\.S\.T\.LQ\-32/201\.600\.1461\.397\.710\.3sLQ\-16/82\.000\.0891\.756\.60≪\\ll0\.1sLQ\-32/162\.000\.0821\.796\.534sLQ\-30/142\.140\.0701\.926\.304sLQ\-24/102\.400\.0532\.126\.101sInt22\.000\.1191\.537\.62\-E8\(QuIP\#\)2\.000\.0891\.756\.60\-TCQ\(QTIP\)2\.000\.0731\.896\.28\-Dimensionality Constraints and Operating RangeThe choice of dimensionsDDandddinvolves a trade\-off between representational capacity and search complexity\. Generally, higher coding dimensions yield better coding efficiency; for instance, as shown in[Table1](https://arxiv.org/html/2606.04050#S3.T1), LiftQuant\-16/8 is comparable to the E8 lattice, while scaling up to LiftQuant\-32/16 surpasses it\. However, this comes at the cost of exponentially increased search overhead \(2D−d2^\{D\-d\}\)\.Consequently, under practical search constraints \(D−d<=20D\-d<=20\), LiftQuant cannot strictly match the theoretical efficiency of QTIP at the same bit\-width, as QTIP leverages Trellis Codes to handle significantly larger dimensions 64\.

Crucially, however, LiftQuant’s flexibility offers a superior practical solution: by merely increasing the bit\-width from 2\-bits to 2\.14\-bits, its coding efficiency surpasses that of the complex Trellis Codes employed in QTIP\. This demonstrates the unique benefit of continuous modulation: we can trade a negligible amount of memory for a simpler, faster, and more flexible decoding architecture, ultimately achieving better Pareto\-optimal results\.

Discussion on High\-Bit Regimes \(≈\\approx4\-bit\): As the target bit\-width increases \(e\.g\., 4\-bit\), maintaining a computationally feasible search space \(D−d≤20D\-d\\leq 20\) requires reducing the block dimensionddsignificantly \(e\.g\.,d≤6d\\leq 6\)\. While technically possible, such small block sizes limit the ability to exploit high\-dimensional inter\-channel correlations\. This is also a fundamental geometric constraint shared by all VQ methods; for example, AQLM and QuIP\# address this by splitting codebooks, which degrades the theoretical coding gain\. More importantly, we argue that fine\-grained modulation is less critical in this regime\. Since recent 4\-bit quantization methods already achieve near\-lossless performance, the marginal utility of fractional adjustments \(e\.g\., 4\.2\-bit\) is negligible; if needed, coarse\-grained schemes \(e\.g\., group\-wise quantization\) are sufficient\. Therefore, LiftQuant strategically focuses on the 2\-to\-3\-bit “deployment gap”, where its continuous modulation capability delivers the highest practical value\. Even when compared to complex trellis\-coded schemes QTIP, LiftQuant maintains a competitive rate\-distortion performance \(within 0\.1 bit gap\) while offering superior flexibility for Pareto\-optimal deployment\.

Notation Convention\.Since the performance and computational characteristics of LiftQuant are intrinsically tied to theDDanddd, we adopt a notation of LQ\-D/dD/d\(e\.g\., LQ\-24/1024/10\) throughout the remainder of this paper\.

### 3\.2Whitening Transformation

In[Section3\.1](https://arxiv.org/html/2606.04050#S3.SS1), we established that our mapping matrix𝑴\\bm\{M\}is optimized to quantize an ideal i\.i\.d\. Gaussian source:𝒘𝒩≃𝑴​𝒘q\\bm\{w\}\_\{\\mathcal\{N\}\}\\simeq\\bm\{M\}\\bm\{w\}\_\{q\}\. However, real\-world LLM weights deviate significantly from this assumption\. They often exhibit heavy\-tailed distributions with outliers\(Dettmerset al\.,[2022](https://arxiv.org/html/2606.04050#bib.bib71)\)and channel\-wise variations in importance due to activation magnitudes\(Linet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib46)\)\.

To address this mismatch, two primary strategies exist\. One is mixed\-scheme quantization, which isolates and independently stores salient weights using higher\-precision formats \(e\.g\., GPT\.int8\(Dettmerset al\.,[2022](https://arxiv.org/html/2606.04050#bib.bib71)\), VPTQ\(Liuet al\.,[2024a](https://arxiv.org/html/2606.04050#bib.bib54)\), QLoRA\(Dettmerset al\.,[2023](https://arxiv.org/html/2606.04050#bib.bib57)\)\)\. While theoretically efficient in bit\-rate, this approach introduces irregular memory access patterns that significantly degrade inference latency\. The alternative is distribution reshaping, which applies low\-complexity linear transforms to smooth or flatten the weight distribution \(e\.g\., SmoothQuant\(Xiaoet al\.,[2022](https://arxiv.org/html/2606.04050#bib.bib30)\), FlatQuant\(Sunet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib45)\)\) or to decorrelate channels \(e\.g\., QuIP\(Cheeet al\.,[2023](https://arxiv.org/html/2606.04050#bib.bib49)\), QTIP\(Tsenget al\.,[2024b](https://arxiv.org/html/2606.04050#bib.bib62)\)\)\. We adopt the latter strategy for its superior hardware efficiency, as linear transforms can be fused or executed in parallel\. We term this process “Whitening”, borrowing from information theory where source data is transformed to resemble a Gaussian channel input for optimal coding\.

Whitening Transform Design\.To achieve both efficiency and representational power, we parameterize the layer\-wise whitening transform𝑫\\bm\{D\}in a decomposed form:

𝑻=diag​\(𝒔1\)​\(𝑷1⊗𝑷2\)​diag​\(𝒔2\)\\bm\{T\}=\\text\{diag\}\(\\bm\{s\}\_\{1\}\)\(\\bm\{P\}\_\{1\}\\otimes\\bm\{P\}\_\{2\}\)\\text\{diag\}\(\\bm\{s\}\_\{2\}\)\(2\)
where activation multiplication by𝑻−1\\bm\{T\}^\{\-1\}scales as𝒪​\(n​n\)\\mathcal\{O\}\(n\\sqrt\{n\}\), significantly lower than the𝒪​\(n2\)\\mathcal\{O\}\(n^\{2\}\)cost of dense matrix multiplication\. Here,𝒔1,2\\bm\{s\}\_\{1,2\}are diagonal scaling matrices, and𝑷1,2\\bm\{P\}\_\{1,2\}aren×n\\sqrt\{n\}\\times\\sqrt\{n\}matrices whose Kronecker product provides channel intermixing and whitening capability\. This decomposition is extremely parameter\-efficient\. For a Llama\-70B model, storing these transformation parameters in FP16 adds only a negligible overhead of 0\.008–0\.011 bits per parameter \(under 1\.6\-bit to 3\-bit quantization settings\)\.

This design serves three specific functions: 1\) Importance\-Aware Scaling𝒔1\\bm\{s\}\_\{1\}\. Inspired by AWQ,𝒔1\\bm\{s\}\_\{1\}redistributes quantization error based on activation magnitudes, down\-scaling channels with large activations to reduce their relative error\. 2\) Decorrelation and Mixing transformation𝑷1,2\\bm\{P\}\_\{1,2\}\. We initialize them as orthogonal matrices \(e\.g\., Hadamard\) to preserve energy and diffuse outliers across dimensions, similar to QuIP\#\. 3\) Isotropy Refinement𝒔2\\bm\{s\}\_\{2\}\. The final scaling𝒔2\\bm\{s\}\_\{2\}further normalizes per\-channel variance, ensuring stronger isotropy with respect to the LiftQuant lattice\. Crucially, by constraining𝒔2\\bm\{s\}\_\{2\}to be block\-wise constant, it can be fused with the projection matrix𝑴\\bm\{M\}during inference \(see[Section3\.3](https://arxiv.org/html/2606.04050#S3.SS3)\)\.

The optimization objective for𝑻\\bm\{T\}is to minimize the reconstruction error under the reversible transformation:

arg⁡min𝒔1,2,𝑷1,2⁡ℒ​‖𝑾​𝒂T−Quantste​\(𝑾​𝑻\)​𝑻−1​𝒂T‖2\\arg\\min\_\{\\bm\{s\}\_\{1,2\},\\bm\{P\}\_\{1,2\}\}\\mathcal\{L\}\\\|\\bm\{W\}\\bm\{a\}^\{T\}\-\\text\{Quant\}\_\{\\text\{ste\}\}\(\\bm\{W\}\\bm\{T\}\)\\bm\{T\}^\{\-1\}\\bm\{a\}^\{T\}\\\|^\{2\}\(3\)whereQuantste\\text\{Quant\}\_\{\\text\{ste\}\}denotes a standard uniform quantizer with Straight\-Through Estimator \(STE\)\. We employ this standard quantizer as a proxy because distributions amenable to uniform quantization—specifically, those that are decorrelated and outlier\-free—are inherently compatible with our Gaussian\-optimized LiftQuant projection\. This proxy approach not only ensures high computational efficiency during training but also enhances robustness by avoiding the optimization instability often associated with complex non\-uniform quantizers\. This decomposed transform effectively reshapes arbitrary weight distributions into well\-approximated i\.i\.d\. Gaussians, enabling the seamless application of our LiftQuant projection\.

### 3\.3Quantization in Lifted\-Space and Intra\-Block Correction

Having obtained the projection matrix𝑴\\bm\{M\}and the whitening transform𝑻\\bm\{T\}, we integrate them into a unified quantization pipeline\. First, the layerweights𝑾\\bm\{W\}are whitened and standardized to match the distribution expected by the LiftQuant lattice\. These pre\-processed weights are then quantized by finding their nearest neighbors in the lifted space, yielding a binary matrix𝑾q∈\{\+1,−1\}O​C×\(⌈I​Cd⌉⋅D\)\\bm\{W\}\_\{q\}\\in\\\{\+1,\-1\\\}^\{OC\\times\(\\lceil\\frac\{IC\}\{d\}\\rceil\\cdot D\)\}\.

Crucially, for efficient inference, the projection matrix𝑴\\bm\{M\}and the inverse whitening transform𝑻−1\\bm\{T\}^\{\-1\}can be mathematically fused into a single linear operator\. The layer output is thus computed as:

𝒐=diag​\(𝒔\)⋅𝑾q​\(𝑴​𝑻−1​𝒂T\)=diag​\(𝒔\)⋅𝑾q​\(𝑻∗​𝒂T\)\\bm\{o\}=\\text\{diag\}\(\\bm\{s\}\)\\cdot\\bm\{W\}\_\{q\}\(\\bm\{M\}\\bm\{T\}^\{\-1\}\\bm\{a\}^\{T\}\)=\\text\{diag\}\(\\bm\{s\}\)\\cdot\\bm\{W\}\_\{q\}\(\\bm\{T\}^\{\*\}\\bm\{a\}^\{T\}\)\(4\)where𝒔\\bm\{s\}is the quantization scale factor, and𝑻∗=𝑴​𝑻−1\\bm\{T\}^\{\*\}=\\bm\{MT\}^\{\-1\}represents the fused decoding matrix\. This formulation reveals that the entire decoding process is reduced to a low\-complexity linear projection followed by a Int1\-Float matrix multiplication, ensuring high throughput and hardware\-friendly deployment\.

A key advantage of this fused formulation is that it renders the entire quantization\-decoding path fully differentiable\. This allows us to further refine the model performance through block\-wise fine\-tuning\. Specifically, we treat the binary weights𝑾q\\bm\{W\}\_\{q\}\(via Straight\-Through Estimator\) and the fused matrix𝑻∗\\bm\{T\}^\{\*\}as trainable parameters\. We optimize them to minimize the reconstruction error of the layer output using a small calibration dataset:

min𝑾q,𝑻∗⁡𝔼𝒂∼𝒯​calib​‖𝑾fp​𝒂T−diag​\(𝒔\)​𝑾q​𝑻∗​𝒂T‖2\\min\_\{\\bm\{W\}\_\{q\},\\bm\{T\}^\{\*\}\}\\mathbb\{E\}\_\{\\bm\{a\}\\sim\\mathcal\{T\}\{\\text\{calib\}\}\}\|\|\\bm\{W\}\_\{\\text\{fp\}\}\\bm\{a\}^\{T\}\-\\text\{diag\}\(\\bm\{s\}\)\\bm\{W\}\_\{q\}\\bm\{T\}^\{\*\}\\bm\{a\}^\{T\}\|\|^\{2\}\(5\)
This local adaptation step is critical for compensating for any residual quantization errors and aligning the components for optimal end\-to\-end performance\.

## 4Experiments

Setting\. We evaluate LiftQuant on a comprehensive suite of LLMs, including the Llama\-2/3\(Touvronet al\.,[2023](https://arxiv.org/html/2606.04050#bib.bib19)\), and Qwen\-2\.5/3\(Yanget al\.,[2025](https://arxiv.org/html/2606.04050#bib.bib70)\)families\. We report perplexity \(PPL\) on WikiText\-2\(Merityet al\.,[2016](https://arxiv.org/html/2606.04050#bib.bib36)\)and C4\(Raffelet al\.,[2020](https://arxiv.org/html/2606.04050#bib.bib60)\)validation sets, zero\-shot accuracy on five common\-sense reasoning benchmarks \(ARC\-c, ARC\-e\(Clarket al\.,[2018](https://arxiv.org/html/2606.04050#bib.bib37)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.04050#bib.bib39)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.04050#bib.bib42)\), WinoGrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2606.04050#bib.bib44)\)\), and MMLU\(Hendryckset al\.,[2020](https://arxiv.org/html/2606.04050#bib.bib69)\)\. We compare against state\-of\-the\-art methods, including leading UQ approaches \(GPTQ, Quarot, EfficientQAT\) and advanced VQ frameworks \(QuIP\#, AQLM, VPTQ, QTIP\)\. For calibration, we sample 4,096 segments with 2,048 sequence length, from the RedPajama dataset\. More calibration details are provided in[AppendixA](https://arxiv.org/html/2606.04050#A1)\.

End\-to\-End Fine\-Tuning\. Following the standard protocol adopted by recent top\-performing methods \(e\.g\., AQLM, QuIP\#, EfficientQAT\), we perform a lightweight end\-to\-end fine\-tuning step after quantization\. This process optimizes the continuous quantization parameters \(e\.g\., scales, transformation matrices\) by minimizing the cross\-entropy loss on a small calibration set, ensuring fair comparison\. Further details are provided in[AppendixB](https://arxiv.org/html/2606.04050#A2)\. And the sensitivity of our method to the calibration data is discussed in[AppendixC](https://arxiv.org/html/2606.04050#A3)\.

### 4\.1Main Results

Table 2:Perplexity \(↓\) on Wikitext2 and C4, context length 2048 for Llama\-2 and 8192 for Llama\-3\.2\-72\-132\-703\-83\-70MethodTypeBitsW2C4W2C4W2C4W2C4W2C4FP16\-\-5\.476\.974\.886\.473\.325\.525\.547\.102\.595\.78GPTQ\-g128UQ2\.1350\.7536\.7643\.8423\.07NaNNaN\-\-\-\-QuarotUQ2\.0022\.07\-10\.41\-5\.60\-\-\-\-\-OmniQ\-g64UQ2\.259\.6212\.727\.5610\.056\.117\.68\-\-\-\-EQAT\-g64UQ2\.256\.868\.505\.967\.594\.526\.388\.3110\.095\.367\.43AQLMVQ1\.97\-2\.076\.618\.285\.727\.444\.196\.13\-\-\-\-VPTQVQ2\.02\-2\.086\.578\.275\.697\.414\.176\.13\-\-5\.557\.30QuIP\#VQ2\.006\.668\.355\.747\.454\.166\.127\.849\.065\.777\.46QTIPVQ2\.006\.287\.945\.457\.163\.945\.937\.338\.624\.976\.80LiftQuantLQ\-32/162\.01\-2\.026\.528\.215\.617\.344\.086\.067\.658\.954\.696\.73LiftQuantLQ\-24/102\.41\-2\.426\.107\.705\.306\.983\.785\.836\.948\.314\.106\.47GPTQUQ3\.008\.379\.816\.448\.024\.826\.57\-\-\-\-GPTQ\-g128UQ3\.136\.297\.895\.427\.003\.855\.858\.819\.354\.827\.37QuarotUQ3\.006\.09\-5\.37\-3\.72\-\-\-\-\-EQAT\-g128UQ3\.135\.817\.345\.126\.733\.615\.716\.357\.793\.786\.30VPTQ\#VQ3\.01\-3\.035\.827\.335\.126\.703\.555\.67\-\-\-\-QuIP\#VQ3\.005\.797\.325\.106\.723\.565\.676\.277\.713\.596\.18LiftQuantLQ\-24/83\.01\-3\.025\.757\.315\.096\.713\.355\.676\.227\.703\.456\.18Table 3:Llama accuracy\(↑\) on 2\-bit quantization\(LQ\-32/16\)\.MethodbitsArcCArcEHellaPiQAWinoAvg\.2\-7\-43\.5276\.2657\.1678\.0769\.2264\.85EQAT\-g642\.2536\.8670\.9651\.5875\.3065\.9860\.14QuIP\#2\.0037\.8871\.8450\.8474\.1665\.6760\.61QTIP2\.0039\.7673\.3253\.6876\.2867\.2562\.06LiftQuant2\.0236\.7770\.6653\.0576\.5568\.2761\.06LiftQuant2\.4239\.4272\.9055\.1076\.9967\.7262\.422\-13\-48\.2979\.4260\.0779\.0572\.2267\.81EQAT\-g642\.2541\.8974\.8355\.2777\.0468\.3663\.48QuIP\#2\.0042\.9275\.7256\.5377\.9769\.0664\.44QTIP2\.0043\.9476\.6058\.1377\.6970\.4065\.35LiftQuant2\.0243\.6976\.3057\.0977\.9170\.0165\.00LiftQuant2\.4244\.8077\.4858\.6477\.4871\.2765\.932\-70\-54\.4482\.7064\.7782\.1577\.9872\.41EQAT\-g642\.2650\.7780\.1361\.7880\.1474\.5969\.48QuIP\#2\.0052\.6581\.9062\.8681\.3975\.7770\.91QTIP2\.0053\.0581\.1263\.1182\.1676\.1971\.12LiftQuant2\.0153\.5881\.3663\.0881\.2875\.6971\.00LiftQuant2\.4151\.6281\.5264\.3182\.3277\.1971\.393\-8\-50\.4380\.0960\.1779\.6072\.6168\.58EQAT\-g642\.2537\.0371\.1751\.8676\.0367\.7260\.76VPTQ2\.0736\.9171\.0352\.1275\.1265\.9260\.22LiftQuant2\.0240\.8774\.3353\.8776\.5568\.0362\.733\-70\-60\.4186\.9966\.3682\.3780\.5175\.33EQAT\-g642\.2549\.0677\.4061\.6077\.3774\.0367\.89VPTQ2\.0252\.6581\.8661\.7180\.3677\.9070\.90LiftQuant2\.0156\.1484\.3062\.3181\.7278\.5372\.60As shown in[Tables2](https://arxiv.org/html/2606.04050#S4.T2)and[3](https://arxiv.org/html/2606.04050#S4.T3), LiftQuant demonstrates a substantial performance advantage over all UQ baselines\. Even compared to methods that employ fine\-grained grouping \(e\.g\., EfficientQAT\-g64\), LiftQuant consistently achieves lower perplexity and higher accuracy\. This validates that our “lift\-then\-project” mechanism successfully overcomes the inherent limitations of the uniform grid, capturing the non\-uniform weight distribution far more effectively than scalar approaches\.

Against state\-of\-the\-art VQ methods under standard integer bit\-widths, LiftQuant delivers highly competitive performance\. On most models, LiftQuant matches the accuracy of leading VQ frameworks like QuIP\# and AQLM\. While QTIP achieves slightly better theoretical coding efficiency due to its trellis\-coded structure, LiftQuant remains within a narrow margin\. But on the Llama\-3\-70B model, LiftQuant achieves a 2\-bit perplexity of 5\.31, slightly outperforming QTIP\. We attribute this to LiftQuant’s fully differentiable architecture, compensating for the theoretical gap in quantization error\.

Most critically, LiftQuant unlocks a new performance tier throughfractional quantization\. As shown in the results, simply increasing the bit\-width to 2\.4 bits allows LiftQuant to significantly outperform all 2\-bit baselines \(both UQ and VQ\) by a large margin\. For example, on Llama\-3\-70B, the 2\.4\-bit LiftQuant model achieves a perplexity of 5\.86, far surpassing the best 2\-bit result of 5\.31\. This demonstrates that in practical scenarios where memory budgets allow for slightly more than 2 bits \(e\.g\., 24GB VRAM\), LiftQuant’s ability to utilize that extra capacity provides a decisive advantage that no integer\-constrained method can match\.

### 4\.2Fractional Bit Widths and Pareto Optimality

![Refer to caption](https://arxiv.org/html/2606.04050v1/x2.png)\(a\)Llama2 Family
![Refer to caption](https://arxiv.org/html/2606.04050v1/x3.png)\(b\)Qwen2\.5 Family
![Refer to caption](https://arxiv.org/html/2606.04050v1/x4.png)\(c\)Qwen3 Family

Figure 4:Performance \(PPL, Zero\-shot, MMLU\) vs\. Memory Footprint across multiple model families\. LiftQuant’s fractional bit widths fill the gaps between integer steps, creating a dense frontier that enables customized, optimal quantization for arbitrary memory constraint\.To demonstrate the full potential of LiftQuant, we conducted an extensive evaluation across the Llama\-2, Qwen\-2\.5, and Qwen\-3 families, covering model sizes from 3B to 70B\. We measured zero\-shot accuracy and MMLU performance across a continuous spectrum of bit\-widths, plotting these against memory footprint to construct the Pareto frontier\.

To contextualize our results, we introduce an “Ideal 4\-Bit” reference point\. Recent foundational models, such as GPT\-oss\-120B and Kimi\-K2, have adopted the MXFP4 format \(effective bit\-width 4\.25 bits\) as their native representation, achieving performance indistinguishable from FP16\. We thus assume a hypothetical 4\-bit quantization that matches FP16 accuracy to serve as the upper bound for compression efficiency\.

[Figure4](https://arxiv.org/html/2606.04050#S4.F4)illustrates the performance\-memory trade\-off curves\. Our findings reveal three critical insights:

1\. Alignment with the Ideal Frontier: We observe that the Pareto frontier formed by LiftQuant models across various fractional bit\-widths closely aligns with the curve formed by the “Ideal 4\-Bit” reference points\. This suggests that a LiftQuant\-compressed model approximates the performance profile of a native model of that effective size\. In this sense, LiftQuant can be viewed as a flexible “parameter scaler”, offering a practical means to instantiate models of intermediate effective sizes to match hardware constraints, potentially reducing the need to pre\-train dense models for every specific memory budget\.

2\. Effective Deployment Strategies: This flexibility facilitates better hardware matching\. As highlighted in[Figure1](https://arxiv.org/html/2606.04050#S1.F1), we demonstrate the feasibility of deploying a 70B model at 2\.4 bits \(24/10\) on a 24GB GPU, and a 32B model at 2\.5 bits \(25/10\) on a 12GB GPU\. In our experiments, these configurations maximized VRAM utilization and delivered performance that consistently exceeded that of smaller models quantized to standard integer bit\-widths\.

3\. The “2x Scaling \+ Quantization” Heuristic: Our data indicates that performance tends to deviate noticeably from the Pareto line when the bit\-width drops below 2 bits\. However, the 2\-to\-4\-bit range generally adheres to the optimal frontier\. This observation points towards a potential deployment heuristic: by scaling native model sizes in steps of roughly 2x \(e\.g\., 7B → 14B → 32B\) and using LiftQuant to bridge the 2\-to\-4\-bit gap, it may be possible to construct a dense and near\-optimal deployment continuum\.

### 4\.3Inference Efficiency

Table 4:E2E decoding for Llama\-2\-70B on GTX4090D\-48G\. Context Length = 512, batch size = 1\.LiftQuantLQ\-32/16LQ\-24/10LQ\-24/831\.3 tk/s25\.7 tk/s20\.8 tk/sQTIP2bit\-3bit24\.5 tk/s\-17\.6 tk/sAWQ2bit\-\-36\.1 tk/s\-\-To validate the practical efficiency of LiftQuant, we evaluated the decoding throughput of the Llama\-2\-70B model on an NVIDIA RTX 4090D \(48GB\) GPU \([Table4](https://arxiv.org/html/2606.04050#S4.T4)\)\. A significant advantage of LiftQuant is its architectural simplicity\. We implemented the linear operation𝒐=d​i​a​g​\(𝒔\)​𝑾​\(𝑻∗​𝒂\)\\bm\{o\}=diag\(\\bm\{s\}\)\\bm\{W\}\(\\bm\{T^\{\*\}\}\\bm\{a\}\)efficiently using torch\.compile and BitBLAS\(Wanget al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib61)\)for UINT1\-FP16 GEMV operations\.

To the best of our knowledge, while QTIP employ specialized kernels heavily optimized for batch size 1, it often lack native support for other settings \(e\.g\., batch size 8\)\. So it requires substantial engineering effort to develop dedicated kernels for each scenario\. In contrast, LiftQuant achieves ideal acceleration across diverse settings using only standard open\-source libraries\.

### 4\.4Ablation Study

To validate the contribution of each component in the LiftQuant framework, we conducted a progressive ablation study on the Llama\-2\-7B model\. We evaluated the impact of the whitening transform parameters \(𝒔1\\bm\{s\}\_\{1\},𝑷1\\bm\{P\}\_\{1\},𝑷2\\bm\{P\}\_\{2\},𝒔2\\bm\{s\}\_\{2\}in𝑻\\bm\{T\}\) and the projection matrix \. The results are reported as perplexity on WikiText\-2 in[Table5](https://arxiv.org/html/2606.04050#S4.T5)\.

When𝑴\\bm\{M\}is not used, we default to a standard 2\-bit symmetric uniform quantizer\. Notably, we found that the matrices𝑷1\\bm\{P\}\_\{1\}and𝑷2\\bm\{P\}\_\{2\}\(initialized as random orthogonal matrices\) are critical for stability; omitting them leads directly to training collapse\. Thus, our ablation starts with𝑷1\\bm\{P\}\_\{1\}and𝑷2\\bm\{P\}\_\{2\}enabled\. As shown in Table 3, adding the scaling factors𝒔1\\bm\{s\}\_\{1\}and𝒔2\\bm\{s\}\_\{2\}progressively improves performance\. Introducing the LiftQuant projection matrix𝑴\\bm\{M\}yields a significant accuracy boost, confirming the benefit of high\-dimensional quantization\. Finally, end\-to\-end fine\-tuning further refines the model, delivering the best overall results\.

Table 5:Ablation Study on Llama\-2\-7bComponent𝑷1\+𝑷2\\bm\{P\}\_\{1\}\+\\bm\{P\}\_\{2\}\+𝒔1\+\\bm\{s\}\_\{1\}\+𝒔2\+\\bm\{s\}\_\{2\}\+𝑴\+\\bm\{M\}\+ F\.T\.wiki2 PPL\(↓\)8\.768\.287\.776\.796\.53

## 5Conclusion

We introduced LiftQuant, a framework that breaks the rigidity of integer quantization by enabling continuous bit\-width control\. Through a “lift\-then\-project” mechanism, LiftQuant decouples quantization rate from coding format, allowing for true Pareto\-optimal deployment\. Our experiments confirm that LiftQuant matches state\-of\-the\-art vector quantization methods accuracy, while uniquely enabling fractional configurations that maximize performance on constrained hardware\. LiftQuant establishes a practical, hardware\-aware paradigm for customized LLM compression\.

## Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning\. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here\.

## References

- S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)QuaRot: outlier\-free 4\-bit inference in rotated llms\.ArXivabs/2404\.00456\.External Links:[Link](https://api.semanticscholar.org/CorpusID:268819214)Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p2.4)\.
- Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence,Vol\.34,pp\. 7432–7439\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- J\. Chee, Y\. Cai, V\. Kuleshov, and C\. M\. De Sa \(2023\)Quip: 2\-bit quantization of large language models with guarantees\.Advances in Neural Information Processing Systems36,pp\. 4396–4429\.Cited by:[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- M\. Chen, W\. Shao, P\. Xu, J\. Wang, P\. Gao, K\. Zhang, and P\. Luo \(2024\)Efficientqat: efficient quantization\-aware training for large language models\.arXiv preprint arXiv:2407\.11062\.Cited by:[Appendix A](https://arxiv.org/html/2606.04050#A1.p3.3),[Appendix B](https://arxiv.org/html/2606.04050#A2.p1.1),[§2](https://arxiv.org/html/2606.04050#S2.p4.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)Gpt3\. int8 \(\): 8\-bit matrix multiplication for transformers at scale\.Advances in neural information processing systems35,pp\. 30318–30332\.Cited by:[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p1.2),[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)Qlora: efficient finetuning of quantized llms\.Advances in neural information processing systems36,pp\. 10088–10115\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p2.4),[§2](https://arxiv.org/html/2606.04050#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- P\. Diaconis and D\. Freedman \(1984\)Asymptotics of graphical projection pursuit\.The annals of statistics,pp\. 793–815\.Cited by:[§3\.1](https://arxiv.org/html/2606.04050#S3.SS1.p1.1)\.
- V\. Egiazarian, A\. Panferov, D\. Kuznedelev, E\. Frantar, A\. Babenko, and D\. Alistarh \(2024\)Extreme compression of large language models via additive quantization\.arXiv preprint arXiv:2401\.06118\.Cited by:[Appendix A](https://arxiv.org/html/2606.04050#A1.p3.3),[Appendix B](https://arxiv.org/html/2606.04050#A2.p1.1),[§2](https://arxiv.org/html/2606.04050#S2.p3.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2022\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.ArXivabs/2210\.17323\.External Links:[Link](https://api.semanticscholar.org/CorpusID:253237200)Cited by:[Appendix A](https://arxiv.org/html/2606.04050#A1.p2.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2020\)Measuring massive multitask language understanding\.arXiv preprint arXiv:2009\.03300\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- P\. Langley \(2000\)Crafting papers on machine learning\.InProceedings of the 17th International Conference on Machine Learning \(ICML 2000\),P\. Langley \(Ed\.\),Stanford, CA,pp\. 1207–1216\.Cited by:[Appendix F](https://arxiv.org/html/2606.04050#A6.p1.1)\.
- D\. Lee and H\. O\. Song \(2025\)Q\-palette: fractional\-bit quantizers toward optimal bit allocation for efficient llm deployment\.arXiv preprint arXiv:2509\.20214\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p4.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)Awq: activation\-aware weight quantization for on\-device llm compression and acceleration\.Proceedings of Machine Learning and Systems6,pp\. 87–100\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p2.4),[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p1.2)\.
- Y\. Liu, J\. Wen, Y\. Wang, S\. Ye, L\. L\. Zhang, T\. Cao, C\. Li, and M\. Yang \(2024a\)Vptq: extreme low\-bit vector post\-training quantization for large language models\.arXiv preprint arXiv:2409\.17066\.Cited by:[Appendix B](https://arxiv.org/html/2606.04050#A2.p1.1),[§2](https://arxiv.org/html/2606.04050#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- Y\. Liu, H\. Fang, L\. He, R\. Zhang, Y\. Bai, Y\. Du, and L\. Du \(2025\)FBQuant: feedback quantization for large language models\.ArXivabs/2501\.16385\.External Links:[Link](https://api.semanticscholar.org/CorpusID:275932200)Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p2.4)\.
- Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2024b\)SpinQuant: llm quantization with learned rotations\.ArXivabs/2405\.16406\.External Links:[Link](https://api.semanticscholar.org/CorpusID:270062819)Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p2.4)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.arXiv preprint arXiv:1609\.07843\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- S\. Park, J\. Bae, B\. Kwon, M\. Kim, B\. Kim, S\. J\. Kwon, U\. Kang, and D\. Lee \(2025\)Unifying uniform and binary\-coding quantization for accurate compression of large language models\.arXiv preprint arXiv:2506\.03781\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p3.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of machine learning research21\(140\),pp\. 1–67\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- Y\. Sun, R\. Liu, H\. Bai, H\. Bao, K\. Zhao, Y\. Li, J\. Hu, X\. Yu, L\. Hou, C\. Yuan,et al\.\(2024\)Flatquant: flatness matters for llm quantization\.arXiv preprint arXiv:2410\.09426\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p2.4),[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- H\. Touvron, T\. Lavril, G\. Izacard, X\. Martinet, M\. Lachaux, T\. Lacroix, B\. Rozière, N\. Goyal, E\. Hambro, F\. Azhar,et al\.\(2023\)Llama: open and efficient foundation language models\.arXiv preprint arXiv:2302\.13971\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. De Sa \(2024a\)Quip\#: even better llm quantization with hadamard incoherence and lattice codebooks\.arXiv preprint arXiv:2402\.04396\.Cited by:[Appendix A](https://arxiv.org/html/2606.04050#A1.p2.1),[Appendix B](https://arxiv.org/html/2606.04050#A2.p1.1),[§2](https://arxiv.org/html/2606.04050#S2.p2.4)\.
- A\. Tseng, Q\. Sun, D\. Hou, and C\. M\. De Sa \(2024b\)Qtip: quantization with trellises and incoherence processing\.Advances in Neural Information Processing Systems37,pp\. 59597–59620\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p3.1),[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- H\. Wang, S\. Ma, L\. Ma, L\. Wang, W\. Wang, L\. Dong, S\. Huang, H\. Wang, J\. Xue, R\. Wang,et al\.\(2025\)Bitnet: 1\-bit pre\-training for large language models\.Journal of Machine Learning Research26\(125\),pp\. 1–29\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p4.1)\.
- L\. Wang, L\. Ma, S\. Cao, Q\. Zhang, J\. Xue, Y\. Shi, N\. Zheng, Z\. Miao, F\. Yang, T\. Cao,et al\.\(2024\)Ladder: enabling efficient\{\\\{low\-precision\}\\\}deep learning computing through hardware\-aware tensor transformation\.In18th USENIX Symposium on Operating Systems Design and Implementation \(OSDI 24\),pp\. 307–323\.Cited by:[§4\.3](https://arxiv.org/html/2606.04050#S4.SS3.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, J\. Demouth, and S\. Han \(2022\)SmoothQuant: accurate and efficient post\-training quantization for large language models\.ArXivabs/2211\.10438\.External Links:[Link](https://api.semanticscholar.org/CorpusID:253708271)Cited by:[§3\.2](https://arxiv.org/html/2606.04050#S3.SS2.p2.1)\.
- C\. Xu, J\. Yao, Z\. Lin, W\. Ou, Y\. Cao, Z\. Wang, and H\. Zha \(2018\)Alternating multi\-bit quantization for recurrent neural networks\.arXiv preprint arXiv:1802\.00150\.Cited by:[§2](https://arxiv.org/html/2606.04050#S2.p3.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.arXiv preprint arXiv:1905\.07830\.Cited by:[§4](https://arxiv.org/html/2606.04050#S4.p1.1)\.

## Appendix ATraining Details for Intra\-Block correction

This appendix details the training procedure for the block\-wise correction phase described in[Section3\.3](https://arxiv.org/html/2606.04050#S3.SS3)\. The goal of this phase is to correct for quantization errors by jointly optimizing the low\-bit weights𝑾q\\bm\{W\}\_\{q\}and the transformation matrix𝑻∗\\bm\{T\}^\{\*\}\.

Two primary strategies exist for post\-quantization correction\. The first, based on the Hessian matrix, involves adaptively rounding weight vectors\(Frantaret al\.,[2022](https://arxiv.org/html/2606.04050#bib.bib50); Tsenget al\.,[2024a](https://arxiv.org/html/2606.04050#bib.bib48)\)\. However, this class of methods is impractical for our framework due to the prohibitive computational cost of the nearest\-neighbor search required to determine the set of valid rounding candidates for each vector in our lattice\.

Consequently, we adopt a more practical and effective approach: direct fine\-tuning using gradient descent\(Egiazarianet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib55); Chenet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib56)\)\. This method, proven viable in prior work, allows us to optimize both𝑻∗\\bm\{T\}^\{\*\}and𝑾q\\bm\{W\}\_\{q\}simultaneously\. Since𝑾q\\bm\{W\}\_\{q\}consists of discrete values, we employ the Straight\-Through Estimator \(STE\) to approximate gradients during backpropagation\.

For the correction process, we constructed a calibration dataset by randomly selecting 4,096 samples from the RedPajama dataset, with each sample having a sequence length of 2048 tokens\. From this set, 128 samples were held out as a validation set\. We used the Adam optimizer to minimize the Mean Squared Error \(MSE\) loss between the outputs of the quantized layer and the original full\-precision layer\. The learning rate for the transformation parameters𝑻∗\\bm\{T\}^\{\*\}was set to1×10−31\\times 10^\{\-3\}across all models\. For the𝑾q\\bm\{W\}\_\{q\}, we used a learning rate of2×10−52\\times 10^\{\-5\}for models between 3B and 14B parameters, and a reduced rate of1×10−51\\times 10^\{\-5\}for the 70B model\. The entire training process was conducted for 2 epochs\.

## Appendix BTraining Details for end to end fine\-tune

To further enhance model performance and globally align the quantization parameters, we perform an optional end\-to\-end fine\-tuning step\. The effectiveness of this approach for adjusting quantization parameters has been validated by several prior works\(Tsenget al\.,[2024a](https://arxiv.org/html/2606.04050#bib.bib48); Egiazarianet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib55); Chenet al\.,[2024](https://arxiv.org/html/2606.04050#bib.bib56); Liuet al\.,[2024a](https://arxiv.org/html/2606.04050#bib.bib54)\)\.

This fine\-tuning process optimizes the continuous parameters of our framework—specifically, the scaling parameters and the components of the transformation matrix𝑫\\bm\{D\}—across all layers simultaneously\. Unlike the layer\-wise correction phase, this step minimizes the standard language modeling loss \(i\.e\., Cross\-Entropy\) over the entire model\.

For training, we used a dataset of 4,096 samples from RedPajama, each with a sequence length of 4096\. We employed the Adam optimizer and trained for a single epoch\. A differential learning rate scheme was applied: the learning rate for the quantization scaling parameters was set to1×10−51\\times 10^\{\-5\}, while the transformation parameters used a higher rate of3×10−43\\times 10^\{\-4\}\. A significant advantage of this approach is its remarkable memory efficiency\. Since the fine\-tuning is performed on the already quantized model, the weights remain in their low\-bit format throughout the process\. This dramatically reduces the memory footprint, enabling us to fine\-tune the entire 70B model on a single 80GB A100 GPU—a task that is infeasible for its full\-precision counterpart\.

## Appendix CSensitivity to Calibration Data

To investigate the sensitivity of our fine\-tuning process to the choice of calibration data, we conducted a comprehensive ablation study\. We varied the calibration dataset’s size, domain, and sequence length, and evaluated the impact on Llama\-2\-7B\. For these experiments, we used a 10x20MM\-matrix and only performed the intra\-block correction \(without end\-to\-end fine\-tuning\) to isolate the specific effect of the calibration data\. The results are presented in Table[6](https://arxiv.org/html/2606.04050#A3.T6)\.

Table 6:Ablation study on the calibration data for 2\-bit Llama\-2\-7B\. The default configuration used in our main experiments is highlighted in bold\.Calibration SetConfig\. \(Samples×\\timesSeqLen\)WikiText\-2 PPL \(↓\\downarrow\)C4 PPL \(↓\\downarrow\)Avg\. 0\-shot Acc\. \(↑\\uparrow\)RedPajama \(Small\)512×\\times2048 \( 1M tokens\)7\.088\.6660\.03RedPajama \(Medium\)1024×\\times2048 \( 2M tokens\)7\.008\.5960\.55RedPajama \(Large\)2048×\\times2048 \( 4M tokens\)6\.968\.5360\.68RedPajama \(Default\)4096×\\times2048 \( 8M tokens\)6\.978\.5360\.70RedPajama \(Short Seq\)4096×\\times512 \( 2M tokens\)6\.988\.5360\.67WikiText\-2 \(In\-Domain\)2048×\\times2048 \( 4M tokens\)6\.728\.6560\.24Our findings from this study provide two key insights:

#### Robustness to Data Size and Sequence Length\.

The results indicate that while performance improves as the calibration data size increases from 1M to 8M tokens, there are clear diminishing returns beyond approximately 4M tokens\. Similarly, reducing the sequence length from 2048 to 512 while keeping the total token count constant has a minimal impact on the final performance\.Our choice of 8M tokens \(4096 samples×\\times2048 sequence length\) for the main experiments was made to ensure a fair comparison with other methods, such as AQLM and EfficientQAT\.

#### Impact of Domain Shift\.

As expected, calibrating on a domain\-matched dataset \(WikiText\-2\) yields the best perplexity on that specific in\-domain benchmark \(6\.72 PPL\), as shown in Table[6](https://arxiv.org/html/2606.04050#A3.T6)\. This specialization, however, comes at the cost of slightly degraded performance on out\-of\-domain benchmarks like the C4 dataset and zero\-shot tasks\. Using a large, general\-purpose corpus like RedPajama provides a more balanced and robust performance across all evaluation metrics\.

## Appendix DDecoding Overhead

Table 7:Asymptotic complexity and storage analysis per layer of sizeN×MN\\times Mat 2bit quantization\.MethodMain GEMM FLOPsAdditional FLOPsWeight StorageAdditional StorageFP162​N​M2NM\-16​N​M16NM\-LiftUQ\(decoding\)2​N​M2NMO​\(ds​N\+ds2​N\)O\(d\_\{s\}N\+d\_\{s\}^\{2\}N\)b​N​MbNMO​\(ds2​N\)O\(d\_\{s\}^\{2\}N\)\(𝑻∗​𝑨\\bm\{T^\{\*\}A\}first, batch=1\)LiftUQ\(prefill\)2​k​N​M2kNMO​\(\(ds​N\+ds2​N\)​k\)O\(\(d\_\{s\}N\+d\_\{s\}^\{2\}N\)k\)b​N​MbNMO​\(ds2​N\)O\(d\_\{s\}^\{2\}N\)\(𝑾𝒒​𝑻∗\\bm\{W\_\{q\}T^\{\*\}\}first, batch=kk\)Note:kkis batch size,bbis bitwidth,dsd\_\{s\}is subspace dimension\.
## Appendix EComparison on 1\.58\-bit Baseline\.

Table 8:Comparison on 1\.58\-bit Baseline\.2\-72\-133\-8MethodTypeBitsW2↓C4↓Avg\.Acc↑W2↓C4↓Avg\.Acc↑W2↓C4↓Avg\.Acc↑FP16\-\-5\.476\.9764\.854\.886\.4767\.816\.148\.8868\.58PTQ1\.61UQ1\.6112\.7017\.7344\.149\.7413\.6449\.2122\.9033\.8243\.99LiftUQLQ\-16/101\.627\.719\.5556\.196\.478\.2760\.5411\.4315\.1356\.66
## Appendix FOther Quantization Result\.

Table 9:Llama\-2 and Llama\-3 accuracy\(↑\) on 3\-bit quantization\.ModelMethodtypebitsArcCArcEHellaSwagPiQAWinoGrandeAvg\.Acc2\-7FP16\-\-43\.5276\.2657\.1678\.0769\.2264\.85QuIP\#VQ3\.0041\.8974\.6255\.8577\.0468\.1963\.52VPTQVQ3\.0239\.369\.154\.977\.368\.061\.70LiftUQUQ3\.0241\.0275\.0756\.5777\.8967\.9763\.712\-13FP16\-\-48\.2979\.4260\.0779\.0572\.2267\.81QuIP\#VQ3\.0044\.6277\.9058\.2678\.0772\.4566\.26VPTQVQ3\.0346\.5078\.8358\.5078\.1869\.8566\.37LiftUQUQ3\.0246\.2577\.9959\.1678\.8471\.1166\.672\-70FP16\-\-54\.4482\.7064\.7782\.1577\.9872\.41QuIP\#VQ3\.0055\.8982\.1164\.2282\.2176\.2472\.13LiftUQUQ3\.0254\.6182\.5863\.9881\.5077\.1171\.963\-8FP16\-\-50\.4380\.0960\.1779\.6072\.6168\.58VPTQVQ3\.0344\.8078\.4557\.8578\.7871\.7466\.32LiftUQUQ3\.0246\.5978\.8358\.4278\.7373\.9567\.303\-70FP16\-\-60\.4186\.9966\.3682\.3780\.5175\.33AWQ\-g128UQ3\.1358\.3684\.5164\.2682\.2678\.8573\.65EPTQ\-g128UQ3\.1355\.1283\.1265\.5380\.5277\.8272\.42LiftUQUQ3\.0258\.8785\.8665\.3282\.4378\.7774\.25

Similar Articles

LLM Compression with Jointly Optimizing Architectural and Quantization choices

arXiv cs.LG

Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

arXiv cs.CL

Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.