Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

arXiv cs.LG 06/03/26, 04:00 AM Papers
quantization llm post-training-quantization low-bit efficient-inference hadamard-rotation weight-compression
Summary
This paper introduces Qift, a fixed no-zero two-bit weight quantization level set designed for Hadamard-rotated LLMs, achieving improved W2A4/KV4 inference by leveraging the near-zero-centered Gaussian-like distribution of rotated weights. Experiments on LLaMA-2-7B and LLaMA-3.1-8B show consistent perplexity gains over standard W2 quantization.
arXiv:2606.02823v1 Announce Type: new Abstract: Two-bit weight quantization is attractive for memory-efficient LLM inference, but the standard W2 level set {-2,-1,0,+1} often collapses under aggressive W2A4/KV4 settings. We study the scalar level-set geometry of two-bit weights in a Hadamard-rotated quantization pipeline. Conventional asymmetric W2 substantially improves over the standard level set, indicating that W2A4 failure is not only a bit-width problem but also a reconstruction-level problem. Across all 224 linear modules in each of LLaMA-2-7B and LLaMA-3.1-8B, pretrained weights are already nearly zero-centered, while Hadamard rotation primarily Gaussianizes their standardized shape: excess kurtosis and Q-Q error drop by orders of magnitude. Based on this approximate zero-centered Gaussian-like source model, we propose Qift, a fixed no-zero W2 level set for rotated W2A4/KV4 inference. The main level set is {+/-0.5, +/-1.5}, equivalently {+/-1, +/-3} under a half-scale reparameterization; a power-of-two variant uses {+/-1, +/-4} for sign-and-shift decoded weight application. Qift redesigns the fixed two-bit code-to-level mapping and is training-free, learned-codebook-free, group-grid-free, and zero-point-free, retaining the standard per-channel scale. A scale-invariant ratio analysis identifies an effective inner/outer centroid ratio range of 0.25 to 0.33, explaining why mirror no-zero (MNZ), Lloyd, NF2, and PoT-MNZ perform well while {+/-1, +/-2} does not. On both models, the no-zero level sets consistently improve pure W2A4 perplexity, L-layer mixed W2/W4 perplexity, downstream accuracy, and GPTQ residual behavior over the standard W2 level set. At L=16 mixed precision, they substantially narrow the gap to W3A4 while keeping half of the transformer layers at two-bit precision, giving a simple, source-aware, and deployment-friendly alternative to more complex learned W2 codebooks.
Original Article
View Cached Full Text
Cached at: 06/03/26, 09:40 AM
# Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference
Source: [https://arxiv.org/html/2606.02823](https://arxiv.org/html/2606.02823)
Chia\-Chi Tsai National Cheng Kung University cctsai@gs\.ncku\.edu\.tw

###### Abstract

Two\-bit weight quantization is attractive for memory\-efficient large language model \(LLM\) inference, but the standard W2 level set\{−2,−1,0,\+1\}\\\{\-2,\-1,0,\+1\\\}often collapses under aggressive W2A4/KV4 settings\. We study the scalar level\-set geometry of two\-bit weights in a Hadamard\-rotated quantization pipeline\. Conventional asymmetric W2 substantially improves over the standard level set, indicating that W2A4 failure is not only a bit\-width problem but also a reconstruction\-level problem\. Across all 224 linear modules in each of LLaMA\-2\-7B and LLaMA\-3\.1\-8B, we show that pretrained weights are already nearly zero\-centered while Hadamard rotation primarily Gaussianizes their standardized shape: excess kurtosis and Q–Q error drop by orders of magnitude and skewness also decreases substantially, while the per\-channel mean stays close to zero relative to the per\-channel standard deviation\.

Based on this approximate zero\-centered Gaussian\-like source model, we propose Qift, a fixed no\-zero W2 level set for rotated W2A4/KV4 inference\. The main level set is\{±0\.5,±1\.5\}\\\{\\pm 0\.5,\\pm 1\.5\\\}, equivalently\{±1,±3\}\\\{\\pm 1,\\pm 3\\\}under a half\-scale reparameterization; a power\-of\-two variant uses\{±1,±4\}\\\{\\pm 1,\\pm 4\\\}for sign\-and\-shift decoded weight application\. Qift redesigns the fixed two\-bit code\-to\-level mapping and is training\-free, learned\-codebook\-free, group\-grid\-free, and zero\-point\-free, retaining the standard per\-channel scale\. A scale\-invariant ratio analysis identifies an effective inner/outer centroid ratio range of0\.250\.25–0\.330\.33, explaining why mirror no\-zero \(MNZ\), Lloyd, NF2, andPoT\-MNZperform well while\{±1,±2\}\\\{\\pm 1,\\pm 2\\\}does not\.

Experiments on LLaMA\-2\-7B and LLaMA\-3\.1\-8B show that the proposed no\-zero level sets consistently improve pure W2A4 perplexity, L\-layer mixed W2/W4 perplexity, downstream accuracy, and GPTQ residual behavior over the standard W2 level set; as the ratio analysis indicates, this improvement requires an appropriate inner/outer ratio rather than the mere absence of a zero level, and a no\-zero set with too large a ratio such as\{±1,±2\}\\\{\\pm 1,\\pm 2\\\}does not improve over the standard level set\. AtL=16L\{=\}16mixed precision, no\-zero level sets substantially narrow the gap to W3A4 while keeping half of the transformer layers at two\-bit precision\. Fixed no\-zero scalar level sets provide a simple, source\-aware, and hardware\-aligned alternative to more complex learned W2 codebooks for rotated W2A4/KV4 inference\.

## 1Introduction

Decode\-phase large language model \(LLM\) inference is highly memory\-bound: during autoregressive decoding, throughput is limited primarily by the cost of moving weights and the key–value \(KV\) cache from memory rather than by arithmetic\. Weight and KV\-cache bit\-width therefore directly bound achievable throughput on constrained hardware, which makes low\-bit quantization a central lever for efficient inference\. Rotation\-based post\-training quantization \(PTQ\) has made W4A4/KV4 inference increasingly practical and close to lossless, and W3A4 is an aggressive but workable operating point\. The more extreme W2A4/KV4 regime—an8×8\\timesweight compression target relative to FP16—remains far less explored and is highly sensitive to quantizer design\. In particular, its scalar reconstruction level set has largely been inherited from the standard symmetric integer quantizer rather than designed for the rotated weight distribution\.

Existing low\-bit LLM quantization methods combine several complementary ingredients\. Smoothing shifts activation outlier difficulty into weights, rotations diffuse channel\-wise outlier energy, calibration and compensation methods such as GPTQ and GPTAQ reduce the remaining discretization error, mixed precision protects sensitive layers with additional bits, and weight\-only quantization keeps activations in high precision to avoid compounding weight and activation quantization error\. These techniques are crucial for practical low\-bit inference and also form the foundation of the rotated W2A4/KV4 setting studied here\.

However, once weights are reduced to two bits, a different error source becomes first\-order\. With only four reconstruction levels, the dense central bulk of the rotated weight distribution is no longer represented almost for free\. Outlier handling and compensation are necessary but no longer sufficient: they operate with respect to a chosen W2 level set, but they do not determine whether those four levels represent the dense rotated bulk well\. Thus, to push beyond practical W4A4/KV4 toward the more extreme W2A4/KV4 regime, this work treats the W2 reconstruction level set itself as a first\-class design variable\.

This work focuses on the scalar level set used by two\-bit weight quantization inside a rotated LLM quantization pipeline\. In many symmetric integer quantizers, abb\-bit weight is represented by a signed integer code

q=clip\(⌊ws⌉,−2b−1,2b−1−1\),w^=sq,q=\\mathrm\{clip\}\\\!\\left\(\\left\\lfloor\\frac\{w\}\{s\}\\right\\rceil,\-2^\{b\-1\},2^\{b\-1\}\-1\\right\),\\qquad\\hat\{w\}=sq,\(1\)wheressis a per\-output\-channel scale\. Forb=2b=2, this gives the standard W2 reconstruction level set

𝒢sym=\{−2,−1,0,\+1\}\.\\mathcal\{G\}\_\{\\mathrm\{sym\}\}=\\\{\-2,\-1,0,\+1\\\}\.\(2\)Equivalently, the four reconstruction levels ares𝒢syms\\mathcal\{G\}\_\{\\mathrm\{sym\}\}\. In practice,ssmay be chosen by max\-range or clipping\-based MSE search, but the scalar reconstruction geometry is still determined by𝒢sym\\mathcal\{G\}\_\{\\mathrm\{sym\}\}\.

We use reconstruction level set for the four scalar values used by a W2 quantizer, and “grid” only as an informal synonym\. In scalar\-quantization terminology, a mid\-tread quantizer includes zero as a reconstruction level, whereas a mid\-rise quantizer places zero between two reconstruction levels\. The standard signed\-integer W2 grid, denotedSYM\-INT, is mid\-tread\-style: it is simple, but it spends one of only four reconstruction levels exactly at zero, which is not a neutral choice for a two\-bit quantizer\.

A direct indication that the grid, and not merely the bit\-width, is responsible comes from conventional asymmetric W2 quantization: simply replacing the standardSYM\-INTgrid with an asymmetric W2 quantizer substantially reduces pure\-W2A4 perplexity \(on LLaMA\-2\-7B with KV4 and GPTQ, from 53\.849 to 33\.533; on LLaMA\-3\.1\-8B, from 3005\.556 to 113\.747\)\. We treat this as evidence—rather than as the central result—that the placement of the four reconstruction levels is a dominant factor in W2A4 failure, and we use it only to motivate a principled, source\-aware grid design\.

From a quantization perspective, the tensor distribution determines how efficiently a small number of reconstruction levels can represent values\. A compact and approximately symmetric source is easier to quantize because most values lie near the center and the quantizer can allocate levels to high\-density regions\. In contrast, skewed or heavy\-tailed sources waste range on rare extremes and reduce effective precision for the dense central mass\. Thus, Gaussian\-like is not the final goal by itself; it is a useful reference for a centered, symmetric, low\-outlier source\.

This motivates the central question of this work: after Hadamard rotation, what scalar level set should a two\-bit weight quantizer use? We show that rotated weights are better modeled as approximately zero\-centered and more Gaussian\-like in standardized shape\. For such a source, a four\-level scalar quantizer should place two inner centroids around zero and two outer centroids in the tails, instead of spending a centroid exactly at zero\.

We call the proposed design Qift, short for quantization with shift\-friendly no\-zero W2 grids\. The main practical level set is mirror no\-zero \(MNZ\),

𝒢MNZ=\{−1\.5,−0\.5,\+0\.5,\+1\.5\},\\mathcal\{G\}\_\{\\mathrm\{MNZ\}\}=\\\{\-1\.5,\-0\.5,\+0\.5,\+1\.5\\\},\(3\)a uniform four\-level mid\-rise level set, equivalent to odd integer levels\{±1,±3\}\\\{\\pm 1,\\pm 3\\\}under a half\-scale reparameterization\. We also study a power\-of\-two variant,PoT\-MNZ,

𝒢pot=\{−4,−1,\+1,\+4\},\\mathcal\{G\}\_\{\\mathrm\{pot\}\}=\\\{\-4,\-1,\+1,\+4\\\},\(4\)which keeps the mirror no\-zero structure but is not a uniform mid\-rise grid; its power\-of\-two magnitudes support sign\-and\-shift decoded weight application\. In both cases, the per\-channel scale can be applied separately in the epilogue, as in standard quantized linear layers\.

The proposed grids are deliberately minimal: they use the same four reconstruction levels globally, retain only the standard per\-channel scale, and introduce no quantization\-aware training, learned partitions, per\-layer or group\-wise grid assignment \(we call this group\-grid\-free\), learned codebooks, or asymmetric zero\-points\. Thus, Qift treats W2 quantization as an accuracy\-aware level\-set design problem while remaining a drop\-in replacement for the standard W2 grid rather than a new learned quantizer\.

##### Contributions\.

Our contributions are summarized as follows\.

- •We introduce Qift as a modular reconstruction\-level redesign for rotated W2A4/KV4 inference\. It isolates the fixed W2 level set as the design intervention, keeping the surrounding rotation, scaling, and PTQ compensation pipeline unchanged and requiring no quantization\-aware training, learned codebooks, group\-wise grid assignment, or zero\-point metadata\.
- •We propose mirror no\-zero \(MNZ\), a fixed no\-zero W2 scalar level set for approximately zero\-centered Gaussian\-like rotated weights\.MNZprovides a simple integer approximation to the four\-level Gaussian Lloyd\-Max structure, together with a power\-of\-twoPoT\-MNZvariant for sign\-and\-shift decoded weight application\.
- •We validate Qift across two LLaMA models\. The proposed level sets consistently improve over the standard W2 grid in pure W2A4, mixed W2/W4, and downstream tasks, while ablations show that the gain comes from an effective inner/outer centroid ratio rather than zero removal alone\.

## 2Related Work

We organize prior low\-bit LLM quantization by which component is primarily changed\. Some methods transform the tensor distribution seen by the quantizer; others improve calibration or error compensation after discretization; and a third group changes the quantizer, reconstruction levels, partitions, or codebooks themselves\. Table[1](https://arxiv.org/html/2606.02823#S2.T1)summarizes this view and locates Qift within it\.

##### Equivalent transformations\.

A major line of work improves quantization by applying mathematically equivalent transformations to weights and activations so that the resulting tensors are easier to quantize\. SmoothQuant\[[4](https://arxiv.org/html/2606.02823#bib.bib4)\]migrates activation outlier difficulty into weights through channel\-wise scaling, while rotation\-based methods reduce outliers with orthogonal or learned transforms: QuaRot\[[3](https://arxiv.org/html/2606.02823#bib.bib3)\]uses Hadamard rotations for end\-to\-end W4A4/KV4 inference, SpinQuant\[[6](https://arxiv.org/html/2606.02823#bib.bib6)\]learns the rotation matrices, and FlatQuant\[[7](https://arxiv.org/html/2606.02823#bib.bib7)\]learns affine transforms that flatten weight and activation statistics\. These methods make tensors more quantization\-friendly but keep the scalar integer reconstruction levels fixed\. Qift is complementary: it assumes such a rotated pipeline and redesigns the W2 reconstruction levels used after the transformation\.

##### Calibration and error compensation\.

Another line of work uses calibration data to reduce the effect of discretization rather than changing the reconstruction levels\. GPTQ\[[1](https://arxiv.org/html/2606.02823#bib.bib1)\]uses a Hessian approximation to compensate weight quantization error, and GPTAQ\[[2](https://arxiv.org/html/2606.02823#bib.bib2)\]adds activation\-aware asymmetric calibration\. AWQ\[[11](https://arxiv.org/html/2606.02823#bib.bib11)\]uses activation statistics to identify and protect salient weights, and OmniQuant\[[12](https://arxiv.org/html/2606.02823#bib.bib12)\]learns equivalent transformations and clipping parameters for post\-training quantization\. These methods improve the calibration or compensation procedure; Qift instead redesigns the fixed W2 reconstruction level set and can be combined with such pipelines, as in our GPTQ/GPTAQ experiments\.

##### Quantizer and level\-set design\.

A third line changes the quantizer itself—its reconstruction levels, partitions, or codebooks\. RCP\[[8](https://arxiv.org/html/2606.02823#bib.bib8)\]is the closest prior work, as it also targets W2A4/KV4: it integrates rotation, clipping, and a learnable non\-uniform W2 quantizer trained with quantization\-aware training \(QAT\)\. NF4\[[5](https://arxiv.org/html/2606.02823#bib.bib5)\]uses a normal\-distribution motivation for a 4\-bit datatype, conceptually close to our Gaussian Lloyd\-Max\[[14](https://arxiv.org/html/2606.02823#bib.bib14),[15](https://arxiv.org/html/2606.02823#bib.bib15)\]and NF2 references\. QuIP\#\[[9](https://arxiv.org/html/2606.02823#bib.bib9)\]and AQLM\[[10](https://arxiv.org/html/2606.02823#bib.bib10)\]move beyond scalar quantization with lattice or additive vector codebooks, and LeanQuant\[[13](https://arxiv.org/html/2606.02823#bib.bib13)\]learns loss\-error\-aware adaptive grids\. Qift takes a different point in this design space: it keeps the quantizer scalar, fixed, training\-free, and zero\-point\-free, and redesigns the four W2 reconstruction levels themselves\. The contrast with RCP is direct—RCP learns non\-uniform W2 partitions through QAT, whereas Qift uses a fixed, source\-aware, post\-training no\-zero scalar level set\.

##### Weight\-only compression versus W2A4/KV4 inference\.

Many extreme low\-bit methods, including QuIP\#\[[9](https://arxiv.org/html/2606.02823#bib.bib9)\]and AQLM\[[10](https://arxiv.org/html/2606.02823#bib.bib10)\], primarily target weight\-only compression, where the main benefit is reduced parameter storage and weight memory traffic\. In contrast, this work studies a rotated W2A4/KV4 inference setting, where the W2 levels must interact with four\-bit activations and KV\-cache quantization\. This makes the scalar W2 level\-set design more constrained than in weight\-only compression\. The proposed no\-zero level sets keep the quantizer fixed and scalar while avoiding learned codebook lookup and asymmetric zero\-point metadata\.

Table 1:Taxonomy of related LLM quantization techniques and the position of Qift\. The table groups methods by which component they primarily change; it is not a cross\-paper accuracy leaderboard\.Table 2:Design\-level positioning of W2 quantization choices\. The table compares decoding and metadata properties rather than reporting cross\-paper accuracy\. CB\-free = no learned codebook; ZP\-free = no zero\-point; Group\-free = reconstruction levels, learned centroids, or lookup tables are not assigned per weight group \(standard per\-channel scaling is still used\)\.Overall, existing methods mainly improve the tensor distribution, the calibration or compensation procedure, or the expressiveness of the quantizer\. Qift focuses on a smaller but underexplored design variable: the fixed four\-level W2 reconstruction level set used inside a rotated W2A4/KV4 PTQ pipeline\. Table[2](https://arxiv.org/html/2606.02823#S2.T2)compares these methods by their decoding and metadata properties—whether they are codebook\-free, zero\-point\-free, and group\-free—and places Qift at the fixed, scalar, no\-zero corner of this space\.

## 3Qift: No\-Zero W2 Reconstruction\-Level Design

### 3\.1Overview and Base Pipeline

Qift is a quantizer\-level replacement for the W2 reconstruction level set in a fixed Hadamard\-rotated W2A4/KV4 pipeline\. The base pipeline follows the rotation\-based PTQ setting\[[3](https://arxiv.org/html/2606.02823#bib.bib3)\]: Hadamard transformations redistribute concentrated channel\-wise outlier energy before low\-bit quantization, which balances the activation distribution and yields the rotated weight source studied in this work \(Figure[1](https://arxiv.org/html/2606.02823#S3.F1)\)\. This is complementary to activation smoothing such as SmoothQuant\[[4](https://arxiv.org/html/2606.02823#bib.bib4)\], which instead migrates activation outlier difficulty into weights\.

![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_activation_hadamard_B_perchannel_max.png)Figure 1:Channel\-wise activation outlier reduction after Hadamard rotation for selected LLaMA\-2\-7Bdown\_projinputs\. We report the maximum activation magnitude for each channel, rather than the single global maximum over the entire activation tensor\. The largest channel\-wise spike in layer 30 drops from 1295\.0 to 57\.4, and the plotted layers show roughly 11\.7–22\.6×\\timesreduction in dominant channel\-wise spikes\. This activation\-side outlier diffusion motivates the fixed rotated W2A4/KV4 pipeline used by Qift; the rest of this section studies the resulting weight\-side reconstruction\-level design\.After rotation, weights may be quantized by RTN\-style nearest\-level rounding or by compensation methods\. RTN selects a per\-output\-channel scale and rounds each weight to the nearest reconstruction level\. GPTQ\[[1](https://arxiv.org/html/2606.02823#bib.bib1)\]additionally compensates weight quantization error using a Hessian approximation from calibration activations, and GPTAQ\[[2](https://arxiv.org/html/2606.02823#bib.bib2)\]adds activation\-aware asymmetric calibration by matching quantized layer outputs to their full\-precision counterparts\. These methods reduce the error induced by a chosen four\-level codebook, but they do not by themselves determine which four reconstruction levels should be used\.

In Qift, the fixed W2 reconstruction level set is isolated as the design intervention\. It keeps the standard per\-output\-channel scale and introduces no group\-wise grid assignment, learned codebooks, or asymmetric zero\-point metadata\. The mainMNZlevel set is the uniform four\-level mid\-rise grid introduced earlier, whilePoT\-MNZkeeps the same mirror no\-zero structure with power\-of\-two magnitudes for sign\-and\-shift decoded weight application\. The same level\-set replacement can be used with RTN, GPTQ, or GPTAQ, keeping compensation and reconstruction\-level geometry as separate design choices\. Both level sets are motivated by the source model developed next: a centered, symmetric, Gaussian\-like rotated weight distribution should use two inner centroids around zero and two outer centroids in the tails, rather than spending a centroid exactly at zero\.

### 3\.2Design Principle: A Zero\-Centered Gaussian\-Like Source

Reducing weight precision from 16 bits to 2 bits can dramatically reduce memory footprint\. However, W2 quantization is qualitatively different from W4 quantization: with only four reconstruction levels, every centroid placement decision matters\. In W4, a zero centroid can coexist with many other centroids; in W2, using one centroid at zero consumes 25% of the representational capacity\.

For low\-bit quantization, the source distribution strongly affects reconstruction error\. If the source is skewed, a sign\-symmetric level set wastes levels on the low\-density side; if the source is heavy\-tailed, the scale or clipping range must cover rare extremes, reducing precision for the dense central region\. A zero\-centered Gaussian\-like source is a useful design reference because it is symmetric, concentrated near the center, and has limited tail dominance\. However, exact Gaussianity is not required\. The important properties are centeredness, low skewness, moderate kurtosis, and low quantile mismatch to a Gaussian reference\.

Concretely, the source hypothesis has two parts:

pretrained weights≈near\-zero centered,\\displaystyle\\approx\\text\{near\-zero centered\},\(5\)Hadamard mixing⇒more Gaussian\-like standardized shape\.\\displaystyle\\Rightarrow\\text\{more Gaussian\-like standardized shape\}\.\(6\)If both statements hold, post\-rotation weights can be modeled as an approximately zero\-centered Gaussian\-like source for scalar level\-set design\. We adopt this model as the design assumption in this section and verify it empirically on real rotated weights in §[4\.2](https://arxiv.org/html/2606.02823#S4.SS2), so that the method can be understood independently of the supporting diagnostics\.

For a zero\-centered Gaussian source, the Lloyd\-Max scalar quantization principle\[[14](https://arxiv.org/html/2606.02823#bib.bib14),[15](https://arxiv.org/html/2606.02823#bib.bib15)\]implies that the optimal four\-level scalar quantizer places two inner centroids around zero and two outer centroids in the tails\. It does not allocate a centroid exactly at zero\. Since the measured rotated weights are not exactly Gaussian, we use this Lloyd\-Max solution as a design prior and scalar reference rather than as a strict generative model\.

![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_core_grid_concept.png)Figure 2:Core intuition behind the proposed W2 grid redesign\. The figure illustrates the source model studied in this work: after Hadamard rotation, weights remain approximately zero\-centered and become more Gaussian\-like in standardized shape\. The standard SYM\-INT grid spends one reconstruction level exactly at zero, which only benefits very\-near\-zero weights\.Far\-MNZremoves the zero level but places the inner centroids too far from zero\.MNZplaces two inner centroids around the dense middle region, yielding a better centroid allocation for the rotated source\.The preceding design space suggests that a W2 grid for rotated W2A4/KV4 inference should satisfy four requirements\. First, it should match the post\-rotation weight source rather than inherit the standard symmetric integer grid by default\. Second, it should keep the quantizer scalar and globally fixed, avoiding learned partitions, learned codebooks, and group\-wise grid assignment\. Third, it should retain the standard per\-output\-channel scale without introducing asymmetric zero\-point metadata\. Fourth, it should remain compatible with simple decoded integer weight application, so that the grid remains accurate and deployment\-friendly\. These requirements motivate the concrete no\-zero level sets introduced next, whose source assumption is validated empirically in §[4\.2](https://arxiv.org/html/2606.02823#S4.SS2)\.

### 3\.3Qift Reconstruction Levels:MNZandPoT\-MNZ

The proposed level sets share a common geometry: every level has an equal and opposite mirror level and no level sits at zero\. We call this family mirror no\-zero \(MNZ\)\. Figure[2](https://arxiv.org/html/2606.02823#S3.F2)illustrates the intuition: the standardSYM\-INTgrid spends one reconstruction level at zero,Far\-MNZremoves the zero level but places its inner centroids too far from the dense bulk, andMNZputs its two inner centroids in the dense center of the rotated distribution\. We group the grids into four roles: standard baselines, proposed Qift grids, scalar reference grids, and a negative diagnostic grid\. The proposed deployable grids are Qift\-MNZand Qift\-PoT\-MNZ\. Lloyd\-Max and NF2 are used as scalar reconstruction references, whileFar\-MNZis included to show that removing zero is not sufficient when the inner centroids are placed too far from zero\. Throughout, we useMNZandPoT\-MNZto denote the level sets themselves, and Qift\-MNZand Qift\-PoT\-MNZto denote the full Qift method instantiated with the corresponding level set\.

We evaluate the following W2 grids:

SYM\-INT:\{−2,−1,0,\+1\},\\displaystyle:\\\{\-2,\-1,0,\+1\\\},\(7\)MNZ:\{−1\.5,−0\.5,\+0\.5,\+1\.5\},\\displaystyle:\\\{\-1\.5,\-0\.5,\+0\.5,\+1\.5\\\},\(8\)Lloyd:\{±0\.4528,±1\.5104\},\\displaystyle:\\\{\\pm 0\.4528,\\pm 1\.5104\\\},\(9\)NF2:\{−1\.0,−0\.2525685,\+0\.2525685,\+1\.0\},\\displaystyle:\\\{\-1\.0,\-0\.2525685,\+0\.2525685,\+1\.0\\\},\(10\)PoT\-MNZ:\{−4,−1,\+1,\+4\},\\displaystyle:\\\{\-4,\-1,\+1,\+4\\\},\(11\)Far\-MNZ:\{−2,−1,\+1,\+2\},equivalently\{±1,±2\}up to scale\(r=0\.5\)\.\\displaystyle:\\\{\-2,\-1,\+1,\+2\\\},\\quad\\text\{equivalently \}\\\{\\pm 1,\\pm 2\\\}\\ \\text\{up to scale\}\\ \(r=0\.5\)\.\(12\)
The proposedMNZgrid is a drop\-in replacement for the reconstruction levels associated with four two\-bit codes:

𝒢MNZ=\{−1\.5,−0\.5,\+0\.5,\+1\.5\}\.\\mathcal\{G\}\_\{\\mathrm\{MNZ\}\}=\\\{\-1\.5,\-0\.5,\+0\.5,\+1\.5\\\}\.\(13\)The grid itself is globally fixed\. The same four scalar levels are shared across all layers, modules, channels, and weight groups\. Each output channel still uses the standard per\-channel scale, but the grid is not learned, tuned, searched, or selected per group\. Therefore,MNZintroduces no group\-wise codebook, no group\-wise grid assignment, no per\-layer grid search, and no zero\-point metadata; the only per\-channel quantity is the ordinary scale already used by standard weight quantization\.

The same grid can be written as

\{−1\.5,−0\.5,\+0\.5,\+1\.5\}⋅s=\{−3,−1,\+1,\+3\}⋅s2\.\\\{\-1\.5,\-0\.5,\+0\.5,\+1\.5\\\}\\cdot s=\\\{\-3,\-1,\+1,\+3\\\}\\cdot\\frac\{s\}\{2\}\.\(14\)This representation is useful for implementation because the decoded W2 value is an odd integer level, while the half\-scale can be folded into the ordinary channel scale\. The assignment from two\-bit codes to these levels can be chosen by the implementation; the proposed method specifies the reconstruction level set rather than a unique code ordering\.

Conventional asymmetricbb\-bit quantization represents a weight by an unsigned integer code

q=clip\(⌊ws⌉\+z,0,2b−1\),w^=s\(q−z\),q=\\mathrm\{clip\}\\\!\\left\(\\left\\lfloor\\frac\{w\}\{s\}\\right\\rceil\+z,0,2^\{b\}\-1\\right\),\\qquad\\hat\{w\}=s\(q\-z\),\(15\)wherezzis a zero\-point\. Forb=2b=2, the unsigned code set isq∈\{0,1,2,3\}q\\in\\\{0,1,2,3\\\}\. For nearly symmetric rotated weights, the useful midpoint between the four integer codes isz=1\.5z=1\.5\. Keeping this fractional midpoint gives

s\{0−1\.5,1−1\.5,2−1\.5,3−1\.5\}=s\{−1\.5,−0\.5,\+0\.5,\+1\.5\},s\\\{0\-1\.5,1\-1\.5,2\-1\.5,3\-1\.5\\\}=s\\\{\-1\.5,\-0\.5,\+0\.5,\+1\.5\\\},\(16\)which is exactlyMNZ\. Thus,MNZcan be viewed as a zero\-point\-free realization of the useful fractional\-midpoint geometry of asymmetric W2\. It preserves the favorable no\-zero level placement without storing per\-channel zero\-points or performing zero\-point subtraction during dequantization\.

To further reduce arithmetic cost, we also study the power\-of\-two no\-zero grid

𝒢pot=\{±1,±4\}\.\\mathcal\{G\}\_\{\\mathrm\{pot\}\}=\\\{\\pm 1,\\pm 4\\\}\.\(17\)

### 3\.4Inner/Outer Ratio as a Design Knob

Beyond removing the zero centroid, the main scale\-invariant degree of freedom is the inner/outer centroid ratio\. Normalizing the outer magnitude to one, the mirror no\-zero family is

𝒢\(r\)=\{−1,−r,\+r,\+1\},0<r<1\.\\mathcal\{G\}\(r\)=\\\{\-1,\-r,\+r,\+1\\\},\\quad 0<r<1\.\(18\)The grids studied here correspond to different ratios: Qift\-MNZtor=1/3r=1/3, Qift\-PoT\-MNZtor=1/4r=1/4, the Lloyd\-Max and NF2 references tor≈0\.30r\\approx 0\.30andr≈0\.25r\\approx 0\.25, and the negative diagnosticFar\-MNZtor=1/2r=1/2\. A scale\-invariant reconstruction analysis on real rotated weights \(§[4\.5\.3](https://arxiv.org/html/2606.02823#S4.SS5.SSS3)\) shows that effective grids cluster in the bandr≈0\.25r\\approx 0\.25–0\.330\.33, which containsMNZ,PoT\-MNZ, Lloyd\-Max, and NF2 but excludesFar\-MNZ\. We therefore treatrras the primary geometric design knob and choose grids inside this band\.

### 3\.5Quantization Objective and Integration with GPTQ/GPTAQ

Since Qift is a post\-training grid replacement, it introduces no trainable objective\. The only local objective is the standard per\-output\-channel scale selection that minimizes nearest\-level reconstruction error for the chosen grid, identical to the scale search used by the baseline quantizer\. When GPTQ or GPTAQ is enabled, Qift uses the same Hessian\-based or asymmetric\-calibration compensation as the baseline pipeline; only the fixed reconstruction level set changes, and no additional parameters, learned centroids, or zero\-points are introduced\.

Operationally, the standard W2 grid is replaced by a fixed no\-zero grid while the surrounding rotated PTQ pipeline is unchanged\. For each output channel, the quantizer selects an MSE\-optimal scale for the chosen grid, maps weights to the nearest scaled level, and applies the same GPTQ or GPTAQ compensation as the baseline pipeline\.

### 3\.6Hardware\-Friendly Decoded Levels

The proposed level sets also have a simple decoded arithmetic structure\. Qift\-MNZcan be represented as odd integer levels\{±1,±3\}\\\{\\pm 1,\\pm 3\\\}with a half\-scale reparameterization, while Qift\-PoT\-MNZuses power\-of\-two levels\{±1,±4\}\\\{\\pm 1,\\pm 4\\\}\. Thus, after decoding, the weight values are small fixed signed integers rather than learned or irregular lookup\-table values\. This regular integer structure makes the level sets hardware\-friendly\.

## 4Experiments

### 4\.1Experimental Setup

We evaluate LLaMA\-2\-7B and LLaMA\-3\.1\-8B\. The main stress test is pure W2A4, where all linear layers use two\-bit weights and four\-bit activations\. We also evaluate the L\-layer W2/W4 mixed\-precision heuristic described in the mixed\-L section, especially theL=16L=16iso\-bit point where the nominal average weight bit\-width is three bits\. We report WikiText\-2 perplexity and downstream accuracy over ARC\-Challenge, ARC\-Easy, HellaSwag, PIQA, and WinoGrande\.

##### Quantization configuration\.

Unless otherwise stated, all experiments use a QuaRot\-style Hadamard\-rotated pipeline with weight, activation, and KV\-cache quantization enabled\. For the W2A4/KV4 setting, weights are quantized to two bits while activations and the KV\-cache use four bits\. Weights use per\-output\-channel scaling without grouping, and the per\-channel scale is selected by a clipping\-based search\. The conventional asymmetric baseline additionally enables asymmetric weight quantization with a stored per\-channel zero\-point, whereas Qift level sets change only the fixed two\-bit code\-to\-level mapping and remain zero\-point\-free\. Activations use symmetric four\-bit quantization with a clipping ratio of0\.90\.9, and the KV\-cache uses asymmetric key and value quantization with clipping ratios of0\.950\.95\. We use the default Hadamard rotation and do not learn or sample random rotation matrices\. GPTQ uses a small Hessian damping factor \(0\.010\.01\), and GPTAQ uses the same quantization configuration and calibration set with activation\-aware asymmetric correction additionally enabled\.

##### Calibration, evaluation, and software\.

We use 128 WikiText\-2 calibration samples with sequence length 2048 and sampling seed 0\. Perplexity is evaluated on WikiText\-2, and downstream accuracy is evaluated on ARC\-Challenge, ARC\-Easy, HellaSwag, PIQA, and WinoGrande using the same task suite across grid variants\. All reported results are single\-seed unless otherwise specified\. FP16 reference rows use W16A16/KV16 without weight, activation, or KV\-cache quantization; they provide the full\-precision upper\-bound reference for the same WikiText\-2 and downstream task suite\. Experiments are run in the project Docker environment with Python 3\.10\.13, PyTorch 2\.2\.1, CUDA 12\.1, and NVIDIA RTX A6000 48GB GPUs\.

### 4\.2Source\-Model Validation

Before evaluating the proposed grids, we empirically verify the source assumption adopted in §[3\.2](https://arxiv.org/html/2606.02823#S3.SS2): that Hadamard\-rotated weights are approximately zero\-centered and become more Gaussian\-like in standardized shape\. We use the term Gaussian\-like operationally—a distribution is more Gaussian\-like if its standardized shape has lower absolute skewness, lower absolute excess kurtosis, and lower Q–Q error against a standard normal reference—rather than as a claim that the weights exactly follow a normal law\.

For each output channelccwith weightsWcW\_\{c\}, define

rc=\|μc\|σc,μc=𝔼\[Wc\],σc=Var\(Wc\)\.r\_\{c\}=\\frac\{\|\\mu\_\{c\}\|\}\{\\sigma\_\{c\}\},\\qquad\\mu\_\{c\}=\\mathbb\{E\}\[W\_\{c\}\],\\quad\\sigma\_\{c\}=\\sqrt\{\\mathrm\{Var\}\(W\_\{c\}\)\}\.\(19\)This measures centeredness\. To measure shape, we standardize weights asz=\(w−μ\)/σz=\(w\-\\mu\)/\\sigmaand compute:

excess kurtosis=𝔼\[z4\]−3,\\displaystyle=\\mathbb\{E\}\[z^\{4\}\]\-3,\(20\)skewness=𝔼\[z3\],\\displaystyle=\\mathbb\{E\}\[z^\{3\}\],\(21\)Q–Q error=1m∑i=1m\(qiemp−qi𝒩\(0,1\)\)2\.\\displaystyle=\\frac\{1\}\{m\}\\sum\_\{i=1\}^\{m\}\\left\(q\_\{i\}^\{\\mathrm\{emp\}\}\-q\_\{i\}^\{\\mathcal\{N\}\(0,1\)\}\\right\)^\{2\}\.\(22\)For a standard Gaussian, excess kurtosis and skewness are zero, and the Q–Q error is low\. We use these metrics as diagnostics of Gaussian\-likeness, not as a claim that the empirical distribution exactly follows a normal law\.

Figure[3](https://arxiv.org/html/2606.02823#S4.F3)and Table[3](https://arxiv.org/html/2606.02823#S4.T3)show model\-wide pre/post statistics across 224 linear modules\. The centeredness metric remains nearly unchanged, confirming that Hadamard rotation does not create zero\-centeredness\. In contrast, excess kurtosis, skewness, and Q–Q error drop sharply, showing that rotation primarily makes standardized shape more Gaussian\-like\.

![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_gaussian_summary_pre_post.png)Figure 3:Hadamard rotation improves kurtosis and Q–Q shape metrics by orders of magnitude \(skewness more modestly\) while leaving mean\-centeredness essentially unchanged\.Table 3:Model\-wide Gaussianity summary across 224 linear modules\. Lower is better for all metrics shown\.The improvement is not caused by a small number of selected modules: across modules the post\-rotation shape metrics improve for the large majority, and the improvement is consistent across all 32 layers\.

### 4\.3Main Performance Results

We first establish the failure of the standard grid under pure W2A4 and then show that source\-aware no\-zero grids recover most of the lost accuracy\. Table[4](https://arxiv.org/html/2606.02823#S4.T4)summarizes the headline comparison across both models and the three main operating points; the remainder of this section reports the full per\-grid and per\-task breakdowns that support it\.

Table 4:Headline summary of Qift versus baselines under rotated KV4 inference\. “Down\. Avg” is the mean zero\-shot accuracy over ARC\-C, ARC\-E, HellaSwag, PIQA, and WinoGrande\.*Pure W2A4*keeps all linear layers at two\-bit weights;*L=16L\{=\}16*upgrades the 16 most sensitive layers to W4A4, givingbavg=3b\_\{\\mathrm\{avg\}\}=3, iso\-bit with uniform W3A4\. Lower PPL and higher accuracy are better\. Detailed per\-grid and per\-task results appear in the tables that follow\.Under pure quantization, where all linear layers are quantized to W2A4 with KV4 and GPTQ, the standardSYM\-INTW2 grid is severe: LLaMA\-2\-7B degrades from an FP16 reference of 5\.471 to 53\.849 PPL, while LLaMA\-3\.1\-8B collapses completely \(FP16 6\.277→\\rightarrow3005\.556 PPL\)\.

Tables[5](https://arxiv.org/html/2606.02823#S4.T5)and[6](https://arxiv.org/html/2606.02823#S4.T6)summarize pure W2A4 results\. Figures[4](https://arxiv.org/html/2606.02823#S4.F4)and[5](https://arxiv.org/html/2606.02823#S4.F5)visualize the KV4 matrix\. We read these results in three steps: asymmetric W2 improves over the standardSYM\-INTbaseline, confirming that scalar level\-set geometry matters; Gaussian\-aware no\-zero grids further improve over asymmetric W2, showing that source\-aware centroid placement is more effective than generic zero\-point correction; andPoT\-MNZremains competitive while using power\-of\-two reconstruction levels that map naturally to sign\-and\-shift weight application\. Lloyd\-Max is often the strongest scalar reference, as expected from the Gaussian source model, while Qift\-MNZand Qift\-PoT\-MNZremain close with simpler integer or power\-of\-two decoded levels\. These improvements are obtained without changing the calibration procedure, learning a codebook, introducing per\-group grids, or storing zero\-points; the only grid\-level change is the fixed W2 code\-to\-level mapping\.

Table 5:Pure W2A4 LLaMA\-2\-7B corrected matrix\. Lower PPL is better\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max and NF2 are scalar reference grids\. Bold marks the best non\-reference result in each column\.![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_pure_w2a4_llama2_bar.png)Figure 4:Pure W2A4 KV4 results on LLaMA\-2\-7B\. No\-zero grids substantially outperform standard W2 andW\-ASYM\.Table 6:Pure W2A4 LLaMA\-3\.1\-8B corrected matrix\. Lower PPL is better\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max and NF2 are scalar reference grids\.![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_pure_w2a4_llama31_log_bar.png)Figure 5:Pure W2A4 KV4 results on LLaMA\-3\.1\-8B\. The standardSYM\-INTgrid collapses, while no\-zero grids recover the model to a usable range\.Table[7](https://arxiv.org/html/2606.02823#S4.T7)shows pure W2A4 downstream results on LLaMA\-2\-7B under KV4\. No\-zero grids improve substantially over both standardSYM\-INTW2 and asymmetric W2\.

Table 7:Pure W2A4 KV4 downstream accuracy on LLaMA\-2\-7B\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max and NF2 are scalar reference grids\. Higher is better\. Bold marks the best non\-reference result\.Table[8](https://arxiv.org/html/2606.02823#S4.T8)reports the same downstream task suite on LLaMA\-3\.1\-8B under KV4\. The trend from LLaMA\-2 transfers to the newer model: Gaussian\-aware W2 grids improve both pure W2A4 GPTAQ and L=16 GPTAQ\.PoT\-MNZis the best average point in both LLaMA\-3\.1 W2 groups, supporting the ratio\-sensitivity argument that itsr=0\.25r=0\.25geometry is not merely hardware\-convenient\.

Table 8:LLaMA\-3\.1\-8B KV4 downstream accuracy on the evaluation task subset\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max is a scalar reference grid\. Higher is better\. Bold marks the best non\-reference result within each comparison block\.On pure W2A4,PoT\-MNZimproves LLaMA\-3\.1 average accuracy from 0\.3683 to 0\.4140\. In L=16 mixed precision,PoT\-MNZimproves from 0\.5431 to 0\.5619 and remains close to the W3A4 reference at 0\.5972\. These results strengthen the cross\-model claim: the grid redesign is not specific to LLaMA\-2, and the hardware\-friendlyPoT\-MNZvariant remains competitive under downstream metrics\.

### 4\.4Constraint\-Specific Comparison:L=16L\{=\}16vs\. W3A4

Before reporting mixed\-L results, we describe the simple layer\-selection heuristic used in this experiment\. The mixed\-L setting is a deployment\-oriented stress test for the proposed W2 level\-set redesign, not a general mixed\-precision allocation algorithm\. The goal is to ask whether W2A4 can retain much of its storage advantage while protecting a small number of highly sensitive layers\.

Starting from a pure W2A4 baseline, we independently upgrade one transformer layer at a time to W4A4 while keeping all other layers at W2A4\. LetP0P\_\{0\}denote the perplexity of the pure W2A4 baseline andPℓP\_\{\\ell\}denote the perplexity after upgrading only layerℓ\\ellto W4A4\. We define the single\-layer gain as

Δℓ=P0−Pℓ\.\\Delta\_\{\\ell\}=P\_\{0\}\-P\_\{\\ell\}\.\(23\)Layers are sorted byΔℓ\\Delta\_\{\\ell\}, and the top\-LLlayers are upgraded to W4A4 in the final mixed\-precision model; all remaining layers stay at W2A4\.

For a 32\-layer model, upgrading one full layer from W2 to W4 increases the average weight bit\-width by approximately2/322/32\. Thus the nominal average weight bit\-width is

bavg=2\+2L32=2\+L16\.b\_\{\\mathrm\{avg\}\}=2\+\\frac\{2L\}\{32\}=2\+\\frac\{L\}\{16\}\.\(24\)AtL=16L=16, this givesbavg=3b\_\{\\mathrm\{avg\}\}=3, matching the nominal average weight budget of uniform W3A4\.

This layer\-wise mixed\-precision setting also keeps the quantization layout regular\. Sub\-tensor mixed precision can protect outlier channels or elements, but then a single matrix multiplication may contain multiple weight precisions\. Our policy is deliberately layer\-wise: selected sensitive layers are assigned W4A4, while the remaining layers stay at uniform W2A4\. Thus every quantized matrix multiplication keeps a single weight precision\.

This heuristic is intentionally simple\. It ignores joint interactions between upgraded layers and is not claimed to be optimal\. We use it only to construct a transparent W2/W4 Pareto curve and to evaluate whether the proposed no\-zero W2 grids remain useful when only part of the model stays at two\-bit precision\.

Figure[6](https://arxiv.org/html/2606.02823#S4.F6)shows the W2/W4 mixed\-precision Pareto curve obtained from the single\-layer sensitivity ranking\. Most of the recovery occurs in the first few selected layers: L=4 already reduces perplexity from 53\.849 to 10\.739, and the curve has a knee around L=4–6\. This indicates that standard W2A4 failure is concentrated in a small number of functionally sensitive layers\.

![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_k_mixed_pareto.png)Figure 6:Simple W2/W4 mixed\-precision Pareto curve on LLaMA\-2\-7B\. Starting from pure W2A4, layers are ranked by the perplexity gain from independently upgrading each layer to W4A4, and then the top\-LLlayers are upgraded\. The average weight bit\-width is2\+L/162\+L/16\. AtL=16L=16, the model reaches the same nominal average weight bit\-width as W3A4\. The purple marker shows the iso\-method L=16 MNZ point at 7\.318 PPL under the same GPTQ calibration as the standard\-grid curve\. Under iso\-method comparison, the level\-set redesign alone \(SYM\-INT→\\toMNZ\) closes about 55% of the W3A4 gap under GPTQ \(7\.825→\\to7\.318 against 6\.897\) and about 49% under GPTAQ \(7\.719→\\to7\.122 against 6\.509\); in both cases L=16 approaches but does not surpass W3A4\.Tables[9](https://arxiv.org/html/2606.02823#S4.T9)and[10](https://arxiv.org/html/2606.02823#S4.T10)report the corresponding mixed\-L WikiText\-2 perplexity matrix under KV4 GPTQ\. After rerunning corrected NF2 for L=4/8/12, the LLaMA\-2 table is complete for all six grids: standardSYM\-INTW2, conventional asymmetric W2,MNZ,PoT\-MNZ, Lloyd, and corrected NF2\. The LLaMA\-3\.1 table is also complete for all six grids\. For LLaMA\-3\.1, we intentionally reuse the same static L\-layer ranking from the LLaMA\-2 sweep rather than performing a model\-specific L\-search; this treats mixed\-L selection as a fixed deployment heuristic and tests whether the W2 grid trend transfers across models\.

Table 9:LLaMA\-2\-7B mixed\-L KV4 GPTQ WikiText\-2 perplexity\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max and NF2 are scalar reference grids\. Lower is better\.Table 10:LLaMA\-3\.1\-8B mixed\-L KV4 GPTQ WikiText\-2 perplexity using the same static L\-layer ranking as the LLaMA\-2 sweep\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max and NF2 are scalar reference grids\. Lower is better\.The mixed\-L results show that level\-set redesign remains useful beyond pure W2A4\. On LLaMA\-2,MNZ,PoT\-MNZ, Lloyd, and corrected NF2 form a tight cluster for L=4–16, with Lloyd slightly best at L=4/16 andMNZbest at L=8/12\. Corrected NF2 is competitive but not the best LLaMA\-2 mixed\-L grid\. On LLaMA\-3\.1, the same trend transfers: no\-zero and Gaussian\-aware grids consistently improve over the standardSYM\-INTgrid, andPoT\-MNZbecomes the best L=16 point despite being the hardware\-friendly power\-of\-two variant\.

Table[11](https://arxiv.org/html/2606.02823#S4.T11)further evaluates pure W2A4 and mixed\-L under GPTAQ, including the conventional asymmetric W2 baseline\. L=0 denotes the pure W2A4 setting under GPTAQ\. This should not be confused with the earlier pure W2A4 GPTQ collapse baseline: for example, LLaMA\-2\-7B pure W2A4 symmetric quantization is 53\.849 PPL with GPTQ but 12\.118 PPL with GPTAQ in this table\. The main conclusion is not that GPTAQ is universally beneficial\. Rather, the interaction is model\-dependent: on LLaMA\-2\-7B, GPTAQ lowers the L=16 endpoints for all shown grids relative to the GPTQ table, while on LLaMA\-3\.1\-8B the same GPTAQ sweep worsens the L=16 endpoints\. In both models, however, the no\-zero/Gaussian\-aware grids remain consistently better than the standardSYM\-INTgrid and the conventional asymmetric baseline across all L\. This separates the robust level\-set\-geometry effect from the less stable GPTAQ interaction\.

Table 11:Pure W2A4 and mixed\-L KV4 GPTAQ WikiText\-2 perplexity on LLaMA\-2\-7B and LLaMA\-3\.1\-8B\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max is a scalar reference grid\. Lower is better\. L=0 denotes pure W2A4 with no W4A4 protected layers, evaluated with GPTAQ rather than GPTQ\.Under WikiText\-2 perplexity, L=16 mixed precision approaches but does not fully surpass the uniform W3A4 reference under the strict KV4 iso\-bit comparison\. To isolate the level\-set effect from the calibration\-algorithm effect, we compare each grid against the W3A4 reference under the*same*calibration method\. Under GPTQ, the standardSYM\-INTL=16 point is 7\.825 PPL versus a W3A4 GPTQ reference of 6\.897 PPL, a 0\.928 PPL gap; replacingSYM\-INTwithMNZlowers the L=16 point to 7\.318 PPL \(Table[9](https://arxiv.org/html/2606.02823#S4.T9)\), closing 0\.507 PPL or about 55% of the gap and leaving a residual of 0\.421 PPL\. Under GPTAQ, the standardSYM\-INTL=16 point is 7\.719 PPL versus a W3A4 GPTAQ reference of 6\.509 PPL, a 1\.210 PPL gap; replacingSYM\-INTwithMNZlowers the L=16 point to 7\.122 PPL \(Table[11](https://arxiv.org/html/2606.02823#S4.T11)\), closing 0\.597 PPL or about 49% of the gap and leaving a residual of 0\.613 PPL\. Both iso\-method comparisons agree that the level\-set redesign alone closes roughly half of the W3A4 gap, while the remaining residual reflects the genuine accuracy cost of keeping half of the transformer layers at two\-bit precision rather than using uniform three\-bit weights\. This mirrors the tail\-versus\-bulk framing in the introduction: spending a full extra bit of average weight budget to protect sensitive layers still does not reach uniform W3A4, whereas redesigning the four W2 reconstruction levels—at no additional bit cost—recovers roughly half of the same gap\. The corresponding mixed\-LW\-ASYMpoint is 7\.572, confirming that asymmetric W2 remains a useful diagnostic baseline but not the best mixed\-L solution\. The qualitative conclusion is stable: the simple L\-layer heuristic remains a minor deployment\-oriented extension, while the main algorithmic contribution is the fixed no\-zero W2 grid redesign\.

Table[12](https://arxiv.org/html/2606.02823#S4.T12)shows downstream accuracy for L=16 and W3A4 references on LLaMA\-2\-7B\.

Table 12:LLaMA\-2\-7B KV4 downstream accuracy for W3A4 and L=16 W2A4\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max and NF2 are scalar reference grids\. Values are averages over five tasks unless otherwise noted\. Higher is better\. Bold marks the best non\-reference result within each comparison block\.Although L=16 remains slightly worse than W3A4 in WikiText\-2 perplexity under the strict KV4 setting, downstream accuracy shows a much smaller gap\. L16 MNZ GPTAQ reaches 0\.6157 average accuracy, compared with 0\.6179 for W3A4 GPTAQ and 0\.6200 for W3A4 GPTQ\. Under GPTQ, L16 Lloyd reaches 0\.6141, only 0\.0059 below W3A4 GPTQ\. These results suggest that the simple L\-layer heuristic can approach the W3A4 reference on LLaMA\-2\-7B, while on LLaMA\-3\.1\-8B it still trails W3A4 but substantially improves over the pure W2A4 and standard W2 baselines\.

### 4\.5Ablation Study

The RTN bucket diagnostic, ratio sensitivity scan, GPTQ residual analysis, and final PPL/accuracy experiments serve different roles\. The bucket diagnostic measures the intrinsic per\-output\-channel reconstruction behavior of each grid on post\-rotation weights\. The ratio scan isolates the scale\-invariant inner/outer centroid geometry after per\-channel normalization using pooled samples and a global scale search\. The GPTQ residual analysis checks whether the reconstruction advantage transfers after Hessian\-based compensation\. Finally, perplexity and downstream accuracy measure the full W2A4/KV4 pipeline\. This separation is intentional: RTN diagnostics expose the level\-set geometry directly, while GPTQ residuals and final task metrics confirm that the grid advantage survives the full quantization pipeline\.

#### 4\.5\.1RTN Bucket Reconstruction Diagnostic

For each weight sampleww, we quantize it to a grid level, dequantize it back tow^\\hat\{w\}, and compute squared reconstruction error\(w−w^\)2\(w\-\\hat\{w\}\)^\{2\}\. We then aggregate two quantities by assigned bucket:

count%\(g\)\\displaystyle\\mathrm\{count\\%\}\(g\)=\#\{w:q\(w\)=g\}\#\{w\},\\displaystyle=\\frac\{\\\#\\\{w:q\(w\)=g\\\}\}\{\\\#\\\{w\\\}\},\(25\)err%\(g\)\\displaystyle\\mathrm\{err\\%\}\(g\)=∑w:q\(w\)=g\(w−w^\)2∑w\(w−w^\)2\.\\displaystyle=\\frac\{\\sum\_\{w:q\(w\)=g\}\(w\-\\hat\{w\}\)^\{2\}\}\{\\sum\_\{w\}\(w\-\\hat\{w\}\)^\{2\}\}\.\(26\)This diagnostic is computed on real Hadamard\-rotated LLaMA\-2\-7B weights across all 224 linear modules\. It is not a Gaussian simulation and not a downstream metric; it isolates the scalar grid’s reconstruction behavior\.

This bucket diagnostic uses RTN\-style nearest\-level quantization rather than GPTQ\. For each output channel, we select an MSE\-optimized scalar scale by sweeping candidate clipping ratios and then assign each weight to the nearest reconstruction level\. Thus, the diagnostic uses per\-output\-channel scaling, without group\-wise quantization, asymmetric zero\-points, or Hessian\-based GPTQ compensation\. The resulting squared error measures the intrinsic reconstruction behavior of the scalar grid on post\-rotation weights, rather than the final GPTQ\-compensated task error\.

The asymmetric baseline provides the first indication that modifying reconstruction geometry matters\. Bucket\-level diagnostics reveal the structural problem in the standardSYM\-INTgrid: Figure[7](https://arxiv.org/html/2606.02823#S4.F7)shows that the standard W2 grid uses its four levels unevenly\. The−2\-2level is rarely used, the zero bucket captures a large fraction of weights, and the\+1\+1bucket carries most of the reconstruction error\. Importantly, bucket\-level error percentages are normalized within each grid and therefore describe where each grid spends its error, not whether the grid has lower total error\. Table[13](https://arxiv.org/html/2606.02823#S4.T13)therefore also reports the aggregate squared reconstruction error over all 6\.48B rotated LLaMA\-2\-7B weights\.MNZreduces total squared error from1\.2890×1051\.2890\\times 10^\{5\}to1\.0228×1051\.0228\\times 10^\{5\}\(−20\.7%\-20\.7\\%\), andPoT\-MNZreduces it to1\.0363×1051\.0363\\times 10^\{5\}\(−19\.6%\-19\.6\\%\), while also preserving the more balanced no\-zero bucket allocation\.

![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_B_bucket_dist.png)Figure 7:Bucket count and reconstruction\-error allocation for standard W2 andMNZon real rotated LLaMA\-2\-7B weights\. Standard W2 underuses−2\-2, overuses zero, and concentrates error in the most positive bucket\.Table 13:Bucket diagnostic summary over real Hadamard\-rotated LLaMA\-2\-7B weights using RTN\-style per\-output\-channel quantization\. Bucket error share is normalized within each grid; total squared error is the aggregate nearest\-level reconstruction error over all weights\. Lower total squared error is better\.
#### 4\.5\.2Where No\-Zero Grids Win

The improvement ofMNZis not uniform across all magnitudes: a region\-wise error decomposition shows that standard W2 can win very close to zero because of its zero centroid andFar\-MNZloses because its inner centroids sit too far from zero, whileMNZwins primarily in the middle high\-density region where±0\.5\\pm 0\.5are closer than either zero or±1\\pm 1\. A zero\-bucket counterfactual confirms thatMNZis not better on the weights standard W2 snaps to zero, so the total\-error reduction in Table[13](https://arxiv.org/html/2606.02823#S4.T13)is a global centroid\-allocation gain rather than a claim thatMNZwins on every local region\.

#### 4\.5\.3Scale\-Invariant Ratio Sensitivity

A previous equal\-spacing analysis over\{−3a,−a,\+a,\+3a\}\\\{\-3a,\-a,\+a,\+3a\\\}is useful as a fixed\-normalization diagnostic, but if the production quantizer freely optimizes a scaless, then

s\{−3a,−a,\+a,\+3a\}=\(sa\)\{−3,−1,\+1,\+3\},s\\\{\-3a,\-a,\+a,\+3a\\\}=\(sa\)\\\{\-3,\-1,\+1,\+3\\\},\(27\)soaais absorbed into scale and is not a scale\-invariant geometry parameter\.

We therefore analyze the non\-uniform mirror no\-zero family

𝒢\(r\)=\{−1,−r,\+r,\+1\},0<r<1\.\\mathcal\{G\}\(r\)=\\\{\-1,\-r,\+r,\+1\\\},\\quad 0<r<1\.\(28\)For eachrr, we search the best global scaleβ\\beta:

NMSE\(r\)=minβ⁡∑i\(xi−Qβ𝒢\(r\)\(xi\)\)2∑ixi2\.\\mathrm\{NMSE\}\(r\)=\\min\_\{\\beta\}\\frac\{\\sum\_\{i\}\\left\(x\_\{i\}\-Q\_\{\\beta\\mathcal\{G\}\(r\)\}\(x\_\{i\}\)\\right\)^\{2\}\}\{\\sum\_\{i\}x\_\{i\}^\{2\}\}\.\(29\)The input samples are real rotated LLaMA\-2\-7B weights normalized by the standard W2 per\-channel scale\. After this per\-channel normalization, samples from all modules and output channels are pooled\. For each ratiorr, the scan optimizes a single global scaleβ\\betaon the pooled samples, rather than re\-optimizing a separate scale for every output channel\. Therefore, this experiment is a scale\-invariant grid\-shape diagnostic, not a full production\-granularity quantization experiment\.

Figure[8](https://arxiv.org/html/2606.02823#S4.F8)sweepsrrover the pooled samples and reports the best\-scale NMSE for each ratio; reference grids are placed by fixing their correspondingrrunder the same best\-scale search, so the curve isolates the geometry of the four reconstruction levels from arbitrary scale choices\. Table[14](https://arxiv.org/html/2606.02823#S4.T14)lists the exact inner/outer ratio and best\-scale NMSE for each grid in this comparison\.

![Refer to caption](https://arxiv.org/html/2606.02823v1/figures/fig_grid_ratio_sensitivity.png)Figure 8:Scale\-invariant ratio sensitivity over\{−1,−r,\+r,\+1\}\\\{\-1,\-r,\+r,\+1\\\}on 100K pooled samples from real rotated LLaMA\-2\-7B weights after standard W2 per\-channel normalization\. For each ratiorr, a single global scaleβ\\betais optimized on the pooled samples before computing NMSE\. Good practical grids cluster aroundr≈0\.25r\\approx 0\.25–0\.330\.33, whileFar\-MNZatr=0\.5r=0\.5places the inner centroids too far from zero\.Table 14:Ratio sensitivity reference grids\. The discrete scan uses a step of0\.0050\.005, so its best grid point \(r=0\.3000r=0\.3000\) can be marginally exceeded by a reference grid evaluated at its exact ratio \(Lloyd\-Max,r=0\.2998r=0\.2998\); the two are NMSE\-equivalent to within10−610^\{\-6\}, and the best ratio is approximatelyr=0\.30r=0\.30\. Qift\-MNZand Qift\-PoT\-MNZare the proposed deployable grids; Lloyd\-Max and NF2 are scalar references;Far\-MNZis a negative diagnostic\.NF2 follows the NormalFloat motivation used by QLoRA\[[5](https://arxiv.org/html/2606.02823#bib.bib5)\], while the ratio scan shows that no\-zero alone is insufficient: the inner centroids must not be placed too far from zero\. Good practical grids haver≈0\.25r\\approx 0\.25–0\.330\.33, explaining whyMNZ, Lloyd, NF2, andPoT\-MNZperform well, whileFar\-MNZis weaker\.

#### 4\.5\.4GPTQ Residual Analysis

To understand why no\-zero grids improve GPTQ results, we record the cumulative residual accumulated during GPTQ quantization\. Table[15](https://arxiv.org/html/2606.02823#S4.T15)shows that the effective no\-zero and scalar\-reference grids all reduce the cumulative residual relative to standardSYM\-INTW2\. The residual ratio follows the rough PPL trend but does not strictly order the grids by final perplexity:MNZhas the lowest residual, whereas Lloyd\-Max reaches a slightly lower perplexity\.

Table 15:GPTQ residual accumulation ratio and PPL in a L=16 setting\. Qift\-MNZand Qift\-PoT\-MNZare the proposed grids; Lloyd\-Max is a scalar reference grid andFar\-MNZis a negative diagnostic\. Ratios are normalized by the standardSYM\-INTgrid\.This analysis does not prove that residual ratio uniquely determines perplexity, but it provides mechanism\-level evidence consistent with the overall PPL trend\.

### 4\.6Summary of Empirical Findings

Across pure W2A4 perplexity, mixed\-L perplexity, downstream accuracy, and GPTQ residual analysis, the consistent trend is that no\-zero and Gaussian\-aware grids outperform the standardSYM\-INTW2 grid\. Conventional asymmetric W2 improves over the standard grid in many settings, confirming that scalar level\-set geometry matters, but it remains worse than source\-aware no\-zero grids\. Lloyd often acts as the strongest scalar reconstruction reference,MNZprovides a simple default no\-zero grid, andPoT\-MNZprovides the most hardware\-oriented grid while remaining competitive\. The L\-layer mixed\-precision study further shows that the level\-set advantage persists under an iso\-bit deployment constraint\. Overall, the experiments support the central claim that the W2 scalar grid is a major design variable in rotated W2A4/KV4 inference\.

## 5Conclusion

This work argues that W2A4 quantization is not only a bit\-width problem but also a scalar level\-set design problem\. Treating the four reconstruction levels of a two\-bit weight quantizer as a design choice rather than a default reveals that the standard\{−2,−1,0,\+1\}\\\{\-2,\-1,0,\+1\\\}grid is geometrically mismatched to Hadamard\-rotated weights, which are approximately zero\-centered and Gaussian\-like in standardized shape\. Fixed no\-zero grids—MNZ\{±0\.5,±1\.5\}\\\{\\pm 0\.5,\\pm 1\.5\\\}andPoT\-MNZ\{±1,±4\}\\\{\\pm 1,\\pm 4\\\}—match this source by placing two inner centroids around the dense middle region and two outer centroids in the tails, and consistently improve perplexity, downstream accuracy, and GPTQ residual behavior on LLaMA\-2\-7B and LLaMA\-3\.1\-8B\.

Qift is intentionally minimal: its design intervention is confined to the fixed two\-bit code\-to\-level mapping, leaving rotation, GPTQ/GPTAQ, activation and KV\-cache quantization, and mixed\-precision policy unchanged\. This makes it a quantizer\-level plug\-in component for rotated PTQ pipelines\. Because the decoded levels are small fixed signed integers, Qift is also hardware\-friendly: a simple, source\-aware step toward accuracy\-aware design for extreme low\-bit LLM quantization\. This modularity allows it to be combined with future improvements in calibration, rotation, low\-bit kernel design, and hardware\-aware mixed\-precision policies\.

## References

- \[1\]Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh\.GPTQ: Accurate Post\-Training Quantization for Generative Pre\-trained Transformers\.In*International Conference on Learning Representations*, 2023\.[https://arxiv\.org/abs/2210\.17323](https://arxiv.org/abs/2210.17323)\.
- \[2\]Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda\.GPTAQ: Efficient Finetuning\-Free Quantization for Asymmetric Calibration\.In*International Conference on Machine Learning*, 2025\.[https://arxiv\.org/abs/2504\.02692](https://arxiv.org/abs/2504.02692)\.
- \[3\]Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L\. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman\.QuaRot: Outlier\-Free 4\-Bit Inference in Rotated LLMs\.*arXiv preprint arXiv:2404\.00456*, 2024\.[https://arxiv\.org/abs/2404\.00456](https://arxiv.org/abs/2404.00456)\.
- \[4\]Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han\.SmoothQuant: Accurate and Efficient Post\-Training Quantization for Large Language Models\.In*Proceedings of the 40th International Conference on Machine Learning*, volume 202 of*Proceedings of Machine Learning Research*, pages 38087–38099\. PMLR, 2023\.[https://proceedings\.mlr\.press/v202/xiao23c\.html](https://proceedings.mlr.press/v202/xiao23c.html)\.
- \[5\]Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer\.QLoRA: Efficient Finetuning of Quantized LLMs\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.[https://papers\.neurips\.cc/paper\_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b\-Abstract\-Conference\.html](https://papers.neurips.cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html)\.
- \[6\]Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort\.SpinQuant: LLM Quantization with Learned Rotations\.In*International Conference on Learning Representations*, 2025\.[https://proceedings\.iclr\.cc/paper\_files/paper/2025/hash/e5b1c0d4866f72393c522c8a00eed4eb\-Abstract\-Conference\.html](https://proceedings.iclr.cc/paper_files/paper/2025/hash/e5b1c0d4866f72393c522c8a00eed4eb-Abstract-Conference.html)\.
- \[7\]Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao\.FlatQuant: Flatness Matters for LLM Quantization\.*arXiv preprint arXiv:2410\.09426*, 2024\.[https://arxiv\.org/abs/2410\.09426](https://arxiv.org/abs/2410.09426)\.
- \[8\]Euntae Choi, Sumin Song, Woosang Lim, and Sungjoo Yoo\.Rotate, Clip, and Partition: Towards W2A4KV4 Quantization by Integrating Rotation and Learnable Non\-uniform Quantizer\.In*Findings of the Association for Computational Linguistics: EMNLP 2025*, pages 7568–7590, 2025\.[https://aclanthology\.org/2025\.findings\-emnlp\.400/](https://aclanthology.org/2025.findings-emnlp.400/)\.
- \[9\]Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa\.QuIP\#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks\.In*International Conference on Machine Learning*, 2024\.[https://arxiv\.org/abs/2402\.04396](https://arxiv.org/abs/2402.04396)\.
- \[10\]Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh\.Extreme Compression of Large Language Models via Additive Quantization\.In*International Conference on Machine Learning*, 2024\.[https://arxiv\.org/abs/2401\.06118](https://arxiv.org/abs/2401.06118)\.
- \[11\]Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei\-Ming Chen, Wei\-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han\.AWQ: Activation\-aware Weight Quantization for On\-Device LLM Compression and Acceleration\.In*Proceedings of Machine Learning and Systems \(MLSys\)*, 2024\.[https://arxiv\.org/abs/2306\.00978](https://arxiv.org/abs/2306.00978)\.
- \[12\]Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo\.OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.[https://arxiv\.org/abs/2308\.13137](https://arxiv.org/abs/2308.13137)\.
- \[13\]Tianyi Zhang and Anshumali Shrivastava\.LeanQuant: Accurate and Scalable Large Language Model Quantization with Loss\-Error\-Aware Grid\.In*International Conference on Learning Representations*, 2025\.[https://arxiv\.org/abs/2407\.10032](https://arxiv.org/abs/2407.10032)\.
- \[14\]S\. P\. Lloyd\.Least squares quantization in PCM\.*IEEE Transactions on Information Theory*, 28\(2\):129–137, 1982\.
- \[15\]J\. Max\.Quantizing for minimum distortion\.*IRE Transactions on Information Theory*, IT\-6\(1\):7–12, 1960\.doi:10\.1109/TIT\.1960\.1057548\.
Qift: Shift-Friendly No-Zero W2 Post-Training Quantization for Rotated W2A4/KV4 LLM Inference

Similar Articles

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

Theory-optimal Quantization Based on Flatness

LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection

Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs

Submit Feedback

Similar Articles

QAM-W: Joint 2D Codebook Quantization for LLM Weights via Hadamard Rotation and Activation-Aware Scaling
InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization
Theory-optimal Quantization Based on Flatness
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs