Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Summary
This paper investigates smoothness degradation in extremely quantized Large Language Models, arguing that preserving smoothness is crucial for maintaining performance beyond numerical accuracy.
View Cached Full Text
Cached at: 05/12/26, 07:05 AM
# Smoothness in Extremely Quantized LLMs
Source: [https://arxiv.org/html/2605.08894](https://arxiv.org/html/2605.08894)
## Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs
Yuzhuang Xu1Xu Han2,∗Yuxuan Li2Pengzhan Li1Wanxiang Che1, 1Harbin Institute of Technology, Harbin, China 2Tsinghua University, Beijing, China \{xyz, car\}@ir\.hit\.edu\.cn, han\-xu@mail\.tsinghua\.edu\.cn
###### Abstract
Large language models \(LLMs\) achieve strong performance but incur high deployment costs, motivating extremely low\-bit but lossy quantization\. Existing quantization algorithms mainly focus on improving the numerical accuracy of forward computation to eliminate performance degradation\. In this paper, we show that extremely quantized LLMs suffer from systematic smoothness degradation beyond numerical precision loss\. Through a smoothness proxy, we observe that such degradation becomes increasingly severe as the quantization bit\-width decreases\. Furthermore, based on sequence neighborhood modeling, we find that quantized models exhibit a rapid reduction of effective token candidates within the prediction neighborhood, which directly leads to a sparser decoding tree and degraded generation quality\. To validate it, we introduce a simple smoothness\-preserving principle in both post\-training quantization and quantization\-aware training, and demonstrate that preserving smoothness brings additional gains beyond numerical accuracy\. The core goal of this paper is to highlight smoothness preservation as an important design consideration for future extreme quantization methods\. Code is available at[https://github\.com/xuyuzhuang11/FINE](https://github.com/xuyuzhuang11/FINE)\.
“When you reach the edge, hidden forces take over\.”
## 1Introduction
Large language models \(LLMs\) face substantial deployment costs, posing a major bottleneck for real\-world adoption\. To unlock model capability under constrained budgets, model quantization is widely used, replacing high\-bit value representations with low\-bit counterparts\[[14](https://arxiv.org/html/2605.08894#bib.bib2),[23](https://arxiv.org/html/2605.08894#bib.bib3)\]\. Model quantization has now been pushed to an extremely low bit\-width, such as 1\-bit\[[41](https://arxiv.org/html/2605.08894#bib.bib1),[21](https://arxiv.org/html/2605.08894#bib.bib5)\]or even sub\-1\-bit\[[10](https://arxiv.org/html/2605.08894#bib.bib6)\]\. Under such extreme bit\-width compression, models often suffer from substantial performance degradation\. Existing studies attribute this degradation primarily to numerical precision loss, and consequently focus on preserving the numerical precision of forward computation as much as possible\[[16](https://arxiv.org/html/2605.08894#bib.bib7),[21](https://arxiv.org/html/2605.08894#bib.bib5)\]\. A natural question then arises:Is the collapse of model performance solely caused by numerical precision in extremely low\-bit quantization?
Figure 1:Smoothness degradation in extreme quantization\. Smoothness score distributions of GPTQ\-quantized LLaMA\-2\-7B under different bit\-widths\. Higher scores indicate worse smoothness\.The answer may be “no”\. We empirically observe that such extremely quantized LLMs also suffer from pronounced smoothness degradation beyond fitting precision, which may constitute an additional source of capability loss, as shown in Figure[1](https://arxiv.org/html/2605.08894#S1.F1)\. Prior studies in machine learning have long connected smoothness to generalization, robustness, and training stability\[[2](https://arxiv.org/html/2605.08894#bib.bib9),[8](https://arxiv.org/html/2605.08894#bib.bib10),[24](https://arxiv.org/html/2605.08894#bib.bib8)\]\. Models with poor smoothness are known to exhibit high sensitivity to small perturbations, resulting in unstable outputs and amplified generalization errors\. Extensive evidence from classical machine learning and vision tasks further suggests that smoothness degradation is often accompanied by reduced performance and reliability\[[11](https://arxiv.org/html/2605.08894#bib.bib13),[19](https://arxiv.org/html/2605.08894#bib.bib14),[20](https://arxiv.org/html/2605.08894#bib.bib12),[5](https://arxiv.org/html/2605.08894#bib.bib11)\]\. Despite its well\-established importance, however, smoothness remains largely underexplored in transformer\-based LLMs, particularly in extreme quantization, motivating us to investigate its role in quantized LLM capability degradation\.
Furthermore, by introducing a neighborhood model over text sequences together with reverse perplexity \(rPPL\), we theoretically show that the next\-token probability distribution of quantized models collapses “more rapidly” than that of the original FP16 model\. This phenomenon suggests that quantized models produce lower\-quality next\-token rankings at each decoding step, leaving a substantially narrower range of effective candidates for sampling\. From a macroscopic perspective, this manifests as a significantlysparser effective decoding treegenerated by quantized models\. More importantly, we find that this neighborhood collapse effect becomes increasingly severe as the quantization bit\-width decreases\. Additional analysis suggests that a direct way to mitigate such neighborhood collapse is to preserve model smoothness during quantization as much as possible, which aligns closely with our empirical observations\.
Motivated by our empirical and theoretical findings, we propose simple strategies to mitigate smoothness degradation in extremely quantized LLMs\. For post\-training quantization \(PTQ\), we argue that existing methods rely on incomplete optimization objectives that focus mainly on reconstruction error while overlooking smoothness preservation\. We therefore introduce learnable gradient preserving \(LGP\) to explicitly maintain original gradients during quantization\. For quantization\-aware training \(QAT\), we find that smoothness degradation mainly appears in the intermediate hidden\-state gradients, and thus introduce a loss of gradient regularization \(LGR\) during training\. Experiments on both PTQ and QAT show that smoothness yields additional performance gains\.
Rather than proposing a competing algorithm to others, we highlight smoothness preservation as a key design principle for extreme quantization\. Our analysis suggests that in extremely low\-bit settings, forward fitting and backward preservation are hard to optimize jointly, with the latter being more sensitive to bit\-width reduction\. Moreover, solution\-space analysis shows that low\-bit quantized weights capable of preserving both forward and backward behaviors do not disappear, but instead become increasingly narrow\. Therefore, the objective of extreme quantization should no longer be to seek a lossless critical point on hidden\-state reconstruction error or perplexity, but rather to achieve a principled trade\-off between fitting precision and smoothness under limited bit\-width budgets\. Overall, this paper makes the following three contributions:
- •Discovery\.We establish a feasible smoothness proxy for transformer\-based LLMs, and reveal the smoothness degradation problem in extremely quantized LLMs\. Moreover, through the sequence neighborhood modeling and rPPL, we uncover the decoding\-tree sparsification effect caused by smoothness degradation\.
- •Validation\.We design simple yet effective methods, including LGP for PTQ and LGR for QAT, to verify that smoothness enhancement brings positive performance gains to prediction distributions under extreme quantization\.
- •Guidance\.We identify limitations in existing quantization objectives and provide an in\-depth analysis of the feasibility and necessity of considering smoothness into extreme quantization\. Our findings offer practical guidance for future algorithm design\.
## 2Preliminary
### 2\.1Network Smoothness
Lipschitzness is the most relevant metric of smoothness in neural networks\. Generally, letf:𝒟⊆ℝn→ℝmf:\\mathcal\{D\}\\subseteq\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}^\{m\}be a function defined on a domain𝒟\\mathcal\{D\}\. The functionffis said to beCC\-Lipschitz continuous with respect to theα\\alpha\-norm if there exists a constantC\>0C\>0such that for all𝐱,𝐲∈𝒟\\mathbf\{x\},\\mathbf\{y\}\\in\\mathcal\{D\}:
‖f\(𝐱\)−f\(𝐲\)‖α≤C‖𝐱−𝐲‖α\.\\\|f\(\\mathbf\{x\}\)\-f\(\\mathbf\{y\}\)\\\|\_\{\\alpha\}\\leq C\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|\_\{\\alpha\}\.\(1\)The Lipschitz constant can also be simply given byC=sup𝐱∈𝒟‖∇𝐱f‖α~C=\\sup\_\{\\mathbf\{x\}\\in\\mathcal\{D\}\}\\\|\\nabla\_\{\\mathbf\{x\}\}f\\\|\_\{\\tilde\{\\alpha\}\}, where∇𝐱f\\nabla\_\{\\mathbf\{x\}\}fis the Jacobian offfwith respect to the input𝐱\\mathbf\{x\}\. Here,α~\\tilde\{\\alpha\}denotes the dual norm ofα\\alphaifm=1m=1; otherwise,α~=α\\tilde\{\\alpha\}=\\alpha\. For simplicity, we fixα=α~=2\\alpha=\\tilde\{\\alpha\}=2in this paper\. A smaller constantCCimplies limited output variation under small input perturbations\. It is closely associated with the generalization and robustness\. Unfortunately, computing the exact value ofCCis NP\-hard\[[36](https://arxiv.org/html/2605.08894#bib.bib23)\]\. Therefore, we approximately characterize the smoothness solely by estimating the upper and lower bounds ofCC\.
The simplest, yet significantlyloose, upper bound is the product of the Lipschitz constants of each layer\. Moreover, the lower bound is estimated by sampling a small subsetSSfrom the domain𝒟\\mathcal\{D\}\. Consider anLL\-layer networkf𝜽f\_\{\\boldsymbol\{\\theta\}\}parameterized by𝜽\\boldsymbol\{\\theta\}, defined asf𝜽=f\(L\)∘f\(L−1\)∘⋯∘f\(1\)f\_\{\\boldsymbol\{\\theta\}\}=f^\{\(L\)\}\\circ f^\{\(L\-1\)\}\\circ\\cdots\\circ f^\{\(1\)\}\. An upper bound and lower bound on its Lipschitz constant is given by
Clower≤sup𝐱∈S‖∇𝐱f𝜽‖2≤C≤∏i=1Lsup𝐱\(i−1\)∈dom\(f\(i\)\)‖∇𝐱\(i−1\)f\(i\)‖2=Cupper,C\_\{\\text\{lower\}\}\\leq\\sup\_\{\\mathbf\{x\}\\in S\}\\\|\\nabla\_\{\\mathbf\{x\}\}f\_\{\\boldsymbol\{\\theta\}\}\\\|\_\{2\}\\leq C\\leq\\prod\_\{i=1\}^\{L\}\\sup\_\{\\mathbf\{x\}^\{\(i\-1\)\}\\in\\text\{dom\}\(f^\{\(i\)\}\)\}\\\|\\nabla\_\{\\mathbf\{x\}^\{\(i\-1\)\}\}f^\{\(i\)\}\\\|\_\{2\}=C\_\{\\text\{upper\}\},\(2\)where𝐱\(i−1\)\\mathbf\{x\}^\{\(i\-1\)\}denotes the input for theii\-th layer anddomf\\text\{dom\}fis the definition area offf\.ClowerC\_\{\\text\{lower\}\}is more accurate and reduces the complexity of accurately estimatingCC\. However, a drawback is that it merely reflects the local gradient magnitude at specific points rather than the global landscape\. An alternative in computer version is to compute theexpected input gradientCavg=𝔼𝐱∈S‖∇𝐱f𝜽‖2C\_\{\\text\{avg\}\}=\\mathbb\{E\}\_\{\\mathbf\{x\}\\in S\}\\\|\\nabla\_\{\\mathbf\{x\}\}f\_\{\\boldsymbol\{\\theta\}\}\\\|\_\{2\}\. Although this metric does not strictly satisfy the definition of Lipschitz constant, it provides a valid and often more practical estimate ofCC\. For more details, please refer toKhromov and Singh \[[17](https://arxiv.org/html/2605.08894#bib.bib15)\]\.
### 2\.2Model Quantization
Model quantization converts the 16\-bit floating\-point formats commonly used in LLMs into low\-bit representations\[[14](https://arxiv.org/html/2605.08894#bib.bib2),[40](https://arxiv.org/html/2605.08894#bib.bib18)\]\. While most studies focus on low\-bit integer representations, floating\-point formats\[[38](https://arxiv.org/html/2605.08894#bib.bib33)\]and codebook\-based encoding\[[35](https://arxiv.org/html/2605.08894#bib.bib4),[12](https://arxiv.org/html/2605.08894#bib.bib46)\]are also investigated\. In this work, we focus on fixed\-point quantization for our analysis\.
Quantization converts precision through scaling and shifting\. The process is formulated as
Q\(w\)=clamp\(⌊wh⌉\+z,0,2N−1\),Q\(w\)=\\text\{clamp\}\(\\lfloor\\frac\{w\}\{h\}\\rceil\+z,0,2^\{N\}\-1\),\(3\)where⌊⋅⌉\\lfloor\\cdot\\rceildenotes the rounding\-to\-nearest, andclamprepresents the clipping operation\. The parametershhandzzare obtained as follows:
h\\displaystyle h=max\(w\)−min\(w\)2N−1,\\displaystyle=\\frac\{\\max\(w\)\-\\min\(w\)\}\{2^\{N\}\-1\},\(4\)z\\displaystyle z=−⌊min\(w\)/h⌉\.\\displaystyle=\-\\lfloor\\min\(w\)/h\\rceil\.Here,NNdenotes the bit\-width of the integer\. For memory efficiency, matrix rows or columns are often quantized in groups, each sharing the samehhandzz, typically of size 64 or 128\.
Correspondingly, the dequantization process maps the discrete integer values back to the floating\-point domain to enable subsequent computation\. It can be defined as
w^=Q−1\(Q\(w\)\)=h⋅\(Q\(w\)−z\)\.\\hat\{w\}=Q^\{\-1\}\(Q\(w\)\)=h\\cdot\(Q\(w\)\-z\)\.\(5\)The recovered valuew^\\hat\{w\}serves as an approximation of the original weightww, with the quantization error determined by the bit\-widthNN, the clipping range, and the group\-wise statistics\.
\(a\)BF16 \(Original\)
\(b\)INT3 \(GPTQ\)
\(c\)Input gradient distribution
Figure 2:Approximate Lipschitz constant, i\.e\. expected input gradientCavgC\_\{\\text\{avg\}\}, of LLaMA\-2\-7B on one input sequence under different precisions\.
## 3Empirics: Proxy and Smoothness
While Lipschitz constant bounds and expected input gradient effectively characterize model smoothness, existing conclusions are predominantly derived from simple architectures or vision tasks, such as MLPs\[[33](https://arxiv.org/html/2605.08894#bib.bib34)\], ResNets\[[5](https://arxiv.org/html/2605.08894#bib.bib11)\], or ViTs\[[17](https://arxiv.org/html/2605.08894#bib.bib15)\]\. It remains unclear whether these findings hold for transformer\-based LLMs and quantized LLMs\. In this section, we discuss the smoothness degradation problem in quantized LLMs under the premise of validating the applicability of the proxy\.
CavgC\_\{\\text\{avg\}\}is a differentiable metric, yet it deviates from the strict definition ofCC\. We now need to address two pivotal questions:\(a\) CanCavgC\_\{\\text\{avg\}\}approximateCCin transformer\-based LLMs?and\(b\) Does this approximation hold for quantized LLMs?
To facilitate measuring the impact of input perturbations on LLM outputs, we defineffas the language modeling objective combined with the cross\-entropy loss\. Consequently,fftakes the formf:ℝn→ℝf:\\mathbb\{R\}^\{n\}\\to\\mathbb\{R\}\. A key benefit is that∇𝐱f\\nabla\_\{\\mathbf\{x\}\}fsimplifies to a gradient vector instead of a Jacobian matrix, thereby avoiding the excessive computation and memory overhead associated with Jacobians\.
We provide a formal definition of∇𝐱f\\nabla\_\{\\mathbf\{x\}\}fin the context of LLMs\. For an input sequence\(w1,…,wT\)\(w\_\{1\},\\dots,w\_\{T\}\), let𝐱t\(i\)\\mathbf\{x\}\_\{t\}^\{\(i\)\}denote the input hidden state of tokenwtw\_\{t\}at layeri=1,…,Li=1,\\dots,L\. The smoothness proxy, referred to as “input gradient”, is defined as∇𝐱\(0\)f\\nabla\_\{\\mathbf\{x\}^\{\(0\)\}\}f, where𝐱\(0\)\\mathbf\{x\}^\{\(0\)\}is the token embedding ofwtw\_\{t\}\.
SinceClowerC\_\{\\text\{lower\}\}is shown to be a reasonably accurate estimate of the Lipschitz constant\[[17](https://arxiv.org/html/2605.08894#bib.bib15)\], we investigate the relationship betweenCavgC\_\{\\text\{avg\}\}andClowerC\_\{\\text\{lower\}\}on LLaMA\-2\-7B\. As shown in Figure[2\(a\)](https://arxiv.org/html/2605.08894#S2.F2.sf1),ClowerC\_\{\\text\{lower\}\}converges as the number of input tokens increases\. AlthoughClowerC\_\{\\text\{lower\}\}exhibits a few distinct spikes, these are attributed to a very small number of tokens with large gradients\. As illustrated in Figure[2\(c\)](https://arxiv.org/html/2605.08894#S2.F2.sf3), the gradients for nearly all tokens remain below 0\.02\. In the absence of these outliers, the distribution ofClowerC\_\{\\text\{lower\}\}would be flat\. Meanwhile, despite a magnitude gap between the two,CavgC\_\{\\text\{avg\}\}exhibits a trend similar to that ofClowerC\_\{\\text\{lower\}\}\. Therefore, althoughCavgC\_\{\\text\{avg\}\}does not strictly adhere to the definition ofCC, it remains a valid metric for estimating trends in LLM smoothness\.
This conclusion also holds for the quantized models shown in Figure[2\(b\)](https://arxiv.org/html/2605.08894#S2.F2.sf2)\. Additionally, Figure[2\(c\)](https://arxiv.org/html/2605.08894#S2.F2.sf3)illustrates the shift in input gradients for quantized models\. We observe that at 3\-bit quantization, the input gradients exhibit a noticeable increase, even though the quantization loss remains minimal\. Figure[1](https://arxiv.org/html/2605.08894#S1.F1)and[2\(c\)](https://arxiv.org/html/2605.08894#S2.F2.sf3)clearly show thatthere exists a metric independent of the forward fitting accuracy in quantization, the smoothness captured by the input\-gradient, which degrades sharply as the quantization bit\-width decreases\.
## 4Theory: Sequence Neighborhood Modeling
Building on the empirical findings, we further investigate from a theoretical perspective how the behavior of quantized models changes\. A widely adopted approach for evaluating quantized model capability is to evaluate perplexity on a validation dataset\. This metric essentially reflects how close the predicted probabilityp\(wT\+1\|w1:T\)p\(w\_\{T\+1\}\|w\_\{1:T\}\)of the next tokenwT\+1w\_\{T\+1\}is to 1, given a contextw1:Tw\_\{1:T\}from the dataset\. However, this evaluation paradigm overlooks a fundamental fact: in practical usage, LLMs rely on top\-k or top\-p sampling, where multiple candidate tokens beyondwT\+1w\_\{T\+1\}are required to be assigned relative high probabilities\.
The ability of an autoregressive language model are reflected in step\-by\-step decoding\. We define a contextc=w1:Tc=w\_\{1:T\}as a point in the sequence space, and view the newly added tokenwwas a small perturbationδ=1\\delta=1\. Then theδ\\delta\-neighborhood of the sequence space centered atcccan be defined as:
Nδ\(c\)=\{c∪\{w\}∣w∈\|𝒱\|\},N\_\{\\delta\}\(c\)=\\\{c\\cup\\\{w\\\}\\mid w\\in\|\\mathcal\{V\}\|\\\},\(6\)where𝒱\\mathcal\{V\}denotes the model vocabulary, as shown in Figure[3\(a\)](https://arxiv.org/html/2605.08894#S4.F3.sf1)\. The functionff, i\.e\., language modeling combined with cross\-entropy loss, is defined over the above sequence space and maps a sequence to a real\-valued score\. The directional derivative offfat pointccin directionwwis defined as:
∇c→c\+wf=f\(c\+w\)−f\(c\)\.\\nabla\_\{c\\rightarrow c\+w\}f=f\(c\+w\)\-f\(c\)\.\(7\)Sincef\(c\)f\(c\)is a constant, we redefine the directional derivative as∇c→c\+wf=f\(c\+w\)\\nabla\_\{c\\rightarrow c\+w\}f=f\(c\+w\)\.∇c→c\+wf\\nabla\_\{c\\rightarrow c\+w\}fmeasures the quality of perturbations ofccalong different directions: if the perturbation leads to worse sequence quality,∇c→c\+wf\\nabla\_\{c\\rightarrow c\+w\}fwill be larger\.
\(a\)Definition ofNδ\(w1:T\)N\_\{\\delta\}\(w\_\{1:T\}\)
\(b\)Definition of rPPL\-kk
Figure 3:Key definitions in sequence neighborhood modeling\.The decoding of an LLM can be viewed as a sequence perturbation process starting from a given point\. For a contextc=w1:Tc=w\_\{1:T\}, a quantized modelℳQ\\mathcal\{M\}\_\{Q\}determines possible directions of the next perturbation by computingp\(⋅\|w1:T\)p\(\\cdot\|w\_\{1:T\}\)\. To evaluate the quality of such perturbations, we compute the perplexity ofc\+wc\+wusing the original full\-precision modelℳFP16\\mathcal\{M\}\_\{FP16\}, a metric we refer to asreverse perplexity\(rPPL\)\. We denote the perplexity computed by perturbingccalong the direction of the token rankedkk\-th byp\(⋅\|w1:T\)p\(\\cdot\|w\_\{1:T\}\)as rPPL\-kk, as shown in Figure[3\(b\)](https://arxiv.org/html/2605.08894#S4.F3.sf2)\. In particular, perturbing along the most probable token \(greedy decoding\) direction yields rPPL\-1\. We do not compute perplexity over all vocabulary directions\. Instead, we focus on the most commonly feasible sampling range, such as perturbations corresponding to the top\-40 predicted tokens\.
\(a\)Quantized by GPTQ
\(b\)Quantized by AQLM
Figure 4:Results of rPPL\-1 to rPPL\-40 on original and quantized LLaMA\-2\-7B models\. For GPTQ, we evaluate 4\-bit, 3\-bit, and 2\-bit quantization with group size 128\. For the vector\-quantization method AQLM, we compare codebook settings of 4×\\times8, 2×\\times8, and 1×\\times16 with group size 8\.rPPL reflects the quality of different perturbation directions guided by quantized models\. We compare rPPL results on C4 on different models and quantization algorithms, as shown in Figure[4](https://arxiv.org/html/2605.08894#S4.F4)\. For both the original and quantized models, rPPL\-kkincreases as k becomes larger, indicating that only top\-ranked predictions are effective in each decoding step, which aligns with intuition\. However, our focus is neither the absolute value of rPPL nor this monotonically increasing trend itself\. More importantly, the growth becomes substantially steeper and faster as the quantization bit\-width decreases\. This rapid collapse behavior shows little correlation with the starting point rPPL\-1, which mainly reflects the quality of single\-point predictions\.
\(a\)Decoding tree of FP16 model
\(b\)Decoding tree of INT2 model
Figure 5:Illustrative decoding trees of original and quantized models\. For any given node, darker child nodes indicate higher predictive quality in that direction, while gray nodes represent near\-invalid tokens with extremely poor quality\.The lower the quantization bit\-width, the faster the rPPL collapse becomes\. This implies that as quantization approaches the extreme regime, more perturbation directions around any point in the sequence space become ineffective under the guidance of the quantized model\. We illustrate this intuition together with Figure[5](https://arxiv.org/html/2605.08894#S4.F5)\. After quantization, the number of effective child nodes at each node of the decoding tree decreases, and this effect becomes increasingly severe at lower bit\-widths\. While the original FP16 model may provide around 10 effective predictions, the number of effective tokens in a quantized model may shrink to 5 or even fewer\. Such local shrinkage naturally accumulates as decoding proceeds\. A direct consequence is that the decoding tree becomes progressively sparser\. Therefore, the closer quantization moves toward the extreme low\-bit regime, the sparser the decoding tree becomes, and the less likely the model is to decode high\-quality generation paths\. Importantly, this phenomenon cannot be revealed by conventional perplexity evaluation, since perplexity only measures the likelihood of a single path within the decoding tree\.
rPPL can be viewed as the directional derivative of a quantized model along different prediction directions\. Figure[4](https://arxiv.org/html/2605.08894#S4.F4)shows that models with lower quantization bit\-width exhibit much more drastic variation in directional derivatives around points in the sequence space\. This variation can be expanded as:
∇c→c\+wf=f\(c\+w\)−f\(c\)≈∇cf⊤w\+12w⊤Hw\.\\nabla\_\{c\\rightarrow c\+w\}f=f\(c\+w\)\-f\(c\)\\approx\\nabla\_\{c\}f^\{\\top\}w\+\\frac\{1\}\{2\}w^\{\\top\}Hw\.\(8\)Ignoring the second\-order term H, we obtain∇c→c\+wf=∇cf⊤w\\nabla\_\{c\\rightarrow c\+w\}f=\\nabla\_\{c\}f^\{\\top\}w\. Here,∇c→c\+wf\\nabla\_\{c\\rightarrow c\+w\}fand∇cf\\nabla\_\{c\}fare different quantities: the former denotes the directional derivative offfat pointccalong a specific directionww, while the latter denotes the gradient offfwith respect to the inputcc\. To suppress the rapid growth of∇c→c\+wf\\nabla\_\{c\\rightarrow c\+w\}fcaused by quantization, we can optimize∇cf⊤w\\nabla\_\{c\}f^\{\\top\}w\. Since‖∇cf⊤w‖≤‖∇cf‖⋅‖w‖\\\|\\nabla\_\{c\}f^\{\\top\}w\\\|\\leq\\\|\\nabla\_\{c\}f\\\|\\cdot\\\|w\\\|, and‖w‖\\\|w\\\|depends on the specific token direction and cannot be controlled, the practical way to minimize the above expression is to keep‖∇cf‖\\\|\\nabla\_\{c\}f\\\|as stable or as small as possible during quantization\. This not only explains our previous empirical observations, but also provides a feasible direction for optimizing extreme quantization algorithms\.
## 5Simple Smoothness Preservation Method
In this section, we propose simple smoothness\-preserving methods for both PTQ and QAT to examine whether smoothness brings performance gains to quantized models\. Due to space limitations, we provide all experimental settings and ablations in Appendix[A\.4](https://arxiv.org/html/2605.08894#A1.SS4)and[A\.5](https://arxiv.org/html/2605.08894#A1.SS5)\.
### 5\.1Learnable Gradient Preservation for PTQ
PTQ quantizes LLMs in a layer\-wise manner by aligning the output activations of quantized modules with those of the original model\. For a linear layer𝐘=𝐖𝐗\\mathbf\{Y\}=\\mathbf\{WX\}, existing PTQ methods mainly optimize the forward reconstruction objective‖𝐖𝐗−𝐖^𝐗‖F2\\\|\\mathbf\{WX\}\-\\hat\{\\mathbf\{W\}\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}, which preserves forward fitting accuracy but does not guarantee backward gradient preservation after quantization\. Considering backward propagation, let∇𝐘f=𝐆\\nabla\_\{\\mathbf\{Y\}\}f=\\mathbf\{G\}and the input gradient is given by∇𝐗f=𝐖⊤𝐆\\nabla\_\{\\mathbf\{X\}\}f=\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}, indicating that smoothness preservation additionally requires maintaining gradient propagation, i\.e\., minimizing‖𝐖⊤𝐆−𝐖^⊤𝐆‖F2\\\|\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\-\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}\\\|\_\{F\}^\{2\}\. Therefore, preserving both model accuracy and smoothness requires jointly optimizing forward fitting and backward preservation:
min𝐖^‖𝐖𝐗−𝐖^𝐗‖F2⏟accuracy\+‖𝐖⊤𝐆−𝐖^⊤𝐆‖F2⏟smoothness\.\\min\_\{\\hat\{\\mathbf\{W\}\}\}\\underbrace\{\\\|\\mathbf\{WX\}\-\\hat\{\\mathbf\{W\}\}\\mathbf\{X\}\\\|^\{2\}\_\{F\}\}\_\{\\text\{accuracy\}\}\+\\underbrace\{\\\|\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\-\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}\\\|^\{2\}\_\{F\}\}\_\{\\text\{smoothness\}\}\.\(9\)As illustrated in Figure[7](https://arxiv.org/html/2605.08894#S5.F7), these two objectives are largely orthogonal, suggesting that classical PTQ objectives are inherently incomplete\.
Figure 6:The row\-wise forward pass and the column\-wise backward pass\.
Figure 7:Layer\-wise gradient of BF16 and B158 models\. The spike denotes “gradient ridge”\.
The objective in‖𝐖𝐗−𝐖^𝐗‖F2\\\|\\mathbf\{WX\}\-\\hat\{\\mathbf\{W\}\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}admits a high\-precision approximate closed\-form solution, and methods such as GPTQ exploit second\-order information \(i\.e\., the Hessian𝐇=𝐗𝐗⊤\\mathbf\{H\}=\\mathbf\{XX\}^\{\\top\}\) to iteratively quantize weights while compensating for induced errors, forming the backbone of many PTQ approaches\. However, the joint smoothness objective in Equation[9](https://arxiv.org/html/2605.08894#S5.E9)is incompatible with such GPTQ\-style solvers due to orthogonal optimization structures: the forward reconstruction term follows a row\-wise update scheme, whereas smoothness preservation in‖𝐖⊤𝐆−𝐖^⊤𝐆‖F2\\\|\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\-\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}\\\|\_\{F\}^\{2\}requires column\-wise consistency in gradient propagation\. To alleviate this issue, we select to build upon OmniQuant\[[32](https://arxiv.org/html/2605.08894#bib.bib35)\], which introduces learnable weight clipping parametersγ\\gammaandβ\\beta,
h=γmax\(w\)−βmin\(w\)2N−1,z=−⌊βmin\(w\)/h⌉,h=\\frac\{\\gamma\\max\(w\)\-\\beta\\min\(w\)\}\{2^\{N\}\-1\},\\quad z=\-\\lfloor\\beta\\min\(w\)/h\\rceil,\(10\)and optimizes them via layer\-wise distillation\. Based on this framework, we introduce learnable gradient preservation \(LGP\) to explicitly maintain smoothness, yielding the joint objective:
min𝜽^‖z𝜽−z𝜽^‖F2\+α1‖∇𝐗fz𝜽−∇𝐗fz𝜽^‖F2,\\min\_\{\\hat\{\\boldsymbol\{\\theta\}\}\}\\\|z\_\{\\boldsymbol\{\\theta\}\}\-z\_\{\\hat\{\\boldsymbol\{\\theta\}\}\}\\\|\_\{F\}^\{2\}\+\\alpha\_\{1\}\\\|\\nabla\_\{\\mathbf\{X\}\}f\_\{z\_\{\\boldsymbol\{\\theta\}\}\}\-\\nabla\_\{\\mathbf\{X\}\}f\_\{z\_\{\\hat\{\\boldsymbol\{\\theta\}\}\}\}\\\|\_\{F\}^\{2\},\(11\)whereα1\\alpha\_\{1\}balances fitting and smoothness\. This formulation helps effectively preserve smoothness in quantization, even though exact closed\-form solutions for the joint objective remain an open problem\.
Table 1:Main W2A16 quantization results of the evaluation experiment\. The best scores are in bold\.Table[1](https://arxiv.org/html/2605.08894#S5.T1)presents the evaluation results of incorporating LGP into PTQ\. Compared to GPTQ, OmniQuant achieves substantial accuracy gains in 2\-bit weight quantization, although a noticeable performance gap remains relative to the FP16 baseline\. After integrating LGP, we first observe that the language modeling capability is maintained or even slightly enhanced\. Furthermore, LGP demonstrates improvements across the majority of accuracy tasks for each model\. Notably, although LGP does not explicitly optimize for fitting accuracy, it yields further improvements in both PPL and Accuracy\. We attribute this success to the inherent advantages of smoothness, which mitigates erratic output shifts\. We deeply discuss the reasons behind this change later in the paper\.
### 5\.2Loss of Gradient Regularization for QAT
QAT typically relies on the language modeling loss and therefore also overlooks smoothness preservation\. While one may expect∇𝐱\(0\)f\\nabla\_\{\\mathbf\{x\}^\{\(0\)\}\}fto serve as a direct constraint on model smoothness, we find that it is ineffective due to the existence of agradient ridgeat the 0\-th layer, as shown in Figure[7](https://arxiv.org/html/2605.08894#S5.F7)\. To investigate this phenomenon, we compare layer\-wise gradients between a BF16 model and its 1\.58\-bit counterpart\[[25](https://arxiv.org/html/2605.08894#bib.bib22)\], monitoring input and output gradients at theinput\_layernormandpost\_attention\_layernormacross layers\. We observe that the quantized model exhibits significantly larger intermediate gradients, indicating reduced smoothness in hidden states, particularly in early and middle layers\. However, the input gradients at the 0\-th layer remain consistently high and nearly identical across both models, suggesting that this behavior is not induced by quantization\. We hypothesize that this stems from the nature of embedding inputs, which lack semantic structure and lead to sparse representations, causing optimization to concentrate on ridge\-like regions in the latent space rather than smooth valleys\. We refer to this phenomenon as the “Gradient Ridge”, an intrinsic property independent of weight quantization, which explains why∇𝐱\(0\)f\\nabla\_\{\\mathbf\{x\}^\{\(0\)\}\}fcannot reliably reflect smoothness degradation\.
Table 2:Main quantization results on our trained FP16 and B158 models\. The best scores are in bold\.To enable extremely low\-bit models to automatically acquire smoothness during training, we propose a gradient regularization loss \(LGR\)\. For the existence of the gradient ridge, we avoid using 0\-th layer inputs and instead apply regularization on the 1\-st layer hidden states\. Specifically, we define the smoothness objective asℒsmooth=1N∑i=1N‖∇𝐱i\(1\)f‖F2\\mathcal\{L\}\_\{\\text\{smooth\}\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\nabla\_\{\\mathbf\{x\}\_\{i\}^\{\(1\)\}\}f\\\|\_\{F\}^\{2\}, whereNNis the sequence length\. The overall training objective is then given byℒ=ℒlm\+α2ℒsmooth\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{lm\}\}\+\\alpha\_\{2\}\\mathcal\{L\}\_\{\\text\{smooth\}\}, withα2\\alpha\_\{2\}controlling the strength of smoothness regularization\. This formulation allows smoothness to be explicitly encouraged in extremely low\-bit training while avoiding theinstabilityintroduced by 0\-th layer gradients\.
Table[2](https://arxiv.org/html/2605.08894#S5.T2)presents the performance of B158 models and their LGR\-enhanced counterparts\. For reference, the results of the FP16 models are also included\. It is evident that the models trained with LGR exhibit better performance on the majority of benchmarks\. Although lower scores on some tasks reduce the overall average, this does not negate the effectiveness on the majority of benchmarks\. We also observe that the 0\.4B model outperforms the 1\.7B model on certain tasks\. This is likely attributable to the sufficiency of training\.
## 6Discussion
### 6\.1Quantized Weight Space is Anisotropic
Although we are still unable to derive a high\-quality closed\-form solution to Equation[9](https://arxiv.org/html/2605.08894#S5.E9), we can study the change in solution properties from two extremes: fitting accuracy and smoothness preservation\. Specifically, we investigate the quantization of theq\_projweights in the 0\-th layer of LLaMA\-2\-7B and separately optimize the two components of the objective in Equation[9](https://arxiv.org/html/2605.08894#S5.E9)\. We optimize‖𝐖𝐗−𝐖^𝐗‖F2\\\|\\mathbf\{WX\}\-\\hat\{\\mathbf\{W\}\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}to obtain𝐖^a\\hat\{\\mathbf\{W\}\}\_\{a\}using the GPTQ algorithm, and optimize‖𝐖⊤𝐆−𝐖^⊤𝐆‖F2\\\|\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\-\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}\\\|\_\{F\}^\{2\}to obtain𝐖^s\\hat\{\\mathbf\{W\}\}\_\{s\}\. By constructing𝐖^=\(1−α\)𝐖^a\+α𝐖^s\\hat\{\\mathbf\{W\}\}=\(1\-\\alpha\)\\hat\{\\mathbf\{W\}\}\_\{a\}\+\\alpha\\hat\{\\mathbf\{W\}\}\_\{s\}and varying the value ofα\\alpha, we obtain a series of intermediate points between𝐖^a\\hat\{\\mathbf\{W\}\}\_\{a\}and𝐖^s\\hat\{\\mathbf\{W\}\}\_\{s\}\. Even if these intermediate points are not solutions to Equation[9](https://arxiv.org/html/2605.08894#S5.E9), their trends clearly reveal how the properties of the solution space change along different directions\. We use cosine similarity to measure how well the quantized outputs𝐖^𝐗\\hat\{\\mathbf\{W\}\}\\mathbf\{X\}and𝐖^⊤𝐆\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}preserve the original outputs𝐖𝐗\\mathbf\{WX\}and𝐖⊤𝐆\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}, as shown in Figure[9](https://arxiv.org/html/2605.08894#S6.F9)\. At 4\-bit quantization, the difference between solutions𝐖^a\\hat\{\\mathbf\{W\}\}\_\{a\}and𝐖^s\\hat\{\\mathbf\{W\}\}\_\{s\}is negligible, with no clear difference between fitting accuracy and smoothness preservation\. However, as the quantization bit\-width decreases to 3\-bit and even 2\-bit, thePareto frontiershifts rapidly, making it difficult to optimize fitting accuracy and smoothness simultaneously\. In addition, backward smoothness is more sensitive to low\-bit quantization than forward accuracy\. At 2\-bit, when only fitting accuracy is considered \(i\.e\.,𝐖^=𝐖^a\\hat\{\\mathbf\{W\}\}=\\hat\{\\mathbf\{W\}\}\_\{a\}\), the backward preservation score drops sharply to0\.700\.70\. The shift of the Pareto frontier and the different sensitivities along the two directions suggest that an appropriate trade\-off exists between the two extreme solutions\.
Model quantization can be viewed as shifting the original weights along specific directions\. The above analysis suggests that quantization performance depends on the choice of weight shift directions\. As illustrated in Figure[9](https://arxiv.org/html/2605.08894#S6.F9), in high\-precision regimes, both accuracy and smoothness can simultaneously achieve optimality\. However, under large deviations between𝜽^\\hat\{\\boldsymbol\{\\theta\}\}and𝜽\\boldsymbol\{\\theta\}at lower bit\-width, optimizing fitting accuracy alone leads to the loss of smoothness\. The weight anisotropy in extreme quantization suggests that smoothness must be considered explicitly\.
Figure 8:Forward / backward preservation\.
Figure 9:Anisotropy of the weight space\.
### 6\.2Accuracy \+ Smoothness =∅\\varnothing?
LetΔ𝐖=𝐖−𝐖^\\Delta\\mathbf\{W\}=\\mathbf\{W\}\-\\hat\{\\mathbf\{W\}\}, and‖𝐖𝐗−𝐖^𝐗‖F2\\\|\\mathbf\{WX\}\-\\hat\{\\mathbf\{W\}\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}can be simplified tomin‖Δ𝐖𝐗‖F2\\min\\\|\\Delta\\mathbf\{W\}\\mathbf\{X\}\\\|\_\{F\}^\{2\}\. Given that𝐗\\mathbf\{X\}adheres to a certain distribution, aΔ𝐖\\Delta\\mathbf\{W\}exists that minimizes this term, which partially explains the viability of quantization\. If we subsequently impose the stronger constraint of‖𝐖⊤𝐆−𝐖^⊤𝐆‖F2\\\|\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\-\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}\\\|\_\{F\}^\{2\}\(i\.e\.,min‖Δ𝐖⊤𝐆‖F2\\min\\\|\\Delta\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\\\|\_\{F\}^\{2\}\), does a validΔ𝐖\\Delta\\mathbf\{W\}still persist? Does this imply that the original FP16 model is the sole candidate capable of simultaneously achieving both accuracy and smoothness? The answer seems to beno\. This stems from the fact that the two optimization processes essentially involve finding the rows ofΔ𝐖\\Delta\\mathbf\{W\}in the null spaces of𝐗\\mathbf\{X\}and the columns must lie in the null space of𝐆\\mathbf\{G\}\. A joint solution exists provided that𝒩\(𝐗⊤\)∩𝒩\(𝐆⊤\)≠\{𝟎\}\\mathcal\{N\}\(\\mathbf\{X\}^\{\\top\}\)\\cap\\mathcal\{N\}\(\\mathbf\{G\}^\{\\top\}\)\\neq\\\{\\mathbf\{0\}\\\}\. Indeed, the extreme sparsity of LLMs ensures that𝐗\\mathbf\{X\}and𝐆\\mathbf\{G\}are low\-rank, making the condition:
rank\(𝐗\)\+rank\(𝐆\)<min\(din,dout\)\\mathrm\{rank\}\(\\mathbf\{X\}\)\+\\mathrm\{rank\}\(\\mathbf\{G\}\)<\\min\(d\_\{\\text\{in\}\},d\_\{\\text\{out\}\}\)\(12\)generally valid, though the feasible solution space becomes much narrower\. This observation highlights the challenges that dual constraints impose on developing algorithms for extreme quantization\.
### 6\.3How Does Smoothness Help Quantized Models?
Fitting accuracy improves model capability by maximizing the predicted probabilities of high\-quality tokens\. However, for a fixed weight bit\-width, the achievable fitting accuracy is inherently limited\. In contrast, the smoothness objective does not directly increase the probabilities of high\-quality tokens, but instead improves model performance by influencing token rankings\. This phenomenon can be understood from the trend of the rPPL\-kkcurves\. To slow down the growth of rPPL and make the curve smoother, the model must rank high\-quality tokens higher in the prediction order, even if their probability gains are not substantial\.
For example, the nine zero\-shot tasks used in our evaluation are all multiple\-choice benchmarks, where correctness is determined by the relative ranking among options A, B, C, and D\. Suppose the correct answer is D\. After introducing smoothness regularization, the model may change the prediction order from “BCDA” to “DBAC”, as shown in Figure[10](https://arxiv.org/html/2605.08894#S6.F10)\. Although none of the four options ranks highly in the overall ranking, this change is still sufficient to turn an incorrect prediction into a correct one\.
Therefore, what we would like to emphasize in this paper is that fitting accuracy and smoothness play different roles in model performance, and this distinction becomes particularly important under extreme quantization\. However, one should never expect smoothness optimization alone to yield substantial performance gains\. Fitting accuracy remains the primary factor determining quantization performance, while smoothness serves as a complementary enhancement\.
Figure 10:Comparison of fitting optimization and smoothness optimization\. Given a prompt, the ideal ranking of predicted tokens is “DBAC”\.
### 6\.4The Role of LayerNorm in Sub\-2\-bit Models
Our training experience with sub\-2\-bit models suggests that introducingLayerNorminto linear layers is crucial, which is also emphasized in several previous studies\[[41](https://arxiv.org/html/2605.08894#bib.bib1),[25](https://arxiv.org/html/2605.08894#bib.bib22)\]\. The discussion in this paper helps explain the necessity of this practice\. Beyond improving training stability,LayerNormalso smooths the backward gradients during training\. Therefore,LayerNormcontributes indispensably to the performance of sub\-2\-bit models\.
## 7Related Work
### 7\.1Model Quantization
Model quantization compresses high\-precision weights into low\-bit representations\. Post\-training quantization \(PTQ\) converts a trained model to low\-bit precision using optimization solvers\[[14](https://arxiv.org/html/2605.08894#bib.bib2),[9](https://arxiv.org/html/2605.08894#bib.bib20)\]\. Quantization\-aware training \(QAT\) integrates quantization into training process to mitigate the adverse effects of reduced precision\[[22](https://arxiv.org/html/2605.08894#bib.bib19),[41](https://arxiv.org/html/2605.08894#bib.bib1)\]\. Some studies also explore simultaneous weight and activation quantization\[[23](https://arxiv.org/html/2605.08894#bib.bib3),[40](https://arxiv.org/html/2605.08894#bib.bib18)\]\. Currently, the widely used lossless quantization level is INT4\. Quantization below this level is considered extreme, but it often leads to degraded model performance\[[35](https://arxiv.org/html/2605.08894#bib.bib4),[16](https://arxiv.org/html/2605.08894#bib.bib7),[25](https://arxiv.org/html/2605.08894#bib.bib22),[21](https://arxiv.org/html/2605.08894#bib.bib5),[42](https://arxiv.org/html/2605.08894#bib.bib21)\]\. Nearly all existing extreme quantization focuses primarily on improving representation accuracy to reduce forward computation loss\.
### 7\.2Network Smoothness
Lipschitzness is the most direct smoothness metric in neural networks\[[3](https://arxiv.org/html/2605.08894#bib.bib24)\]\. However, despite its concise definition in machine learning, accurately estimating the Lipschitz constant for neural networks is highly challenging\[[36](https://arxiv.org/html/2605.08894#bib.bib23)\]\. Numerous studies attempt to approximate network Lipschitz constants, yet they typically suffer from two limitations\. First, they fail to effectively address the exponential growth of estimated bounds with network depth\[[36](https://arxiv.org/html/2605.08894#bib.bib23),[39](https://arxiv.org/html/2605.08894#bib.bib29)\]\. Second, most prior work relies on early architectures\[[15](https://arxiv.org/html/2605.08894#bib.bib27),[34](https://arxiv.org/html/2605.08894#bib.bib28),[13](https://arxiv.org/html/2605.08894#bib.bib25),[44](https://arxiv.org/html/2605.08894#bib.bib26),[17](https://arxiv.org/html/2605.08894#bib.bib15)\], such as ResNet\. In contrast, the Lipschitzness of transformer\[[18](https://arxiv.org/html/2605.08894#bib.bib30),[29](https://arxiv.org/html/2605.08894#bib.bib31)\]remains relatively underexplored\. Although early studies in vision\[[15](https://arxiv.org/html/2605.08894#bib.bib27)\]and robustness\[[28](https://arxiv.org/html/2605.08894#bib.bib32)\]investigate the impact of smoothness on model performance via proxies, conclusions on smoothness in LLMs remain scarce\.
## 8Conclusion
In this work, we highlight smoothness as a important but overlooked objective in extreme quantization\. Building on input\-gradient analysis and sequence neighborhood modeling, we introduce LGP for PTQ and LGR for QAT as simple smoothness\-preserving instantiations, and advocate explicitly incorporating smoothness into future quantization design\.
## References
- \[1\]A\. Amini, S\. Gabriel, S\. Lin, R\. Koncel\-Kedziorski, Y\. Choi, and H\. Hajishirzi\(2019\)MathQA: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),pp\. 2357–2367\.External Links:[Link](https://doi.org/10.18653/v1/N19-1245)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[2\]D\. Bahri, H\. Mobahi, and Y\. Tay\(2022\)Sharpness\-aware minimization improves language model generalization\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 7360–7371\.External Links:[Link](https://doi.org/10.18653/v1/2022.acl-long.508)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1)\.
- \[3\]P\. L\. Bartlett, D\. J\. Foster, and M\. J\. Telgarsky\(2017\)Spectrally\-normalized margin bounds for neural networks\.Advances in neural information processing systems \(NeurIPS\)30,pp\. 6240–6249\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2017/hash/b22b257ad0519d4500539da3c8bcf4dd-Abstract.html)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[4\]Y\. Bisk, R\. Zellers, J\. Gao, Y\. Choi,et al\.\(2020\)PIQA: reasoning about physical commonsense in natural language\.InProceedings of the AAAI conference on artificial intelligence \(AAAI\),pp\. 7432–7439\.External Links:[Link](https://doi.org/10.48550/arXiv.1911.11641)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[5\]A\. Chan, Y\. Tay, and Y\. Ong\(2020\)What it thinks is important is important: robustness transfers through input gradients\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 332–341\.External Links:[Link](https://openaccess.thecvf.com/content_CVPR_2020/html/Chan_What_It_Thinks_Is_Important_Is_Important_Robustness_Transfers_Through_CVPR_2020_paper.html)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1),[§3](https://arxiv.org/html/2605.08894#S3.p1.1)\.
- \[6\]C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova\(2019\)BoolQ: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),pp\. 2924–2936\.External Links:[Link](https://doi.org/10.18653/v1/N19-1300)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[7\]P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord\(2018\)Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge\.arXiv preprint arXiv:1803\.05457v1\.External Links:[Link](https://doi.org/10.48550/arXiv.1803.05457)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[8\]J\. Cohen, E\. Rosenfeld, and Z\. Kolter\(2019\)Certified adversarial robustness via randomized smoothing\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 1310–1320\.External Links:[Link](https://doi.org/10.48550/arXiv.1902.02918)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1)\.
- \[9\]T\. Dettmers, R\. Svirschevski, V\. Egiazarian, D\. Kuznedelev, E\. Frantar, S\. Ashkboos, A\. Borzunov, T\. Hoefler, and D\. Alistarh\(2024\)SpQR: a sparse\-quantized representation for near\-lossless LLM weight compression\.InProceedings of the Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=Q1u25ahSuy)Cited by:[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[10\]P\. Dong, L\. Li, Y\. Zhong, D\. Du, R\. FAN, Y\. Chen, Z\. Tang, Q\. Wang, W\. Xue, Y\. Guo,et al\.\(2025\)STBLLM: breaking the 1\-bit barrier with structured binary llms\.InProceedings of the Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=6XUSDvBFkV)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p1.1)\.
- \[11\]P\. Dwivedi, B\. Islam, and M\. Kajal\(2025\)Smooth gradient loss: a loss function for gradient regularization in deep learning optimization\.The Journal of Supercomputing81,pp\. 1–45\.External Links:[Link](https://doi.org/10.1007/s11227-025-07954-9)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1)\.
- \[12\]V\. Egiazarian, A\. Panferov, D\. Kuznedelev, E\. Frantar, A\. Babenko, and D\. Alistarh\(2024\)Extreme compression of large language models via additive quantization\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 12284–12303\.External Links:[Link](https://proceedings.mlr.press/v235/egiazarian24a.html)Cited by:[§2\.2](https://arxiv.org/html/2605.08894#S2.SS2.p1.1)\.
- \[13\]N\. B\. Erichson, O\. Azencot, A\. Queiruga, L\. Hodgkinson, and M\. W\. Mahoney\(2021\)Lipschitz recurrent neural networks\.InProceedings of the Ninth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=-N7PBXqOUJZ)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[14\]E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh\(2022\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.arXiv preprint arXiv:2210\.17323v2\.External Links:[Link](https://doi.org/10.48550/arXiv.2210.17323)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.08894#S2.SS2.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[15\]H\. Gouk, E\. Frank, B\. Pfahringer, and M\. J\. Cree\(2021\)Regularisation of neural networks by enforcing lipschitz continuity\.Machine Learning110,pp\. 393–416\.External Links:[Link](https://doi.org/10.1007/s10994-020-05929-w)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[16\]W\. Huang, Y\. Liu, H\. Qin, Y\. Li, S\. Zhang, X\. Liu, M\. Magno, and X\. Qi\(2024\)BiLLM: pushing the limit of post\-training quantization for LLMs\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 20023–20042\.External Links:[Link](https://proceedings.mlr.press/v235/huang24q.html)Cited by:[§A\.3](https://arxiv.org/html/2605.08894#A1.SS3.p3.2),[§1](https://arxiv.org/html/2605.08894#S1.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[17\]G\. Khromov and S\. P\. Singh\(2024\)Some fundamental aspects about lipschitz continuity of neural networks\.InProceedings of the Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=5jWsW08zUh)Cited by:[§2\.1](https://arxiv.org/html/2605.08894#S2.SS1.p2.14),[§3](https://arxiv.org/html/2605.08894#S3.p1.1),[§3](https://arxiv.org/html/2605.08894#S3.p5.10),[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[18\]H\. Kim, G\. Papamakarios, and A\. Mnih\(2021\)The lipschitz constant of self\-attention\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 5562–5571\.External Links:[Link](https://proceedings.mlr.press/v139/kim21i.html)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[19\]H\. Lee, S\. Cho, and C\. Kim\(2025\)Indirect gradient matching for adversarial robust distillation\.InProceedings of the Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://doi.org/10.48550/arXiv.2312.03286)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1)\.
- \[20\]T\. Li, P\. Zhou, Z\. He, X\. Cheng, and X\. Huang\(2024\)Friendly sharpness\-aware minimization\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition \(CVPR\),pp\. 5631–5640\.External Links:[Link](https://openaccess.thecvf.com/content/CVPR2024/html/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.html)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1)\.
- \[21\]Z\. Li, X\. Yan, T\. Zhang, H\. Qin, D\. Xie, J\. Tian, L\. Kong, Y\. Zhang, X\. Yang,et al\.\(2025\)ARB\-LLM: alternating refined binarizations for large language models\.InProceedings of the Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=ZU8OdDLTts)Cited by:[§A\.3](https://arxiv.org/html/2605.08894#A1.SS3.p3.2),[§1](https://arxiv.org/html/2605.08894#S1.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[22\]Z\. Liu, B\. Oguz, C\. Zhao, E\. Chang, P\. Stock, Y\. Mehdad, Y\. Shi, R\. Krishnamoorthi, and V\. Chandra\(2024\)LLM\-QAT: data\-free quantization aware training for large language models\.InFindings of the Association for Computational Linguistics \(ACL Findings\),pp\. 467–484\.External Links:[Link](https://doi.org/10.18653/v1/2024.findings-acl.26)Cited by:[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[23\]Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort\(2025\)SpinQuant: LLM quantization with learned rotations\.InProceedings of the Thirteenth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://doi.org/10.48550/arXiv.2405.16406)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[24\]K\. Lyu, Z\. Li, and S\. Arora\(2022\)Understanding the generalization benefit of normalization layers: sharpness reduction\.Advances in Neural Information Processing Systems \(NeurIPS\)35,pp\. 34689–34708\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2022/hash/dffd1c523512e557f4e75e8309049213-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p2.1)\.
- \[25\]S\. Ma, H\. Wang, L\. Ma, L\. Wang, W\. Wang, S\. Huang, L\. Dong, R\. Wang, J\. Xue, and F\. Wei\(2024\)The era of 1\-bit LLMs: all large language models are in 1\.58 bits\.arXiv preprint arXiv:2402\.17764\.External Links:[Link](https://doi.org/10.48550/arXiv.2402.17764)Cited by:[§A\.1](https://arxiv.org/html/2605.08894#A1.SS1.p3.1),[§5\.2](https://arxiv.org/html/2605.08894#S5.SS2.p1.2),[§6\.4](https://arxiv.org/html/2605.08894#S6.SS4.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[26\]S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher\(2017\)Pointer sentinel mixture models\.InProceedings of the Fifth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=Byj72udxe)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[27\]T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal\(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 2381–2391\.External Links:[Link](https://doi.org/10.48550/arXiv.1809.02789)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[28\]P\. Pauli, A\. Koch, J\. Berberich, P\. Kohler, and F\. Allgöwer\(2021\)Training robust neural networks using lipschitz bounds\.IEEE Control Systems Letters6,pp\. 121–126\.External Links:[Link](https://doi.org/10.1109/LCSYS.2021.3050444)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[29\]X\. Qi, J\. Wang, Y\. Chen, Y\. Shi, and L\. Zhang\(2023\)LipsFormer: introducing lipschitz continuity to vision transformers\.InProceedings of the Eleventh International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=cHf1DcCwcH3)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[30\]C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu\(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.The Journal of Machine Learning Research21,pp\. 5485–5551\.External Links:[Link](https://jmlr.org/papers/volume21/20-074/20-074.pdf)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[31\]K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi\(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.External Links:[Link](https://doi.org/10.48550/arXiv.1907.10641)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[32\]W\. Shao, M\. Chen, Z\. Zhang, P\. Xu, L\. Zhao, Z\. Li, K\. Zhang, P\. Gao, Y\. Qiao, and P\. Luo\(2024\)OmniQuant: omnidirectionally calibrated quantization for large language models\.InProceedings of the Twelfth International Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=8Wuvhh0LYW)Cited by:[§5\.1](https://arxiv.org/html/2605.08894#S5.SS1.p2.5)\.
- \[33\]Z\. Shi, Y\. Wang, H\. Zhang, J\. Z\. Kolter, and C\. Hsieh\(2022\)Efficiently computing local lipschitz constants of neural networks via bound propagation\.Advances in Neural Information Processing Systems \(NeurIPS\)35,pp\. 2350–2364\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/0ff54b4ec4f70b3ae12c8621ca8a49f4-Abstract-Conference.html)Cited by:[§3](https://arxiv.org/html/2605.08894#S3.p1.1)\.
- \[34\]S\. Singla and S\. Feizi\(2021\)Skew orthogonal convolutions\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 9756–9766\.External Links:[Link](https://proceedings.mlr.press/v139/singla21a.html)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[35\]A\. Tseng, J\. Chee, Q\. Sun, V\. Kuleshov, and C\. De Sa\(2024\)QuIP\#: even better LLM quantization with hadamard incoherence and lattice codebooks\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 48630–48656\.External Links:[Link](https://proceedings.mlr.press/v235/tseng24a.html)Cited by:[§2\.2](https://arxiv.org/html/2605.08894#S2.SS2.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[36\]A\. Virmaux and K\. Scaman\(2018\)Lipschitz regularity of deep neural networks: analysis and efficient estimation\.Advances in Neural Information Processing Systems \(NeurIPS\)31,pp\. 3835–3844\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2018/hash/d54e99a6c03704e95e6965532dec148b-Abstract.html)Cited by:[§2\.1](https://arxiv.org/html/2605.08894#S2.SS1.p1.19),[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[37\]A\. Wang, A\. Singh, J\. Michael, F\. Hill, O\. Levy, and S\. Bowman\(2018\)GLUE: a multi\-task benchmark and analysis platform for natural language understanding\.InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP,pp\. 353–355\.External Links:[Link](https://doi.org/10.18653/v1/W18-5446)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[38\]R\. Wang, Y\. Gong, X\. Liu, G\. Zhao, Z\. Yang, B\. Guo, Z\. Zha, and P\. CHENG\(2025\)Optimizing large language model training using FP4 quantization\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 62937–62957\.External Links:[Link](https://proceedings.mlr.press/v267)Cited by:[§2\.2](https://arxiv.org/html/2605.08894#S2.SS2.p1.1)\.
- \[39\]Z\. Wang, G\. Prakriya, and S\. Jha\(2022\)A quantitative geometric approach to neural\-network smoothness\.Advances in Neural Information Processing Systems \(NeurIPS\)35,pp\. 34201–34215\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2022/hash/dd1322ce23cbbdd9d7ebb0ad1223c27a-Abstract-Conference.html)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
- \[40\]G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han\(2023\)SmoothQuant: accurate and efficient post\-training quantization for large language models\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 38087–38099\.External Links:[Link](https://proceedings.mlr.press/v202/xiao23c.html)Cited by:[§2\.2](https://arxiv.org/html/2605.08894#S2.SS2.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[41\]Y\. Xu, X\. Han, Z\. Yang, S\. Wang, Q\. Zhu, Z\. Liu, W\. Liu, and W\. Che\(2024\)OneBit: towards extremely low\-bit large language models\.Advances in Neural Information Processing System \(NeurIPS\)37,pp\. 66357–66382\.External Links:[Link](https://doi.org/10.52202/079017-2122)Cited by:[§1](https://arxiv.org/html/2605.08894#S1.p1.1),[§6\.4](https://arxiv.org/html/2605.08894#S6.SS4.p1.1),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[42\]Y\. Xu, S\. Ji, Q\. Zhu, and W\. Che\(2025\)CRVQ: channel\-relaxed vector quantization for extreme compression of LLMs\.Transactions of the Association for Computational Linguistics \(TACL\)13,pp\. 1488–1506\.External Links:[Link](https://doi.org/10.1162/TACL.a.45)Cited by:[§A\.3](https://arxiv.org/html/2605.08894#A1.SS3.p3.2),[§7\.1](https://arxiv.org/html/2605.08894#S7.SS1.p1.1)\.
- \[43\]R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi\(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 4791–4800\.External Links:[Link](https://doi.org/10.18653/v1/P19-1472)Cited by:[§A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1.Px3.p1.1)\.
- \[44\]Z\. Zhou, J\. Liang, Y\. Song, L\. Yu, H\. Wang, W\. Zhang, Y\. Yu, and Z\. Zhang\(2019\)Lipschitz generative adversarial nets\.InProceedings of International Conference on Machine Learning \(ICML\),pp\. 7584–7593\.External Links:[Link](https://proceedings.mlr.press/v97/zhou19c.html)Cited by:[§7\.2](https://arxiv.org/html/2605.08894#S7.SS2.p1.1)\.
## Appendix AAppendix
### A\.1Smoothness in Training Process
\(a\)BF16 model and B158 model
\(b\)B158 model and different LGR settings
Figure 11:Evolution ofCavgC\_\{\\text\{avg\}\}during training across different models and settings\. We use 0\.4B model in this experiment\.In this section, we discuss several topics related to smoothness\-aware training, reflecting on insights and conjectures derived from our experiments\. We primarily address the following three questions:
a\) Why is RMSNorm excluded from the BitNet b1\.58 model in our QAT experiments?
In the standard BitNet b1\.58 architecture\[[25](https://arxiv.org/html/2605.08894#bib.bib22)\], the activations entering the linear layer are first normalized viaRMSNormbefore performing matrix multiplication with the ternary weights\. This process is defined as follows:
𝐗′\\displaystyle\\mathbf\{X\}^\{\{\}^\{\\prime\}\}=RMSNorm\(𝐗\)\\displaystyle=\\mathrm\{RMSNorm\}\(\\mathbf\{X\}\)𝐖′\\displaystyle\\mathbf\{W\}^\{\{\}^\{\\prime\}\}=Ternarize\(𝐖\)\\displaystyle=\\mathrm\{Ternarize\}\(\\mathbf\{W\}\)𝐘\\displaystyle\\mathbf\{Y\}=𝐖′𝐗′\.\\displaystyle=\\mathbf\{W\}^\{\{\}^\{\\prime\}\}\\mathbf\{X\}^\{\{\}^\{\\prime\}\}\.In our LGR implementation, we deliberately excludedRMSNormas a controlled variable strategy\. It is well established that BitNet models incorporatingRMSNormachieve performance comparable to full\-precision baselines\. This effectiveness stems from two factors: first,RMSNormnormalizes activations during the forward pass, enhancing computation accuracy; second, it significantly flattens backward gradients, substantially improving ternary model smoothness\.
However, this dual impact makes it difficult to isolate the specific benefits of smoothness\. By removingRMSNorm, we align the forward computation process with that of the full\-precision model\. Consequently, any observed gains can be attributed solely to smoothness\. Furthermore, the success ofRMSNormitself implicitly validates the accuracy\-smoothness trade\-off idea proposed in this paper\.
b\) Is smoother training always better?
In QAT, maximizing smoothness is not unconditionally beneficial\. Moreover, it is unrealistic to expect a ternary model to achieve parity with a full\-precision model in both smoothness and performance simultaneously\. The core idea of this work is to identify the smoothest candidate among the manifold of quantized solutions that have equivalent fitting accuracy\.
The guiding principle ismoderate smoothing: enhancing smoothness without compromising accuracy\. We selected the hyperparameterα2=0\.01\\alpha\_\{2\}=0\.01\. As shown in Figure[11](https://arxiv.org/html/2605.08894#A1.F11), increasingα2\\alpha\_\{2\}significantly improves model smoothness; atα2=100\\alpha\_\{2\}=100, the ternary model exhibits smoothness comparable to the full\-precision baseline\. However, this comes at a cost: due to over\-smoothing, the accuracy degrades, causing perplexity to deteriorate from 97\.4 to 130\.6, similar to randomly regularization\. Consequently, excessive smoothing must be avoided\.
c\) How can smoothness be incorporated into industrial training practices?
Since the norm of input gradients is negligible relative to the training loss, introducing smoothness constraints in the early stages of training proves ineffective\. Furthermore, adhering to the principle of moderate smoothing, LGR should not drastically alter the intrinsic smoothness of quantized model\. Consequently, we recommend incorporating smoothness training only during the mid\-to\-late stages of the training process\. This strategy may preserves pre\-trained knowledge while conserving computational resources\.
\(a\)BitNet b1\.58 model
\(b\)B158 model with frozen embedding
Figure 12:Evolution ofCavgC\_\{\\text\{avg\}\}during training when integrateRMSNormor freeze word embedding\.
### A\.2Gradient Ridge
We observe that the “Gradient Ridge” is not sporadic\. It is independent of subtle architectural modifications or quantization settings\. In Section[5\.2](https://arxiv.org/html/2605.08894#S5.SS2), we initially demonstrate its presence in both FP16 and B158 models\. To reinforce the universality of this observation, we conducted similar tests on the BitNet b1\.58 architecture withRMSNorm\.
As illustrated in Figure[12\(a\)](https://arxiv.org/html/2605.08894#A1.F12.sf1), even though theRMSNorm\-equipped model exhibits smoothness comparable to the FP16 baseline, the input gradient at 0\-th layer still spikes to a very high magnitude\. Note that all results reported thus far, including those in Section[5\.2](https://arxiv.org/html/2605.08894#S5.SS2), is derived under full\-weight training settings\. Furthermore, we test with using frozen FP16 embeddings within the B158 model\. As depicted in Figure[12\(b\)](https://arxiv.org/html/2605.08894#A1.F12.sf2), the conclusion holds firm\.
Based on this evidence, we conclude that theGradient Ridgeis by no means coincidental\. Although its root cause remains elusive, we believe that exploring it could offer significant benefits for the interpretability of LLMs\.
### A\.3Residual Quantization and Smoothness
Section[5\.1](https://arxiv.org/html/2605.08894#S5.SS1)points out the challenge of optimizing the complete objective \(Equation[9](https://arxiv.org/html/2605.08894#S5.E9)\), as the two sub\-objectives are orthogonal\. Yet, Section[6\.2](https://arxiv.org/html/2605.08894#S6.SS2)clarifies that this goal is not impossible\. It merely constrains the solution space\.
Figure 13:Important/abnormal gradient channels and relaxed optimization\.While deriving a closed\-form solution for Equation[9](https://arxiv.org/html/2605.08894#S5.E9)is challenging, it remains feasible to partially fulfill the smoothness objective while prioritizing accuracy\. The extreme sparsity of LLM gradients underpins this feasibility\. As illustrated in the Figure[13](https://arxiv.org/html/2605.08894#A1.F13), input and output gradients across linear layers remain negligible in the vast majority of channels\. This implies that by prioritizing the precision of a select few columns, we can ensure that gradient propagation remains unimpaired\. As a result, selecting key column vectors and applying high precision presents an intuitively viable strategy\. Here, the objective inmin𝐖^‖𝐖⊤𝐆−𝐖^⊤𝐆‖F2\\min\_\{\\hat\{\\mathbf\{W\}\}\}\\\|\\mathbf\{W\}^\{\\top\}\\mathbf\{G\}\-\\hat\{\\mathbf\{W\}\}^\{\\top\}\\mathbf\{G\}\\\|^\{2\}\_\{F\}is relaxed, effectively giving an approximate solution\.
In fact, similar ideas have already emerged in the field of extreme quantization\[[16](https://arxiv.org/html/2605.08894#bib.bib7),[42](https://arxiv.org/html/2605.08894#bib.bib21),[21](https://arxiv.org/html/2605.08894#bib.bib5)\]\. Methods likeresidual quantizationselect a subset of pivotal column vectors to fit their quantization errors, significantly enhancing extreme compression performance\. The distinction between our proposal and such methods lies in the selection criteria\. Residual quantization evaluates column importance based on activation error, defined asIi=\(wi−w^i\)xiI\_\{i\}=\(w\_\{i\}\-\\hat\{w\}\_\{i\}\)x\_\{i\}\. In contrast, our relaxed formulation utilizes the magnitude of backward gradients∇xif\\nabla\_\{x\_\{i\}\}fas the importance metric\. Combining these two indicators may pave the way for designing even more potent extreme compression algorithms, which is a direction we leave for future research\.
### A\.4Experimental Settings
#### A\.4\.1LGP
We evaluate the effectiveness of introducing LGP on two mainstream models, LLaMA\-2 and Qwen\-3\. The model sizes range from 0\.6B to 13B\.
##### Baselines
We select the original model, 2\-bit GPTQ, and 2\-bit OmniQuant as our baselines\. The original model serves as a reference to quantify the performance gap induced by quantization\. By comparing OmniQuant with GPTQ, we highlight the accuracy gains achieved by the learnable weight clipping\. Our method is designed to demonstrate the effectiveness of introducing LGP for joint optimization\. For calibration, all methods utilize 128 sequences from the C4 dataset, each with a length of 2048\. The quantization group size is set to 128\.
##### LGP
For all models, regardless of whether LGP is incorporated, the layer\-wise distillation process is conducted for 40 epochs\. The coefficientα1\\alpha\_\{1\}is typically set to be1eK1e^\{K\}whereKKis a positive integer, as listed in Table[1](https://arxiv.org/html/2605.08894#S5.T1)\.
##### Evaluation
To assess the performance of the baselines and LGP, we calculate perplexity using sequences randomly sampled from Wikitext2\[[26](https://arxiv.org/html/2605.08894#bib.bib36)\]and C4\[[30](https://arxiv.org/html/2605.08894#bib.bib37)\]\. Additionally, we report zero\-shot accuracy across a range of tasks, including Winogrande\[[31](https://arxiv.org/html/2605.08894#bib.bib38)\], Hellaswag\[[43](https://arxiv.org/html/2605.08894#bib.bib39)\], PIQA\[[4](https://arxiv.org/html/2605.08894#bib.bib40)\], BoolQ\[[6](https://arxiv.org/html/2605.08894#bib.bib41)\], ARC\[[7](https://arxiv.org/html/2605.08894#bib.bib42)\], OBQA\[[27](https://arxiv.org/html/2605.08894#bib.bib43)\], MathQA\[[1](https://arxiv.org/html/2605.08894#bib.bib44)\], and RTE\[[37](https://arxiv.org/html/2605.08894#bib.bib45)\]\.
#### A\.4\.2LGR
We evaluate the FP16 baseline against the standard B158 model and its smoothed counterpart\. We train models of three different sizes on the OpenWebText2 dataset, and their configurations are listed in Table[3](https://arxiv.org/html/2605.08894#A1.T3), with detailed training settings provided in the Appendix\. We fixα2=0\.01\\alpha\_\{2\}=0\.01for all runs\. The evaluation protocol aligns with that of Section[A\.4\.1](https://arxiv.org/html/2605.08894#A1.SS4.SSS1)\.
Table 3:Configurations of different model sizes\.
### A\.5Ablation
LGP incorporates a parameterα1\\alpha\_\{1\}to balance the objectives of fitting fidelity and smoothness\. An insufficiently smallα1\\alpha\_\{1\}renders LGP ineffective, while an overly large value disrupts the fitting process, leading to accuracy degradation\. To avoid these issues, we employ a heuristic method for selectingα1\\alpha\_\{1\}\. Given that the norm of the backward gradients is smaller than the fitting loss by multiple orders of magnitude, we choose a sufficiently largeα1\\alpha\_\{1\}to bring both terms to a comparable scale\. For example, for LLaMA\-2\-7B, we perform a search across magnitudes ranging from1e41e4to1e81e8\. The difference of varyingα1\\alpha\_\{1\}is shown in Figure[14](https://arxiv.org/html/2605.08894#A1.F14)\.
Figure 14:Comparison of zero\-shot accuracy under different values of the coefficientα1\\alpha\_\{1\}\.Table 4:Ablation of different input gradients\. We report the 0\.4B model training with 400k steps, andα2=0\.1\\alpha\_\{2\}=0\.1\.Table[4](https://arxiv.org/html/2605.08894#A1.T4)shows the superiority of utilizing input gradients from the 1\-st layer\. In contrast, relying on the 0\-th layer may have a detrimental impact on training results\. We attribute this potential failure to the corruption of embedding representations\.
### A\.6Details on Baselines
In this section, we provide supplementary implementation details for the evaluated methods\. All code will be released upon de\-anonymization of the paper\.
##### GPTQ
We adhere to the default configurations provided in the official open\-source repositories\. Specifically, we set the random seed for data sampling to 0 and employ a Hessian damping factor of 0\.01\. Moreover, we utilize asymmetric quantization, activation\-aware channel reordering, and true sequential quantization\.
##### OmniQuant
We set the random seed for data sampling to 0\. The learning rate for LWC is configured at 0\.01, and the model is trained for alignment for 40 epochs using AdamW optimizer\. Additionally, we employ asymmetric quantization and an auxiliary loss in their official code\.
##### LGP
Following OmniQuant, we quantize layers sequentially in a shallow\-to\-deep manner\. For each layer, we first perform a full forward and backward pass through the entire model to compute the input and output gradients of this layer\. Subsequently, we calculate the smoothness loss of the quantized model and add it to the original OmniQuant loss\. Other configurations remain identical to those in OmniQuant\.
##### LGR
We conduct the training on 8 NVIDIA A800 GPUs\. The setup utilizes a per\-device batch size of 1 with 16 gradient accumulation steps\. We employ the AdamW optimizer with hyperparametersβ1=0\.9,β2=0\.95\\beta\_\{1\}=0\.9,\\beta\_\{2\}=0\.95, and a weight decay of 0\.1\. The learning rate follows a cosine schedule with a 1000\-step warmup\. To help for reproducibility, random seeds for Python, NumPy, PyTorch, and CUDA are all fixed at 1234\.
### A\.7Limitations
Although we propose a novel smoothness\-aware perspective to complete the optimization objectives for extreme compression, we acknowledge three main limitations that point toward future directions\.
Firstly, our proposed methods \(LGP and LGR\) function as foundational baselines\. We prioritized simplicity and versatility to demonstrate the validity of the smoothness hypothesis\. Consequently, these methods may not represent the optimal mathematical solution for the joint optimization problem we formulated\. We anticipate that future work can build upon our dual\-constraint analysis to design more advanced solvers\.
Secondly, while we provide intuitive explanations like the “Gradient Ridge”, its underlying causes are not yet fully understood\. We regard this work as a stepping stone that invites the community to delve deeper into its underpinnings, which may contribute to the explainability of LLMs\.
Lastly, constrained by available computational resources, our experiments are confined to models of moderate scale and trained with a limited number of steps on a restricted dataset\. While we observe consistent improvements within this scope, the behavior of smoothness constraints under the regime of massive\-scale pre\-training remains an open question\. We hope that future work can further scale these experiments\.Similar Articles
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
Researchers identify two distinct failure modes in aggressive LLM quantization—Signal Degradation and Computation Collapse—and show that training-free fixes only remedy the former, indicating structural reconstruction is needed for ultra-low-bit models.
Saliency-Aware Regularized Quantization Calibration for Large Language Models
This paper proposes Saliency-Aware Regularized Quantization Calibration (SARQC), a unified framework that improves Post-Training Quantization (PTQ) for LLMs by adding a regularization term to preserve weight proximity, enhancing generalization and performance.
The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval
This research paper investigates the 'Text Uncanny Valley,' a phenomenon where LLM performance in information retrieval tasks degrades non-monotonically as word-boundary corruption increases. The authors propose a mode transition hypothesis to explain this U-shaped performance curve and demonstrate its relevance to real-world noisy text inputs.
Sampling More, Getting Less: Calibration is the Diversity Bottleneck in LLMs
This paper introduces a validity-diversity framework attributing diversity collapse in LLMs to order and shape miscalibration during decoding, validated across 14 language models.
Can LLMs Take Retrieved Information with a Grain of Salt?
This paper investigates how large language models adapt to the certainty of retrieved information, identifying systematic limitations in handling uncertainty. It proposes an interaction strategy that reduces obedience errors by 25% without modifying model weights.