Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
Summary
A novel end-to-end framework for LLM compression that jointly optimizes structural pruning and mixed-precision quantization, achieving significant perplexity reductions and speedups over state-of-the-art methods, especially at ultra-low bit precisions.
View Cached Full Text
Cached at: 06/09/26, 08:53 AM
# Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
Source: [https://arxiv.org/html/2606.07819](https://arxiv.org/html/2606.07819)
11institutetext:UiT The Arctic University of Norway22institutetext:University of Oslo, Norway
22email:\{hoang\.l\.la,phuong\.hoai\.ha\}@uit\.no
\{truongl,amirhost\}@ifi\.uio\.no###### Abstract
Recently, the efficiency of Large Language Models \(LLMs\) deployment has become a critical concern in practical applications\. While post\-training quantization \(PTQ\) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per\-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions\. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub\-optimality\. We introduce a novel end\-to\-end framework that addresses these limitations in two key ways\. First, we propose a novel mixed\-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer\-wise errors\. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed\-precision quantization policies within a unified search space\. Extensive experiments show that, at ultra\-low precisions \(1–3 bits\), our quantization method reduces WikiText perplexity by up to 21% compared to state\-of\-the\-art \(SoTA\) weight\-activation quantization baselines\. Against leading weight\-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively\. Compared to the SoTA joint pruning\-and\-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra\-low bits\. Furthermore, in mainstream mixed\-precision settings \(e\.g\., 4\-bit/8\-bit\), our compressed models remain highly competitive in terms of perplexity on WikiText and zero\-shot reasoning accuracy, while delivering up to2×2\\timesfaster prefill,6\.5×6\.5\\timespeak memory reduction during decoding \(vs\. FP16\), up to 30% faster inference, and an additional 10% memory savings compared to models compressed by SoTA methods\.
## 1Introduction
The efficiency of large language models \(LLMs\) deployment has become a major research focus in recent years\. To enable deployment on resource\-constrained devices, prior works have extensively investigated model compression for pre\-trained LLMs\. Pruning and quantization are the two most common techniques, as they reduce memory footprint and computational requirements while preserving reasoning performance for efficient edge inference\.
Pruning eliminates redundant parameters to reduce resource usage\. However, compared to quantization at the same high compression ratio \(e\.g\., 75% \), pruned models often exhibit inferior reasoning performance\[kuzmin2023pruning\]\. Quantization maps weights \(and optionally activations\) to lower\-bit representations and typically preserves better performance for a given compression ratio\. It encompasses quantization\-aware training \(QAT\), which requires fine\-tuning, and post\-training quantization \(PTQ\), which operates directly on pre\-trained weights without retraining\. This work focuses on PTQ, as it avoids the costly retraining or fine\-tuning steps associated with QAT\.
Recent mixed\-precision PTQ methods\[zhao2024atom,zhao2025ptq1,huang2024slim,huang2024billm\]for ultra\-low\-bit LLMs typically partition linear\-layer weights into salient and non\-salient channels\. Non\-salient channels are quantized to very low precisions \(e\.g\., 1\-4 bits\), while salient ones retain higher precision \(e\.g\., 8 bits\)\. These methods generally rely on layer\-wise greedy search guided by local quantization loss, which overlooks global error propagation and applies uniform saliency criteria across layers despite varying inter\-layer sensitivities\. This often results in suboptimal compression\[arai2025quantization\]\. Our work introduces a novel mixed\-precision PTQ approach to address these limitations\.
To further improve reasoning and generation abilities under aggressive compression, recent studies also explore joint pruning and quantization, showing the two are non\-orthogonal both theoretically and empirically\[harmaeffective\]\. Methods like SparseGPT\+GPTQ\[frantar2023sparsegpt\]and OBR\[guo2025optimal\]support joint unstructured/semi\-structured compression with error compensation, but offer limited speedup compared to structured methods\. We introduce a joint structured pruning and mixed\-precision PTQ framework that combines our novel quantization approach with DISP\-LLM\[gao2024disp\], enabling co\-optimization over the joint search space\.
Our main contributions are as follows:
- •We introduce a novel mixed\-precision post\-training quantization \(PTQ\) framework that reformulates bit\-width allocation as a binary mask optimization problem\. A hypernetwork, trained directly on the end\-to\-end task loss, learns and optimizes these binary masks\. Unlike prior methods that rely on fixed, hand\-crafted thresholds to separate salient from non\-salient weights, our approach dynamically identifies the most important \(salient\) weights per linear layer using the global end\-to\-end loss\. This adaptive, loss\-driven strategy overcomes the local greedy nature and suboptimality of earlier techniques\. Compared to state\-of\-the\-art \(SoTA\) weight\-activation PTQ baselines, our method reduces WikiText\-2 perplexity by up to 21% and improves average zero\-shot accuracy across six reasoning benchmarks by up to 4\.5%\. Against SoTA weight\-only PTQ baselines, it achieves up to 59% and 85% lower perplexity on WikiText\-2 and C4, respectively, along with up to 5\.4% higher average zero\-shot accuracy on the reasoning tasks\.
- •An integrated structured pruning and mixed\-precision quantization framework, namedtrain\-once\-get\-all \(TOGA\), that delivers SoTA perplexity and zero\-shot reasoning performance across diverse tasks\.
- •Custom CUDA kernels for efficient mixed\-precision matrix multiplication \(e\.g\., W4A4 \+ W8A8\)\. Leveraging these kernels, our compressed models achieve up to2×2\\timesprefill speedup and6\.5×6\.5\\timespeak memory reduction compared to FP16, while outperforming the strongest prior 2:4 semi\-structured sparsity techniques by 30% faster prefill and 10% peak memory savings\.
## 2Related Work and Our Advancements beyond SoTA
### 2\.1Mixed\-precision Post\-Training Quantization techniques
Recent mixed\-precision quantization \(MPQ\) methods assign higher bit\-widths to salient \(sensitive\) weight channels according to their quantization error sensitivity, typically estimated from local layer\-level information such as Hessian information, gradients, or activation statistics\[huang2024slim,huang2024billm,zhao2024atom,zhao2025ptq1\]\. Weight\-only MPQ has substantially improved memory efficiency in memory\-bound inference scenarios, while activations remain at 16\-bit precision\. For instance, PTQ\-1\.61\[zhao2025ptq1\]and BiLLM\[huang2024billm\]use per\-layer Hessian\-based metrics to distinguish salient from non\-salient weights\. Slim\-LLM\[huang2024slim\]applies a local greedy search within each linear layer to determine optimal bit\-width allocations for its weight matrix\.
Weight\-activation MPQ directly addresses activation outliers for greater efficiency gains\. Atom\[zhao2024atom\]preserves critical activations by identifying them based on their magnitude, reordering them \(along with corresponding weight channels\) to the end of the matrices, then quantizing salient channels to higher precision \(8\-bit\) while applying lower precision \(3\- or 4\-bit\) to non\-salient ones\. In another approach, ResQ\[saxena2024resq\]employs principal component analysis \(PCA\) to separate sensitive and non\-sensitive components, applying W4A4 quantization to non\-sensitive weight\-activation channels and W8A8 to the rest\.
Advancement beyond SoTA:These methods share two key limitations\. First, they typically rely on fixed, uniform saliency thresholds to identify sensitive weights, thereby overlooking differences in layer\-wise sensitivity across models and architectures\. Second, they allocate bits using only local, layer\-level sensitivity metrics\. As a result, they fail to account for the cumulative quantization error that accumulates as signals propagate through the entire network\[arai2025quantization\]\. Our approach addresses both issues by optimizing binary masks to identify salient weights across the entire model architecture, thereby enabling flexible, non\-uniform bit allocation across linear layers, while directly optimizing the global language modeling loss\.
### 2\.2Pruning\+Quantization methods
Harma et al\.\[harmaeffective\]demonstrate that pruning and quantization are non\-orthogonal, and that the order of applying them significantly impacts performance\. In LLMs, a prune\-then\-quantize sequence consistently yields lower perplexity than quantize\-then\-prune or applying either technique in isolation\. Similarly, in vision models, Kuzmin et al\.\[kuzmin2023pruning\]show that quantized models generally preserve higher accuracy than pruned models at equivalent compression ratios\. Their findings further reveal that combining mild pruning with higher\-precision quantization produces a superior accuracy compared to aggressive low\-bit quantization alone\.
SparseGPT\[frantar2023sparsegpt\]builds on the Optimal Brain Surgeon \(OBS\) framework\[hassibi1993optimal\]to perform unstructured pruning and quantization of LLMs\. The authors show that applying both techniques jointly by combining SparseGPT with GPTQ\[frantar2022gptq\]outperforms using either method alone\. More recently, OBR\[guo2025optimal\]introduces an explicit error\-compensation step between pruning and quantization and uses the OBS principles to better align their conflicting effects on weight distributions\.
Advancement beyond the SoTA:Both SparseGPT and OBR rely on unstructured or semi\-structured sparsity patterns\. While unstructured sparsity offers strong theoretical compression, it provides limited practical inference speedups on standard GPU hardware\[hoefler2021sparsity\]\. On the other hand, semi\-structured sparsity can deliver practical speedups, but it is constrained by current hardware: NVIDIA GPUs primarily support only the 2:4 pattern \(50% sparsity with a specific 2\-out\-of\-4 non\-zero layout\)\. When targeting other sparsity ratios \(e\.g\., 40% or 60%\), these methods typically fall back to fully unstructured pruning, which remains highly inefficient on existing hardware and yields little to no real speedup\. In contrast, to the best of our knowledge, this work introduces the first joint*structured pruning*and*mixed\-precision quantization*framework for large language models\. By producing hardware\-friendly, dense\-compatible matrices, our approach enables efficient execution on standard GPUs while consistently achieving superior accuracy–efficiency trade\-offs compared to prior joint unstructured or semi\-structured methods\. Moreover, the structurally pruned models produced by our method achieve up to 30% faster inference, 10% lower peak memory usage with better perplexity and reasoning performance than 2:4 semi\-structured pruned models from previous works\. Please see Section[4](https://arxiv.org/html/2606.07819#S4)for detailed experimental results\.
### 2\.3Structural Pruning for LLM with Binary Masks
Transformers:LLMs are predominantly based on the decoder\-only transformer architecture\[vaswani2017attention\]\. Specifically, each transformer block comprises two main submodules: multi\-head self\-attention \(MHA\) and a feed\-forward network \(FFN\), each followed by a residual connection and layer normalization\. Hence, given inputXXto a transformer block, the core computations are described as:
Attention\(X\)\\displaystyle\\text\{Attention\}\(X\)=MHA\(XWq,XWk,XWv\)Wo,\\displaystyle=\\text\{MHA\}\(XW\_\{q\},XW\_\{k\},XW\_\{v\}\)W\_\{o\},\(1\)MLP\(X\)\\displaystyle\\text\{MLP\}\(X\)=\(σ\(XWgate\)⊙\(XWup\)\)Wdown,\\displaystyle=\\bigl\(\\sigma\(XW\_\{\\text\{gate\}\}\)\\odot\(XW\_\{\\text\{up\}\}\)\\bigr\)W\_\{\\text\{down\}\},\(2\)where MHA captures position\-wise dependencies using multiple attention heads, each defined by linear projections for queries \(WqW\_\{q\}\), keys \(WkW\_\{k\}\), values \(WvW\_\{v\}\), and output \(WoW\_\{o\}\)\. The FFN \(also called MLP\) applies gate \(WgateW\_\{\\text\{gate\}\}\), up\-projection \(WupW\_\{\\text\{up\}\}\), and down\-projection \(WdownW\_\{\\text\{down\}\}\) matrices, with non\-linearityσ\\sigmaafter the gate\.
Binary Masks:Our proposed method draws inspiration from DISP\-LLM\[gao2024disp\], which frames structured pruning as a learnable binary mask optimization problem\. LetLLdenote the total number of binary masks across the model, and let𝒫=\{Pi\}i=1L\\mathcal\{P\}=\\\{P\_\{i\}\\\}\_\{i=1\}^\{L\}be the set of learnable binary masks, where eachPi∈\{0,1\}diP\_\{i\}\\in\\\{0,1\\\}^\{d\_\{i\}\}indicates which channels \(input or output\) of the corresponding weight matrix are retained \(11\) or pruned \(0\)\. For a linear layer with full\-precision weight matrixW∈ℝdout×dinW\\in\\mathbb\{R\}^\{d\_\{\\text\{out\}\}\\times d\_\{\\text\{in\}\}\}, structured pruning is applied using an input maskPin∈\{0,1\}dinP\_\{\\text\{in\}\}\\in\\\{0,1\\\}^\{d\_\{\\text\{in\}\}\}and an output maskPout∈\{0,1\}doutP\_\{\\text\{out\}\}\\in\\\{0,1\\\}^\{d\_\{\\text\{out\}\}\}\. The pruned weight matrix is obtained as:
Fprune\(W,Pin,Pout\)=diag\(Pin\)Wdiag\(Pout\)=PinTWPout\.\\displaystyle F\_\{\\text\{prune\}\}\(W,P\_\{\\text\{in\}\},P\_\{\\text\{out\}\}\)=\\operatorname\{diag\}\(P\_\{\\text\{in\}\}\)\\,W\\,\\operatorname\{diag\}\(P\_\{\\text\{out\}\}\)=P\_\{\\text\{in\}\}^\{T\}WP\_\{\\text\{out\}\}\.\(3\)
Applying this to the attention and feed\-forward modules yields:
Attention\(X\)\\displaystyle\\text\{Attention\}\(X\)=MHA\(XP1,XP1,XP1\)\(WoP2\),\\displaystyle=\\text\{MHA\}\(XP\_\{1\},XP\_\{1\},XP\_\{1\}\)\(W\_\{o\}P\_\{2\}\),\(4\)MLP\(X\)\\displaystyle\\text\{MLP\}\(X\)=\(σ\(XP3Wgate\)⊙\(XP3Wup\)\)\(P4TWdownP5\),\\displaystyle=\\bigl\(\\sigma\(XP\_\{3\}W\_\{\\text\{gate\}\}\)\\odot\(XP\_\{3\}W\_\{\\text\{up\}\}\)\\bigr\)\(P\_\{4\}^\{T\}W\_\{\\text\{down\}\}P\_\{5\}\),\(5\)where\{Pi\}i=15\\\{P\_\{i\}\\\}\_\{i=1\}^\{5\}are the pruning masks for linear layers in a transformer block\.
Following DISP\-LLM, in this work, we only prune the input and output dimensions of attention modules and the input, intermediate, and output dimensions of the MLP module, while keeping the number of attention heads and head dimension fixed\.
Search for optimal binary masks:LetSSdenote a set ofLLbinary vectorssls\_\{l\}that govern the joint pruning and quantization of the model, which consists ofLLlinear layers\. While searching for the optimal configurationS∗S^\{\*\}can be resolved with computationally intensive techniques, namely, evolutionary algorithms\[tang2025darwinlm\]and reinforcement learning, motivated by DISP\-LLM, we advocate for a more efficient hypernetwork\-based approach\. To guide the hypernetwork towards configurations that satisfy a desired budgetbb, letB\(S\)B\(S\)be a differentiable function that estimates the expected budget \(e\.g\., effective sparsity, bit\-width\-averaged memory usage, or memory savings\) induced by configurationSS\. The budget regularization term can be defined as follows\.
R\(b,B\(S\)\)=log\(max\(b,B\(S\)\)min\(b,B\(S\)\)\)\\displaystyle R\(b,B\(S\)\)=\\log\(\\frac\{\\max\(b,B\(S\)\)\}\{\\min\(b,B\(S\)\)\}\)\(6\)
This term regularizes the expected budgetB\(S\)B\(S\)to be match the targetbb\. Following DISP\-LLM\[gao2024disp\], the hypernetwork is trained using a loss function that combines the standard cross\-entropy language modeling objectiveLCEL\_\{\\text\{CE\}\}with the budget regularizationRR\. Letλ\\lambdadenote a hyperparameter controlling the magnitude of the regularization term\. The final training loss is a weighted sum of these components:
minθLCE\(X,W,S\)\+λR\(b,B\(S\)\)\\displaystyle\\min\_\{\\theta\}L\_\{CE\}\(X,W,S\)\+\\lambda R\(b,B\(S\)\)\(7\)
In this work, we use a similar hypernetwork architecture designed in DISP\-LLM, which is described in more detail in Section 5 in the Appendix\.
Regarding the training of the hypernetwork, letolo\_\{l\}denote the continuous output produced by the hypernetwork\. To obtain the final discrete binary maskssl∈\{0,1\}dls\_\{l\}\\in\\\{0,1\\\}^\{d\_\{l\}\}, we follow a procedure similar to DISP\-LLM\[gao2024disp\]: we apply the Gumbel\-Softmax trick\[jang2016categorical\]combined with the Straight\-Through Estimator\[bengio2013estimating\]to generate differentiable yet discrete binary vectorssls\_\{l\}from the continuous vectorsolo\_\{l\}\. This enables end\-to\-end gradient propagation through the discrete mask selection during training\. Full details of this differentiable binarization step are provided in Section 5\.2 of the Appendix\.
Advancement beyond SoTA:While DISP\-LLM targets only structured pruning, our method extends the hypernetwork paradigm to first support MPQ and then enable joint structured pruning and MPQ for LLMs\. Additionally, unlike prior layer\-wise MPQ methods that consider only local layer\-level errors, motivated by DISP\-LLM\[gao2024disp\], our approach trains the hypernetwork directly on end\-to\-end language modeling loss\. This global perspective yields superior quantization quality compared to the SoTA layer\-wise baselines \(see Section[4\.3](https://arxiv.org/html/2606.07819#S4.SS3)\)\.
## 3Method
### 3\.1Quantization with binary masks
This section outlines the general framework for mixed\-precision quantization \(MPQ\) in LLMs\. We define a binary maskMMto distinguish between sensitive and non\-sensitive weight channels\. Specifically, an elementmi∈\{0,1\}m\_\{i\}\\in\\\{0,1\\\}within the mask is set to 1 if theii\-th channel is identified as salient, and 0 otherwise\. LetQh\(⋅\)Q\_\{\\text\{h\}\}\(\\cdot\)andQl\(⋅\)Q\_\{\\text\{l\}\}\(\\cdot\)denote the quantization functions that map weights to high precision \(e\.g\., 4\-bit or 8\-bit\) and low precision \(e\.g\., 1\-bit, 2\-bit, or 3\-bit\), respectively\. The mixed\-precision quantized weight matrix is then formulated as:
Fquant\(W,M\)=Qh\(W\)⋅M\+Ql\(W\)⋅\(1−M\)\\displaystyle F\_\{quant\}\(W,M\)=Q\_\{h\}\(W\)\\cdot M\+Q\_\{l\}\(W\)\\cdot\(1\-M\)\(8\)
Similarly, let denote X as the activation input to a linear layer\. For activation quantization, we apply the same formulation as follows
Fquant\(X,M\)=Qh\(X\)⋅M\+Ql\(X\)⋅\(1−M\)\\displaystyle F\_\{quant\}\(X,M\)=Q\_\{h\}\(X\)\\cdot M\+Q\_\{l\}\(X\)\\cdot\(1\-M\)\(9\)
Previous MPQ methods\[zhao2024atom,huang2024billm,zhao2025ptq1,huang2024slim\]derive quantization masks for each layer based on the layer’s quantization error\. These approaches typically rely on a calibration dataset to evaluate weight importance via activation magnitudes or Hessian\-based sensitivity analysis\. By applying a greedy search at the layer level, as in previous work, they assume that the quantization choices between layers are independent and overlook cumulative errors and inter\-layer dependencies, which leads to sub\-optimal solutions\. In contrast, we resolve this searching problem by considering the actual final loss of the model in an end\-to\-end manner, which will be described in more detail in Section[2\.3](https://arxiv.org/html/2606.07819#S2.SS3)\.
Notably, we reorder the input channels of each weight matrix based on their corresponding activation magnitudes\. Following prior work\[zhao2024atom\], these magnitudes are estimated as the mean absolute activation value computed over a calibration set of 128 samples from WikiText\-2\. This reordering strategy clusters high\-magnitude outlier channels into the same groups, preventing a few extreme values from inflating the quantization scales of more sensitive weights\. Additionally, consistent with previous MPQ methods\[zhao2024atom,saxena2024resq,liuspinquant\], we adapt GPTQ\[frantar2022gptq\]to further refine the quantized weights and reduce residual quantization error\. Ablation studies on Llama\-2\-7B \(Table[3](https://arxiv.org/html/2606.07819#S4.T3)\) confirm that incorporating both activation\-based channel reordering and GPTQ refinement consistently improves the perplexity of the quantized model\.
### 3\.2Combining quantization and pruning
When combining quantization and pruning, from Equation[3](https://arxiv.org/html/2606.07819#S2.E3)and Equation[8](https://arxiv.org/html/2606.07819#S3.E8), we have two possible options, namely,
1. 1\.Quantization\-then\-pruning:Fprune\(Fquant\(W,M\),Pin,Pout\)F\_\{prune\}\(F\_\{quant\}\(W,M\),P\_\{in\},P\_\{out\}\)
2. 2\.Pruning\-then\-quantization:Fquant\(Fprune\(W,Pin,Pout\),M\)F\_\{quant\}\(F\_\{prune\}\(W,P\_\{in\},P\_\{out\}\),M\)
In literature, structured pruning is generally employed as an initial phase to extract an optimized subnetwork\. This architecture is subsequently quantized, yielding a more compact model with improved inference throughput accuracy on target hardware\[han2016deep\]and achieves better perplexity for LLMs\[harmaeffective\]\. Based on these observations from previous works, we choose the second option for jointly pruning and quantization: the weight matrix is first pruned, and the resulting sparse weights are subsequently quantized\. The final joint pruned\-then\-quantized weight matrix is obtained by combining Equation[8](https://arxiv.org/html/2606.07819#S3.E8)and Equation[3](https://arxiv.org/html/2606.07819#S2.E3)as follows:
Fquant\(PinTWPout,M\)\\displaystyle F\_\{quant\}\(P^\{T\}\_\{in\}WP\_\{out\},M\)=Qh\(PinTWPout\)⋅M\+Ql\(PinTWPout\)⋅\(1−M\)\\displaystyle=Q\_\{h\}\(P^\{T\}\_\{in\}WP\_\{out\}\)\\cdot M\+Q\_\{l\}\(P^\{T\}\_\{in\}WP\_\{out\}\)\\cdot\(1\-M\)\(10\)
wherePinTWPoutP^\{T\}\_\{in\}WP\_\{out\}are the weight matrixWWpruned by binary vectors for input dimensionPinP\_\{in\}and output dimensionPoutP\_\{out\}\. Similarly, the activation matrix can also be compressed as follows:
Fquant\(XPin,M\)\\displaystyle F\_\{quant\}\(XP\_\{in\},M\)=Qh\(XPin\)⋅M\+Ql\(XPin\)⋅\(1−M\)\\displaystyle=Q\_\{h\}\(XP\_\{in\}\)\\cdot M\+Q\_\{l\}\(XP\_\{in\}\)\\cdot\(1\-M\)\(11\)
whereXPinXP\_\{in\}is the activation matrixXXpruned by the binary vector for input dimensionPinP\_\{in\}\. It is noteworthy that, different from the weight matrix, the activation matrix can only be pruned in the input dimension\.
By applying the pruning and quantization operations \(as defined in the above equations\) to the weights of linear layers and their corresponding activations in LLMs, we construct a masked compressible supernet\. This supernet encodes all possible combinations of structured pruning and MPQ policies within a single model architecture\. Unlike conventional sequential approaches that first prune LLMs to a target sparsity level and then quantize the pruned model to a desired precision, our method constructs a supernet that jointly encodes all feasible combinations of structured pruning and mixed\-precision quantization policies\. The hypernetwork is then trained end\-to\-end to progressively identify the optimal joint configuration\. This integrated search enables mutual awareness during optimization: when determining pruning decisions, the hypernetwork accounts for the current quantization choices, and vice versa\. Figure[1](https://arxiv.org/html/2606.07819#S3.F1)depicts an overview of the masked compressible supernet proposed in this paper\. We name our joint\-pruning\-and\-quantization method asTOGA\.
Figure 1:Overview of the masked compressible supernet inTOGA\. The gray modules are frozen\. Otherwise, colorful modules are trainable\.
## 4Experiments
### 4\.1Settings
DatasetsWe trained our hypernetworks using WikiText\-2\[merity2016pointer\]dataset\. To ensure a fair comparison, other calibration\-based methods were evaluated using the same datasets employed for our hypernetwork training\.
Evaluation Benchmarks\. Our evaluation follows the standard pipeline established in prior literature\[zhao2025ptq1,gao2024disp\]\. Particularly, we assess model performance using the following metrics and datasets:
- •Perplexity: Measured using the WikiText\-2\[merity2016pointer\]and C4\[raffel2020exploring\]dataset\.
- •Zero\-shot accuracy on Common Reasoning: Directly evaluating pruned models with further fine\-tuning steps \(zero\-shot\) on common reasoning benchmarks, including ARC Easy/Challenge\[allenaiarc\], BoolQ\[clark2019boolq\], Winogrande\[ai2\_winogrande\], Hellswag\[zellers2019hellaswag\], and MMLU\[hendrycks2021ethics\]\.
ModelsWe conducted experiments across several widely\-used models, including Llama\-3\.2\-1B, Llama\-3\.2\-3B, Llama\-2\-7B, Llama\-2\-13B, Llama\-3\-8B, Llama\-3\.1\-8B, Mistral\-7B\-v0\.3 \(Mistral\-7B\), and Qwen\-3\-8B\.
Experimental Setup\.We trained the hypernetworks using a single NVIDIA A100 GPU with 80GB of VRAM\. The training duration varied by task: 2,000 steps were allocated for quantization\-only experiments, whereas 10,000 steps were used for joint pruning and quantization\. All experiments utilized a batch size of 1 and were repeated five times to report the average results\. For more Detailed hyperparameter configurations, see Table 1 of the Appendix\.
BaselinesWe compare our proposed method against the following baselines:
- •Post\-Training Quantization \(PTQ\) methods: To isolate the effect of quantization in our approach, we disable pruning by setting all entries in the pruning vectors to ones\. We refer to this quantization\-only variant asTOGA\-q\. In this quantization\-only setting, the desired budgetbbis defined as the percentage of salient weights\. We then compareTOGA\-qagainst SoTA PTQ techniques, including: - –Weight\-activation PTQ: We compare against SOTA mixed\-precision methods, namely Atom\[zhao2024atom\]and ResQ\[saxena2024resq\], as well as a strong uniform\-precision baseline, SpinQuant\[liuspinquant\]\. Notably, all weight\-activation quantization baselines, including our proposedTOGA\-q, employ the GPTQ quantizer\[frantar2022gptq\]to further improve the quality of the quantized models\. For GPTQ, we use a group size of 64 for Llama\-3\.2\-1B and a group size of 128 for all other models\. - –Weight\-only PTQ: We include PTQ\-1\.61\[zhao2025ptq1\]\(evaluated both with and without LoRA preprocessing\), SliM\-LLM\[huang2024slim\], and BiLLM\[huang2024billm\]\. For 1\-bit quantization, we adopt a binarization approach consistent with that used in PTQ\-1\.61\[zhao2025ptq1\]\.
- •Sequential pruning \+ quantization:We compareTOGA\(joint pruning and quantization\) against a sequential baseline: DISP\-LLM pruning followed by PTQ \(Atom, ResQ, BiLLM, SliM\-LLM\)\. For a fair comparison, we also evaluate a variant ofTOGAby adding a sparsity constraint to the loss function in Equation[7](https://arxiv.org/html/2606.07819#S2.E7), and denote it asTOGA\-fixed\-sparsity\. We also include two strong joint baselines: SparseGPT\+GPTQ\[frantar2023sparsegpt\]and OBR\[guo2025optimal\]\. In this joint pruning\-and\-quantization setting, the desired budgetbbis compression budget defined in Section[4\.2](https://arxiv.org/html/2606.07819#S4.SS2)\.
### 4\.2Definitions
For clarity and consistency in the experiments, we define the following key terms used throughout the paper\.
Compression Budget:The compression budget \(or compression ratio\) is defined as the ratio of the theoretical memory footprint of the compressed model to that of the original FP16 model\. For instance, 50% sparsity combined with W4A4 quantization gives a budget of0\.5×416=0\.1250\.5\\times\\frac\{4\}\{16\}=0\.125\(i\.e\., the model retains 12\.5% of the original memory, or achieves 87\.5% memory reduction\)\.
Precision Format:In mixed\-precision quantization, the effective average bit\-width depends on the chosen precisions for salient and non\-salient portions and their respective proportions\. For instance, Atom quantizes 97% of weight and activation channels to 3 bits while preserving 3% of salient weights at 8 bits, giving an average of≈3\.2\\approx 3\.2bits \(denoted W\(3\.2\)A\(3\.2\)\)\. Likewise, PTQ\-1\.61 quantizes 80% of weights to 1 bit and 20% to 4 bits \(activations remain 16\-bit\), yielding an average weight bit\-width of 1\.6 \(denoted W\(1\.6\)A16\)\.
### 4\.3Comparison with other mixed\-precision PTQ methods
We first evaluateTOGA\-qagainst SoTA PTQ methods at ultra\-low precisions \(sub\-2\-bit weight\-only and sub\-4\-bit weight\-activation\) before comparing all approaches at the more hardware\-friendly INT4 and INT8 settings\.
Ultra\-low precision:We compareTOGA\-qwith several SoTA mixed\-precision PTQ methods\. Atom\[zhao2024atom\]preserves∼\\sim3% of salient weights \(the last 128 channels of each weight matrix\) at 8\-bit precision, yielding an average bit\-width of 3\.2 bits\. Similarly, we adopt the same setup for ResQ to ensure fairness\. ForTOGA\-q, we match this budget by assigning the desired budgetbbto 3%, and quantizing salient weights to 8\-bit and the rest to 3\-bit, directly following the Atom configuration\. In the weight\-only setting, BiLLM achieves a theoretical average of 1\.1 bits but requires additional unstructured binary masks to group salient weights, incurring non\-negligible metadata overhead and resulting in an effective average closer to 2\.1 bits\[zhao2025ptq1\]\. In contrast, both PTQ\-1\.61 andTOGA\-qallocate 80% of each weight matrix to 1\-bit and 20% to 4\-bit, with virtually no metadata overhead; this exact combination reproduces the PTQ\-1\.61 results faithfully\.
Table[1](https://arxiv.org/html/2606.07819#S4.T1)presents perplexity results on WikiText\-2 along with the average zero\-shot accuracy across six standard reasoning benchmarks for the quantized models\. Further results for weight\-only PTQ methods are provided in Appendix Section 3\.1\. Overall,TOGA\-qconsistently surpasses all other SoTA approaches evaluated\. Compared to ResQ, the strongest baseline among weight\-activation PTQ methods,TOGA\-qreduces WikiText\-2 perplexity by 5–21% and improves average zero\-shot accuracy on the reasoning tasks by 1\.6–4\.5%\. In particular, the leading uniform\-precision method SpinQuant experiences total perplexity collapse at 3\-bit precision, underscoring the importance of mixed\-precision quantization to maintain model coherence in sub\-4\-bit settings by safeguarding critical weights\.
Additionally, although PTQ\-1\.61†, one of the most competitive weight\-only PTQ baselines, relies on a costly preprocessing step \(10,000 LoRA fine\-tuning iterations on RedPajama\),TOGA\-qrequires neither external datasets nor any fine\-tuning, yet it achieves 9–59% lower perplexity on WikiText and 2\.3–5\.4% higher average zero\-shot accuracy on the reasoning tasks\. While SliM\-LLM remains reasonably competitive, its approach of using three different bit\-widths per linear layer introduces extra kernel\-launch overhead during inference, frequently resulting in no speedup, or even performance regression, compared to the FP16 baseline\[huang2024slim\]\. By contrast,TOGA\-quses only two bit\-widths per layer, delivers 3–50% lower perplexity on WikiText\-2, 16–85% lower perplexity on C4 \(see Tables 3 and 4 in the Appendix\), and achieves 2\-4\.21% higher average zero\-shot accuracy in the six reasoning tasks\.
Table 1:Perplexities \(lower is better\) on WikiText\-2 dataset and average accuracy \(higher is better\) of six different reasoning tasks of models quantized by different mixed\-precision PTQ methods\. \(PTQ\-1\.61†means that we apply PTQ\-1\.61 with a pre\-processing step using RedPajama dataset\)MethodPrecisionPerplexity \(↓\\downarrow\)Zero\-shot Accuracy \(↑\\uparrow\)2\-73\-8Mistral\-7BQwen3\-8B2\-73\-8Mistral\-7BQwen3\-8BPTQ\-1\.61W\(1\.6\)A1622\.65805\.6345\.69160\.0234\.2233\.7637\.1534\.02BiLLMW\(2\.1\)A1632\.5836\.0129\.9047\.3241\.1639\.1639\.1137\.87SilM\-LLMW2A1616\.0140\.6016\.3723\.9343\.9436\.1240\.0343\.00PTQ\-1\.61†W\(1\.6\)A1612\.7022\.3538\.5128\.3441\.4335\.5939\.7743\.53TOGA\-qW\(1\.6\)A1611\.0020\.2815\.9320\.2946\.8041\.3342\.0747\.58SpinQuantW3A3438\.11205\.4321\.63\-33\.0634\.2633\.96\-AtomW\(3\.2\)A\(3\.2\)12\.1248\.5310\.3438\.7848\.1339\.5751\.7138\.36ResQW\(3\.2\)A\(3\.2\)7\.3515\.536\.5315\.6052\.6346\.3358\.4342\.45TOGA\-qW\(3\.2\)A\(3\.2\)7\.3012\.246\.4014\.1654\.2050\.8162\.6056\.03INT4\+INT8 Mixed Precision: In this experiment, we employ a combination of INT4 and INT8 precisions for mixed\-precision PTQ methods, as both precisions are hardware\-friendly and natively supported by NVIDIA GPUs\. We evaluate ourTOGA\-qmethod against leading mixed\-precision PTQ approaches, specifically Atom and ResQ\. To further demonstrate the advantages of mixed\-precision quantization, we also include SpinQuant, a strong representative of SoTA uniform\-precision PTQ methods\. For the mixed\-precision PTQ baselines, we adopt the same configuration as ResQ: 12\.5% of the weights are identified as salient and quantized to INT8, while the remaining weights are quantized to INT4\. The same setting is also applied to Atom in this experiment\. Additionally, since the KV cache constitutes one of the largest contributors to memory usage during LLM inference and deployment, we further reduce its memory footprint by quantizing the KV cache to INT4 precision\. Table[2](https://arxiv.org/html/2606.07819#S4.T2)reports the average zero\-shot accuracy of Llama\-family models quantized via these PTQ methods on six reasoning tasks\. Overall, consistent with trends observed in ultra\-low\-bit settings, all MPQ methods outperform SpinQuant, the leading uniform\-precision baseline\. More importantly,TOGA\-qsurpasses Atom, which simply identifies and preserves the top 12\.5% most sensitive activation channels \(along with their corresponding weight channels\) at higher precision while aggressively quantizing the rest\. This result underscores the effectiveness ofTOGA\-q’s adaptive, loss\-driven bit\-width allocation across layers\. Finally,TOGA\-qalso outperforms ResQ, the strongest prior MPQ baseline\.
Table 2:Zero\-shot accuracy \(higher is better\) of the Llama\-family models under different post\-training quantization \(PTQ\) methods across various reasoning tasks\. SpinQuant applies uniform W4A4 quantization\. All mixed\-precision methods quantize 12\.5% of the weights \(salient weights\) to INT8, with the remaining weights quantized to INT4\. For all methods, the KV cache is quantized to INT4 precision\.ModelMethodZero\-shot Task Performance \(↑\\uparrow\)AverageARC\_eARC\_cBoolQHellaswagWinoGrandeMMLULlama\-3\.2\-1B16\-bit baseline60\.636\.563\.463\.660\.132\.852\.8SpinQuant51\.832\.359\.355\.454\.724\.946\.4Atom58\.833\.758\.956\.753\.028\.648\.3ResQ56\.633\.658\.958\.255\.326\.648\.2TOGA\-q59\.733\.859\.957\.757\.426\.849\.2Llama\-3\.2\-3B16\-bit baseline71\.746\.273\.173\.769\.149\.563\.9SpinQuant64\.838\.968\.069\.162\.937\.256\.8Atom69\.040\.064\.269\.162\.444\.058\.1ResQ65\.643\.168\.870\.564\.847\.860\.1TOGA\-q70\.541\.172\.670\.263\.246\.560\.7Llama\-2\-7B16\-bit baseline74\.646\.377\.875\.969\.139\.563\.9SpinQuant71\.343\.673\.873\.265\.433\.560\.1Atom74\.042\.774\.973\.566\.934\.861\.1ResQ72\.043\.375\.973\.966\.837\.361\.5TOGA\-q74\.543\.575\.674\.369\.337\.662\.5Llama\-3\-8B16\-bit baseline77\.153\.281\.179\.273\.456\.770\.1SpinQuant75\.448\.075\.875\.469\.251\.265\.8Atom75\.947\.473\.575\.568\.654\.065\.8ResQ75\.049\.272\.576\.571\.056\.666\.8TOGA\-q76\.347\.176\.976\.870\.156\.367\.2
### 4\.4Comparison with other pruning\+quantization methods
In this section, we compare TOGA with other joint pruning and quantization baselines\. Figure[2](https://arxiv.org/html/2606.07819#S4.F2)compares our joint structured pruning and mixed\-precision quantization method \(TOGA\) against sequential baselines: DISP\-LLM structured pruning \(sparsity 0\.2–0\.9\) followed by PTQ via Atom, ResQ, BiLLM, or SliM\-LLM\. In contrast,TOGAjointly optimizes both pruning and quantization in a single unified framework\. This joint optimization allowsTOGAto flexibly balance sparsity and precision for any target compression ratio \(e\.g\., at∼\\sim0\.103, it can yield either 45% sparsity \+ W3A3 or 59% sparsity \+ W4A4\)\. For fair comparison, we add a regularization term \(Equation[7](https://arxiv.org/html/2606.07819#S2.E7)\) to constrainTOGAto the exact sparsity levels of the baselines; we refer to this variant asTOGA\-fixed\-sparsity\. Furthermore, we exclude OBR due to its failure at ultra\-low precision \(e\.g\., Llama\-2\-7B has perplexity\>1500\>1500at 10% sparsity with W3A3 precision, or 222\.67 with W2A16 precision\); further comparison with the OBR baseline at INT4 precision is included in Table 7 in the Appendix\.
The left panel of Figure[2](https://arxiv.org/html/2606.07819#S4.F2)shows weight\-activation PTQ results at effective W\(3\.2\)A\(3\.2\) across compression ratios 0\.04 \(80% sparsity\) to 0\.18 \(10% sparsity\), including the unstructured SparseGPT\+GPTQ baseline \(pruned then quantized to W3A3\)\. The right panel evaluates weight\-only PTQ baselines\. We omit PTQ\-1\.61 \(requires costly LoRA fine\-tuning on an external dataset\) and SparseGPT\+GPTQ \(catastrophic failure at W2A16, e\.g\., perplexity\>7000\>7000for Llama\-2\-7B at 10% sparsity\)\.
Overall,TOGAconsistently outperforms both SoTA joint and sequential \(DISP\-LLM \+ other PTQ methods\) pipelines\. UnconstrainedTOGAsurpasses evenTOGA\-fixed\-sparsity, particularly at aggressive compression levels, confirming that simultaneous optimization and free sparsity\-precision trade\-offs deliver superior accuracy\-efficiency Pareto fronts\. Please refer to Appendix Section 3\.2 for more experiments\.
\(a\)Weight\-Activation
\(b\)Weight\-Only
Figure 2:Perplexity on WikiText2 dataset of Llama\-2\-7B compressed by weight\-activation \(left\) and weight\-only \(right\) quantization methods at different compression budgets\. \(Left: all methods are quantized to W\(3\.2\)A\(3\.2\) format\. Right: DISP\+BiLLM is quantized to W\(2\.1\)A16 format as explained in\[zhao2025ptq1\]\. Other methods are quantized to W\(1\.6\)A16 format\.\)
### 4\.5Performance
In this experiment, we evaluate the practical performance gains ofTOGArelative to OBR and FP16 baselines using the Llama\-2\-7B model\. For ourTOGAmethod, we developed custom CUDA kernels based on the CUTLASS library to support mixed\-precision INT4/INT8 GEMM operations\. We further optimized the inference pipeline by fusing the RMS Normalization layer with reordering and quantization stages to minimize computational overhead \(see Section 4 in the Appendix for implementation details\)\. Besides that, we also quantize the KV cache to reduce the memory overhead during the inference phase\.
For OBR method, to ensure a fair and reproducible comparison, we use CUTLASS\-based kernels from the OBR paper that support INT4 2:4 semi\-structured sparse GEMM operations\. These kernels were integrated into the QuaRot inference pipeline\[ashkboos2024quarot\], similar to that in the original OBR paper\[guo2025optimal\]\. It is noteworthy that OBR also quantizes the KV cache to INT4\. For the FP16 baseline, we use the standard HuggingFace model without any modifications\. All performance benchmarks were performed on a NVIDIA L40 GPU, with results averaged over 100 runs using a fixed context length of 2048 tokens and varying batch sizes\. The performance evaluation follows a similar protocol to that described in\[ashkboos2024quarot\]and is summarized as follows:
- •Compute\-bound prefill stage:The left side of Figure[3](https://arxiv.org/html/2606.07819#S4.F3)shows inference speedups for OBR andTOGAacross different batch sizes at a context length of 2048\. Notably, the FP16 baseline encounters out\-of\-memory \(OOM\) errors starting at batch size 12\. In contrast, OBR andTOGAcan sustain significantly higher batch sizes without memory issues\. Moreover,TOGAdelivers superior prefill speedups compared to the 2:4 semi\-structured OBR baseline\. In particular,TOGAachieves up to2×2\\timesspeedup over the FP16 baseline and approximately1\.3×1\.3\\timesspeedup over OBR baseline in the prefill phase\.
- •Memory\-bound decoding phase:Since memory consumption is the primary bottleneck during autoregressive decoding, we focus on peak memory usage\.TOGAreduces peak memory by up to6\.5×6\.5\\timescompared to the FP16 baseline and by approximately 10% compared to OBR baseline\.
These results demonstrate the substantial practical advantages ofTOGA’s joint structured pruning and mixed\-precision quantization approach over both the uncompressed FP16 baseline and the 2:4 semi\-structured sparsity method employed by OBR\. For accuracy on reasoning tasks and perplexity evaluation of models compressed byTOGAand OBR, please see Table 7 in the Appendix\.
Figure 3:Inference latency during the prefill stage and peak memory usage during the decode stage for the Llama\-2\-7B model, compressed using baselines andTOGA\. All measurements were conducted with a fixed context length of 2048 tokens and varying batch sizes\. The FP16 baseline encounters out\-of\-memory errors at batch sizes≥\\geq12\.
### 4\.6Ablation Study
Quantization techniques with TOGA\-q:In this experiment, we first use round\-to\-nearest \(RTN\) to adopt per\-channel quantization for weights and per\-token quantization for activations, which is also the standard quantization recipe\[xiao2023smoothquant\], to uniformly quantize the model to W4A4\. Then, we apply quantization techniques used inTOGA\-q, namely quantizing 12\.5% of weights to INT8, reordering weight channels \(see Section[3\.1](https://arxiv.org/html/2606.07819#S3.SS1)\), GPTQ\[frantar2022gptq\], and quantizing the KV cache to INT4\. As shown in Table[3](https://arxiv.org/html/2606.07819#S4.T3)for Llama\-2\-7B on WikiText\-2 and C4, retaining 12\.5% of sensitive weights at INT8 already yields a large perplexity reduction compared to uniform W4A4\. Adding channel reordering and GPTQ further improves performance from 6\.03 to 5\.38, while INT4 KV\-cache quantization causes only a small increase to 5\.48, demonstrating that the overall method remains highly effective even under KV cache quantization\.
Table 3:Ablation Study on quantization techniques withTOGA\-qfor Llama\-2\-7B model to mixture precision of W4A4 and W8A8\. We maintain an average of 12\.5% of rows of weight and activation matrices in W8A8 format\.Quantization TechniquesPerplexityWikiText\-2C416\-bit baseline5\.127\.10W4A4 RTN17532301\+ Quantizing 12\.5% of weights to INT86\.038\.10\+ Reordering5\.788\.01\+ GPTQ5\.387\.47\+ Quantizing KV cache to INT45\.487\.68Salient Weight DistributionIn this experiment, we investigate the quantization choices made byTOGA\-q\. In contrast to previous mixed\-precision methods such as Atom and ResQ, which apply a uniform, fixed fraction of salient weights \(quantized to INT8\) across all layers while quantizing the rest to INT4,TOGA\-qadopts a more flexible and adaptive approach\. It utilizes a hypernetwork to automatically search for and identify the optimal number of salient weights to be kept at INT8 in each individual layer, with the remainder quantized to INT4\. This enables layer\-specific mixed\-precision assignments that more effectively account for the varying redundancy and sensitivity across different layers\.
Figure[4](https://arxiv.org/html/2606.07819#S4.F4)illustrates the distribution of salient channels as determined byTOGA\-qand other baselines on Llama\-2\-7B\. Notably,TOGA\-qtends to allocate a larger number of salient weights to the first 16 transformer blocks and to a few of the final blocks, while assigning significantly fewer salient weights to the intermediate blocks\. This pattern aligns well with existing empirical findings on layer importance in large language models\[sreenivas2024llm\], which show that removing the early layers or the last few layers causes substantially larger accuracy degradation compared to pruning the middle layers\. In contrast, ResQ and Atom apply the same threshold for all linear layers to identify salient weights\.
Figure 4:Distribution of Salient Channels suggested byTOGA\-q, Atom, and ResQ when quantizing non\-salient/salient weights to INT4/INT8 for Llama\-2\-7B\.
## 5Conclusions
In this work, we present a novel mixed\-precision post\-training quantization \(PTQ\) method coupled with the first end\-to\-end framework for joint structured pruning and quantization of LLMs\. By optimizing bit\-width allocation via a hypernetwork trained on end\-to\-end language modeling loss and integrating it with structural pruning in a unified search space, our method achieves substantial inference acceleration over the uncompressed FP16 baseline, while delivering superior perplexity on language modeling datasets, higher zero\-shot accuracy on downstream reasoning benchmarks, and improved real\-world throughput \(prefill and decode phases\) compared to SoTA approaches that combine semi\-structured pruning with quantization\. These gains enable practical deployment of LLMs on severely resource\-constrained hardware\.
## 6Future Works
A major limitation of our proposed method is that it requires loading and running the entire LLM on GPUs, which can easily lead to Out\-of\-Memory \(OOM\) errors with very large models \(e\.g\., 70B parameters\)\. For instance, under the current implementation, a GPU with 80GB of memory can only accommodate up to a 32B\-parameter LLM\. In the future, we plan to leverage advanced distributed training or offloading techniques to significantly reduce memory consumption during hypernetwork training\.
## ReferencesSimilar Articles
LLM Compression with Jointly Optimizing Architectural and Quantization choices
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
Proposes GEMQ, a global expert-level mixed-precision quantization method for MoE LLMs that uses linear programming and router fine-tuning to reduce memory and accelerate inference with minimal accuracy degradation.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
LiftQuant introduces a 'lift-then-project' mechanism enabling continuous (non-integer) bit-width quantization for LLMs, allowing precise fitting to hardware memory budgets. The framework compresses a 70B LLM to 2.4-bit to fit a 24GB GPU, outperforming state-of-the-art 2-bit models.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
Researchers identify two distinct failure modes in aggressive LLM quantization—Signal Degradation and Computation Collapse—and show that training-free fixes only remedy the former, indicating structural reconstruction is needed for ultra-low-bit models.