Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study
Summary
This paper empirically investigates whether aligning the allocation cost with the output-space objective improves compressed model fidelity in ROCKET, a training-free LLM compression method. Results show a trade-off between accuracy and perplexity, with effects more pronounced at higher compression ratios.
View Cached Full Text
Cached at: 06/29/26, 05:24 AM
# Output-Space Allocation Costs for Calibration-Guided LLM Compression: An Empirical Study
Source: [https://arxiv.org/html/2606.27785](https://arxiv.org/html/2606.27785)
FARS, Qiong Tang222Equal contribution; human authors listed in alphabetical order\.,Xiangkun Hu222Equal contribution; human authors listed in alphabetical order\.,Xiangyang Liu222Equal contribution; human authors listed in alphabetical order\.,Yiran Chen222Equal contribution; human authors listed in alphabetical order\.,Yunfan Shao222Equal contribution; human authors listed in alphabetical order\. Analemma fars@analemma\.ai
###### Abstract
Training\-free compression methods for large language models \(LLMs\) often use calibration data to guide compression decisions\. ROCKET, a recent method combining sparse\-dictionary factorization with multi\-choice knapsack problem \(MCKP\) allocation, derives its per\-layer factorization from an output reconstruction objective but uses weight\-space Frobenius error as the MCKP allocation cost\. We investigate whether aligning the allocation cost with the output\-space objective improves compressed model fidelity\. On Qwen3\-8B at 50% compression, our ROCKET\-ActCost achieves \+0\.8 percentage points higher average accuracy across 8 zero\-shot benchmarks \(53\.1% vs 52\.3%\), but increases WikiText perplexity by 16% \(61\.46 vs 52\.98\)\. This accuracy\-perplexity tradeoff reveals that different allocation objectives favor different downstream metrics\. The high correlation \(\>\>0\.99\) between weight\-space and output\-space errors limits allocation divergence, explaining the modest effect size\. On Llama\-3\.2\-1B at 20% compression, the two methods produce near\-identical results \(53\.3% vs 53\.5% accuracy, 14\.45 vs 14\.66 PPL\), suggesting that the effect of the cost function is minor at lower compression ratios\.
> Disclosure:This paper was produced by FARS \(Fully Automated Research System\)111[https://analemma\.ai/fars/](https://analemma.ai/fars/), which autonomously performed the ideation, literature review, experiment design and execution, result analysis, and manuscript composition\. The accompanying code is publicly available\.222[https://gitlab\.com/fars\-a/rocket\-activation\-aware\-knapsack](https://gitlab.com/fars-a/rocket-activation-aware-knapsack)The human authors contributed review and minor editorial revisions\. They have verified the authenticity of all cited references and confirmed that all reported experimental results originate from actual code execution\. Readers should be aware that the prose and presentation of this manuscript are primarily machine\-generated and may not meet the standards of fully human\-authored work\.
## 1Introduction
Large language models \(LLMs\) have achieved remarkable capabilities across diverse tasks, but their deployment is constrained by substantial memory and computational requirements\(Zhuet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib14)\)\. Post\-training compression methods address this challenge by reducing model size without retraining, with low\-rank factorization emerging as a promising approach that approximates weight matrices using structured factors\(Yuanet al\.,[2023](https://arxiv.org/html/2606.27785#bib.bib20); Wanget al\.,[2025](https://arxiv.org/html/2606.27785#bib.bib19)\)\.
ROCKET\(Aliet al\.,[2026](https://arxiv.org/html/2606.27785#bib.bib4)\)is a recent training\-free compression method that combines sparse\-dictionary factorization with global budget allocation via a multi\-choice knapsack problem \(MCKP\)\. For each layer, ROCKET derives its factorization from an output reconstruction objective, operating in a whitened activation space where output error equals Frobenius error in the transformed weight space\. However, when allocating compression budgets across layers, ROCKET uses weight\-space Frobenius error as the MCKP cost rather than the output\-space error that motivated the factorization\.
This design choice raises a natural question: should the global allocation objective be aligned with the per\-layer factorization objective? Activation\-aware methods such as AWQ\(Linet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib16)\)and ASVD\(Yuanet al\.,[2023](https://arxiv.org/html/2606.27785#bib.bib20)\)have demonstrated that accounting for activation statistics improves compression quality\. We hypothesize that using output\-space error as the MCKP allocation cost—which directly measures the impact of compression on layer outputs for the calibration distribution—may yield better downstream performance than weight\-space error\.
We investigate this hypothesis empirically by proposing ROCKET\-ActCost, which replaces the weight\-space allocation cost with an output\-space equivalent and selects per\-layer sparsity configurations optimized for output\-space error, using matrices already available during profiling\. On Qwen3\-8B at 50% compression, ROCKET\-ActCost achieves \+0\.8 percentage points higher average accuracy \(53\.1% vs 52\.3%\) but increases perplexity by 16%, revealing an accuracy\-perplexity tradeoff\. Analysis shows that the high correlation \(\>\>0\.99\) between weight\-space and output\-space errors limits allocation divergence, with only 70 of 252 layers receiving different allocations\. On Llama\-3\.2\-1B at 20% compression, the two methods produce near\-identical results, suggesting the effect is minor at lower compression ratios\.
Our contributions are:
- •An empirical study of output\-space MCKP allocation cost for calibration\-guided LLM compression, testing whether aligning the allocation objective with the factorization objective improves model fidelity\.
- •Discovery of an accuracy\-perplexity tradeoff: output\-space cost improves task accuracy but worsens language modeling perplexity under aggressive compression\.
- •Analysis showing that high error correlation \(\>\>0\.99\) between weight\-space and output\-space metrics fundamentally limits allocation divergence, explaining the modest effect size\.
## 2Method
We investigate whether using an output\-space error as the allocation cost in ROCKET’s multi\-choice knapsack problem \(MCKP\) improves compressed model fidelity compared to the original weight\-space error\.
### 2\.1Background: ROCKET’s MCKP Formulation
ROCKET\(Aliet al\.,[2026](https://arxiv.org/html/2606.27785#bib.bib4)\)is a training\-free compression method that combines a fast sparse\-dictionary factorization with global budget allocation via MCKP\. For each linear layer with weightW∈ℝd1×d2W\\in\\mathbb\{R\}^\{d\_\{1\}\\times d\_\{2\}\}and calibration activationsX∈ℝN×d1X\\in\\mathbb\{R\}^\{N\\times d\_\{1\}\}, ROCKET operates in a whitened activation space to derive a data\-adaptive factorization\.
Given the Gram matrixA=X⊤XA=X^\{\\top\}Xand its upper Cholesky factorLL\(whereA=L⊤LA=L^\{\\top\}L\), ROCKET forms the whitened weightWL=LWW\_\{L\}=LW\. The key insight is that output reconstruction error in the original space equals Frobenius error in the whitened space:
‖XW−XW^‖F=‖LW−LW^‖F=‖WL−W^L‖F\.\\\|XW\-X\\hat\{W\}\\\|\_\{F\}=\\\|LW\-L\\hat\{W\}\\\|\_\{F\}=\\\|W\_\{L\}\-\\hat\{W\}\_\{L\}\\\|\_\{F\}\.\(1\)This transformation reweights errors by activation energy, so errors along rarely\-used activation directions contribute less\.
ROCKET then performs eigendecomposition onWLWL⊤W\_\{L\}W\_\{L\}^\{\\top\}to obtain a data\-adaptive basis, applies structured sparsification to the coefficient matrix, and solves a least\-squares problem to obtain the final factorizationW^=L−1DfinalCsparse\\hat\{W\}=L^\{\-1\}D\_\{\\text\{final\}\}C\_\{\\text\{sparse\}\}\.
To allocate compression budgets across layers, ROCKET profiles each layer with multiple candidate configurations \(varying rankkkand sparsityss\) and solves a constrained MCKP:
minxℓ,i∈\{0,1\}∑ℓ=1L∑i=1Kℓeℓ,i⋅xℓ,is\.t\.∑ℓ=1L∑i=1Kℓcℓ,i⋅xℓ,i≤Ctotal,∑i=1Kℓxℓ,i=1,∀ℓ,\\min\_\{x\_\{\\ell,i\}\\in\\\{0,1\\\}\}\\sum\_\{\\ell=1\}^\{L\}\\sum\_\{i=1\}^\{K\_\{\\ell\}\}e\_\{\\ell,i\}\\cdot x\_\{\\ell,i\}\\quad\\text\{s\.t\.\}\\quad\\sum\_\{\\ell=1\}^\{L\}\\sum\_\{i=1\}^\{K\_\{\\ell\}\}c\_\{\\ell,i\}\\cdot x\_\{\\ell,i\}\\leq C\_\{\\text\{total\}\},\\quad\\sum\_\{i=1\}^\{K\_\{\\ell\}\}x\_\{\\ell,i\}=1,\\;\\forall\\ell,\(2\)wherecℓ,ic\_\{\\ell,i\}is the parameter count andeℓ,ie\_\{\\ell,i\}is the reconstruction error for optioniiof layerℓ\\ell\. ROCKET uses the*weight\-space*relative Frobenius error as the cost:
eℓ,iweight=‖Wℓ−W^ℓ,i‖F‖Wℓ‖F\.e^\{\\text\{weight\}\}\_\{\\ell,i\}=\\frac\{\\\|W\_\{\\ell\}\-\\hat\{W\}\_\{\\ell,i\}\\\|\_\{F\}\}\{\\\|W\_\{\\ell\}\\\|\_\{F\}\}\.\(3\)
### 2\.2Output\-Space Allocation Cost
While ROCKET’s per\-layer factorization is derived from an output reconstruction objective \(Equation[1](https://arxiv.org/html/2606.27785#S2.E1)\), its global allocation uses weight\-space error \(Equation[3](https://arxiv.org/html/2606.27785#S2.E3)\)\. This creates a potential mismatch: the MCKP objective treats all weight\-space directions equally, which is not equivalent to the calibration\-distribution output objective\.
We proposeROCKET\-ActCost, which replaces the weight\-space cost with an*output\-space*\(whitened\) error:
eℓ,iout=‖WL,ℓ−W^L,ℓ,i‖F‖WL,ℓ‖F=‖LWℓ−LW^ℓ,i‖F‖LWℓ‖F\.e^\{\\text\{out\}\}\_\{\\ell,i\}=\\frac\{\\\|W\_\{L,\\ell\}\-\\hat\{W\}\_\{L,\\ell,i\}\\\|\_\{F\}\}\{\\\|W\_\{L,\\ell\}\\\|\_\{F\}\}=\\frac\{\\\|LW\_\{\\ell\}\-L\\hat\{W\}\_\{\\ell,i\}\\\|\_\{F\}\}\{\\\|LW\_\{\\ell\}\\\|\_\{F\}\}\.\(4\)This cost directly measures the impact of rank truncation on the layer’s output for the calibration distribution, aligning the allocation objective with the factorization derivation\.
Importantly, switching to output\-space error also changes which sparsity configuration \(ksk\_\{s\}ratio\) is optimal for each layer and compression level\. During profiling, ROCKET evaluates multipleksk\_\{s\}candidates per layer; ROCKET\-ActCost selects the candidate minimizing output\-space error rather than weight\-space error\. Since 96\.6% of \(layer, compression\-ratio\) pairs have different optimalksk\_\{s\}values under the two metrics, ROCKET\-ActCost effectively changes both the MCKP cost*and*the per\-layer compression configuration relative to ROCKET\-default\. Figure[1](https://arxiv.org/html/2606.27785#S2.F1)illustrates the ROCKET\-ActCost pipeline\.
Figure 1:Overview of ROCKET\-ActCost\. The method modifies ROCKET’s MCKP allocation by replacing weight\-space Frobenius error‖W−W^‖F\\\|W\-\\hat\{W\}\\\|\_\{F\}with output\-space error‖XW−XW^‖F\\\|XW\-X\\hat\{W\}\\\|\_\{F\}, computed equivalently as‖WL−W^L‖F\\\|W\_\{L\}\-\\hat\{W\}\_\{L\}\\\|\_\{F\}in the whitened space\. Both methods share the same SVD decomposition and MCKP solver, but differ in the cost function and in the per\-layer sparsity configuration \(ksk\_\{s\}ratio\) selected during profiling\.Crucially, ROCKET\-ActCost addsno runtime overhead\. During profiling, ROCKET already computes the whitened weightsWL=LWW\_\{L\}=LWand whitened reconstructionsW^L\\hat\{W\}\_\{L\}before mapping back to the original space\. The output\-space error and output\-optimalksk\_\{s\}selection are computed from these existing matrices without additional calibration passes\. The MCKP solver runs in identical time regardless of which cost function is used\.
## 3Experiments
We evaluate ROCKET\-ActCost against ROCKET\-default to test whether output\-space allocation cost improves compressed model fidelity\.
### 3\.1Experimental Setup
#### Models and Compression Ratios\.
We evaluate on two settings: \(1\)Qwen3\-8B\(Yanget al\.,[2025](https://arxiv.org/html/2606.27785#bib.bib5)\)at 50% compression ratio \(aggressive compression, primary evaluation\), and \(2\)Llama\-3\.2\-1B\(Meta AI,[2024](https://arxiv.org/html/2606.27785#bib.bib7)\)at 20% compression ratio \(milder compression, secondary check\)\. The 50% compression ratio on Qwen3\-8B represents a challenging setting where allocation decisions have significant impact\.
#### Calibration\.
Following ROCKET’s setup, we use 256 sequences of length 1024 from RefinedWeb\(Penedoet al\.,[2023](https://arxiv.org/html/2606.27785#bib.bib8)\)for calibration\. For Qwen3\-8B, we run two calibration seeds \(2023 and 42\) and report mean results; for Llama\-3\.2\-1B, we use a single seed\.
#### Evaluation\.
We evaluate on 8 zero\-shot benchmarks using lm\-eval\-harness\(Bidermanet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib9)\): PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2606.27785#bib.bib24)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2606.27785#bib.bib10)\), LAMBADA\(Papernoet al\.,[2016](https://arxiv.org/html/2606.27785#bib.bib11)\), ARC\-Easy, ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2606.27785#bib.bib12)\), SciQ\(Welblet al\.,[2017](https://arxiv.org/html/2606.27785#bib.bib25)\), RACE\(Laiet al\.,[2017](https://arxiv.org/html/2606.27785#bib.bib26)\), and MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.27785#bib.bib13)\)\. We report average accuracy \(AvgAcc\) across these benchmarks and WikiText\-2\(Merityet al\.,[2017](https://arxiv.org/html/2606.27785#bib.bib27)\)perplexity \(PPL\)\.
### 3\.2Main Results
Table[1](https://arxiv.org/html/2606.27785#S3.T1)presents the main comparison on Qwen3\-8B at 50% compression ratio\. In addition to the output\-space MCKP cost, ROCKET\-ActCost uses output\-optimalksk\_\{s\}ratios \(Section[2](https://arxiv.org/html/2606.27785#S2)\); both variants share the same profiling pipeline and are directly comparable\. ROCKET\-ActCost achieves \+0\.8 percentage points higher average accuracy than ROCKET\-default \(53\.1% vs 52\.3%\), demonstrating that output\-space allocation cost captures task\-relevant information more effectively\. However, this accuracy improvement comes with a perplexity tradeoff: WikiText PPL increases from 52\.98 to 61\.46 \(16% worse\)\.
Table 1:Main results on Qwen3\-8B at 50% compression ratio\. ROCKET\-ActCost improves average accuracy by \+0\.8pp but increases perplexity by 16%\. Best inbold\.Table[2](https://arxiv.org/html/2606.27785#S3.T2)shows the per\-benchmark breakdown\. ROCKET\-ActCost improves on all 8 benchmarks, with the largest gains on reasoning tasks: ARC\-Challenge \(\+1\.5pp\), MMLU \(\+1\.5pp\), and LAMBADA \(\+1\.3pp\)\.
Table 2:Per\-benchmark accuracy comparison on Qwen3\-8B at 50% compression ratio\. ROCKET\-ActCost improves on all 8 benchmarks\. Values are mean accuracy \(%\) across 2 seeds\. Best inbold\.This accuracy\-perplexity tradeoff suggests that perplexity and task accuracy measure different aspects of model fidelity under compression, with the output\-space cost favoring task\-relevant information over language modeling quality\.
### 3\.3Analysis: Error Correlation Limits Allocation Divergence
To understand why the effect size is modest despite using a different cost function, we analyze the correlation between weight\-space and output\-space errors across compression candidates\. On Qwen3\-8B, the per\-layer Spearman rank correlation betweeneℓ,iweighte^\{\\text\{weight\}\}\_\{\\ell,i\}andeℓ,ioute^\{\\text\{out\}\}\_\{\\ell,i\}across candidate configurations exceeds 0\.99 for nearly all layers, indicating that the two error metrics rank candidates almost identically\. This high correlation limits how much the MCKP allocation can diverge between the two cost functions\.
Concretely, approximately 70 of 252 compressible layers receive different allocations under ROCKET\-ActCost compared to ROCKET\-default, and these differences occur at borderline decision points where multiple candidates have similar costs\. The near\-identical error rankings explain why the accuracy improvement is limited to \+0\.8pp rather than a larger gain\.
### 3\.4Secondary Setting: Llama\-3\.2\-1B at 20% Compression
Table[3](https://arxiv.org/html/2606.27785#S3.T3)presents results on Llama\-3\.2\-1B at 20% compression ratio\. In this milder setting, the two methods produce near\-identical results: ROCKET\-ActCost shows a marginal perplexity improvement \(14\.45 vs 14\.66\) and a negligible accuracy difference \(−\-0\.2pp: 53\.3% vs 53\.5%\)\. This suggests that the effect of the allocation cost function is more pronounced under aggressive compression, and largely vanishes at lower compression ratios\.
Table 3:Results on Llama\-3\.2\-1B at 20% compression ratio\. Best inbold\.
### 3\.5Runtime
ROCKET\-ActCost adds no runtime overhead compared to ROCKET\-default\. The output\-space error is computed from matrices already available during profiling \(WLW\_\{L\}andW^L\\hat\{W\}\_\{L\}\), and the MCKP solver runs in identical time regardless of which cost function is used\. In our experiments, ROCKET\-ActCost was actually 7\.7% faster on average due to different allocation decisions leading to slightly different compression configurations\.
## 4Related Work
#### LLM Compression\.
Post\-training compression methods for large language models fall into three main categories\(Zhuet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib14)\)\.*Quantization*methods such as GPTQ\(Frantaret al\.,[2022](https://arxiv.org/html/2606.27785#bib.bib15)\)and AWQ\(Linet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib16)\)reduce precision of weights and activations\.*Pruning*methods including SparseGPT\(Frantar and Alistarh,[2023](https://arxiv.org/html/2606.27785#bib.bib17)\)and Wanda\(Sunet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib18)\)remove weights based on importance scores\.*Low\-rank factorization*methods such as SVD\-LLM\(Wanget al\.,[2025](https://arxiv.org/html/2606.27785#bib.bib19)\), ASVD\(Yuanet al\.,[2023](https://arxiv.org/html/2606.27785#bib.bib20)\), and SliceGPT\(Ashkbooset al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib21)\)approximate weight matrices with low\-rank factors\.
#### Activation\-Aware Methods\.
Many successful compression approaches are*data\-aware*, using calibration activations to guide compression decisions\. AWQ\(Linet al\.,[2024](https://arxiv.org/html/2606.27785#bib.bib16)\)identifies salient weights based on activation magnitudes\. ASVD\(Yuanet al\.,[2023](https://arxiv.org/html/2606.27785#bib.bib20)\)scales weight matrices by activation statistics before SVD decomposition\. SmoothQuant\(Xiaoet al\.,[2023](https://arxiv.org/html/2606.27785#bib.bib22)\)migrates quantization difficulty from activations to weights\. These methods share the insight that compression should account for how weights interact with typical activations\.
#### Rank Allocation\.
Global budget allocation across layers is critical for compression quality\. ROCKET\(Aliet al\.,[2026](https://arxiv.org/html/2606.27785#bib.bib4)\)formulates this as a multi\-choice knapsack problem \(MCKP\), while CoSpaDi\(Shopkhoevet al\.,[2025](https://arxiv.org/html/2606.27785#bib.bib23)\)uses iterative sparse dictionary learning\. Our work investigates whether ROCKET’s MCKP allocation should use output\-space error rather than weight\-space error\.
## 5Conclusion
We investigated whether using output\-space error as the MCKP allocation cost in ROCKET improves compressed model fidelity\. On Qwen3\-8B at 50% compression, ROCKET\-ActCost achieves \+0\.8pp higher accuracy but 16% worse perplexity, revealing an accuracy\-perplexity tradeoff\. The high correlation \(\>\>0\.99\) between weight\-space and output\-space errors limits allocation divergence, explaining the modest effect size\. On Llama\-3\.2\-1B at 20% compression, the two methods produce near\-identical results, suggesting the effect is minor at lower compression ratios\. Our findings indicate that the choice of allocation cost function matters most under aggressive compression, informing future compression method design\.
## References
- ROCKET: rapid optimization via calibration\-guided knapsack enhanced truncation for efficient model compression\.CoRRabs/2602\.11008\.External Links:[Link](https://doi.org/10.48550/arXiv.2602.11008),[Document](https://dx.doi.org/10.48550/ARXIV.2602.11008),2602\.11008Cited by:[§1](https://arxiv.org/html/2606.27785#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.27785#S2.SS1.p1.2),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px3.p1.1)\.
- S\. Ashkboos, M\. L\. Croci, M\. G\. do Nascimento, T\. Hoefler, and J\. Hensman \(2024\)SliceGPT: compress large language models by deleting rows and columns\.InThe Twelfth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=vXxardq6db)Cited by:[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1)\.
- S\. Biderman, H\. Schoelkopf, L\. Sutawika, L\. Gao, J\. Tow,et al\.\(2024\)Lessons from the trenches on reproducible evaluation of language models\.ArXivabs/2405\.14782\.Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThe Thirty\-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty\-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7\-12, 2020,pp\. 7432–7439\.External Links:[Link](https://doi.org/10.1609/aaai.v34i05.6239),[Document](https://dx.doi.org/10.1609/AAAI.V34I05.6239)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.ArXivabs/1803\.05457\.Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- E\. Frantar and D\. Alistarh \(2023\)SparseGPT: massive language models can be accurately pruned in one\-shot\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 10323–10337\.External Links:[Link](https://proceedings.mlr.press/v202/frantar23a.html)Cited by:[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1)\.
- E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2022\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.ArXivabs/2210\.17323\.Cited by:[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.In9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3\-7, 2021,External Links:[Link](https://openreview.net/forum?id=d7KBjmI3GmQ)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- G\. Lai, Q\. Xie, H\. Liu, Y\. Yang, and E\. H\. Hovy \(2017\)RACE: large\-scale reading comprehension dataset from examinations\.InProceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9\-11, 2017,M\. Palmer, R\. Hwa, and S\. Riedel \(Eds\.\),pp\. 785–794\.External Links:[Link](https://doi.org/10.18653/v1/d17-1082),[Document](https://dx.doi.org/10.18653/V1/D17-1082)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- J\. Lin, J\. Tang, H\. Tang, S\. Yang, W\. Chen, W\. Wang, G\. Xiao, X\. Dang, C\. Gan, and S\. Han \(2024\)AWQ: activation\-aware weight quantization for on\-device LLM compression and acceleration\.InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13\-16, 2024,P\. B\. Gibbons, G\. Pekhimenko, and C\. D\. Sa \(Eds\.\),External Links:[Link](https://proceedings.mlsys.org/paper%5C_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.27785#S1.p3.1),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px2.p1.1)\.
- S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2017\)Pointer sentinel mixture models\.In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24\-26, 2017, Conference Track Proceedings,External Links:[Link](https://openreview.net/forum?id=Byj72udxe)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- Meta AI \(2024\)Llama 3\.2: lightweight text models \(1b and 3b\)\.Note:[https://ai\.meta\.com/blog/llama\-3\-2\-connect\-2024\-vision\-edge\-mobile\-devices/](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/)Accessed: 2025\-06\-01Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px1.p1.1)\.
- D\. Paperno, G\. Kruszewski, A\. Lazaridou, Q\. N\. Pham, R\. Bernardi, S\. Pezzelle, M\. Baroni, G\. Boleda, and R\. Fernández \(2016\)The LAMBADA dataset: word prediction requiring a broad discourse context\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7\-12, 2016, Berlin, Germany, Volume 1: Long Papers,External Links:[Link](https://doi.org/10.18653/v1/p16-1144),[Document](https://dx.doi.org/10.18653/V1/P16-1144)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- G\. Penedo, Q\. Malartic, D\. Hesslow, R\. Cojocaru, H\. Alobeidli, A\. Cappelli, B\. Pannier, E\. Almazrouei, and J\. Launay \(2023\)The refinedweb dataset for falcon LLM: outperforming curated corpora with web data only\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2023/hash/fa3ed726cc5073b9c31e3e49a807789c-Abstract-Datasets%5C_and%5C_Benchmarks.html)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px2.p1.1)\.
- D\. Shopkhoev, D\. Makhov, M\. Zhussip, A\. Ali, and S\. Lefkimmiatis \(2025\)COSPADI: compressing llms via calibration\-guided sparse dictionary learning\.ArXivabs/2509\.22075\.Cited by:[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px3.p1.1)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2024\)A simple and effective pruning approach for large language models\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=PxoFut3dWW)Cited by:[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1)\.
- X\. Wang, Y\. Zheng, Z\. Wan, and M\. Zhang \(2025\)SVD\-LLM: truncation\-aware singular value decomposition for large language model compression\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=LNYIUouhdt)Cited by:[§1](https://arxiv.org/html/2606.27785#S1.p1.1),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1)\.
- J\. Welbl, N\. F\. Liu, and M\. Gardner \(2017\)Crowdsourcing multiple choice science questions\.InProceedings of the 3rd Workshop on Noisy User\-generated Text, NUT@EMNLP 2017, Copenhagen, Denmark, September 7, 2017,L\. Derczynski, W\. Xu, A\. Ritter, and T\. Baldwin \(Eds\.\),pp\. 94–106\.External Links:[Link](https://doi.org/10.18653/v1/w17-4413),[Document](https://dx.doi.org/10.18653/V1/W17-4413)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- G\. Xiao, J\. Lin, M\. Seznec, H\. Wu, J\. Demouth, and S\. Han \(2023\)SmoothQuant: accurate and efficient post\-training quantization for large language models\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 38087–38099\.External Links:[Link](https://proceedings.mlr.press/v202/xiao23c.html)Cited by:[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px2.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.ArXivabs/2505\.09388\.Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px1.p1.1)\.
- Z\. Yuan, Y\. Shang, Y\. Song, Q\. Wu, Y\. Yan, and G\. Sun \(2023\)ASVD: activation\-aware singular value decomposition for compressing large language models\.ArXivabs/2312\.05821\.Cited by:[§1](https://arxiv.org/html/2606.27785#S1.p1.1),[§1](https://arxiv.org/html/2606.27785#S1.p3.1),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px2.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,A\. Korhonen, D\. R\. Traum, and L\. Màrquez \(Eds\.\),pp\. 4791–4800\.External Links:[Link](https://doi.org/10.18653/v1/p19-1472),[Document](https://dx.doi.org/10.18653/V1/P19-1472)Cited by:[§3\.1](https://arxiv.org/html/2606.27785#S3.SS1.SSS0.Px3.p1.1)\.
- X\. Zhu, J\. Li, Y\. Liu, C\. Ma, and W\. Wang \(2024\)A survey on model compression for large language models\.Trans\. Assoc\. Comput\. Linguistics12,pp\. 1556–1577\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00704),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00704)Cited by:[§1](https://arxiv.org/html/2606.27785#S1.p1.1),[§4](https://arxiv.org/html/2606.27785#S4.SS0.SSS0.Px1.p1.1)\.Similar Articles
Calibrating LLMs with Semantic-level Reward
Proposes CSR, a framework that calibrates LLMs directly in semantic space using a novel semantic calibration reward, reducing ECE by up to 40% and improving AUROC by up to 31% over verbalized-confidence baselines across multiple datasets.
Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels
This paper studies how post-training quantization introduces new biases in instruction-tuned LLMs, finding that 3-bit precision causes 6–21% of previously unbiased items to develop stereotypes, while standard metrics like perplexity fail to detect this degradation.
LLM Compression with Jointly Optimizing Architectural and Quantization choices
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
Cutting LLM Token Costs with rtk, headroom, and caveman - savings measured on real workloads
A detailed analysis of three open-source tools (rtk, headroom, and caveman) designed to reduce LLM token costs for coding agents, finding that real-world savings are much lower than claimed.
Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.