SigmaScale: LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices
Summary
Introduces SigmaScale, a method that learns auxiliary scaling matrices for SVD-based LLM compression, showing competitive performance on Llama 3.1 8B and Qwen3-8B benchmarks.
View Cached Full Text
Cached at: 06/08/26, 09:22 AM
# LLM Compression with SVD-based Low-Rank Decomposition and Learned Scaling Matrices Source: [https://arxiv.org/html/2606.07098](https://arxiv.org/html/2606.07098) Ernests Lavrinovics1,Marco Letizia2,3,4,Roy Janco5,Shai Segal, Johannes Bjerva1,Maurizio Pierini4, 1Department of Computer Science, Aalborg University Copenhagen, Denmark 2MaLGa\-DIBRIS, University of Genoa, Genoa, Italy, 3INFN, Sezione di Genova, Genoa, Italy 4European Organization for Nuclear Research \(CERN\), Geneva, Switzerland 5Ceva, Inc\., Correspondence:[elav@cs\.aau\.dk](https://arxiv.org/html/2606.07098v1/mailto:[email protected]) ###### Abstract We present SigmaScale, a method for learning auxiliary scaling matricesSSto aid truncated Singular Value Decomposition \(SVD\) based Large Language Model \(LLM\) compression\. Instead of deriving scaling matrices analytically, SigmaScale optimizes two sets of vectors that define diagonal row and column scaling transformations under an activation\-aware compression loss\. We show that learned scaling lowers the effective intrinsic rank of weight matrices, as reflected by reductions in effective\-rank entropy, and that this reduction is strongly correlated with compression loss\. Experiments on Llama 3\.1 8B Instruct and Qwen3\-8B show that SigmaScale is competitive with closely related state\-of\-the\-art SVD\-based compression methods across perplexity and zero\-shot benchmarks\. By using learned activation\-aware transformations, SigmaScale explores a more flexible route to low\-rank LLM compression by adapting to the structure of individual model weights\. The advantage observed in specific tasks makes our approach a valid option for applications requiring a reduced LLM\-inference computing cost\. SigmaScale: LLM Compression with SVD\-based Low\-Rank Decomposition and Learned Scaling Matrices Ernests Lavrinovics1††thanks:Worked performed as part of a research internship in CERN, Marco Letizia2,3,4, Roy Janco5, Shai Segal††thanks:Worked performed as part of Ceva Inc\.,Johannes Bjerva1,Maurizio Pierini4,1Department of Computer Science, Aalborg University Copenhagen, Denmark2MaLGa\-DIBRIS, University of Genoa, Genoa, Italy,3INFN, Sezione di Genova, Genoa, Italy4European Organization for Nuclear Research \(CERN\), Geneva, Switzerland5Ceva, Inc\.,Correspondence:[elav@cs\.aau\.dk](https://arxiv.org/html/2606.07098v1/mailto:[email protected]) ## 1Introduction and Background Large Language Models \(LLMs\) exhibit a remarkable performance and generalization across a variety of NLP tasksBrownet al\.\([2020](https://arxiv.org/html/2606.07098#bib.bib1)\)and it has been demonstrated that their performance scales with the increase of parametersKaplanet al\.\([2020](https://arxiv.org/html/2606.07098#bib.bib2)\), therefore leading to developments of very large language models in the tens and hundreds of billions of parametersGrattafioriet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib5)\); DeepSeek\-AI \([2026](https://arxiv.org/html/2606.07098#bib.bib4)\); Yanget al\.\([2025](https://arxiv.org/html/2606.07098#bib.bib3)\)\. The high parameter count impacts the technological accesibility and has significant environmental impacts due to the high power consumption of inference systemsBommasaniet al\.\([2021](https://arxiv.org/html/2606.07098#bib.bib6)\)\. Therefore the AI research community has long explored methods of model compressionZhuet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib18)\); Liuet al\.\([2025a](https://arxiv.org/html/2606.07098#bib.bib7)\)which span across quantizationLiuet al\.\([2025b](https://arxiv.org/html/2606.07098#bib.bib33)\); Ashkbooset al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib36)\); Frantaret al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib37)\), pruningZhuet al\.\([2025](https://arxiv.org/html/2606.07098#bib.bib41)\), knowledge distillation \(KD\)Yanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib30)\); Xinet al\.\([2026](https://arxiv.org/html/2606.07098#bib.bib35)\)and low\-rank decompositionYuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\); Wanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib8)\); Sahaet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib40)\)\. Despite the success of these methods, practical deployment of quantization and pruning requires specialized hardware support which is a limitation contrary to low\-rank decomposition and KD methods\. Figure 1:Visualization of the processing pipelineLow\-rank decomposition methods approximate a given matrixW∈ℝm×nW\\in\\mathbb\{R\}^\{m\\times n\}as the product of two lower\-rank matricesL∈ℝm×kL\\in\\mathbb\{R\}^\{m\\times k\}andR∈ℝk×nR\\in\\mathbb\{R\}^\{k\\times n\}, wherek≪min\(m,n\)k\\ll\\min\(m,n\)\. This means that low\-rank decomposition typically does not require specialized hardware for supporting it, and it can be deployed alongside quantization and pruningYuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\); Wanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib8)\)\. The Eckart–Young–Mirsky theoremEckart and Young \([1936](https://arxiv.org/html/2606.07098#bib.bib13)\); Mirsky \([1960](https://arxiv.org/html/2606.07098#bib.bib12)\)states that, for minimizing the Frobenius norm‖W−W′‖F\|\|W\-W^\{\\prime\}\|\|\_\{F\}, whereWWis the original weight matrix andW′W^\{\\prime\}is its low\-rank approximation, the optimal analytical solution is given by the truncated singular value decomposition \(SVD\): fsvd\(k\)\(W\)=UkΣkVkT=∑i=1kuiσiviT\.f^\{\(k\)\}\_\{\\rm svd\}\(W\)=U\_\{k\}\\Sigma\_\{k\}V\_\{k\}^\{T\}=\\sum\_\{i=1\}^\{k\}u\_\{i\}\\sigma\_\{i\}v\_\{i\}^\{T\}\.\(1\)Here,Uk∈ℝm×kU\_\{k\}\\in\\mathbb\{R\}^\{m\\times k\}andVk∈ℝn×kV\_\{k\}\\in\\mathbb\{R\}^\{n\\times k\}contain the topkkleft and right singular vectors ofWW, respectively, whileΣk∈ℝk×k\\Sigma\_\{k\}\\in\\mathbb\{R\}^\{k\\times k\}is a diagonal matrix containing the correspondingkklargest singular values in descending order\. Retaining only the topkksingular values and their corresponding singular vectors effectively discards components associated with lower\-energy modes\. However, a drawback of SVD is its computational cost,O\(n3\)O\(n^\{3\}\)for square matricesShishkinet al\.\([2019](https://arxiv.org/html/2606.07098#bib.bib11)\); Kishore Kumar and Schneider \([2017](https://arxiv.org/html/2606.07098#bib.bib10)\), and its unstable derivative, for which Taylor expansion\-based approximations have been used to approximate its gradientsWanget al\.\([2022](https://arxiv.org/html/2606.07098#bib.bib38),[2025](https://arxiv.org/html/2606.07098#bib.bib39)\)\. This means that performing SVD at each step of an optimization routine has its limitations, and it does not scale well as the matrix size increases\. Additionally naïve SVD decomposition on weight matricesWWminimizing the Frobenius norm‖W−W′‖F\|\|W\-W^\{\\prime\}\|\|\_\{F\}has been shown to perform poorly on neural network weight matricesHsuet al\.\([2022](https://arxiv.org/html/2606.07098#bib.bib17)\); Yuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\)partially due to the presence of outliers in the activations\. Therefore, previous worksNagelet al\.\([2020](https://arxiv.org/html/2606.07098#bib.bib28)\); Wanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib8)\); Sahaet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib40)\)include the activationsxxin the loss function‖Wx−W′x‖F\|\|Wx\-W^\{\\prime\}x\|\|\_\{F\}to optimize over the functionality instead of the structure for a given weight matrix\. Previous works further expand upon this idea by applying linear invertible scaling matricesSStoWWwith the goal of: \(1\) absorbing outliers of the activationsYuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\), \(2\) aligning the singular values with the compression loss through Cholesky decomposition of the activation covariance matrixWanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib8)\); Liet al\.\([2026](https://arxiv.org/html/2606.07098#bib.bib16)\)\. Since compression introduces a certain performance loss, compressed models are commonly fine\-tuned to realign their weights\. However, this is not straightforward for LLMs primarily because these models undergo multi\-step post\-training\. Ideally, achieving a faithful distribution recovery after compression would require access to the same datasets used during the original post\-training phases\. In practice, this is often not achievable, as popular open\-weight model technical reportsGrattafioriet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib5)\); Yanget al\.\([2025](https://arxiv.org/html/2606.07098#bib.bib3)\)do not disclose the exact datasets employed during their post\-training\. To this end, KDHintonet al\.\([2015](https://arxiv.org/html/2606.07098#bib.bib21)\)has been demonstrated to be useful for realigning the model to its original distributionXinet al\.\([2026](https://arxiv.org/html/2606.07098#bib.bib35)\)\. Given that learning scaling matrices for improving the SVD performance is underexplored and previous methodsYuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\); Wanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib8)\)rely on analytical means of derivingSS, and given that KD is suggested to be beneficial for performance recovery over supervised fine\-tuning, we cover the following contributions: \(1\) Empirical results on SVD compression performance when learning row\- and column\-wise scaling matrices\. To the best of our knowledge, this is the first work to explore learning the parameters of scaling matricesSSfor this purpose\. \(2\) Comparisons between KD and supervised fine\-tuning for performance recovery, with varied post\-compression performance recovery datasets\. \(3\) Custom variant of the AlpacaTaoriet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib29)\)dataset, based on Llama 3\.1\-8B Instruction output distribution\. See Appendix[G](https://arxiv.org/html/2606.07098#A7)for the codebase link\. ## 2Methodology The first step of our pipeline is sensitivity probing, which determines the compression levels for each given layer and module of the model, described in Section[2\.1](https://arxiv.org/html/2606.07098#S2.SS1)\. The second step is to learn scaling matrices that apply a linear transformation to the weight matrixWWbefore performing truncated SVD\. After the optimal scaling matrix has been learned, we perform the final compression on the model and do post\-compression fine\-tuning for realignment of weights\. We base our experiments on Llama 3\.1 8B\-InstructGrattafioriet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib5)\)and Qwen3\-8B modelsYanget al\.\([2025](https://arxiv.org/html/2606.07098#bib.bib3)\)\. See Figure[1](https://arxiv.org/html/2606.07098#S1.F1)for our pipeline visualization\. ### 2\.1Sensitivity Probing for Determining Truncation Ranks Sensitivity probing is done by defining a set of compression ratiosc∈\{0\.1,0\.2,…,0\.9\}c\\in\\\{0\.1,0\.2,\\dots,0\.9\\\}which are used to calculate the truncated SVD target rankkkusing Eq[2](https://arxiv.org/html/2606.07098#S2.E2)\. Intuitively the compression ratios describe the percentage of the parameter count that will be retained after the decomposition\. k=c\|𝐖\|\(m\+n\)−1\.k=c\\,\|\\mathbf\{W\}\|\\left\(m\+n\\right\)^\{\-1\}\.\(2\)Whereccdenotes the compression ratio,\|𝐖\|\|\\mathbf\{W\}\|the number of parameters in the weight matrix, andm,nm,nare the dimensions ofWWrows and columns\. We probe for perplexity metric in our condition models by performing truncated SVD compression at rankkkfor each isolated MLP and attention weight matrix at each layer\. This information is used to find most optimal set of compression rankskkacross the whole model that achieve the global target compression ratio, while minimizing the increase in perplexity\. This rank search is done with the binary search algorithm introduced in ASVDYuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\)\. Truncation is performed by retaining the firstkksingular values and discarding the tail\-end of the distribution\. ### 2\.2Learned Scaling Matrices and Post Compression Fine\-Tuning For a given weight matrixW∈ℝm×nW\\in\\mathbb\{R\}^\{m\\times n\}, we initialize two vectorsdr∈ℝm,dc∈ℝnd\_\{r\}\\in\\mathbb\{R\}^\{m\},d\_\{c\}\\in\\mathbb\{R\}^\{n\}with a scaled Gaussian distribution:dr,c=\(0\.1\)σWϵr,cd\_\{r,c\}=\(0\.1\)\\,\\sigma\_\{W\}\\,\\epsilon\_\{r,c\}withϵr∼𝒩\(0,Im\)\\epsilon\_\{r\}\\sim\\mathcal\{N\}\(0,I\_\{m\}\)andϵc∼𝒩\(0,In\)\\epsilon\_\{c\}\\sim\\mathcal\{N\}\(0,I\_\{n\}\)\. We use the standard deviationσW\\sigma\_\{W\}of the weight matrix to scale the initialization ofdrd\_\{r\}anddcd\_\{c\}, to match thedrd\_\{r\}anddcd\_\{c\}with the scaled magnitude of the corresponding weight matrix\. From the vectorsdrd\_\{r\}anddcd\_\{c\}, we construct positive diagonal scaling via exponentiation, defined asSr=diag\(exp\(dr\)\)S\_\{r\}=\\rm\{diag\}\(\\exp\(d\_\{r\}\)\)andSc=diag\(exp\(dc\)S\_\{c\}=\\rm\{diag\}\(\\exp\(d\_\{c\}\)\. These are used to apply row and column scaling to model weightsWW\. We then perform truncated SVD \(Eq[1](https://arxiv.org/html/2606.07098#S1.E1)\), and apply the inverse scaling \(Eq\.[3](https://arxiv.org/html/2606.07098#S2.E3)\) before computing an activation aware loss with a normalization term \(Eq\.[4](https://arxiv.org/html/2606.07098#S2.E4)\)\. W′=Sr−1fsvd\(k\)\(SrWSc\)Sc−1W^\{\\prime\}=S\_\{r\}^\{\-1\}f^\{\(k\)\}\_\{\\mathrm\{svd\}\}\(S\_\{r\}WS\_\{c\}\)S\_\{c\}^\{\-1\}\(3\)ℒF=1mn‖WX−W′X‖F2\.\\mathcal\{L\}\_\{\\mathrm\{F\}\}=\\frac\{1\}\{mn\}\\left\\\|WX\-W^\{\\prime\}X\\right\\\|\_\{F\}^\{2\}\.\(4\)Here,WWis the original weight matrix,XXare activations from a calibration set,W′W^\{\\prime\}is the compressed weight matrix\. After learningdrd\_\{r\}anddcd\_\{c\}, we construct the final compressed weight matrixW′W^\{\\prime\}and replace the original matrix in the model\. We first apply truncated SVD to the scaled weight matrix:fsvd\(k\)\(SrWSc\)f^\{\(k\)\}\_\{\\mathrm\{svd\}\}\(S\_\{r\}WS\_\{c\}\)The final low\-rank factors are then obtained by absorbing the singular values and applying the inverse scaling transformations: L=Sr−1UkΣk,R=ΣkVkTSc−1,L=S\_\{r\}^\{\-1\}U\_\{k\}\\sqrt\{\\Sigma\_\{k\}\},\\qquad R=\\sqrt\{\\Sigma\_\{k\}\}V\_\{k\}^\{T\}S\_\{c\}^\{\-1\},\(5\)such that the compressed matrix satisfiesW′=LRW^\{\\prime\}=LR\. Finally, post\-compression fine\-tuning is performed to realign the impaired weight matrices\. See Appendix[B](https://arxiv.org/html/2606.07098#A2)for further details including pseudo\-code\. ## 3Experimental setup As part of the experiments, we use Qwen3\-8B and Llama 3\.1\-8B\-Instruction models with a focus on English language\. We use a Wikitext2\-raw\-v1Merityet al\.\([2016](https://arxiv.org/html/2606.07098#bib.bib27)\)test split with n=141 samples and 2048 sequence length for all perplexity measurements\. As our calibration data, we use a set of n=32 samples of 2048 sequence length from Wikitext training split\. AlpacaTaoriet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib29)\)is used for post\-compression fine\-tuning\. See Appendix[B](https://arxiv.org/html/2606.07098#A2)for a full set of implementation details\. Evaluation is done on five downstream task benchmarks with licensing terms summarized in Appendix[I](https://arxiv.org/html/2606.07098#A9)\. Our compute budget is described in Appendix[C](https://arxiv.org/html/2606.07098#A3)\. During post\-compression fine\-tuning we freeze all weight matrices that have not been modified by the low\-rank decomposition and perform comparisons with supervised fine\-tuning versus knowledge distillation \(KD\) using an uncompressed teacher model\. Our experimental setup does not perform compression on token embeddings, layer normalizations or language modeling head\. We run comparisons with SVD\-LLMWanget al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib8)\)and ASVD\+Yuanet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib14)\)for which we unify the hyperparameter sets for direct comparisons and perform supervised\-fine\-tuning for performance recovery with frozen, non\-compressed elements of the model\. We use LM\-Evaluation\-Harness framework for running evaluationsGaoet al\.\([2024](https://arxiv.org/html/2606.07098#bib.bib20)\)on full downstream task benchmarks\. Table 1:Post\-compression fine tuning results for Llama 3\.1 8B Instruct and Qwen3\-8B\. Zero\-shot benchmarks report length\-normalized accuracy with standard error,pplreports mean perplexity over Wikitext\-Test split\. ## 4Results and Analysis Table[1](https://arxiv.org/html/2606.07098#S3.T1)show results for Llama 3\.1\-8B\-Instruction and Qwen3\-8B models with SigmaScale comparisons in KD and supervised fine\-tuning paradigms\. SigmaScale is most competitive in the mild\-to\-moderate compression regime\. At 0\.90x retention, it substantially improves perplexity over SVD\-LLM for both models, while also recovering much of the zero\-shot performance\. At 0\.75x retention SigmaScale generally improves several zero\-shot benchmarks, but perplexity gains are marginal\. At 0\.50x retention, SigmaScale degrades more sharply, particularly for Llama 3\.1\-8B\-Instruction\. This suggests that the method is most effective when reshaping the singular\-value spectrum can preserve the dominant components of the weight matrix\. Under aggressive compression, the retained subspace may become too small for learned scaling alone to compensate for the discarded singular directions\. SigmaScale should therefore be understood as a mechanism for improving truncation quality in the retained\-rank regime, rather than as a complete solution for extreme low\-rank compression\. Contrary toXinet al\.\([2026](https://arxiv.org/html/2606.07098#bib.bib35)\), our results do not show major improvements of KD over supervised fine\-tuning conditions for SigmaScale\. Given that singular values are indicative of the intrinsic rankKonstantinides and Yao \([2002](https://arxiv.org/html/2606.07098#bib.bib15)\), we perform an analysis of the given compressed weight matrices during the optimization of scaling vectorsdr,cd\_\{r,c\}\. In Table[2](https://arxiv.org/html/2606.07098#S4.T2)we aggregate the mean drop in compression loss as per Eq\.[4](https://arxiv.org/html/2606.07098#S2.E4)and also measure the average drop in the effective rank entropyRoy and Vetterli \([2007](https://arxiv.org/html/2606.07098#bib.bib19)\)of theΣ\\Sigmacomponent\. We see that there is a strong correlation between the compression loss and the effective rank entropy of the compressed weight matrices’Σ\\Sigmacomponents\. See Appendix[F](https://arxiv.org/html/2606.07098#A6)for further visualizations and corresponding results for Qwen3 model, for which we observe similar patterns\. Table 2:Llama 3\.1 average percentage of loss and effective rank entropy decrease during scaling matrix training ## 5Conclusions Our work demonstrates the effectiveness of learning scaling matricesSSfor SVD\-based LLM compression\. Our results show that SigmaScale performs on par with the most similar state\-of\-the\-art methods, while taking a fundamentally different approach: learningSSrather than deriving it analytically, as in SVD\-LLM or ASVD\. We show that the learned scaling matrices manipulate the intrinsic rank of a given weight matrix, as reflected by changes in the effective\-rank entropy of the singular values and its correlation with compression loss\. Future work should further investigate the impact of calibration data used to learnSS, explore different initialization strategies forSS, and examine how complementary current state\-of\-the\-art methods are to one another\. ## Limitations Our method relies on computing SVD at every update step while learning the scaling matrixSSwhich hasO\(n3\)O\(n^\{3\}\)computational expense, we do not explore faster alternative SVD methods that would use approximations\. Our method, as shown in Section[4](https://arxiv.org/html/2606.07098#S4)degrades sharply \(especially for Llama 3\.1\) and should not be viewed as a complete solution for extreme low\-rank compression\. Our evaluation is based on perplexity and a specific set of zero\-shot benchmarks\. We do not explore effects on longer\-form generation, or coding tasks\. Current method’s robustness to different calibration distributions has not been formally verified, yet we anticipate that at its core, Wikitext is a subpar choice which was used mainly to stay consistent for comparisons with SVD\-LLM and ASVD\. ## Ethical Considerations To the best of our knowledge, our work does not require an additional ethics review\. We do not conduct tests on humans nor use any sensitive data\. We summarize used asset licenses in Appendix[I](https://arxiv.org/html/2606.07098#A9)for which our work does not violate any of the licensing terms\. We do not foresee additional significant ethical, societal, or environmental risks arising directly from this work\. As common in the field, we urge anyone who uses our work for downstream applications to cross\-check and verify their model integrity before production deployments\. ## References - S\. Ashkboos, A\. Mohtashami, M\. L\. Croci, B\. Li, P\. Cameron, M\. Jaggi, D\. Alistarh, T\. Hoefler, and J\. Hensman \(2024\)QuaRot: Outlier\-Free 4\-Bit Inference in Rotated LLMs\.arXiv\.Note:arXiv:2404\.00456 \[cs\]External Links:[Link](http://arxiv.org/abs/2404.00456),[Document](https://dx.doi.org/10.48550/arXiv.2404.00456)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThirty\-Fourth AAAI Conference on Artificial Intelligence,Cited by:[Table 1](https://arxiv.org/html/2606.07098#S3.T1.1.1.8.1.1)\. - R\. Bommasani, D\. A\. Hudson, E\. Adeli, R\. Altman, S\. Arora, S\. von Arx, M\. S\. Bernstein, J\. Bohg, A\. Bosselut, E\. Brunskill,et al\.\(2021\)On the opportunities and risks of foundation models\.arXiv preprint arXiv:2108\.07258\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in neural information processing systems33,pp\. 1877–1901\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[Table 1](https://arxiv.org/html/2606.07098#S3.T1.1.1.6.1.1)\. - DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.External Links:[Link](https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - C\. Eckart and G\. Young \(1936\)The approximation of one matrix by another of lower rank\.Psychometrika1\(3\),pp\. 211–218\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p3.3)\. - E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)GPTQ: Accurate Post\-Training Quantization for Generative Pre\-trained Transformers\.arXiv\.Note:arXiv:2210\.17323 \[cs\]External Links:[Link](http://arxiv.org/abs/2210.17323),[Document](https://dx.doi.org/10.48550/arXiv.2210.17323)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§3](https://arxiv.org/html/2606.07098#S3.p2.1)\. - A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1),[§1](https://arxiv.org/html/2606.07098#S1.p5.2),[§2](https://arxiv.org/html/2606.07098#S2.p1.1)\. - G\. Hinton, O\. Vinyals, and J\. Dean \(2015\)Distilling the knowledge in a neural network\.arXiv preprint arXiv:1503\.02531\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p5.2)\. - Y\. Hsu, T\. Hua, S\. Chang, Q\. Lou, Y\. Shen, and H\. Jin \(2022\)Language model compression with weighted low\-rank factorization\.ArXivabs/2207\.00112\.External Links:[Link](https://api.semanticscholar.org/CorpusID:250243971)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p4.6)\. - J\. Kaplan, S\. McCandlish, T\. Henighan, T\. B\. Brown, B\. Chess, R\. Child, S\. Gray, A\. Radford, J\. Wu, and D\. Amodei \(2020\)Scaling laws for neural language models\.arXiv preprint arXiv:2001\.08361\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - N\. Kishore Kumar and J\. Schneider \(2017\)Literature survey on low rank approximation of matrices\.Linear and Multilinear Algebra65\(11\),pp\. 2212–2244\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p3.11)\. - K\. Konstantinides and K\. Yao \(2002\)Statistical analysis of effective singular values in matrix rank determination\.IEEE Transactions on Acoustics, Speech, and Signal Processing36\(5\),pp\. 757–763\.Cited by:[§4](https://arxiv.org/html/2606.07098#S4.p3.3)\. - Y\. Li, D\. Lee, R\. Yin, and P\. Panda \(2026\)Optimal brain decomposition for accurate llm low\-rank approximation\.arXiv preprint arXiv:2604\.00821\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p4.6)\. - D\. Liu, Y\. Zhu, Z\. Liu, Y\. Liu, C\. Han, J\. Tian, R\. Li, and W\. Yi \(2025a\)A survey of model compression techniques: past, present, and future\.Frontiers in Robotics and AI12,pp\. 1518965\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - Z\. Liu, C\. Zhao, I\. Fedorov, B\. Soran, D\. Choudhary, R\. Krishnamoorthi, V\. Chandra, Y\. Tian, and T\. Blankevoort \(2025b\)SpinQuant: LLM quantization with learned rotations\.arXiv\.Note:arXiv:2405\.16406 \[cs\]External Links:[Link](http://arxiv.org/abs/2405.16406),[Document](https://dx.doi.org/10.48550/arXiv.2405.16406)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - S\. Merity, C\. Xiong, J\. Bradbury, and R\. Socher \(2016\)Pointer sentinel mixture models\.External Links:1609\.07843Cited by:[§3](https://arxiv.org/html/2606.07098#S3.p1.1)\. - T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2381–2391\.Cited by:[Table 1](https://arxiv.org/html/2606.07098#S3.T1.1.1.5.1.1)\. - L\. Mirsky \(1960\)Symmetric gauge functions and unitarily invariant norms\.The quarterly journal of mathematics11\(1\),pp\. 50–59\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p3.3)\. - M\. Nagel, R\. A\. Amjad, M\. Van Baalen, C\. Louizos, and T\. Blankevoort \(2020\)Up or down? Adaptive rounding for post\-training quantization\.InProceedings of the 37th International Conference on Machine Learning2024 IEEE 44th International Conference on Distributed Computing Systems \(ICDCS\),H\. D\. III and A\. Singh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.119,pp\. 7197–7206\.External Links:[Link](https://proceedings.mlr.press/v119/nagel20a.html)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p4.6)\. - O\. Roy and M\. Vetterli \(2007\)The effective rank: a measure of effective dimensionality\.In2007 15th European signal processing conference,pp\. 606–610\.Cited by:[§4](https://arxiv.org/html/2606.07098#S4.p3.3)\. - R\. Saha, N\. Sagan, V\. Srivastava, A\. J\. Goldsmith, and M\. Pilanci \(2024\)Compressing Large Language Models using Low Rank and Low Precision Decomposition\.arXiv\.Note:arXiv:2405\.18886 \[cs\]External Links:[Link](http://arxiv.org/abs/2405.18886),[Document](https://dx.doi.org/10.48550/arXiv.2405.18886)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1),[§1](https://arxiv.org/html/2606.07098#S1.p4.6)\. - K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[Table 1](https://arxiv.org/html/2606.07098#S3.T1.1.1.7.1.1)\. - S\. L\. Shishkin, A\. Shalaginov, and S\. D\. Bopardikar \(2019\)Fast approximate truncated svd\.Numerical Linear Algebra with Applications26\(4\),pp\. e2246\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p3.11)\. - R\. Taori, I\. Gulrajani, T\. Zhang, Y\. Dubois, X\. Li, C\. Guestrin, P\. Liang, and T\. B\. Hashimoto \(2023\)Stanford alpaca: an instruction\-following llama model\.GitHub\.Note:[https://github\.com/tatsu\-lab/stanford\_alpaca](https://github.com/tatsu-lab/stanford_alpaca)Cited by:[Appendix B](https://arxiv.org/html/2606.07098#A2.p1.1),[Appendix B](https://arxiv.org/html/2606.07098#A2.p4.1),[§1](https://arxiv.org/html/2606.07098#S1.p5.2),[§3](https://arxiv.org/html/2606.07098#S3.p1.1)\. - Q\. Wang, J\. Ke, M\. Tomizuka, Y\. Chen, K\. Keutzer, and C\. Xu \(2025\)Dobi\-SVD: Differentiable SVD for LLM Compression and Some New Perspectives\.arXiv\.Note:arXiv:2502\.02723 \[cs\]External Links:[Link](http://arxiv.org/abs/2502.02723),[Document](https://dx.doi.org/10.48550/arXiv.2502.02723)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p3.11)\. - W\. Wang, Z\. Dang, Y\. Hu, P\. Fua, and M\. Salzmann \(2022\)Robust Differentiable SVD\.IEEE Transactions on Pattern Analysis and Machine Intelligence44\(9\),pp\. 5472–5487\.Note:arXiv:2104\.03821 \[cs\]External Links:ISSN 0162\-8828, 2160\-9292, 1939\-3539,[Link](http://arxiv.org/abs/2104.03821),[Document](https://dx.doi.org/10.1109/TPAMI.2021.3072422)Cited by:[Appendix B](https://arxiv.org/html/2606.07098#A2.p7.1),[§1](https://arxiv.org/html/2606.07098#S1.p3.11)\. - X\. Wang, Y\. Zheng, Z\. Wan, and M\. Zhang \(2024\)Svd\-llm: truncation\-aware singular value decomposition for large language model compression\.arXiv preprint arXiv:2403\.07378\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1),[§1](https://arxiv.org/html/2606.07098#S1.p2.4),[§1](https://arxiv.org/html/2606.07098#S1.p4.6),[§1](https://arxiv.org/html/2606.07098#S1.p5.2),[§3](https://arxiv.org/html/2606.07098#S3.p2.1)\. - M\. Xin, S\. Priyadarshi, J\. Xin, B\. Kartal, A\. Vavre, A\. K\. Thekkumpate, Z\. Chen, A\. S\. Mahabaleshwarkar, I\. Shahaf, A\. Bercovich, K\. Patel, S\. V\. Velury, C\. Luo, Z\. Cheng, J\. Chen, C\. Yu, W\. Ping, O\. Rybakov, N\. Tajbakhsh, O\. Olabiyi, D\. Stosic, D\. Wu, S\. Han, E\. Chung, S\. T\. Sreenivas, B\. Catanzaro, Y\. Suhara, T\. Blankevoort, and H\. Mao \(2026\)Quantization\-Aware Distillation for NVFP4 Inference Accuracy Recovery\.arXiv\.Note:arXiv:2601\.20088 \[cs\]External Links:[Link](http://arxiv.org/abs/2601.20088),[Document](https://dx.doi.org/10.48550/arXiv.2601.20088)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1),[§1](https://arxiv.org/html/2606.07098#S1.p5.2),[§4](https://arxiv.org/html/2606.07098#S4.p2.1)\. - A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1),[§1](https://arxiv.org/html/2606.07098#S1.p5.2),[§2](https://arxiv.org/html/2606.07098#S2.p1.1)\. - R\. Yang, T\. Wu, J\. Wang, P\. Hu, Y\. Wu, N\. Wong, and Y\. Yang \(2024\)Llm\-neo: parameter efficient knowledge distillation for large language models\.arXiv preprint arXiv:2411\.06839\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - Z\. Yuan, Y\. Shang, Y\. Song, Q\. Wu, Y\. Yan, and G\. Sun \(2023\)ASVD: activation\-aware singular value decomposition for compressing large language models\.External Links:2312\.05821Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1),[§1](https://arxiv.org/html/2606.07098#S1.p2.4),[§1](https://arxiv.org/html/2606.07098#S1.p4.6),[§1](https://arxiv.org/html/2606.07098#S1.p5.2),[§2\.1](https://arxiv.org/html/2606.07098#S2.SS1.p2.3),[§3](https://arxiv.org/html/2606.07098#S3.p2.1)\. - R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 4791–4800\.Cited by:[Table 1](https://arxiv.org/html/2606.07098#S3.T1.1.1.9.1.1)\. - K\. Zhu, F\. Hu, Y\. Ding, W\. Zhou, and R\. Wang \(2025\)A comprehensive review of network pruning based on pruning granularity and pruning time perspectives\.Neurocomputing626,pp\. 129382\.External Links:ISSN 0925\-2312,[Link](https://www.sciencedirect.com/science/article/pii/S0925231225000542),[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.neucom.2025.129382)Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. - X\. Zhu, J\. Li, Y\. Liu, C\. Ma, and W\. Wang \(2024\)A survey on model compression for large language models\.Transactions of the Association for Computational Linguistics12,pp\. 1556–1577\.Cited by:[§1](https://arxiv.org/html/2606.07098#S1.p1.1)\. ## Appendix ACore Experiment Variations Table 3:Llama\-Alpaca variations:♣\\clubsuit3 para, 1 epoch;♠\\spadesuit1para 3 epoch\.♡\\heartsuitVanilla Alpaca with 3 training epochsTable 4:Benchmark results for Wikitext\-Train as post\-compression training data for Llama 3\.1 8B Instruct model\.We perform additional experiments with a custom Alpaca dataset for which its outputs are generated from Llama 3\.1 8B Instruction model\. The dataset contains three output generations per single instruction, with a goal to introduce variance\. We perform post\-compression fine\-tuning by training on 1 answer per instruction over 3 epochs against 3 answers per instruction over 1 epoch\. The results are depicted in Table[3](https://arxiv.org/html/2606.07098#A1.T3)and our tests show minor improvements with Llama\-Alpaca dataset\. Specifically for 25% compression, there is 1 point perplexity improvement for between♣\\clubsuitand♡\\heartsuitexperiment variations but marginal changes across zero\-shot benchmarks\. Further details of the custom Alpaca dataset are described in Section[D](https://arxiv.org/html/2606.07098#A4)\. We also run experiments for using Wikitext2 as post\-compression fine\-tuning dataset by training on the continued pretraining task\. Results for this are depicted in Table[4](https://arxiv.org/html/2606.07098#A1.T4)which showcases improvements in perplexity although decrease in overall zero\-shot benchmark performance across the board\. ## Appendix BImplementation Details Scaling Matrix Learning:When learning the row and column scaling matrices, we perform hyperparameter optimization via grid search\. Optimal found configuration is described in Table[5](https://arxiv.org/html/2606.07098#A2.T5)\. See Algorithm[1](https://arxiv.org/html/2606.07098#alg1)for pseudo\-code of the training loop and Algorithm[2](https://arxiv.org/html/2606.07098#alg2)for pseudo\-code of constructing final compressedW′W^\{\\prime\}\. Table 5:Hyperparameter sweep configurationPost\-compression fine\-tuning:We perform post\-compression fine\-tuning with full Alpaca dataset training split over1 epochand computing the loss only over the response span\. Wikitext\-Test for all perplexity evaluations is first tokenized then split into 2048 sequences\. SVD\-LLM and ASVD has a discrepancy where first their implementations split text chunks intosequence\_len×10sequence\\\_len\\times 10character lengths and afterwards perform tokenization\. We use the same learning rate and epoch count for SigmaScale, ASVD and SVD\-LLM\. Our post\-compression fine\-tuning uses Alpaca datasetTaoriet al\.\([2023](https://arxiv.org/html/2606.07098#bib.bib29)\)for which we fine\-tune over 1 epoch computing cross\-entropy loss over the response span\. We use learning rate10−610^\{\-6\}with a cosine LR scheduler with 0\.1 warmup step ratio, this configuration is used for both SVD\-LLM and SigmaScale results\. Knowledge distillation \(KD\):We use the following loss function \(Eq\.[6](https://arxiv.org/html/2606.07098#A2.E6)\) for performing KD ℒtotal=αℒKD\+\(1−α\)ℒtask\\mathcal\{L\}\_\{\\text\{total\}\}=\\alpha\\mathcal\{L\}\_\{\\text\{KD\}\}\+\(1\-\\alpha\)\\mathcal\{L\}\_\{\\text\{task\}\}\(6\)whereℒKD\\mathcal\{L\}\_\{\\text\{KD\}\}is the KL divergence between student and teacher logits andℒtask\\mathcal\{L\}\_\{\\text\{task\}\}is cross\-entropy of student predictions over ground truth labels\. By default we always useα=0\.7\\alpha=0\.7unless explicitly specified otherwise\. 1:Input:Matrix Wm×nW^\{m\\times n\}, rank kk, number of epochs TT, activations XXfrom calibration data 2:Initialize dc∼𝒩\(0,In\)⋅σw⋅0\.1d\_\{c\}\\sim\\mathcal\{N\}\(0,I\_\{n\}\)\\cdot\\sigma\_\{w\}\\cdot 0\.1; dr∼𝒩\(0,Im\)⋅σw⋅0\.1d\_\{r\}\\sim\\mathcal\{N\}\(0,I\_\{m\}\)\\cdot\\sigma\_\{w\}\\cdot 0\.1 3:for t=0t=0to T−1T\-1do⊳\\trianglerightConstruct the scaling matrices and their inversions 4: Sc←diag\(exp\(dc\)\)S\_\{c\}\\leftarrow\\mathrm\{diag\}\(exp\(d\_\{c\}\)\) 5: Sc−1←diag\(exp\(−dc\)\)S\_\{c\}^\{\-1\}\\leftarrow\\mathrm\{diag\}\(exp\(\-d\_\{c\}\)\) 6: Sr←diag\(exp\(dr\)\)S\_\{r\}\\leftarrow\\mathrm\{diag\}\(exp\(d\_\{r\}\)\) 7: Sr−1←diag\(exp\(−dr\)\)S\_\{r\}^\{\-1\}\\leftarrow\\mathrm\{diag\}\(exp\(\-d\_\{r\}\)\)⊳\\trianglerightApply column and row scaling 8: Wscaled←SrWScW\_\{\\text\{scaled\}\}\\leftarrow S\_\{r\}WS\_\{c\}⊳\\trianglerightCompute truncated SVD 9: \(Uk,Sk,Vk\)←SVD\(Wscaled,k\)\(U\_\{k\},S\_\{k\},V\_\{k\}\)\\leftarrow\\mathrm\{SVD\}\(W\_\{\\text\{scaled\}\},k\)⊳\\trianglerightReconstruct W’ truncated SVD 10: Wscaled\(k\)←Ukdiag\(Sk\)VkW\_\{\\text\{scaled\}\}^\{\(k\)\}\\leftarrow U\_\{k\}\\mathrm\{diag\}\(S\_\{k\}\)V\_\{k\}⊳\\trianglerightInvert scaling 11: W\(r\)←Sr−1Wscaled\(r\)Sc−1W^\{\(r\)\}\\leftarrow S\_\{r\}^\{\-1\}W\_\{\\text\{scaled\}\}^\{\(r\)\}S\_\{c\}^\{\-1\}⊳\\triangleright… Compute loss, updatedcd\_\{c\}drd\_\{r\} 12:endfor 13:Output: dc,drd\_\{c\},d\_\{r\} Algorithm 1Pseudo\-code of training the scaling matrixSSSVD derivative during training:As outlined in contributionWanget al\.\([2022](https://arxiv.org/html/2606.07098#bib.bib38)\), the SVD algorithm has an unstable derivative, we bypass this by skipping update steps in which theσi−σj\\sigma\_\{i\}\-\\sigma\_\{j\}denominator reaches close to 0 causing NaN values\. We find that even with the skipped updates, our loss still converges often triggering early stop, therefore while this is not necessarily a robust solution, we do not experience this as a bottleneck for our usecase\. ## Appendix CCompute Budget For running our computation we use Nvidia and AMD GPUs summarized in Table[6](https://arxiv.org/html/2606.07098#A3.T6)with approximate compute times and GPU count used for a given processing stage\. Numbers are reported per experimental condition \(e\.g\. model and corresponding compression ratio\)\. Table 6:Compute budget overview ## Appendix DCustom Alpaca Dataset Created with Llama 3\.1 8B Instruct by generating 3 output versions per single datapoint row\. The idea is to introduce data variance for weight realignment\. For creating the dataset, we re\-ran the inference 3 times with the following generation settings Table[7](https://arxiv.org/html/2606.07098#A4.T7)\. Table 7:Hyperparameters used for generating Llama 3\.1 8B answers of the Alpaca inputs for a custom dataset used in experiments described in Table[3](https://arxiv.org/html/2606.07098#A1.T3)\.1:Input:Weight matrix WW, rank rr, scaling vectors dr,dcd\_\{r\},d\_\{c\}⊳\\trianglerightConstruct scaling matrices and their inverses 2: Sc←diag\(exp\(dc\)\)S\_\{c\}\\leftarrow\\mathrm\{diag\}\(\\exp\(d\_\{c\}\)\) 3: Sc−1←diag\(exp\(−dc\)\)S\_\{c\}^\{\-1\}\\leftarrow\\mathrm\{diag\}\(\\exp\(\-d\_\{c\}\)\) 4: Sr←diag\(exp\(dr\)\)S\_\{r\}\\leftarrow\\mathrm\{diag\}\(\\exp\(d\_\{r\}\)\) 5: Sr−1←diag\(exp\(−dr\)\)S\_\{r\}^\{\-1\}\\leftarrow\\mathrm\{diag\}\(\\exp\(\-d\_\{r\}\)\)⊳\\trianglerightApply symmetric row/column scaling 6: Wscaled←SrWScW\_\{\\mathrm\{scaled\}\}\\leftarrow S\_\{r\}WS\_\{c\}⊳\\trianglerightCompute truncated SVD 7: \(Uk,Σk,Vk\)←TruncateSVD\(Wscaled,k\)\(U\_\{k\},\\Sigma\_\{k\},V\_\{k\}\)\\leftarrow\\mathrm\{TruncateSVD\}\(W\_\{\\mathrm\{scaled\}\},k\)⊳\\trianglerightConstruct square\-root singular value factors 8: Lscaled←UkΣkL\_\{\\text\{scaled\}\}\\leftarrow U\_\{k\}\\sqrt\{\\Sigma\_\{k\}\} 9: Rscaled←ΣkVk⊤R\_\{\\text\{scaled\}\}\\leftarrow\\sqrt\{\\Sigma\_\{k\}\}V\_\{k\}^\{\\top\}⊳\\trianglerightMap factors back to original parameter space 10: L←Sr−1LscaledL\\leftarrow S\_\{r\}^\{\-1\}L\_\{\\text\{scaled\}\} 11: R←RscaledSc−1R\\leftarrow R\_\{\\text\{scaled\}\}S\_\{c\}^\{\-1\} 12:Output: L,RL,R Algorithm 2Construction of low\-rank matrices after learning the row/column scaling ## Appendix EInvestigating Scaling MatrixSSTraining Paradigms We conduct additional analysis of the scaling matrixSStraining to check for isolated and aggregated effects of row and column scaling with respect to the compression loss\. For this we use a Llama 3\.1 8B model’s Key matrix from layer 30 as the test case, see Table[8](https://arxiv.org/html/2606.07098#A5.T8)and Figure[2](https://arxiv.org/html/2606.07098#A5.F2)and Figure[3](https://arxiv.org/html/2606.07098#A5.F3)for loss curve of MLP\_down module\. The loss curves in Figures[2](https://arxiv.org/html/2606.07098#A5.F2)and[3](https://arxiv.org/html/2606.07098#A5.F3)show a clear benefit of scaling both rows and columns with respect to decreasing the compression loss\. Additionally, we execute a test run for training the row and column scaling matricesSSseparately by training first row and then column scaling, as well as jointly\. Table[9](https://arxiv.org/html/2606.07098#A5.T9)show this result for a single module as an example, we use this information to jointly train all scaling matrices as part of our core methodology\. Table 8:Compression loss and sigma effective rank entropy for different compression strategies after training scaling matrixSS\. Llama 3\.1 8B\-Instruct at 80% reduction for layer 30 key matrixTable 9:Compression loss for training row and column scaling matrices sequentially \(first rows, then columns\) and jointly\. Llama 3\.1 8B\-Instr layer 31 Query matrix at 80% reductionFigure 2:Overview of compression loss when training scaling matrices applied separately and together for rows and columns for Llama 3\.1 layer 30 Key matrixFigure 3:Overview of compression loss when training scaling matrices applied separately and together for rows and columns for Llama 3\.1 layer 14 MLP\_down matrix ## Appendix FFurther Analysis on Sigma Values As described in Section[4](https://arxiv.org/html/2606.07098#S4), we expand the analysis of compression loss vs sigma value effective rank entropy in Table[10](https://arxiv.org/html/2606.07098#A6.T10)for Qwen3\-8B model\. Additionally see Figures[4](https://arxiv.org/html/2606.07098#A6.F4)and[5](https://arxiv.org/html/2606.07098#A6.F5)which shows that by applying the scaling matrixSSto a weightWW, there is a downstream effect on the sigma value distribution\. The higher end of the sigma values are scaled up, whereas the lower end sees a minor scale\-down\. Table 10:Overview of loss and effective rank entropy decrease for all seven modules, for Qwen3\-8B  Figure 4:Overview of Llama 3\.1 Layer 30 Query matrix\. Plots of Sigma values after performing SVD on a scaled and unscaled weight matrix\. Side by side comparisons with a logarithmic and linear x axis scaling for an overview of top Sigma values\. The dashed Rank line indicates the SVD truncation rank\.  Figure 5:Overview of Llama 3\.1 Layer 14 Key matrix\. Plots of Sigma values after performing SVD on a scaled and unscaled weight matrix\. Side by side comparisons with a logarithmic and linear x axis scaling for an overview of top Sigma values\. The dashed Rank line indicates the SVD truncation rank\. ## Appendix GCodebase and Dataset Links ## Appendix HGenerative AI Disclosure As part of this work effort we used generative AI as a coding assistant as well as writing aid for refining text and cross\-checking grammar\. All generative AI outputs were human cross\-checked and validated\. ## Appendix IUsed Resources Licensing Overview We summarize licenses of the models and datasets that we have used as part of this study in Table[11](https://arxiv.org/html/2606.07098#A9.T11) Table 11:Licensing information for datasets and models used in this study\. ## Appendix JAlpaca Prompt Template We define the prompt with a similar template as per original Alpaca dataset\. We use Huggingface tokenizer to automatically apply the prompt formatting for Llama and Qwen models\. We define overall instructions in thesystemprompt, task\-specific instructions and any additional input as theuserprompt, expected output in theassistantsection\. See Listings[1](https://arxiv.org/html/2606.07098#LST1)and[2](https://arxiv.org/html/2606.07098#LST2)for full formatting for Llama 3\.1\. Listing 1:Llama 3\.1 Alpaca\-style Chat Template<\|begin\_of\_text\|\><\|start\_header\_id\|\>system<\|end\_header\_id\|\> Belowisaninstructionthatdescribesatask\.Writearesponsethatappropriatelycompletestherequest\. <\|eot\_id\|\><\|start\_header\_id\|\>user<\|end\_header\_id\|\> \#\#\#Instruction: \{instruction\} <\|eot\_id\|\><\|start\_header\_id\|\>assistant<\|end\_header\_id\|\> \{output\} <\|eot\_id\|\> Listing 2:Llama 3\.1 Alpaca\-style Chat Template’<\|begin\_of\_text\|\><\|start\_header\_id\|\>system<\|end\_header\_id\|\> Belowisaninstructionthatdescribesatask,pairedwithaninputthatprovidesfurthercontext\.Writearesponsethatappropriatelycompletestherequest\. <\|eot\_id\|\><\|start\_header\_id\|\>user<\|end\_header\_id\|\> \#\#\#Instruction:\{instruction\} \#\#\#Input:\{input\} <\|eot\_id\|\><\|start\_header\_id\|\>assistant<\|end\_header\_id\|\> \{output\} <\|eot\_id\|\>’
Similar Articles
ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
ScaleSweep proposes a new block scale initialization method for NVFP4 post-training quantization of LLMs, achieving improved accuracy by sweeping over feasible block scale candidates. Experiments on Llama and Qwen models show it preserves over 93% of full-precision performance under aggressive quantization.
Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization
This paper proposes trainable smooth-rotation transforms with quantile-robust scaling and gradient-based optimization to improve post-training quantization of LLMs, achieving significant error reduction on LLaMA-3.2-1B under W4A4 quantization.
Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models
This paper systematically studies scale vectors in LLM normalization layers, showing they optimize training through a self-amplifying preconditioning effect, and proposes three lightweight improvements that enhance performance and scaling behavior with negligible overhead.
Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
A novel end-to-end framework for LLM compression that jointly optimizes structural pruning and mixed-precision quantization, achieving significant perplexity reductions and speedups over state-of-the-art methods, especially at ultra-low bit precisions.
LiteFrame Scales Video LLM Efficiency (6 minute read)
LiteFrame introduces a highly efficient video encoder for Video LLMs that uses Compressed Token Distillation to enable up to 8x more frames and 35% latency reduction while maintaining accuracy, setting a new Pareto frontier for long-form video understanding.