Small LLMs: Pruning vs. Training from Scratch
Summary
This paper empirically compares pruning vs. training small language models from scratch, finding that pruning provides a strong advantage under limited token budgets but that the advantage diminishes as training scales, especially with coarse pruning.
View Cached Full Text
Cached at: 06/15/26, 09:11 AM
# Small LLMs: Pruning vs. Training from Scratch
Source: [https://arxiv.org/html/2606.14150](https://arxiv.org/html/2606.14150)
Small LLMs: Pruning vs\. Training from Scratch
Yufeng Xu1,2Taiming Lu1Kunjun Li1Jiachen Zhu2 Mingjie Sun3Zhuang Liu1
1Princeton University2New York University3Carnegie Mellon University


Figure 1:Initialization by pruningprovides a strong advantage overrandom initialization, but this advantage diminishes as training continues\.Left:under the same training token budget, pruning initialization beats random initialization, although the advantage decreases with longer training\.Right:when the random initialization baseline is trained with the full token budget used by the entire pruning pipeline, it becomes competitive with pruning alternative\. Overall, pruning is a powerful shortcut to strong small models, but its advantage diminishes as training scales\.Abstract
- Pruning promises a shortcut to strong small language models\. In this work, we examine this promise by pruning Llama\-3\.1\-8B at pruning ratios of 0\.5–0\.8 with six methods spanning depth, width, and sparse granularities, under two controlled token\-matched settings\.\(1\)With the same training token budget, pruned initialization consistently outperforms random initialization\. This shows that the parent model provides a strong starting point, although the advantage narrows as the training token budget grows and as the pruning ratio rises, nearly vanishing at the highest pruning ratio we study\.\(2\)When training from scratch is instead given the full token budget consumed by the whole pipeline, pruning at finer granularities still retains an advantage, while coarser structured pruning can be matched or surpassed\. This suggests that the parent model transfers knowledge that additional training tokens alone cannot fully recover, but only at fine granularity\. Taken together, our results yield a clear recommendation: with a large pretrained model in hand and a limited training token budget, pruning is better than training from scratch; when the training budget is not limited, training from scratch can be competitive for coarser pruning, so a large pretrained parent is not always necessary\. Our code is available at[github\.com/zlab\-princeton/llm\-pruning\-collection](https://github.com/zlab-princeton/llm-pruning-collection)\.
## 1Introduction
Remove more than half of a trained language model’s parameters, add a lightweight retraining pass, and the resulting small model still matches most of the original’s performance\(ma2023llmpruner;xia2024shearedllama;sun2024wanda;sreenivas2024minitron2\)\. This is the promise of network pruning\(lecun1989OBD;babak1993obs\): a shortcut to strong small language models that bypasses the rising cost of pretraining Large Language Models \(LLMs\) from scratch on trillions of tokens\(grattafiori2024llama3;qwen3;deepseekv3;kaplan2020scalinglaw;hoffmann2022computeoptimalllm;henighan2020scalinglaw;touvron2023llama;touvron2023llama2\)\.
Multi\-billion\-parameter open\-weight LLMs are now widely available, with releases such as Llama\(grattafiori2024llama3;meta2025llama3\.1\), Qwen\(qwen3\), Gemma\(gemma3\), and OLMo\(olmo3\)putting pretrained checkpoints in public hands\. At the same time, demand for smaller models is rising, driven by deployment constraints, inference efficiency, and accessibility\(hagele2024scalinglawscomputeoptimaltraining;hu2024minicpm;nag2024efficientcontinualpretrainingllms\)\. This is exactly the gap pruning promises to fill\. This raises a practical question: to obtain a capable small model, should we train it from scratch, or carve it out of an existing large one via pruning? Pruning is appealing because it may transfer knowledge from the parent, giving a stronger starting point than random initialization\. But the inheritance is not free: a large parent must be pretrained before it can be pruned, and whether pruning’s advantage survives a token\-matched comparison remains unclear\.
While pruning is typically viewed as a compression technique, we instead treat it as an*initialization strategy*: the pruned weights are simply a starting point for training the target small model\. This reframes the natural baseline: rather than comparing the pruned model against its parent, we compare it against a model of the same size trained from a random initialization on the same data stream\. The question becomes: under an equal training token budget, doesinitializing by pruning a larger modelbeatrandom initializing?
We then ask a sharper question: are pruning’s gains merely a head start that more training can buy, or knowledge that extra data cannot recover? To find out, we compare pruned against random initialization under two token\-matched settings\.\(i\) Equal training token budget: both target models are trained under the same training token budget, isolating the effect of initialization\.\(ii\) Equal total token budget: training from scratch is instead given the*entire*token budget of the pruning pipeline \(the parent’s pretraining tokens plus the retraining tokens\), testing whether more tokens alone can close the gap\. Varying pruning ratio and granularity across both, we draw two key observations, both illustrated in Figure[1](https://arxiv.org/html/2606.14150#S0.F1):
1. \(1\)Under an equal training token budget, training from pruned initialization consistently outperforms training from random initialization, though this advantage diminishes as pruning ratio increases\.
2. \(2\)Even when training from scratch uses the full token budget \(number of training iterations\) of the pruning pipeline, sparse pruning still outperforms; for structured pruning, however, training from scratch can close the gap\.
Together, these results recast pruning as a token\-efficient initialization strategy: with a strong parent model in hand and a limited training token budget, pruning beats training from scratch, especially at fine granularity, transferring knowledge that additional tokens alone cannot recover\.
\(a\) Pruning granularity

\(b\) Methods evaluated
methodtarget modulescriterionMinitrondepthlayersinfluence onvalidation lossMinitronwidthhidden channels,mlp channelsblock outputactivation normFLAPhidden channels,mlp channelsinput feature variance×\\timesweight column normShearedLLaMAlayers, hidden channels,mlp channels, attn headslearned pruning masks\(Lagrangian optim\.\)Wandaindividual weightsweight magnitude×\\timesinput feature normSparseGPTindividual weightslayerwisereconstruction error
Figure 2:Pruning granularity and method overview\.\(a\) Illustration of the pruning granularities we study: depth, width, and sparse pruning\. \(b\) Overview of the six pruning methods we evaluate, including their target modules and criteria\.
## 2Background
Network pruning\(lecun1989OBD\)identifies and removes unimportant weights or modules in neural networks to improve efficiency\(frankle2019LTH\)\. Methods fall into two families \(Figure[2](https://arxiv.org/html/2606.14150#S1.F2)\(a\)\):sparse pruning, which zeroes out individual weights while preserving model shape, andstructured pruning, which removes entire architectural components such as layers, attention heads, or feedforward channels\.
#### Sparse pruning\.
Sparse methods mask individual weights to zero\.Semi\-structured\(n:m\) pruning\(zhou2021nmpruning\)keeps exactlynnof everymmweights and can exploit hardware support\(mishra2021acceleratingsparse;hubara2021acceleratedsparsetraining\);unstructuredpruning places no constraint on which weights are removed\. SparseGPT\(frantar2023sparsegpt\)frames weight selection as a layerwise reconstruction problem inspired by Optimal Brain Surgeon\(babak1993obs\)\. Wanda\(sun2024wanda\)approximates the same objective more cheaply by scoring each weight as its magnitude times the corresponding input feature norm, with follow\-up work\(yang2025wanda\+\+\)further refining the activation\-based scoring\. Both methods support n:m and unstructured modes and require only a lightweight weight update after pruning\.
#### Structured pruning\.
Structured methods, with roots in classical CNN\-era channel and filter pruning\(li2017l1normpruning;liu2017networkslimming;luo2017thinet;huang2018sparsestructureselection\), remove whole modules, yielding smaller dense architectures\.Depth pruningremoves entire layers: ShortGPT\(men2024shortgpt\), Shortened Llama\(kim2024shortenedllama\), and SLEB\(song2024sleb\)estimate layer importance via activation similarity, Taylor approximations, and block\-skip perplexity respectively; Minitron\-depth\(sreenivas2024minitron2\)searches for the consecutive layer group whose removal least hurts validation loss\.Width pruningremoves hidden or feedforward channels: Minitron\-width\(sreenivas2024minitron2\)ranks channels by output activation norm, while FLAP\(an2023flap\)uses input feature variance weighted by column norm\. Sheared LLaMA\(xia2024shearedllama\)jointly prunes both depth and width by learning pruning masks via Lagrangian optimization subject to a fixed target model size constraint specified in advance\.
#### Pruning ratio and granularity\.
Thepruning ratiois the fraction of parameters removed\.Granularityrefers to the smallest unit treated as a single pruning decision, ranging from whole layers \(coarsest\) through channels and attention heads down to individual weights \(finest\)\. Under the same ratio, finer granularity preserves more of the base model’s performance but yields less hardware speedup and reduced memory savings at inference time\(an2023flap;sreenivas2024minitron2\)\.
## 3Methodology
We study pruning under constrained token budgets through controlled comparisons that isolate \(i\) the value of pruning as an initialization strategy and \(ii\) whether additional training from scratch can close the gap to the pruning pipeline\. Section[3\.1](https://arxiv.org/html/2606.14150#S3.SS1)describes the experimental design, base model, data pipeline, optimizer, and evaluation protocol\. Section[3\.2](https://arxiv.org/html/2606.14150#S3.SS2)formalizes the six pruning methods we evaluate\.
### 3\.1Training Setup
#### Experimental design\.
Throughout, we use three notations:SNNdenotes training the target architecture fromScratch with random initialization forNNB tokens;P200\-RNNdenotes the pruning pipeline, whichPretrains a larger model for200B tokens, prunes it at a target ratio, thenRetrains the pruned model forNNB tokens; and*Meta*\-RNNdenotes the same pipeline but starting from Meta’s released Llama\-3\.1\-8B checkpoint instead of our 200B\-pretrained one, retrained forNNB tokens \(used as a reference point in Figure[3](https://arxiv.org/html/2606.14150#S4.F3)\)\. Both token\-matched experiments compare the pruning pipeline against training from scratch, differing only in how much data we allow the scratch baseline to see:
- \(1\)Equal training token budget\(S50vs\.P200\-R50\): the scratch baseline trains on the same 50B\-token data stream used to retrain the pruned model, under the same target architecture and optimizer schedule, isolating the effect of initialization alone\. We further sweep the pruning ratio to track how this initialization advantage evolves as compression becomes more aggressive\.
- \(2\)Equal total token budget\(S250vs\.P200\-R50\): the scratch baseline trains on all 250B tokens consumed by the pipeline \(pretraining plus retraining\), testing whether additional tokens alone can close the gap\.
#### Model and dataset\.
We adopt Llama\-3\.1\-8B\(grattafiori2024llama3\)as the base model for our experiments, which is the most common\(frantar2023sparsegpt;sun2024wanda;sreenivas2024minitron2\)choice in prior LLM pruning works\. For training data, we adopt DCLM\-Baseline\-1\.0\(li2024datacomplm\), a large\-scale curated training corpus comparable to other contemporary web\-scale corpora such as RefinedWeb\(penedo2023refinedweb\), Dolma\(soldaini2024dolma\), and RedPajama\(weber2024redpajama\)\. The dataset is tokenized with the Llama\-3\.1 tokenizer and divided into chunks of 8192 tokens\.
#### Data pipeline\.
The 200B pretraining tokens and the 50B retraining tokens are strictly non\-overlapping subsets of the DCLM corpus\. For theS250comparison, the 250B tokens are exactly the union of these two disjoint sets, ensuring thatS250andP200\-R50see exactly the same data in total\. For theS50comparison,S50uses the same 50B tokens as the retraining phase ofP200\-R50\. This guarantees that all comparisons are token\-fair and that differences in performance reflect initialization strategy, not data composition\.
#### Training recipe\.
We follow the default recipe in Lingua\(meta2024lingua\), using AdamW\(loshchilov2019adamw\)with cosine learning rate decay and a 5% warmup\. For each experiment we sweep the max learning rate over\{1e\-5,3e\-5,1e\-4,3e\-4,1e\-3\}and select the one yielding lowest training loss in a short prefix run\. This is important for retraining after pruning, where the learning rate must be chosen carefully to prevent catastrophic forgetting\(parmar2024reusedontretrainrecipe;gupta2023howtorewarmyourmodel\)\. The full set of training hyperparameters is reported in Appendix[D\.1](https://arxiv.org/html/2606.14150#A4.SS1)\.
#### Evaluation\.
We evaluate on two dimensions: \(i\) linguistic perplexity across four general\-domain corpora \(C4, WikiText\-103, WikiText\-2, CNN Dailymail\) and \(ii\) zero\-shot downstream accuracy across eight benchmarks spanning commonsense QA \(WinoGrande, HellaSwag, PIQA\), scientific QA \(ARC\-Challenge, ARC\-Easy, SciQ, OpenBookQA\), and logical reasoning \(BoolQ\)\. Full evaluation protocols, benchmark citations, and random\-chance baselines are provided in Appendix[E](https://arxiv.org/html/2606.14150#A5)\.
### 3\.2Pruning Methods
We evaluate six representative LLM pruning methods spanning two families, with their target modules and scoring criteria summarized in Figure[2](https://arxiv.org/html/2606.14150#S1.F2)\(b\) \(and described in Section[2](https://arxiv.org/html/2606.14150#S2)\)\.Structuredmethods remove entire model components while keeping the weight matrix dense \(Minitron\-depth\(muralidharan2024minitron1;sreenivas2024minitron2\),Minitron\-width\(muralidharan2024minitron1;sreenivas2024minitron2\),FLAP\(an2023flap\), andSheared LLaMA\(xia2024shearedllama\)\);sparsemethods retain the full macro\-architecture but zero\-mask individual weights \(Wanda\(sun2024wanda\)andSparseGPT\(frantar2023sparsegpt\)\)\. For brevity, we refer to Minitron\-depth asMinitron\-D, Minitron\-width asMinitron\-W, and the unstructured variants of Wanda and SparseGPT asWanda\-UandSparseGPT\-U, respectively\. Formal mathematical descriptions of all six methods are provided in Appendix[C\.2](https://arxiv.org/html/2606.14150#A3.SS2)\.
#### Implementation\.
For Minitron\-depth and Minitron\-width, we developed a simplified re\-implementation and verified its correctness against the original results\. FLAP and Sheared LLaMA were originally designed for models with multi\-head attention \(MHA\); we adapted both methods to support grouped\-query attention \(GQA\)\(ainslie2023gqa\)as used in Llama\-3\.1, and verified that they achieve reasonable pruning performance on this architecture\. For Wanda and SparseGPT, we used the authors’ original implementations without modification\. When a target architecture was not specified in the original paper, we performed a lightweight architecture search over candidate dimension configurations \(see Appendix[D\.2](https://arxiv.org/html/2606.14150#A4.SS2)\)\. Full per\-method calibration sets and hyperparameters are reported in Appendix[D\.2](https://arxiv.org/html/2606.14150#A4.SS2)\.
methodrationum layersattention headshidden sizeFFN sizesparsityparametersbase0\.0%32324096143360\.0%8\.0BMinitron\-D50\.0%16324096143360\.0%4\.5BMinitron\-D62\.5%8324096143360\.0%2\.8BMinitron\-D75\.0%4324096143360\.0%1\.9BMinitron\-D81\.3%2324096143360\.0%1\.5BMinitron\-W50\.0%3232307292160\.0%4\.5BMinitron\-W62\.5%3232243261440\.0%3\.1BMinitron\-W75\.0%3232179260160\.0%2\.0BMinitron\-W81\.3%3232153647360\.0%1\.5BFLAP50\.0%3220409666560\.0%4\.5BSheared LLaMA50\.0%3232345675520\.0%4\.6BWanda50\.0%323240961433650\.0%4\.5BSparseGPT50\.0%323240961433650\.0%4\.5BTable 1:Architectures of the pruned models obtained with different pruning methods and pruning ratios\. The base model is Llama\-3\.1\-8B \(8\.0B parameters\)\. For sparse methods the listed parameter count reflects non\-zero weights only\.
#### Pruning configurations\.
The six methods span five granularities: depth, width, depth\-and\-width, 2:4 sparse, and unstructured sparse\. We center our comparison at a 50% pruning ratio because it is the canonical setting adopted by prior pruning work—Minitron\(sreenivas2024minitron2\), SparseGPT\(frantar2023sparsegpt\), and Wanda\(sun2024wanda\)all report their primary results at 50%—so we prune every method to this same ratio and search for the best configuration within each method’s design space\. Concretely, we evaluate Minitron\-depth at\{50%,62\.5%,75%,81\.3%\}\\\{50\\%,62\.5\\%,75\\%,81\.3\\%\\\}pruning ratios, Minitron\-width at\{50%,62\.5%,75%\}\\\{50\\%,62\.5\\%,75\\%\\\}, FLAP and Sheared LLaMA at50%50\\%; SparseGPT and Wanda are applied at 2:4 and 50% unstructured sparsity\. Resulting architectures are listed in Table[1](https://arxiv.org/html/2606.14150#S3.T1)\. Calibration sets and per\-method configurations are in Appendix[D\.2](https://arxiv.org/html/2606.14150#A4.SS2)\.
## 4Pruning vs\. Training from Scratch
We evaluate pruning under the equal\-training\-token\-budget setting introduced in Section[3](https://arxiv.org/html/2606.14150#S3), comparing P200\-R50 against S50\. We also include S250 to see how training from scratch with the full pipeline token budget compares to pruning\. Beyond the 50% pruning ratio, we additionally vary the pruning ratio for Minitron\-D and Minitron\-W to understand how the advantage of pruned initialization changes with more aggressive compression\. Table[2](https://arxiv.org/html/2606.14150#S4.T2)summarizes the comparison of P200\-R50 against S50 and S250 across all six pruning methods \(with both 2:4 and unstructured\-sparsity variants for Wanda and SparseGPT\) at 50% pruning ratio, with the full per\-benchmark breakdown deferred to Appendix[H\.4](https://arxiv.org/html/2606.14150#A8.SS4)\(Table[16](https://arxiv.org/html/2606.14150#A8.T16)\)\. Among the six methods,Wanda\-U\(the finest sparse granularity\) yields the most accurate pruned model \(68\.1% average accuracy\) andFLAPthe lowest perplexity\. The accuracy ranking broadly tracks pruning granularity, though the most accurate \(sparse\) methods bring little inference speedup on commodity hardware—a trade\-off we examine in Section[4\.3](https://arxiv.org/html/2606.14150#S4.SS3)\. Table[11](https://arxiv.org/html/2606.14150#A8.T11)reports results across pruning ratios for Minitron\-D and Minitron\-W\. We highlight two key observations\.
WikiText\-2 ppl↓\\downarrowAvg accuracy \(%\)↑\\uparrowmethodS50P200\-R50S250S50P200\-R50S250Minitron\-D10\.779\.419\.0160\.764\.466\.2Minitron\-W11\.179\.018\.6959\.966\.367\.5FLAP11\.347\.978\.8257\.766\.565\.1Sheared LLaMA11\.288\.638\.8558\.465\.965\.3Wanda\-2:410\.559\.158\.6061\.566\.766\.2Wanda\-U10\.668\.378\.5761\.668\.167\.2SparseGPT\-2:410\.548\.908\.6562\.166\.767\.1SparseGPT\-U10\.478\.448\.5961\.967\.566\.3Table 2:Pruning–retraining vs\. training from scratch at 50% pruning ratio \(summary\)\.WikiText\-2 perplexity \(↓\\downarrow\) and average downstream accuracy \(↑\\uparrow, over the eight benchmarks\) forP200\-R50\(pretrain 200B, prune, retrain 50B\),S50\(scratch for 50B; equal training token budget\), andS250\(scratch for 250B; equal total token budget\)\.Boldmarks the best pruned\-model \(P200\-R50\) result per metric\.Pruning beats equal\-training\-token\-budget scratch \(S50\) for every method; under an equal total token budget \(S250\), only the finer \(sparse\) granularities keep their edge, while coarser structured pruning is matched\.Full per\-benchmark results are in Table[16](https://arxiv.org/html/2606.14150#A8.T16)\.\(a\)Minitron\-D pruning\.
\(b\)Minitron\-W pruning\.
\(c\)FLAP pruning\.
\(d\)Sheared LLaMA pruning\.
Figure 3:Four structured pruning methods across retraining token budgets \(Llama\-3\.1\-8B→\\to4B, 50% pruning\)\.Each panel shows DCLM validation loss \(↓\\downarrow\) and average downstream accuracy \(↑\\uparrow; averaged over WinoGrande, ARC\-C, ARC\-E, HellaSwag, PIQA, SciQ, BoolQ, OBQA\) as a function of retraining tokensN∈\{10,30,50,250,500\}N\\in\\\{10,30,50,250,500\\\}B under three initialization strategies:SNN\(train from scratch forNNB tokens\),P200\-RNN\(pretrain 200B tokens, prune, retrain forNNB tokens\), andMeta\-RNN\(prune from Meta’s released Llama\-3\.1\-8B, retrain forNNB tokens\)\.Pruned initialization \(P200\-RNN\) starts well ahead of scratch \(SNN\) and keeps its lead as retraining grows, though the gap narrows with more tokens\.Full results are in Appendix[H\.3](https://arxiv.org/html/2606.14150#A8.SS3)\.### 4\.1Equal Training Token Budget
Finding 1\.Under an equal training token budget, pruning initialization consistently beats random initialization, but the advantage shrinks as the pruning ratio grows and vanishes near 81% sparsity\.
#### Pruning initialization is consistently better than random initialization under 50% pruning ratio\.
The experiment results under 50% pruning ratio are summarized in Table[2](https://arxiv.org/html/2606.14150#S4.T2)\(full per\-benchmark breakdown in Table[16](https://arxiv.org/html/2606.14150#A8.T16)\); Figure[3](https://arxiv.org/html/2606.14150#S4.F3)plots DCLM loss and average accuracy across retraining token budgets for all four structured pruning methods \(with full per\-metric results in Tables[12](https://arxiv.org/html/2606.14150#A8.T12),[13](https://arxiv.org/html/2606.14150#A8.T13),[14](https://arxiv.org/html/2606.14150#A8.T14), and[15](https://arxiv.org/html/2606.14150#A8.T15)in Appendix[H\.3](https://arxiv.org/html/2606.14150#A8.SS3)\), and theN=50N\{=\}50slice matches the equal\-training\-token\-budget comparison discussed here\. On structured pruning methods, Minitron\-D combined with P200\-R50 consistently outperforms S50 on all 8 benchmarks, with significant 4\.5% and 6\.1% improvement on ARC Challenge and Hellaswag \(Figure[3\(a\)](https://arxiv.org/html/2606.14150#S4.F3.sf1)\)\. On Minitron\-W, the improvement of P200\-R50 over S50 is even more significant, with 8\.2% improvement on ARC Challenge and 10\.2% improvement on Hellaswag \(Figure[3\(b\)](https://arxiv.org/html/2606.14150#S4.F3.sf2)\), which is consistent with the observation that Minitron\-W preserves more of the base model’s capabilities than Minitron\-D\(sreenivas2024minitron2\)\. FLAP and Sheared LLaMA exhibit the same qualitative trend across token budgets \(Figures[3\(c\)](https://arxiv.org/html/2606.14150#S4.F3.sf3)and[3\(d\)](https://arxiv.org/html/2606.14150#S4.F3.sf4)\)\.
Sparse pruning follows the same pattern, with the finer unstructured granularity gaining more than 2:4: for Wanda, P200\-R50 improves WikiText\-2 perplexity over S50 by 1\.40 \(2:4\) versus 2\.30 \(unstructured\), and for SparseGPT the ARC\-Challenge gain grows from 2\.0% \(2:4\) to 8\.1% \(unstructured\)\. This unstructured\-over\-2:4 ordering holds across nearly all benchmarks \(Table[2](https://arxiv.org/html/2606.14150#S4.T2)\)\.
#### The advantage of pruning initialization diminishes as pruning ratio increases\.
Table[11](https://arxiv.org/html/2606.14150#A8.T11)\(Appendix[H\.2](https://arxiv.org/html/2606.14150#A8.SS2)\) presents results across different pruning ratios for Minitron\-D and Minitron\-W\. At 50% pruning ratio, P200\-R50 shows a clear advantage over S50, with 1\.31 lower perplexity on WikiText\-103\. However, this advantage shrinks as pruning becomes more aggressive: at 62\.5% and 75% pruning ratio, the gap narrows substantially\. At 81\.3% pruning ratio, P200\-R50 performs on par with or even slightly worse than S50 on some benchmarks, suggesting that pruning initialization loses its benefit under extreme compression\.
On perplexity benchmarks, the gap between P200\-R50 and S50 closes as pruning ratio increases from 50% to 81\.3%, with the curves eventually intersecting\. Similar patterns are observed on accuracy benchmarks, though with some exceptions such as Winogrande\. Figure[4](https://arxiv.org/html/2606.14150#S4.F4)visualizes the same trend across pruned model sizes for both depth pruning \(Figure[4\(a\)](https://arxiv.org/html/2606.14150#S4.F4.sf1)\) and width pruning \(Figure[4\(b\)](https://arxiv.org/html/2606.14150#S4.F4.sf2)\): the advantage of P200\-R50 over S50 shrinks as the target model becomes smaller \(i\.e\., as the pruning ratio grows\)\.
\(a\)Depth pruning: S50 vs\. P200\-R50 across model sizes\.
\(b\)Width pruning: S50 vs\. P200\-R50 across model sizes\.
Figure 4:S50 vs\. P200\-R50 across model sizes for depth and width pruning\.Top panels show average accuracy \(WinoGrande, ARC\-C, ARC\-E, HellaSwag, PIQA, SciQ, BoolQ, OBQA\); bottom panels show the gain of P200\-R50 over S50 per model size\. Dark blue bars denote scratch\-trained models \(S50\); purple bars denote pruned\-then\-retrained models \(P200\-R50\) at the same target parameter count\.The pruning advantage shrinks as the target model gets smaller \(i\.e\., as the pruning ratio grows\), nearly closing at the smallest sizes\.
### 4\.2Can Extended Training Catch Up?
Finding 2\.Given the pipeline’s full token budget, training from scratch catches up to pruning at coarser granularities, but pruning at finer granularities retains its advantage\.
#### Longer scratch training can match pruning at coarser granularities but not at finer ones\.
Figure[5](https://arxiv.org/html/2606.14150#S4.F5)compares P200\-R50 against S250 \(equal total token budget\)\. On Minitron\-D, S250 surpasses P200\-R50 on every benchmark \(0\.40 lower WikiText\-2 perplexity; 0\.6–4\.0% higher accuracy\)\. The margin is smaller for the finer Minitron\-W \(e\.g\., the OBQA gain shrinks from 4\.0% to 2\.2%\), showing that the advantage of coarse\-granularity pruning can be recovered simply by training longer from scratch\.
On sparse pruning methods, however, prolonged training from scratch only performs comparably with or worse than the pruning pipeline\. For Wanda\-2:4, S250 has lower perplexity than P200\-R50, yet underperforms P200\-R50 on 6 of 8 accuracy benchmarks, including 1\.8% and 1\.2% degradation on Winogrande and ARC Challenge respectively\. On Wanda\-U, S250 underperforms P200\-R50 on 6 of 8 benchmarks, except BoolQ and OpenBookQA \(improvements of 0\.4% and 0\.8%\)\. This suggests that at 2:4 semi\-structured sparsity training from scratch reaches parity with pruning, while at unstructured \(the finest\) sparsity pruning initialization remains beneficial even with matched total data\. Per\-token\-budget numbers are tabulated in Tables[12](https://arxiv.org/html/2606.14150#A8.T12)–[15](https://arxiv.org/html/2606.14150#A8.T15)of Appendix[H\.3](https://arxiv.org/html/2606.14150#A8.SS3)\.
### 4\.3Finer Granularity Unlocks Greater Benefits
Finding 3\.At the same pruning ratio, finer pruning granularity yields a stronger initialization and a larger advantage over training from scratch, at the cost of hardware efficiency\.
#### Under 50% pruning ratio, finer pruning granularity gives the pruning pipeline more advantage\.
From Table[16](https://arxiv.org/html/2606.14150#A8.T16), we can observe that for structured pruning methods, Minitron\-W in general gives P200\-R50 over S50 more advantage compared to Minitron\-D; for sparse pruning methods, unstructured sparsity provides a stronger pruned model than 2:4 sparsity\. For P200\-R50 and S250, longer training from scratch surpasses pruning\-retraining on structured methods, is roughly on par at 2:4 sparsity, and remains behind unstructured sparse pruning\. Taken together, these findings suggest that under the same pruning ratio, finer granularity yields a stronger pruned model for retraining, one that outperforms the same architecture trained from scratch\. These observations correspond to previous observations that finer\-grained pruning methods are more capable of preserving the base model’s performance, no matter whether retraining is performed or not\(an2023flap;sun2024wanda;sreenivas2024minitron2\)\. We provide a mechanistic discussion of why sparse pruning remains uniquely hard to match by training from scratch in Appendix[G](https://arxiv.org/html/2606.14150#A7)\.
#### The superior performance of sparse pruning is at the price of efficiency\.
Table[9](https://arxiv.org/html/2606.14150#A6.T9)\(Appendix[F](https://arxiv.org/html/2606.14150#A6)\) shows a comparison between training efficiency and WikiText\-103 performance of the methods we explored\. All step times are profiled on Google TPU v4\-256, which lacks sparse tensor cores\. On this hardware, both unstructured 50% sparse and 2:4 sparse models run at virtually the same speed as the dense model \(1\.0×\\times\)\.111On NVIDIA GPUs with sparse tensor cores \(e\.g\., A100, H100\), 2:4 sparsity yields meaningful inference speedups via cuSPARSELt acceleration\(mishra2021acceleratingsparse\), whereas unstructured sparsity does not\. This hardware dependence does not affect the training\-time comparison reported here, but practitioners targeting GPU inference should account for it\.Minitron\-D achieves the highest training speedup under the same pruning ratio but at the cost of higher perplexity, while Minitron\-W offers a middle ground\. These observations imply that for practitioners who want compact, general\-purpose models that are efficient on commodity hardware, structured pruning is the practical choice, yet it is precisely this setting where the pruning pipeline fails to outperform training from scratch under the full token budget, motivating the need for more token\-efficient training paradigms\.
Figure 5:P200\-R50 versus S250 across pruning methods\.We plot the average accuracy \(WinoGrande, ARC\-C, ARC\-E, HellaSwag, PIQA, SciQ, BoolQ, OBQA\) from Table[16](https://arxiv.org/html/2606.14150#A8.T16)\. Blue bars denoteS250scratch training, and purple bars denoteP200\-R50pruning followed by retraining, matching the color convention in Figure[1](https://arxiv.org/html/2606.14150#S0.F1)\. The bottom panel showsP200\-R50minusS250; positive values indicate that pruning remains better after matching the total token budget\.
## 5Related Work
Our work draws on three lines of research; we summarize them here and extend in Appendix[B](https://arxiv.org/html/2606.14150#A2)\.
#### Scaling under data constraints\.
As models and datasets grow to hundreds of billions of parameters and trillions of tokens\(deepseekv3;qwen3;penedo2024fineweb;li2024datacomplm\), data availability rather than compute increasingly bottlenecks pretraining, motivating more token\-efficient recipes\(kaplan2020scalinglaw;kim2025pretraininginfinitecompute;muennighoff2025scalingdataconstrainedlm\)\. This is exactly the regime our study targets\.
#### Continual pretraining\.
Our retraining stage is effectively a form of continual pretraining \(CPT\)\(sun2020ernie2\), wherecatastrophic forgetting\(mccloskey1989catastrophicinterference\)and careful learning\-rate re\-warming are central concerns\(gupta2023howtorewarmyourmodel;grattafiori2024llama3;ibrahim2024simple\)\. These works inform the training recipe used in our experiments\.
#### Understanding network pruning\.
A line of work studies whether pruned subnetworks can be matched by training from scratch, from the Lottery Ticket Hypothesis\(frankle2019LTH;evci2020riggingthelottery\)to NAS\-style interpretations of structured pruning\(liu2019rethinkingvaluenetworkpruning\)and pruning\-scaling laws\(xu2024initializingmodelswithlargerones;chen2025p2scalinglaw\)\. These prior works directly inspired our experimental design\. Extended discussion of these findings and the limitations of our study are deferred to Appendix[A](https://arxiv.org/html/2606.14150#A1)\.
## 6Conclusion
Treating pruning as an initialization strategy rather than a compression tool, we ran a token\-matched comparison against training from scratch\. Under an equal training token budget, pruning initialization wins, with the margin shrinking as the pruning ratio grows\. Under an equal total token budget, the outcome splits along granularity: a longer scratch run overtakes structured pruning, but sparse pruning keeps its lead\.
## Acknowledgments
This work was primarily supported by the computational resources generously provided by Google’s TPU Research Cloud program\. We gratefully acknowledge the use of the Neuronic GPU computing cluster maintained by the Department of Computer Science at Princeton University\. This work was substantially performed using Princeton Research Computing resources, a consortium led by the Princeton Institute for Computational Science and Engineering \(PICSciE\) and Research Computing at Princeton University\.
## References
Appendix
This appendix provides additional methodology details, configurations, and per\-benchmark results supporting the main paper:
- •§[A](https://arxiv.org/html/2606.14150#A1)extends the conclusion with a discussion of what transfers from the larger model and enumerates the axes of variation our study leaves unexplored\.
- •§[B](https://arxiv.org/html/2606.14150#A2)gives the full related\-work discussion condensed in the main paper\.
- •§[C](https://arxiv.org/html/2606.14150#A3)catalogues the pruning methods we evaluate, with formal scoring and reconstruction objectives for each\.
- •§[D](https://arxiv.org/html/2606.14150#A4)documents the training recipe, calibration sets, pruning ratios, and the architecture\-search procedure used to pick a single configuration per method\.
- •§[E](https://arxiv.org/html/2606.14150#A5)specifies the perplexity corpora, downstream benchmarks, and the standard\-deviation estimator we use to mark within\-noise differences\.
- •§[F](https://arxiv.org/html/2606.14150#A6)reports training\-speed comparisons of the pruning methods we evaluate\.
- •§[G](https://arxiv.org/html/2606.14150#A7)discusses why sparse pruning remains hard to match by training from scratch, drawing on prior work on lottery tickets and weight importance\.
- •§[H](https://arxiv.org/html/2606.14150#A8)gives the full per\-benchmark breakdown, including pruning\-only baselines, pruning ratio ablations, retraining\-token budgets, and the equal\-total\-token\-budget setting\.
## Appendix AExtended Discussion and Limitations
#### Discussion\.
The split between sparse and structured pruning suggests that what transfers from the larger model is*weight\-level*information rather than architectural shape: when sparse pruning preserves the original parameter values, the surviving weights carry knowledge that extra training tokens alone do not reproduce, while a smaller dense architecture trained for longer can match a structurally pruned counterpart\. This implies an asymmetric recipe: structured pruning is justified only when the training token budget is the binding constraint; otherwise a longer scratch run on a smaller dense model is competitive\. Sparse pruning is the regime where the detour through a larger model genuinely pays off, but it is also the regime with the weakest hardware support—unstructured sparsity sees no speedup on the TPUs we used and only partial speedup on GPUs with sparse tensor cores\. Closing this gap will likely require either better token\-efficient recipes for structured small models, or broader hardware/kernel support for unstructured sparsity\. A practical corollary is that the right pipeline depends on what is scarce: when target\-model training data is the bottleneck, pruning beats a scratch run of equal length; when the full pipeline token budget is accounted for, only sparse pruning continues to justify the detour through the larger parent\. This is a useful decision rule for practitioners, and a reminder that headline pruning numbers reported under fixed retraining budgets may overstate the practical benefit once the cost of the parent is amortized in\.
#### Limitations\.
Our study leaves five axes of variation unexplored\. \(1\)Model family\.All experiments use Llama\-3\.1\-8B as the base; we do not test whether the same pruning\-vs\-scratch ordering holds for other architectures \(e\.g\. Qwen\(qwen3\), Gemma\(gemma3\), OLMo\(olmo3\)\) or for substantially smaller and larger base models\. \(2\)Pretraining corpus\.Both the P200 checkpoint and all retraining runs use DCLM\. We do not vary the corpus, so the gap between sparse and structured pruning may shift under different data distributions such as FineWeb\-Edu\(penedo2024fineweb\)or Dolma\(soldaini2024dolma\), and the Meta\-RNNcomparisons control for architecture but not for the original Meta pretraining mixture\. \(3\)Knowledge\-distillation baselines\.We compare pruning to plain language\-modeling retraining; we do not include the post\-pruning knowledge\-distillation pipelines used by recent structured methods \(e\.g\. Minitron, ShearedLLaMA\), which could narrow or close the gap for structured pruning under matched total data\. \(4\)Data order\.Each setting is trained with a single shuffle of the retraining corpus; we do not measure run\-to\-run variance or the sensitivity of the pruning advantage to data ordering, which can be non\-trivial in low\-token regimes\. \(5\)Random\-initialization strategies\.Our scratch baselines use a single standard initialization; alternative schemes such asμ\\muP, depth\-aware variance scaling, or warm\-starting from a smaller pretrained checkpoint may shift the scratch curve and could change which pruning methods retain an advantage over training from scratch\.
## Appendix BExtended Related Work
This section gives the full treatment of the three lines of research summarized in Section[5](https://arxiv.org/html/2606.14150#S5)\.
#### Model scaling and constraints\.
The traditional language model scaling paradigm assumes that simple scaling of model size and training data results in stronger performance\(kaplan2020scalinglaw\), however, as recent language models scale to hundreds of billions of parameters\(deepseekv3;qwen3;gemini3\)and training data scales to trillions of tokens\(penedo2024fineweb;li2024datacomplm\), data availability has become a practical challenge in scaling language model pretraining\. Recent works on pretraining under limited data and unlimited compute have explored the effect of traditional scaling method in this scenario\.kim2025pretraininginfinitecomputeshow that existing approaches such as increasing epoch and parameter count eventually overfit and improved training recipes are needed for better scaling performance\.muennighoff2025scalingdataconstrainedlmobserved similar plateau of scaling training epochs and proposed a compute optimality scaling law\. Overall, these recent advances verify the practicality of our scenario where data availability is the main bottleneck in language model pretraining and more token\-efficient training paradigms must be explored\.
#### Continual pretraining\.
As the domain\-specific capabilities of LLMs become increasingly strong,continual pretraining\(CPT\)\(sun2020ernie2\)has emerged as an approach to adapt general\-purpose language models for domain specific task\. The primary challenge in CPT iscatastrophic forgetting\(mccloskey1989catastrophicinterference;luo2025catastrophicforgetting\), where the model loses its prior knowledge and capabilities during the continual learning stage, and the learning rate schedule must be carefully designed to mitigate this phenomenon\.gupta2023howtorewarmyourmodelobserve that starting CPT with a high learning rate causes a sharp drop in model’s performance, and rewarming the LR to 10–20% of the max learning rate significantly mitigates this problem\.grattafiori2024llama3also suggested that learning rate schedule with re\-warmup has been adopted to improve stability when continual pretraining on a new domain\.ibrahim2024simpleprovide a systematic study of data replay and learning rate re\-warming strategies in CPT, showing that mixing a small fraction of original pretraining data prevents catastrophic forgetting\. These works provide valuable reference for the design of training recipes in our experiments, since network pruning damages the internal structure of the base model and the retraining stage is effectively a form of CPT\.
#### Understanding network pruning\.
Since network pruning enables efficient acquisition of compact and capable models, its properties and impacts on the pruned models’ behaviors have been a concerning topic in the network pruning community\. The Lottery Ticket Hypothesis\(frankle2019LTH\)conjectures that every randomly initialized network contains a sparse subnetwork that can be trained to match the full network’s performance; crucially, this subnetwork must be trained from its*original*initialization, since random reinitialization substantially degrades performance\.evci2020riggingthelotteryfurther show that static sparse training from scratch can get stuck in poor local minima, whereas dynamic sparse topologies help escape them\. Together, these results suggest that a sparse mask derived from a pretrained model is difficult to match via random reinitialization\.xu2024initializingmodelswithlargeronesobserve that initializing small models from large pretrained models with rule\-based weight selection yields better training performance\.liu2019rethinkingvaluenetworkpruningshow that small models with the same architecture trained from scratch have on\-par or stronger performance than their pruned\-and\-retrained counterparts for structured methods, and interpret network pruning as a form of Neural Architecture Search \(NAS\)\.chen2025p2scalinglawpropose a pruning\-scaling law connecting training data quality and post\-pruning budget to final performance\. These prior works have greatly inspired the experimental design of this work\.
## Appendix CPruning Methods
### C\.1Overview
Table[3](https://arxiv.org/html/2606.14150#A3.T3)summarizes representative LLM pruning methods organized by pruning granularity\. We group methods into four categories:depthpruning removes entire transformer layers;widthpruning reduces the hidden dimension, FFN intermediate size, or number of attention heads;depth\+widthpruning jointly compresses along both axes; andsparsepruning zero\-masks individual weights within linear modules while keeping the macro\-architecture intact\. For each method, we list the modules it targets and the scoring metric it uses to decide which units to prune\. This taxonomy motivates the six representative methods evaluated in our experiments \(Minitron\-depth, Minitron\-width, FLAP, Sheared LLaMA, Wanda, and SparseGPT\), which together span all four granularity regimes\.
Pruning MethodTarget ModulesPruning Metric→\\rightarrowDepthShortGPT\(men2024shortgpt\)layers1−𝔼h,thi,tThi\+1,t‖hi,t‖2‖hi\+1,t‖21\-\\mathbb\{E\}\_\{h,t\}\\dfrac\{h\_\{i,t\}^\{T\}h\_\{i\+1,t\}\}\{\\\|h\_\{i,t\}\\\|\_\{2\}\\\|h\_\{i\+1,t\}\\\|\_\{2\}\}Minitron\-depth\(sreenivas2024minitron2\)layersℒ\(θ,z\)−ℒ\(θ\)\\mathcal\{L\}\(\\theta,z\)\-\\mathcal\{L\}\(\\theta\)Shortened Llama\(kim2024shortenedllama\)layers\|∂ℒ\(D\)∂Wi,jk,nWi,jk,n\|\\left\|\\dfrac\{\\partial\\mathcal\{L\}\(D\)\}\{\\partial W\_\{i,j\}^\{k,n\}\}W\_\{i,j\}^\{k,n\}\\right\|→\\rightarrowWidthLLM\-Pruner\(ma2023llmpruner\)hidden size, FFN size, attention heads\|∂ℒ\(𝒟\)∂WikWik−12∑i=1N\(∂ℒ\(𝒟\)∂WikWik\)2\+𝒪\(‖Wik‖3\)\|\\left\|\\dfrac\{\\partial\\mathcal\{L\}\(\\mathcal\{D\}\)\}\{\\partial W\_\{i\}^\{k\}\}W\_\{i\}^\{k\}\-\\dfrac\{1\}\{2\}\\sum\_\{i=1\}^\{N\}\\left\(\\dfrac\{\\partial\\mathcal\{L\}\(\\mathcal\{D\}\)\}\{\\partial W\_\{i\}^\{k\}\}W\_\{i\}^\{k\}\\right\)^\{2\}\+\\mathcal\{O\}\\left\(\\\|W\_\{i\}^\{k\}\\\|^\{3\}\\right\)\\right\|Minitron\-width\(sreenivas2024minitron2\)hidden size, FFN size‖X:,j‖2,1≤j≤d\\\|X\_\{:,j\}\\\|^\{2\},\\quad 1\\leq j\\leq dWanda\-sp\(an2023flap\)hidden size, FFN size∑i\|Wij\|⋅‖X:,j‖2\\sum\_\{i\}\|W\_\{ij\}\|\\cdot\\\|X\_\{:,j\}\\\|\_\{2\}FLAP\(an2023flap\)hidden size, FFN size1N−1∑n=1N\(Xn,jl−X¯:,jl\)2⋅‖W:,jl‖22\\dfrac\{1\}\{N\-1\}\\sum\_\{n=1\}^\{N\}\(X\_\{n,j\}^\{l\}\-\\bar\{X\}\_\{:,j\}^\{l\}\)^\{2\}\\cdot\\\|W\_\{:,j\}^\{l\}\\\|\_\{2\}^\{2\}SliceGPT\(ashkboos2024slicegpt\)hidden sizeλj\(𝐗l⊤𝐗l\),1≤j≤d\\lambda\_\{j\}\\bigl\(\\mathbf\{X\}^\{l\\top\}\\mathbf\{X\}^\{l\}\\bigr\),\\quad 1\\leq j\\leq d→\\rightarrowDepth \+ WidthSheared Llama\(xia2024shearedllama\)layers, hidden size, FFN size, attention headsℒ\(θ,z\)\+∑j=1LSℒ~jhead\+∑j=1LSℒ~jint\+ℒ~layer\+ℒ~hidden\\mathcal\{L\}\(\\theta,z\)\+\\sum\_\{j=1\}^\{L\_\{S\}\}\\tilde\{\\mathcal\{L\}\}\_\{j\}^\{\\text\{head\}\}\+\\sum\_\{j=1\}^\{L\_\{S\}\}\\tilde\{\\mathcal\{L\}\}\_\{j\}^\{\\text\{int\}\}\+\\tilde\{\\mathcal\{L\}\}^\{\\text\{layer\}\}\+\\tilde\{\\mathcal\{L\}\}^\{\\text\{hidden\}\}→\\rightarrowSparseMagnitude\(han2015magnitudepruning\)individual weights\|Wij\|\|W\_\{ij\}\|Sparse GPT\(frantar2023sparsegpt\)individual weights\|W\|2diag\[\(XTX\+λI\)−1\]\\dfrac\{\|W\|^\{2\}\}\{\\mathrm\{diag\}\[\(X^\{T\}X\+\\lambda I\)^\{\-1\}\]\}Wanda\(sun2024wanda\)individual weights\|Wij\|⋅‖Xj‖2\|W\_\{ij\}\|\\cdot\\\|X\_\{j\}\\\|\_\{2\}
Table 3:A taxonomy of different LLM pruning approaches categorized by pruning granularity\. Depth pruning removes layers, width pruning prunes attention heads / hidden / FFN dimensions, and sparse pruning zero\-masks individual weights without changing the model shape\.
### C\.2Method details
#### Preliminaries\.
Letθ∈ℝd\\theta\\in\\mathbb\{R\}^\{d\}denote the parameters of a pretrained LLM and𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}a small calibration corpus\. Pruning at ratiorrproduces a smaller model by selecting a binary maskm∈\{0,1\}dm\\in\\\{0,1\\\}^\{d\}that retains a fraction1−r1\-rof the prunable components — whole layers, channels, or individual weights\. Existing LLM pruning methods fall into two families based on howmmis chosen\.
*\(i\) Loss / reconstruction minimization\.*These methods selectmmto minimize an objective on𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\},
m⋆=argminm∈ℳrℒ\(θ⊙m;𝒟cal\),m^\{\\star\}\\;=\\;\\arg\\min\_\{m\\in\\mathcal\{M\}\_\{r\}\}\\;\\mathcal\{L\}\\bigl\(\\theta\\odot m;\\,\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\\bigr\),\(1\)whereℳr\\mathcal\{M\}\_\{r\}is the feasible mask set at ratiorrandℒ\\mathcal\{L\}is either the next\-token validation loss \(Minitron\-depth, Sheared LLaMA\) or a layerwise output\-reconstruction error \(SparseGPT\)\.
*\(ii\) Activation\-based importance scoring\.*These methods assign each prunable componentiia scalar scoresis\_\{i\}summarizing its contribution to the output activations on𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}, and retain the top\-kkhighest\-scoring components withk=⌈\(1−r\)d⌉k=\\lceil\(1\-r\)\\,d\\rceil:
mi⋆=1\[i∈TopK\(\{sj\}j=1d,k\)\]\.m\_\{i\}^\{\\star\}\\;=\\;\\mathbb\{1\}\\\!\\left\[\\,i\\in\\mathrm\{TopK\}\\bigl\(\\\{s\_\{j\}\\\}\_\{j=1\}^\{d\},\\,k\\bigr\)\\,\\right\]\.\(2\)The unit indexed byiimay be a channel \(Minitron\-width, FLAP\) or an individual weight \(Wanda\)\.
#### Minitron\-depth
\(sreenivas2024minitron2\)prunes a contiguous block ofkktransformer layers, withkkdetermined by the pruning ratio\. For each candidate starting indexii, the validation loss is recomputed with layers\[i,i\+k\)\[i,\\,i\+k\)replaced by identity mappings, and the block causing the smallest loss increase is removed:
i⋆=argminiℒ\(θ\[i:i\+k\]→id;𝒟cal\)−ℒ\(θ;𝒟cal\)\.i^\{\\star\}\\;=\\;\\arg\\min\_\{i\}\\;\\;\\mathcal\{L\}\\bigl\(\\theta\_\{\[i:i\+k\]\\to\\mathrm\{id\}\};\\,\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\\bigr\)\\;\-\\;\\mathcal\{L\}\(\\theta;\\,\\mathcal\{D\}\_\{\\mathrm\{cal\}\}\)\.\(3\)
#### Minitron\-width
\(sreenivas2024minitron2\)scores each hidden or FFN channeljjby the L2 norm of its block\-output activations on𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}, and keeps the top\-scoring channels up to the target dimension:
sj=‖X:,j‖2,1≤j≤d\.s\_\{j\}\\;=\\;\\bigl\\\|X\_\{:,j\}\\bigr\\\|\_\{2\},\\qquad 1\\leq j\\leq d\.\(4\)
#### FLAP
\(an2023flap\)scores each hidden / FFN channel by the variance of its input feature, weighted by the squared norm of the corresponding output\-weight column:
sj=1N−1∑n=1N\(Xn,j−X¯:,j\)2⋅‖W:,j‖22\.s\_\{j\}\\;=\\;\\frac\{1\}\{N\-1\}\\sum\_\{n=1\}^\{N\}\\bigl\(X\_\{n,j\}\-\\bar\{X\}\_\{:,j\}\\bigr\)^\{2\}\\cdot\\bigl\\\|W\_\{:,j\}\\bigr\\\|\_\{2\}^\{2\}\.\(5\)Channels with the largestsjs\_\{j\}are retained\.
#### Sheared LLaMA
\(xia2024shearedllama\)introduces continuous mask variableszzover layers, attention heads, hidden channels, and FFN channels, and learns them jointly with the model parameters via a constrained optimization\. The target\-architecture constraints are relaxed into an augmented Lagrangian penalty:
minθ,zℒ\(θ⊙z;𝒟\)\+∑c∈𝒞λc\(s^c\(z\)−sc⋆\)\+ϕc\(s^c\(z\)−sc⋆\)2,\\min\_\{\\theta,\\,z\}\\;\\;\\mathcal\{L\}\\bigl\(\\theta\\odot z;\\,\\mathcal\{D\}\\bigr\)\\;\+\\;\\sum\_\{c\\in\\mathcal\{C\}\}\\lambda\_\{c\}\\bigl\(\\hat\{s\}\_\{c\}\(z\)\-s\_\{c\}^\{\\star\}\\bigr\)\\;\+\\;\\phi\_\{c\}\\bigl\(\\hat\{s\}\_\{c\}\(z\)\-s\_\{c\}^\{\\star\}\\bigr\)^\{2\},\(6\)whereccranges over architectural axes,s^c\(z\)\\hat\{s\}\_\{c\}\(z\)is the current size along axiscc,sc⋆s\_\{c\}^\{\\star\}is the target, and\(λc,ϕc\)\(\\lambda\_\{c\},\\phi\_\{c\}\)are dual variables updated jointly withzz\.
#### Wanda
\(sun2024wanda\)scores each individual weightWijW\_\{ij\}by its magnitude times the L2 norm of the corresponding input feature on𝒟cal\\mathcal\{D\}\_\{\\mathrm\{cal\}\}, and keeps the top\-scoring weights within each output row:
sij=\|Wij\|⋅‖X:,j‖2\.s\_\{ij\}\\;=\\;\|W\_\{ij\}\|\\cdot\\bigl\\\|X\_\{:,j\}\\bigr\\\|\_\{2\}\.\(7\)Restricting the comparison to within each output row allows Wanda to support both unstructured and 2:4 sparsity patterns without any additional weight updates after pruning\.
#### SparseGPT
\(frantar2023sparsegpt\)formulates pruning as a layerwise reconstruction problem\. For each linear layer with weightWWand calibration inputXX, it jointly searches for a binary maskmmand updated weightsW^\\hat\{W\}that minimize the output mismatch:
minm,W^‖XW⊤−X\(W^⊙m\)⊤‖F2s\.t\.‖m‖0≤\(1−r\)\|W\|\.\\min\_\{m,\\,\\hat\{W\}\}\\;\\;\\bigl\\\|\\,XW^\{\\top\}\-X\(\\hat\{W\}\\odot m\)^\{\\top\}\\,\\bigr\\\|\_\{F\}^\{2\}\\quad\\text\{s\.t\.\}\\quad\\\|m\\\|\_\{0\}\\leq\(1\-r\)\\,\|W\|\.\(8\)The combinatorial mask is approximated by an OBS\-style score\|Wij\|2/\[\(X⊤X\+λI\)−1\]jj\|W\_\{ij\}\|^\{2\}/\[\(X^\{\\top\}\\\!X\+\\lambda I\)^\{\-1\}\]\_\{jj\}, and surviving weights are updated through the inverse Hessian to compensate for those removed\.
## Appendix DImplementation Details
### D\.1Training recipe
All pretraining and retraining experiments use the default Lingua\(meta2024lingua\)recipe\. Key hyperparameters are summarised in Table[4](https://arxiv.org/html/2606.14150#A4.T4)\.
configvaluesequence length8192batch size512weight decay0\.01warmup ratio5%learning rate schedulecosine decaymin learning rate ratio0\.1optimizerAdamWoptimizer momentumβ1\\beta\_\{1\}= 0\.9,β2\\beta\_\{2\}= 0\.95optimizer eps1e\-8Table 4:Training recipes of our pretraining and retraining experiments\.#### Learning rate schedule\.
A cosine decay schedule is used for all runs\. The warmup phase occupies 5% of total steps, after which the learning rate decays to 10% of its peak value\. For each experiment we sweep the peak learning rate over\{\\\{1e\-5, 3e\-5, 1e\-4, 3e\-4, 1e\-3\}\\\}using a short trial prefix and select the value that yields the lowest training loss\.
### D\.2Pruning configurations
We describe the calibration sets and configurations for each pruning method\.
#### Minitron\-depth\.
Calibration set: 1024 samples of 8192 tokens from WikiText\-103, used to compute validation loss\. Pruning ratios\{50%,62\.5%,75%,81\.3%\}\\\{50\\%,62\.5\\%,75\\%,81\.3\\%\\\}correspond to 16, 8, 4, and 2 hidden layers in the pruned model\.
#### Minitron\-width\.
Same calibration set as Minitron\-depth\. L2\-norm of block output activations is used to score each hidden channel and MLP channel\. Pruning ratios\{50%,62\.5%,75%,81\.3%\}\\\{50\\%,62\.5\\%,75\\%,81\.3\\%\\\}\. Since Minitron\-width prunes hidden\-size and MLP\-size dimensions independently, multiple target architectures are possible for a given parameter budget\. To select the final configuration at each pruning ratio, we enumerate several candidate dimension pairs \(hidden size, MLP size\) that yield roughly the same total parameter count, following the original Minitron paper in keeping attention heads fixed at 32 and constraining2h≤MLP≤4h2h\\leq\\text\{MLP\}\\leq 4h\. We prune the pretrained Llama\-3\.1\-8B checkpoint to each candidate and evaluate validation loss on the calibration set without retraining\. The candidate with thelowest validation lossis used for all subsequent retraining\. The selected configurations are listed in Table[1](https://arxiv.org/html/2606.14150#S3.T1)\.
#### FLAP\.
Calibration set: 128 samples of 8192 tokens from C4\. Channels are scored by input feature variance weighted by the corresponding weight column norm, and pruned at a 50% ratio\.
#### Sheared LLaMA\.
We specify a target architecture matching a 50% pruning ratio and jointly learn pruning masks over layers, attention heads, hidden dimensions, and feedforward dimensions on the training data\.
#### SparseGPT and Wanda\.
Calibration set: 128 samples of 8192 tokens from C4\. Both 2:4 semi\-structured and 50% unstructured sparsity patterns are applied\.
### D\.3Architecture search
ForMinitron\-W,FLAP, andSheared LLaMA, the pruning method does not uniquely determine the output architecture for a given parameter budget: multiple dimension configurations can yield the same total parameter count\. For each method we enumerate a set of candidate architectures that match the target budget, prune the pretrained Llama\-3\.1\-8B checkpoint to each candidate without any retraining, and evaluate validation loss on the calibration set\. The candidate with thelowest validation lossis selected and used for all subsequent retraining experiments in this paper\.
#### Minitron\-W\.
Following the original Minitron paper, attention heads are kept unchanged \(32 heads\)\. We vary the hidden size and constrain MLP size such that2h≤MLP≤4h2h\\leq\\text\{MLP\}\\leq 4h\. The full candidate set and selected configuration are listed in Table[5](https://arxiv.org/html/2606.14150#A4.T5)\.
#### FLAP\.
Following the UL\-MM configuration in the FLAP paper, hidden size is kept unchanged at 4096\. We vary the number of KV heads and MLP size to reach the target of∼\\sim4\.5B parameters, consistent withMinitron,Wanda, andSparseGPT\. Candidates and the selected configuration are listed in Table[6](https://arxiv.org/html/2606.14150#A4.T6)\.
#### Sheared LLaMA\.
We vary attention heads, hidden size, and MLP size to reach∼\\sim4\.5B parameters\. Because varying all three dimensions simultaneously yields too many combinations, we fix hidden size to two representative values \(3072 and 3456\) and vary attention heads and MLP size within each\. The full candidate set and selected configuration are listed in Table[7](https://arxiv.org/html/2606.14150#A4.T7)\.
RatioHiddenAttn headsMLPParamsSel\.62\.5%21763271683\.1B23043266563\.1B24323261443\.1B✓\\checkmark25603264003\.1B26883258883\.1B75%16643266562\.0B17923260162\.0B✓\\checkmark19203252482\.0B20483244802\.0B81\.3%14083256321\.5B15363247361\.5B✓\\checkmark16643238401\.5BTable 5:Candidate architectures forMinitron\-W\. Attention heads are fixed at 32; we vary hidden size with2h≤MLP≤4h2h\\leq\\text\{MLP\}\\leq 4h\. The selected configuration at each pruning ratio is marked with✓\\checkmark\.RatioHiddenKV headsMLPParamsSel\.50%40961671684\.5B40962066564\.5B✓\\checkmark40962462724\.5B40962858884\.5B40963253764\.5BTable 6:Candidate architectures forFLAP\. Following the UL\-MM config in the FLAP paper, hidden size is fixed at 4096; KV heads and MLP size are varied\. The selected configuration is marked with✓\\checkmark\.RatioHiddenAttn headsMLPParamsSel\.50%307216110084\.5B307220104964\.5B307224101124\.5B30722897284\.5B30723292164\.5B34561692164\.5B34562088324\.5B34562484484\.5B34562879364\.5B34563275524\.5B✓\\checkmarkTable 7:Candidate architectures forSheared LLaMA\. Hidden size is fixed to 3072 or 3456; attention heads and MLP size are varied within each\. The selected configuration is marked with✓\\checkmark\.
## Appendix EEvaluation protocol
#### Linguistic perplexity\.
We evaluate on the general\-domain corpora C4\(raffel2020c4\), WikiText\-103\(merity2017wikitext\), and WikiText\-2\(merity2017wikitext\), along with the news\-and\-summaries corpus CNN Dailymail\(chen2016cnndailymail\)\. For each corpus, we collect 256 sequences of length 8192 \(the max position embedding of Llama\-3\.1\) as the evaluation set\.
#### Downstream accuracy\.
We evaluate on eight zero\-shot benchmarks spanning three categories: commonsense QA \(WinoGrande\(sakaguchi2020winogrande\), HellaSwag\(zellers2019hellaswag\), PIQA\(bisk2019piqa\)\), scientific QA \(ARC\-Challenge\(clark2018arc\), ARC\-Easy\(clark2018arc\), SciQ\(welbl2017sciq\), OpenBookQA\(mihaylov2018obqa\)\), and logical reasoning \(BoolQ\(clark2019boolq\)\)\. Zero\-shot evaluation measures foundational knowledge and capability, independent of in\-context learning biases\. The random\-chance baselines are 50% for WinoGrande \(binary\), 25% for ARC\-C and ARC\-E \(four\-way\), and 50% for PIQA and BoolQ \(binary\); results near these baselines indicate that the model has essentially lost the measured capability, and we interpret them accordingly\.
#### Evaluation framework and standard deviations\.
All downstream evaluations are run with lm\-evaluation\-harness\(eleuther2021lmeval\)\. Grey entries in Table[16](https://arxiv.org/html/2606.14150#A8.T16)denote differences that fall within the evaluation standard deviation, estimated asacc\(1−acc\)/N\\sqrt\{acc\(1\-acc\)/N\}whereNNis the number of evaluation examples\. The per\-benchmark standard deviations are shown in Table[8](https://arxiv.org/html/2606.14150#A5.T8)\.
BenchmarkStd\. Dev\.WinoG±\\pm1\.3%ARC\-C±\\pm1\.4%ARC\-E±\\pm1\.0%HSwag±\\pm0\.4%PIQA±\\pm1\.0%SciQ±\\pm0\.8%BoolQ±\\pm0\.8%OBQA±\\pm2\.2%Table 8:Per\-benchmark evaluation standard deviations\.
## Appendix FEfficiency comparison
methodsparsityFLOPs \(TF\)Speedupbase0\.0%32\.932\.91\.0×\\timesMinitron\-D50\.0%17\.517\.51\.9×\\times62\.5%9\.89\.84\.2×\\times75\.0%6\.06\.07\.2×\\times81\.3%4\.14\.111\.2×\\timesMinitron\-W50\.0%19\.119\.11\.4×\\times62\.5%12\.612\.62\.0×\\times75\.0%9\.89\.82\.4×\\times81\.3%7\.97\.92\.9×\\timesFLAP50\.0%17\.717\.71\.5×\\timesSheared LLaMA50\.0%18\.918\.91\.4×\\timesWanda50\.0%16\.516\.51\.0×\\times2:416\.516\.51\.0×\\timesSparseGPT50%16\.516\.51\.0×\\times2:416\.516\.51\.0×\\timesTable 9:Efficiency comparison between pruning methods\.Models are obtained by pretraining Llama\-3\.1\-8B for 200B tokens, pruning at the listed ratio, and retraining for 50B tokens\. For Wanda and SparseGPT, 50% denotes unstructured sparsity\. FLOPs are computed for a single forward pass with sequence length 2048; for sparse models, theoretical FLOPs assume 50% of weight multiplications are skipped\.All step times are profiled on Google TPU v4\-256 using MaxText\(aihypercomputer2025maxtext\), which has no sparse tensor cores\.On this hardware, sparse methods yield no training speedup despite halved theoretical FLOPs\. On NVIDIA GPUs with sparse tensor cores \(A100/H100\), 2:4 sparse models obtain meaningful speedups via cuSPARSELt; unstructured sparse models do not\.Table[9](https://arxiv.org/html/2606.14150#A6.T9)compares training speed and WikiText\-103\(merity2017wikitext\)performance among the pruning methods we experiment with\. Minitron\-D\(sreenivas2024minitron2\)with the coarsest granularity provides the most speedup while preserving the least of the base model’s performance on WikiText\-103, while sparse pruning methods like SparseGPT\(frantar2023sparsegpt\)and Wanda\(sun2024wanda\)score the best but provide no training speedup at all\. Minitron\-W\(sreenivas2024minitron2\)provides a middle ground with moderate speedup and decent preservation of the base model’s performance\.
An overview of the methods is shown in Figure[2](https://arxiv.org/html/2606.14150#S1.F2)\.
Among these methods, we cover four different granularities: depth pruning, width pruning, 2:4 sparse pruning, and unstructured sparse pruning\. We also use four pruning ratios 50%, 62\.5%, 75\.0% and 81\.3%\.
## Appendix GWhy Does Sparse Pruning Maintain Its Advantage?
The persistent advantage of sparse pruning over training from scratch, even when scratch training uses the full 250B\-token budget \(Section[4\.3](https://arxiv.org/html/2606.14150#S4.SS3)\), calls for a mechanistic explanation\. We offer two complementary perspectives grounded in prior work\.
#### Optimizing a sparse model from scratch is hard\.
The Lottery Ticket Hypothesis\(frankle2019LTH\)posits that a pruned subnetwork trained from its*original*initialization converges faster and to higher accuracy than the same sparse structure trained from a*random*reinitialization\.evci2020riggingthelotteryfurther demonstrate that static sparse training can get stuck in isolated local minima, and that allowing the sparse topology to evolve during training helps escape them\. In our S250\-sparse setting, the sparse mask is fixed and the weights are randomly reinitialized, placing it in exactly the hard regime identified by these works\.
#### Sparse pruning identifies and preserves non\-redundant weights\.
liu2019rethinkingvaluenetworkpruningobserve that both fine\-tuned and scratch\-trained pruned models have far fewer near\-zero weights than the unpruned base model, suggesting that pruning actively identifies structurally important weights\. For unstructured sparse pruning in particular, the fine\-tuned model diverges more strongly from the scratch\-trained counterpart than structured pruning does, indicating that the specific weight*values*inherited from the large model, not just the sparsity pattern, are a key source of advantage\.
## Appendix HDetailed Per\-Benchmark Results
### H\.1Pruning\-only baselines
To contextualize the benefit of retraining, we report zero\-shot performance immediately after pruning, before any retraining is applied\. Table[10](https://arxiv.org/html/2606.14150#A8.T10)covers both structured pruning \(Minitron\-D, Minitron\-W, FLAP, Sheared LLaMA\) and sparse pruning \(Wanda and SparseGPT at 2:4 and unstructured 50% sparsity\)\. We note that some structured\-pruning configurations exhibit extremely high post\-pruning perplexity — for example, P200 \+ Minitron\-W reaches 24686\.5 on C4 and 36761\.5 on WikiText\-2 — because activation\-norm channel selection on the P200 checkpoint can leave the residual stream poorly conditioned before any weights are re\-tuned\. Despite this near\-broken starting point, the same configuration fully recovers after a modest retraining budget: at the 50B\-token retraining setting reported in the main text \(Table[16](https://arxiv.org/html/2606.14150#A8.T16)\), P200\-R50 with Minitron\-W reaches accuracy and perplexity comparable to the other 50%\-pruned methods, indicating that initial post\-pruning perplexity alone is a poor predictor of the retrained model’s final quality\.
Base modelMethodperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowC4WT\-2WinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvgMetaBase9\.95\.874\.353\.481\.178\.981\.396\.382\.045\.074\.0MetaMinitron\-D532\.4599\.948\.524\.030\.629\.154\.731\.137\.927\.235\.4MetaMinitron\-W411\.0182\.052\.422\.630\.328\.653\.051\.237\.827\.437\.9MetaFLAP85\.434\.253\.425\.536\.834\.657\.761\.938\.830\.042\.3MetaSheared\-LLaMA19\.113\.361\.635\.059\.357\.069\.988\.965\.736\.259\.2MetaWanda 2:438\.021\.659\.430\.552\.049\.269\.083\.867\.930\.855\.3MetaWanda\-U15\.59\.370\.442\.268\.368\.576\.490\.879\.140\.267\.0MetaSparseGPT\-2:424\.215\.063\.534\.859\.356\.570\.587\.368\.235\.059\.4MetaSparseGPT\-U14\.58\.872\.543\.067\.471\.476\.587\.979\.242\.467\.5P200Base15\.89\.068\.446\.375\.174\.578\.895\.369\.341\.068\.6P200Minitron\-D504\.0364\.253\.629\.934\.736\.958\.835\.453\.828\.641\.5P200Minitron\-W24686\.536761\.549\.326\.325\.826\.251\.222\.838\.927\.633\.5P200FLAP128\.057\.051\.524\.847\.838\.163\.969\.462\.131\.248\.6P200Sheared\-LLaMA20\.415\.261\.141\.869\.464\.174\.991\.565\.837\.663\.3P200Wanda\-2:4142\.867\.157\.530\.553\.952\.767\.689\.563\.932\.056\.0P200Wanda\-U30\.915\.063\.442\.467\.869\.176\.490\.366\.938\.264\.3P200SparseGPT\-2:442\.720\.362\.635\.863\.559\.473\.589\.165\.637\.460\.9P200SparseGPT\-U22\.411\.966\.543\.970\.268\.977\.990\.465\.440\.265\.4Table 10:Zero\-shot performance immediately after pruning, without any retraining\. All pruning methods use 50% sparsity\.Minitron\-DandMinitron\-Wdenote the depth and width variants of Minitron pruning, respectively\.Wanda\-UandSparseGPT\-Udenote the unstructured sparsity variants of Wanda and SparseGPT \(as opposed to the 2:4 semi\-structured variants\)\. C4 and WT\-2: perplexity \(lower is better\); others: 0\-shot accuracy \(%\)\. Avg is the mean over all 8 zero\-shot benchmarks \(WinoG, ARC\-C, ARC\-E, HSwag, PIQA, SciQ, BoolQ, OBQA\)\. Purple\-shaded rows mark the unpruned base model in each block \(Meta’s released Llama\-3\.1\-8B and our 200B\-token pretrained checkpoint\);purple bold valuesmark the best result among the pruned methods within each block\.
### H\.2Pruning ratio ablations
Table[11](https://arxiv.org/html/2606.14150#A8.T11)reports per\-benchmark results for Minitron\-D and Minitron\-W at pruning ratios\{50\.0%,62\.5%,75\.0%,81\.3%\}\\\{50\.0\\%,\\allowbreak 62\.5\\%,\\allowbreak 75\.0\\%,\\allowbreak 81\.3\\%\\\}\(equal training token budget\)\. TheΔ\\DeltaP200\-R50 rows summarize the advantage of pruning initialization over scratch training, which diminishes as the pruning ratio grows\.
ratiotokensloss↓\\downarrowperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowDCLMC4WTWT\-2CNNWinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvg→\\rightarrowMinitron\-depth50\.0%S502\.6416\.9010\.6110\.7710\.3556\.836\.270\.161\.974\.490\.457\.837\.860\.7P200\-R502\.5415\.349\.309\.419\.4664\.440\.771\.568\.076\.891\.764\.038\.064\.4Δ\\DeltaP200\-R50\+0\.10\+1\.56\+1\.31\+1\.36\+0\.89\+7\.6\+4\.5\+1\.4\+6\.1\+2\.4\+1\.3\+6\.2\+0\.2\+3\.762\.5%S502\.8220\.2313\.4313\.7012\.5754\.032\.362\.351\.171\.485\.251\.633\.855\.2P200\-R502\.7719\.3512\.6712\.8912\.0355\.330\.762\.553\.672\.387\.355\.734\.256\.4Δ\\DeltaP200\-R50\+0\.05\+0\.88\+0\.76\+0\.81\+0\.53\+1\.3\-1\.6\+0\.2\+2\.5\+0\.9\+2\.1\+4\.1\+0\.4\+1\.275\.0%S503\.0224\.5717\.2917\.6915\.5551\.427\.656\.240\.968\.179\.555\.029\.051\.0P200\-R503\.0124\.4117\.1117\.4815\.4951\.129\.355\.841\.468\.681\.954\.930\.451\.7Δ\\DeltaP200\-R500\.01\+0\.16\+0\.18\+0\.21\+0\.06\-0\.3\+1\.7\-0\.4\+0\.6\+0\.5\+2\.4\-0\.2\+1\.4\+0\.781\.3%S503\.2530\.7924\.5725\.1619\.9749\.425\.952\.235\.765\.174\.455\.128\.848\.3P200\-R503\.2530\.7924\.9225\.4820\.3351\.224\.653\.535\.466\.476\.050\.427\.448\.1Δ\\DeltaP200\-R500\.000\.00\-0\.35\-0\.32\-0\.36\+1\.8\-1\.4\+1\.2\-0\.3\+1\.3\+1\.6\-4\.7\-1\.4\-0\.2→\\rightarrowMinitron\-width50\.0%S502\.7017\.4710\.9911\.1710\.6561\.333\.665\.160\.374\.789\.660\.833\.859\.9P200\-R502\.4714\.68\.939\.018\.9265\.441\.873\.470\.577\.493\.867\.440\.466\.3Δ\\DeltaP200\-R50\+0\.23\+2\.87\+2\.05\+2\.16\+1\.73\+4\.1\+8\.2\+8\.3\+10\.2\+2\.7\+4\.2\+6\.6\+6\.6\+6\.462\.5%S502\.7518\.6712\.0012\.2311\.4556\.130\.165\.256\.372\.986\.255\.732\.656\.9P200\-R502\.5014\.518\.498\.619\.0161\.541\.771\.069\.575\.492\.672\.435\.464\.9Δ\\DeltaP200\-R50\+0\.24\+4\.16\+3\.51\+3\.61\+2\.44\+5\.4\+11\.6\+5\.8\+13\.2\+2\.4\+6\.4\+16\.8\+2\.8\+8\.075\.0%S502\.8220\.0012\.9913\.2612\.2756\.329\.861\.752\.671\.987\.455\.732\.256\.0P200\-R502\.6015\.989\.619\.769\.9157\.237\.268\.063\.573\.489\.866\.132\.461\.0Δ\\DeltaP200\-R50\+0\.21\+4\.03\+3\.38\+3\.49\+2\.36\+0\.9\+7\.4\+6\.2\+11\.0\+1\.5\+2\.4\+10\.4\+0\.2\+5\.081\.3%S502\.8620\.9313\.7714\.0812\.8854\.127\.960\.749\.570\.786\.053\.231\.554\.2P200\-R502\.6316\.169\.539\.7110\.1054\.035\.865\.861\.271\.788\.367\.729\.859\.3Δ\\DeltaP200\-R50\+0\.23\+4\.77\+4\.24\+4\.37\+2\.78\-0\.1\+7\.9\+5\.1\+11\.7\+1\.0\+2\.3\+14\.5\-1\.7\+5\.1
Table 11:Effect of pruning ratio on the initialization advantage \(equal training token budget\)\.P200\-R50:pretrain 200B tokens, prune, retrain 50B tokens\.S50:train from scratch for 50B tokens\.Δ\\DeltaP200\-R50: difference relative to P200\-R50\. Blue: P200\-R50 better; red: P200\-R50 worse; grey: within evaluation standard deviation\. The advantage of pruning initialization diminishes as the pruning ratio increases, vanishing near 81\.3% for depth pruning\.#### Per\-benchmark observations\.
BoolQ is the most volatile benchmark across pruning ratios under both granularities: under depth pruning it decays monotonically from\+6\.2\+6\.2at 50% to−4\.7\-4\.7at 81\.3%, while under width pruning it swings non\-monotonically between\+6\.6\+6\.6and\+16\.8\+16\.8\. ARC\-Challenge under Minitron\-D 62\.5% is a localized anomaly — pruning costs1\.61\.6points there even though neighboring ratios \(50%, 75%\) both favor pruning\. Minitron\-W retains a positive average advantage at every ratio we tested \(including 81\.3%,ΔAvg=\+5\.1\\Delta\\text\{Avg\}=\+5\.1\), in sharp contrast to depth pruning, where the average advantage vanishes at 81\.3% and only WinoG and ARC\-E remain positive\.
### H\.3Token scaling across pruning methods
Tables[12](https://arxiv.org/html/2606.14150#A8.T12),[13](https://arxiv.org/html/2606.14150#A8.T13),[14](https://arxiv.org/html/2606.14150#A8.T14), and[15](https://arxiv.org/html/2606.14150#A8.T15)report per\-benchmark results across retraining token budgets\{10B,30B,50B,250B,500B\}\\\{10\\text\{B\},\\allowbreak 30\\text\{B\},\\allowbreak 50\\text\{B\},\\allowbreak 250\\text\{B\},\\allowbreak 500\\text\{B\}\\\}for Minitron\-D, Minitron\-W, FLAP, and Sheared LLaMA, respectively, under the three initialization strategies \(train from scratch, prune from our 200B\-pretrained checkpoint, and prune from Meta’s released Llama\-3\.1\-8B\)\. All four tables are the full\-data sources behind Figure[3](https://arxiv.org/html/2606.14150#S4.F3)in the main text\.
methodtokensloss↓\\downarrowperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowDCLMC4WTWT\-2CNNWinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvgMinitrondepthS103\.0925\.9218\.1818\.7116\.0749\.026\.749\.338\.667\.276\.760\.031\.249\.8P200\-R102\.7016\.8310\.2910\.4410\.7957\.734\.668\.560\.375\.889\.860\.538\.260\.7Δ\\DeltaS10\+0\.39\+9\.09\+7\.89\+8\.27\+5\.28\+8\.7\+7\.9\+19\.2\+21\.7\+8\.6\+13\.1\+0\.5\+7\.0\+10\.9Meta\-R102\.6916\.299\.869\.9810\.5858\.536\.770\.161\.975\.692\.061\.239\.461\.9Δ\\DeltaS10\+0\.41\+9\.63\+8\.32\+8\.73\+5\.49\+9\.5\+10\.0\+20\.8\+23\.3\+8\.4\+15\.3\+1\.2\+8\.2\+12\.1S302\.8119\.4012\.5012\.7511\.9154\.330\.463\.754\.272\.386\.148\.833\.855\.5P200\-R302\.5915\.409\.199\.279\.6859\.038\.371\.265\.676\.591\.261\.537\.862\.6Δ\\DeltaS30\+0\.21\+4\.00\+3\.31\+3\.48\+2\.23\+4\.7\+7\.9\+7\.5\+11\.4\+4\.2\+5\.1\+12\.7\+4\.0\+7\.1Meta\-R302\.5915\.078\.969\.049\.6461\.739\.872\.866\.876\.793\.061\.240\.664\.1Δ\\DeltaS30\+0\.22\+4\.33\+3\.54\+3\.71\+2\.27\+7\.4\+9\.4\+9\.1\+12\.6\+4\.4\+6\.9\+12\.4\+6\.8\+8\.6S502\.6416\.9010\.6110\.7710\.3556\.836\.270\.161\.974\.490\.457\.837\.860\.7P200\-R502\.5415\.349\.309\.419\.4664\.440\.771\.568\.076\.891\.764\.038\.064\.4Δ\\DeltaS50\+0\.10\+1\.56\+1\.31\+1\.36\+0\.89\+7\.6\+4\.5\+1\.4\+6\.1\+2\.4\+1\.3\+6\.2\+0\.2\+3\.7Meta\-R502\.5014\.428\.558\.619\.2163\.340\.072\.769\.077\.693\.262\.240\.663\.4Δ\\DeltaS50\+0\.14\+2\.48\+2\.06\+2\.16\+1\.14\+6\.5\+3\.8\+2\.6\+7\.1\+3\.2\+2\.8\+4\.4\+2\.8\+2\.7S2502\.4914\.788\.939\.019\.0363\.541\.773\.770\.878\.093\.265\.142\.066\.0P200\-R2502\.4613\.878\.128\.168\.5767\.343\.374\.272\.578\.694\.462\.742\.867\.0Δ\\DeltaS250\+0\.03\+0\.91\+0\.81\+0\.85\+0\.46\+3\.8\+1\.6\+0\.5\+1\.7\+0\.6\+1\.2\-2\.4\+0\.8\+1\.0Meta\-R2502\.4613\.728\.008\.038\.5666\.644\.575\.072\.978\.894\.468\.643\.068\.0Δ\\DeltaS250\+0\.03\+1\.06\+0\.93\+0\.98\+0\.47\+3\.1\+2\.8\+1\.3\+2\.1\+0\.8\+1\.2\+3\.5\+1\.0\+2\.0S5002\.4714\.158\.428\.468\.6568\.442\.774\.672\.577\.993\.468\.140\.867\.3P200\-R5002\.4313\.597\.937\.968\.3666\.445\.074\.674\.179\.094\.668\.542\.468\.1Δ\\DeltaS500\+0\.04\+0\.56\+0\.49\+0\.50\+0\.29\-2\.0\+2\.30\.0\+1\.6\+1\.1\+1\.2\+0\.4\+1\.6\+0\.8Meta\-R5002\.4313\.467\.817\.838\.3468\.745\.875\.774\.580\.095\.069\.444\.469\.2Δ\\DeltaS500\+0\.04\+0\.69\+0\.61\+0\.63\+0\.31\+0\.3\+3\.1\+1\.1\+2\.0\+2\.1\+1\.6\+1\.3\+3\.6\+1\.9
Table 12:Minitron depth pruning across retraining token budgets \(Llama\-3\.1\-8B→\\to4B, 50% pruning\)\.SNN:train from scratch forNNB tokens\.P200\-RNN:pretrain 200B tokens on DCLM, prune, retrainNNB tokens\.Meta\-RNN:prune from Meta’s released Llama\-3\.1\-8B, retrainNNB tokens\. DCLM loss is validation cross\-entropy on DCLM \(lower is better\)\. Avg is the mean of WinoG, ARC\-C, ARC\-E, HSwag, PIQA, SciQ, BoolQ, and OBQA\. EachΔ\\DeltaSNNrow reports the difference between the pruned\-model row directly above \(P200\-RNNor Meta\-RNN\) and the SNNbaseline\.methodtokensloss↓\\downarrowperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowDCLMC4WTWT\-2CNNWinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvgMinitronwidthS103\.1126\.6818\.7419\.2616\.5551\.925\.351\.538\.267\.178\.659\.719\.846\.8P200\-R102\.6215\.579\.449\.589\.8365\.037\.269\.668\.376\.293\.966\.029\.863\.3Δ\\DeltaS10\+0\.49\+11\.11\+9\.30\+9\.68\+6\.72\+13\.1\+11\.9\+18\.1\+30\.1\+9\.1\+15\.3\+6\.3\+10\.0\+16\.5Meta\-R102\.5514\.408\.358\.469\.0766\.842\.273\.370\.978\.395\.173\.629\.666\.3Δ\\DeltaS10\+0\.56\+12\.28\+10\.39\+10\.80\+7\.48\+14\.9\+16\.9\+21\.8\+32\.7\+11\.2\+16\.5\+13\.9\+9\.8\+19\.5S302\.8019\.1912\.3212\.5511\.7956\.430\.363\.255\.172\.586\.457\.323\.455\.5P200\-R302\.4713\.667\.948\.038\.5568\.948\.076\.373\.879\.995\.676\.031\.069\.4Δ\\DeltaS30\+0\.32\+5\.53\+4\.38\+4\.52\+3\.24\+12\.5\+17\.7\+13\.1\+18\.7\+7\.4\+9\.2\+18\.7\+7\.6\+13\.9Meta\-R302\.4713\.467\.737\.808\.4568\.745\.276\.374\.278\.795\.375\.832\.868\.6Δ\\DeltaS30\+0\.33\+5\.73\+4\.59\+4\.75\+3\.34\+12\.3\+14\.9\+13\.1\+19\.1\+6\.2\+8\.9\+18\.5\+9\.4\+13\.1S502\.7017\.4710\.9911\.1710\.6261\.333\.665\.160\.374\.789\.660\.824\.259\.0P200\-R502\.4714\.598\.939\.018\.9265\.741\.773\.470\.577\.493\.867\.429\.865\.7Δ\\DeltaS50\+0\.23\+2\.88\+2\.06\+2\.16\+1\.70\+4\.4\+8\.1\+8\.3\+10\.2\+2\.7\+4\.2\+6\.6\+5\.6\+6\.7Meta\-R502\.3912\.957\.417\.478\.1369\.748\.577\.675\.779\.095\.875\.332\.370\.1Δ\\DeltaS50\+0\.31\+4\.52\+3\.58\+3\.70\+2\.49\+8\.4\+14\.9\+12\.5\+15\.4\+4\.3\+6\.2\+14\.5\+8\.1\+11\.1S2502\.4614\.378\.628\.698\.7565\.645\.975\.472\.277\.891\.168\.531\.067\.4P200\-R2502\.3913\.027\.567\.608\.0266\.345\.276\.075\.278\.895\.575\.628\.068\.3Δ\\DeltaS250\+0\.07\+1\.35\+1\.06\+1\.09\+0\.73\+0\.7\-0\.7\+0\.6\+3\.0\+1\.0\+4\.4\+7\.1\-3\.0\+0\.9Meta\-R2502\.3712\.517\.137\.167\.7870\.648\.878\.777\.379\.896\.078\.332\.671\.1Δ\\DeltaS250\+0\.10\+1\.86\+1\.49\+1\.53\+0\.97\+5\.0\+2\.9\+3\.3\+5\.1\+2\.0\+4\.9\+9\.8\+1\.6\+3\.7S5002\.4413\.688\.118\.158\.3666\.945\.875\.973\.979\.994\.871\.632\.268\.5P200\-R5002\.3512\.527\.127\.147\.7366\.448\.277\.277\.279\.295\.978\.626\.869\.6Δ\\DeltaS500\+0\.09\+1\.16\+0\.99\+1\.01\+0\.63\-0\.5\+2\.4\+1\.3\+3\.3\-0\.7\+1\.1\+7\.0\-5\.4\+1\.1Meta\-R5002\.3412\.356\.997\.017\.6572\.150\.679\.878\.180\.195\.779\.533\.072\.1Δ\\DeltaS500\+0\.09\+1\.33\+1\.12\+1\.14\+0\.71\+5\.2\+4\.8\+3\.9\+4\.2\+0\.2\+0\.9\+7\.9\+0\.8\+3\.6
Table 13:Minitron width pruning across retraining token budgets \(Llama\-3\.1\-8B→\\to4B, 50% pruning\)\.SNN:train from scratch forNNB tokens\.P200\-RNN:pretrain 200B tokens on DCLM, prune, retrainNNB tokens\.Meta\-RNN:prune from Meta’s released Llama\-3\.1\-8B, retrainNNB tokens\. DCLM loss is validation cross\-entropy on DCLM \(lower is better\)\. Avg is the mean of WinoG, ARC\-C, ARC\-E, HSwag, PIQA, SciQ, BoolQ, and OBQA\. EachΔ\\DeltaSNNrow reports the difference between the pruned\-model row directly above \(P200\-RNNor Meta\-RNN\) and the SNNbaseline\.methodtokensloss↓\\downarrowperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowDCLMC4WTWT\-2CNNWinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvgFLAPS103\.1526\.3818\.5519\.0816\.4349\.026\.551\.338\.566\.872\.059\.520\.548\.0P200\-R102\.6214\.758\.408\.539\.5163\.740\.872\.868\.076\.793\.073\.027\.264\.4Δ\\DeltaS10\+0\.52\+11\.63\+10\.15\+10\.56\+6\.92\+14\.7\+14\.3\+21\.5\+29\.5\+9\.8\+21\.0\+13\.5\+6\.7\+16\.4Meta\-R102\.5814\.408\.158\.289\.2566\.043\.074\.570\.577\.893\.874\.530\.066\.3Δ\\DeltaS10\+0\.57\+11\.98\+10\.40\+10\.80\+7\.18\+17\.0\+16\.5\+23\.2\+32\.0\+11\.0\+21\.8\+15\.0\+9\.5\+18\.3S302\.8519\.3612\.4712\.7211\.9453\.930\.964\.354\.772\.180\.752\.724\.954\.3P200\-R302\.5414\.108\.108\.228\.9565\.543\.575\.571\.077\.593\.574\.527\.466\.1Δ\\DeltaS30\+0\.31\+5\.26\+4\.37\+4\.50\+2\.99\+11\.6\+12\.6\+11\.2\+16\.3\+5\.4\+12\.8\+21\.8\+2\.5\+11\.8Meta\-R302\.5113\.857\.958\.028\.8567\.845\.875\.872\.578\.393\.874\.931\.067\.5Δ\\DeltaS30\+0\.34\+5\.51\+4\.52\+4\.70\+3\.09\+13\.9\+14\.9\+11\.5\+17\.8\+6\.2\+13\.1\+22\.2\+6\.1\+13\.2S502\.7117\.5711\.1611\.3410\.7758\.334\.967\.259\.874\.383\.358\.925\.457\.8P200\-R502\.4713\.687\.897\.978\.5266\.944\.876\.272\.977\.893\.975\.527\.667\.0Δ\\DeltaS50\+0\.24\+3\.89\+3\.27\+3\.37\+2\.25\+8\.6\+9\.9\+9\.0\+13\.1\+3\.5\+10\.6\+16\.6\+2\.2\+9\.2Meta\-R502\.4713\.437\.697\.758\.3969\.246\.376\.574\.678\.893\.875\.331\.668\.3Δ\\DeltaS50\+0\.24\+4\.14\+3\.47\+3\.59\+2\.38\+10\.9\+11\.4\+9\.3\+14\.8\+4\.5\+10\.5\+16\.4\+6\.2\+10\.5S2502\.5114\.578\.768\.828\.9266\.143\.274\.871\.777\.991\.166\.829\.865\.2P200\-R2502\.4012\.857\.357\.388\.0068\.848\.376\.976\.179\.394\.879\.832\.869\.6Δ\\DeltaS250\+0\.11\+1\.72\+1\.41\+1\.44\+0\.92\+2\.7\+5\.1\+2\.1\+4\.4\+1\.4\+3\.7\+13\.0\+3\.0\+4\.4Meta\-R2502\.3912\.807\.307\.337\.9369\.948\.577\.276\.479\.893\.975\.434\.269\.4Δ\\DeltaS250\+0\.12\+1\.77\+1\.46\+1\.49\+0\.99\+3\.8\+5\.3\+2\.4\+4\.7\+1\.9\+2\.8\+8\.6\+4\.4\+4\.2S5002\.4913\.918\.258\.288\.5369\.243\.775\.573\.478\.993\.069\.831\.466\.9P200\-R5002\.3611\.896\.406\.437\.4268\.851\.579\.779\.379\.896\.483\.326\.870\.7Δ\\DeltaS500\+0\.13\+2\.02\+1\.85\+1\.85\+1\.11\-0\.5\+7\.9\+4\.2\+5\.9\+0\.9\+3\.4\+13\.5\-4\.6\+3\.8Meta\-R5002\.3511\.756\.286\.317\.3572\.352\.080\.380\.080\.495\.879\.536\.572\.1Δ\\DeltaS500\+0\.14\+2\.16\+1\.97\+1\.97\+1\.18\+3\.1\+8\.3\+4\.8\+6\.6\+1\.5\+2\.8\+9\.7\+5\.1\+5\.2
Table 14:FLAP across retraining token budgets \(Llama\-3\.1\-8B→\\to4B, 50% pruning\)\.SNN:train from scratch forNNB tokens\.P200\-RNN:pretrain 200B tokens on DCLM, prune, retrainNNB tokens\.Meta\-RNN:prune from Meta’s released Llama\-3\.1\-8B, retrainNNB tokens\. DCLM loss is validation cross\-entropy on DCLM \(lower is better\)\. Avg is the mean of WinoG, ARC\-C, ARC\-E, HSwag, PIQA, SciQ, BoolQ, and OBQA\.methodtokensloss↓\\downarrowperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowDCLMC4WTWT\-2CNNWinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvgShearedLLaMAS103\.1026\.3018\.4618\.9816\.3150\.725\.950\.638\.467\.177\.859\.819\.048\.7P200\-R102\.6215\.459\.429\.559\.6263\.439\.871\.567\.875\.891\.864\.029\.062\.9Δ\\DeltaS10\+0\.48\+10\.85\+9\.04\+9\.43\+6\.69\+12\.7\+13\.9\+20\.9\+29\.4\+8\.7\+14\.0\+4\.2\+10\.0\+14\.2Meta\-R102\.5814\.958\.959\.059\.3065\.542\.573\.670\.277\.093\.267\.031\.065\.0Δ\\DeltaS10\+0\.52\+11\.35\+9\.51\+9\.93\+7\.01\+14\.8\+16\.6\+23\.0\+31\.8\+9\.9\+15\.4\+7\.2\+12\.0\+16\.3S302\.8019\.2912\.4112\.6511\.8555\.630\.363\.454\.772\.486\.353\.923\.255\.0P200\-R302\.5514\.708\.858\.969\.1064\.842\.073\.670\.076\.492\.866\.232\.064\.7Δ\\DeltaS30\+0\.25\+4\.59\+3\.56\+3\.69\+2\.75\+9\.2\+11\.7\+10\.2\+15\.3\+4\.0\+6\.5\+12\.3\+8\.8\+9\.7Meta\-R302\.5214\.258\.558\.658\.9066\.043\.574\.671\.377\.294\.068\.033\.566\.0Δ\\DeltaS30\+0\.28\+5\.04\+3\.86\+4\.00\+2\.95\+10\.4\+13\.2\+11\.2\+16\.6\+4\.8\+7\.7\+14\.1\+10\.3\+11\.0S502\.6717\.5111\.1011\.2810\.6960\.034\.266\.159\.974\.689\.059\.623\.858\.4P200\-R502\.5014\.358\.558\.638\.9065\.842\.874\.771\.276\.793\.668\.034\.265\.9Δ\\DeltaS50\+0\.17\+3\.16\+2\.55\+2\.65\+1\.79\+5\.8\+8\.6\+8\.6\+11\.3\+2\.1\+4\.6\+8\.4\+10\.4\+7\.5Meta\-R502\.4713\.958\.258\.328\.7067\.044\.675\.572\.577\.894\.369\.035\.067\.0Δ\\DeltaS50\+0\.20\+3\.56\+2\.85\+2\.96\+1\.99\+7\.0\+10\.4\+9\.4\+12\.6\+3\.2\+5\.3\+9\.4\+11\.2\+8\.6S2502\.4714\.578\.778\.858\.8964\.844\.274\.771\.677\.991\.967\.130\.465\.3P200\-R2502\.4213\.357\.787\.858\.3567\.545\.876\.474\.279\.494\.670\.035\.567\.9Δ\\DeltaS250\+0\.05\+1\.22\+0\.99\+1\.00\+0\.54\+2\.7\+1\.6\+1\.7\+2\.6\+1\.5\+2\.7\+2\.9\+5\.1\+2\.6Meta\-R2502\.3913\.007\.457\.528\.1570\.849\.279\.077\.280\.595\.874\.037\.070\.4Δ\\DeltaS250\+0\.08\+1\.57\+1\.32\+1\.33\+0\.74\+6\.0\+5\.0\+4\.3\+5\.6\+2\.6\+3\.9\+6\.9\+6\.6\+5\.1S5002\.4513\.918\.268\.308\.5067\.544\.675\.473\.379\.194\.270\.231\.967\.0P200\-R5002\.3712\.757\.257\.328\.0069\.547\.878\.075\.879\.695\.873\.037\.069\.6Δ\\DeltaS500\+0\.08\+1\.16\+1\.01\+0\.98\+0\.50\+2\.0\+3\.2\+2\.6\+2\.5\+0\.5\+1\.6\+2\.8\+5\.1\+2\.6Meta\-R5002\.3412\.356\.957\.027\.7572\.051\.080\.078\.279\.896\.276\.038\.571\.5Δ\\DeltaS500\+0\.11\+1\.56\+1\.31\+1\.28\+0\.75\+4\.5\+6\.4\+4\.6\+4\.9\+0\.7\+2\.0\+5\.8\+6\.6\+4\.5
Table 15:Sheared LLaMA across retraining token budgets \(Llama\-3\.1\-8B→\\to4B, 50% pruning\)\.SNN:train from scratch forNNB tokens\.P200\-RNN:pretrain 200B tokens on DCLM, prune, retrainNNB tokens\.Meta\-RNN:prune from Meta’s released Llama\-3\.1\-8B, retrainNNB tokens\. DCLM loss is validation cross\-entropy on DCLM \(lower is better\)\. Avg is the mean of WinoG, ARC\-C, ARC\-E, HSwag, PIQA, SciQ, BoolQ, and OBQA\.#### Per\-benchmark observations\.
Across all four structured methods, BoolQ is the first benchmark to flip negative as the retraining budget grows: at 500B tokens, BoolQ goes to−2\.0\-2\.0\(Minitron\-D\),−5\.4\-5\.4\(Minitron\-W\), and−4\.6\-4\.6\(FLAP\), even when other benchmarks \(e\.g\. ARC\-Challenge, HellaSwag\) still favor pruning\. This suggests BoolQ rewards raw data scale more than initialization quality\. Sheared LLaMA is the only structured method whose average advantage never crosses zero across the full token range \(ΔAvg≥\+2\.6\\Delta\\text\{Avg\}\\\!\\geq\\\!\+2\.6even at 500B\), consistent with its joint\-mask training\-aware pruning objective\. Meta\-RNNtends to outperform P200\-RNNat low budgets \(e\.g\. Minitron\-DΔS10=\+12\.1\\Delta\_\{S10\}\\\!=\\\!\+12\.1vs\.\+10\.9\+10\.9\), with the gap closing asNNgrows — a residual signal of the original Meta pretraining mixture beyond what DCLM\-only P200 retraining recovers\.
### H\.4Pruning vs\. scratch under equal total token budget
Table[16](https://arxiv.org/html/2606.14150#A8.T16)reports the complete per\-benchmark breakdown underlying Figure[5](https://arxiv.org/html/2606.14150#S4.F5), comparing P200\-R50 against S50 \(equal training token budget\) and S250 \(equal total token budget\) across all six pruning methods \(with both 2:4 and unstructured variants for Wanda and SparseGPT\) at the 50% pruning ratio\. Under an equal total token budget \(P200\-R50 vs\. S250\), sparse methods retain a consistent advantage from the pruned initialization, whereas structured methods are largely matched or surpassed by extended scratch training\.
methodtokensloss↓\\downarrowperplexity↓\\downarrowaccuracy\(%\)↑\\uparrowDCLMC4WTWT\-2CNNWinoGARC\-CARC\-EHSwagPIQASciQBoolQOBQAAvgMinitrondepthP200\-R502\.5415\.349\.309\.419\.4664\.440\.771\.568\.076\.891\.764\.038\.064\.4S502\.6416\.9010\.6110\.7710\.3556\.836\.270\.161\.974\.490\.457\.837\.860\.7Δ\\DeltaP200\-R50\-0\.10\-1\.56\-1\.31\-1\.36\-0\.89\-7\.6\-4\.5\-1\.4\-6\.1\-2\.4\-1\.3\-6\.2\-0\.2\-3\.7S2502\.4914\.798\.949\.019\.0365\.041\.673\.770\.677\.993\.165\.442\.066\.2Δ\\DeltaP200\-R50\+0\.05\+0\.55\+0\.36\+0\.39\+0\.43\+0\.6\+0\.9\+2\.2\+2\.6\+1\.1\+1\.4\+1\.3\+4\.0\+1\.8MinitronwidthP200\-R502\.4714\.608\.939\.018\.9265\.441\.873\.470\.577\.493\.867\.440\.466\.3S502\.7017\.4710\.9911\.1710\.6561\.333\.665\.160\.374\.789\.660\.833\.859\.9Δ\\DeltaP200\-R50\-0\.23\-2\.87\-2\.05\-2\.16\-1\.73\-4\.1\-8\.2\-8\.2\-10\.2\-2\.7\-4\.2\-6\.6\-6\.6\-6\.4S2502\.4614\.378\.628\.698\.7565\.846\.275\.672\.277\.991\.268\.442\.667\.5Δ\\DeltaP200\-R500\.00\+0\.23\+0\.32\+0\.33\+0\.17\+0\.4\+4\.4\+2\.2\+1\.7\+0\.5\-2\.6\+1\.0\+2\.2\+1\.2FLAPP200\-R502\.4713\.647\.907\.978\.5565\.943\.575\.872\.876\.993\.975\.527\.666\.5S502\.7117\.5711\.1611\.3410\.7758\.334\.967\.259\.873\.883\.358\.925\.457\.7Δ\\DeltaP200\-R50\-0\.24\-3\.93\-3\.26\-3\.37\-2\.22\-7\.6\-8\.6\-8\.6\-13\.0\-3\.1\-10\.6\-16\.6\-2\.2\-8\.8S2502\.5114\.578\.768\.828\.9266\.143\.274\.871\.777\.491\.166\.829\.865\.1Δ\\DeltaP200\-R50\-0\.04\-0\.93\-0\.86\-0\.85\-0\.37\+0\.2\-0\.3\-1\.0\-1\.1\+0\.5\-2\.8\-8\.7\+2\.2\-1\.4ShearedLLaMAP200\-R502\.5014\.358\.558\.638\.9065\.842\.874\.771\.276\.793\.668\.034\.265\.9S502\.6717\.5111\.1011\.2810\.6960\.034\.266\.159\.974\.689\.059\.623\.858\.4Δ\\DeltaP200\-R50\-0\.17\-3\.16\-2\.55\-2\.65\-1\.79\-5\.8\-8\.6\-8\.6\-11\.3\-2\.1\-4\.6\-8\.4\-10\.4\-7\.5S2502\.4714\.578\.778\.858\.8964\.844\.274\.771\.677\.991\.967\.130\.465\.3Δ\\DeltaP200\-R50\+0\.03\-0\.22\-0\.22\-0\.22\+0\.01\-1\.0\+1\.40\.0\+0\.4\+1\.2\-1\.7\-0\.9\-3\.8\-0\.6Wanda2:4P200\-R502\.4914\.719\.079\.158\.9666\.143\.674\.570\.578\.893\.566\.740\.266\.7S502\.6316\.6810\.4010\.5510\.2359\.236\.870\.362\.875\.490\.460\.437\.061\.5Δ\\DeltaP200\-R50\-0\.14\-1\.97\-1\.33\-1\.40\-1\.27\-6\.9\-6\.8\-4\.2\-7\.7\-3\.4\-3\.1\-6\.2\-3\.2\-5\.2S2502\.4514\.338\.548\.608\.7464\.242\.474\.470\.377\.192\.368\.540\.466\.2Δ\\DeltaP200\-R50\+0\.04\+0\.39\+0\.53\+0\.54\+0\.22\-1\.8\-1\.2\-0\.1\-0\.2\-1\.7\-1\.2\+1\.8\+0\.2\-0\.5Wanda\-UP200\-R502\.4313\.948\.318\.378\.4767\.746\.775\.873\.179\.094\.966\.740\.868\.1S502\.6316\.7210\.5110\.6610\.2660\.136\.369\.562\.975\.990\.560\.037\.261\.6Δ\\DeltaP200\-R50\-0\.21\-2\.78\-2\.21\-2\.30\-1\.80\-7\.6\-10\.3\-6\.3\-10\.2\-3\.1\-4\.4\-6\.7\-3\.6\-6\.5S2502\.4514\.298\.518\.578\.7364\.743\.675\.572\.178\.894\.267\.141\.667\.2Δ\\DeltaP200\-R50\-0\.02\-0\.35\-0\.20\-0\.20\-0\.26\-3\.0\-3\.1\-0\.3\-1\.0\-0\.2\-0\.7\+0\.4\+0\.8\-0\.9SparseGPT2:4P200\-R502\.4814\.608\.838\.908\.9066\.542\.373\.270\.678\.193\.767\.741\.666\.7S502\.6316\.7210\.4010\.5410\.2562\.240\.470\.262\.775\.490\.658\.636\.662\.1Δ\\DeltaP200\-R50\-0\.15\-2\.12\-1\.57\-1\.64\-1\.35\-4\.3\-2\.0\-2\.9\-8\.0\-2\.7\-3\.1\-9\.1\-5\.0\-4\.6S2502\.4514\.328\.598\.658\.7267\.142\.073\.571\.178\.995\.068\.840\.067\.1Δ\\DeltaP200\-R50\+0\.03\+0\.28\+0\.24\+0\.25\+0\.18\+0\.6\-0\.3\+0\.3\+0\.5\+0\.8\+1\.3\+1\.1\-1\.6\+0\.4SparseGPT\-UP200\-R502\.4414\.068\.398\.448\.5668\.345\.075\.372\.378\.994\.165\.940\.067\.5S502\.6316\.6510\.3110\.4710\.2260\.436\.969\.362\.575\.090\.662\.538\.261\.9Δ\\DeltaP200\-R50\-0\.19\-2\.59\-1\.93\-2\.02\-1\.66\-7\.9\-8\.1\-6\.0\-9\.8\-4\.0\-3\.5\-3\.4\-1\.8\-5\.5S2502\.4514\.318\.538\.598\.7467\.043\.273\.771\.078\.293\.162\.441\.666\.3Δ\\DeltaP200\-R50\-0\.01\-0\.26\-0\.15\-0\.15\-0\.18\-1\.3\-1\.8\-1\.6\-1\.3\-0\.8\-1\.0\-3\.5\+1\.6\-1\.2
Table 16:Pruning\-retraining versus training from scratch at 50% pruning ratio\.P200\-R50:pretrain 200B tokens, prune, retrain 50B tokens\.S50:train from scratch for 50B tokens \(equal training token budget\)\.S250:train from scratch for 250B tokens \(equal total token budget\)\.Δ\\DeltaP200\-R50: difference relative to P200\-R50\. Red: P200\-R50 is better; blue: P200\-R50 is worse; grey: within evaluation standard deviation \(see Appendix[E](https://arxiv.org/html/2606.14150#A5)\)\.#### Per\-benchmark observations\.
FLAP exhibits the strongest BoolQ initialization signal in the table:ΔBoolQ=\+16\.6\\Delta\\text\{BoolQ\}\\\!=\\\!\+16\.6vs\. S50, and it is the only benchmark/method combination where the residual advantage at S250 exceeds 8 points \(\+8\.7\+8\.7\)\. FLAP also holds the best\-cell perplexity numbers overall \(C4 13\.64, WT 7\.90, WT\-2 7\.97\), surpassing every sparse method despite being structured — a worth\-noting caveat to the headline “sparse\>\>structured at full data” framing\. OBQA splits the sparse and structured methods at S250: Sheared LLaMA retains\+3\.8\+3\.8\(largest residual structured advantage\) while Wanda\-U and SparseGPT\-2:4 actually trail S250 on OBQA\. Finally, the sparse\-method advantage at full data is benchmark\-dependent rather than uniform: Wanda\-2:4, Wanda\-U, and SparseGPT\-U all post slightly negative average deltas at S250 \(down to−1\.2\-1\.2\), even though individual benchmarks \(BoolQ, WT\-2 perplexity\) still favor pruning — readers should treat the sparse advantage under an equal total token budget as a per\-benchmark rather than a wholesale claim\.Similar Articles
@Zephyr271828: You want a strong small LLM. Would you start small — or inherit from something bigger? New paper: Small LLMs: Pruning v…
A new paper investigates whether it's better to prune a larger LLM or train a small LLM from scratch, finding that pruning provides more than just a good initialization.
Optimizing Korean-Centric LLMs via Token Pruning
This paper presents a systematic benchmark of token pruning—a compression technique that removes tokens and embeddings for irrelevant languages—applied to Korean-centric LLM tasks. The study evaluates popular multilingual models (Qwen3, Gemma-3, Llama-3, Aya) across different vocabulary configurations and finds that token pruning significantly improves generation stability and reduces memory footprint for domain-specific deployments.
Are super tiny LLMs any good?
Explores whether very small language models can handle casual conversations adequately, and what training factors differentiate the better ones.
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.
I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size
Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.