Apertus LLM Family Expansion via Distillation and Quantization
Summary
This paper validates distillation and quantization as cost-effective methods to expand the Apertus LLM family to new sizes and hardware formats, producing Apertus-v1.1 models with up to 4B parameters trained on 1.7T tokens.
View Cached Full Text
Cached at: 05/29/26, 09:17 AM
# Apertus LLM Family Expansion via Distillation and Quantization Source: [https://arxiv.org/html/2605.29128](https://arxiv.org/html/2605.29128) ###### Abstract The wide adoption of LLMs has led to their use in great variety of applications and scenarios, such as chatbot assistants and data annotation, creating the need for the models to satisfy certain budget and hardware constraints\. This has led to the trend of LLMs being released in batches consisting of similar models of various sizes for the family of models to adhere to as wide of a range of constraints as possible\. In this paper, we validate distillation and quantization as a cost\-effective way to expand model families to new sizes and hardware formats\. Based on the open\-recipe Apertus 8B LLM, we produce Apertus\-v1\.1 — a distilled family of models with up to 4B parameters trained on 1\.7T permissive license tokens\. We demonstrate cost\-efficiency and strong accuracy performance of our approach for covering large ranges of hardware and systems requirements\. Machine Learning, ICML ## 1Background The popularity and versatility of Large Language Models \(LLMs\) have introduced a wide spectrum of budget, memory, and hardware constraints for their deployment\. To accommodate these varying requirements, it has become crucial to provide LLMs in multiple sizes and formats\. Releasing a family of models allows practitioners to select the optimal trade\-off between computational cost and predictive performance for their specific deployment scenarios, democratizing access to advanced AI capabilities across different hardware tiers\. However, training an entire family of models from scratch requires prohibitive amounts of compute\. Knowledge Distillation \(KD\) in the pre\-training phase, or Pre\-training Distillation \(PD\), offers a powerful solution to dramatically cut these costs\(Penget al\.,[2024](https://arxiv.org/html/2605.29128#bib.bib1)\)\. By transferring knowledge from a large, capable teacher model to a smaller student model using the teacher’s generated logits, the student benefits from richer information and implicit label smoothing\. This allows the student to converge faster and achieve higher downstream performance with significantly fewer training tokens and compute resources\. Consequently, pre\-training distillation enables the cost\-effective expansion of a model family without the computational burden of standard pre\-training\. An orthogonal direction for addressing cost requirements \(e\.g\., disk space or latency\) is quantization\. While reducing numerical precision significantly lowers the memory footprint and accelerates inference, it inherently introduces a cost\-accuracy trade\-off\. As we show here, by carefully balancing this trade\-off around the Pareto frontier of compression methods, practitioners gain finer control over the model’s performance and hardware profile\. This fine\-grained control allows for further expansion of the model family, bridging the gaps between pre\-trained sizes at a cost significantly less than even pre\-training distillation\. Our work builds upon the foundation of the Apertus\(Apertuset al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib16)\)project, which sets a new standard for fully open and compliant LLMs\. Unlike many open\-weight models that withhold training data and pipelines, the Apertus recipe emphasizes complete transparency, data compliance, and global multilingual representation\. By grounding our distillation and quantization pipeline in the Apertus ecosystem, we inherit its rigorous openness and reproducibility\. Table 1:Model architecture overview\.ModelLayersDimMLP DimHeads \(Q/KV\)Dim/LayersTied Emb\.Model sizeComputeStorageApertus\-v1\.1\-0\.5B201024614416/451\.2Yes0\.4B0\.4BApertus\-v1\.1\-1\.5B1620481228832/8128No1\.5B2\.0BApertus\-v1\.1\-4B2430721638424/8128No3\.8B4\.6BApertus\-8B3240962150432/8128No8\.1B9\.1B ## 2Pre\-Training Distillation ### 2\.1Recipe #### Data\. To produce the highest\-quality models, we gathered the data corresponding to Phase 5 \(the final phase\) of the original Apertus pre\-training, which consists of documents and code and instruction samples with the highest level of quality filtering for a total yield of approximately 1\.7T tokens\. Similar to Apertus, we cut and pack these documents into chunks of 4096 tokens and train with cross\-document attention masked\. #### Logits generation\. To be able to efficiently re\-use the logits for multiple models, we generated the entire training set in advance\. We ran the collected documents through theApertus\-8B\-2509model to obtain≈\\approx131k logits per token\. After calculating the probability distributions from these logits, top\-256 highest probabilities were identified per token\. These probabilities, along with corresponding token indices in model vocabulary, were represented in 32\-bit precision for a total of≈\\approx2KB of data per token\. The tensors were batched in groups of≈\\approx131k tokens, compressed withgzipand stored in long\-term storage for a total footprint of≈\\approx1\.5PB of disk space\. We applied sequences permutation on the logits generation stage to only have to do sequential disk loads when using them for training later\. #### Training objective\. As shown to perform well byPenget al\.\([2024](https://arxiv.org/html/2605.29128#bib.bib1)\), we utilize a 90%/10% mix between the KL\-Divergence and the label cross\-entropy\. Since the computed KL\-Divergence is sparse, it introduces close to no computational or memory overhead relative to the basic cross\-entropy calculation\. #### Model Architecture\. Apertus\-v1\.1 models follow the same architecture as Apertus: Dense transformer models with grouped\-query attention and xIELU\(Huang and Schlag,[2025](https://arxiv.org/html/2605.29128#bib.bib3)\)activation in the MLP\. Table[1](https://arxiv.org/html/2605.29128#S1.T1)details the architectural configurations, parameter counts, and the resulting memory and computational footprints for the Apertus\-v1\.1 models\. Notably, we used tied embeddings and thinner and deeper architecture for the smallest Apertus\-v1\.1 model to maximize performance while minimizing memory footprint\(Liuet al\.,[2024](https://arxiv.org/html/2605.29128#bib.bib2)\)\. Figure 1:Training loss curves of Apertus\-v1\.1 models\. Dashed line shows the loss of the teacher model \(Apertus\-8B\-2509\)\.Figure 2:Multilingual performance macro average during pre\-training of Apertus\-v1\.1 models and for a number of similar\-sized models\. Distillation allows Apertus\-v1\.1 models to achieve competitive performance while training on up to an order of magnitude less compute\. #### Training dynamics\. Similar to Apertus, we use the AdEMAMix\(Pagliardiniet al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib4)\)optimizer with WSD schedule and weight decay\. Next\-token prediction \(NTP\) loss shown in Figure[1](https://arxiv.org/html/2605.29128#S2.F1)\. Multilingual macro downstream average shown in Figure[2](https://arxiv.org/html/2605.29128#S2.F2)\. We observed no training instabilities and consistent improvement in downstream performance, especially during the learning rate annealing stage \(highlighted in gray\)\. #### SFT and alignment\. The supervised fine\-tuning \(SFT\) stage followed immediately after pre\-training\. For it, we exactly reused the original Apertus SFT recipe, only adjusting the LR to match the post\-annealing LR of Apertus\-v1\.1 models\. For the subsequent alignment stage, we utilized a simplified DPO\(Rafailovet al\.,[2024](https://arxiv.org/html/2605.29128#bib.bib5)\)setup\. #### Evaluations\. Following the Apertus evaluation setup, we report multilingual benchmarks average during training in Figure[2](https://arxiv.org/html/2605.29128#S2.F2), selected final pre\-training metrics in Table[3](https://arxiv.org/html/2605.29128#S2.T3), multilingual post\-training evaluations in Table[4](https://arxiv.org/html/2605.29128#S2.T4)and broader post\-training evaluations in Appendix[B](https://arxiv.org/html/2605.29128#A2)\. Unsurprisingly, the performance profile of Apertus\-v1\.1 models is extremely similar toApertus\-8B\-2509, demonstrating great multilingual performance for the base models and good multilingual chat performance but lacking in certain capabilities like instruction following and math\. ### 2\.2Cost Analysis Table 2:Cost for small LLM pre\-training and distillation\. Apertus\-v1\.1 is 2\-10x cheaper than competing small LLM pre\-training pipelines\.StageTokensFLOPsOriginal pre\-training15T3\.7E23Apertus\-8BLogits generation1\.7T1\.4E22from Apertus\-8BPre\-training1\.7T0\.2E22Apertus\-v1\.1 0\.5BPre\-training1\.7T0\.8E22Apertus\-v1\.1 1\.5BPre\-training1\.7T2\.0E22Apertus\-v1\.1 4BPre\-training36T6\.5E22Qwen3\-0\.6BPre\-training4T1\.7E22EuroLLM\-1\.7BPre\-training11T5\.6E22SmolLM2\-1\.7BPre\-training11T9\.9E22SmolLM3\-3BAs seen from Table[2](https://arxiv.org/html/2605.29128#S2.T2), Apertus\-v1\.1 models used significantly less compute than similar\-sized models, being trained on just 1\.7T tokens, in contrast to the 15T tokens of Apertus\. The cost of producing the logits from the 8B model is relatively small because one only needs to perform the forward pass to produce logits and the same logits only have to be computed once for the entire family of distilled models, dramatically cutting the teacher cost per\-model\. The total compute cost of the entire Apertus\-v1\.1 model family is 2\.4E22 FLOPs\. This is less than, for example, the cost of standalone pre\-training for SmolLM2\-1\.7B and less than 12% of the original Apertus 8B pre\-training cost\. Table 3:Base models evaluations\.ModelAvgARCHellaSwagWinoGrandeXNLIXCOPAPIQAApertus\-v1\.1\-0\.5B51\.7944\.9640\.4257\.0641\.5155\.4971\.27Apertus\-v1\.1\-1\.5B56\.6652\.6648\.3161\.7242\.9459\.7674\.54Apertus\-v1\.1\-4B61\.5361\.1553\.5167\.4845\.0363\.8278\.18Apertus\-8B64\.9671\.6659\.6269\.3044\.0965\.6979\.38EuroLLM\-1\.7B54\.0350\.8045\.0159\.5140\.8855\.7672\.20SmolLM2\-1\.7B58\.0060\.2353\.3866\.2237\.5753\.5177\.10SmolLM\-3B\-Base60\.8864\.4556\.3768\.4340\.2858\.0277\.75Qwen3\-0\.6B\-Base52\.2348\.3541\.0159\.2039\.5554\.9670\.29Qwen3\-1\.7B\-Base57\.5156\.4949\.3663\.3841\.6658\.3575\.79Qwen3\-4B\-Base62\.1464\.9954\.5670\.4843\.0061\.8277\.97Table 4:Multilingual evaluations for instruction\-tuned models\. Each benchmark here is the multilingual version thereof \(see Appendix[B](https://arxiv.org/html/2605.29128#A2)\)\.ModelAverageMMLUTruthfulQAArcIFLogiQAApertus\-v1\.1\-0\.5B Instruct0\.3180\.2580\.4610\.2250\.3280\.279Apertus\-v1\.1\-1\.5B\-Instruct0\.3820\.3770\.4510\.2660\.4340\.276Apertus\-v1\.1\-4B\-Instruct0\.4730\.5040\.5060\.3320\.5500\.296Apertus\-8B\-Instruct\-25090\.5340\.5530\.5240\.3680\.6890\.290EuroLLM\-1\.7B\-Instruct0\.2910\.2600\.4330\.2500\.2220\.269EuroLLM\-9B\-Instruct0\.4800\.5200\.4650\.3220\.6130\.345gemma\-3\-270m\-it0\.2890\.2420\.4650\.2150\.2360\.205gemma\-3\-1b\-it0\.4060\.4090\.4570\.2500\.5090\.379gemma\-3\-4b\-it0\.4970\.5470\.4920\.3160\.6350\.411SmolLM2\-1\.7B\-Instruct0\.3480\.3650\.4520\.2130\.3640\.246SmolLM3\-3B0\.4790\.5070\.5000\.2700\.6370\.365Qwen3\-0\.6B0\.4010\.3770\.4640\.2220\.5410\.353Qwen3\-1\.7B0\.4570\.4770\.4900\.2510\.6110\.414Qwen3\-4B0\.5210\.5810\.4970\.2740\.7330\.500 ## 3Quantization While pre\-training distillation successfully generated the core Apertus\-v1\.1models at a fraction of the cost, adapting these models for highly constrained environments requires further optimization for specific hardware profiles\. In this section, we consider the problem of adapting Apertus\-v1\.1 models to NVIDIA GPUs and mobile devices, demonstrating how quantization yields a wider range of optimal, specialized models at close to no cost\. Figure 3:Visualization of the cost\-accuracy trade\-off for Apertus and Apertus\-v1\.1 models\. Base models \(left\) are compared based on validation loss while instruction\-tuned models \(right\) are compared based on downstream performance\. Quantized models both optimize the trade\-off and add intermediate points to the Pareto fronts\.### 3\.1Apertus\-v1\.1 Quantization Recipe Figure 4:Apertus\-v1\.1 quantization recipe ablation\.#### Baseline\. We use GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2605.29128#bib.bib7)\), the most widely\-used 1\-shot LLM quantization method as our baseline\. We gauge our improvement over it differently for base and instruction\-tuned models: - •For base models,we measure the loss increase over the corresponding unquantized models on the validation set of≈\\approx17M tokens from the original pre\-training mixture \(Apertus Phase 5 data\)\. We test*weight\+activation*\(FP8, NVFP4\) quantization for base models with focus on NVIDIA Blackwell GPUs, as we foresee their main usage in high\-throughput scenarios such as data annotation and embedding\. - •For instruction\-tuned models,we measure the recovery of macro average over normalized few\-shot accuracies on Arc\(Clarket al\.,[2018](https://arxiv.org/html/2605.29128#bib.bib12)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.29128#bib.bib13)\), MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.29128#bib.bib14)\)and WinoGrande\(Sakaguchiet al\.,[2019](https://arxiv.org/html/2605.29128#bib.bib15)\)\. We test*weight\-only quantization*\(INT2, INT3, INT4, INT6\) for instruction\-tuned models with focus on Apple devices \(MLX\) inference, as we foresee their main usage in memory\-limited scenarios such as mobile and edge deployment\. #### Quantization\-aware distillation \(QAD\)\. QAD is applied as a short recovery stage on a fully\-trained model by treating the entire model as trainable parameters, quantizing its weights every forward pass and updating them with standard gradient\-based method via straight\-through estimation\(Bengioet al\.,[2013](https://arxiv.org/html/2605.29128#bib.bib11)\), bridging the gap between full quantization\-aware training and PTQ methods\. Similar to pre\-training distillation, teacher model logits \(usually the corresponding unquantized model or a larger model from the same family\) provide much richer signal for this phase, making it preferable to quantization\-aware supervised fine\-tuning\. QAD has been shown to yield consistent improvement over 0\-shot and 1\-shot post\-training quantization \(PTQ\) methods\(Leeet al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib9); Egiazarianet al\.,[2026](https://arxiv.org/html/2605.29128#bib.bib8); Xinet al\.,[2026](https://arxiv.org/html/2605.29128#bib.bib10)\)\. The open access to the original pre\-training set and SFT mixture utilized for both Apertus and Apertus\-v1\.1 pre\- and post\-training allows us to use it for QAD of these models with the highest degree of confidence that the distillation curriculum captures close to the entirety of the models’ capability\. We test QAD for both base and instruction\-tuned models, using≈\\approx100M tokens \(we see only marginal improvement beyond that\) of the pre\-training or the SFT mixture accordingly\. We useApertus\-8B\-2509andApertus\-8B\-Instruct\-2509as a teacher in this scenario\. Additional implementation details and hyper\-parameters are described in Appendix[C\.2](https://arxiv.org/html/2605.29128#A3.SS2)\. #### Norm fusion\. To further improve quantization quality, we propose the following zero\-cost static model optimization: We scale attention’s QKV and MLP’s up projection matrices’ columns \(input dimension\) to have the same norm, multiplicatively fusing the reciprocal scales into the preceding layer\-normalization layers’ weights\. The idea behind this is to normalize the magnitudes of weight values to prevent flush\-to\-zero of small\-magnitude but important weights and weights adjacent to outlier channels\. The loss measurements for compressed base models and few\-shot recovery measurements for the instruction\-tuned models show that this yields the most improvement for smaller models\. Additionally, although this normalization is mainly designed to assist with weight quantization, we find that it also improves weight\+activation quantization \(NVFP4\), indicating that offloading these scales to activations doesn’t hurt their compressibility\. #### Weight averaging\. Weight averaging \(arithmetic averaging of model weight tensors\) of the last few checkpoints during the annealing stage has been shown to improve LLMs’ resilience to post\-training quantization\(Ajroldiet al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib6)\)\. To validate it, we tested weight averaging for the Apertus\-v1\.1 0\.5B base model combined with various quantization formats and methods, including RTN, GPTQ\(Frantaret al\.,[2023](https://arxiv.org/html/2605.29128#bib.bib7)\)and QAD\. The results, shown in Figure[5](https://arxiv.org/html/2605.29128#S3.F5), demonstrate that weight averaging reduces validation loss gap to BF16 by up to 10% for RTN, up to 2% for GPTQ and has close to*no discernible effect on QAD*\. As a result, we did not include it in our final quantization pipeline\. #### Final quantization recipe\. Our final recipe combines QAD with norm fusion to achieve just 0\.1\-0\.2 validation loss increase for base and 90\-104% few\-shot accuracy recovery for instruction\-tuned Apertus and Apertus\-v1\.1 models, as seen in Figure[4](https://arxiv.org/html/2605.29128#S3.F4)\. Figure 5:The effect of weight averaging \(WA\) over the last few base model checkpoints on post\-training quantization for various data\-types and algorithms\. Checkpoints were taken every 1000 iterations\. ### 3\.2Pareto Optimality As mentioned in the beginning of this section, we analyze base model quantization in the context of high\-throughput applications and instruction\-tuned model quantization in the context of memory\-constrained deployment\. Naturally, the corresponding cost can be measured for every model we trained \(quantized or otherwise\), along with a representative measure of it’s capability, quantifying the cost\-accuracy trade\-off\. Covering a larger range of costs is what drove the demand for smaller models in the first place, and in Figure[3](https://arxiv.org/html/2605.29128#S3.F3)one can see this trade\-off visualized\. What is interesting, is that quantized models not only shift the Pareto front \(i\.e\., the enveloping curve\) towards more efficient solutions \(as seen, for example, by BF16 models almost never being optimal\), but also adds more points on the frontier, allowing for more fine\-grained control over cost\. Without quantization, adding new points would have meant pre\-training new models of intermediate sizes, which would have entailed spending compute in the order of trillions tokens\. QAD, on the other hand, achieves high recovery after only a few tens of millions of tokens, cutting the cost by more than*four orders of magnitude*\. ## 4Released Checkpoints We provide a comprehensive suite of pre\-trained and instruction\-tuned models across multiple quantization formats to support various hardware constraints and deployment scenarios\. Table[5](https://arxiv.org/html/2605.29128#S4.T5)summarizes all the checkpoints released as part of the Apertus and Apertus\-v1\.1 model families\. Table 5:Overview of released Apertus and Apertus\-v1\.1 checkpoints\. Click the Hugging Face logo to access the corresponding model weights\.ModelBF16BF16FP8NVFP4A16INT3INT4INT6BaseInstructApertus\-v1\.1\-0\.5B[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B-Instruct)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B-Instruct-vLLM-FP8)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B-Instruct-vLLM-NVFP4A16)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B-Instruct-MLX-INT3)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B-Instruct-MLX-INT4)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-0.5B-Instruct-MLX-INT6)Apertus\-v1\.1\-1\.5B[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B-Instruct)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B-Instruct-vLLM-FP8)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B-Instruct-vLLM-NVFP4A16)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B-Instruct-MLX-INT3)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B-Instruct-MLX-INT4)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-1.5B-Instruct-MLX-INT6)Apertus\-v1\.1\-4B[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B-Instruct)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B-Instruct-vLLM-FP8)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B-Instruct-vLLM-NVFP4A16)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT3)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT4)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-v1.1-4B-Instruct-MLX-INT6)Apertus\-8B\-2509[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-8B-2509)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509-vLLM-FP8)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509-vLLM-NVFP4A16)[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/hf-logo.png)](https://huggingface.co/swiss-ai/Apertus-8B-Instruct-2509-MLX-INT4) ## 5Conclusion We validated pre\-training distillation for multi\-billion parameter models and multi\-trillion token budgets, demonstrating how such model family expansion can be done at a tiny cost \(less than 20%\) of the teacher model training and far more cheaply than pre\-training from scratch\. In total, we release 24 new model checkpoints, including the 3 pre\-trained base models, 3 instruction\-tuned models, 8 quantized checkpoints for NVIDIA devices, 10 quantized checkpoints for Apple devices, as well as all the code to reproduce training, post\-training and quantization pipelines\. We hope our open\-source, open\-data and compliant recipe to be of use for LLM practitioners interested in producing and using small language models\. ## References - N\. Ajroldi, A\. Orvieto, and J\. Geiping \(2025\)When, where and why to average weights?\.External Links:2502\.06761,[Link](https://arxiv.org/abs/2502.06761)Cited by:[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px4.p1.1)\. - P\. Apertus, A\. Hernández\-Cano, A\. Hägele, A\. H\. Huang, A\. Romanou, A\. Solergibert, B\. Pasztor, B\. Messmer, D\. Garbaya, E\. F\. Ďurech, I\. Hakimi, J\. G\. Giraldo, M\. Ismayilzada, N\. Foroutan, S\. Moalla, T\. Chen, V\. Sabolčec, Y\. Xu, M\. Aerni, B\. AlKhamissi, I\. A\. Mariñas, M\. H\. Amani, M\. Ansaripour, I\. Badanin, H\. Benoit, E\. Boros, N\. Browning, F\. Bösch, M\. Böther, N\. Canova, C\. Challier, C\. Charmillot, J\. Coles, J\. Deriu, A\. Devos, L\. Drescher, D\. Dzenhaliou, M\. Ehrmann, D\. Fan, S\. Fan, S\. Gao, M\. Gila, M\. Grandury, D\. Hashemi, A\. Hoyle, J\. Jiang, M\. Klein, A\. Kucharavy, A\. Kucherenko, F\. Lübeck, R\. Machacek, T\. Manitaras, A\. Marfurt, K\. Matoba, S\. Matrenok, H\. Mendonça, F\. R\. Mohamed, S\. Montariol, L\. Mouchel, S\. Najem\-Meyer, J\. Ni, G\. Oliva, M\. Pagliardini, E\. Palme, A\. Panferov, L\. Paoletti, M\. Passerini, I\. Pavlov, A\. Poiroux, K\. Ponkshe, N\. Ranchin, J\. Rando, M\. Sauser, J\. Saydaliev, M\. A\. Sayfiddinov, M\. Schneider, S\. Schuppli, M\. Scialanga, A\. Semenov, K\. Shridhar, R\. Singhal, A\. Sotnikova, A\. Sternfeld, A\. K\. Tarun, P\. Teiletche, J\. Vamvas, X\. Yao, H\. Zhao, A\. Ilic, A\. Klimovic, A\. Krause, C\. Gulcehre, D\. Rosenthal, E\. Ash, F\. Tramèr, J\. VandeVondele, L\. Veraldi, M\. Rajman, T\. Schulthess, T\. Hoefler, A\. Bosselut, M\. Jaggi, and I\. Schlag \(2025\)Apertus: democratizing open and compliant llms for global language environments\.External Links:2509\.14233,[Link](https://arxiv.org/abs/2509.14233)Cited by:[Appendix A](https://arxiv.org/html/2605.29128#A1.p3.1),[§1](https://arxiv.org/html/2605.29128#S1.p4.1)\. - Y\. Bengio, N\. Léonard, and A\. Courville \(2013\)Estimating or propagating gradients through stochastic neurons for conditional computation\.External Links:1308\.3432,[Link](https://arxiv.org/abs/1308.3432)Cited by:[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px2.p1.1)\. - P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.External Links:1803\.05457,[Link](https://arxiv.org/abs/1803.05457)Cited by:[2nd item](https://arxiv.org/html/2605.29128#S3.I1.i2.p1.1)\. - A\. Conneau, R\. Rinott, G\. Lample, A\. Williams, S\. Bowman, H\. Schwenk, and V\. Stoyanov \(2018\)XNLI: evaluating cross\-lingual sentence representations\.InProceedings of the 2018 conference on empirical methods in natural language processing,pp\. 2475–2485\.Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - V\. Dac Lai, C\. Van Nguyen, N\. T\. Ngo, T\. Nguyen, F\. Dernoncourt, R\. A\. Rossi, and T\. H\. Nguyen \(2023\)Okapi: instruction\-tuned large language models in multiple languages with reinforcement learning from human feedback\.arXiv e\-prints,pp\. arXiv–2307\.Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - V\. Egiazarian, R\. L\. Castro, D\. Kuznedelev, A\. Panferov, E\. Kurtic, S\. Pandit, A\. Marques, M\. Kurtz, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2026\)Bridging the gap between promise and performance for microscaling fp4 quantization\.External Links:2509\.23202,[Link](https://arxiv.org/abs/2509.23202)Cited by:[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px2.p1.1)\. - E\. Frantar, S\. Ashkboos, T\. Hoefler, and D\. Alistarh \(2023\)GPTQ: accurate post\-training quantization for generative pre\-trained transformers\.External Links:2210\.17323,[Link](https://arxiv.org/abs/2210.17323)Cited by:[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px4.p1.1)\. - D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.External Links:2009\.03300,[Link](https://arxiv.org/abs/2009.03300)Cited by:[2nd item](https://arxiv.org/html/2605.29128#S3.I1.i2.p1.1)\. - A\. H\. Huang and I\. Schlag \(2025\)Deriving activation functions using integration\.External Links:2411\.13010,[Link](https://arxiv.org/abs/2411.13010)Cited by:[§2\.1](https://arxiv.org/html/2605.29128#S2.SS1.SSS0.Px4.p1.1)\. - J\. H\. Lee, S\. Shin, V\. Kim, J\. You, and A\. Chen \(2025\)Unifying block\-wise ptq and distillation\-based qat for progressive quantization toward 2\-bit instruction\-tuned llms\.External Links:2506\.09104,[Link](https://arxiv.org/abs/2506.09104)Cited by:[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px2.p1.1)\. - Z\. Liu, C\. Zhao, F\. Iandola, C\. Lai, Y\. Tian, I\. Fedorov, Y\. Xiong, E\. Chang, Y\. Shi, R\. Krishnamoorthi, L\. Lai, and V\. Chandra \(2024\)MobileLLM: optimizing sub\-billion parameter language models for on\-device use cases\.External Links:2402\.14905,[Link](https://arxiv.org/abs/2402.14905)Cited by:[§2\.1](https://arxiv.org/html/2605.29128#S2.SS1.SSS0.Px4.p1.1)\. - I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.External Links:1711\.05101,[Link](https://arxiv.org/abs/1711.05101)Cited by:[§C\.2](https://arxiv.org/html/2605.29128#A3.SS2.p1.2)\. - N\. Muennighoff, T\. Wang, L\. Sutawika, A\. Roberts, S\. Biderman, T\. L\. Scao, M\. S\. Bari, S\. Shen, Z\. Yong, H\. Schoelkopf, X\. Tang, D\. Radev, A\. F\. Aji, K\. Almubarak, S\. Albanie, Z\. Alyafeai, A\. Webson, E\. Raff, and C\. Raffel \(2022\)Crosslingual generalization through multitask finetuning\.External Links:2211\.01786Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - M\. Pagliardini, P\. Ablin, and D\. Grangier \(2025\)The ademamix optimizer: better, faster, older\.InInternational Conference on Learning Representations,Y\. Yue, A\. Garg, N\. Peng, F\. Sha, and R\. Yu \(Eds\.\),Vol\.2025,pp\. 64715–64757\.External Links:[Link](https://proceedings.iclr.cc/paper_files/paper/2025/file/a2cf225ba392627529efef14dc857e22-Paper-Conference.pdf)Cited by:[§2\.1](https://arxiv.org/html/2605.29128#S2.SS1.SSS0.Px5.p1.1)\. - H\. Peng, X\. Lv, Y\. Bai, Z\. Yao, J\. Zhang, L\. Hou, and J\. Li \(2024\)Pre\-training distillation for large language models: a design space exploration\.External Links:2410\.16215,[Link](https://arxiv.org/abs/2410.16215)Cited by:[§1](https://arxiv.org/html/2605.29128#S1.p2.1),[§2\.1](https://arxiv.org/html/2605.29128#S2.SS1.SSS0.Px3.p1.1)\. - E\. M\. Ponti, G\. Glavaš, O\. Majewska, Q\. Liu, I\. Vulić, and A\. Korhonen \(2020\)XCOPA: a multilingual dataset for causal commonsense reasoning\.External Links:2005\.00333,[Link](https://arxiv.org/abs/2005.00333)Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2024\)Direct preference optimization: your language model is secretly a reward model\.External Links:2305\.18290,[Link](https://arxiv.org/abs/2305.18290)Cited by:[§2\.1](https://arxiv.org/html/2605.29128#S2.SS1.SSS0.Px6.p1.1)\. - A\. Romanou, N\. Foroutan, A\. Sotnikova, S\. H\. Nelaturu, S\. Singh, R\. Maheshwary, M\. Altomare, Z\. Chen, M\. Haggag, A\. Amayuelas,et al\.\(2025\)Include: evaluating multilingual language understanding with regional knowledge\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 83291–83322\.Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2019\)WinoGrande: an adversarial winograd schema challenge at scale\.External Links:1907\.10641,[Link](https://arxiv.org/abs/1907.10641)Cited by:[2nd item](https://arxiv.org/html/2605.29128#S3.I1.i2.p1.1)\. - S\. Singh, A\. Romanou, C\. Fourrier, D\. I\. Adelani, J\. G\. Ngui, D\. Vila\-Suero, P\. Limkonchotiwat, K\. Marchisio, W\. Q\. Leong, Y\. Susanto, R\. Ng, S\. Longpre, W\. Ko, S\. Ruder, M\. Smith, A\. Bosselut, A\. Oh, A\. F\. T\. Martins, L\. Choshen, D\. Ippolito, E\. Ferrante, M\. Fadaee, B\. Ermis, and S\. Hooker \(2025\)Global mmlu: understanding and addressing cultural and linguistic biases in multilingual evaluation\.External Links:2412\.03304,[Link](https://arxiv.org/abs/2412.03304)Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - M\. Xin, S\. Priyadarshi, J\. Xin, B\. Kartal, A\. Vavre, A\. K\. Thekkumpate, Z\. Chen, A\. S\. Mahabaleshwarkar, I\. Shahaf, A\. Bercovich, K\. Patel, S\. V\. Velury, C\. Luo, Z\. Cheng, J\. Chen, C\. Yu, W\. Ping, O\. Rybakov, N\. Tajbakhsh, O\. Olabiyi, D\. Stosic, D\. Wu, S\. Han, E\. Chung, S\. T\. Sreenivas, B\. Catanzaro, Y\. Suhara, T\. Blankevoort, and H\. Mao \(2026\)Quantization\-aware distillation for nvfp4 inference accuracy recovery\.External Links:2601\.20088,[Link](https://arxiv.org/abs/2601.20088)Cited by:[§3\.1](https://arxiv.org/html/2605.29128#S3.SS1.SSS0.Px2.p1.1)\. - Y\. Yang, Y\. Zhang, C\. Tar, and J\. Baldridge \(2019\)PAWS\-X: a cross\-lingual adversarial dataset for paraphrase identification\.InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing \(EMNLP\-IJCNLP\),K\. Inui, J\. Jiang, V\. Ng, and X\. Wan \(Eds\.\),Hong Kong, China,pp\. 3687–3692\.External Links:[Link](https://aclanthology.org/D19-1382/),[Document](https://dx.doi.org/10.18653/v1/D19-1382)Cited by:[Appendix B](https://arxiv.org/html/2605.29128#A2.p1.1)\. - R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)HellaSwag: can a machine really finish your sentence?\.External Links:1905\.07830,[Link](https://arxiv.org/abs/1905.07830)Cited by:[2nd item](https://arxiv.org/html/2605.29128#S3.I1.i2.p1.1)\. ## Appendix ACodebases The full codebases for the pre\-training distillation, post\-training, evaluations and quantization stages of the pipeline are available on GitHub\. - •[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/github-logo.png)Megatron\-LM\-Distill](https://github.com/swiss-ai/Megatron-LM-Distill): A fork of Megatron\-LM with added functionality for teacher logits generation and saving as well as pre\-training distillation\. - •[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/github-logo.png)posttraining](https://github.com/swiss-ai/posttraining): The original post\-training codebase from Apertus that was reused for this project\. - •[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/github-logo.png)qat\-suite](https://github.com/swiss-ai/qat-suite): A lightweight quantization suite with support for vLLM and MLX data formats and various quantization algorithms, including QAD\. - •[![[Uncaptioned image]](https://arxiv.org/html/2605.29128v1/figures/logos/github-logo.png)evals](https://github.com/swiss-ai/evals): The Apertus pre\-training evaluation suite\. - • For the data preparation scripts, please refer to the original Apertus report\(Apertuset al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib16)\)\. ## Appendix BEvaluation Suite Details For the evaluations reported in Tables[3](https://arxiv.org/html/2605.29128#S2.T3)and[4](https://arxiv.org/html/2605.29128#S2.T4), we used the publicly\-available Apertus evaluation suite\. The multilingual macro average shown in Figure[2](https://arxiv.org/html/2605.29128#S2.F2)includes INCLUDE\(Romanouet al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib17)\), XCOPA\(Pontiet al\.,[2020](https://arxiv.org/html/2605.29128#bib.bib18)\), XNLI\(Conneauet al\.,[2018](https://arxiv.org/html/2605.29128#bib.bib19)\), XWinograd\(Muennighoffet al\.,[2022](https://arxiv.org/html/2605.29128#bib.bib20)\), PAWS\-X\(Yanget al\.,[2019](https://arxiv.org/html/2605.29128#bib.bib21)\), Multilingual Arc\(Dac Laiet al\.,[2023](https://arxiv.org/html/2605.29128#bib.bib22)\), Global MMLU\(Singhet al\.,[2025](https://arxiv.org/html/2605.29128#bib.bib23)\)and Multilingual HellaSwag\(Dac Laiet al\.,[2023](https://arxiv.org/html/2605.29128#bib.bib22)\)\. ## Appendix CAdditional Hyper\-Parameters ### C\.1Pre\-Training Details Additional per\-model pre\-training hyper\-parameters are shown in Table[6](https://arxiv.org/html/2605.29128#A3.T6)\. Table 6:Additional hyper\-parameters\.ModelLRGBSTotal IterationsApertus\-v1\.1\-0\.5B6e\-4512800000Apertus\-v1\.1\-1\.5B3e\-4512800000Apertus\-v1\.1\-4B2e\-41024400000 ### C\.2QAT Details For the base models, we sample≈\\approx130M tokens uniformly from the unused remainder of the gathered pre\-training data\. For the instruction\-tuned models, we sample≈\\approx60M uniformly from the Apertus SFT mixture\. We train with AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.29128#bib.bib24)\)with cosine LR schedule\. For base models, we use the same sequence length and batch size as in pre\-training\. For instruction\-tuned models, we use slightly larger batch size of 512\-2048 to compensate for smaller length of some post\-training sequences\. Similar to pre\-training distillation, we pre\-compute and store the sparse logits from the teacher model \(Apertus\-8B\-2509for base models andApertus\-8B\-Instruct\-2509for instruction\-tuned models\) once and re\-use them for all student model and quantization format combinations\.
Similar Articles
LLM Compression with Jointly Optimizing Architectural and Quantization choices
Researchers from UiT and University of Oslo propose a differentiable NAS framework that jointly optimizes architectural configurations and mixed-precision quantization for LLM compression, achieving up to 1.4× faster inference or 6% higher accuracy across seven reasoning tasks compared to sequential NAS-then-quantization baselines.
Mix-Quant: Quantized Prefilling, Precise Decoding for Agentic LLMs
Mix-Quant proposes a phase-aware quantization framework for agentic LLMs, using NVFP4 quantization for the prefilling stage to accelerate computation while preserving BF16 precision for decoding to maintain accuracy. The method achieves up to 3x speedup in prefilling with minimal performance degradation on agentic benchmarks.
LiftQuant: Continuous Bit-Width LLM via Dimensional Lifting and Projection
LiftQuant introduces a 'lift-then-project' mechanism enabling continuous (non-integer) bit-width quantization for LLMs, allowing precise fitting to hardware memory budgets. The framework compresses a 70B LLM to 2.4-bit to fit a 24GB GPU, outperforming state-of-the-art 2-bit models.
LLMs 101: A Practical Guide (2026 Edition)
A comprehensive practical guide to LLMs covering inference mechanics, tokens, Transformers, KV cache, local deployment hardware, and quantization as of May 2026.
I trained a 75M parameter LLM from scratch on 18B tokens and it beats a model almost double its size
Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.