Early Data Exposure Improves Robustness to Subsequent Fine-Tuning

arXiv cs.LG Papers

Summary

This paper shows that mixing post-training data into pretraining (early exposure) improves how robustly a model retains capabilities after subsequent fine-tuning, challenging the notion that immediate post-training performance predicts retention. Controlled experiments on 135M and 1B models demonstrate that early exposure consistently improves the trade-off between upstream retention and downstream performance.

arXiv:2605.12705v1 Announce Type: new Abstract: How can we train models whose post-trained capabilities survive subsequent fine-tuning? Rather than focusing on downstream interventions to mitigate forgetting of upstream capabilities, we study how upstream training choices - that is, the manner in which a capability is acquired - shape how robustly that capability is retained. We investigate this question in a controlled three-stage language-model pipeline: pretraining, post-training to acquire a target capability, and downstream fine-tuning on a new objective. Across 135M and 1B models, two post-training domains, and two downstream fine-tuning tasks, we find that immediate post-training performance does not reliably predict retention after subsequent fine-tuning: training recipes that look equivalent immediately after post-training can retain the target capability very differently after subsequent fine-tuning. In particular, early exposure - mixing post-training data into pretraining - consistently improves the frontier between retained upstream performance and downstream performance. In compute-matched experiments, where the target data must be allocated between pretraining and post-training, we find that the optimum lies at neither extreme. Together with our other empirical and theoretical findings, this supports the view that post-training drives immediate specialization while early exposure improves robustness to later forgetting. Replay and dropout, typically used to mitigate forgetting as it occurs during fine-tuning, provide complementary gains to early exposure when applied during post-training. Our findings suggest that robustness to subsequent fine-tuning should be treated as a first-class objective of upstream training, addressed preventatively through choices like early exposure rather than reactively during fine-tuning itself.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:17 AM

# Early Data Exposure Improves Robustness to Subsequent Fine-Tuning
Source: [https://arxiv.org/html/2605.12705](https://arxiv.org/html/2605.12705)
Antiquus S\. Hippocampus, Natalia Cerebro & Amelie P\. Amygdale Department of Computer Science Cranberry\-Lemon University Pittsburgh, PA 15213, USA \{hippo,brain,jen\}@cs\.cranberry\-lemon\.edu &Ji Q\. Ren & Yevgeny LeNet Department of Computational Neuroscience University of the Witwatersrand Joburg, South Africa \{robot,net\}@wits\.ac\.za Use footnote for providing further information about author \(webpage, alternative address\)—*not*for acknowledging funding agencies\. Funding acknowledgements go at the end of the paper\.

###### Abstract

How can we train models whose post\-trained capabilities survive subsequent fine\-tuning? Rather than focusing on downstream interventions to mitigate forgetting of upstream capabilities, we study how upstream training choices — that is, the manner in which a capability is acquired — shape how robustly that capability is retained\. We investigate this question in a controlled three\-stage language\-model pipeline: pretraining, post\-training to acquire a target capability, and downstream fine\-tuning on a new objective\. Across 135M and 1B models, two post\-training domains, and two downstream fine\-tuning tasks, we find that immediate post\-training performance does not reliably predict retention after subsequent fine\-tuning: training recipes that look equivalent immediately after post\-training can retain the target capability very differently after subsequent fine\-tuning\. In particular,early exposure— mixing post\-training data into pretraining — consistently improves the frontier between retained upstream performance and downstream performance\. In compute\-matched experiments, where the target data must be allocated between pretraining and post\-training, we find that the optimum lies at neither extreme\. Together with our other empirical and theoretical findings, this supports the view that post\-training drives immediate specialization while early exposure improves robustness to later forgetting\. Replay and dropout, typically used to mitigate forgetting as it occurs during fine\-tuning, provide complementary gains to early exposure when applied during post\-training\. Our findings suggest that robustness to subsequent fine\-tuning should be treated as a first\-class objective of upstream training, addressed preventatively through choices like early exposure rather than reactively during fine\-tuning itself\.

## 1 Introduction

When a post\-trained language model is released for downstream fine\-tuning, its carefully acquired capabilities are at risk\. Fine\-tuning on a new objective routinely causes catastrophic forgetting of behaviors introduced during post\-training — whether instruction following, domain knowledge, coding ability, or safety\-related behavior\(Yang et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib26); Olmo et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib17)\)\.

Most prior work treats this as a problem for the downstream fine\-tuner to solve\. If fine\-tuning degrades prior capabilities, the natural response is to modify that fine\-tuning stage: replay earlier data\(Bethune et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib3); Kotha & Liang,[2026](https://arxiv.org/html/2605.12705#bib.bib11)\), regularize the update\(Kirkpatrick et al\.,[2017](https://arxiv.org/html/2605.12705#bib.bib10)\), restrict the trainable parameters\(Hu et al\.,[2021](https://arxiv.org/html/2605.12705#bib.bib9); Biderman et al\.,[2024](https://arxiv.org/html/2605.12705#bib.bib4)\), or jointly optimize competing objectives\(Wortsman et al\.,[2022a](https://arxiv.org/html/2605.12705#bib.bib24);[b](https://arxiv.org/html/2605.12705#bib.bib25)\)\.

We take a complementary view: robustness to subsequent fine\-tuning should be treated as an objective of upstream model development\. Upstream developers typically train in two stages — first on a large general corpus to build broad language understanding, then on a smaller, often scarce, targeted dataset to instill specific capabilities\. Because this second stage uses limited data, how and when it is used matters\. Our central intuition is that how a model learns a capability shapes how robustly it is retained: two models that reach identical post\-training performance can differ substantially in how well those capabilities survive later adaptation\.

To study this question, we use a controlled three\-stage pipeline reflecting this standard practice: an upstream developer firstpretrainson a broad corpus, thenpost\-trainson a smaller targeted dataset to acquire specific capabilities, and finally hands the resulting model to a downstream user whofine\-tunesit on a new objective \(Figure[1](https://arxiv.org/html/2605.12705#S1.F1)\)\. We study this framework across multiple controlled settings spanning different post\-training and downstream fine\-tuning regimes \(Table[1](https://arxiv.org/html/2605.12705#S3.T1)\), including both domain and behavioral adaptation, and evaluate these settings for 135M parameter models, extending our findings to 1B parameter models\. We hold the downstream fine\-tuning method fixed, applying standard supervised fine\-tuning, and sweep its learning rate to characterize how upstream choices shape the tradeoff between downstream performance, retention of the post\-trained capability, and performance on the broader pretraining distribution\. Accordingly, our evaluation centers on the tradeoff frontier induced by different methods, rather than immediate post\-training performance alone\.

θpre⟶θpostθft\\mathchoice\{\\hbox to59\.81pt\{\\vbox to9\.81pt\{\\pgfpicture\\makeatletter\\hbox\{\\hskip 29\.90294pt\\lower\-2\.8611pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-29\.90294pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\displaystyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to59\.81pt\{\\vbox to9\.81pt\{\\pgfpicture\\makeatletter\\hbox\{\\hskip 29\.90294pt\\lower\-2\.8611pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-29\.90294pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\textstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to36\.94pt\{\\vbox to7\.05pt\{\\pgfpicture\\makeatletter\\hbox\{\\qquad\\lower\-2\.18748pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-18\.47067pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\scriptstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to28\.29pt\{\\vbox to5\.15pt\{\\pgfpicture\\makeatletter\\hbox\{\\qquad\\lower\-1\.68054pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-14\.14374pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\scriptscriptstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}\{\\hbox\{\\pagecolor\{appleblue\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{pre\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\\;\\longrightarrow\\;\\mathchoice\{\\hbox to73\.96pt\{\\vbox to9\.81pt\{\\pgfpicture\\makeatletter\\hbox\{\\hskip 36\.9808pt\\lower\-2\.8611pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-36\.9808pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\displaystyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to73\.96pt\{\\vbox to9\.81pt\{\\pgfpicture\\makeatletter\\hbox\{\\hskip 36\.9808pt\\lower\-2\.8611pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-36\.9808pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\textstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to44\.84pt\{\\vbox to7\.69pt\{\\pgfpicture\\makeatletter\\hbox\{\\hskip 22\.42177pt\\lower\-2\.83331pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-22\.42177pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\scriptstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to33\.93pt\{\\vbox to5\.5pt\{\\pgfpicture\\makeatletter\\hbox\{\\qquad\\lower\-2\.02379pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-16\.96597pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\scriptscriptstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}\{\\hbox\{\\pagecolor\{applegreen\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{post\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\\hskip 110\.00017pt\\mathchoice\{\\hbox to39\.58pt\{\\vbox to8\.69pt\{\\pgfpicture\\makeatletter\\hbox\{\\qquad\\lower\-1\.75pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-19\.79173pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\displaystyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applered\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to39\.58pt\{\\vbox to8\.69pt\{\\pgfpicture\\makeatletter\\hbox\{\\qquad\\lower\-1\.75pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-19\.79173pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\textstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applered\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to25\.23pt\{\\vbox to7\.29pt\{\\pgfpicture\\makeatletter\\hbox\{\\qquad\\lower\-2\.43054pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-12\.61398pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\scriptstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applered\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}\{\\hbox to19\.92pt\{\\vbox to5\.21pt\{\\pgfpicture\\makeatletter\\hbox\{\\quad\\lower\-1\.7361pt\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@setlinewidth\{\\the\\pgflinewidth\}\\pgfsys@invoke\{ \}\\nullfont\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\}\{ \{\{\}\}\\hbox\{\\hbox\{\{\\pgfsys@beginscope\\pgfsys@invoke\{ \}\{\{\}\{\}\{\{ \{\}\{\}\}\}\{ \{\}\{\}\} \{\{\}\{\{\}\}\}\{\{\}\{\}\}\{\}\{\{\}\{\}\} \{ \}\{\{\{\{\}\}\\pgfsys@beginscope\\pgfsys@invoke\{ \}\\pgfsys@transformcm\{1\.0\}\{0\.0\}\{0\.0\}\{1\.0\}\{\-9\.96042pt\}\{0\.0pt\}\\pgfsys@invoke\{ \}\\hbox\{\{\\definecolor\{pgfstrokecolor\}\{rgb\}\{0,0,0\}\\pgfsys@color@rgb@stroke\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\pgfsys@color@rgb@fill\{0\}\{0\}\{0\}\\pgfsys@invoke\{ \}\\hbox\{\{$\\scriptscriptstyle\\definecolor\{currentcolor\}\{rgb\}\{0,0,0\}\\mathchoice\{\\hbox\{\\pagecolor\{applered\!17\}$\\displaystyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\textstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}\{\\hbox\{\\pagecolor\{applered\!17\}$\\scriptscriptstyle\\mathstrut\\theta\_\{\\mathrm\{ft\}\}$\}\}$\}\} \}\}\\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \\pgfsys@invoke\{ \}\\pgfsys@endscope\}\}\} \} \\pgfsys@invoke\{ \}\\pgfsys@endscope\\hbox to0\.0pt\{\}\{\{ \{\}\{\}\{\}\}\}\{\}\{\}\\hss\}\\pgfsys@discardpath\\pgfsys@invoke\{ \}\\pgfsys@endscope\\hss\}\}\\endpgfpicture\}\}\}
Pretraining

Post\-training forXXFine\-tuning for taskYYdownstream userAcquiring CapabilityXXIsXXRetained?

Figure 1:Overview of our three\-stage experimental setup, in contrast to a typical two\-stage setup\. A first party pretrains then post\-trains a model with the goal of achieving high performance on domainXX\. Subsequently, downstream users fine\-tuneθpost\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\theta\_\{\\text\{post\}\}\}for a taskYY, causing catastrophic forgetting of domainXX\. Previous work investigates interventions in the third stage: how can we fine\-tune forYYwhile mitigating forgetting onXX? In this work, we investigate howthe way in whichXXis learned affects how it is \(or not\) forgotten\.Our main intervention is simple: we expose the model to some of the eventual post\-training data earlier by mixing it in during pretraining\. Across datasets and model sizes, we find that thisearly exposureimproves the tradeoff between retained upstream capability and downstream fine\-tuning loss \(Figure[2](https://arxiv.org/html/2605.12705#S1.F2)\), even when it has little or no visible effect on immediate post\-training performance \(Section[4\.1](https://arxiv.org/html/2605.12705#S4.SS1)\)\.

Why does early exposure help? Our theoretical account \(Section[5](https://arxiv.org/html/2605.12705#S5)\) suggests that mixing during pretraining allows the post\-training capability to be represented in specialized features that are less vulnerable to subsequent interference\. Our compute\-matched experiments reinforce this view: even under a fixed budget of post\-training data, the optimum does not lie at either extreme of allocating all exposure to pretraining or all exposure to post\-training, but between the two \(Section[4\.2](https://arxiv.org/html/2605.12705#S4.SS2)\)\.

If the manner of learning matters, a natural follow\-up question arises: what other interventions can shape how a capability is acquired? We study replay and dropout from this perspective \(Section[4\.4](https://arxiv.org/html/2605.12705#S4.SS4)\)\. Intuitively, replay mitigates forgetting by interleaving earlier data with new\-domain updates, while dropout discourages co\-adaptation and promotes more robust representations\(Rolnick et al\.,[2019](https://arxiv.org/html/2605.12705#bib.bib20); Srivastava et al\.,[2014](https://arxiv.org/html/2605.12705#bib.bib23)\)\. Importantly, we evaluate whether the benefits of these interventions persist after a later downstream fine\-tuning stage\. We find that both improve the tradeoff frontier while remaining complementary to pretraining\-time mixing\.

Together, our findings show that the upstream training process is a meaningful lever for shaping robustness to subsequent fine\-tuning\. In particular, a remarkably simple intervention—mixing a small amount of post\-training data into pretraining—can materially improve how well capabilities survive subsequent fine\-tuning\. Replay and dropout provide additional complementary gains, further suggesting that robustness can be influenced well before forgetting is observed downstream\. More broadly, our results point to a promising direction for future work: building models that are inherently easier to adapt by designing the upstream training pipeline to make valuable capabilities more durable from the start\.

![Refer to caption](https://arxiv.org/html/2605.12705v1/x1.png)Figure 2:Mixing during pretraining improves the frontier across four training pipelines \(135M\)\.Each panel corresponds to one 3\-stage pipeline\. Within each panel, the left plot shows retained post\-training loss versus downstream fine\-tuning loss, and the right plot shows retained pretraining loss versus retained post\-training loss\.Blackdenotes the frontier obtained from unmixed pretraining, andpurpledenotes the frontier obtained from mixed pretraining\. Across all four pipelines, mixing shifts the frontier toward lower retained post\-training loss, lower retained pretraining loss, and lower downstream fine\-tuning loss, indicating that early exposure to post\-training data can improve its downstream retention after subsequent fine\-tuning\.
## 2 Related Works

Catastrophic Forgetting\.A recurring challenge in sequential training iscatastrophic forgetting: when a model is optimized on new data, its performance can deteriorate on behaviors it previously exhibited\(McCloskey & Cohen,[1989](https://arxiv.org/html/2605.12705#bib.bib14)\)\. For language models, this phenomenon shows up in modern training pipelines\. For example, instruction tuning and RLHF can trade off against preexisting capabilities, an effect often discussed as an “alignment tax”\(Ouyang et al\.,[2022](https://arxiv.org/html/2605.12705#bib.bib18)\)\. Relatedly, several works show that behaviors introduced during safety fine\-tuning can be quickly weakened or reversed by subsequent training on different objectives or data\(Yang et al\.,[2023](https://arxiv.org/html/2605.12705#bib.bib27); Qi et al\.,[2023](https://arxiv.org/html/2605.12705#bib.bib19)\)\. These tradeoffs also appear in adjacent settings such as knowledge editing\(Nishi et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib15)\)and unlearning\(Maini et al\.,[2024](https://arxiv.org/html/2605.12705#bib.bib12)\)\. Beyond documenting the effect, recent work has started to map how training choices shape its severity: for instance, LoRA\-style adaptation can alter forgetting dynamics\(Biderman et al\.,[2024](https://arxiv.org/html/2605.12705#bib.bib4)\), and longer pretraining can change how brittle or persistent acquired capabilities are\(Springer et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\. In this paper, we focus on catastrophic forgetting of post\-trained capabilities, and study what properties of an intermediate checkpoint determine whether capabilities persist under subsequent training\.

Data Placement in Pretraining\.A line of recent works examine pretraining interventions for enforcing desired downstream capabilities and properties\.Maini et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib13)\); O’Brien et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib16)\)propose filtering and augmenting data during pretraining to improve safety\. Similarly,Sam et al\. \([2026](https://arxiv.org/html/2605.12705#bib.bib21)\)demonstrate that the impact of such interventions improves as they are introduced earlier in pretraining\. While these works incorporate downstream tasks during pretraining, they extensively modify the pretraining corpus by incorporating data\-augmentations and filtering of the dataset\.Baek et al\. \([2026](https://arxiv.org/html/2605.12705#bib.bib2)\)demonstrate that mixing post\-training data during pretraining can immediately improve in\-domain performance relative to simply fine\-tuning\. In our work, we introduce an additional benefit of early exposure to post\-training data: robustness to catastrophic forgetting during future training\.

## 3 Preliminaries and Setting

Typically, a model developer\(1\) pretrainsa model on a general web corpus and\(2\) post\-trainsthe model on a target domain; downstream users then\(3\) fine\-tunethe model for their own purposes\. Throughout this work, we’ll describe stages\(1\)and\(2\)as upstream relative to stage\(3\)controlled by the downstream end user\. Each stage is associated with a dataset:𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}\(general pretraining\),𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\(post\-training\), and𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}\(fine\-tuning\)\. The post\-training corpus is assumed to be much smaller than the pretraining web corpus, reflecting the practical regime where post\-training data is relatively scarce\.

We writeℒ​\(θ;𝒟\)\\mathcal\{L\}\(\\theta;\\mathcal\{D\}\)for the loss of parametersθ\\thetaevaluated on dataset𝒟\\mathcal\{D\}\. All losses are computed on held\-out splits of the corresponding datasets\.

Stage 1: Upstream Pretraining\.The model is first pretrained on a general web corpus𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}, producing a pretrained checkpointθpre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\theta\_\{\\text\{pre\}\}\}\. In some experiments, the upstream developer additionally mixes a fractionλ∈\[0,1\]\\lambda\\in\[0,1\]of the post\-training corpus𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}into this stage\. Here,λ=0\\lambda=0denotes no exposure to𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during pretraining, whileλ\>0\\lambda\>0denotesearly exposureto the post\-training dataset\. In this work,λ\\lambdais at most one, denoting at most one pass over𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during pretraining\.

Stage 2: Upstream Post\-training\.Starting fromθpre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\theta\_\{\\text\{pre\}\}\}, the upstream developer post\-trains the model on a relatively smaller corpus𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}to acquire a target capability or domain adaptation, yielding the post\-trained checkpointθpost\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\theta\_\{\\text\{post\}\}\}\. We measure performance on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}immediately after this stage using the*immediate post\-training loss*

ℒim:=ℒ​\(θpost;𝒟post\)\.\\mathcal\{L\}\_\{\\mathrm\{im\}\}:=\\mathcal\{L\}\(\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\theta\_\{\\text\{post\}\}\};\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\.
Stage 3: Subsequent Fine\-Tuning\.A downstream user then fine\-tunes the post\-trained model on a new objective𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}, producing the checkpointθft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\theta\_\{\\text\{ft\}\}\}\. We measure performance on this new objective using the*downstream fine\-tuning loss*

ℒft:=ℒ​\(θft;𝒟ft\)\.\\mathcal\{L\}\_\{\\mathrm\{ft\}\}:=\\mathcal\{L\}\(\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\theta\_\{\\text\{ft\}\}\};\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}\)\.
Subsequent fine\-tuning can degrade capabilities acquired during upstream post\-training\. To measure how much of the post\-trained capability survives, we also evaluateθft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\theta\_\{\\text\{ft\}\}\}on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}, defining the*retained post\-training loss*

ℒret:=ℒ​\(θft;𝒟post\)\.\\mathcal\{L\}\_\{\\mathrm\{ret\}\}:=\\mathcal\{L\}\(\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\theta\_\{\\text\{ft\}\}\};\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\.
Our central question is how upstream training choices affect downstream adaptation, retention of post\-trained capabilities, and preservation of general\-domain performance under subsequent fine\-tuning\.

Table 1:Experimental instantiations of the three\-stage pipeline\. We vary the upstream post\-training corpus𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}and downstream fine\-tuning corpus𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}while keeping the general pretraining corpus𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}fixed to C4\.### 3\.1 Evaluation methodology

Subsequent fine\-tuning is inherently multi\-objective\. A downstream user may care not only about performance on the new fine\-tuning objective𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}, but also about retaining capabilities acquired during upstream post\-training on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}and preserving more general capabilities associated with the pretraining distribution𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}\. We therefore track three losses throughout this work: the downstream fine\-tuning lossℒft\\mathcal\{L\}\_\{\\mathrm\{ft\}\}, the retained post\-training lossℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}, and the retained pretraining loss

ℒpre:=ℒ​\(θft;𝒟pre\)\.\\mathcal\{L\}\_\{\\mathrm\{pre\}\}:=\\mathcal\{L\}\(\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\theta\_\{\\text\{ft\}\}\};\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}\)\.
We use validation loss as our evaluation metric\. Prior work has established loss as a reliable, scale\-invariant proxy for capability: models with matched pretraining loss exhibit equivalent downstream task performance\(Du et al\.,[2024](https://arxiv.org/html/2605.12705#bib.bib6); Gadre et al\.,[2024](https://arxiv.org/html/2605.12705#bib.bib7); Chen et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib5)\)\. This is particularly important at our training scales, where task accuracy is noisy and near random chance; in contrast, loss provides a continuous and smooth signal that enables fine\-grained comparisons between models that may appear similar under coarse or discrete metrics\.

Sweeping upstream training choices and hyperparameters yields checkpoints with different tradeoffs among these objectives\. We summarize the best attainable tradeoffs using 2D*loss frontiers*: for each method, we plot the Pareto\-optimal set of checkpoints in a given 2D projection, i\.e\., those for which no other checkpoint from the same method achieves lower loss on both axes simultaneously\. Our main analysis uses two complementary views:\(ℒret,ℒft\)\(\\mathcal\{L\}\_\{\\mathrm\{ret\}\},\\mathcal\{L\}\_\{\\mathrm\{ft\}\}\), which captures the tradeoff between retaining the post\-trained capability and adapting to the downstream task, and\(ℒpre,ℒret\)\(\\mathcal\{L\}\_\{\\mathrm\{pre\}\},\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\), which captures the tradeoff between retaining broader pretraining capabilities and retaining the post\-trained capability\. Together, these views provide interpretable slices through the underlying three\-objective tradeoff\.

### 3\.2 Experimental instantiations

Across experiments, we fix the general pretraining corpus to C4 and study the four three\-stage instantiations shown in Table[1](https://arxiv.org/html/2605.12705#S3.T1)\. These settings cover two qualitatively different forms of upstream capability acquisition, domain adaptation and instruction tuning, and let us test whether the same robustness phenomena appear across different downstream fine\-tuning objectives\.

### 3\.3 Model scales and training budgets

Unless otherwise stated, we use a SmolLM2\-style architecture\(Allal et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib1)\)at two scales: our primary experiments use a 135M\-parameter model, and we additionally run a 1B\-parameter variant to test whether the same qualitative patterns persist at larger scale\. For the 135M experiments, we pretrain on approximately 10B tokens from𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}, optionally with early exposure to𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during Stage 1\. Starting from the resulting checkpointθpre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\theta\_\{\\text\{pre\}\}\}, we perform Stage 2 post\-training on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}using AdamW with linear warmup and cosine decay\. In all but the compute\-matched experiments, training proceeds exclusively on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}, with no restriction on dataset repetitions: we apply early stopping and continue training as long as validation loss on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}improves \(up to a maximum budget of 2B tokens\)\. This ensures that all models are trained to convergence on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}, and that differences in downstream retention are not attributable to unequal training duration\. We then fine\-tune each post\-trained checkpointθpost\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\theta\_\{\\text\{post\}\}\}on𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}for a fixed token budget of 200M tokens with various learning rates\.

Full optimizer settings, sweep ranges, and per\-intervention details are provided in Appendix[A](https://arxiv.org/html/2605.12705#A1)\.

## 4 Experiments and Results

We begin by asking whether pretraining\-time mixing has any effect once post\-training is run to convergence \(Section[4\.1](https://arxiv.org/html/2605.12705#S4.SS1)\)\. We then ask whether scarce post\-training data should be mixed during pretraining or reserved for a dedicated post\-training stage \(Section[4\.2](https://arxiv.org/html/2605.12705#S4.SS2)\)\. Finally, we ask whether the benefits of mixing persist across broad hyperparameter sweeps and multiple pipeline instantiations \(Section[4\.3](https://arxiv.org/html/2605.12705#S4.SS3)\), before turning to replay and dropout as complementary post\-training interventions \(Section[4\.4](https://arxiv.org/html/2605.12705#S4.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2605.12705v1/x2.png)\(a\)Varying mixture fractionλ\\lambda\.
![Refer to caption](https://arxiv.org/html/2605.12705v1/x3.png)\(b\)Compute\-matched setting\.

Figure 3:Left:As the mixture fractionλ\\lambdaincreases, immediate MusicPile loss after post\-training remains nearly constant, while retained MusicPile loss after downstream fine\-tuning on ChemPile improves\. This shows that the benefits of mixing can belatent: they may not be visible immediately after post\-training, but emerge after subsequent fine\-tuning\.Right:In a compute\-matched setting where total MusicPile exposure is held fixed across pretraining and post\-training, increasingλ\\lambdaworsens immediate MusicPile loss after post\-training but improves retained MusicPile loss after downstream fine\-tuning\. Thus, even under a fixed MusicPile token budget, allocating some exposure earlier in training yields better retention\.### 4\.1 Immediate post\-training performance does not reflect downstream retention

We begin with a controlled study of the mixing ratioλ\\lambda, designed to isolate whether early exposure to𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}has any effect once upstream post\-training is run to convergence using early stopping\. Unlike the broad hyperparameter sweeps used in our main frontier analysis, these experiments fix the Stage 2 post\-training procedure and vary only how much of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}is seen during Stage 1 pretraining\.

Setup\.We fix the post\-training configuration and vary only the pretraining mixing fractionλ∈\{0,0\.25,0\.5,0\.75,1\.0\}\\lambda\\in\\\{0,0\.25,0\.5,0\.75,1\.0\\\}\. We study three post\-training dataset sizes\|𝒟post\|∈\{30​M,150​M,300​M\}\|\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\|\\in\\\{30\\text\{M\},150\\text\{M\},300\\text\{M\}\\\}where𝒟post⊂MusicPile\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\\subset\\text\{MusicPile\}\. Starting from each pretrained checkpoint, we post\-train on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}until convergence using a fixed hyperparameter configuration\. To induce forgetting, we then fine\-tune on𝒟ft⊂ChemPile\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}\\subset\\text\{ChemPile\}, and report both the immediate post\-training lossℒim\\mathcal\{L\}\_\{\\mathrm\{im\}\}and the retained post\-training lossℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}at a fixed Stage 3 learning rate of5×10−55\\times 10^\{\-5\}\(Figure[3\(a\)](https://arxiv.org/html/2605.12705#S4.F3.sf1)\)\.

Result\.Varyingλ\\lambdahas little effect on immediate post\-training performance: across dataset sizes,ℒim\\mathcal\{L\}\_\{\\mathrm\{im\}\}remains nearly flat as the mixing fraction increases\. In other words, once post\-training is allowed to run to convergence, mixed and unmixed models reach similar performance on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\. However, these checkpoints behave very differently under subsequent fine\-tuning\. Asλ\\lambdaincreases, the retained post\-training lossℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}consistently decreases, indicating that models with more exposure to𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during pretraining forget less after subsequent downstream adaptation\.

Takeaway\.Early exposure can substantially improve retention under subsequent fine\-tuning even when it provides little or no benefit to immediate post\-training performance\.

### 4\.2 Early exposure and post\-training play different roles under a fixed data budget

The experiment above in Section[4\.1](https://arxiv.org/html/2605.12705#S4.SS1)shows that early exposure to post\-training data can improve its downstream retention even when it has little effect on immediate post\-training performance\. However, those experiments do not isolate whether the benefit comes from*when*𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}is introduced or simply from the model seeing more total𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}tokens\. We therefore ask a more controlled question: under a fixed𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}budget, should post\-training data be mixed into pretraining at all, or is it better reserved for a dedicated post\-training stage?

Setup\.We fix the total number of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}tokens seen across Stage 1 pretraining and Stage 2 post\-training and vary only how that budget is allocated\. For each mixing fractionλ∈\{0,0\.25,0\.5,0\.75,1\.0\}\\lambda\\in\\\{0,0\.25,0\.5,0\.75,1\.0\\\}, we expose the model to aλ\\lambda\-fraction of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during pretraining and reserve the remaining\(1−λ\)\(1\-\\lambda\)fraction for post\-training, so every model sees exactly one pass over𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}in total\. Thus,λ=0\\lambda=0assigns the full budget to dedicated post\-training, whileλ=1\\lambda=1assigns it entirely to mixed pretraining\. We evaluate this allocation study with𝒟post⊂MusicPile\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\\subset\\text\{MusicPile\}at three dataset sizes,\|𝒟post\|∈\{30​M,150​M,300​M\}\|\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\|\\in\\\{30\\text\{M\},150\\text\{M\},300\\text\{M\}\\\}, use𝒟ft⊂ChemPile\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}\\subset\\text\{ChemPile\}for downstream fine\-tuning, and report bothℒim\\mathcal\{L\}\_\{\\mathrm\{im\}\}andℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}after Stage 3 fine\-tuning at a fixed learning rate of5×10−55\\times 10^\{\-5\}\(Figure[3\(b\)](https://arxiv.org/html/2605.12705#S4.F3.sf2)\)\.

Result\.Figure[3\(b\)](https://arxiv.org/html/2605.12705#S4.F3.sf2)reveals a clear tradeoff\. Asλ\\lambdaincreases, the immediate post\-training lossℒim\\mathcal\{L\}\_\{\\mathrm\{im\}\}worsens: under a fixed𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}budget, allocating more post\-training data to pretraining leaves fewer tokens for the post\-training stage that exclusively optimizes for𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\. The downstream picture is different\. Increasingλ\\lambdaconsistently improves retained performance after Stage 3 fine\-tuning, loweringℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}across dataset sizes\. As a result, the best immediate post\-training performance is achieved by allocating all of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}to Stage 2, while the best retained performance after downstream adaptation is achieved at a nonzero mixture fraction\. Even under a fixed data budget, the optimum therefore lies between the two extremes of all\-post\-training and all\-pretraining allocation\. This suggests that dedicated post\-training and early exposure are doing something meaningfully different: concentrating𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}in Stage 2 yields stronger immediate fitting to the post\-training domain, while exposure during pretraining makes that capability less brittle under later training\. We provide a theoretical understanding of this in Section[5](https://arxiv.org/html/2605.12705#S5)\.

Takeaway:Under a fixed𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}budget, the best immediate post\-training performance occurs when all data is reserved for Stage 2, but the best retained performance after downstream fine\-tuning occurs at a positive mixture fraction\.

### 4\.3 Early exposure improves the loss frontier across hyperparameter sweeps

The controlled studies above isolate two key phenomena: the benefit of mixing can be latent, and under a fixed post\-training\-data budget the best retained performance is achieved at a nonzero mixture fraction\. We now ask whether these conclusions persist once we move beyond controlled comparisons to the more realistic setting in which both post\-training and downstream fine\-tuning offer many tunable degrees of freedom\. In practice, an upstream developer can vary post\-training hyperparameters to reach different tradeoffs between adaptation and retention, while a downstream end user may likewise vary fine\-tuning hyperparameters to target different operating points\. We therefore evaluate each upstream strategy not by a single checkpoint, but by the frontier of attainable checkpoints it induces across broad Stage 2 post\-training sweeps and Stage 3 fine\-tuning sweeps\.

Setup\.For each pipeline in Table[1](https://arxiv.org/html/2605.12705#S3.T1), we sweep Stage 2 post\-training hyperparameters under both unmixed and mixed pretraining, then fine\-tune every resulting checkpoint on the downstream objective using a range of Stage 3 learning rates, and evaluate the paired frontier views shown in Figure[2](https://arxiv.org/html/2605.12705#S1.F2)\. Within each pipeline, the left panel plots retained post\-training loss against downstream fine\-tuning loss, while the right panel plots retained pretraining loss against retained post\-training loss\.

Result\.Across all four pipelines, early exposure consistently shifts the frontier relative to unmixed pretraining\. In the\(ℒret,ℒft\)\(\\mathcal\{L\}\_\{\\mathrm\{ret\}\},\\mathcal\{L\}\_\{\\mathrm\{ft\}\}\)view, mixing yields lower retained post\-training loss at matched downstream fine\-tuning loss, indicating greater robustness of the post\-trained capability under subsequent adaptation\. In the\(ℒpre,ℒret\)\(\\mathcal\{L\}\_\{\\mathrm\{pre\}\},\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\)view, mixing also improves the tradeoff between preserving broader pretraining capabilities and preserving the post\-trained capability\. These gains appear across both domain and behavioral post\-training settings, suggesting that the benefit of mixing is not confined to a single dataset pair or narrow training regime\. We further find that a similar qualitative frontier improvement appears in our 1B experiments, suggesting that the benefit of mixing persists beyond the small\-model setting \(Figures[8](https://arxiv.org/html/2605.12705#A2.F8)and[9](https://arxiv.org/html/2605.12705#A2.F9)\)\.

Takeaway:Across hyperparameter sweeps and training pipelines, early exposure consistently improves the attainable tradeoffs among downstream fine\-tuning performance, retained post\-training performance, and retained pretraining performance\.

### 4\.4 Replay and dropout provide complementary gains

![Refer to caption](https://arxiv.org/html/2605.12705v1/x4.png)Figure 4:Replay and dropout provide complementary gains on top of mixed pretraining\.Each subfigure shows one 3\-stage pipeline\. Within each subfigure, theleftpanel compares unmixed pretraining, mixed pretraining, and mixed pretraining \+ dropout, while therightpanel compares unmixed pretraining, mixed pretraining, and mixed pretraining \+ replay\. Across both downstream settings, adding dropout or replay to mixed pretraining further shifts the loss frontier, indicating that these post\-training interventions provide complementary gains rather than replacing the effect of pretraining\-time mixing\.Mixing𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}into pretraining is one way to shape how the model acquires the target post\-training capability, but it is not the only one\. We next study two alternative upstream interventions that act during post\-training itself:*replay*, which mixes a small amount of general\-domain data into post\-training, and*dropout*, which regularizes the post\-training update\. We view both as interventions on*how*the model learns𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}, rather than simply how much performance it achieves immediately after post\-training\.

Setup\.We evaluate replay and dropout in the same three\-stage framework as above, using broad Stage 2 hyperparameter sweeps and, for every resulting checkpoint, sweeping Stage 3 learning rates before measuring the resulting frontier between downstream fine\-tuning lossℒft\\mathcal\{L\}\_\{\\mathrm\{ft\}\}and retained post\-training lossℒret\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\. Replay is implemented by mixing a small fraction \(1% followingBethune et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib3)\)\) of general\-domain data from𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}into post\-training, while dropout is applied during post\-training as a regularizer\.

Result\.Figure[4](https://arxiv.org/html/2605.12705#S4.F4)shows that both replay and dropout further improve the loss frontier relative to early exposure alone\. Replay encourages the model to acquire the post\-training domain without simply overwriting broader pretraining features, while dropout may promote more distributed and robust representations during Stage 2 learning\. These gains persist after downstream fine\-tuning in both representative pipelines we study, indicating that post\-training interventions can meaningfully improve robustness to later adaptation\.

At 1B, the effect of replay on the\(ℒft,ℒret\)\(\\mathcal\{L\}\_\{\\mathrm\{ft\}\},\\mathcal\{L\}\_\{\\mathrm\{ret\}\}\)frontier is weaker\. However, replay remains useful for preserving broader general\-domain performance, suggesting that its benefits may shift across scales and objectives \(figures in Appendix[B\.4\.2](https://arxiv.org/html/2605.12705#A2.SS4.SSS2)\)\.

Complementarity with mixing\.Neither replay nor dropout eliminates the value of pretraining\-time mixing\. Instead, both remain complementary to mixing: the strongest frontiers are obtained by combining post\-training interventions with mixed pretraining\. This reinforces the broader picture developed above\. Pretraining\-time mixing plays a distinct role by changing when the model first encounters the post\-training capability, while replay and dropout shape how that capability is learned during Stage 2\.

Takeaway:Replay and dropout are upstream alternatives to mixing during pretraining that can also improve robustness to subsequent fine\-tuning, and their gains are complementary with early exposure\.

## 5 Theoretical Analysis of Early Exposure

We find that mixing even a small fraction of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}into pretraining substantially improves retention under subsequent fine\-tuning, and that this effect is complementary to various post\-training regularizations applied during post\-training\. This is striking, as there is little reason to expect that a small change to the pretraining corpus should persistently shape retention two stages downstream\. Moreover, these benefits persist across a range of post\-training interventions, suggesting that mixing operates on a different axis than techniques that regularize the post\-training or fine\-tuning stages\. Broadly, our findings indicate that early exposure has a unique effect on how capabilities are implemented in the model, which then propagates through subsequent training stages\.

We now formally characterize how early exposure affects the way post\-trained capabilities are implemented in the model and their vulnerability to future forgetting\. We analyze our three\-stage pipeline in a two\-layer linear model, which enables a precise characterization of feature learning during pretraining and its impact on subsequent training dynamics\.

Setup\.We train a two\-layer linear networkθ=𝐖1​𝐖2\\theta=\\mathbf\{W\}\_\{1\}\\mathbf\{W\}\_\{2\}sequentially on three regression tasks simulating𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\},𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}, and𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}, using gradient descent on the squared lossℒ​\(θ;𝒟\)=𝐄​\[‖θ​𝐱−𝐲‖2\]\\mathcal\{L\}\(\\theta;\\mathcal\{D\}\)=\\mathbf\{E\}\[\|\|\\theta\\mathbf\{x\}\-\\mathbf\{y\}\|\|^\{2\}\]\. Each task is defined by an input distribution and a ground\-truth linear map𝐀t\\mathbf\{A\}^\{t\}, with𝐲=𝐀t​𝐱\\mathbf\{y\}=\\mathbf\{A\}^\{t\}\\mathbf\{x\}\. FollowingSpringer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\), we assume all tasks share singular vectors𝐔,𝐕\\mathbf\{U\},\\mathbf\{V\}so that𝐀t=𝐔​𝚺t​𝐕⊤\\mathbf\{A\}^\{t\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\_\{t\}\\mathbf\{V\}^\{\\top\}; the singular values of𝚺t\\mathbf\{\\Sigma\}\_\{t\}are the*features*of tasktt\.

Feature structure\.We consider the input\-space to be partitioned into three blocks, exhibiting different behaviors across the tasks:

- •Invariant features\(n−2​kn\{\-\}2kfeatures\) have identical singular values across𝐀gen\\mathbf\{A\}^\{\\textrm\{gen\}\},𝐀spec\\mathbf\{A\}^\{\\textrm\{spec\}\}, and𝐀ft\\mathbf\{A\}^\{\\textrm\{ft\}\}\.
- •Inconsistent features\(kkfeatures\) have shared dimensions where the tasks disagree—the singular values differ between𝐀gen\\mathbf\{A\}^\{\\textrm\{gen\}\},𝐀spec\\mathbf\{A\}^\{\\textrm\{spec\}\}, and𝐀ft\\mathbf\{A\}^\{\\textrm\{ft\}\}\.
- •Specialized features\(kkfeatures\) are active only on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}, while having zero covariance under pretraining and downstream distributions\.

Task definitions\.The pretraining task𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}draws inputs fromx∼𝒩​\(0,𝐈n−k\)x\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{n\-k\}\), activating only shared dimensions\. The post\-training task𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}draws inputs fromx∼𝒩​\(0,𝐈n\)x\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{n\}\), activating all dimensions including the specialized ones\. Like pretraining, the downstream task𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}draws inputs fromx∼𝒩​\(0,𝐈n−k\)x\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{n\-k\}\)—critically, it does not activate the specialized features\. We modelearly exposureby training on the distribution𝒟mixed=\(1−α\)​𝒟pre\+α​𝒟post\\mathcal\{D\}\_\{\\textrm\{mixed\}\}=\(1\{\-\}\\alpha\)\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}\+\\alpha\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\. Full details and formal assumptions are given in Appendix[C](https://arxiv.org/html/2605.12705#A3)\.

### 5\.1 Early exposure learns different features

We first characterize what features each pretraining strategy learns\.

###### Theorem 5\.1\(Informal:Only Early Exposure Learns Specialized Features\)\.

Letθmixed\\theta^\{\\textrm\{mixed\}\}andθunmixed\\theta^\{\\textrm\{unmixed\}\}be the parameters learned by pretraining on𝒟mixed\\mathcal\{D\}\_\{\\textrm\{mixed\}\}and𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}, respectively, after sufficient training\. Thenθmixed\\theta^\{\\textrm\{mixed\}\}learns the specialized features, whileθunmixed\\theta^\{\\textrm\{unmixed\}\}does not\.

The key mechanism is that linear networks learn features in descending order of their singular value\(Gidel et al\.,[2019](https://arxiv.org/html/2605.12705#bib.bib8); Springer et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib22)\), with remaining features staying near zero\. Without early exposure, thekkspecialized features have the lowest singular values and are not learned\. Early exposure, on the other hand, boosts the effective singular value above the learning threshold\. See Appendix[C](https://arxiv.org/html/2605.12705#A3)for the formal statement\.

Impact of a Small Mixing Ratio\.The mixing fractionα\\alphaneed not be large to have an impact\. Suppose the specialized features have a singular valueβ\\betain the post\-training task \(ApostA^\{\\textrm\{post\}\}\)\. Without exposure to the post\-training data, specialized features have zero effective singular value and are thus never learned\. Mixing anα\\alpha\-fraction of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}boosts the effective singular value toα​β\\alpha\\beta, so whenβ\\betais sufficiently large, even smallα\\alphasuffices to cross the threshold at which specialized features are learned\. Empirically, we also observe benefits from early exposure even when the quantity of mixed data is small relative to the total pretraining corpus\.

### 5\.2 Post\-training primarily reuses existing features

Next, we study the impact of the different features learned by early exposure on the post\-training process\.

###### Theorem 5\.2\(Informal: Post\-training onθmixed\\theta^\{\\textrm\{mixed\}\}versusθunmixed\\theta^\{\\textrm\{unmixed\}\}\)\.

After sufficient post\-training on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\},θpostmixed\\theta^\{\\textrm\{mixed\}\}\_\{\\textrm\{post\}\}converges to parameters that leverage specialized features to reduce post\-training loss, whileθpostunmixed\\theta^\{\\textrm\{unmixed\}\}\_\{\\textrm\{post\}\}converges to parameters that only modify the inconsistent features\.

The key insight is that features absent at initialization remain absent throughout post\-training: ifΣi​i=0\\Sigma\_\{ii\}=0at the start of post\-training, it stays0\. Sinceθunmixed\\theta^\{\\textrm\{unmixed\}\}never learned the specialized features, post\-training from this checkpoint can only reduce loss on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}by distorting the inconsistent features\. Post\-training fromθmixed\\theta^\{\\textrm\{mixed\}\}, having learned the specialized features during pretraining, can additionally improve loss on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}using the specialized features\.

Cost of Specialization to General Performance\.Our analysis shows that, without early exposure, adaptation to𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}proceeds by modifying features that also support general capabilities—namely, the inconsistent features—thereby degrading performance on the pretraining task \(i\.e\., increasing loss on𝒟pre\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}\)\. In contrast, early exposure enables the formation of specialized features that support post\-training without interfering with other tasks\. Consistent with the predictions of our analytical model, we observe that post\-training following early exposure incurs less degradation in C4 loss than post\-training from a base model without early exposure \(Figure[2](https://arxiv.org/html/2605.12705#S1.F2)\)\.

### 5\.3 Characterizing downstream forgetting behavior

Finally, we show that the different feature usage established above directly determines forgetting under downstream fine\-tuning\.

###### Theorem 5\.3\(Informal:θpostunmixed\\theta^\{\\textrm\{unmixed\}\}\_\{\\textrm\{post\}\}experiences more forgetting thanθpostmixed\\theta\_\{\\textrm\{post\}\}^\{\\textrm\{mixed\}\}\)\.

LetΔmixed,Δunmixed\\Delta\_\{\\textrm\{mixed\}\},\\Delta\_\{\\textrm\{unmixed\}\}denote the increase in loss on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}after fine\-tuning on𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}, for the mixed and unmixed checkpoints, respectively\. ThenΔunmixed≥Δmixed\\Delta\_\{\\textrm\{unmixed\}\}\\geq\\Delta\_\{\\textrm\{mixed\}\}\.

Because𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}has no covariance along the specialized features, gradient updates during fine\-tuning have zero projection along those directions\. Inconsistent features, however, overlap with𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}and are overwritten\. Without early exposure, all the loss reduction on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}is achieved by modifying the inconsistent features and is therefore vulnerable to erasure during fine\-tuning\. With early exposure, loss reduction is also implemented in isolated specialized features, enabling it to persist after further training\. This provides a formal account of the frontier shift observed empirically: the mixed checkpoint’s retention advantage traces back to feature learning during pretraining\.

Summary\.Ultimately, our analysis shows that even limited exposure to post\-training data during pretraining induces the formation ofspecialized featuresfor that domain\. These features are isolated from both𝒟gen\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{gen\}\}\}and𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}, rendering them resistant to overwriting during fine\-tuning\. In contrast, without early exposure, post\-training relies on modifying broadly shared features to reduce loss\. This makes the adaptation inherently fragile and prone to forgetting under subsequent fine\-tuning\.

## 6 Conclusion

Main themes\.This work advances three claims\. First, how well a capability survives later fine\-tuning cannot be read off from how well a model performs on that capability immediately after it is acquired; two training recipes that look equivalent at the handoff can diverge substantially once the model is adapted downstream\. Second, the*manner*in which a capability is learned — when it is introduced during training, how it is presented, and what else the model is learning at the same time — shapes how durable that capability will be under later updates\. Third, upstream training offers not a single knob but a family of interventions: early exposure during pretraining, replay during post\-training, and regularization during post\-training each shift the retention–adaptation frontier, and their effects are largely complementary rather than substitutable\. Together, these observations reframe robustness to fine\-tuning from a downstream problem to be mitigated into a design objective of upstream training\.

Limitations\.Our experiments restrict the amount of post\-training data introduced during pretraining to at most a single pass over that corpus; we do not characterize what happens when the post\-training data is repeated much more aggressively — for instance, allocating several percent of the total pretraining budget to post\-training data, which would imply many repetitions of a small corpus \(seeBaek et al\. \([2026](https://arxiv.org/html/2605.12705#bib.bib2)\)\)\. We focus on a single downstream adaptation method and do not characterize how early exposure interacts with preference\-based or reinforcement learning fine\-tuning schemes\. We provide additional LoRA experiments in Appendix[B\.3](https://arxiv.org/html/2605.12705#A2.SS3)\.

Future work\.A fuller characterization of when early exposure helps — across varying degrees of overlap between upstream and downstream data, across a wider range of exposure levels, and into regimes of heavy repetition of post\-training data — would sharpen practical guidance for upstream developers\. Extending the study to larger models and longer training runs would test whether upstream interventions continue to shift the retention frontier at scale\. A complementary line of work is algorithmic: whether new objectives or regularizers applied later in training can approximate the representational effect of early exposure, without modifying the pretraining corpus\.

## References

- Allal et al\. \(2025\)Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Martín Blázquez, Guilherme Penedo, Lewis Tunstall, Andrés Marafioti, Hynek Kydlíček, Agustín Piqueres Lajarín, Vaibhav Srivastav, Joshua Lochner, Caleb Fahlgren, Xuan\-Son Nguyen, Clémentine Fourrier, Ben Burtenshaw, Hugo Larcher, Haojun Zhao, Cyril Zakka, Mathieu Morlon, Colin Raffel, Leandro von Werra, and Thomas Wolf\.Smollm2: When smol goes big – data\-centric training of a small language model, 2025\.URL[https://arxiv\.org/abs/2502\.02737](https://arxiv.org/abs/2502.02737)\.
- Baek et al\. \(2026\)Christina Baek, Ricardo Pio Monti, David Schwab, Amro Abbas, Rishabh Adiga, Cody Blakeney, Maximilian Böther, Paul Burstein, Aldo Gael Carranza, Alvin Deng, Parth Doshi, Vineeth Dorna, Alex Fang, Tony Jiang, Siddharth Joshi, Brett W\. Larsen, Jason Chan Lee, Katherine L\. Mentzer, Luke Merrick, Haakon Mongstad, Fan Pan, Anshuman Suri, Darren Teh, Jason Telanoff, Jack Urbanek, Zhengping Wang, Josh Wills, Haoli Yin, Aditi Raghunathan, J\. Zico Kolter, Bogdan Gaza, Ari Morcos, Matthew Leavitt, and Pratyush Maini\.The finetuner’s fallacy: When to pretrain with your finetuning data, 2026\.URL[https://arxiv\.org/abs/2603\.16177](https://arxiv.org/abs/2603.16177)\.
- Bethune et al\. \(2025\)Louis Bethune, David Grangier, Dan Busbridge, Eleonora Gualdoni, Marco Cuturi, and Pierre Ablin\.Scaling laws for forgetting during finetuning with pretraining data injection, 2025\.URL[https://arxiv\.org/abs/2502\.06042](https://arxiv.org/abs/2502.06042)\.
- Biderman et al\. \(2024\)Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P\. Cunningham\.Lora learns less and forgets less, 2024\.URL[https://arxiv\.org/abs/2405\.09673](https://arxiv.org/abs/2405.09673)\.
- Chen et al\. \(2025\)Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji\.Scaling laws for predicting downstream performance in llms, 2025\.URL[https://arxiv\.org/abs/2410\.08527](https://arxiv.org/abs/2410.08527)\.
- Du et al\. \(2024\)Zhengxiao Du, Aohan Zeng, Yuxiao Dong, and Jie Tang\.Understanding emergent abilities of language models from the loss perspective\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.
- Gadre et al\. \(2024\)Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Luca Soldaini, Alexandros G\. Dimakis, Gabriel Ilharco, Pang Wei Koh, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, and Ludwig Schmidt\.Language models scale reliably with over\-training and on downstream tasks, 2024\.URL[https://arxiv\.org/abs/2403\.08540](https://arxiv.org/abs/2403.08540)\.
- Gidel et al\. \(2019\)Gauthier Gidel, Francis Bach, and Simon Lacoste\-Julien\.Implicit regularization of discrete gradient dynamics in linear neural networks, 2019\.URL[https://arxiv\.org/abs/1904\.13262](https://arxiv.org/abs/1904.13262)\.
- Hu et al\. \(2021\)Edward J\. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen\-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen\.Lora: Low\-rank adaptation of large language models, 2021\.URL[https://arxiv\.org/abs/2106\.09685](https://arxiv.org/abs/2106.09685)\.
- Kirkpatrick et al\. \(2017\)James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A\. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska\-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell\.Overcoming catastrophic forgetting in neural networks\.*Proceedings of the National Academy of Sciences*, 114\(13\):3521–3526, March 2017\.ISSN 1091\-6490\.doi:10\.1073/pnas\.1611835114\.URL[http://dx\.doi\.org/10\.1073/pnas\.1611835114](http://dx.doi.org/10.1073/pnas.1611835114)\.
- Kotha & Liang \(2026\)Suhas Kotha and Percy Liang\.Replaying pre\-training data improves fine\-tuning, 2026\.URL[https://arxiv\.org/abs/2603\.04964](https://arxiv.org/abs/2603.04964)\.
- Maini et al\. \(2024\)Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C\. Lipton, and J\. Zico Kolter\.Tofu: A task of fictitious unlearning for llms, 2024\.URL[https://arxiv\.org/abs/2401\.06121](https://arxiv.org/abs/2401.06121)\.
- Maini et al\. \(2025\)Pratyush Maini, Sachin Goyal, Dylan Sam, Alex Robey, Yash Savani, Yiding Jiang, Andy Zou, Matt Fredrikson, Zacharcy C\. Lipton, and J\. Zico Kolter\.Safety pretraining: Toward the next generation of safe ai, 2025\.URL[https://arxiv\.org/abs/2504\.16980](https://arxiv.org/abs/2504.16980)\.
- McCloskey & Cohen \(1989\)Michael McCloskey and Neal J Cohen\.Catastrophic interference in connectionist networks: The sequential learning problem\.In*Psychology of learning and motivation*, volume 24, pp\. 109–165\. Elsevier, 1989\.
- Nishi et al\. \(2025\)Kento Nishi, Rahul Ramesh, Maya Okawa, Mikail Khona, Hidenori Tanaka, and Ekdeep Singh Lubana\.Representation shattering in transformers: A synthetic study with knowledge editing, 2025\.URL[https://arxiv\.org/abs/2410\.17194](https://arxiv.org/abs/2410.17194)\.
- O’Brien et al\. \(2025\)Kyle O’Brien, Stephen Casper, Quentin Anthony, Tomek Korbak, Robert Kirk, Xander Davies, Ishan Mishra, Geoffrey Irving, Yarin Gal, and Stella Biderman\.Deep ignorance: Filtering pretraining data builds tamper\-resistant safeguards into open\-weight llms, 2025\.URL[https://arxiv\.org/abs/2508\.06601](https://arxiv.org/abs/2508.06601)\.
- Olmo et al\. \(2025\)Team Olmo, :, Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, Pradeep Dasigi, Robert Berry, Saumya Malik, Saurabh Shah, Scott Geng, Shane Arora, Shashank Gupta, Taira Anderson, Teng Xiao, Tyler Murray, Tyler Romero, Victoria Graf, Akari Asai, Akshita Bhagia, Alexander Wettig, Alisa Liu, Aman Rangapur, Chloe Anastasiades, Costa Huang, Dustin Schwenk, Harsh Trivedi, Ian Magnusson, Jaron Lochner, Jiacheng Liu, Lester James V\. Miranda, Maarten Sap, Malia Morgan, Michael Schmitz, Michal Guerquin, Michael Wilson, Regan Huff, Ronan Le Bras, Rui Xin, Rulin Shao, Sam Skjonsberg, Shannon Zejiang Shen, Shuyue Stella Li, Tucker Wilde, Valentina Pyatkin, Will Merrill, Yapei Chang, Yuling Gu, Zhiyuan Zeng, Ashish Sabharwal, Luke Zettlemoyer, Pang Wei Koh, Ali Farhadi, Noah A\. Smith, and Hannaneh Hajishirzi\.Olmo 3, 2025\.URL[https://arxiv\.org/abs/2512\.13961](https://arxiv.org/abs/2512.13961)\.
- Ouyang et al\. \(2022\)Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L\. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe\.Training language models to follow instructions with human feedback, 2022\.URL[https://arxiv\.org/abs/2203\.02155](https://arxiv.org/abs/2203.02155)\.
- Qi et al\. \(2023\)Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin\-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson\.Fine\-tuning aligned language models compromises safety, even when users do not intend to\!, 2023\.URL[https://arxiv\.org/abs/2310\.03693](https://arxiv.org/abs/2310.03693)\.
- Rolnick et al\. \(2019\)David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy P\. Lillicrap, and Greg Wayne\.Experience replay for continual learning, 2019\.URL[https://arxiv\.org/abs/1811\.11682](https://arxiv.org/abs/1811.11682)\.
- Sam et al\. \(2026\)Dylan Sam, Sachin Goyal, Pratyush Maini, Alexander Robey, and J\. Zico Kolter\.When should we introduce safety interventions during pretraining?, 2026\.URL[https://arxiv\.org/abs/2601\.07087](https://arxiv.org/abs/2601.07087)\.
- Springer et al\. \(2025\)Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, and Aditi Raghunathan\.Overtrained language models are harder to fine\-tune, 2025\.URL[https://arxiv\.org/abs/2503\.19206](https://arxiv.org/abs/2503.19206)\.
- Srivastava et al\. \(2014\)Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov\.Dropout: a simple way to prevent neural networks from overfitting\.*J\. Mach\. Learn\. Res\.*, 15\(1\):1929–1958, January 2014\.ISSN 1532\-4435\.
- Wortsman et al\. \(2022a\)Mitchell Wortsman, Gabriel Ilharco, Samir Yitzhak Gadre, Rebecca Roelofs, Raphael Gontijo\-Lopes, Ari S\. Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt\.Model soups: averaging weights of multiple fine\-tuned models improves accuracy without increasing inference time, 2022a\.URL[https://arxiv\.org/abs/2203\.05482](https://arxiv.org/abs/2203.05482)\.
- Wortsman et al\. \(2022b\)Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gontijo\-Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, and Ludwig Schmidt\.Robust fine\-tuning of zero\-shot models, 2022b\.URL[https://arxiv\.org/abs/2109\.01903](https://arxiv.org/abs/2109.01903)\.
- Yang et al\. \(2025\)An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang, Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang, Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, and Zihan Qiu\.Qwen3 technical report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Yang et al\. \(2023\)Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin\.Shadow alignment: The ease of subverting safely\-aligned language models, 2023\.URL[https://arxiv\.org/abs/2310\.02949](https://arxiv.org/abs/2310.02949)\.

## Contributions

Lawrence Feng led the project and conducted all the main experiments\.

Gaurav R\. Ghosal, Ziqian Zhong, Jacob Mitchell Springer, and Aditi Raghunathan were involved throughout the project, including project direction, experimental design, analysis, and paper writing\.

## Acknowledgments

We gratefully acknowledge support from Apple, Google, Jane Street, the National Science Foundation and the FLAME cluster at Carnegie Mellon University\.

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant No\. DGE2140739\. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation\.

We’d like to thank Christina Baek for her insights on the dynamics of pretraining and fine\-tuning\. We’d like to thank Kevin Li for his feedback on earlier versions of this work\.

## Appendix ATraining Details

### A\.1 SmolLM2\-1B model architecture

Table 2:SmolLM2\-1B model architecture \(custom config interpolated from SmolLM2 family\)\.
### A\.2 Dataset statistics

Table 3:135M Parameter ExperimentsTable 4:1B Parameter Experiments
### A\.3 Optimizer configuration

Table 5:Optimizer configuration used across all experiments\.
### A\.4 1B Stage 1 pretraining

Table 6:Stage 1 pretraining configuration for 1B experiments\.
### A\.5 Post\-training hyperparameters

Table 7:FFT hyperparameter search space for Stage 2 post\-training \(135M\)\.Table 8:Stage 2 FFT hyperparameter search space for 1B\-scale post\-training\.
### A\.6 LoRA configuration

Table 9:LoRA hyperparameter configuration for Stage 2 post\-training\.ParameterValuesNotesLoRA rank \(rr\)64FixedLoRA alpha \(α\\alpha\)128Fixed,α/r=2\\alpha/r=2LoRA dropout\{0\.0, 0\.02, 0\.05\}Same as FFT dropoutLoRA targetsprojection, mlp, headQ/K/V excludedLearning rate\{1e\-4, 2e\-4, 5e\-4, 1e\-3, 5e\-3\}Same as FFTWeight decay0\.1Same as FFTOther parametersSame as FFT \(Table[7](https://arxiv.org/html/2605.12705#A1.T7)\)
### A\.7 1B Stage 2 CPT: dropout ablation

Table 10:Dropout ablation configuration for 1B Stage 2 CPT\. MusicPile CPT pipeline only\. 12 runs total \(2λ\\lambda×\\times3 LRs×\\times2 dropout rates\)\.

## Appendix BAdditional Plots

### B\.1 135M dropout and replay frontiers with retained pretraining loss \(C4\)

![Refer to caption](https://arxiv.org/html/2605.12705v1/x5.png)Figure 5:Dropout and replay preserve broader pretraining capability in addition to the post\-training capability \(135M\)\.Companion to Figure[4](https://arxiv.org/html/2605.12705#S4.F4), plotting the same Stage 2 hyperparameter sweeps against retained pretraining loss on C4 instead of downstream fine\-tuning loss\. Within each pipeline, the left panel compares unmixed pretraining, mixed pretraining, and mixed pretraining \+ dropout, and the right panel compares unmixed pretraining, mixed pretraining, and mixed pretraining \+ replay\. Across both downstream settings, adding dropout or replay to mixed pretraining further lowers retained pretraining loss at matched retained post\-training loss, indicating that these post\-training interventions protect broader pretraining capabilities as well as the targeted post\-trained capability\.\(a\)C4→\\rightarrowMusicPile→\\rightarrowChemPile\.\(b\)C4→\\rightarrowMusicPile→\\rightarrowFLAN\.
### B\.2 Additional 135M dropout and replay frontiers \(dropout and replay without mixing\)

![Refer to caption](https://arxiv.org/html/2605.12705v1/x6.png)Figure 6:Dropout and replay applied without pretraining\-time mixing \(135M\)\.To isolate the effect of post\-training interventions from pretraining\-time mixing, each panel applies dropout or replay on top of*unmixed*pretraining \(λ=0\\lambda=0\), with the mixed\-pretraining frontier shown for reference\. Within each pipeline, the left panel adds dropout during Stage 2 post\-training; the right panel adds a small fraction \(1%1\\%\) of general\-domain replay\. Both interventions shift the fine\-tuning–retention frontier, but less than pretraining\-time mixing alone, reinforcing that mixing acts on a distinct axis from Stage 2 regularization\.
### B\.3 135M LoRA experiments

![Refer to caption](https://arxiv.org/html/2605.12705v1/x7.png)Figure 7:FFT vs LoRA fine\-tuning–retention frontiers \(135M\)\.Each panel shows four frontiers obtained by sweeping Stage 2 post\-training hyperparameters and Stage 3 fine\-tuning learning rates: FFT with unmixed pretraining \(black circles, solid\), FFT with mixed pretraining \(purple circles, solid\), LoRA with unmixed pretraining \(black squares, dashed\), and LoRA with mixed pretraining \(purple squares, dashed\)\. Mixed pretraining improves both the FFT and LoRA frontiers in both downstream settings, and FFT generally attains a better fine\-tuning–retention tradeoff than LoRA at matched upstream configurations\. This suggests that the benefit of pretraining\-time mixing is not specific to a particular fine\-tuning method\.
### B\.4 1B experiments

Blackdenotes the frontier obtained from unmixed pretraining, andpurpledenotes the frontier obtained from mixed pretraining\.

#### B\.4\.1 Mixing frontiers

![Refer to caption](https://arxiv.org/html/2605.12705v1/x8.png)Figure 8:Mixing frontiers at 1B, MusicPile post\-training pipelines\.Companion to Figure[2](https://arxiv.org/html/2605.12705#S1.F2)at 1B scale\. Within each pipeline, the left panel plots retained post\-training loss against downstream fine\-tuning loss, and the right panel plots retained pretraining loss against retained post\-training loss\. As at the 135M scale, mixed pretraining consistently shifts the frontier toward lower retained post\-training loss, lower retained pretraining loss, and lower downstream fine\-tuning loss, indicating that the benefit of early exposure to post\-training data persists beyond the small\-model setting\.![Refer to caption](https://arxiv.org/html/2605.12705v1/x9.png)Figure 9:Mixing frontiers at 1B, FLAN post\-training pipeline\.Left: retained post\-training loss vs fine\-tuning loss\. Right: retained pretraining loss \(C4\) vs retained post\-training loss\. In this pipeline, mixed pretraining does not noticeably improve the retained post\-training vs fine\-tuning frontier \(left\), but it does improve the retained pretraining vs retained post\-training frontier \(right\), indicating that the benefit of early exposure here is concentrated in preserving broader pretraining capabilities rather than further improving the post\-training/fine\-tuning tradeoff\.
#### B\.4\.2 Upstream dropout and replay at 1B parameters

![Refer to caption](https://arxiv.org/html/2605.12705v1/x10.png)Figure 10:Replay and dropout provide complementary gains on top of mixed pretraining at 1B\.Companion to Figure[4](https://arxiv.org/html/2605.12705#S4.F4)at 1B scale\. The left panel compares unmixed pretraining, mixed pretraining, and mixed pretraining \+ dropout; the right panel compares unmixed pretraining, mixed pretraining, and mixed pretraining \+ replay\. As at 135M scale, both dropout and replay further shift the fine\-tuning–retention frontier beyond mixed pretraining alone, indicating that these post\-training interventions provide complementary gains at the larger scale as well\.![Refer to caption](https://arxiv.org/html/2605.12705v1/x11.png)Figure 11:Dropout and replay on top of mixed pretraining, broader pretraining retention \(1B\)\.Companion to Figure[5](https://arxiv.org/html/2605.12705#A2.F5)at 1B scale\. The same Stage 2 sweeps are plotted against retained pretraining loss on C4\. At this scale, dropout at the rate we swept degrades C4 retention relative to mixed pretraining alone, which we attribute to insufficient tuning of the dropout rate; replay continues to preserve and often improves C4 retention, consistent with its role of keeping general\-domain data present during Stage 2\.![Refer to caption](https://arxiv.org/html/2605.12705v1/x12.png)Figure 12:Dropout and replay applied without pretraining\-time mixing \(1B\)\.Companion to Figure[6](https://arxiv.org/html/2605.12705#A2.F6)at 1B scale\. Each panel applies dropout or replay on top of unmixed pretraining \(λ=0\\lambda=0\), with the mixed\-pretraining frontier shown for reference\. At 1B, dropout continues to shift the fine\-tuning–retention frontier relative to the unmixed baseline, while replay alone has a weaker effect, consistent with our observation in the main text that the relative strength of replay as a frontier\-shifting intervention diminishes at this scale\.

## Appendix CTheoretical Analysis

### C\.1 Preliminaries and Setup

##### Model and data distribution

We consider a two\-layer linear networkθ=𝐖𝟏​𝐖𝟐​𝐱\\theta=\\mathbf\{W\_\{1\}\}\\mathbf\{W\_\{2\}\}\\mathbf\{x\}on a series of regression problems using the squared loss\. In particular, the problems take the form ofℒt​\(θ\)=𝐄x∼𝒟t​\[‖θ​x−𝐀t​x‖22\]\\mathcal\{L\}\_\{t\}\(\\theta\)=\\mathbf\{E\}\_\{x\\sim\\mathcal\{D\}\_\{\\textrm\{t\}\}\}\[\|\|\\theta x\-\\mathbf\{A\}^\{t\}x\|\|\_\{2\}^\{2\}\], wheret∈\{pre,post,ft\}t\\in\\\{\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\textrm\{pre\}\},\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\textrm\{post\}\},\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\textrm\{ft\}\}\\\}index the current stage of training\. Here,𝒟t\\mathcal\{D\}\_\{t\}denotes the input distribution and the ground\-truth outputs are generated as𝐀t​𝐗\\mathbf\{A\}^\{t\}\\mathbf\{X\}, where𝐗∼𝒟t\\mathbf\{X\}\\sim\\mathcal\{D\}\_\{t\}\. Following the analysis inSpringer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\), we consider the singular values and vectors of𝐀t\\mathbf\{A\}^\{t\}as the learned features for the training tasktt\.

###### Assumption C\.1\(Simultaneous Diagonalizability\)\.

There are orthonormal matrices𝐔,𝐕\\mathbf\{U\},\\mathbf\{V\}such that fort∈\{pre,post,ft\}t\\in\\\{\\textrm\{pre\},\\textrm\{post\},\\textrm\{ft\}\\\}we can write𝐀t=𝐔​𝚺𝐭​𝐕⊤\\mathbf\{A\}^\{t\}=\\mathbf\{U\}\\mathbf\{\\Sigma\_\{t\}\}\\mathbf\{V\}^\{\\top\}, where all the𝚺𝐭\\mathbf\{\\Sigma\_\{t\}\}are diagonal matrices\.

In order to model transfer and interference between the distributions, we will next specify a structure on the relationships between the different features\. We first assume the presence of*invariant features*, capturing common linguistic capabilities that are broadly applicable across domains and tasks\. Across these definitions, we assume a consistent indexing of the singular values \(although the ordering of the singular values in terms of their magnitude may be different\)\. In the following we will denote the singular values of the the task covariances interchangeably with the notationsσit=\(Σt\)i​i\\sigma^\{t\}\_\{i\}=\(\\Sigma\_\{t\}\)\_\{ii\}to denote the ground\-truth value of the feature\.

###### Definition C\.2\(Invariant Features\)\.

Fori∈\[1,n−2​k\]i\\in\[1,n\-2k\], we have that\(𝚺pre\)i​i=\(𝚺post\)i​i=\(𝚺ft\)i​i\(\\mathbf\{\\Sigma\}\_\{\\textrm\{pre\}\}\)\_\{ii\}=\(\\mathbf\{\\Sigma\_\{\\textrm\{post\}\}\}\)\_\{ii\}=\(\\mathbf\{\\Sigma\}\_\{\\textrm\{ft\}\}\)\_\{ii\}\. For clarity and to emphasize their static nature, we will often denote the values of the invariant features asσ1inv,…,σn−2​kinv\\sigma^\{\\textrm\{inv\}\}\_\{1\},\.\.\.,\\sigma^\{\\textrm\{inv\}\}\_\{n\-2k\}\. For conciseness, we will also usedinvariant=n−2​kd\_\{\\textrm\{invariant\}\}=n\-2kto refer to the number of invariant features\.

In addition to these highly general features, we also consider the features through which the model may learn more domain specific information\. We consider that such specialization can be implemented through one of two pathways:

###### Definition C\.3\(Inconsistent Features\)\.

We define a feature \(indexed byii\) to be inconsistent if we have that\(𝚺post\)i​i\>\(𝚺pre\)i​i\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\>\(\\mathbf\{\\Sigma\}\_\{\\textrm\{pre\}\}\)\_\{ii\}and\(𝚺post\)i​i−\(𝚺pre\)i​i\>cmis\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}\_\{\\textrm\{pre\}\}\)\_\{ii\}\>c\_\{\\textrm\{mis\}\}

Inconsistent features therefore incur a tradeoff between reducing loss on𝒟post\\mathcal\{D\}\_\{\\textrm\{post\}\}and preserving performance on𝒟pre\\mathcal\{D\}\_\{\\textrm\{pre\}\}\. Finally, we introducespecialized features, which do not incur such a tradeoff\.

###### Definition C\.4\(Specialized features\)\.

We consider featureiiisspecializedif we have that\(𝐕⊤\)i​𝐱=0\(\\mathbf\{V\}^\{\\top\}\)\_\{i\}\\mathbf\{x\}=0and that\(𝚺post\)i​i\>0\(\\mathbf\{\\Sigma\_\{\\textrm\{post\}\}\}\)\_\{ii\}\>0\. For simplicity we will assume that all specialized features take the same value of\(𝚺post\)i​i=β\(\\mathbf\{\\Sigma\_\{\\textrm\{post\}\}\}\)\_\{ii\}=\\beta\. We will also consider thatβ<12​cmis\\beta<\\frac\{1\}\{2\}c\_\{\\textrm\{mis\}\}, which encodes that inconsistent features shift by a relatively large magnitude across the distributions\.

Intuitively,𝒟pre\\mathcal\{D\}\_\{\\textrm\{pre\}\}and𝒟ft\\mathcal\{D\}\_\{\\textrm\{ft\}\}have no covariance along the specialized feature directions\. As we will show, this results in gradient steps taken along them causing no interference along these directions\. However, as a result of their zero\-covariance, these features are also impossible to learn without explicitly seeing the post\-training data\.

Downstream Tuning TaskWe consider that the downstream tuning task is relatively more similar to the pretraining task than the post\-training task\. As such, we consider that the inputs are sampled according tox∼𝒩​\(0,𝐈n−k\)x\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{n\-k\}\)\(i\.e\. it doesn’t activate the specialized features\)\. As previously, we have that the singular values corresponding to the invariant features remain constant\. We will also consider that𝐀post\\mathbf\{A\}^\{\\textrm\{post\}\}and𝐀ft\\mathbf\{A\}^\{\\textrm\{ft\}\}diverge on the inconsistent feature\. Concretely we have that\(𝚺post\)i​i\>\(𝚺ft\)i​i\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\>\(\\mathbf\{\\Sigma\}\_\{\\textrm\{ft\}\}\)\_\{ii\}and\(𝚺post\)i​i−\(𝚺ft\)i​i\>cmis\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}\_\{\\textrm\{ft\}\}\)\_\{ii\}\>c\_\{\\textrm\{mis\}\}\.

##### Mixed Training

We parameterize the mixed distribution by a parameterα\\alphaand train on the distribution𝒟α,mixed=\(1−α\)​𝒟pre\+α​𝒟post\\mathcal\{D\}\_\{\\alpha,\\textrm\{mixed\}\}=\(1\-\\alpha\)\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{pre\}\}\}\+\\alpha\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\. As all distributions in our setting have mean zero, we have that the covariance matrix of this mixture of Gaussian distributions is\(1−α\)​𝚺pre\+α​𝚺post\(1\-\\alpha\)\\mathbf\{\\Sigma\_\{\\text\{pre\}\}\}\+\\alpha\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\.

###### Assumption C\.5\(Invariant Features are High Magnitude\)\.

We consider that the invariant features are higher magnitude than the specialized features and the inconsistent features, concretely:

σipre\>σjpre\\sigma\_\{i\}^\{\\textrm\{pre\}\}\>\\sigma\_\{j\}^\{\\textrm\{pre\}\}∀i∈\[0,dinvariant\]\\forall i\\in\[0,d\_\{\\textrm\{invariant\}\}\]and∀j∈\(dinvariant,n\)\\forall j\\in\(d\_\{\\textrm\{invariant\}\},n\)\. We make a similar assumption on the relationship between the invariant features and the specialized features, concretely:

σipre\>σjpost\\sigma\_\{i\}^\{\\textrm\{pre\}\}\>\\sigma\_\{j\}^\{\\textrm\{post\}\}∀i∈\[0,dinvariant\]\\forall i\\in\[0,d\_\{\\textrm\{invariant\}\}\]and∀j∈\(dinvariant,n\)\\forall j\\in\(d\_\{\\textrm\{invariant\}\},n\)\. This intuitively encodes that the invariant features correspond to the strongest directions in the data\.

###### Assumption C\.6\(Sufficient Specialized Mixing\)\.

We assume that there existsα∈\[0,1\]\\alpha\\in\[0,1\]

α​β\>\(1−α\)​σipre\+α​σipost​∀i∈\[dinvariant,dinvariant\+k\]\\alpha\\beta\>\(1\-\\alpha\)\\sigma^\{\\textrm\{pre\}\}\_\{i\}\+\\alpha\\sigma\_\{i\}^\{\\textrm\{post\}\}\\,\\,\\,\\forall i\\in\[d\_\{\\textrm\{invariant\}\},d\_\{\\textrm\{invariant\}\}\+k\]

Intuitively, Assumption[C\.6](https://arxiv.org/html/2605.12705#A3.Ex8)suggests that there exists a mixing ratio such that the mixing specialized features become more salient than the inconsistent features\. However, this mixing ratio need not be high if the strength of the specialized feature is high in the covariance of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\.

### C\.2 Analysis of Initial Pretraining

Here, we will study the dynamics of the pretraining stage\. We first introduce an important result on the sequential learning dynamics of features in two layer linear networks\(Gidel et al\.,[2019](https://arxiv.org/html/2605.12705#bib.bib8)\)\. For a given pretraining task where𝐗,𝐘\\mathbf\{X\},\\mathbf\{Y\}represent the inputs and outputs, respectively, we define that𝚺x​y=1n​𝐗⊤​𝐘\\mathbf\{\\Sigma\}\_\{xy\}=\\frac\{1\}\{n\}\\mathbf\{X\}^\{\\top\}\\mathbf\{Y\}and𝚺x=1n​𝐗⊤​𝐗\\mathbf\{\\Sigma\}\_\{x\}=\\frac\{1\}\{n\}\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\. We will write that𝚺x​y=∑i=1Rx​yσi​𝐮i​𝐯𝐢T\\mathbf\{\\Sigma\}\_\{xy\}=\\sum\_\{i=1\}^\{R\_\{xy\}\}\\sigma\_\{i\}\\mathbf\{u\}\_\{i\}\\mathbf\{v\_\{i\}\}^\{T\}, whereRx​yR\_\{xy\}is the rank of𝚺𝐱𝐲\\mathbf\{\\Sigma\_\{xy\}\}\. We will also assume that:

###### Assumption C\.7\(Joint Decomposition\)\.

There exist orthogonal matrices𝐔,𝐕\\mathbf\{U\},\\mathbf\{V\}such that

𝚺x​y=𝐔𝐃x​y​𝐕⊤,𝚺x​x=𝐔𝐃x​x​𝐔⊤\\mathbf\{\\Sigma\}\_\{xy\}=\\mathbf\{U\}\\mathbf\{D\}\_\{xy\}\\mathbf\{V\}^\{\\top\},\\mathbf\{\\Sigma\}\_\{xx\}=\\mathbf\{U\}\\mathbf\{D\}\_\{xx\}\\mathbf\{U\}^\{\\top\}\(1\)and we will denote the singular values of𝚺x​y\\mathbf\{\\Sigma\}\_\{xy\}asσ1,…,σRx​y\\sigma\_\{1\},\.\.\.,\\sigma\_\{R\_\{xy\}\}and the diagonal entries of𝐃x​x\\mathbf\{D\}\_\{xx\}asλ1,…,λRx=1\\lambda\_\{1\},\.\.\.,\\lambda\_\{R\_\{x\}\}=1\.

Next, we will characterize the initialization scale of model before pretraining\. Following\(Springer et al\.,[2025](https://arxiv.org/html/2605.12705#bib.bib22)\), we have the following initialization:

###### Assumption C\.8\(Pretrained Initialization Scale\)\.

Let\(𝐖1​\(0\),𝐖2​\(0\)\)\(\\mathbf\{W\}\_\{1\}\(0\),\\mathbf\{W\}\_\{2\}\(0\)\)be the parameters at initialization\. Then we have that𝐖1​\(0\)=𝐖2​\(0\)=exp⁡\(−𝒯\)​𝐈𝐝\\mathbf\{W\}\_\{1\}\(0\)=\\mathbf\{W\}\_\{2\}\(0\)=\\exp\(\{\-\\mathcal\{T\}\)\\mathbf\{I\_\{d\}\}\}\.

Essentially, Assumption[C\.8](https://arxiv.org/html/2605.12705#A3.Thmtheorem8)requires that the model parameters are close, but not exactly0which yields*sequential feature learning*\. We next explicitly re\-state the result fromGidel et al\. \([2019](https://arxiv.org/html/2605.12705#bib.bib8)\)\.

###### Theorem C\.9\(Sequential Learning of FeaturesGidel et al\. \([2019](https://arxiv.org/html/2605.12705#bib.bib8)\); Springer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\)\.

Suppose𝐖1,𝐖2\\mathbf\{W\}\_\{1\},\\mathbf\{W\}\_\{2\}obey the initialization in Assumption[C\.8](https://arxiv.org/html/2605.12705#A3.Thmtheorem8)and the pretraining task obeys Assumption[C\.7](https://arxiv.org/html/2605.12705#A3.Thmtheorem7)\. Then there exist timest1,…,trt\_\{1\},\.\.\.,t\_\{r\}such that

‖𝐖1​\(ti\)−𝐔​\(Σ:i\)12‖F≤exp⁡\(−C​τ\)\|\|\\mathbf\{W\}\_\{1\}\(t\_\{i\}\)\-\\mathbf\{U\}\(\\Sigma\_\{:i\}\)^\{\\frac\{1\}\{2\}\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)‖𝐖2​\(ti\)−\(Σ:i\)12​𝐕⊤‖F≤exp⁡\(−C​τ\)\|\|\\mathbf\{W\}\_\{2\}\(t\_\{i\}\)\-\(\\Sigma\_\{:i\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)WhereΣ:i\\Sigma\_\{:i\}is defined to bediag​\(σ1,…​σi,0,…​0\)\\textrm\{diag\}\(\\sigma\_\{1\},\.\.\.\\sigma\_\{i\},0,\.\.\.0\), equivalently the rankiiapproximation ofdiag​\(σ1,…,σRx​y\)\\textrm\{diag\}\(\\sigma\_\{1\},\.\.\.,\\sigma\_\{R\_\{xy\}\}\)\.

Conceptually, Theorem[C\.9](https://arxiv.org/html/2605.12705#A3.Thmtheorem9)demonstrates that during the pretraining process,𝐖1​𝐖2\\mathbf\{W\}\_\{1\}\\mathbf\{W\}\_\{2\}learn features in decreasing order of their of the singular value ofΣx​y\\Sigma\_\{xy\}\. Next, we will apply this result in order to compare the features learned during mixed and non\-mixed pretraining\.

###### Theorem C\.10\(Only Mixing Learns Specialized Features\)\.

Letθgen​\(t\)=𝐖1gen​\(t\)​𝐖2gen​\(t\)\\theta^\{\\textrm\{gen\}\}\(t\)=\\mathbf\{W\}^\{\\textrm\{gen\}\}\_\{1\}\(t\)\\mathbf\{W\}^\{\\textrm\{gen\}\}\_\{2\}\(t\)be the parameters learned when pretraining only on𝒟gen\{\\color\[rgb\]\{0\.3046875,0\.4765625,0\.59375\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3046875,0\.4765625,0\.59375\}\\mathcal\{D\}\_\{\\textrm\{gen\}\}\}andθmixed​\(t\)=𝐖1mixed​\(t\)​𝐖2mixed​\(t\)\\theta^\{\\textrm\{mixed\}\}\(t\)=\\mathbf\{W\}^\{\\textrm\{mixed\}\}\_\{1\}\(t\)\\mathbf\{W\}^\{\\textrm\{mixed\}\}\_\{2\}\(t\)\. Denote𝐮spec,𝐯spec\\mathbf\{u\}\_\{\\textrm\{spec\}\},\\mathbf\{v\}\_\{\\textrm\{spec\}\}to be the right and left singular vectors corresponding to the specialized feature\. Then, there exists a timettsuch that

‖𝐖1\(unmixed\)−𝐔​\(𝚺\(unmixed\)\)12‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{1\}^\{\\textrm\{\(unmixed\)\}\}\-\\mathbf\{U\}\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(unmixed\}\)\}\)^\{\\frac\{1\}\{2\}\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)‖𝐖2\(unmixed\)−\(𝚺\(unmixed\)\)12​𝐕⊤‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{2\}^\{\\textrm\{\(unmixed\)\}\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(unmixed\}\)\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)‖𝐖1\(mixed\)−𝐔​\(𝚺\(mixed\)\)12‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{1\}^\{\\textrm\{\(mixed\)\}\}\-\\mathbf\{U\}\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(mixed\}\)\}\)^\{\\frac\{1\}\{2\}\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)‖𝐖2\(mixed\)−\(𝚺\(mixed\)\)12​𝐕⊤‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{2\}^\{\\textrm\{\(mixed\)\}\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(mixed\}\)\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)where𝚺mixed=diag​\(σ1inv,…​σn−2​kinv,𝟎k,α​β,…,α​β\)\\mathbf\{\\Sigma\}\_\{\\textrm\{mixed\}\}=\\textrm\{diag\}\(\\sigma^\{\\textrm\{inv\}\}\_\{1\},\.\.\.\\sigma\_\{n\-2k\}^\{\\textrm\{inv\}\},\\mathbf\{0\}\_\{k\},\\alpha\\beta,\.\.\.,\\alpha\\beta\)and𝚺unmixed=diag​\(σ1inv,…,σn−2​kinv,σ1post,…,σkpost,𝟎k\)\\mathbf\{\\Sigma\}\_\{\\textrm\{unmixed\}\}=\\textrm\{diag\}\(\\sigma\_\{1\}^\{\\textrm\{inv\}\},\.\.\.,\\sigma\_\{n\-2k\}^\{\\textrm\{inv\}\},\\sigma\_\{1\}^\{\\textrm\{post\}\},\.\.\.,\\sigma^\{\\textrm\{post\}\}\_\{k\},\\mathbf\{0\}\_\{k\}\)

###### Proof\.

This is a relatively straightforward application of Theorem[C\.9](https://arxiv.org/html/2605.12705#A3.Thmtheorem9)\. Denote\(𝐗\(mixed\),𝐘\(mixed\)\)\(\\mathbf\{X\}^\{\\textrm\{\(mixed\)\}\},\\mathbf\{Y\}^\{\\textrm\{\(mixed\)\}\}\)as the data used for mixed pretraining and𝚺x​y\(mixed\)=1n​\(𝐗\(mixed\)\)⊤​𝐘\(mixed\)\\mathbf\{\\Sigma\}\_\{xy\}^\{\\textrm\{\(mixed\)\}\}=\\frac\{1\}\{n\}\(\\mathbf\{X\}^\{\\textrm\{\(mixed\)\}\}\)^\{\\top\}\\mathbf\{Y\}^\{\\textrm\{\(mixed\)\}\}\. We have that𝚺x​y\(mixed\)=\(1−α\)​𝚺x​ypre\+α​𝚺x​yspec\\mathbf\{\\Sigma\}\_\{xy\}^\{\\textrm\{\(mixed\)\}\}=\(1\-\\alpha\)\\mathbf\{\\Sigma\}^\{\\textrm\{pre\}\}\_\{xy\}\+\\alpha\\mathbf\{\\Sigma\}^\{\\textrm\{spec\}\}\_\{xy\}which follows from the fact that theΣx​y\\Sigma\_\{xy\}are submatrices of the covariance matrix of Gaussian random vectors\. By Theorem[C\.9](https://arxiv.org/html/2605.12705#A3.Thmtheorem9), we have that the features are learned in order of the singular values of𝚺𝐱𝐲\\mathbf\{\\Sigma\_\{xy\}\}\. By Assumptions[C\.5](https://arxiv.org/html/2605.12705#A3.Thmtheorem5)and[C\.6](https://arxiv.org/html/2605.12705#A3.Ex8), we have that the topn−kn\-ksingular values of𝚺x​y\(mixed\)\\mathbf\{\\Sigma\}\_\{xy\}^\{\\textrm\{\(mixed\)\}\}are then−2​kn\-2kshared features and thekkspecialized features\. Define𝚺:n−k\(mixed\)=diag​\(σ1inv,…​σkinv,𝟎k,α​β,…,α​β\)\\mathbf\{\\Sigma\}^\{\\textrm\{\(mixed\}\)\}\_\{:n\-k\}=\\textrm\{diag\}\(\\sigma^\{\\textrm\{inv\}\}\_\{1\},\.\.\.\\sigma\_\{k\}^\{\\textrm\{inv\}\},\\mathbf\{0\}\_\{k\},\\alpha\\beta,\.\.\.,\\alpha\\beta\)\. Applying Theorem[C\.9](https://arxiv.org/html/2605.12705#A3.Thmtheorem9), we have that

‖𝐖1\(mixed\)−𝐔​\(𝚺:n−k\(mixed\)\)12‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{1\}^\{\\textrm\{\(mixed\)\}\}\-\\mathbf\{U\}\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(mixed\}\)\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)‖𝐖2\(mixed\)−\(𝚺:n−k\(mixed\)\)12​𝐕⊤‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{2\}^\{\\textrm\{\(mixed\)\}\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(mixed\}\)\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)
Repeating this analysis for unmixed training, we have that the topn−kn\-ksingular values of𝚺x​y\(gen\)\\mathbf\{\\Sigma\}^\{\\textrm\{\(gen\)\}\}\_\{xy\}are then−2​kn\-2kshared features are thekkinconsistent features\. We can define𝚺:n−k\(unmixed\)=diag​\(σ1inv,…​σkinv,σ1post,…,σkpost,𝟎k\)\\mathbf\{\\Sigma\}^\{\\textrm\{\(unmixed\)\}\}\_\{:n\-k\}=\\textrm\{diag\}\(\\sigma^\{\\textrm\{inv\}\}\_\{1\},\.\.\.\\sigma\_\{k\}^\{\\textrm\{inv\}\},\\sigma^\{\\textrm\{post\}\}\_\{1\},\.\.\.,\\sigma^\{\\textrm\{post\}\}\_\{k\},\\mathbf\{0\}\_\{k\}\)\. Similarly, by applying Theorem[C\.9](https://arxiv.org/html/2605.12705#A3.Thmtheorem9), we have that

‖𝐖1\(unmixed\)−𝐔​\(𝚺:n−k\(unmixed\)\)12‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{1\}^\{\\textrm\{\(unmixed\)\}\}\-\\mathbf\{U\}\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(unmixed\}\)\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)‖𝐖2\(unmixed\)−\(𝚺:n−k\(unmixed\)\)12​𝐕⊤‖F≤exp⁡\(−C​τ\)\\displaystyle\|\|\\mathbf\{W\}\_\{2\}^\{\\textrm\{\(unmixed\)\}\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{\(unmixed\}\)\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}\|\|\_\{F\}\\leq\\exp\(\-C\\tau\)∎

Intuitively, our result in Theorem[C\.10](https://arxiv.org/html/2605.12705#A3.Thmtheorem10)demonstrates that mixing𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during pretraining results in a pretrained initialization that has different features\. Mixing learns the specialized features, while not mixing learns only the inconsistent features\. In the following, we will examine the impact that these different features have on the retention of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}during subsequent training\.

### C\.3 Analysis of Post\-Training

We now study the dynamics of the post\-training process\. To formalize the post\-training process, we first examine the dynamics beginning from the idealized pretraining initialization \(as performed bySpringer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\)\. We perform the post\-training stage on the regularized loss𝐄​\[‖θ​x−𝐀sp​x‖F2\]\+λ​‖θ−θ0‖F2\\mathbf\{E\}\[\|\|\\theta x\-\\mathbf\{A\}^\{\\textrm\{sp\}\}x\|\|\_\{F\}^\{2\}\]\+\\lambda\|\|\\theta\-\\theta\_\{0\}\|\|\_\{F\}^\{2\}\. Observe that becausex∼𝒩​\(0,𝐈d\)x\\sim\\mathcal\{N\}\(0,\\mathbf\{I\}\_\{d\}\), this is equivalent to‖θ−𝐀sp‖F\|\|\\theta\-\\mathbf\{A\}^\{\\textrm\{sp\}\}\|\|\_\{F\}\. We follow the assumptions on the regularity of fine\-tuning established inSpringer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\.

###### Assumption C\.11\(Bound on Parameters Throughout Training\)\.

‖𝐖^1\(mixed\)‖op≤Γ\\displaystyle\|\|\\hat\{\\mathbf\{W\}\}^\{\(\\textrm\{mixed\}\)\}\_\{1\}\|\|\_\{\\textrm\{op\}\}\\leq\\sqrt\{\\Gamma\}‖𝐖^1\(mixed\)‖op≤Γ\\displaystyle\|\|\\hat\{\\mathbf\{W\}\}^\{\(\\textrm\{mixed\}\)\}\_\{1\}\|\|\_\{\\textrm\{op\}\}\\leq\\sqrt\{\\Gamma\}‖𝐖^1\(unmixed\)‖op≤Γ\\displaystyle\|\|\\hat\{\\mathbf\{W\}\}^\{\(\\textrm\{unmixed\}\)\}\_\{1\}\|\|\_\{\\textrm\{op\}\}\\leq\\sqrt\{\\Gamma\}‖𝐖^1\(unmixed\)‖op≤Γ\\displaystyle\|\|\\hat\{\\mathbf\{W\}\}^\{\(\\textrm\{unmixed\}\)\}\_\{1\}\|\|\_\{\\textrm\{op\}\}\\leq\\sqrt\{\\Gamma\}

Moreover, we assume that the regularization strength and the learning rates are likewise bounded\.

###### Assumption C\.12\(Bound on Learning Rate\)\.

4​η​\(λ\+2\)​Γ<14\\eta\(\\lambda\+2\)\\Gamma<1

##### Idealized Pretraining Initialization

We denote the ideal initialization parameters for the mixed and unmixed cases\(𝐖1^​\(0\),𝐖2^​\(0\)\)\(\\hat\{\\mathbf\{W\}\_\{1\}\}\(0\),\\hat\{\\mathbf\{W\}\_\{2\}\}\(0\)\)\.

𝐖1\(mixed\)^​\(0\)=𝐔​\(Σ:n−kmixed\)12\\displaystyle\\hat\{\\mathbf\{W\}\_\{1\}^\{\\textrm\{\(mixed\)\}\}\}\(0\)=\\mathbf\{U\}\(\\Sigma^\{\\textrm\{mixed\}\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}𝐖2\(mixed\)^​\(0\)=\(Σ:n−kmixed\)12​𝐕⊤\\displaystyle\\hat\{\\mathbf\{W\}\_\{2\}^\{\\textrm\{\(mixed\)\}\}\}\(0\)=\(\\Sigma^\{\\textrm\{mixed\}\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}
Similarly, we have the following idealized initialization for the unmixed initialization:

𝐖1\(unmixed\)^​\(0\)=𝐔​\(Σ:n−kunmixed\)12\\displaystyle\\hat\{\\mathbf\{W\}\_\{1\}^\{\\textrm\{\(unmixed\)\}\}\}\(0\)=\\mathbf\{U\}\(\\Sigma^\{\\textrm\{unmixed\}\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}𝐖2\(unmixed\)^​\(0\)=\(Σ:n−kunmixed\)12​𝐕⊤\\displaystyle\\hat\{\\mathbf\{W\}\_\{2\}^\{\\textrm\{\(unmixed\)\}\}\}\(0\)=\(\\Sigma^\{\\textrm\{unmixed\}\}\_\{:n\-k\}\)^\{\\frac\{1\}\{2\}\}\\mathbf\{V\}^\{\\top\}
In the idealized setting, we can track the evolution of each singular value independently\. In particular, we have the following update rules as derived inSpringer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\(where we denoteσispec\\sigma\_\{i\}^\{\\textrm\{spec\}\}as theii\-th singular value of𝐀spec\\mathbf\{A\}^\{\\textrm\{spec\}\}and likewise forσi\(un\)mixed​\(t\)\\sigma^\{\\textrm\{\(un\)mixed\}\}\_\{i\}\(t\)as theii\-th singular value at steptt\)\. In what follows, we will suppress the superscript for compactness:

σi​\(t\+1\)=σi​\(t\)−2​η​σi​\(t\)​\(σi​\(t\)2−\(σspec,i\)2\)\+2​η​λ​\(σi​\(t\)2−σi​\(0\)2\)\\sigma\_\{i\}\(t\+1\)=\\sigma\_\{i\}\(t\)\-2\\eta\\sigma\_\{i\}\(t\)\(\\sigma\_\{i\}\(t\)^\{2\}\-\(\\sigma\_\{\\textrm\{spec\},i\}\)^\{2\}\)\+2\\eta\\lambda\(\\sigma\_\{i\}\(t\)^\{2\}\-\\sigma\_\{i\}\(0\)^\{2\}\)\(2\)As a result, note that whenσi\(un\)mixed​\(0\)=0\\sigma\_\{i\}^\{\\textrm\{\(un\)mixed\}\}\(0\)=0,σi\(un\)mixed​\(t\)=0\\sigma\_\{i\}^\{\\textrm\{\(un\)mixed\}\}\(t\)=0for alltt\.

Next,we will study the dynamics of the non\-zero singular values \(Lemma A\.11Springer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\)\. We will assume that post\-training is performed for a sufficient number of steps\.

###### Assumption C\.13\(Sufficient Post\-Training Steps\)\.

We have that the number of post\-training steps \(denoted byKK\) satisfiesK≥1λ​cmin​log⁡100​ΓϵK\\geq\\frac\{1\}\{\\lambda c\_\{\\textrm\{min\}\}\}\\log\\frac\{100\\Gamma\}\{\\epsilon\}, for a constantϵ<cmis2​cmis−4​β\\epsilon<\\frac\{c\_\{\\textrm\{mis\}\}\}\{2c\_\{\\textrm\{mis\}\}\-4\\beta\}and wherecmin=min⁡\{\(𝚺post\)i​i\|\(𝚺post\)i​i≠0\}c\_\{\\textrm\{min\}\}=\\min\\\{\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\|\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\\neq 0\\\}– that is the minimum, non\-zero singular value\.

Given these technical conditions, we now state a general result \(adapted for our setting fromSpringer et al\. \([2025](https://arxiv.org/html/2605.12705#bib.bib22)\)\)\.

###### Lemma C\.14\.

When training on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}with infinite batch size from the ideal pretraining initialization and taking sufficient number of stepsKK, for alli∈rank​\(θn​\(0\)\)i\\in\\textrm\{rank\}\(\\theta\_\{n\}\(0\)\), we have that

\|\(𝐔⊤​θ^\(un\)mixed​\(t\)​𝐕\)i​i−\(𝚺post\)i​i\|≤ϵ\\displaystyle\|\(\\mathbf\{U\}^\{\\top\}\\hat\{\\theta\}^\{\\textrm\{\(un\)mixed\}\}\(t\)\\mathbf\{V\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\)\_\{ii\}\|\\leq\\epsilon\(3\)where𝚺post\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}is such that𝐀post=𝐔​𝚺post​𝐕⊤\\mathbf\{A\}\_\{\\textrm\{post\}\}=\\mathbf\{U\}\\mathbf\{\\Sigma\}\_\{\\textrm\{post\}\}\\mathbf\{V\}^\{\\top\}\.

Now, we will define the matrices𝚺shared,post=diag​\(σ1inv,…​σn−2​kinv,σdinvariant\+1post,…​σdinvariant\+kpost,𝟎k\)\\mathbf\{\\Sigma\}^\{\\textrm\{shared,post\}\}=\\textrm\{diag\}\(\\sigma^\{\\textrm\{inv\}\}\_\{1\},\.\.\.\\sigma^\{\\textrm\{inv\}\}\_\{n\-2k\},\\sigma^\{\\textrm\{post\}\}\_\{d\_\{\\textrm\{invariant\}\}\+1\},\.\.\.\\sigma^\{\\textrm\{post\}\}\_\{d\_\{\\textrm\{invariant\}\}\+k\},\\mathbf\{0\}\_\{k\}\)and𝚺spec,post=diag​\(σ1inv,…​σn−2​kinv,𝟎k,σdinvariant\+k\+1post,…​σdinvariant\+2​kpost\)\\mathbf\{\\Sigma\}^\{\\textrm\{spec,post\}\}=\\textrm\{diag\}\(\\sigma^\{\\textrm\{inv\}\}\_\{1\},\.\.\.\\sigma^\{\\textrm\{inv\}\}\_\{n\-2k\},\\mathbf\{0\}\_\{k\},\\sigma^\{\\textrm\{post\}\}\_\{d\_\{\\textrm\{invariant\}\}\+k\+1\},\.\.\.\\sigma^\{\\textrm\{post\}\}\_\{d\_\{\\textrm\{invariant\}\}\+2k\}\), Intuitively,𝚺shared,post\\mathbf\{\\Sigma\}^\{\\textrm\{shared,post\}\}lowers loss on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}by shifting the values on the shared features, while𝚺spec,post\\mathbf\{\\Sigma\}^\{\\textrm\{spec,post\}\}accomplishes this by modifying the unique features\. We are now ready to state the main theorem\.

###### Theorem C\.15\(Post\-training onθmixed\\theta^\{\\textrm\{mixed\}\}versusθunmixed\\theta^\{\\textrm\{unmixed\}\}\)\.

Letθunmixed​\(K\)\\theta^\{\\textrm\{unmixed\}\}\(K\)denote the parameters after training the idealized unmixed initialization starting forKKsteps and letθmixed​\(K\)\\theta^\{\\textrm\{mixed\}\}\(K\)be the same starting from the idealized mixed checkpoint\. Then we have

‖𝐔⊤​θmixed​𝐕−𝚺spec,post‖op≤ϵ\\displaystyle\|\|\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{mixed\}\}\\mathbf\{V\}\-\\mathbf\{\\Sigma\}^\{\\textrm\{spec,post\}\}\|\|\_\{\\textrm\{op\}\}\\leq\\epsilon‖𝐔⊤​θunmixed​𝐕−𝚺shared,post‖op≤ϵ\\displaystyle\|\|\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{unmixed\}\}\\mathbf\{V\}\-\\mathbf\{\\Sigma\}^\{\\textrm\{shared,post\}\}\|\|\_\{\\textrm\{op\}\}\\leq\\epsilon

###### Proof\.

This theorem follows by noting that under the idealized pretraining initialization any singular value that is0at initialization remains that way during the entire optimization trajectory\. Note that the0singular values of𝐔⊤​θmixed​𝐕\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{mixed\}\}\\mathbf\{V\}coincide with𝚺spec, post\\mathbf\{\\Sigma\}^\{\\textrm\{spec, post\}\}\(the specialized features\) and likewise𝐔⊤​θunmixed​𝐕\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{unmixed\}\}\\mathbf\{V\}coincide with𝚺shared, post\\mathbf\{\\Sigma\}^\{\\textrm\{shared, post\}\}\(the inconsistent features\)\.

This implies that we have

maxi∈\[1,n\]⁡\|\(𝐔⊤​θmixed​𝐕\)i​i−\(𝚺spec,post\)i​i\|≤maxi∈rank​\(θ\)⁡\|\(𝐔⊤​θmixed​𝐕\)i​i−\(𝚺spec,post\)i​i\|\\displaystyle\\max\_\{i\\in\[1,n\]\}\|\(\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{mixed\}\}\\mathbf\{V\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{spec,post\}\}\)\_\{ii\}\|\\leq\\max\_\{i\\in\\textrm\{rank\}\(\\theta\)\}\|\(\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{mixed\}\}\\mathbf\{V\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{spec,post\}\}\)\_\{ii\}\|maxi∈\[1,n\]⁡\|\(𝐔⊤​θunmixed​𝐕\)i​i−\(𝚺shared,post\)i​i\|≤maxi∈rank​\(θ\)⁡\|\(𝐔⊤​θmixed​𝐕\)i​i−\(𝚺shared,post\)i​i\|\\displaystyle\\max\_\{i\\in\[1,n\]\}\|\(\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{unmixed\}\}\\mathbf\{V\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{shared,post\}\}\)\_\{ii\}\|\\leq\\max\_\{i\\in\\textrm\{rank\}\(\\theta\)\}\|\(\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{mixed\}\}\\mathbf\{V\}\)\_\{ii\}\-\(\\mathbf\{\\Sigma\}^\{\\textrm\{shared,post\}\}\)\_\{ii\}\|Now, applying the result from Lemma[C\.14](https://arxiv.org/html/2605.12705#A3.Thmtheorem14)yields the desired claim\. ∎

### C\.4 Analysis of Downstream Adaptation

In the previous section, we characterized the impact of post\-training from a mixed versus an unmixed initialization, demonstrating that different features are used to minimize the loss on𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\. In this section, we study how these different features impact the ultimate retention of𝒟post\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\. We first establish that the singular values corresponding to directions in which there is no covariance remain unchanged throughout the downstream fine\-tuning stage\.

###### Lemma C\.16\.

Consider performing downstream unregularized fine\-tuning on𝒟ft\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}\. Ifx∼𝒟ftx\\sim\{\\color\[rgb\]\{0\.7265625,0\.37109375,0\.40625\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.7265625,0\.37109375,0\.40625\}\\mathcal\{D\}\_\{\\textrm\{ft\}\}\}has0covariance along a singular direction, the corresponding singular value remains unchanged throughout downstream adaptation\.

###### Proof\.

To see this, note that the gradient updates for𝐖1\\mathbf\{W\}\_\{1\}and𝐖2\\mathbf\{W\}\_\{2\}take the following form:

𝐖1​\(k\+1\)=𝐖1​\(k\)−2​η​\(𝐖1​\(k\)​𝐖2​\(k\)−𝐀spec\)​Σx​𝐖2⊤\\displaystyle\\mathbf\{W\}\_\{1\}\(k\+1\)=\\mathbf\{W\}\_\{1\}\(k\)\-2\\eta\(\\mathbf\{W\}\_\{1\}\(k\)\\mathbf\{W\}\_\{2\}\(k\)\-\\mathbf\{A\}^\{\\textrm\{spec\}\}\)\\Sigma\_\{x\}\\mathbf\{W\}\_\{2\}^\{\\top\}𝐖2​\(k\+1\)=𝐖2​\(k\)−2​η​𝐖1​\(k\)​Σx​\(𝐖1​\(k\)​𝐖2​\(k\)−𝐀spec\)\\displaystyle\\mathbf\{W\}\_\{2\}\(k\+1\)=\\mathbf\{W\}\_\{2\}\(k\)\-2\\eta\\mathbf\{W\}\_\{1\}\(k\)\\Sigma\_\{x\}\(\\mathbf\{W\}\_\{1\}\(k\)\\mathbf\{W\}\_\{2\}\(k\)\-\\mathbf\{A\}^\{\\textrm\{spec\}\}\)Here, we have thatΣx\\Sigma\_\{x\}denotes the covariance of the input dataxx\. Thus, along any singular direction in which the data has0variance, the𝚺x\\mathbf\{\\Sigma\}\_\{x\}term will project the gradient to0\. Therefore, the singular values on such directions must also remain unchanged\. ∎

We consider performing downstream adaptation by taking steps using unregularized gradient descent on𝒟ft\\mathcal\{D\}\_\{\\textrm\{ft\}\}and show the following result\.

###### Theorem C\.17\.

Consider performingKKsteps of gradient descent on the downstream finetuning dataset beginning from intializationsθpost, mixed​\(K\)\\theta^\{\\textrm\{post, mixed\}\}\(K\)and letθFT,mixed​\(K\)\\theta^\{\\textrm\{FT,mixed\}\}\(K\),θFT,unmixed​\(K\)\\theta^\{\\textrm\{FT,unmixed\}\}\(K\), andθpost, unmixed​\(K\)\\theta^\{\\textrm\{post, unmixed\}\}\(K\)denote the final parameters\. LetΔunmixed=ℒ​\(θFT, unmixed;𝒟post\)−ℒ​\(θpost, unmixed​\(K\);𝒟post\)\\Delta\_\{\\textrm\{unmixed\}\}=\\mathcal\{L\}\(\\theta^\{\\textrm\{FT, unmixed\}\};\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\-\\mathcal\{L\}\(\\theta^\{\\textrm\{post, unmixed\}\}\(K\);\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)and likewiseΔmixed=ℒ​\(θFT, mixed;𝒟post\)−ℒ​\(θpost,mixed​\(K\);𝒟post\)\\Delta\_\{\\textrm\{mixed\}\}=\\mathcal\{L\}\(\\theta^\{\\textrm\{FT, mixed\}\};\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\-\\mathcal\{L\}\(\\theta^\{\\textrm\{post,mixed\}\}\(K\);\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\. Then we have thatΔunmixed\>Δmixed\.\\Delta\_\{\\textrm\{unmixed\}\}\>\\Delta\_\{\\textrm\{mixed\}\}\.

###### Proof\.

As the invariant features take the same values, they will not move during the downstream adaptation\. Moreover, due to the Lemma[C\.16](https://arxiv.org/html/2605.12705#A3.Thmtheorem16), we also have that the specialized features will not change during the the downstream fine\-tuning\. This implies thatΔmixed=0\\Delta\_\{\\textrm\{mixed\}\}=0\. Next we will examine the changes induced by downstream training on the unmixed models\. Observe that we have that theℒ​\(θ;𝒟post\)=‖θ−𝐀spec‖F2\\mathcal\{L\}\(\\theta;\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)=\|\|\\theta\-\\mathbf\{A\}^\{\\textrm\{spec\}\}\|\|\_\{F\}^\{2\}\. We will define the following matricesΣFT\(unmixed\)=diag\(σ1inv,…\.,σdinvariantinv,σdinvariant\+1ft,…σdinvariant\+kft𝟎k\)\\Sigma\_\{\\textrm\{FT\}\}^\{\\textrm\{\(unmixed\)\}\}=\\textrm\{diag\}\(\\sigma\_\{1\}^\{\\textrm\{inv\}\},\.\.\.\.,\\sigma^\{\\textrm\{inv\}\}\_\{d\_\{\\textrm\{invariant\}\}\},\\sigma^\{\\textrm\{ft\}\}\_\{d\_\{\\textrm\{invariant\}\}\+1\},\.\.\.\\sigma^\{\\textrm\{ft\}\}\_\{\\textrm\{d\}\_\{\\textrm\{invariant\}\}\+k\}\\mathbf\{0\}\_\{k\}\)and note that Lemma[C\.14](https://arxiv.org/html/2605.12705#A3.Thmtheorem14)gives us that

‖𝐔⊤​θFT\(unmixed\)​𝐕−ΣFT\(unmixed\)‖op≤ϵ\\displaystyle\|\|\\mathbf\{U\}^\{\\top\}\\theta^\{\\textrm\{\(unmixed\)\}\}\_\{\\textrm\{FT\}\}\\mathbf\{V\}\-\\Sigma\_\{\\textrm\{FT\}\}^\{\\textrm\{\(unmixed\)\}\}\|\|\_\{\\textrm\{op\}\}\\leq\\epsilon
The loss function we use here is simply the squared difference of the singular values\. Thus,we can upper bound:

ℒ​\(θpostunmixed;𝒟post\)≤k​\(β\+ϵ\)2\+k​ϵ2\\displaystyle\\mathcal\{L\}\(\\theta\_\{\\textrm\{post\}\}^\{\\textrm\{unmixed\}\};\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\\leq k\(\\beta\+\\epsilon\)^\{2\}\+k\\epsilon^\{2\}and likewise lower bound

ℒ​\(θftunmixed;𝒟post\)≥k​\(β−ϵ\)2\+k​\(cmis−ϵ\)2\\displaystyle\\mathcal\{L\}\(\\theta\_\{\\textrm\{ft\}\}^\{\\textrm\{unmixed\}\};\{\\color\[rgb\]\{0\.17578125,0\.51171875,0\.25\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.17578125,0\.51171875,0\.25\}\\mathcal\{D\}\_\{\\textrm\{post\}\}\}\)\\geq k\(\\beta\-\\epsilon\)^\{2\}\+k\(c\_\{\\textrm\{mis\}\}\-\\epsilon\)^\{2\}Then,we can lower boundΔunmixed≥k​\[−\(β\+ϵ\)2\+\(β−ϵ\)2\]\+k​\[\(cmis−ϵ\)2−ϵ2\]=k​\[−4​β​ϵ−2​cmis​ϵ\+cm​i​s2\]\\Delta\_\{\\textrm\{unmixed\}\}\\geq k\[\-\(\\beta\+\\epsilon\)^\{2\}\+\(\\beta\-\\epsilon\)^\{2\}\]\+k\[\(c\_\{\\textrm\{mis\}\}\-\\epsilon\)^\{2\}\-\\epsilon^\{2\}\]=k\[\-4\\beta\\epsilon\-2c\_\{\\textrm\{mis\}\}\\epsilon\+c\_\{mis\}^\{2\}\]\. From the condition thatβ<12​cmis\\beta<\\frac\{1\}\{2\}c\_\{\\textrm\{mis\}\}, the positivity of\(cmis\)2\(c\_\{\\textrm\{mis\}\}\)^\{2\}, and the condition onϵ\\epsilon, we thus have thatΔunmixed\>0\\Delta\_\{\\textrm\{unmixed\}\}\>0, which is what we wanted to show\.

∎

Similar Articles

The Long-Term Effects of Data Selection in LLM Fine-Tuning

arXiv cs.LG

This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

arXiv cs.LG

This paper benchmarks sub-1B models on mathematical reasoning tasks, revealing that full fine-tuning actively harms performance in models under 300M parameters, while parameter-efficient fine-tuning (PEFT) like LoRA and DoRA provides stability. The authors recommend defaulting to PEFT for all aligned sub-1B models and caution against full FT for architectures smaller than 500M to prevent catastrophic forgetting.

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

arXiv cs.CL

This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.

A Bitter Lesson for Data Filtering (1 minute read)

TLDR AI

This paper investigates data filtering for large model pretraining and finds that in the high-compute, data-scarce regime, filtering may not be necessary and can even be detrimental; sufficiently trained large models benefit from nominally low-quality data.