The Long-Term Effects of Data Selection in LLM Fine-Tuning

arXiv cs.LG Papers

Summary

This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.

arXiv:2605.30537v1 Announce Type: new Abstract: Data selection is increasingly used to reduce the cost of large language model (LLM) fine-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence. This paper studies a different question: when fine-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long-horizon view of LLM data selection in which a selector is evaluated not only by immediate task performance, but also by future adaptation speed, forgetting, capability imbalance, and out-of-distribution robustness. We compare representative random, loss-based, gradient-based, diversity-based, quality-based, and utility-diversity selection families under a unified multi-stage protocol. Through controlled experiments designed to instantiate this protocol, we show how short-term selectors can exhibit rank reversal: they improve the current stage while slowing subsequent learning and increasing forgetting. We formalize this behavior as \emph{myopic selection}, provide a simple local analysis of why it can occur, and propose a diagnostic Long-Horizon Aware Selection (LHAS) objective that augments immediate utility with coverage, future-proxy transfer, and anti-concentration terms. The study argues that data selection should be evaluated as a training intervention that shapes the model's learning trajectory, rather than only as a local data-efficiency mechanism.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:26 AM

# The Long-Term Effects of Data Selection in LLM Fine-Tuning
Source: [https://arxiv.org/html/2605.30537](https://arxiv.org/html/2605.30537)
Yuxin Yang Shanghai University

&Aoxiong Zeng East China Normal University

&Xiangquan Yang East China Normal University

###### Abstract

Data selection is increasingly used to reduce the cost of large language model \(LLM\) fine\-tuning, with recent methods prioritizing samples by current utility, diversity, quality, or influence\. This paper studies a different question: when fine\-tuning occurs over multiple stages, can selection strategies that look optimal now make the model less adaptable later? We introduce a long\-horizon view of LLM data selection in which a selector is evaluated not only by immediate task performance, but also by future adaptation speed, forgetting, capability imbalance, and out\-of\-distribution robustness\. We compare representative random, loss\-based, gradient\-based, diversity\-based, quality\-based, and utility\-diversity selection families under a unified multi\-stage protocol\. Through controlled experiments designed to instantiate this protocol, we show how short\-term selectors can exhibit rank reversal: they improve the current stage while slowing subsequent learning and increasing forgetting\. We formalize this behavior as*myopic selection*, provide a simple local analysis of why it can occur, and propose a diagnostic Long\-Horizon Aware Selection \(LHAS\) objective that augments immediate utility with coverage, future\-proxy transfer, and anti\-concentration terms\. The study argues that data selection should be evaluated as a training intervention that shapes the model’s learning trajectory, rather than only as a local data\-efficiency mechanism\.

## 1Introduction

Supervised fine\-tuning \(SFT\) is one of the standard ways to adapt large language models to downstream tasks\(Brownet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib3); Achiamet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib2); Touvronet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib50); Baiet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib4)\)\. As instruction datasets grow in size and heterogeneity, training on every available example is often inefficient and sometimes harmful: low\-quality examples can amplify bias, redundant examples waste compute, and overly narrow mixtures can overfit the model to a transient capability profile\. These concerns have motivated data selection methods based on quality filtering, influence estimation, active learning, coreset coverage, online batch scoring, and deduplication or diversification\(Albalaket al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib5); Sener and Savarese,[2018](https://arxiv.org/html/2605.30537#bib.bib19); Ashet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib20); Colemanet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib21); Chenet al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib12); Zhouet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib13); Xiaet al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib11); Leeet al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib14); Tirumalaet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib15); Zouet al\.,[2025a](https://arxiv.org/html/2605.30537#bib.bib1)\)\.

Most of this literature evaluates selection within a single fine\-tuning stage\. This is natural when the goal is to solve one task with a fixed budget, but it misses an important property of modern model adaptation: fine\-tuning is often sequential\. A deployed assistant may first be tuned on general instructions, then mathematical reasoning, then code, then safety data, and then domain\-specific corpora\. In this setting, the selected subset at stagettdoes not merely save compute\. It changes the parameter state from which all future stages start\.

This paper asks whether efficient data selection can make models increasingly specialized\. By specialization we do not mean only that a selector changes the label distribution of the current batch\. We mean that selection may push representations, gradients, and parameter\-efficient adapters toward a narrow set of capabilities, reducing future learnability or robustness\. A sample can be highly useful for the current stage while still being a poor long\-term training intervention\.

We call this phenomenon*myopic selection*: a selection policy is myopic when it maximizes immediate utility at the expense of future adaptation, retention, or out\-of\-distribution \(OOD\) robustness\. This framing leads to five research questions\. RQ1: Are short\-term winning selectors also long\-term winning selectors? RQ2: How do selectors affect the learning speed of later tasks? RQ3: Do they increase forgetting or capability imbalance? RQ4: Is diversity sufficient to prevent long\-horizon bias? RQ5: Can a lightweight long\-horizon objective improve the trade\-off?

The distinction matters because online selection methods are often evaluated with curves that stop at the end of the current stage\. Such curves answer whether a method uses fewer tokens to solve today’s objective, but they do not answer whether it leaves the model in a state from which tomorrow’s objective is easier or harder\. In particular, two selectors can reach the same current validation score while inducing very different update covariance, adapter subspaces, and capability coverage\. A long\-horizon evaluation therefore needs to measure both*where*the model arrives and*how*it arrived there\.

We make four contributions\. First, we formulate long\-horizon data selection for LLM fine\-tuning and define metrics that capture future adaptation speed, forgetting, capability imbalance, OOD robustness, and a*myopia gap*\. Second, we specify a unified protocol for comparing representative selection families under equal token budgets\. Third, we give a simple theoretical analysis showing why two selectors with equal current\-stage gain can differ in future adaptation cost\. Fourth, we use controlled experiments to stress\-test the protocol and introduce Long\-Horizon Aware Selection \(LHAS\), a diagnostic baseline showing how coverage and anti\-concentration terms can reduce the long\-term side effects of myopic selection\.

## 2Related work

#### Online data and batch selection for LLM fine\-tuning\.

Online selection methods score examples as training proceeds, often using loss, gradient magnitude, uncertainty, diversity, or model\-internal utility estimates\. Classic online batch selection prioritizes high\-loss examples\(Loshchilov and Hutter,[2015](https://arxiv.org/html/2605.30537#bib.bib6); Jianget al\.,[2019](https://arxiv.org/html/2605.30537#bib.bib8)\), importance\-sampling approaches use gradient information\(Katharopoulos and Fleuret,[2018](https://arxiv.org/html/2605.30537#bib.bib7)\), RHO\-Loss emphasizes examples that are learnable and not yet learned\(Mindermannet al\.,[2022](https://arxiv.org/html/2605.30537#bib.bib9)\), and GREATS selects high\-quality data in each training iteration\(Wanget al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib10)\)\. Recent LLM\-oriented work also studies influence\-based instruction tuning and utility\-diversity scoring\(Xiaet al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib11); Zouet al\.,[2025a](https://arxiv.org/html/2605.30537#bib.bib1)\)\. UDS is the closest point of departure for this work: it combines a utility term based on forward\-pass logits with an inter\-sample diversity estimate using a historical memory buffer\(Zouet al\.,[2025a](https://arxiv.org/html/2605.30537#bib.bib1)\)\. Our goal is complementary\. We ask whether utility and diversity, when defined locally, are sufficient for sequential adaptation\.

This distinction separates our work from attempts to improve the current\-stage scoring rule\. A new utility score may improve the area under the current training curve and still be myopic if the score repeatedly emphasizes the same capability direction\. Conversely, a selector with slightly lower current\-stage accuracy may be preferable if it preserves broad plasticity\. Thus, our comparison treats the selector as part of the optimizer and not merely as a preprocessing filter\.

#### Data valuation, pruning, and curriculum learning\.

Data selection has a long history in active learning\(Settles,[2009](https://arxiv.org/html/2605.30537#bib.bib48); Sener and Savarese,[2018](https://arxiv.org/html/2605.30537#bib.bib19); Ashet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib20)\), dataset cartography\(Swayamdiptaet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib44)\), proxy\-based selection\(Colemanet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib21)\), data pruning\(Sorscheret al\.,[2022](https://arxiv.org/html/2605.30537#bib.bib47)\), and gradient\- or loss\-based example scoring\(Tonevaet al\.,[2019](https://arxiv.org/html/2605.30537#bib.bib43); Paulet al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib45); Mirzasoleimanet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib46)\)\. In language\-model training, importance resampling, deduplication, and diversification further show that the data mixture can change both efficiency and generalization\(Xieet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib16); Leeet al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib14); Tirumalaet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib15)\)\. Instruction tuning also highlights that small, carefully curated datasets can match or exceed much larger mixtures\(Zhouet al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib13); Chenet al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib12)\)\. These methods often assume a fixed target distribution or a single training objective\. In multi\-stage LLM fine\-tuning, the target distribution itself evolves\. A selector that is optimal for the present objective can alter the representation from which later objectives must be learned\.

Curriculum learning provides another useful analogy\(Wanget al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib17); Xuet al\.,[2020](https://arxiv.org/html/2605.30537#bib.bib18)\)\. A curriculum can accelerate training by presenting examples in a helpful order, but an overly narrow curriculum can also delay exposure to skills that are needed later\. Our setting differs because the data distribution is not only ordered but also filtered: unselected examples never contribute gradients at that stage\. The long\-term effect is therefore stronger than reordering alone\.

#### Continual learning and stability\-plasticity\.

Continual learning studies the tension between acquiring new skills and retaining old ones\. Representative approaches include regularization against important parameter changes\(Kirkpatricket al\.,[2017](https://arxiv.org/html/2605.30537#bib.bib36); Zenkeet al\.,[2017](https://arxiv.org/html/2605.30537#bib.bib39)\), distillation\-based retention\(Li and Hoiem,[2016](https://arxiv.org/html/2605.30537#bib.bib38)\), replay or exemplar memory\(Rebuffiet al\.,[2017](https://arxiv.org/html/2605.30537#bib.bib37)\), constrained\-gradient methods\(Lopez\-Paz and Ranzato,[2017](https://arxiv.org/html/2605.30537#bib.bib40); Chaudhryet al\.,[2019](https://arxiv.org/html/2605.30537#bib.bib41)\), and architectural expansion\(Rusuet al\.,[2016](https://arxiv.org/html/2605.30537#bib.bib42)\)\. Recent efficient continual adaptation methods further use mechanisms such as sparse expansion, decorrelation, and guided random projection to reduce interference and adaptation cost\(Zouet al\.,[2025b](https://arxiv.org/html/2605.30537#bib.bib24),[c](https://arxiv.org/html/2605.30537#bib.bib25); Liet al\.,[2026](https://arxiv.org/html/2605.30537#bib.bib26)\)\. We share this concern, but shift the intervention from the model architecture or regularizer to the data selector\. The selector is part of the continual learning system because it determines which gradients are allowed to shape the model\.

#### Parameter\-efficient adaptation\.

Parameter\-efficient fine\-tuning \(PEFT\) adapts large pretrained models by updating a small set of parameters, including adapters\(Houlsbyet al\.,[2019](https://arxiv.org/html/2605.30537#bib.bib30)\), prefix tuning\(Li and Liang,[2021](https://arxiv.org/html/2605.30537#bib.bib31)\), prompt tuning\(Lesteret al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib32)\), P\-tuning variants\(Liuet al\.,[2022](https://arxiv.org/html/2605.30537#bib.bib33)\), LoRA\(Huet al\.,[2022](https://arxiv.org/html/2605.30537#bib.bib29)\), and quantized LoRA\-style training\(Dettmerset al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib34)\); recent surveys organize these methods as a broad family of scalable adaptation tools\(Dinget al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib35)\)\. PEFT makes repeated LLM adaptation practical, but restricted update capacity can still accumulate interference across tasks\. Recent multi\-task PEFT work explores MoE\-LoRA specialization for domain\-specific adaptation\(Yanget al\.,[2026a](https://arxiv.org/html/2605.30537#bib.bib22)\), context\-aware modulation of LoRA updates\(Yanget al\.,[2026b](https://arxiv.org/html/2605.30537#bib.bib23)\), and rank\-wise mixture mechanisms for task decoupling\(Zouet al\.,[2025d](https://arxiv.org/html/2605.30537#bib.bib27)\)\. Our protocol can be run with either full fine\-tuning or LoRA; in the main design we use LoRA because it is computationally realistic and makes adapter trajectory analysis straightforward\.

The LoRA setting is also a stress test for selection\-induced specialization\. When adaptation capacity is restricted to low\-rank updates, a selector that repeatedly chooses examples with aligned gradients can consume a large fraction of the available update subspace\. This makes it easier to observe whether future tasks have to fight against a narrow adapter direction\.

## 3Problem setup and theoretical analysis

LetM0M\_\{0\}denote a pretrained model and let𝒟1:T=\{𝒟1,…,𝒟T\}\\mathcal\{D\}\_\{1:T\}=\\\{\\mathcal\{D\}\_\{1\},\\ldots,\\mathcal\{D\}\_\{T\}\\\}denote a sequence of fine\-tuning stages\. At stagett, a selection policyπt\\pi\_\{t\}observes the current modelMt−1M\_\{t\-1\}, a candidate pool𝒟t\\mathcal\{D\}\_\{t\}, and optional historyHt−1H\_\{t\-1\}, then selects a subsetSt⊂𝒟tS\_\{t\}\\subset\\mathcal\{D\}\_\{t\}under a fixed budget\. Training onStS\_\{t\}producesMtM\_\{t\}\.

Most selectors optimize an immediate objective,

Uimm​\(πt\)=Perf​\(Mt,Vt\)−Perf​\(Mt−1,Vt\),U\_\{\\mathrm\{imm\}\}\(\\pi\_\{t\}\)=\\mathrm\{Perf\}\(M\_\{t\},V\_\{t\}\)\-\\mathrm\{Perf\}\(M\_\{t\-1\},V\_\{t\}\),\(1\)whereVtV\_\{t\}is a validation set for the current stage\. A long\-horizon objective must additionally account for future learnability and retention:

Ulong​\(π1:T\)=∑t=1TPerf​\(MT,Vt\)\+α​∑t=1T−1AUCt→t\+1−β​∑t=1TFt\+γ​Rood,U\_\{\\mathrm\{long\}\}\(\\pi\_\{1:T\}\)=\\sum\_\{t=1\}^\{T\}\\mathrm\{Perf\}\(M\_\{T\},V\_\{t\}\)\+\\alpha\\sum\_\{t=1\}^\{T\-1\}\\mathrm\{AUC\}\_\{t\\rightarrow t\+1\}\-\\beta\\sum\_\{t=1\}^\{T\}F\_\{t\}\+\\gamma R\_\{\\mathrm\{ood\}\},\(2\)whereAUCt→t\+1\\mathrm\{AUC\}\_\{t\\rightarrow t\+1\}measures adaptation speed on the next stage after training stagett,FtF\_\{t\}measures forgetting relative to the best previous score on stagett, andRoodR\_\{\\mathrm\{ood\}\}measures robustness on shifted evaluation sets\.

#### Myopia gap\.

We define the myopia gap as the disagreement between selector rankings under immediate and long\-horizon evaluation:

Gap=1K−1​𝔼π∈Π​\[\|rankimm​\(π\)−ranklong​\(π\)\|\],\\mathrm\{Gap\}=\\frac\{1\}\{K\-1\}\\mathbb\{E\}\_\{\\pi\\in\\Pi\}\\left\[\\left\|\\mathrm\{rank\}\_\{\\mathrm\{imm\}\}\(\\pi\)\-\\mathrm\{rank\}\_\{\\mathrm\{long\}\}\(\\pi\)\\right\|\\right\],\(3\)whereKKis the number of selectors\. A large gap indicates that the selector family looks good under the standard single\-stage view but changes order when evaluated as a multi\-stage intervention\.

#### Trajectory diagnostics\.

Long\-horizon evaluation should also inspect the training trajectory, not only final scores\. We use three diagnostics\. First,*capability entropy*measures the entropy of selected examples over coarse skill clusters\. Second,*update concentration*is the largest eigenvalue share of the selected\-gradient covariance matrix\. Third,*adapter drift*measures cosine distance between the adapter update after stagettand the cumulative update direction from previous stages\. These diagnostics are not themselves objectives; they are used to explain why two selectors with similar current scores can differ in future adaptation\.

### 3\.1A simple theoretical view

We give a minimal analysis showing why immediate\-utility selection can harm future tasks\. The analysis is deliberately simple: it is intended to clarify the mechanism, not to model all details of LLM fine\-tuning\. Consider a local quadratic approximation of the loss around the current parameter vectorθ\\thetaand let each examplexxinduce a gradient vectorg​\(x\)g\(x\)\. A selector chooses a minibatchSSwhose average update isg¯S=\|S\|−1​∑x∈Sg​\(x\)\\bar\{g\}\_\{S\}=\|S\|^\{\-1\}\\sum\_\{x\\in S\}g\(x\)\. Suppose stagetthas a current task directionutu\_\{t\}and the next stage has directionut\+1u\_\{t\+1\}\.

###### Definition 1\(Selection concentration\)

For a selected setSS, define concentration

C​\(S\)=λmax​\(1\|S\|​∑x∈Sg​\(x\)​g​\(x\)⊤\)tr​\(1\|S\|​∑x∈Sg​\(x\)​g​\(x\)⊤\)\+ϵ\.C\(S\)=\\frac\{\\lambda\_\{\\max\}\\left\(\\frac\{1\}\{\|S\|\}\\sum\_\{x\\in S\}g\(x\)g\(x\)^\{\\top\}\\right\)\}\{\\mathrm\{tr\}\\left\(\\frac\{1\}\{\|S\|\}\\sum\_\{x\\in S\}g\(x\)g\(x\)^\{\\top\}\\right\)\+\\epsilon\}\.\(4\)HighC​\(S\)C\(S\)means selected gradients occupy a narrow subspace\.

###### Proposition 1\(Myopic updates can increase future adaptation cost\)

Assume one fine\-tuning step updatesθ′=θ−η​g¯S\\theta^\{\\prime\}=\\theta\-\\eta\\bar\{g\}\_\{S\}and the next\-stage loss has local formLt\+1​\(θ\)=12​‖θ−θt\+1⋆‖H2L\_\{t\+1\}\(\\theta\)=\\frac\{1\}\{2\}\\\|\\theta\-\\theta\_\{t\+1\}^\{\\star\}\\\|\_\{H\}^\{2\}withH⪰0H\\succeq 0\. If two selectorsSa,SbS\_\{a\},S\_\{b\}have equal current improvement but⟨g¯Sa,H​\(θ−θt\+1⋆\)⟩<⟨g¯Sb,H​\(θ−θt\+1⋆\)⟩\\langle\\bar\{g\}\_\{S\_\{a\}\},H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\\rangle<\\langle\\bar\{g\}\_\{S\_\{b\}\},H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\\rangle, then selectorSaS\_\{a\}yields higher next\-stage loss after the current update\. In particular, a selector that concentrates gradients in a direction orthogonal or antagonistic tout\+1u\_\{t\+1\}can be worse for future adaptation despite matching immediate gain\.

The proof appears in Appendix[B](https://arxiv.org/html/2605.30537#A2)\. The proposition captures the central point: current improvement constrains the projection of the update onto the current task geometry, but it does not constrain its projection onto future task geometry\. Diversity and anti\-concentration help because they reduce the probability that the update lies in a narrow direction that is misaligned with future tasks\. LHAS operationalizes this idea by adding coverage and future\-proxy alignment terms to the immediate utility score\.

## 4Selection strategies and evaluation protocol

All selectors operate under the same token or sample budget\. We compare the following families\.

#### Random\.

Random selection provides a strong long\-horizon baseline because it is unbiased with respect to the current model’s transient weaknesses\.

#### Loss\-based\.

The loss selector chooses examples with high current loss\. This often accelerates the current stage but may concentrate updates on hard or outlier examples\.

#### Gradient\-based\.

The gradient selector chooses examples with large gradient norm or high gradient similarity to the current batch objective\. It approximates optimization contribution but can amplify directional concentration\.

#### Diversity\-based\.

The diversity selector uses embedding coverage, implemented as farthest\-first traversal or clustering over candidate embeddings\. This tests whether coverage alone is enough to avoid long\-term specialization\.

#### Quality\-based\.

The quality selector chooses examples with high external or heuristic quality scores, such as reward\-model scores, LLM judge scores, or rule\-based filters\. It captures a common data curation practice\.

#### Utility\-diversity\.

The utility\-diversity selector combines immediate utility with inter\-sample diversity using a memory buffer, following the central design of UDS\(Zouet al\.,[2025a](https://arxiv.org/html/2605.30537#bib.bib1)\)\. This is the most direct representative of modern online batch selection\.

#### Long\-Horizon Aware Selection\.

LHAS is a diagnostic baseline rather than a claim of optimality\. It augments any immediate utility scoreu​\(x\)u\(x\):

s​\(x\)=u​\(x\)\+λ​c​\(x,Ht\)\+η​p​\(x,Pt\)−ρ​a​\(x,S1:t−1\),s\(x\)=u\(x\)\+\\lambda c\(x,H\_\{t\}\)\+\\eta p\(x,P\_\{t\}\)\-\\rho a\(x,S\_\{1:t\-1\}\),\(5\)whereccrewards coverage relative to selected history,ppmeasures alignment with a small future\-proxy validation mixture, andaapenalizes repeated selection of the same capability cluster or gradient direction\. In our implementation,u​\(x\)u\(x\)is the utility\-diversity score\. The purpose of LHAS is to test whether a simple temporal correction reduces myopic side effects\.

### 4\.1Experimental protocol

#### Models and adaptation\.

The main experiments use an 8B\-class open LLM with LoRA fine\-tuning\(Huet al\.,[2022](https://arxiv.org/html/2605.30537#bib.bib29); Dubeyet al\.,[2024](https://arxiv.org/html/2605.30537#bib.bib51)\)\. LoRA rank is set to 16, the optimizer is AdamW\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.30537#bib.bib28)\), and each selector receives the same selected\-token budget\. We also include a smaller full fine\-tuning sanity check to test whether the trend is specific to adapters\.

#### Task sequences\.

We use three sequence families\. The*skill sequence*is general instruction, math, code, reasoning, and safety\. The*domain sequence*is general QA, biomedical QA, legal QA, finance QA, and scientific QA\. The*interleaved sequence*mixes partially overlapping skills so that task boundaries are blurred rather than clean\. Candidate datasets include OpenHermes or UltraChat for general instruction, GSM8K or MathInstruct for math\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib53)\), CodeAlpaca or code\-generation corpora with HumanEval\-style evaluation\(Chenet al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib54)\), MMLU\-style knowledge evaluation\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.30537#bib.bib52)\), and TruthfulQA\-style safety/robustness evaluation\(Linet al\.,[2022](https://arxiv.org/html/2605.30537#bib.bib55)\)\.

#### Metrics\.

We report current\-stage score, future adaptation AUC, forgetting, forward transfer, capability imbalance, OOD score, and myopia gap\. Future adaptation AUC is computed from the early learning curve on staget\+1t\+1after finishing stagett\. Forgetting is the mean drop from the best historical score on previous stages\. Capability imbalance is the standard deviation of normalized task scores\. OOD score averages shifted or held\-out evaluations\.

Conceptual flow\.At each stage, an online selector scores candidate examples using the current model state\. A myopic selector repeatedly chooses examples with high immediate utility, producing a narrow update trajectory\. Long\-horizon evaluation then probes whether the resulting model learns the next stage quickly, retains previous stages, and remains robust under distribution shift\.

Figure 1:Data selection as a long\-horizon training intervention\. The selected subset changes not only current performance but also the state from which future tasks are learned\.

## 5Results and analysis

#### Short\-term winners are not always long\-term winners\.

Table[1](https://arxiv.org/html/2605.30537#S5.T1)shows the experimental results on the skill sequence with a 25% selection budget\. Loss and gradient selection obtain the strongest current\-stage scores, but they have lower future adaptation AUC, higher forgetting, and worse OOD scores\. Random and diversity are less competitive on immediate score but remain stronger long\-horizon baselines\. Utility\-diversity selection improves the immediate\-diversity trade\-off, while LHAS achieves the best long\-horizon profile by sacrificing a small amount of current\-stage performance\.

Figure[2](https://arxiv.org/html/2605.30537#S5.F2)visualizes the same rank reversal together with future learning curves, robustness/forgetting trade\-offs, and myopia\-gap severity\. Gradient and loss selection lie in the high\-current, low\-future region, while LHAS moves toward the upper\-right region\. Utility\-diversity selection is an important middle case: it retains much of the current\-stage gain of utility\-based selection, but its future score remains below LHAS because diversity is computed primarily with respect to the current candidate stream\.

![Refer to caption](https://arxiv.org/html/2605.30537v1/x1.png)Figure 2:Experimental summary\. \(a\) Immediate current\-stage gains can reverse under future adaptation\. \(b\) Myopic selectors slow the next learning stage\. \(c\) Utility\-heavy selectors exhibit worse OOD robustness and forgetting\. \(d\) The myopia gap measures rank disagreement between short\- and long\-horizon evaluation\.Table 1:Experimental results on the skill sequence with a 25% selection budget and LoRA\-style adaptation dynamics\. Current and OOD are normalized scores; Future AUC measures early learning on the next stage\. Higher is better except for forgetting and myopia gap\. Values are mean±\\pmstandard deviation over three seeds\.
#### Myopic selection slows future adaptation\.

The largest immediate gains come from loss and gradient selectors, but the next\-stage learning curves begin from lower initial scores and improve more slowly\. Gradient selection requires approximately 1\.35×\\timesas many selected tokens as LHAS to reach the same early\-stage validation threshold on the next task\. This supports the hypothesis that strong local optimization can make the next adaptation problem harder\.

#### Diversity helps but does not fully solve the problem\.

Diversity selection is consistently stronger than loss and quality selection on OOD robustness and forgetting\. Utility\-diversity selection further improves current\-stage score while preserving some coverage\. However, diversity defined only inside the current candidate pool cannot anticipate which directions will be useful later\. This is visible in the remaining gap between utility\-diversity selection and LHAS on future AUC and worst\-task performance\.

#### Selection changes the trajectory\.

Table[2](https://arxiv.org/html/2605.30537#S5.T2)reports diagnostics over selected examples and update directions\. Loss and gradient selectors have lower capability entropy and higher gradient concentration, meaning that selected examples repeatedly activate similar capability clusters\. Diversity and LHAS maintain broader coverage\. Utility\-diversity selection is intermediate: its memory buffer reduces redundancy but its utility term still favors high\-loss regions of the current stage\.

The trajectory view is important because it explains why the phenomenon is not reducible to current\-stage overfitting\. A selector can choose high\-quality examples and still induce a narrow gradient covariance if those examples come from the same capability cluster\. Conversely, a selector can choose examples that are not individually maximal under the utility score but collectively preserve a wider update basis\. This is the mechanism suggested by Proposition[1](https://arxiv.org/html/2605.30537#Thmproposition1)\.

Table 2:Diagnostics for selected subsets and update trajectories\. Capability entropy measures spread across skill clusters; concentration is the top eigenvalue share of the update covariance\.
#### A lightweight long\-horizon objective improves the trade\-off\.

LHAS does not dominate every immediate metric: its current\-stage score is below gradient and utility\-diversity selection\. Its advantage is that it explicitly pays for coverage and anti\-concentration\. The result is a better worst\-task score, lower forgetting, and the smallest myopia gap\. This suggests that future\-aware objectives need not be complex to expose the missing temporal dimension in online selection\.

### 5\.1Ablations and sensitivity analysis

#### Budget sensitivity\.

Table[3](https://arxiv.org/html/2605.30537#S5.T3)summarizes the budget ablation\. At 10% budget, all selectors become more brittle because each chosen example has higher influence\. The myopia gap is largest in this regime\. At 50%, random and diversity approach utility\-diversity performance, but loss and gradient selection still show higher forgetting\. The qualitative ranking remains stable\.

Figure[3](https://arxiv.org/html/2605.30537#S5.F3)summarizes the same budget trend together with update diagnostics, adaptation\-mode sensitivity, and balanced\-capability metrics\. Increasing the budget improves all selectors, but it does not remove the ordering induced by long\-horizon effects\. This matters for practice: simply selecting more data may reduce variance, but it does not fully correct a scoring rule that repeatedly prioritizes narrow utility directions\.

Table 3:Ablation over selection budgets\. The table reports long\-horizon score, an aggregate of final average performance, future AUC, forgetting, and OOD robustness\.![Refer to caption](https://arxiv.org/html/2605.30537v1/x2.png)Figure 3:Ablation and diagnostic dashboard\. \(a\) Larger budgets improve all selectors but preserve the long\-horizon ranking\. \(b\) Lower capability entropy correlates with higher update concentration\. \(c\) The trend remains under LoRA\-style and full\-fine\-tuning\-style dynamics\. \(d\) LHAS improves worst\-task performance and forward transfer\.
#### Task order\.

Myopic effects are strongest when the next stage is weakly aligned with the current one, such as math followed by code or safety\. When adjacent stages share many capabilities, loss and gradient selection can transfer positively\. This suggests that a long\-horizon selector should depend on estimated task geometry rather than applying a universal penalty to utility\.

This observation gives a practical diagnostic\. If adjacent stages are known to be aligned, aggressive utility\-based selection may be acceptable\. If the future mixture is uncertain, a selector should be conservative: maintain capability coverage, avoid repeated gradient directions, and reserve part of the budget for examples that are not maximal under the current model\. The latter setting is common in long\-lived assistants, where future update requests are not known at the time of the current fine\-tuning job\.

#### LoRA versus full fine\-tuning\.

The LoRA setting shows sharper specialization because the adapter has limited rank\. Full fine\-tuning reduces but does not eliminate the effect: the selected gradient distribution still determines which regions of parameter space are explored\.

#### Practical recommendation\.

For future empirical work, we recommend reporting a two\-level scorecard\. The first level contains ordinary data\-selection metrics: current\-stage score, selected\-token budget, wall\-clock time, and training stability\. The second level contains long\-horizon metrics: future adaptation AUC, forgetting, worst\-task score, OOD robustness, and update concentration\. A selector should be considered robust only if it improves the first level without causing large degradation on the second\.

## 6Discussion, limitations, and broader impacts

#### What should change in online selection benchmarks?

The main methodological implication is that online selection benchmarks should include a held\-out future stage, not only a held\-out validation set for the current stage\. A selector that is evaluated only on the current stage is incentivized to exploit the current model’s weaknesses as aggressively as possible\. This can be desirable for a one\-off fine\-tuning job, but it is incomplete for models that are updated repeatedly\. We recommend that future benchmarks report at least one forward\-transfer metric, one forgetting metric, and one OOD metric under the same selected\-token budget\.

#### When is myopic selection acceptable?

Myopic selection is not always bad\. If the model will not be updated again, or if the next update is known to be very close to the current task, then maximizing current utility may be the right engineering choice\. The concern arises when the future task distribution is uncertain, broad, or safety\-critical\. In such settings, a small current\-stage sacrifice can be rational if it preserves a better starting point for future adaptation\. This is analogous to reserving model capacity for unknown future requirements\.

#### What kind of future proxy is realistic?

LHAS uses a future\-proxy mixture, but this does not require knowing the exact future tasks\. In practice, the proxy could be a small standing evaluation suite that represents capabilities the developer wants to preserve: general instruction following, math, code, factuality, harmlessness, and domain robustness\. The proxy should not be tuned to maximize a single benchmark\. Its purpose is to discourage selection policies from collapsing onto a narrow region of the current candidate pool\.

#### Why not solve the problem with diversity alone?

Diversity is necessary but not sufficient\. A selector can be diverse within a narrow domain while still ignoring capabilities that matter later\. For example, a math\-only candidate pool can be diverse over problem templates, difficulty, and surface form, but it may still push the model away from code or safety behavior\. Long\-horizon selection therefore needs diversity at multiple levels: example\-level diversity within the current stage, capability\-level coverage across history, and update\-level anti\-concentration in parameter space\.

#### How should this be used in real systems?

The safest deployment pattern is not to replace all existing selectors with LHAS, but to add long\-horizon auditing around any selector\. If a production selector is loss\-based or quality\-based, it should be stress\-tested on a sequence of future tasks and compared to random and diversity baselines\. If it produces a large myopia gap, then coverage constraints, replay mixtures, or future\-proxy penalties can be added\. This makes long\-horizon selection an evaluation discipline first and an algorithmic proposal second\.

### 6\.1Broader impacts

This work aims to make LLM adaptation more reliable by exposing a failure mode in data selection: a selector can optimize the present task while degrading future adaptability or robustness\. The positive impact is a more conservative evaluation standard for efficient fine\-tuning systems, especially in settings where models are updated repeatedly\. The main risk is that better selection methods could also make repeated fine\-tuning cheaper for harmful applications\. We do not release a new model or dataset in this draft\. If the protocol is used with safety, medical, legal, or financial datasets, dataset licensing, privacy, and downstream risk should be reviewed explicitly\.

### 6\.2Limitations

The task sequences are stylized and may not cover all deployment settings\. LHAS uses a small future\-proxy mixture, which may be unavailable or poorly specified in practice\. Quality\-based selection depends on reward models or LLM judges that can encode their own biases\(Zhenget al\.,[2023](https://arxiv.org/html/2605.30537#bib.bib49)\)\. Finally, our focus is supervised fine\-tuning and LoRA\-style adaptation; the conclusions may differ under RLHF, pretraining, tool\-use agents, or retrieval\-augmented systems\.

## 7Conclusion

Online data selection is usually evaluated as a local efficiency mechanism\. This paper argues that in multi\-stage LLM fine\-tuning it should instead be treated as a long\-horizon training intervention\. A selector that looks strong on the current stage can slow future adaptation, increase forgetting, and reduce OOD robustness\. The proposed protocol, metrics, and LHAS baseline provide a concrete way to study this effect\. The broader message is that future work on data selection should report not only immediate task gains, but also how selected data changes the model’s ability to keep learning\.

## References

- J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 technical report\.arXiv preprint arXiv:2303\.08774\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1)\.
- A\. Albalak, Y\. Elazar, S\. M\. Xie, S\. Longpre, N\. Lambert, X\. Wang, N\. Muennighoff, B\. Hou, L\. Pan, H\. Jeong,et al\.\(2024\)A survey on data selection for language models\.arXiv preprint arXiv:2402\.16827\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1)\.
- Deep batch active learning by diverse, uncertain gradient lower bounds\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Bai, S\. Bai, Y\. Chu, Z\. Cui, K\. Dang, X\. Deng, Y\. Fan, W\. Ge, Y\. Han, F\. Huang,et al\.\(2023\)Qwen technical report\.arXiv preprint arXiv:2309\.16609\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. D\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.InAdvances in Neural Information Processing Systems,Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1)\.
- A\. Chaudhry, M\. Ranzato, M\. Rohrbach, and M\. Elhoseiny \(2019\)Efficient lifelong learning with A\-GEM\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Chen, S\. Li, J\. Yan, H\. Wang, K\. Gunaratna, V\. Yadav, Z\. Tang, V\. Srinivasan, T\. Zhou, H\. Huang, and H\. Jin \(2024\)AlpaGasus: training a better alpaca with fewer data\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. d\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px2.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px2.p1.1)\.
- C\. Coleman, C\. Yeh, S\. Mussmann, B\. Mirzasoleiman, P\. Bailis, P\. Liang, J\. Leskovec, and M\. Zaharia \(2020\)Selection via proxy: efficient data selection for deep learning\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized LLMs\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- N\. Ding, Y\. Qin, G\. Yang, F\. Wei, Z\. Yang, Y\. Su, S\. Hu, Y\. Chen, C\. Chan, W\. Chen,et al\.\(2023\)Parameter\-efficient fine\-tuning of large\-scale pre\-trained language models\.Nature Machine Intelligence\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.International Conference on Learning Representations\.Cited by:[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px2.p1.1)\.
- N\. Houlsby, A\. Giurgiu, S\. Jastrzebski, B\. Morrone, Q\. De Laroussilhe, A\. Gesmundo, M\. Attariyan, and S\. Gelly \(2019\)Parameter\-efficient transfer learning for NLP\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- E\. J\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1),[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px1.p1.1)\.
- A\. H\. Jiang, D\. L\.\-K\. Wong, G\. Zhou, D\. G\. Andersen, J\. Dean, G\. R\. Ganger, G\. Joshi, M\. Kaminsky, M\. Kozuch, Z\. C\. Lipton,et al\.\(2019\)Accelerating deep learning by focusing on the biggest losers\.arXiv preprint arXiv:1910\.00762\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Katharopoulos and F\. Fleuret \(2018\)Not all samples are created equal: deep learning with importance sampling\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Kirkpatrick, R\. Pascanu, N\. Rabinowitz, J\. Veness, G\. Desjardins, A\. A\. Rusu, K\. Milan, J\. Quan, T\. Ramalho, A\. Grabska\-Barwinska,et al\.\(2017\)Overcoming catastrophic forgetting in neural networks\.Proceedings of the National Academy of Sciences\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Lee, D\. Ippolito, A\. Nystrom, C\. Zhang, D\. Eck, C\. Callison\-Burch, and N\. Carlini \(2021\)Deduplicating training data makes language models better\.arXiv preprint arXiv:2107\.06499\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Lester, R\. Al\-Rfou, and N\. Constant \(2021\)The power of scale for parameter\-efficient prompt tuning\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- R\. Li, H\. Zou, X\. Yan, Z\. Liang, J\. Yang, C\. Li, and X\. Yang \(2026\)Enhancing pretrained model\-based continual representation learning via guided random projection\.arXiv preprint arXiv:2603\.19145\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- X\. L\. Li and P\. Liang \(2021\)Prefix\-tuning: optimizing continuous prompts for generation\.InProceedings of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Li and D\. Hoiem \(2016\)Learning without forgetting\.InEuropean Conference on Computer Vision,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Lin, J\. Hilton, and O\. Evans \(2022\)TruthfulQA: measuring how models mimic human falsehoods\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics,Cited by:[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px2.p1.1)\.
- X\. Liu, K\. Ji, Y\. Fu, Z\. Du, Z\. Yang, and J\. Tang \(2022\)P\-Tuning v2: prompt tuning can be comparable to fine\-tuning universally across scales and tasks\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- D\. Lopez\-Paz and M\. Ranzato \(2017\)Gradient episodic memory for continual learning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2015\)Online batch selection for faster training of neural networks\.arXiv preprint arXiv:1511\.06343\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§4\.1](https://arxiv.org/html/2605.30537#S4.SS1.SSS0.Px1.p1.1)\.
- S\. Mindermann, J\. M\. Brauner, M\. T\. Razzak, M\. Sharma, A\. Kirsch, W\. Xu, B\. Höltgen, A\. N\. Gomez, A\. Morisot, S\. Farquhar, and Y\. Gal \(2022\)Prioritized training on points that are learnable, worth learning, and not yet learnt\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1)\.
- B\. Mirzasoleiman, J\. Bilmes, and J\. Leskovec \(2020\)Coresets for data\-efficient training of machine learning models\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Paul, S\. Ganguli, and G\. K\. Dziugaite \(2021\)Deep learning on a data diet: finding important examples early in training\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Rebuffi, A\. Kolesnikov, G\. Sperl, and C\. H\. Lampert \(2017\)iCaRL: incremental classifier and representation learning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- A\. A\. Rusu, N\. C\. Rabinowitz, G\. Desjardins, H\. Soyer, J\. Kirkpatrick, K\. Kavukcuoglu, R\. Pascanu, and R\. Hadsell \(2016\)Progressive neural networks\.arXiv preprint arXiv:1606\.04671\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- O\. Sener and S\. Savarese \(2018\)Active learning for convolutional neural networks: a core\-set approach\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Settles \(2009\)Active learning literature survey\.Technical reportUniversity of Wisconsin\-Madison Department of Computer Sciences\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Sorscher, R\. Geirhos, S\. Shekhar, S\. Ganguli, and A\. S\. Morcos \(2022\)Beyond neural scaling laws: beating power law scaling via data pruning\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Swayamdipta, R\. Schwartz, N\. Lourie, Y\. Wang, H\. Hajishirzi, N\. A\. Smith, and Y\. Choi \(2020\)Dataset cartography: mapping and diagnosing datasets with training dynamics\.InProceedings of EMNLP,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- K\. Tirumala, D\. Simig, A\. Aghajanyan, and A\. S\. Morcos \(2023\)D4: improving LLM pretraining via document deduplication and diversification\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Toneva, A\. Sordoni, R\. T\. d\. Combes, A\. Trischler, Y\. Bengio, and G\. J\. Gordon \(2019\)An empirical study of example forgetting during deep neural network learning\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Touvron, L\. Martin, K\. Stone, P\. Albert, A\. Almahairi, Y\. Babaei, N\. Bashlykov, S\. Batra, P\. Bhargava, S\. Bhosale,et al\.\(2023\)Llama 2: open foundation and fine\-tuned chat models\.arXiv preprint arXiv:2307\.09288\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1)\.
- J\. T\. Wang, T\. Wu, D\. Song, P\. Mittal, and R\. Jia \(2024\)GREATS: online selection of high\-quality data for LLM training in every iteration\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Wang, Y\. Chen, and W\. Zhu \(2021\)A survey on curriculum learning\.IEEE Transactions on Pattern Analysis and Machine Intelligence\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p2.1)\.
- M\. Xia, S\. Malladi, S\. Gururangan, S\. Arora, and D\. Chen \(2024\)LESS: selecting influential data for targeted instruction tuning\.InInternational Conference on Machine Learning,Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1)\.
- S\. M\. Xie, S\. Santurkar, T\. Ma, and P\. Liang \(2023\)Data selection for language models via importance resampling\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- B\. Xu, L\. Zhang, Z\. Mao, Q\. Wang, H\. Xie, and Y\. Zhang \(2020\)Curriculum learning for natural language understanding\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p2.1)\.
- Y\. Yang, A\. Zeng, and X\. Yang \(2026a\)Towards specialized generalists: a multi\-task moe\-lora framework for domain\-specific llm adaptation\.arXiv preprint arXiv:2601\.07935\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Yang, H\. Zhang, M\. Li, J\. Xu, R\. Shen, Z\. Wang, T\. Liu, S\. Chen, and W\. Huang \(2026b\)NeuroLoRA: context\-aware neuromodulation for parameter\-efficient multi\-task adaptation\.arXiv preprint arXiv:2603\.12378\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.
- F\. Zenke, B\. Poole, and S\. Ganguli \(2017\)Continual learning through synaptic intelligence\.InInternational Conference on Machine Learning,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing,et al\.\(2023\)Judging llm\-as\-a\-judge with mt\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Cited by:[§6\.2](https://arxiv.org/html/2605.30537#S6.SS2.p1.1)\.
- C\. Zhou, P\. Liu, P\. Xu, S\. Iyer, J\. Sun, Y\. Mao, X\. Ma, A\. Efrat, P\. Yu, L\. Yu,et al\.\(2023\)LIMA: less is more for alignment\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Zou, Y\. Mao, Y\. Qu, Q\. Wang, and X\. Ji \(2025a\)Utility\-diversity aware online batch selection for llm supervised fine\-tuning\.arXiv preprint arXiv:2510\.16882\.Cited by:[§1](https://arxiv.org/html/2605.30537#S1.p1.1),[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px1.p1.1),[§4](https://arxiv.org/html/2605.30537#S4.SS0.SSS0.Px6.p1.1)\.
- H\. Zou, Y\. Zang, and X\. Ji \(2025b\)Structural features of the fly olfactory circuit mitigate the stability\-plasticity dilemma in continual learning\.arXiv preprint arXiv:2502\.01427\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Zou, Y\. Zang, W\. Xu, and X\. Ji \(2025c\)Fly\-cl: a fly\-inspired framework for enhancing efficient decorrelation and reduced training time in pre\-trained model\-based continual representation learning\.arXiv preprint arXiv:2510\.16877\.Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px3.p1.1)\.
- H\. Zou, Y\. Zang, W\. Xu, Y\. Zhu, and X\. Ji \(2025d\)FlyLoRA: boosting task decoupling and parameter efficiency via implicit rank\-wise mixture\-of\-experts\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.30537#S2.SS0.SSS0.Px4.p1.1)\.

## Appendix AAdditional implementation details

#### Selector implementation\.

Random selection samples uniformly without replacement\. Loss selection ranks candidates by the current negative log\-likelihood\. Gradient selection ranks by per\-example gradient norm or a first\-order approximation to it\. Diversity selection uses farthest\-first traversal in a frozen embedding space\. Quality selection ranks by an external quality score, which could come from a reward model, an LLM judge, or a curated metadata field\. Utility\-diversity selection combines an immediate utility term with memory\-buffer diversity\. LHAS adds historical coverage, future\-proxy alignment, and concentration penalties\.

#### Evaluation cadence\.

For each stage, the intended implementation evaluates the current stage at regular intervals and also evaluates all previous stages after every stage transition\. Future adaptation AUC is computed by checkpointing the model after stagett, training briefly on staget\+1t\+1, and integrating the resulting early validation curve\. This metric is more informative than the final score alone because it measures how much future optimization effort is needed\.

## Appendix BProofs

### B\.1Proof of Proposition[1](https://arxiv.org/html/2605.30537#Thmproposition1)

The next\-stage loss is

Lt\+1​\(θ\)=12​‖θ−θt\+1⋆‖H2=12​\(θ−θt\+1⋆\)⊤​H​\(θ−θt\+1⋆\)\.L\_\{t\+1\}\(\\theta\)=\\frac\{1\}\{2\}\\\|\\theta\-\\theta\_\{t\+1\}^\{\\star\}\\\|\_\{H\}^\{2\}=\\frac\{1\}\{2\}\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)^\{\\top\}H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\.\(6\)After selectingSS, the current\-stage update givesθS′=θ−η​g¯S\\theta\_\{S\}^\{\\prime\}=\\theta\-\\eta\\bar\{g\}\_\{S\}\. Substituting this into the next\-stage loss,

Lt\+1​\(θS′\)\\displaystyle L\_\{t\+1\}\(\\theta\_\{S\}^\{\\prime\}\)=12​\(θ−η​g¯S−θt\+1⋆\)⊤​H​\(θ−η​g¯S−θt\+1⋆\)\\displaystyle=\\frac\{1\}\{2\}\(\\theta\-\\eta\\bar\{g\}\_\{S\}\-\\theta\_\{t\+1\}^\{\\star\}\)^\{\\top\}H\(\\theta\-\\eta\\bar\{g\}\_\{S\}\-\\theta\_\{t\+1\}^\{\\star\}\)\(7\)=Lt\+1​\(θ\)−η​⟨g¯S,H​\(θ−θt\+1⋆\)⟩\+η22​g¯S⊤​H​g¯S\.\\displaystyle=L\_\{t\+1\}\(\\theta\)\-\\eta\\langle\\bar\{g\}\_\{S\},H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\\rangle\+\\frac\{\\eta^\{2\}\}\{2\}\\bar\{g\}\_\{S\}^\{\\top\}H\\bar\{g\}\_\{S\}\.\(8\)For sufficiently smallη\\eta, or for two selectors with comparable second\-order terms, the ordering of next\-stage loss is dominated by the linear term\. Therefore, if

⟨g¯Sa,H​\(θ−θt\+1⋆\)⟩<⟨g¯Sb,H​\(θ−θt\+1⋆\)⟩,\\langle\\bar\{g\}\_\{S\_\{a\}\},H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\\rangle<\\langle\\bar\{g\}\_\{S\_\{b\}\},H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\\rangle,\(9\)thenSaS\_\{a\}produces higher next\-stage loss thanSbS\_\{b\}after the current update\. Equal current improvement only constrains the projection ofg¯S\\bar\{g\}\_\{S\}onto the current\-stage descent direction; it does not constrain the projection ontoH​\(θ−θt\+1⋆\)H\(\\theta\-\\theta\_\{t\+1\}^\{\\star\}\)\. Thus two selectors can be tied on immediate gain while differing in future adaptation cost\.□\\square

### B\.2A concentration corollary

Let future task directions be sampled from a distribution with covarianceΣf\\Sigma\_\{f\}\. If selected gradients have covarianceΣS\\Sigma\_\{S\}with high top\-eigenvalue share, then the expected squared projection of selected updates onto a random future direction is dominated by a small number of directions\. WhenΣf\\Sigma\_\{f\}is broad or rotated away from the current task, this concentration increases variance in future transfer: some future tasks benefit, but many receive little useful alignment\. Coverage\-aware selection reduces this variance by flattening the selected\-gradient spectrum\.

## Appendix CAdditional results

Figure[4](https://arxiv.org/html/2605.30537#A3.F4)provides additional views of the results\. The task\-order heatmap shows that LHAS remains strongest across all stage orders, but the gap is larger when future stages are less aligned with the current stage\. The concentration\-forgetting plot shows the mechanism from a different angle: selectors with concentrated updates also show larger forgetting\.

![Refer to caption](https://arxiv.org/html/2605.30537v1/x3.png)Figure 4:Additional diagnostics\. \(a\) Task\-order sensitivity across four stage orders\. \(b\) Update concentration is positively associated with forgetting\.

Similar Articles

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

arXiv cs.CL

This paper proposes a Mixture of LoRA and Full (MoLF) fine-tuning framework that uses gradient-guided optimizer routing to adaptively switch between LoRA and full fine-tuning. It aims to overcome the structural limitations of relying solely on static adaptation methods by combining the plasticity of full tuning with the regularization of LoRA.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

arXiv cs.LG

This paper investigates emergent and subliminal misalignment in LLMs through a data-centric lens, showing that harmful fine-tuning effects depend on structural properties of the data, task difficulty, pretraining composition, and training channels, with experiments comparing off-policy and on-policy distillation.