Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
Summary
This paper introduces ADAPT, an online reweighting framework for LLM data curation that dynamically adjusts sample importance during training via loss weighting, outperforming offline selection and mixing methods in cross-benchmark generalization.
View Cached Full Text
Cached at: 05/08/26, 06:44 AM
# Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods
Source: [https://arxiv.org/html/2605.05227](https://arxiv.org/html/2605.05227)
\\undefine@key
newfloatplacement\\undefine@keynewfloatname\\undefine@keynewfloatfileext\\undefine@keynewfloatwithin
Wanru Zhao1, Yihong Chen2, Yuzhi Tang3†\\dagger, Wentao Ma3†\\dagger, Shengchao Hu4,Shell Xu Hu5, Alex Iacob1, Abhinav Mehrotra5, Nicholas D\. Lane1 1University of Cambridge,2OATML, University of Oxford, 3University of Toronto,4Shanghai Jiao Tong University,5Samsung AI Center†††\\daggerEqual contribution\.
###### Abstract
Data curation is a critical yet underexplored area in large language model \(LLM\) training\. Existing methods, such as data selection and mixing, operate in anofflineparadigm, detaching themselves from training\. This separation introduces engineering overhead and makes the curation brittle: the entire pipeline must be re\-run under model/task shifts\. Moreover, offline methods alter data size through hard filtering or resampling, often sacrificing data diversity and harming generalization\.
We propose to rethink data curation as anonlinereweighting problem, where sample importance is dynamically adjusted during training via loss weighting rather than static pre\-processing\. Specifically, we introduceADAPT\(AdaptiveData reweighting forPretraining and FineTuning\), a dynamic online framework that reweights training samples with adaptive per\-sample learning rates guided by similarity\-based quality signals, without changing the number of training samples\. Unlike offline methods that enforce a static data distribution,ADAPTacts as an implicit curriculum learner, progressively shifting focus from coarse\-grained patterns to fine\-grained semantic distinctions as the model evolves\. Experiments on both instruction tuning and large\-scale pretraining show thatADAPTconsistently outperforms offline selection/mixing and prior online methods, achieving stronger cross\-benchmark generalization under equal FLOPs\. ![[Uncaptioned image]](https://arxiv.org/html/2605.05227v1/figures/other/hf_logo.png)[https://huggingface\.co/DATA\-ADAPT](https://huggingface.co/DATA-ADAPT) ![[Uncaptioned image]](https://arxiv.org/html/2605.05227v1/figures/other/github_logo.png)[https://github\.com/Ryan0v0/ADAPT](https://github.com/Ryan0v0/ADAPT)
†††\\daggerEqual contribution\.## 1Introduction
Large language models \(LLMs\) have demonstrated remarkable capabilities across diverse tasks\(brown2020language;chowdhery2022palm;touvron2023llama\), yet their generalization usually hinges on the quality of the training data, including its diversity and mixing\(hoffmann2022training;kaplan2020scaling\)\. While some recent efforts seek to circumvent the data quality issue and improve generalization through architectural interventions\(peyrard2022invariant;chen2023improving;bengio2025superintelligent\), most pragmatic practices still rely on curating the “right” dataset for the model and the task\. Data curation – deciding what data to keep and how to weight it – has thus become a cornerstone of large language model development\. On the other hand, the process of data curation has largely remained a secret sauce with pipelines relying on opaque heuristics, ad\-hoc engineering choices, or costly trial\-and\-error, rather than principled frameworks\.
Current approaches to data curation largely follow anofflineparadigm, falling into two camps: data selection, which keeps a subset of valuable examples, and data mixing, which adjusts sampling frequencies to rebalance distributions\. Despite their differences, both approaches typically follow a multi\-stage pipeline: \(1\) train or extract features with a proxy model, \(2\) compute quality signals as selection or weighting criteria on a validation set, and \(3\) retrain the main model on the curated data\. While effective in some cases, this paradigm suffers from fundamental drawbacks \(Sec\.[4](https://arxiv.org/html/2605.05227#S4)\) due to its static and subtractive nature\. First, it ignores training dynamics: the value of a data sample is not static but changes as the model learns\. Offline selection freezes this value based on a proxy model, creating a misalignment with the training model’s evolving needs\. Second, it compromises diversity: by relying on hard filtering to select a fixed subset, offline methods risk discarding the broad data distribution essential for robust generalization\. Consequently, the resulting subsets often overfit to specific benchmarks and are brittle, which requires the entire pipeline to be re\-run whenever the model architecture or corpus changes\.
In contrast, we propose a paradigm shift toonline reweighting, which fundamentally differs from offline selection not just inwhenthe signal is computed, but also inhowthe data is used\. Instead of hard\-filtering a subset based on static scores, online reweighting retains the full dataset to preserve diversity, but dynamically modulates each sample’s contribution via loss weighting\. This preserves data diversity compared to offline’s “hard cuts” while intelligently controlling each sample’s “step size” in parameter updates to adapt to evolving model states\.
To solve the limitation of offline data curation, we propose a unified online reweighting framework,ADAPT\(AdaptiveData reweighting forPretraining and FineTuning\), which aligns training samples with a validation set via similarity\-based weighting\. Our method incurs nearly zero additional overhead yet achieves superior cross\-benchmark generalization\. Under a unified evaluation protocol, it consistently outperforms both offline selection/mixing strategies and prior online reweighting methods, across large\-scale pretraining and instruction tuning setups\.
#### Our contributions\.
This work makes the following contributions:
1. 1\.Formalization and unification of dataset curation pipelines\.We present a unified view of data curation, showing that data selection, mixing, and balancing can all be expressed as special cases of*data reweighting*\. This perspective shifts the focus from offline pre\-processing to dynamic, in\-training adjustment of data weights, which offers a fair protocol to view and evaluate them through a unified lens of*online data reweighting*\.
2. 2\.Online sample\-level reweighting method\.We introduce a new online global reweighting algorithm that is effective under this unified formalization and protocol\. It employs an adaptive per\-sample learning rate guided by sample quality signals, without explicitly altering the dataset size, while dynamically adjusting sample weights during training\.
3. 3\.Better performance and generalization across tasks\.On both instruction tuning and large\-scale pretraining experiments, our framework achieves up to 7\.2% higher accuracy per FLOP than offline baselines including LESS\(xia2024less\), DoReMi\(doremi\), and RegMix\(liu2024regmix\), and other online reweighting methods\(sow2025dynamic\), and demonstrates 6\.1% stronger out\-of\-domain generalization across benchmarks\.
## 2Background and Related Work
### 2\.1Data curation for large language models
Researchers and engineers usually rely on web crawls to gather large datasets for training large language models\(gpt3;redpajama;penedo2024finewebdatasetsdecantingweb;tang2024txt360;bai2023qwen;kandpal2025common\)\. However, these crawls often include a substantial amount of low\-quality or irrelevant content, which makes data curation necessary to build high\-quality training sets\. Most data curation efforts focus on methods for improving model performance\(raffel2020exploring;gpt3;gopher;penedo2023refinedweb;dolma\), including filtering by language\(raffel2020exploring;xue2020mt5\), heuristic\-based filtering\(pile;gopher;penedo2023refinedweb;dolma\), quality filtering\(du2022glam;xie2023data\), data deduplication\(Lee2021DeduplicatingTD\)and mixing\(xie2023doremi;cerebras2023slimpajama;albalak2023efficient\)\. However, current approaches to curating such datasets are generally ad\-hoc\. We aim to develop a principled and automated method for data curation that can also unify different processing stages\.
#### Pretraining data curation\.
Several recent studies\(xie2023doremi;chen2023skill;fan2023doge;thakkar2023self\)have explored various reweighting techniques to enhance the generalization and efficiency of language models pretraining\. For instance,xie2023doremiandfan2023dogeare aiming at finding the optimal mixture of pretraining corpora to enhance performance across domains\.chen2023skillpropose an ordered skill learning method for data selection measuring how effectively it teaches interdependent skills for continual pretraining and fine\-tuning\. Although effective, these works are aiming at the group level, whereas our work explores reweighting at the sample level, offering fine\-grained control during model training dynamics\. Instance\-level reweighting has been used in pretraining settings of LLMs\(chen2024take;jiang2024importance\), where each sample per mini\-batch is weighted over how individual samples are treated based on their loss values\. In contrast, our work studies the effects of various adaptive learning rates with different quality signals considering the model states to enhance both performance and generalization for LLMs pretraining and fine\-tuning regimes\.
#### Instruction data curation\.
Research has demonstrated that prioritizing data quality and diversity over quantity is more helpful for instruction\-following capabilities\(cao2023instruction;chen2023alpagasus;bukharin2023data;du2023mods;liu2023makes;li2023one\)\. Instruction tuning data includes task\-based datasets curated from traditional NLP tasks\(wang2022super;sanh2022multitask;wei2022finetuned;longpre2023flan\), and open\-ended datasets\(alpaca;DatabricksBlog2023DollyV2;kopf2023openassistant;xu2023wizardlm;mukherjee2023orca;zhou2023lima;ding2023enhancing\)\. In our work, we focus on adaptively adjusting the data weight during instruction tuning without explicitly reducing the number of instruction data samples to obtain a model with better generalization\.
### 2\.2Commonly used quality signals for data curation
- •Lexical Similarity \(BM25\)\(silva2024improving\)quantifies term\-based overlap between training and validation data through sparse retrieval scoring: sBM25\(x\)=1\|𝒟val\|∑v∈𝒟valBM25\(x,v\)\.s\_\{\\text\{BM25\}\}\(x\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\rm val\}\|\}\\sum\_\{v\\in\\mathcal\{D\}\_\{\\rm val\}\}\\text\{BM25\}\(x,v\)\.\(1\)This metric captures surface\-level textual similarity without semantic understanding\.
- •Semantic Similarity \(Embedding\)\(rubin2021learning\)measures dense representation alignment using pretrained encoders: sEmbed\(x\)=1\|𝒟val\|∑v∈𝒟valcos\(ϕ\(x\),ϕ\(v\)\),s\_\{\\text\{Embed\}\}\(x\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\rm val\}\|\}\\sum\_\{v\\in\\mathcal\{D\}\_\{\\rm val\}\}\\cos\\big\(\\phi\(x\),\\phi\(v\)\\big\),\(2\)whereϕ\(⋅\)\\phi\(\\cdot\)denotes a frozen embedding model\. This approach captures semantic proximity beyond lexical overlap\.
- •Distributional Alignment \(Perplexity\)\(antonello2020selecting\)evaluates likelihood under a reference language modelθ0\\theta\_\{0\}: sPPL\(x\)=−logPθ0\(x\)\.s\_\{\\text\{PPL\}\}\(x\)=\-\\log P\_\{\\theta\_\{0\}\}\(x\)\.\(3\)Lower perplexity indicates stronger distributional alignment with the reference corpus\.
- •Gradient\-Based Influence\(xia2024less\)estimates training utility through first\-order approximation of validation loss reduction: sGrad\(x\)=⟨∇θℓ\(x;θ0\),∇θℓ\(𝒟val;θ0\)⟩,s\_\{\\text\{Grad\}\}\(x\)=\\big\\langle\\nabla\_\{\\theta\}\\ell\(x;\\theta\_\{0\}\),\\nabla\_\{\\theta\}\\ell\(\\mathcal\{D\}\_\{\\rm val\};\\theta\_\{0\}\)\\big\\rangle,\(4\)whereθ0\\theta\_\{0\}denotes a proxy model\. This metric directly quantifies how training onxxinfluences performance on the validation set𝒟val\\mathcal\{D\}\_\{\\rm val\}\.
## 3Unified Formalization of Data Curation
#### Setup\.
Given two datasets: thetrainset𝒟train\\mathcal\{D\}\_\{\\rm train\}and thevalidationset𝒟val\\mathcal\{D\}\_\{\\rm val\}\. Usually, the size of the validation set is much smaller than the train set\. When𝒟val\\mathcal\{D\}\_\{\\rm val\}is sampled from the distribution of downstream test data, the validation dataset𝒟val\\mathcal\{D\}\_\{\\rm val\}is considered asin\-domain\. The train dataset𝒟train\\mathcal\{D\}\_\{\\rm train\}on the other hand consists of both in\-domain andout\-of\-domainsamples\. This is the case when web crawling is used to collect training data from the whole internet\. Our goal is to subsample or weight the train set𝒟train\\mathcal\{D\}\_\{\\rm train\}under the guidance of the validation set𝒟val\\mathcal\{D\}\_\{\\rm val\}, so that the model training is less affected by the out\-of\-domain samples\.
### 3\.1Design Space of Data Quality Signal
In Sec\.[2](https://arxiv.org/html/2605.05227#S2), we present a unified framework for data quality assessment where*scoring function*takes the forms\(x\)≡s\(x;θ,𝒟val\)s\(x\)\\equiv s\(x;\\theta,\\mathcal\{D\}\_\{\\rm val\}\)that assigns a quality signal for each data examplexx, which optionally depends on a \(proxy\) modelθ\\thetaand the validation set𝒟val\\mathcal\{D\}\_\{\\rm val\}\.
### 3\.2Offline Data Curation
The scores are employed once before training \(e\.g\., with a proxy modelθ0\\theta\_\{0\}and fixed validation set𝒟val\\mathcal\{D\}\_\{\\rm val\}\)\. The resulting weights\{w\(x\)\}x∈𝒟train\\\{w\(x\)\\\}\_\{x\\in\\mathcal\{D\}\_\{\\rm train\}\}are fixed and used to make a decision to pass which training examples to the real training stage\.
- •Data Selection \(sample\-level binary weights\)Data selection \(or filtering\) removes part of the training corpus before pretraining begins\. It keeps data examples with quality signal above a certain thresholdτ\\tau: 𝒟train′=\{x∈𝒟train∣s\(x\)≥τ\}\.\\mathcal\{D\}^\{\\prime\}\_\{\\rm train\}=\\\{x\\in\\mathcal\{D\}\_\{\\rm train\}\\mid s\(x\)\\geq\\tau\\\}\.From a data reweighting perspective, filtering data points is equivalent to assigning binary weights of the formw\(x\)=𝟏\[s\(x\)≥τ\]∈\{0,1\}w\(x\)=\\mathbf\{1\}\[s\(x\)\\geq\\tau\]\\in\\\{0,1\\\}\.
- •Data Mixing \(domain\-level fractional weights\)Data mixing can be seen as a coarse\-grained data reweighting method that operates at the domain level\. In other words, data points within the same domain receive the same quality score\. Pulling all domain scores\{sd\}d∈domains\\\{s\_\{d\}\\\}\_\{d\\in\{\\rm domains\}\}and normalize them to obtain data mixing probability for each domaindd wd=g\(sd\)∑d′g\(sd′\),w\_\{d\}=\\frac\{g\(s\_\{d\}\)\}\{\\sum\_\{d^\{\\prime\}\}g\(s\_\{d^\{\\prime\}\}\)\},whereggtransform the score to be non\-negative with a common choice beingg:s↦exp\(s\)g\\colon s\\mapsto\\exp\(s\)\. As such,wdw\_\{d\}determine how much of a total training budgetBBis allocated to each domain:Bd=wdBB\_\{d\}=w\_\{d\}B\. Alternatively, training can be implemented by sampling domains with probabilitywdw\_\{d\}\(probability mixing\) or by assigning a fixed quotaBdB\_\{d\}\(quota mixing\)\. Sincewdw\_\{d\}is decided before training, mixing is an*offline*operation that alters the effective number of examples each domain contributes\.
### 3\.3Online Data Curation
In this setting, the scoring function depends on the evolving model stateθt\\theta\_\{t\}and, in some cases, on a dynamic validation set\. The resulting weights\{wt\(x\)\}x∈𝒟train\\\{w\_\{t\}\(x\)\\\}\_\{x\\in\\mathcal\{D\}\_\{\\rm train\}\}evolve throughout training, adjusting the*gradient contributions*of examples while preserving the full volume of the training set𝒟train\\mathcal\{D\}\_\{\\rm train\}\.
#### Data Reweighting \(sample\-level fractional weights\)
Data reweighting assigns normalized weights to examples while keeping the training set size unchanged\. Similar to Sec\.[•‣3\.2](https://arxiv.org/html/2605.05227#S3.I1.i2), the quality score for examplexxis transformed to a non\-negative weight:w\(x\)=g\(s\(x\)\)w\(x\)=g\(s\(x\)\), which scales their contribution to the loss:
ℒ∗\(θ\)=1Z∑x∈𝒟w\(x\)ℒ\(θ;x\),Z=∑x∈𝒟w\(x\)\.\\mathcal\{L\}^\{\*\}\(\\theta\)=\\frac\{1\}\{Z\}\\sum\_\{x\\in\\mathcal\{D\}\}w\(x\)\\,\\mathcal\{L\}\(\\theta;x\),\\qquad Z=\\sum\_\{x\\in\\mathcal\{D\}\}w\(x\)\.Equivalently, from a stochastic gradient descent perspective, the stochastic gradient w\.r\.t\.xxis scaled byw\(x\)Z\\frac\{w\(x\)\}\{Z\}, which plays a role of*per\-sample*learning rate in addition to the*global*learning rateη\\eta:
θt\+1=θt−ηw\(x\)Z∇θℓ\(θt;x\)\.\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\frac\{w\(x\)\}\{Z\}\\,\\nabla\_\{\\theta\}\\ell\(\\theta\_\{t\};x\)\.Since the total number of training examples remains unchanged, only their relative contribution to parameter updates is modified, data reweighting is inherently an*online*method\.
### 3\.4Unified Evaluation of Total FLOPs \(Offline vs\. Online\)
The unification enables us to compare different methods under the same FLOPS calculation framework\. We evaluate curation methods under acost\-awaremetric that accounts for all computation spent to obtain and use the curated data\. LetBBdenote a training budget \(tokens or update steps\)\.
#### Total FLOPs for offline data curation
Offline curation modifies the*amount of data that participates in training*before training starts\. Its total compute is
𝖥totaloff=𝖥prepoff\(𝒟;f\)⏟data scoring as preprocessing\+𝖥train\(𝒟′,B′\)⏟model training on the sampled subset,\\mathsf\{F\}^\{\\mathrm\{off\}\}\_\{\\mathrm\{total\}\}\\;=\\;\\underbrace\{\\mathsf\{F\}^\{\\mathrm\{off\}\}\_\{\\mathrm\{prep\}\}\(\\mathcal\{D\};f\)\}\_\{\\text\{data scoring as preprocessing\}\}\\;\+\\;\\underbrace\{\\mathsf\{F\}\_\{\\mathrm\{train\}\}\\\!\\big\(\\mathcal\{D\}^\{\\prime\},\\,B^\{\\prime\}\\big\)\}\_\{\\text\{model training on the sampled subset\}\},\(5\)where \(i\)𝖥prepoff\\mathsf\{F\}^\{\\mathrm\{off\}\}\_\{\\mathrm\{prep\}\}includes any corpus\-wide scoring, proxy\-model passes, retrieval or filtering necessary to construct𝒟′\\mathcal\{D\}^\{\\prime\}; \(ii\)𝒟′⊆𝒟\\mathcal\{D\}^\{\\prime\}\\subseteq\\mathcal\{D\}is the retained subset \(for selection\), or an*effective*subset induced for data mixing; \(iii\)B′B^\{\\prime\}is the effective training budget after curation\. By construction, offline selection uses*binary*example weights \(keep/drop\), and offline domain mixing changes per\-domain sampling rates, thereby reducing or reallocating the volume of data seen during training\.
#### Total FLOPs for online data curation
Online curation*does not change the amount of training data*; instead, it modulates each example’s contribution*during training*\. Its total compute is
𝖥totalon=𝖥train\(𝒟,B\)⏟model training on the full corpus\+𝖥metricson⏟on\-the\-fly features,\\mathsf\{F\}^\{\\mathrm\{on\}\}\_\{\\mathrm\{total\}\}\\;=\\;\\underbrace\{\\mathsf\{F\}\_\{\\mathrm\{train\}\}\\\!\\big\(\\mathcal\{D\},\\,B\\big\)\}\_\{\\text\{model training on the full corpus\}\}\\;\+\\;\\underbrace\{\\mathsf\{F\}^\{\\mathrm\{on\}\}\_\{\\mathrm\{metrics\}\}\}\_\{\\text\{on\-the\-fly features\}\},\(6\)where𝖥metricson\\mathsf\{F\}^\{\\mathrm\{on\}\}\_\{\\mathrm\{metrics\}\}denotes lightweight, in\-training computations needed to obtain scores \(e\.g\., using current logits, per\-example loss, or gradient norms\)\. Because scoring is amortized inside the training loop and no data are removed,𝖥metricson\\mathsf\{F\}^\{\\mathrm\{on\}\}\_\{\\mathrm\{metrics\}\}is typically modest relative to𝖥train\\mathsf\{F\}\_\{\\mathrm\{train\}\}\.
Conversely, online reweighting maintains the full training signal while adding only minimal in\-loop metrics computation, providing a clearer accuracy–compute trade\-off when measured by*total*FLOPs\.
## 4Revisiting Offline Data Selection
In this section, based on the evaluation protocol derived from the unification above, we revisit the accuracy–FLOPs trade\-off of existing offline data selection methods\.
\\caption@setkeys
\[floatrow\]floatrowjustification=centering\\caption@setkeys\[floatrow\]floatrowjustification=centering\\caption@setoptionsfigure\\caption@setpositionb
\\caption@setoptions
subfloatrow\\caption@setoptionssubfigurerow\\caption@setkeys\[floatrow\]floatrowjustification=centering\\caption@setkeys\[floatrow\]floatrowjustification=centeringFigure 12:In\-domain evaluation: MMLU validation set→\\rightarrowMMLU benchmark\.\\caption@setkeys\[floatrow\]floatrowjustification=centeringFigure 14:Out\-of\-domain evaluation: MMLU validation set→\\rightarrowBBH benchmark\.
Figure 15:Efficiency–accuracy tradeoff of different data selection methods and full finetune method under our proposedtotal FLOPsmetric\. Selection methods \(e\.g\.,LESS\) appear competitive under in\-domain evaluation, but their advantage vanishes or even reverses in out\-of\-domain settings\. In contrast, \(full sft\) remains consistently strong across domains\.
#### Observation\.
Figure[15](https://arxiv.org/html/2605.05227#S4.F15)illustrates the limitation of offline data selection methods:Overfitting to the validation benchmark\.Models trained on offline\-selected data with MMLU\(mmlu\)as validation set perform well on the MMLU benchmark \(Figure[15](https://arxiv.org/html/2605.05227#S4.F15)\), but the same models generalize poorly to the BBH benchmark\(bbh\)\(Figure[15](https://arxiv.org/html/2605.05227#S4.F15)\)\. This issue is especially severe when using LESS\(xia2024less\)\. This reveals that offline methods often*overfit to the chosen validation task*, lacking a true generalization\. However, we observed that vanilla full dataset training demonstrates more stable performance in both benchmarks\. One possible reason is that offline data curation, in which we directly change the number of data involved in the training by repetition, would cause the model to replace generalization ability with memorization\(hernandez2022scaling\)\.
In summary, the offline paradigm has severe limitations, being neither cost\-effective nor generalizable\. This motivates us to explore an online alternative, which may alleviate memorization while retaining more data to enhance generalization\. In the next section, we leverage these insights to design an*online reweighting method*\.
## 5Adaptive Learning Rate for Online Data Reweighting
The limitations of offline data selection suggest that a different paradigm is needed: one that is generalizable, and adaptive to the evolving state of the model\. To this end, we propose anonline data reweighting frameworkthat is also suitable for data selection, data mixing with a unified formalization\. Unlike offline methods that commit to a fixed subset before training begins, our framework dynamically adjusts data weights as the model learns, so it naturally adapts to new model states without re\-running the data preprocessing and training pipeline from scratch\.
### 5\.1Per\-Sample Learning Rate Update
We cast data selection and mixing as*online data reweighting*\. At training steptt, given a minibatchBt⊂𝒟B\_\{t\}\\subset\\mathcal\{D\}, we update model parameters as
θt\+1=θt−η∑i∈Btwt\(i\)∇θℓ\(fθ\(xi\),yi\),\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\sum\_\{i\\in B\_\{t\}\}w\_\{t\}\(i\)\\,\\nabla\_\{\\theta\}\\ell\\big\(f\_\{\\theta\}\(x\_\{i\}\),y\_\{i\}\\big\),\(7\)wherewt\(i\)≥0w\_\{t\}\(i\)\\geq 0denotes a dynamic weight assigned to sampleii\. Here,wt\(i\)w\_\{t\}\(i\)directly scales the gradient contribution of sampleii, and thus acts as a*per\-sample learning rate multiplier*: larger weights amplify the effective step size on informative examples, while smaller weights downweight less useful ones\.
In our framework, weights are derived from scoring functionsst\(v,i\)s\_\{t\}\(v,i\)that compare a training exampleiiwith validation/query pointsv∈𝒟valv\\in\\mathcal\{D\}\_\{\\rm val\}\. These scores are aggregated acrossvv, normalized within the current batch or pool, and transformed by a smooth gating function to produce the final weightsw^t\(i\)\\widehat\{w\}\_\{t\}\(i\)\. To ensure stability, weights are clipped to prevent excessively large effective learning rates and avoid gradient explosion\. We now describe our score functions\.
### 5\.2Model\-Agnostic Per\-Sample Learning Rate Update
ADAPT\-BM25quantifies term\-based overlap between training and validation data through sparse retrieval scoring:
sBM25\(x\)=1\|𝒟val\|∑v∈𝒟valBM25\(x,v\)\.s\_\{\\text\{BM25\}\}\(x\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\rm val\}\|\}\\sum\_\{v\\in\\mathcal\{D\}\_\{\\rm val\}\}\\text\{BM25\}\(x,v\)\.\(8\)This metric captures surface\-level textual similarity without semantic understanding\. We use the standard BM25 scoresBM25\(v,i\)s\_\{\\mathrm\{BM25\}\}\(v,i\)over sparse token matches between queryvvand exampleii\. Aggregated scores are normalized to obtain target weights\.
### 5\.3Per\-Sample Learning Rate Update with Model States
ADAPTmeasures alignment using the model’s own dense representations rather than a frozen encoder\(ivison2025large\)\. For an inputxxwith last\-layer hidden states\{hi\}i=1L\\\{h\_\{i\}\\\}\_\{i=1\}^\{L\}, we compute a position\-weighted mean pooling:
wi=i∑j=1Lj,ϕ\(x\)=∑i=1Lwihi,w\_\{i\}=\\frac\{i\}\{\\sum\_\{j=1\}^\{L\}j\},\\qquad\\phi\(x\)=\\sum\_\{i=1\}^\{L\}w\_\{i\}h\_\{i\},\(9\)where later tokens receive higher weights to counteract the causal mask bias of decoder\-only models\. We then define the similarity score:
sADAPT\(x\)=1\|𝒟val\|∑v∈𝒟valcos\(ϕ\(x\),ϕ\(v\)\)\.s\_\{\\text\{\{ADAPT\}\{\}\}\}\(x\)=\\frac\{1\}\{\|\\mathcal\{D\}\_\{\\rm val\}\|\}\\sum\_\{v\\in\\mathcal\{D\}\_\{\\rm val\}\}\\cos\\big\(\\phi\(x\),\\phi\(v\)\\big\)\.\(10\)Equivalently, we can instantiate a representation\-based scorersADAPT\(v,i\)=cos\(ϕ\(v\),ϕ\(i\)\)s\_\{\\mathrm\{\\textsc\{ADAPT\}\{\}\}\}\(v,i\)=\\cos\\big\(\\phi\(v\),\\phi\(i\)\\big\), whereϕ\(⋅\)\\phi\(\\cdot\)is the weighted hidden representation\.
We use a sigmoid function with temperature scaling to produceabsoluteweights that are independent of the batch composition:
wt\(i\)=σ\(sADAPT\(xi\)max\(τ,ϵ\)\)=11\+exp\(−sADAPT\(xi\)/max\(τ,ϵ\)\)w\_\{t\}\(i\)=\\sigma\\left\(\\frac\{s\_\{ADAPT\}\(x\_\{i\}\)\}\{\\max\(\\tau,\\epsilon\)\}\\right\)=\\frac\{1\}\{1\+\\exp\(\-s\_\{ADAPT\}\(x\_\{i\}\)/\\max\(\\tau,\\epsilon\)\)\}\(11\)
whereσ\(⋅\)\\sigma\(\\cdot\)denotes the sigmoid function,τ\>0\\tau\>0is a temperature hyperparameter \(defaultτ=1\.0\\tau=1\.0\), andϵ\>0\\epsilon\>0is a small constant \(e\.g\.,ϵ=10−8\\epsilon=10^\{\-8\}\) used for numerical stability\. The temperature parameterτ\\taucontrols the steepness of the sigmoid: larger values produce flatter weight distributions, while smaller values create sharper distinctions between high and low similarity samples\.
Global Weighting via Sigmoid:The sigmoid transformation maps similarity scores to absolute weights in the interval\[0,1\]\[0,1\]without requiring normalization across samples in the batch\. This design choice ensures that the weight assigned to each samplexix\_\{i\}depends solely on its similarity scoresADAPT\(xi\)s\_\{ADAPT\}\(x\_\{i\}\)relative to the anchor set, rather than its rank within the current mini\-batch\. Consequently, a sample with a given similarity score receives the same weight regardless of whether it appears in a high\-quality or low\-quality batch, making the weighting mechanism robust to batch\-level variations in data quality\. This contrasts with normalized weighting schemes \(e\.g\., softmax normalization\) where weights are relative to other samples in the same batch, potentially amplifying or dampening effects based on batch composition\.
Embedding Normalization:We apply L2 normalization to embeddingsϕ\(x\)\\phi\(x\)andϕ\(v\)\\phi\(v\)before computing cosine similarities to ensure scale\-invariant similarity measurements\. Specifically, for each embedding vector, we computeϕ\(x\)←ϕ\(x\)/max\(‖ϕ\(x\)‖2,ϵ\)\\phi\(x\)\\leftarrow\\phi\(x\)/\\max\(\\\|\\phi\(x\)\\\|\_\{2\},\\epsilon\)whereϵ\\epsilonprevents division by zero\. This normalization step is distinct from weight normalization and serves to standardize the embedding space for similarity computation\.
Online Validation Embedding Updates:To maintain alignment with the evolving model representations, anchor embeddings\{ϕ\(v\)\}v∈𝒟val\\\{\\phi\(v\)\\\}\_\{v\\in\\mathcal\{D\}\_\{val\}\}are refreshed everyRRtraining steps via forward passes on the current model parametersθt\\theta\_\{t\}\. This online update mechanism ensures that similarity scores reflect the model’s current representation space rather than outdated embeddings from earlier training stages\. The refresh intervalRRis a hyperparameter that balances computational efficiency with representation freshness\. The complete algorithm is summarized in Algorithm[1](https://arxiv.org/html/2605.05227#alg1)\.
While gradient\-based methods \(e\.g\., LESS\) directly measure influence, gradients significantly fluctuate during early training and are computationally expensive to compute frequently\. In contrast, semantic embeddings provide a smoother, more stable signal of relevance that evolves consistently with the model’s representation space, offering a robust guide for reweighting without the noise of instantaneous gradient estimates\.
## 6Evaluation
### 6\.1Experimental Design
#### Instruction Tuning Models and Data\.
We use LoRA\(hu2021lora\)to fine\-tune the base modelLlama\-2\-7B\(touvron2023llama2\)\. Following the experimental setup inwang2023far, we use the instruction tuning datasets includingFlan V2\(longpre2023flan\),CoT\(wei2022chain\),Dolly\(DatabricksBlog2023DollyV2\)andOpen Assistant 1\(kopf2023openassistant\)\. The datasets do not contain any obvious in\-domain data for the target queries\. We evaluate our method onMMLU\(hendrycks2020measuring\), multiple\-choice dataset spanning 57 tasks andBBH\(suzgun2023challenging\)from BIG\-Bench selected to evaluate reasoning capabilities\. For MMLU, we report 5\-shot accuracy; for BBH, we report 3\-shot exact match score\. Appendix[G](https://arxiv.org/html/2605.05227#A7)contains more details on the training hyperparameter, Appendix[D](https://arxiv.org/html/2605.05227#A4)and[E](https://arxiv.org/html/2605.05227#A5)contain more dataset details\.
#### Instruction Tuning Baselines\.
We introduce instruction finetuning baselines in Sec\.[2\.2](https://arxiv.org/html/2605.05227#S2.SS2)\.
#### Pretraining Models and Data\.
We adoptTinyllamaarchitecture\(tinyllama\)with 120M parameters, with FlashAttention\(flash\)and Lit\-GPT\(litgpt\)\. More details about the training settings can be found at Appendix[F](https://arxiv.org/html/2605.05227#A6)\. Following prior works\(llama;tinyllama;qrating;doremi\), we employSlimPajama\(llama;redpajama\)as the text corpus, which is specifically curated for pre\-training LLMs\. All selections are performed on about 590M training files of SlimPajama, processed with Llama tokenizer\(llama\)\. We evaluate our method on a diverse set of 15 downstream benchmarks, following common practice in prior work such as RegMix\(liu2024regmix\)\. These tasks span a wide range of realistic settings, including: ARC\-E\(arc\_easy18\), ARC\-C\(arc\_easy18\), COPA\(copa20\), HellaSwagzellers2019hellaswag, Lambada\-S\(paperno2016lambada\), Lambada\-O\(radford2019language\), LogiQA\(liu2020logiqa\), MultiRC\(multirc18\), OpenBookQA\(openbook18\), PiQA\(bisk2020piqa\), QQP\(qqp18\), RACE\(lai2017race\), SciQ\(sciq17\), Social IQA\(sap2019socialiqa\), WinoGrande\(sakaguchi2021winogrande\)\. The reported accuracy in[table2](https://arxiv.org/html/2605.05227#S6.T2)is measured in the 0\-shot setting scored using the lm\-eval\-harness evaluation framework\(eval\-harness\)\.
#### Pretraining Baselines\.
We compareADAPTwithUniformselection and existing file selection methods for LLM pre\-training, includingDoremi\(doremi\)andRegMix\(liu2024regmix\)\.ADAPTrequires an anchor set typically consists of examples in the evaluation distribution\. To construct this set, we sample 50 validation examples from each of eight evaluation benchmarks: ARC\-C, COPA, Lambada, MultiRC, PiQA, RACE, SciQ, and Social IQA\. For Doremi and Regmix we use the domain weights in\(lu2023chameleon\)as the selection ratio of text samples in different domains in our experiment\. For Uniform we use the same ratio to sample from each domain\. We also compareLinUpper\(sow2025dynamic\): an online sample reweighting strategy where the sample weight is proportional to the normalized loss but is capped at a predefinedα\\alphavalue, ensuring that outliers do not dominate the training process\. Due to the large cost of pretraining, for each method we sample 9B unique tokens from SlimPajama, and train for a total budget of 50B tokens, i\.e\., train for approximately 5\.6 epochs\.
### 6\.2Results for Instruction Tuning
\\caption@setkeys
\[floatrow\]floatrowjustification=centering\\caption@setkeys\[floatrow\]floatrowjustification=centering\\caption@setoptionsfigure\\caption@setpositionb
\\caption@setoptions
subfloatrow\\caption@setoptionssubfigurerow\\caption@setkeys\[floatrow\]floatrowjustification=centering\\caption@setkeys\[floatrow\]floatrowjustification=centeringFigure 27:In\-domain evaluation: MMLU validation set→\\rightarrowMMLU benchmark\.\\caption@setkeys\[floatrow\]floatrowjustification=centeringFigure 29:Out\-of\-domain evaluation: MMLU validation set→\\rightarrowBBH benchmark\.
Figure 30:Efficiency\-accuracy tradeoff\. \(a\) While offline BM25 is efficient at low FLOPs, it saturates early\. ADAPT dominates in the high\-performance regime \(higher FLOPs\), avoiding the early plateauing of heuristics\. \(b\) Crucially, LESS overfits to the in\-domain task, showing regression on OOD tasks \(BBH\), whereas ADAPT consistently improves generalization on the Pareto frontier\.
As shown in Figure[30](https://arxiv.org/html/2605.05227#S6.F30), our proposed online data reweighting method consistently outperforms existing approaches under comparable computational budgets:
MethodMMLU\(val\)\- MMLU\(test\)MMLU\(val\)\- BBH\(test\)BM2548\.7±\\pm0\.942\.3±\\pm0\.8Embedding47\.0±\\pm0\.640\.1±\\pm0\.5LESS50\.2±\\pm0\.538\.7±\\pm1\.5PPL46\.2±\\pm1\.140\.9±\\pm0\.9Random43\.5±\\pm0\.338\.4±\\pm1\.0Full Dataset SFT49\.7±\\pm0\.244\.4±\\pm0\.3ADAPT\-BM2550\.9±\\pm0\.643\.7±\\pm1\.2ADAPT50\.7±\\pm0\.744\.8±\\pm1\.3Table 1:Generalization results under different validation\-test configurations\.Comparison with offline data selection methods\.Across all baselines in both Figure[30](https://arxiv.org/html/2605.05227#S6.F30)and Figure[30](https://arxiv.org/html/2605.05227#S6.F30), our approach consistently lies on the Pareto frontier, demonstrating the best trade\-off between computational efficiency and final accuracy\. At the same FLOPs, our method achieves substantially higher accuracy than the offline data selection SoTA method LESS\. To reach the same performance level, our approach requires significantly fewer FLOPs than LESS\. Moreover, after reaching this performance, further training with LESS leads to overfitting and accuracy degradation, whereas online data reweighting continues to exhibit steady performance gains\. Notably, our online reweighting method using BM25 as quality signal \(ADAPT\-BM25\) has better generalization than its offline selection method \(BM25\) with all other configurations kept the same, which demonstrates the advantage of online data curation\.
Comparison with full\-data fine\-tuning\.In Figure[30](https://arxiv.org/html/2605.05227#S6.F30)and Table[1](https://arxiv.org/html/2605.05227#S6.T1), our method also outperforms full\-data fine\-tuning under equal FLOPs, with the performance gap widening as training progresses\. In Fig\.[30](https://arxiv.org/html/2605.05227#S6.F30)and Table[1](https://arxiv.org/html/2605.05227#S6.T1), our reweighting strategy that incorporates model state information \(ADAPT\) outperforms our fixed\-state online reweighting \(ADAPT\-BM25\), highlighting the importance of adapting to the evolving model state to better generalization\.
Generalization cross benchmark\.As shown in Figure[30](https://arxiv.org/html/2605.05227#S6.F30), we further validate the generalization capability of models trained withADAPT, where we use MMLU\(mmlu\)as the validation set to select the data, and evaluate the model trained on selected data on the BBH\(bbh\)task\. The results, also reported in Table[1](https://arxiv.org/html/2605.05227#S6.T1), show thatADAPTachieves better generalization performance across benchmarks compared to other offline data selection baselines and demonstrates comparable generalization with vanilla full dataset instruction tuning\. This highlights a key strength ofADAPT: when new downstream tasks or benchmarks emerge, there is no need to repeat the entire data selection and model training pipeline\. Over the long term, this substantially reduces FLOPs costs, makingADAPThighly practical for real\-world applications\.
Table 2:Benchmark performance of*TinyLlama\-120M*trained on 50B tokens using Uniform, LinUpper, DoReMi, RegMix, andADAPT\.TasksUniformLinUpperDoReMiRegMixADAPT\-BM25ADAPTARC\-C17\.75±0\.0017\.75\_\{\\pm 0\.00\}18\.77±0\.2418\.77\_\{\\pm 0\.24\}18\.00±0\.8418\.00\_\{\\pm 0\.84\}18\.60±0\.0618\.60\_\{\\pm 0\.06\}18\.31±0\.5518\.31\_\{\\pm 0\.55\}19\.11±0\.6019\.11\_\{\\pm 0\.60\}ARC\-E39\.60±0\.1039\.60\_\{\\pm 0\.10\}38\.85±0\.1238\.85\_\{\\pm 0\.12\}40\.70±0\.8740\.70\_\{\\pm 0\.87\}41\.04±0\.2141\.04\_\{\\pm 0\.21\}39\.44±0\.6039\.44\_\{\\pm 0\.60\}39\.06±0\.5439\.06\_\{\\pm 0\.54\}COPA63\.00±0\.0063\.00\_\{\\pm 0\.00\}60\.00±1\.7260\.00\_\{\\pm 1\.72\}66\.00±1\.9566\.00\_\{\\pm 1\.95\}61\.00±1\.561\.00\_\{\\pm 1\.5\}63\.20±0\.4063\.20\_\{\\pm 0\.40\}64\.00±1\.3264\.00\_\{\\pm 1\.32\}HellaSwag28\.39±0\.0028\.39\_\{\\pm 0\.00\}27\.90±0\.1327\.90\_\{\\pm 0\.13\}28\.03±0\.0328\.03\_\{\\pm 0\.03\}27\.63±0\.0327\.63\_\{\\pm 0\.03\}27\.96±0\.0727\.96\_\{\\pm 0\.07\}28\.11±0\.0028\.11\_\{\\pm 0\.00\}Lambada\-O24\.68±0\.6424\.68\_\{\\pm 0\.64\}23\.64±0\.2823\.64\_\{\\pm 0\.28\}22\.38±0\.7322\.38\_\{\\pm 0\.73\}24\.82±0\.9424\.82\_\{\\pm 0\.94\}26\.19±0\.3026\.19\_\{\\pm 0\.30\}24\.63±0\.5324\.63\_\{\\pm 0\.53\}Lambada\-S16\.98±0\.8116\.98\_\{\\pm 0\.81\}16\.79±0\.9116\.79\_\{\\pm 0\.91\}16\.30±1\.0516\.30\_\{\\pm 1\.05\}18\.30±0\.1418\.30\_\{\\pm 0\.14\}18\.43±0\.3318\.43\_\{\\pm 0\.33\}18\.07±0\.7918\.07\_\{\\pm 0\.79\}LogiQA20\.43±0\.5420\.43\_\{\\pm 0\.54\}21\.35±0\.0321\.35\_\{\\pm 0\.03\}21\.51±0\.1121\.51\_\{\\pm 0\.11\}20\.89±0\.0720\.89\_\{\\pm 0\.07\}20\.52±0\.6920\.52\_\{\\pm 0\.69\}21\.66±0\.6921\.66\_\{\\pm 0\.69\}MultiRC56\.68±0\.0056\.68\_\{\\pm 0\.00\}50\.74±0\.2950\.74\_\{\\pm 0\.29\}48\.99±2\.5748\.99\_\{\\pm 2\.57\}56\.70±0\.9356\.70\_\{\\pm 0\.93\}56\.53±0\.4456\.53\_\{\\pm 0\.44\}55\.67±1\.1155\.67\_\{\\pm 1\.11\}OpenBookQA15\.80±0\.4015\.80\_\{\\pm 0\.40\}14\.00±0\.1814\.00\_\{\\pm 0\.18\}15\.20±0\.6015\.20\_\{\\pm 0\.60\}18\.20±0\.5018\.20\_\{\\pm 0\.50\}15\.40±0\.7215\.40\_\{\\pm 0\.72\}14\.20±0\.4314\.20\_\{\\pm 0\.43\}PIQA60\.55±0\.2960\.55\_\{\\pm 0\.29\}60\.83±0\.4560\.83\_\{\\pm 0\.45\}59\.90±0\.3959\.90\_\{\\pm 0\.39\}58\.98±0\.2958\.98\_\{\\pm 0\.29\}61\.00±0\.1961\.00\_\{\\pm 0\.19\}61\.48±0\.2361\.48\_\{\\pm 0\.23\}QQP36\.88±0\.0037\.85±1\.2137\.85\_\{\\pm 1\.21\}36\.84±0\.0436\.84\_\{\\pm 0\.04\}36\.83±0\.0036\.83\_\{\\pm 0\.00\}36\.82±0\.0036\.82\_\{\\pm 0\.00\}36\.81±0\.0036\.81\_\{\\pm 0\.00\}RACE27\.85±0\.1427\.85\_\{\\pm 0\.14\}26\.41±0\.1826\.41\_\{\\pm 0\.18\}26\.41±0\.0326\.41\_\{\\pm 0\.03\}27\.85±0\.0027\.85\_\{\\pm 0\.00\}27\.39±0\.3527\.39\_\{\\pm 0\.35\}26\.60±0\.4926\.60\_\{\\pm 0\.49\}SciQ71\.00±1\.0071\.00\_\{\\pm 1\.00\}70\.70±1\.3470\.70\_\{\\pm 1\.34\}72\.90±0\.2372\.90\_\{\\pm 0\.23\}71\.20±0\.5071\.20\_\{\\pm 0\.50\}72\.48±0\.4472\.48\_\{\\pm 0\.44\}72\.50±1\.3772\.50\_\{\\pm 1\.37\}SocialIQA36\.90±0\.0336\.90\_\{\\pm 0\.03\}36\.85±0\.1636\.85\_\{\\pm 0\.16\}36\.95±0\.0736\.95\_\{\\pm 0\.07\}37\.41±0\.4637\.41\_\{\\pm 0\.46\}36\.52±0\.4736\.52\_\{\\pm 0\.47\}37\.05±0\.3037\.05\_\{\\pm 0\.30\}WinoGrande50\.75±0\.1750\.75\_\{\\pm 0\.17\}50\.83±0\.0950\.83\_\{\\pm 0\.09\}49\.64±0\.7949\.64\_\{\\pm 0\.79\}50\.12±0\.7250\.12\_\{\\pm 0\.72\}50\.61±1\.2550\.61\_\{\\pm 1\.25\}50\.99±0\.0050\.99\_\{\\pm 0\.00\}Average \(All\) \(↑\\uparrow\)37\.81±0\.1337\.81\_\{\\pm 0\.13\}37\.03±0\.1237\.03\_\{\\pm 0\.12\}37\.32±0\.1537\.32\_\{\\pm 0\.15\}37\.97±0\.0237\.97\_\{\\pm 0\.02\}38\.05±0\.2138\.05\_\{\\pm 0\.21\}38\.00±0\.2238\.00\_\{\\pm 0\.22\}Average \(Unseen\) \(↑\\uparrow\)31\.98±0\.0931\.98\_\{\\pm 0\.09\}30\.56±0\.2830\.56\_\{\\pm 0\.28\}31\.95±0\.1731\.95\_\{\\pm 0\.17\}32\.46±0\.3932\.46\_\{\\pm 0\.39\}33\.49±0\.3733\.49\_\{\\pm 0\.37\}33\.73±0\.3933\.73\_\{\\pm 0\.39\}FLOPs overhead \(↓\\downarrow\)004\.92×10194\.92\\times 10^\{19\}3\.072×10183\.072\\times 10^\{18\}≪\\ll1\.0×10141\.0\\times 10^\{14\}≪1\.1×1015\\ll 1\.1\\times 10^\{15\}\\caption@setkeys
\[floatrow\]floatrowjustification=centering\\caption@setkeys\[floatrow\]floatrowjustification=centering\\caption@setoptionsfigure\\caption@setpositionb
\\caption@setoptions
subfloatrow\\caption@setoptionssubfigurerow\\caption@setkeys\[floatrow\]floatrowjustification=centering\\caption@setkeys\[floatrow\]floatrowjustification=centeringFigure 42:Distribution of effective selected examples\\caption@setkeys\[floatrow\]floatrowjustification=centeringFigure 44:Similarity distribution for different epochs
Figure 45:A summary of the distribution of effective selected samples and similarity scores\. Both differences and changes show the proposed method’s ability on capturing data features and adaptive learning\.
#### Data mixing and curriculum\.
We present the distribution of the effective selected examples for different methods in Figure[45](https://arxiv.org/html/2605.05227#S6.F45)\. Different approaches end up selecting very different amounts of data from each training subset\. It demonstrates that online reweighting can also help with deciding proper data mixture\. We also calculate the*effective proportion of used data*, by summing up weights of all data together, with the result of0\.5010\.501in our experimental setting\. In practice, this number would automatically adapt according to the overall quality of the training corpus we are selecting from\.
We also presented Figure[45](https://arxiv.org/html/2605.05227#S6.F45)\. The shift in similarity distributions from epoch 1 to epoch 2 reflects a transition from collapsed, overly homogeneous representations toward more diverse and fine\-grained embeddings\. This diversification enhances generalization by improving the model’s ability to capture subtle distinctions and reducing over\-reliance on coarse features\. Notably, this process can be viewed as an instance of implicit curriculum learning: the model first clusters samples based on coarse, “easier” patterns and progressively moves toward harder, fine\-grained discrimination, thereby mirroring the principles of curriculum learning without explicit scheduling\.
Table 3:Benchmark performance of*TinyLlama\-120M*trained on 100B tokens using Uniform andADAPT\.TasksUniformADAPTARC\-C17\.7518\.86ARC\-E40\.6640\.45COPA63\.0061\.00HellaSwag28\.3528\.75Lambada\-O25\.1325\.91Lambada\-S18\.4720\.14LogiQA20\.2821\.66MultiRC55\.9856\.44OpenBookQA16\.0015\.00PIQA61\.4861\.97QQP36\.8136\.82RACE27\.3727\.46SciQ71\.5072\.40SocialIQA35\.7737\.05WinoGrande51\.1451\.54Average \(All\) \(↑\\uparrow\)37\.9838\.36Average \(Unseen\) \(↑\\uparrow\)32\.5132\.67
### 6\.3Pretraining results
As shown in Table[2](https://arxiv.org/html/2605.05227#S6.T2),ADAPToutperforms all baselines in terms of average performance, achieving a 0\.19% improvement over Uniform, 0\.68% over DoReMi, and a 0\.97% over LinUpper\. RegMix reaches comparable but slightly lower accuracy on average\. Interestingly, LinUpper performs worse than Uniform sampling, suggesting that naive sample\-level adjustments may not generalize well\. Compared to LinUpper, which conducts normalization and calculates sample\-level weights for each batch, our global sample\-level reweighting solution demonstrates superior performance in handling unknown ratios of mixed\-quality corpora\.
To examine the effect of larger training budgets, we extend training to 100B tokens\. As shown in Table[3](https://arxiv.org/html/2605.05227#S6.T3),ADAPTachieves higher benchmark performance than Uniform on 11 out of 15 downstream tasks, with an average improvement of 0\.38%\. Table[6](https://arxiv.org/html/2605.05227#A8.T6)further shows thatADAPTconsistently yields lower validation perplexity than Uniform at both 50B and 100B training budgets\. These findings demonstrate thatADAPTimproves both validation perplexity and downstream task performance consistently across compute scales\.
## 7Conclusion and Discussion
Our work unifies data selection, mixing, and reweighting under a FLOPs\-aware framework\. We show that offline pipelines often incur significant cost via data preprocessing and faces issues with generalization\. In contrast, online reweighting integrates seamlessly into training, adapts to model state, and improves efficiency without affecting effective data size\.
Empirically, we demonstrate thatADAPTconsistently outperforms both offline and online baselines across instruction tuning and pretraining setups\. In instruction tuning,ADAPTnot only achieves higher accuracy under equal FLOPs but also exhibits stronger cross\-benchmark generalization \(e\.g\., from MMLU to BBH\), mitigating the brittleness of offline methods\. In pretraining,ADAPTimproves both downstream task accuracy and validation perplexity under 50B and 100B token budgets, underscoring its robustness across compute scales\. We further discuss the implications and outline future directions in Sec\.[A](https://arxiv.org/html/2605.05227#A1)\.
## Ethics and Reproducibility Statements
We adhere to the ICLR Code of Ethics\. This paper focuses on comparing methodologies for data curation\. To ensure fairness, we evaluate online data reweighting and offline data mixing methods under a FLOPs\-equivalent setting\. We also rigorously test their performance in both instruction fine\-tuning and pretraining scenarios to provide a comprehensive and balanced comparison\. During the preparation of this manuscript, we utilized large language models \(LLMs\) to assist with grammar correction and refinement of the writing\. We provide all necessary details to ensure reproducibility of our work\. The theoretical justification for our FLOPs\-equivalent data curation formalization is presented in Section[3](https://arxiv.org/html/2605.05227#S3), with detailed FLOP calculations in Appendix[C](https://arxiv.org/html/2605.05227#A3)\. Implementation details and training protocols are provided in Section[6\.1](https://arxiv.org/html/2605.05227#S6.SS1), Appendix[B](https://arxiv.org/html/2605.05227#A2), Appendix[F](https://arxiv.org/html/2605.05227#A6), and Appendix[G](https://arxiv.org/html/2605.05227#A7), while descriptions of the training data are given in Appendix[D](https://arxiv.org/html/2605.05227#A4)\.
## Acknowledgement
We would like to thank Zhihao Zhan, Ziqing Fan, Jonathan Hayase, Guilherme Penedo, Hynek Kydlíček and Loubna Ben allal for many stimulating and helpful discussions\. This research was supported by the following entities: The Royal Academy of Engineering via DANTE \(a RAEng Chair\); the European Research Council, specifically the REDIAL project; SPRIND under the composite learning challenge; Google through a Google Academic Research Award\. Yihong acknowledges funding from the Horizon Europe grant No\. 101213369 \(DVPS\)\.
## References
## Appendix
Contents
## Appendix AImplication and Future Work\.
Implication\.ADAPTcould be effectively used for getting better control under constrained budgets \(bothdata budgetandcompute budget\)\. Our empirical observation provides a principled view and reference for automatically allocating computational resources to the most valuable samples while maintaining end\-to\-end efficiency\. Additionally, we expect our method to perform even better where the data quality is often low, which is the common case in practical setups, where data is crude and is not ideal for direct use for LLM training, sources of generally low quality, and biases inherent to the distribution of content on the web\.
Future Work\.We didn’t explicitly discuss and evaluate deduplication, which is a key step for data preprocessing\. For example, we will treat deduplication as reweighting related documents in accordance with their frequency\. Based on this framework, we will explore different trade\-offs inherent in each stage, such as quality\-aware deduplication where high\-quality documents are allowed to be duplicated more than once\.
In future work, we aim to explore ADAPT for safety\-oriented data curation\(gal2025customizable\)and to study how dynamic data reweighting influences memorization and internal knowledge structure formation in large language models\(chen2025structure\)\. In particular, combining ADAPT with interpretability techniques such as influence functions\(basu2020influence\)and Jet Expansions\(chen2026jet\)may enable us to trace how safety\-critical data shapes learned representations and to examine whether models preserve underlying data distributions under dynamic reweighting\. More broadly, we view ADAPT as a mechanism for dynamically modulating what information is reinforced during training\. Suchdata\-centriccontrol of model memory opens avenues not only for improving model quality and safety but also for gaining insights into LLM internal knowledge formation when paired with model\-side interpretability methods\.
## Appendix BInstruction Tuning Baselines
We compareADAPTwith the following baselines:1\) Random Selection: We randomly sample data from the instruction tuning dataset\.2\) BM25\(robertson2009probabilistic\): We assign weights to training samples based on textual statistical features \(i\.e\., TF\-IDF\), and select the topkkdata points with the highest scores\.4\) PPL\(yin2024computeconstraineddataselection;marion2023moreinvestigatingdatapruning;ankner2024perplexedperplexityperplexitybaseddata\): We compute the loss of each training sample on our original base model and use it as its score\.6\) LESSxia2024less: We train LoRAs on a random subset of the data, and then selecting data by computing the gradient\-based influence of each training sample to validation samples\.
## Appendix CData\-Selection FLOPs
To estimate computational costs throughout our paper, we adopt the methodology ofkaplan2020scalinglawsneurallanguage, which approximates the training step computation as approximately6N6NFLOPs per processed token, whereNNrepresents the model’s parameter count \(approximately 7B\)\. According tokaplan2020scalinglawsneurallanguage, the forward pass consumes roughly half the computational resources of the backward pass, yielding an estimate of2N2NFLOPs per token during sample processing\. We employ an approximation of 2,048 tokens per sample, as we limit all samples to this maximum length during both the training and selection phases\. Note that in all experimental configurations, we conduct full fine\-tuning of models over two complete epochs\. If we defineNNas model size,PPas the data pool magnitude \(measured in sample quantity\), andDDas the number of samples chosen for training, we can calculate the computational expense for each methodology as follows:
1. 1\.Random Selection:2k×6N×D×E2k\\times 6N\\times D\\times E
2. 2\.BM25:2k×6N×D×E2k\\times 6N\\times D\\times E
3. 3\.Embedding:2k×2N′×P\+2k∗6N×D×E2k\\times 2N^\{\\prime\}\\times P\+2k\*6N\\times D\\times E\(embed modelN′N^\{\\prime\}\)
4. 4\.PPL:2k×2NP\+2k×6N×D×E2k\\times 2NP\+2k\\times 6N\\times D\\times E
5. 5\.LESS:1\.53×2k×6N×P\+2k×6N×D×E1\.53\\times 2k\\times 6N\\times P\+2k\\times 6N\\times D\\times E\(LESS computes gradients for three checkpoints over the entire pool\.\)
## Appendix DTraining Datasets in Instruction Tuning
Table[4](https://arxiv.org/html/2605.05227#A4.T4)contains information about the training sets used in instruction tuning\.
Dataset\# InstanceSourced from\# RoundsPrompt Len\.Completion Len\.Flan V2100,000NLP datasets and human\-written instructions1355\.731\.2CoT100,000NLP datasets and human\-written CoTs126653\.2Dolly15,011Human\-written from scratch1118\.191\.3Open Assistant 155,668Human\-written from scratch1\.634\.8212\.5
Table 4:Details of training dataset fromwang2023far\. Len\. is short for token length\.
## Appendix EEvaluation Datasets in Instruction Tuning
Table[5](https://arxiv.org/html/2605.05227#A5.T5)contains detailed statistics of the evaluation datasets used in instruction finetuning\.
Dataset\# Shot\# Tasks\|𝓓𝐯𝐚𝐥\|\|\\bm\{\\mathcal\{D\}\}\_\{\\bm\{\\mathrm\{val\}\}\}\|\|𝓓𝐭𝐞𝐬𝐭\|\|\\bm\{\\mathcal\{D\}\}\_\{\\bm\{\\mathrm\{test\}\}\}\|Answer TypeMMLU55728518,721Letter optionsBBH32369920COT and answer
Table 5:Statistics of evaluation datasets\. The selection of evaluation tasks cover different kinds of answer types\.
## Appendix FPretraining Details
We follow all settings in TinyLlama\(tinyllama\)\. The optimizer is AdamW\(adamw\), setting parametersβ1\\beta\_\{1\}at 0\.9 andβ2\\beta\_\{2\}at 0\.95\. We adopt the cosine learning rate schedule with a maximum learning rate of 4e\-4 and the minimum of 4e\-5, the batch size of 2M tokens, the weight decay of 0\.1, and the gradient clipping threshold of 1\.
## Appendix GInstruction Tuning Details
All experiments were conducted with parameter\-efficient finetuning method LoRA\(hu2021lora\)\. For the LoRA adapter, we specified a rank of 128, anα\\alphavalue of 512, and a dropout rate of 0\.1 and applied it across all attention matrices\. Adding the LoRA adapter introduces minimal FLOPs overhead during training—having no impact on our FLOPS analysis—and mainly reduce memory requirements for more accessible training\.
## Appendix HPretraining Perplexity
In Table[6](https://arxiv.org/html/2605.05227#A8.T6), we evaluate validation perplexity on SlimPajama for TinyLlama\-120M under two compute budgets \(50B and 100B tokens\)\. Across both settings,ADAPTachieves consistently lower perplexity than the Uniform baseline, highlighting its effectiveness in improving data efficiency\.
Table 6:Validation perplexity of TinyLlama\-120M on SlimPajama under different training budgets\.Method50B Train Budget100B Train BudgetUniform17\.1515\.94ADAPT16\.5515\.36
## Appendix IPseudo Code
1
2
Input :Training set
𝒟=\{\(xi,yi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}, validation set
𝒟val=\{vj\}j=1\|𝒟val\|\\mathcal\{D\}\_\{val\}=\\\{v\_\{j\}\\\}\_\{j=1\}^\{\|\\mathcal\{D\}\_\{val\}\|\}, model
fθf\_\{\\theta\}, temperature
τ\\tau, learning rate
η\\eta, refresh interval
RR, training steps
TT, small constant
ϵ\>0\\epsilon\>0
3
Output :Updated parameters
θT\\theta\_\{T\}
4
5Initialize model parameters
θ0\\theta\_\{0\}
6
7for*t=1t=1toTT*do
8
9Sample mini\-batch
Bt⊂𝒟B\_\{t\}\\subset\\mathcal\{D\}
10
⊳\\trianglerightRefresh anchor embeddings everyRRsteps via forward pass on current modelθt\\theta\_\{t\}\(online update\)
11
12if*tmodR=1t\\bmod R=1ort=1t=1*then
13
14foreach*v∈𝒟valv\\in\\mathcal\{D\}\_\{val\}*do
15
16Forward pass:
\(h1\(v\),…,hLv\(v\)\)←fθt\(v\)\(h^\{\(v\)\}\_\{1\},\\ldots,h^\{\(v\)\}\_\{L\_\{v\}\}\)\\leftarrow f\_\{\\theta\_\{t\}\}\(v\)
17
18Position weights:
wℓ←ℓ/∑j=1Lvjw\_\{\\ell\}\\leftarrow\\ell/\\sum\_\{j=1\}^\{L\_\{v\}\}jfor
ℓ=1,…,Lv\\ell=1,\\ldots,L\_\{v\}
19
20
ϕ\(v\)←∑ℓ=1Lvwℓhℓ\(v\)\\phi\(v\)\\leftarrow\\sum\_\{\\ell=1\}^\{L\_\{v\}\}w\_\{\\ell\}h^\{\(v\)\}\_\{\\ell\}
21
22
ϕ\(v\)←ϕ\(v\)/max\(‖ϕ\(v\)‖2,ϵ\)\\phi\(v\)\\leftarrow\\phi\(v\)/\\max\(\\\|\\phi\(v\)\\\|\_\{2\},\\epsilon\)
23
24end foreach
25
26end if
27
⊳\\trianglerightForward propagation for training samples
28
29foreach*\(xi,yi\)∈Bt\(x\_\{i\},y\_\{i\}\)\\in B\_\{t\}*do
30
31Forward pass:
\(y^i,h1\(i\),…,hLi\(i\)\)←fθt\(xi\)\(\\hat\{y\}\_\{i\},h^\{\(i\)\}\_\{1\},\\ldots,h^\{\(i\)\}\_\{L\_\{i\}\}\)\\leftarrow f\_\{\\theta\_\{t\}\}\(x\_\{i\}\)
32
33Position weights:
wℓ\(i\)←ℓ/∑j=1Lijw^\{\(i\)\}\_\{\\ell\}\\leftarrow\\ell/\\sum\_\{j=1\}^\{L\_\{i\}\}jfor
ℓ=1,…,Li\\ell=1,\\ldots,L\_\{i\}
34
35
ϕ\(xi\)←∑ℓ=1Liwℓ\(i\)hℓ\(i\)\\phi\(x\_\{i\}\)\\leftarrow\\sum\_\{\\ell=1\}^\{L\_\{i\}\}w^\{\(i\)\}\_\{\\ell\}h^\{\(i\)\}\_\{\\ell\}
36
37
ϕ\(xi\)←ϕ\(xi\)/max\(‖ϕ\(xi\)‖2,ϵ\)\\phi\(x\_\{i\}\)\\leftarrow\\phi\(x\_\{i\}\)/\\max\(\\\|\\phi\(x\_\{i\}\)\\\|\_\{2\},\\epsilon\)
38
39end foreach
40
⊳\\trianglerightCompute quality scores
41
42foreach*i∈Bti\\in B\_\{t\}*do
43
44
sADAPT\(xi\)←1\|𝒟val\|∑v∈𝒟valcos\(ϕ\(xi\),ϕ\(v\)\)s\_\{ADAPT\}\(x\_\{i\}\)\\leftarrow\\frac\{1\}\{\|\\mathcal\{D\}\_\{val\}\|\}\\sum\_\{v\\in\\mathcal\{D\}\_\{val\}\}\\cos\(\\phi\(x\_\{i\}\),\\phi\(v\)\)
45
46end foreach
47
⊳\\trianglerightCompute per\-sample weights
48
49foreach*i∈Bti\\in B\_\{t\}*do
50
51
wt\(i\)←σ\(sADAPT\(xi\)/max\(τ,ϵ\)\)w\_\{t\}\(i\)\\leftarrow\\sigma\(s\_\{ADAPT\}\(x\_\{i\}\)/\\max\(\\tau,\\epsilon\)\)
52
53end foreach
54
⊳\\trianglerightCompute weighted loss
55
56
ℒt←∑i∈Btwt\(i\)ℓ\(y^i,yi\)\\mathcal\{L\}\_\{t\}\\leftarrow\\sum\_\{i\\in B\_\{t\}\}w\_\{t\}\(i\)\\ell\(\\hat\{y\}\_\{i\},y\_\{i\}\)
57
⊳\\trianglerightBackward propagation and parameter update
58
59
θt\+1←θt−η∑i∈Btwt\(i\)∇θtℓ\(y^i,yi\)\\theta\_\{t\+1\}\\leftarrow\\theta\_\{t\}\-\\eta\\sum\_\{i\\in B\_\{t\}\}w\_\{t\}\(i\)\\nabla\_\{\\theta\_\{t\}\}\\ell\(\\hat\{y\}\_\{i\},y\_\{i\}\)
60
61end for
62
631ex
64return*θT\\theta\_\{T\}*
Algorithm 1Online Data Reweighting via Embedding Similarity
## Appendix JQualitative Analysis
We provide a detailed qualitative comparison in Table[7](https://arxiv.org/html/2605.05227#A10.T7), Table[8](https://arxiv.org/html/2605.05227#A10.T8)and Table[9](https://arxiv.org/html/2605.05227#A10.T9)\.
Table 7:Examples with highest and lowest scores retrieved by BM25 for the MMLU validation set\.MMLU Validation Set ExamplesUser:Which expression is equivalent to 5 x 9? A\. \(5 x 4\) x \(6 x 5\) B\. \(5 x 5\) \+ \(5 x 4\) C\. \(5 x 5\) \+ \(5 x 9\) D\. \(5 x 9\) x \(6 x 9\)\. Answer:Assistant:BUser:Let A be the set of all ordered pairs of integers \(m, n\) such that7m\+12n=227m\+12n=22\. What is the greatest negative number in the setB=\{m\+n:\(m,n\)∈A\}B=\\\{m\+n:\(m,n\)\\in A\\\}? A\. \-5 B\. \-4 C\. \-3 D\. \-2 Answer:Assistant:BUser:Ten students take a biology test and receive the following scores: 45, 55, 50, 70, 65, 80, 40, 90, 70, 85\. What is the mean of the students’ test scores? A\. 55 B\. 60 C\. 62 D\. 65 Answer:Assistant:DRetrieved by BM25 \(Highest similarity\)Retrieved by BM25 \(Lowest similarity\)Sample Index:cot 60736Score:0\.9403User:Question: A multiple choice test consists of 4 questions, and each question has 5 answer choices\. In how many ways can the test be completed if every question is unanswered? Options: \(A\) 24 \(B\) 120 \(C\) 625 \(D\) 720 \(E\) 1024 Answer: 5 choices for each of the 4 questions, thus total of5×5×5×5=54=6255\\times 5\\times 5\\times 5=5^\{4\}=625ways\.Assistant:The answer is \(C\)\.Sample Index:cot 60736 \(same entry, continued\)Score:0\.9403User:Question: In how many different ways can the letters of the word “CLAIM” be rearranged? Options: \(A\) 110 \(B\) 120 \(C\) 125 \(D\) 130 \(E\) 140 Answer: The total number of arrangements is5\!=1205\!=120\.Assistant:The answer is \(B\)\.Sample Index:cot 60736 \(same entry, continued\)Score:0\.9403User:Question: A pipe can fill a cistern in 20 minutes whereas the cistern when full can be emptied by a leak in 70 minutes\. When both are opened, find when the cistern will be full? Options: \(A\) 17 minutes \(B\) 87 minutes \(C\) 70 minutes \(D\) 28 minutes \(E\) 11 minutes Answer:1/20−1/70=\(7−2\)/140=5/140=1/28⇒1/20\-1/70=\(7\-2\)/140=5/140=1/28\\Rightarrow28 minutes\.Assistant:The answer is \(D\)\.Sample Index:cot 60736 \(same entry, continued\)Score:0\.9403User:Question: What is the area of a rectangle whose length is twice its width and whose perimeter equals that of a square with area 1? Options: \(A\) 1/9 sq units\. \(B\) 4/9 sq units\. \(C\) 8/9 sq units\. \(D\) 7/9 sq units\. \(E\) 6/9 sq units\. Answer: Perimeter of square = 4, so perimeter of rectangle = 4→\\rightarrow2\(l \+ w\) = 4→\\rightarrowl \+ w = 2\. Since l = 2w→\\rightarrow3w = 2→\\rightarroww = 2/3, l = 4/3\. Area = \(4/3\)×\\times\(2/3\) = 8/9\.Assistant:The answer is \(C\)\.Sample Index:oasst1\_41616Score:0\.0User:\[Thai\] Why is the Lao language similar to Thai?Assistant:\[Thai\] The Lao and Thai languages are similar because they belong to the Tai–Lao language family in the Indochina region\. They share writing systems, grammar, and pronunciation\. Their similarity is natural given shared history and culture, as well as long\-standing contact and cultural exchange between Laos and Thailand\.Sample Index:oasst1\_4573Score:0\.0Question:Buenos díasAnswer:Hola buenos días\. ¿En qué te puedo ayudar?Sample Index:oasst1\_42983Score:0\.0Question:\[Chinese\] First time visiting Shenzhen, please arrange an itinerary for about 3 days\.Answer:\[Chinese\] Shenzhen is a modern city in Guangdong Province, China, known for its high\-tech industry, beautiful coastal scenery, and rich shopping and entertainment\. Here is a 3\-day Shenzhen travel itinerary: \[…\] \(full 3\-day itinerary\)Sample Index:oasst1\_43212Score:0\.0Question:\[Japanese\] I am a cat; I have no name yet\.Answer:\[Japanese\] That is the opening line of Natsume Soseki’s novel “I Am a Cat\.”Table 8:Examples with highest and lowest scores retrieved by LESS \(gradient\-based\) for the MMLU validation set\.MMLU Validation Set ExamplesUser:Which expression is equivalent to 5 x 9? A\. \(5 x 4\) x \(6 x 5\) B\. \(5 x 5\) \+ \(5 x 4\) C\. \(5 x 5\) \+ \(5 x 9\) D\. \(5 x 9\) x \(6 x 9\)\. Answer:Assistant:BUser:Let A be the set of all ordered pairs of integers \(m, n\) such that7m\+12n=227m\+12n=22\. What is the greatest negative number in the setB=\{m\+n:\(m,n\)∈A\}B=\\\{m\+n:\(m,n\)\\in A\\\}? A\. \-5 B\. \-4 C\. \-3 D\. \-2 Answer:Assistant:BUser:Ten students take a biology test and receive the following scores: 45, 55, 50, 70, 65, 80, 40, 90, 70, 85\. What is the mean of the students’ test scores? A\. 55 B\. 60 C\. 62 D\. 65 Answer:Assistant:DRetrieved by LESS \(Highest similarity\)Retrieved by LESS \(Lowest similarity\)Sample Index:flan\_v2\_78815Score:0\.9922User:the compound ratio of 2/3 , 6/7 , 4/3 and 1/8 is given by?Options: a\) 2/21 , b\) 1/48 , c\) 1/98 , d\) 1/96 , e\) 1/68Assistant:aSample Index:flan\_v2\_40013Score:1\.0User:Look, this is a photo of Mr\. White’s family\. The man in the photo is Mr\. White, the father\. The woman is the mother\. They have two daughters\. The girl behind Mr\. White is Mary\. She’s 14\. The girl behind Mrs\. White is Jane\. She is 11\. Mary and Jane are in the same school, but not the same grade\. Mary is in Grade2\. Jane is in Grade 1\. They are good students\. How old are Mary and Jane?Options: A\) Eleven and thirteen B\) Fourteen and eleven C\) Eleven D\) ThirteenAssistant:BSample Index:flan\_v2\_40832Score:0\.946User:Please answer the following question: Dogs have an understanding of fair play and become angry if they feel that another dog is getting a better deal, a new study has found\. The study looked at how dogs react when a companion is rewarded for the same trick in an unequal way\. Friederike Range, a researcher at the University of Vienna in Austria, and her colleagues did a series of experiments with dogs who knew how to respond to the command ”give the paw ”\. The dogs were normally happy to repeatedly give the paw, whether they got a reward or not\. Range found that, compared with monkeys, dogs \_\.Options: A\) care more about whether they are rewarded\. B\) care less about what they are rewarded with\. C\) care more about what they are ordered to do\. D\) care less about who gives them orders\.Assistant:BSample Index:flan\_v2\_81227Score:0\.0User:Personality: I only sleep 3 hours a day\. I have no parents or siblings\. I’m incredibly smart\. \[…\] I don’t blame you\. What is your favorite food? I like well done steak\. Candidates 1\) Hi, how are you doing? 2\) Hi, I am Kathie\! I don’t like eating fast food\. 4\) I like Mac N cheese\.Assistant:I like Mac N cheese\.Sample Index:flan\_v2\_40281Score:0\.0187User:Where will a book being read by one person be kept? Options: school room, university, bedside table, pantry, discount store I believe the correct choice is ”bedside table”, here’s why:Assistant:a book being read by one person be kept bedside table\.Sample Index:flan\_v2\_78179Score:0\.0153User:sbjerg forenede Boldklubber \(EfB\) is a professional Danish football club from the port city of Esbjerg in West Jutland\. The club was formed in 1924, as a merger between Esbjerg Boldklub af 1898 and Esbjerg Amatørklub af 1911\. The club has training facilities and stadium at Gl\. Vardevej in Esbjerg, and plays in blue and white striped shirts\. \[…\] Subject: morten karlsen, Relation: place\_of\_birth\. Options: \(A\) 1955 \(B\) central \(C\) copenhagen \(D\) denmark \(E\) esbjerg \(F\) farum \(G\) island \(H\) jutland \(I\) of \(J\) once \(K\) randers \(L\) time \(M\) uerdingen \(N\) union \(O\) viborg \(P\) zealandAssistant:copenhagenTable 9:Examples with highest and lowest scores retrieved byADAPTfor the MMLU validation set\.MMLU Validation Set ExamplesUser:Which expression is equivalent to 5 x 9? A\. \(5 x 4\) x \(6 x 5\) B\. \(5 x 5\) \+ \(5 x 4\) C\. \(5 x 5\) \+ \(5 x 9\) D\. \(5 x 9\) x \(6 x 9\)\. Answer:Assistant:BUser:Let A be the set of all ordered pairs of integers \(m, n\) such that7m\+12n=227m\+12n=22\. What is the greatest negative number in the setB=\{m\+n:\(m,n\)∈A\}B=\\\{m\+n:\(m,n\)\\in A\\\}? A\. \-5 B\. \-4 C\. \-3 D\. \-2 Answer:Assistant:BUser:Ten students take a biology test and receive the following scores: 45, 55, 50, 70, 65, 80, 40, 90, 70, 85\. What is the mean of the students’ test scores? A\. 55 B\. 60 C\. 62 D\. 65 Answer:Assistant:DRetrieved byADAPT\(Highest similarity\)Retrieved byADAPT\(Lowest similarity\)User:Write something based on this summary: canada foreign minister regrets handling of visa denial for haitian prime minister \[…\]Assistant:the maximum disturbing and unraveling word for an author ’s agent to hear is “no\.” so in the ceremony of book publishing , the relationship that is most culturally challenging and polarizing is the one between writers ’ agents and publishers\. \[…\]User:Generate a short summary: an airplane en route to algiers from paris made an unscheduled stop thursday in central france after a passenger reported mistakenly bringing on board a suitcase that belonged to someone else \[…\] \(several summarization tasks\)Assistant:paris\-algiers flight makes unexpected stop in central france \[…\] \(several summarization answers\)User:ValeAssistant:Dime Tio, que necesitas?User:Write a sentence not in English\.Assistant:MDN DOM 195 Chef \- Service d’examen Vérification Description:User:Kiiltoraunioinen Translate to EnglishAssistant:Asplenium adiantum\-nigrumSimilar Articles
DART: Mitigating Harm Drift in Difference-Aware LLMs via Distill-Audit-Repair Training
DART (Distill-Audit-Repair Training) is a new training framework that addresses 'harm drift' in safety-aligned LLMs, where fine-tuning for demographic difference-awareness causes harmful content to appear in model explanations. On eight benchmarks, DART improves Llama-3-8B-Instruct accuracy from 39.0% to 68.8% while reducing harm drift cases by 72.6%.
Rethinking the Necessity of Adaptive Retrieval-Augmented Generation through the Lens of Adaptive Listwise Ranking
This paper proposes AdaRankLLM, an adaptive retrieval framework that challenges the necessity of adaptive RAG by using listwise ranking to dynamically filter retrieved passages. The work shows that adaptive retrieval serves as a noise filter for weaker models while acting as a cost-efficiency optimizer for stronger models, with extensive experiments across multiple datasets and LLMs.
Polarization by Default: Auditing Recommendation Bias in LLM-Based Content Curation
This paper presents a large-scale audit of recommendation biases in LLM-based content curation across OpenAI, Anthropic, and Google using 540,000 simulated selections from Twitter/X, Bluesky, and Reddit data. The study finds that LLMs systematically amplify polarization, exhibit distinct toxicity handling trade-offs, and show significant political leaning bias favoring left-leaning authors despite right-leaning plurality in datasets.
Data Mixing for Large Language Models Pretraining: A Survey and Outlook
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
DataDignity: Training Data Attribution for Large Language Models
This paper introduces DataDignity, a framework and benchmark (FakeWiki) for pinpoint provenance, aiming to identify the specific training data sources that support an LLM's response. It proposes ScoringModel and SteerFuse methods to improve attribution accuracy over standard retrieval baselines.