Tag
Proposes CodeBlock, a structure-aware sparse supervision framework for supervised fine-tuning of code LLMs. It selects high-quality instruction-response pairs and partitions code responses into syntactically coherent coding items, applying loss only to selected items to achieve stronger pass@1 rates using only 1.9% of supervised response tokens.
This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.
This paper demonstrates that data selection in low-resource verification regimes, where verifiers only have access to fragmented and biased slices of the target distribution, can paradoxically accelerate model collapse by pruning globally relevant tail modes. The authors provide theoretical proof and propose a collaborative proxy reference mechanism as a mitigation strategy.
This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.
APEX introduces a dynamic data selection strategy for automatic prompt optimization, stratifying datasets into easy, hard, and mixed tiers to improve data efficiency, achieving significant performance gains over initial prompts on multiple benchmarks.
DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.
This paper identifies a failure mode called PhysHack in LLM-based LEGO assembly generation and proposes PVPO, a sample-efficient reinforcement learning method with model-based data selection that improves physical and semantic alignment using only a small fraction of training data.
This paper evaluates adaptive data selection strategies for wearable health prediction, finding they significantly improve AUROC for participants with low baseline performance but offer limited gains for strong baselines.
This paper explores whether generalist coding agents (Claude Code, Codex, etc.) can automate data curation loops, achieving published baselines within 10 iterations but revealing a gap in exploring new methods. A scaffold that forces agents to adapt prior research yields policies that beat baselines using 10x less data.
LARK proposes a learnability-grounded method for selecting reasoning trajectories in LLM distillation, employing a learnability factor and χ²-regularized selection policy that balances efficiency and generalization, consistently outperforming baselines across models and tasks.
This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.
Trust functions enable near-lossless weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.
MIRA is a data selection framework for the mid-training stage of LLM development that adaptively constructs quality rubrics per data source, using a teacher model to propose dimensions and distilling into lightweight scorers. It achieves superior performance using only half the tokens compared to full-corpus training.
Proposes SLAP, a novel data selection framework for efficient instruction tuning of large language models that evaluates batch learnability and uses stratified sampling to achieve superior performance with 20-40% less training data.
P2D is a unified framework that leverages task-sensitive attention heads for both data selection and structural pruning, achieving an 8.3 pp performance gain and 7.0× speedup by updating only 10% of heads on 10% of data.
The paper proposes High-Entropy Sum (HES), a training-free metric for selecting high-quality reasoning data for LLM training, validated across SFT, RFT, and RL paradigms.
Weasel is a trajectory selection method for offline training of web agents that improves out-of-domain generalization by balancing importance and diversity. It achieves up to 12.5x training speedups and improved performance across several benchmarks.
This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.
SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.