data-selection

#data-selection

CODEBLOCK: Learning to Supervise Code at the Right Granularity

arXiv cs.LG ↗ · 6d ago Cached

Proposes CodeBlock, a structure-aware sparse supervision framework for supervised fine-tuning of code LLMs. It selects high-quality instruction-response pairs and partitions code responses into syntactically coherent coding items, applying loss only to selected items to achieve stronger pass@1 rates using only 1.9% of supervised response tokens.

0 favorites 0 likes

#data-selection

Spokes: Optimizing for Diverse Pretraining Data Selection

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.

0 favorites 0 likes

#data-selection

When Sample Selection Bias Precipitates Model Collapse

arXiv cs.AI ↗ · 2026-06-15 Cached

This paper demonstrates that data selection in low-resource verification regimes, where verifiers only have access to fragmented and biased slices of the target distribution, can paradoxically accelerate model collapse by pruning globally relevant tail modes. The authors provide theoretical proof and propose a collaborative proxy reference mechanism as a mitigation strategy.

0 favorites 0 likes

#data-selection

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.

0 favorites 0 likes

#data-selection

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

arXiv cs.CL ↗ · 2026-06-11 Cached

APEX introduces a dynamic data selection strategy for automatic prompt optimization, stratifying datasets into easy, hard, and mixed tiers to improve data efficiency, achieving significant performance gains over initial prompts on multiple benchmarks.

0 favorites 0 likes

#data-selection

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

arXiv cs.LG ↗ · 2026-06-09 Cached

DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.

0 favorites 0 likes

#data-selection

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

arXiv cs.LG ↗ · 2026-06-09 Cached

This paper identifies a failure mode called PhysHack in LLM-based LEGO assembly generation and proposes PVPO, a sample-efficient reinforcement learning method with model-based data selection that improves physical and semantic alignment using only a small fraction of training data.

0 favorites 0 likes

#data-selection

Adaptive data selection improves wearable prediction under low baseline performance

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper evaluates adaptive data selection strategies for wearable health prediction, finding they significantly improve AUROC for participants with low baseline performance but offer limited gains for strong baselines.

0 favorites 0 likes

#data-selection

Can Generalist Agents Automate Data Curation?

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

This paper explores whether generalist coding agents (Claude Code, Codex, etc.) can automate data curation loops, achieving published baselines within 10 iterations but revealing a gap in exploring new methods. A scaffold that forces agents to adapt prior research yields policies that beat baselines using 10x less data.

0 favorites 0 likes

#data-selection

LARK: Learnability-Grounded Trajectory Selection for Efficient Reasoning Distillation

arXiv cs.LG ↗ · 2026-06-01 Cached

LARK proposes a learnability-grounded method for selecting reasoning trajectories in LLM distillation, employing a learnability factor and χ²-regularized selection policy that balances efficiency and generalization, consistently outperforming baselines across models and tasks.

0 favorites 0 likes

#data-selection

The Long-Term Effects of Data Selection in LLM Fine-Tuning

arXiv cs.LG ↗ · 2026-06-01 Cached

This paper investigates the long-term effects of data selection strategies in multi-stage LLM fine-tuning, revealing that myopic selection can harm future adaptability. It introduces a Long-Horizon Aware Selection (LHAS) objective to mitigate these issues.

0 favorites 0 likes

#data-selection

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

Trust functions enable near-lossless weak-to-strong generalization by identifying reliable weak labels for training, achieving performance comparable to ground-truth supervision across multiple domains.

0 favorites 0 likes

#data-selection

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

MIRA is a data selection framework for the mid-training stage of LLM development that adaptively constructs quality rubrics per data source, using a teacher model to propose dimensions and distilling into lightweight scorers. It achieves superior performance using only half the tokens compared to full-corpus training.

0 favorites 0 likes

#data-selection

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

arXiv cs.CL ↗ · 2026-05-26 Cached

Proposes SLAP, a novel data selection framework for efficient instruction tuning of large language models that evaluates batch learnability and uses stratified sampling to achieve superior performance with 20-40% less training data.

0 favorites 0 likes

#data-selection

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

arXiv cs.LG ↗ · 2026-05-22 Cached

P2D is a unified framework that leverages task-sensitive attention heads for both data selection and structural pruning, achieving an 8.3 pp performance gain and 7.0× speedup by updating only 10% of heads on 10% of data.

0 favorites 0 likes

#data-selection

Unified Data Selection for LLM Reasoning

arXiv cs.CL ↗ · 2026-05-22 Cached

The paper proposes High-Entropy Sum (HES), a training-free metric for selecting high-quality reasoning data for LLM training, validated across SFT, RFT, and RL paradigms.

0 favorites 0 likes

#data-selection

Weasel: Out-of-Domain Generalization for Web Agents via Importance-Diversity Data Selection

arXiv cs.LG ↗ · 2026-05-21 Cached

Weasel is a trajectory selection method for offline training of web agents that improves out-of-domain generalization by balancing importance and diversity. It achieves up to 12.5x training speedups and improved performance across several benchmarks.

0 favorites 0 likes

#data-selection

Data Mixing for Large Language Models Pretraining: A Survey and Outlook

arXiv cs.CL ↗ · 2026-04-21 Cached

This paper presents a comprehensive survey of data mixing methods for LLM pretraining, formalizing the problem as bilevel optimization and introducing a taxonomy that distinguishes static (rule-based and learning-based) from dynamic (adaptive and externally guided) mixing approaches. The authors analyze trade-offs, identify cross-cutting challenges, and outline future research directions including finer-grained domain partitioning and pipeline-aware designs.

0 favorites 0 likes

#data-selection

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

arXiv cs.CL ↗ · 2026-04-20 Cached

SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.

0 favorites 0 likes

data-selection

Submit Feedback