Tag
The article explains how GEPA (Genetic-Pareto Optimization) within DSPy is used for efficient prompt tuning, specifically applied to pretraining data curation at Microsoft AI, allowing researchers to replace manual prompt engineering with automated compute-driven optimization.
PORTER is a language-grounded structured EHR foundation model that represents clinical events through text descriptions and numeric values, enabling vocabulary-independent transfer across institutions without retraining. On pediatric prediction tasks, PORTER matches fixed-vocabulary models and recovers 97.1% of AUROC when transferred to unseen event descriptions.
The author details the process of pretraining and post-training a 500M parameter language model and a 330M parameter image generator entirely from scratch.
This paper investigates training-time data augmentation techniques to mitigate overfitting in autoregressive language model pretraining under data-constrained, compute-abundant regimes, finding that combining token-level noise, sequence permutations, and target offset prediction improves validation loss.
This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.
This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.
This paper shows that reducing parameter initialization scale consistently improves pretraining of large language models, with the largest gains on reasoning-demanding tasks. It uncovers a critical initialization that balances reasoning and training, and proposes a simple γ-initialization rule.
This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.
Joel Niklaus from Hugging Face will give a live stream on synthetic data's role in advancing pretraining; the team has also published a playbook on the topic.
ACE-EGO-0 is a unified Vision-Language-Action pretraining framework that leverages egocentric human videos and robot trajectories via a reliability-aware training objective, achieving state-of-the-art on embodied AI benchmarks.
AC-ODM uses reinforcement learning to dynamically optimize pretraining data composition for LLMs, achieving faster convergence and higher downstream accuracy with negligible computational overhead.
OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.
This paper proposes a probabilistic contrastive pretraining framework for molecular graph transformers to improve multi-task ADME property prediction in drug discovery, achieving significant gains on three benchmarks.
This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.
This paper studies a staged promotion protocol for micro-pretraining, using escalating budgets from minutes to hours to filter configurations. It finds that early screens are useful but unstable, and that a staged approach can retain a long-horizon reference while identifying alternatives that fail continuation thresholds.
LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.
This paper systematically evaluates 11 synthetic time-series generators for foundation model pretraining and finds that generator rankings are not stable across architectures, but an equal-weight mixture of all generators matches or beats the best individual. Blending this mixture with real data yields the strongest pretraining corpora, reframing synthetic pretraining as a corpus composition problem rather than a generator selection problem.
EditSR proposes a two-layer framework combining a neural symbolic regression model with an edit-based Rectifier to efficiently rectify structural errors in generated expressions, reducing error accumulation and improving recovery of complex symbolic structures with limited extra cost.
GRASP introduces a geometry-aware, interaction-based method for scalable pretraining data attribution that models subset dynamics, outperforming existing additive approaches by over double the task-level rank correlation while reducing computation costs.
This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.