pretraining

#pretraining

@harold_matmul: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)

X AI KOLs Timeline ↗ · 11h ago Cached

The article explains how GEPA (Genetic-Pareto Optimization) within DSPy is used for efficient prompt tuning, specifically applied to pretraining data curation at Microsoft AI, allowing researchers to replace manual prompt engineering with automated compute-driven optimization.

0 favorites 0 likes

#pretraining

PORTER: Language-Grounded Event Representations for Portable Structured EHR Foundation Models

arXiv cs.CL ↗ · yesterday Cached

PORTER is a language-grounded structured EHR foundation model that represents clinical events through text descriptions and numeric values, enabling vocabulary-independent transfer across institutions without retraining. On pediatric prediction tasks, PORTER matches fixed-vocabulary models and recovers 97.1% of AUROC when transferred to unseen event descriptions.

0 favorites 0 likes

#pretraining

I pretrained and post trained a 500M parameter LLM and 330M parameter Image generator from scratch

Reddit r/LocalLLaMA ↗ · 3d ago

The author details the process of pretraining and post-training a 500M parameter language model and a 330M parameter image generator entirely from scratch.

0 favorites 0 likes

#pretraining

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Hugging Face Daily Papers ↗ · 6d ago Cached

This paper investigates training-time data augmentation techniques to mitigate overfitting in autoregressive language model pretraining under data-constrained, compute-abundant regimes, finding that combining token-level noise, sequence permutations, and target offset prediction improves validation loss.

0 favorites 0 likes

#pretraining

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.

0 favorites 0 likes

#pretraining

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.

0 favorites 0 likes

#pretraining

Small Initialization Matters for Large Language Models

arXiv cs.AI ↗ · 2026-06-17 Cached

This paper shows that reducing parameter initialization scale consistently improves pretraining of large language models, with the largest gains on reasoning-demanding tasks. It uncovers a critical initialization that balances reasoning and training, and proposes a simple γ-initialization rule.

0 favorites 0 likes

#pretraining

Spokes: Optimizing for Diverse Pretraining Data Selection

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.

0 favorites 0 likes

#pretraining

@yacinelearning: okay folks buckle up because this thursday we have @joelniklaus from @huggingface that will join us on stream to teach …

X AI KOLs Timeline ↗ · 2026-06-15 Cached

Joel Niklaus from Hugging Face will give a live stream on synthetic data's role in advancing pretraining; the team has also published a playbook on the topic.

0 favorites 0 likes

#pretraining

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

ACE-EGO-0 is a unified Vision-Language-Action pretraining framework that leverages egocentric human videos and robot trajectories via a reliability-aware training objective, achieving state-of-the-art on embodied AI benchmarks.

0 favorites 0 likes

#pretraining

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining

Hugging Face Daily Papers ↗ · 2026-06-14 Cached

AC-ODM uses reinforcement learning to dynamically optimize pretraining data composition for LLMs, achieving faster convergence and higher downstream accuracy with negligible computational overhead.

0 favorites 0 likes

#pretraining

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

arXiv cs.AI ↗ · 2026-06-12 Cached

OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.

0 favorites 0 likes

#pretraining

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper proposes a probabilistic contrastive pretraining framework for molecular graph transformers to improve multi-task ADME property prediction in drug discovery, achieving significant gains on three benchmarks.

0 favorites 0 likes

#pretraining

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.

0 favorites 0 likes

#pretraining

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper studies a staged promotion protocol for micro-pretraining, using escalating budgets from minutes to hours to filter configurations. It finds that early screens are useful but unstable, and that a staged approach can retain a long-horizon reference while identifying alternatives that fail continuation thresholds.

0 favorites 0 likes

#pretraining

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.

0 favorites 0 likes

#pretraining

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper systematically evaluates 11 synthetic time-series generators for foundation model pretraining and finds that generator rankings are not stable across architectures, but an equal-weight mixture of all generators matches or beats the best individual. Blending this mixture with real data yields the strongest pretraining corpora, reframing synthetic pretraining as a corpus composition problem rather than a generator selection problem.

0 favorites 0 likes

#pretraining

EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification

arXiv cs.AI ↗ · 2026-06-09 Cached

EditSR proposes a two-layer framework combining a neural symbolic regression model with an edit-based Rectifier to efficiently rectify structural errors in generated expressions, reducing error accumulation and improving recovery of complex symbolic structures with limited extra cost.

0 favorites 0 likes

#pretraining

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

arXiv cs.LG ↗ · 2026-06-08 Cached

GRASP introduces a geometry-aware, interaction-based method for scalable pretraining data attribution that models subset dynamics, outperforming existing additive approaches by over double the task-level rank correlation while reducing computation costs.

0 favorites 0 likes

#pretraining

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

arXiv cs.LG ↗ · 2026-06-08 Cached

This paper studies data-constrained language model pretraining, proposing masked-input regularization (MIR) to improve validation loss and downstream performance, and SoftQ, a scaling law that better captures model-data interaction under repeated data.

0 favorites 0 likes

pretraining

Submit Feedback