pretraining

#pretraining

Internal Data Repetition Destroys Language Models

arXiv cs.LG ↗ · 17h ago Cached

This paper systematically studies the damage caused by exact document repetition during language model pretraining, showing that repeating a moderately sized subset a moderate number of times maximally harms performance, and that repetition can waste up to 33% of compute (as measured by compute-equivalent loss).

0 favorites 0 likes

#pretraining

Improved Large Language Diffusion Models

arXiv cs.CL ↗ · 17h ago Cached

iLLaDA is an 8B parameter masked diffusion language model with fully bidirectional attention, trained from scratch on 12T tokens. It shows broad improvements over LLaDA and remains competitive with Qwen2.5 7B on several benchmarks. The model and code are open-sourced.

0 favorites 0 likes

#pretraining

@harold_matmul: dspy.GEPA used in pretraining data curation in the new Microsoft AI effort :-)

X AI KOLs Timeline ↗ · yesterday Cached

The article explains how GEPA (Genetic-Pareto Optimization) within DSPy is used for efficient prompt tuning, specifically applied to pretraining data curation at Microsoft AI, allowing researchers to replace manual prompt engineering with automated compute-driven optimization.

0 favorites 0 likes

#pretraining

PORTER: Language-Grounded Event Representations for Portable Structured EHR Foundation Models

arXiv cs.CL ↗ · yesterday Cached

PORTER is a language-grounded structured EHR foundation model that represents clinical events through text descriptions and numeric values, enabling vocabulary-independent transfer across institutions without retraining. On pediatric prediction tasks, PORTER matches fixed-vocabulary models and recovers 97.1% of AUROC when transferred to unseen event descriptions.

0 favorites 0 likes

#pretraining

I pretrained and post trained a 500M parameter LLM and 330M parameter Image generator from scratch

Reddit r/LocalLLaMA ↗ · 4d ago

The author details the process of pretraining and post-training a 500M parameter language model and a 330M parameter image generator entirely from scratch.

0 favorites 0 likes

#pretraining

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

Hugging Face Daily Papers ↗ · 6d ago Cached

This paper investigates training-time data augmentation techniques to mitigate overfitting in autoregressive language model pretraining under data-constrained, compute-abundant regimes, finding that combining token-level noise, sequence permutations, and target offset prediction improves validation loss.

0 favorites 0 likes

#pretraining

HumanScale: Egocentric Human Video Can Outperform Real-Robot Data for Embodied Pretraining

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper finds that egocentric human video, when processed with a filtering and labeling pipeline, can outperform teleoperated real-robot data for pretraining embodied foundation models, achieving lower validation loss and higher success rates on real-robot tasks.

0 favorites 0 likes

#pretraining

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

This paper identifies a fundamental limitation (shrinkage bias) in non-uniform FP4 quantization formats for LLM pretraining and proposes UFP4, a uniform 4-bit training recipe that outperforms existing E2M1-based methods.

0 favorites 0 likes

#pretraining

Small Initialization Matters for Large Language Models

arXiv cs.AI ↗ · 2026-06-17 Cached

This paper shows that reducing parameter initialization scale consistently improves pretraining of large language models, with the largest gains on reasoning-demanding tasks. It uncovers a critical initialization that balances reasoning and training, and proposes a simple γ-initialization rule.

0 favorites 0 likes

#pretraining

Spokes: Optimizing for Diverse Pretraining Data Selection

arXiv cs.CL ↗ · 2026-06-16 Cached

This paper introduces Spokes, a probabilistic diversification framework using the G-Vendi score to optimize diversity in pretraining data selection, achieving significant improvements in downstream task performance on FineWeb and DCLM by jointly optimizing quality and diversity.

0 favorites 0 likes

#pretraining

@yacinelearning: okay folks buckle up because this thursday we have @joelniklaus from @huggingface that will join us on stream to teach …

X AI KOLs Timeline ↗ · 2026-06-15 Cached

Joel Niklaus from Hugging Face will give a live stream on synthetic data's role in advancing pretraining; the team has also published a playbook on the topic.

0 favorites 0 likes

#pretraining

ACE-Ego-0: Unifying Egocentric Human and Robotic Data for VLA Pretraining

Hugging Face Daily Papers ↗ · 2026-06-15 Cached

ACE-EGO-0 is a unified Vision-Language-Action pretraining framework that leverages egocentric human videos and robot trajectories via a reliability-aware training objective, achieving state-of-the-art on embodied AI benchmarks.

0 favorites 0 likes

#pretraining

AC-ODM: Actor--Critic Online Data Mixing for Sample-Efficient LLM Pretraining

Hugging Face Daily Papers ↗ · 2026-06-14 Cached

AC-ODM uses reinforcement learning to dynamically optimize pretraining data composition for LLMs, achieving faster convergence and higher downstream accuracy with negligible computational overhead.

0 favorites 0 likes

#pretraining

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

arXiv cs.AI ↗ · 2026-06-12 Cached

OpenMedQ is a fully-open medical vision-language model pretrained on 14 datasets (~3.35M samples), achieving state-of-the-art results on medical VQA and classification benchmarks.

0 favorites 0 likes

#pretraining

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

arXiv cs.LG ↗ · 2026-06-11 Cached

This paper proposes a probabilistic contrastive pretraining framework for molecular graph transformers to improve multi-task ADME property prediction in drug discovery, achieving significant gains on three benchmarks.

0 favorites 0 likes

#pretraining

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces WebGraphMix, a lightweight framework that uses web graph centrality scores from Common Crawl to select pretraining data, showing that mixing central and peripheral documents improves language model performance.

0 favorites 0 likes

#pretraining

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper studies a staged promotion protocol for micro-pretraining, using escalating budgets from minutes to hours to filter configurations. It finds that early screens are useful but unstable, and that a staged approach can retain a long-horizon reference while identifying alternatives that fail continuation thresholds.

0 favorites 0 likes

#pretraining

LabVLA: Grounding Vision-Language-Action Models in Scientific Laboratories

Hugging Face Daily Papers ↗ · 2026-06-11 Cached

LabVLA is a vision-language-action model for scientific laboratory automation, trained with a two-stage approach combining action token pretraining and flow matching. It achieves state-of-the-art success rates on the LabUtopia benchmark by leveraging simulated data to bridge the gap between household demonstrations and lab-specific tasks.

0 favorites 0 likes

#pretraining

Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining

arXiv cs.LG ↗ · 2026-06-10 Cached

This paper systematically evaluates 11 synthetic time-series generators for foundation model pretraining and finds that generator rankings are not stable across architectures, but an equal-weight mixture of all generators matches or beats the best individual. Blending this mixture with real data yields the strongest pretraining corpora, reframing synthetic pretraining as a corpus composition problem rather than a generator selection problem.

0 favorites 0 likes

#pretraining

EditSR: Enhancing Neural Symbolic Regression via Edit-based Rectification

arXiv cs.AI ↗ · 2026-06-09 Cached

EditSR proposes a two-layer framework combining a neural symbolic regression model with an edit-based Rectifier to efficiently rectify structural errors in generated expressions, reducing error accumulation and improving recovery of complex symbolic structures with limited extra cost.

0 favorites 0 likes

pretraining

Submit Feedback