pre-training

#pre-training

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

arXiv cs.LG ↗ · 2026-06-02 Cached

This paper introduces an exposure-based framework to study grokking-like delayed generalization during LLM pre-training, using BLiMP minimal pairs and critical phrases. The authors observe delayed generalization across five grammatical phenomena and analyze internal changes such as concept vector predictability and attention head concentration.

0 favorites 0 likes

#pre-training

Humanoid-GPT: Scaling Data and Structure for Zero-Shot Motion Tracking

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

Humanoid-GPT is a GPT-style Transformer pre-trained on a billion-scale motion corpus, achieving zero-shot generalization for whole-body motion tracking across unseen motions and tasks.

0 favorites 0 likes

#pre-training

@nv_pavlichenko: Today we're releasing Mellum2: our first "serious" LLM. This is a 12B A2.5B MoE LLM pre-trained on ~11T tokens and post…

X AI KOLs Timeline ↗ · 2026-06-01 Cached

Releases Mellum2, a 12B A2.5B MoE LLM pretrained on ~11T tokens and post-trained with RLVR. Base, SFT, and RL checkpoints are released with a technical report.

0 favorites 0 likes

#pre-training

@latkins: With heart.

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Martin Casado raises concerns about open source models keeping pace with expensive pre-training and blocked distillation access; @latkins replies 'With heart.'

0 favorites 0 likes

#pre-training

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.

0 favorites 0 likes

#pre-training

The Stability of Singular Distribution: A Spectral Perspective on the Two-Phase Dynamics of Language Model Pre-training

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.

0 favorites 0 likes

#pre-training

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

arXiv cs.LG ↗ · 2026-05-27 Cached

This paper identifies a Rank-1 Subspace phenomenon in LLM pre-training trajectories and proposes Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss, achieving consistent zero-shot accuracy gains across GPT-2 and LLaMA families up to 2B parameters.

0 favorites 0 likes

#pre-training

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

arXiv cs.LG ↗ · 2026-05-27 Cached

GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.

0 favorites 0 likes

#pre-training

NITP: Next Implicit Token Prediction for LLM Pre-training

Hugging Face Daily Papers ↗ · 2026-05-24 Cached

Next Implicit Token Prediction (NITP) enhances language model pre-training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead.

0 favorites 0 likes

#pre-training

@jinchenma_ai: I watched Xiaojun Zhang's interview with Yao Shunyu for 4 hours, packed with valuable insights. He made a particularly contrarian judgment. Many say pre-training has hit a wall and Scaling Law has reached its limit. He says no, and there are no signs of hitting a ceiling in the coming months. So why do so many people think it's hit a wall? He directly said: the vast majority of people who shout about hitting a wall have bugs in their own code...

X AI KOLs Timeline ↗ · 2026-05-21 Cached

In the interview, Yao Shunyu proposed a contrarian view that pre-training has not hit a wall and Scaling Law has not reached its limit, claiming that most people who say it has hit a wall have bugs in their code.

0 favorites 0 likes

#pre-training

OpenAI co-founder Andrej Karpathy joins Anthropic’s pre-training team

TechCrunch AI ↗ · 2026-05-19 Cached

Andrej Karpathy, co-founder of OpenAI and former Tesla AI lead, has joined Anthropic to work on pre-training and lead a team focused on using Claude to accelerate pre-training research.

0 favorites 0 likes

#pre-training

Generalization Dynamics of LM Pre-training (17 minute read)

TLDR AI ↗ · 2026-05-19 Cached

This paper reveals that during pre-training, language models frequently and suddenly switch between pattern-matching and generalization behaviors, a phenomenon called mode-hopping, and presents a toy evaluation suite to study it.

0 favorites 0 likes

#pre-training

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

Hugging Face Daily Papers ↗ · 2026-05-18 Cached

Researchers introduce symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional methods like Adam. The approach is validated on various language model architectures including Qwen3-0.6B, Gemma 3 1B, and OLMoE-1B-7B.

0 favorites 0 likes

#pre-training

sapientinc/HRM-Text-1B

Hugging Face Models Trending ↗ · 2026-05-17 Cached

Sapient Intelligence released HRM-Text-1B, a 1-billion-parameter language model with a novel dual-timescale recurrent architecture (Hierarchical Reasoning Model) that provides unbounded compute depth at bounded parameter count. The pre-alignment checkpoint is available on Hugging Face.

0 favorites 0 likes

#pre-training

Nous Research Releases Token Superposition Training to Speed Up LLM Pre-Training by Up to 2.5x Across 270M to 10B Parameter Models

Reddit r/singularity ↗ · 2026-05-16

Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.

0 favorites 0 likes

#pre-training

@NousResearch: Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that deli…

X AI KOLs Following ↗ · 2026-05-15

NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.

0 favorites 0 likes

#pre-training

@stanfordnlp: Lots of @stanfordnlp work at @icmlconf. See you in Seoul! Towards Execution-Grounded Automated AI Research @ChengleiSi …

X AI KOLs Following ↗ · 2026-05-14 Cached

This paper investigates execution-grounded automated AI research by building an automated executor that implements LLM-generated ideas and runs experiments. It shows that execution-guided evolutionary search can find methods that significantly outperform baselines in both pre-training and post-training tasks.

0 favorites 0 likes

#pre-training

@percyliang: For the next Marin model, we are putting together a new data mix. Currently we have 18T tokens, but could use more. So …

X AI KOLs Following ↗ · 2026-05-13 Cached

Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.

0 favorites 0 likes

#pre-training

On Predicting the Post-training Potential of Pre-trained LLMs

arXiv cs.CL ↗ · 2026-05-13 Cached

This paper introduces RuDE, a framework for predicting the post-training potential of pre-trained LLMs by leveraging response discrimination, addressing the limitations of traditional benchmarks like MMLU.

0 favorites 0 likes

#pre-training

Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

This paper presents a systematic study of long-context continued pre-training for vision-language models, achieving generalization beyond 128K context with an efficient data mixture design and introducing the MMProLong model.

0 favorites 0 likes

pre-training

Submit Feedback