Tag
This paper introduces an exposure-based framework to study grokking-like delayed generalization during LLM pre-training, using BLiMP minimal pairs and critical phrases. The authors observe delayed generalization across five grammatical phenomena and analyze internal changes such as concept vector predictability and attention head concentration.
Humanoid-GPT is a GPT-style Transformer pre-trained on a billion-scale motion corpus, achieving zero-shot generalization for whole-body motion tracking across unseen motions and tasks.
Releases Mellum2, a 12B A2.5B MoE LLM pretrained on ~11T tokens and post-trained with RLVR. Base, SFT, and RL checkpoints are released with a technical report.
Martin Casado raises concerns about open source models keeping pace with expensive pre-training and blocked distillation access; @latkins replies 'With heart.'
DynaFLIP is a dynamics-aware multimodal pre-training framework that integrates motion understanding into visual perception for robot manipulation. It uses image-language-3D flow triplets and geometric regularization to improve representation learning, achieving significant gains in out-of-distribution scenarios.
This paper identifies a spectral phenomenon called Stability of Singular Distribution (SoSD) in large language model pre-training, where the singular value spectrum stabilizes early while parameters continue to evolve. The authors prove that this stabilization marks the transition to the slow-descent phase of training, and they analyze how training strategies like WSD and Muon affect this behavior.
This paper identifies a Rank-1 Subspace phenomenon in LLM pre-training trajectories and proposes Extra-Merge, a training-free strategy that extrapolates along this subspace to minimize loss, achieving consistent zero-shot accuracy gains across GPT-2 and LLaMA families up to 2B parameters.
GEM reformulates LLM data curation as a variational problem on the hypersphere, using geometric entropy mixing and a minorize-maximize algorithm to discover balanced semantic clusters, achieving state-of-the-art improvements in data mixing strategies by up to 1.2% average downstream accuracy.
Next Implicit Token Prediction (NITP) enhances language model pre-training by adding dense continuous supervision in representation space, improving generalization and performance across model sizes with minimal computational overhead.
In the interview, Yao Shunyu proposed a contrarian view that pre-training has not hit a wall and Scaling Law has not reached its limit, claiming that most people who say it has hit a wall have bugs in their code.
Andrej Karpathy, co-founder of OpenAI and former Tesla AI lead, has joined Anthropic to work on pre-training and lead a team focused on using Claude to accelerate pre-training research.
This paper reveals that during pre-training, language models frequently and suddenly switch between pattern-matching and generalization behaviors, a phenomenon called mode-hopping, and presents a toy evaluation suite to study it.
Researchers introduce symmetry-compatible optimizers that respect the equivariance structures of neural network parameters, improving training stability and performance over traditional methods like Adam. The approach is validated on various language model architectures including Qwen3-0.6B, Gemma 3 1B, and OLMoE-1B-7B.
Sapient Intelligence released HRM-Text-1B, a 1-billion-parameter language model with a novel dual-timescale recurrent architecture (Hierarchical Reasoning Model) that provides unbounded compute depth at bounded parameter count. The pre-alignment checkpoint is available on Hugging Face.
Nous Research releases Token Superposition Training (TST), a method that speeds up LLM pre-training by up to 2.5x across models from 270M to 10B parameters, reducing wall-clock time without altering architecture or data.
NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.
This paper investigates execution-grounded automated AI research by building an automated executor that implements LLM-generated ideas and runs experiments. It shows that execution-guided evolutionary search can find methods that significantly outperform baselines in both pre-training and post-training tasks.
Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.
This paper introduces RuDE, a framework for predicting the post-training potential of pre-trained LLMs by leveraging response discrimination, addressing the limitations of traditional benchmarks like MMLU.
This paper presents a systematic study of long-context continued pre-training for vision-language models, achieving generalization beyond 128K context with an efficient data mixture design and introducing the MMProLong model.