Tag
FastMix is a novel framework that automates data mixture discovery for training large models using a single proxy model and bilevel optimization, achieving state-of-the-art performance with significant efficiency gains.
Kairos is a native world model framework for Physical AI that learns from diverse experiences using a cross-embodiment data curriculum, maintains persistent states with hybrid temporal attention, and supports efficient deployment on server and consumer hardware.
Google DeepMind's pre-training lead Vlad Feinberg highlights kernel development and low-level performance engineering as high-demand skills for frontier AI labs.
Interview with Google DeepMind's pre-training area lead Vlad Feinberg about landing jobs at frontier AI labs, covering needed skills, research vs engineering differences, and scaling laws.
This paper tackles code generation for no-resource programming languages by building benchmarks and proposing a method that combines further pre-training with weight difference transfer to create specialized instruction-following models at reduced cost.
This paper introduces the Curse of Depth in LLMs, where deep layers become ineffective due to Pre-Layer Normalization causing output variance explosion. The authors propose LayerNorm Scaling to mitigate this, showing consistent improvements in pre-training and fine-tuning across model sizes up to 7B.
A 230-page book that comprehensively covers LLM concepts including pre-training, fine-tuning, alignment, and prompting techniques.
This paper introduces 'fragility', a complementary metric to probe accuracy that measures activation-noise level at which probe accuracy collapses, enabling analysis of representation evolution during LLM pre-training even after accuracy saturates.
A new method called ECHO bridges RL and pre-training by using next token prediction on tool call outputs to learn from the environment beyond reward signals, combining world modeling and agentic actions.
CodeAlchemy is a synthetic data generation framework that transforms publicly available code into semantically rich training data using five strategies, producing over 500 billion tokens and enabling small models to outperform much larger ones on code benchmarks.
The paper identifies repetition mismatch as a primary cause for data mixture experiments failing to scale, and proposes a repetition-controlled subsampling procedure that allows small-scale experiments to recover near-optimal mixtures using far fewer tokens.
3Blue1Brown's new video explains that LLMs are fundamentally compression machines, linking next-token prediction to efficient encoding of human knowledge, which leads to better abstraction and reasoning.
The Aizpurua team at Multiverse Computing, Spain, proposes expanding pretrained large models with small quantum circuits. Adding just about 6,000 parameters to Llama 3.1 8B reduces perplexity by 1.4%, demonstrating the feasibility of quantum-circuit-assisted large model scaling.
Detailed walkthrough of Cursor's approach to training Composer 2: using Kimi 2.5 as the base, learning code knowledge through large-scale intermediate training, then large-scale RL to teach the model to write correct code in real environments, and using self-summarization to handle long contexts.
The paper proposes a hybrid pre-training objective combining JEPA latent-space prediction with MLM reconstruction for language models, showing improved embedding uniformity and semantic-lexical balance.
A thread reviewing the paper 'Pretraining Large Language Models with NVFP4' and discussing NVFP4 pre-training, especially for NVIDIA Blackwell.
Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.
This paper introduces ChristBERT, a family of domain-specific RoBERTa-based language models for German clinical NLP, and evaluates three domain adaptation strategies (continued pre-training, pre-training from scratch, and vocabulary adaptation) on medical named entity recognition and text classification tasks, achieving state-of-the-art results.
This paper introduces Regret Pre-training, a self-supervised framework that uses a dual-view architecture to incorporate future context into causal language model training, improving performance on downstream tasks by up to 18 percentage points without adding parameters.
Explains mid-training as a stage between pre-training and post-training, where a base model is continued on curated data to strengthen specific capabilities before instruction tuning.