Tag
A technical deep-dive into common causes of failed pretraining runs in large language models, including causality-breaking issues in expert routing and numerical precision bugs, with examples from Llama 4, Gemini 2 Pro, and GPT-4.
A GitHub repository provides scripts to train billion-parameter language models from scratch on a single GPU using PyTorch, based on the Transformer architecture.
Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.
This paper introduces Rotation-Preserving Supervised Fine-Tuning (RPSFT), a method that improves out-of-domain generalization by preserving projected rotations in pretrained singular subspaces during fine-tuning.
The paper proposes ActGuide-RL, a method for training agentic policies in LLMs by using human action data as guidance to overcome exploration barriers in reinforcement learning without extensive supervised fine-tuning.
This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.
This paper proposes LayerTracer, an interpretable framework for layer allocation in continued pre-training, demonstrating that freezing deep layers while training shallow ones outperforms full-parameter fine-tuning. It offers a low-cost, actionable strategy for resource-constrained teams optimizing Large Language Models.
The article introduces a method to lighten OPD for efficient post-training of Large Language Models.
This paper introduces On-Policy Harness Self-Distillation (OPHSD), a method that internalizes the capabilities of inference-time reasoning harnesses into the base model through self-distillation. The approach improves standalone performance on complex reasoning tasks, allowing the model to retain reasoning scaffolds without permanent external dependencies.
The article introduces DataArc-SynData-Toolkit, an open-source framework designed to simplify multi-path, multimodal, and multilingual synthetic data generation. It aims to lower technical barriers and improve usability for training large language models through a unified, configuration-driven pipeline.
This paper introduces Pion, a novel spectrum-preserving optimizer for large language model training that uses orthogonal equivalence transformations to maintain singular values during weight updates, offering stable performance comparable to standard optimizers.
Unsloth, an open-source library for efficient LLM training and inference, has officially joined the PyTorch Ecosystem to enhance accessibility and performance. The announcement highlights new features like Unsloth Studio and optimized kernels for reduced VRAM usage.
This paper proposes Shadow Mask Distillation (SMD) to solve the off-policy bias caused by KV cache compression during reinforcement learning post-training for large language models. It introduces a mechanism that ensures on-policy alignment and improves memory efficiency for long-context reasoning tasks.
This paper introduces a training-free diagnostic framework to analyze per-token distillation signals for reasoning models, revealing that guidance is more beneficial on incorrect rollouts and depends on student capacity and task context.
This paper introduces G-Zero, a verifier-free framework that enables autonomous large language model self-improvement through co-evolutionary training using intrinsic rewards and hint-based guidance. It aims to overcome the limitations of proxy LLM judges in open-ended tasks by deriving supervision from internal distributional dynamics.
This paper introduces RLRT, a method that reverses teacher signals in self-distillation to reinforce successful student deviations, enhancing reasoning exploration in large language models.
This article analyzes post-training methods for language models through a distributional perspective, comparing how SFT, RL, and on-policy distillation reshape model distributions and impact phenomena like catastrophic forgetting.
The article claims that Stanford has released a free technique for training LLMs to adhere strictly to prompts, a skill Anthropic reportedly pays high salaries for. It urges readers to bookmark the resource before it is removed.
The author details the process of optimizing custom matrix multiplication kernels in Swift to train a Large Language Model on Apple Silicon, aiming to outperform C implementations by leveraging CPU, SIMD, AMX, and GPU capabilities.
MiniMax published a technical blog post providing an in-depth analysis of the systematic vocabulary degradation issue behind its M2 series large models' inability to output specific personal names. It reveals parameter shifts caused by a disconnect in data coverage between pre-training and post-training stages, and proposes an effective solution involving full-scale synthetic data for remediation.