Tag
The MiniMax-M2 series introduces Mixture-of-Experts language models that achieve high performance on agentic tasks with minimal activated parameters (9.8B per token out of 229.9B total), leveraging agent-driven data pipelines, a scalable RL system called Forge, and a checkpoint that takes early steps toward self-evolution.
Norway's National Library is building a sovereign Norwegian LLM using 2 PB of Huawei OceanStor Dorado flash storage for its AI training data pipeline, addressing the need for a local language model.
ARES proposes a framework for automatically constructing rubric-based RL data from pretraining documents, generating question-answer pairs and weighted rubrics to enable instance-level reward supervision for open-ended LLM responses, outperforming existing methods on multi-dimensional open-ended tasks.
NVIDIA has officially published a set of Skills for AI agents, covering video analysis, voice agents, LLM training, model acceleration, RAG, secure environments, logistics optimization, and CUDA programming.
This paper introduces Vector Policy Optimization (VPO), a reinforcement learning algorithm that trains LLMs to produce diverse solutions by optimizing across multiple reward dimensions, significantly improving test-time search performance compared to scalar RL baselines.
A user shares a workflow for training a 9B LLM on an A100 GPU using Google Colab for $13.99 CAD, noting the overnight process and the ease of training small language models.
This paper introduces a framework to quantify hyperparameter transfer in LLMs and finds that the benefit of μP over SP in AdamW training largely comes from increasing the embedding layer learning rate. It also explores the impact of weight decay and other factors.
This paper clarifies that under AdamW, µP's embedding learning rate rule (constant) is essentially correct and explains most of µP's benefit, contrary to a previous finding by Hayou et al. about realistic LLM vocab sizes.
PlanningBench is a framework for generating scalable, diverse, and verifiable planning data to evaluate and train large language models, featuring a constraint-driven synthesis pipeline with adaptive difficulty control and quality filtering. Experiments show that frontier LLMs struggle with coupled constraints, and reinforcement learning on PlanningBench data improves performance on unseen planning tasks.
CODA reparameterizes memory-bound operations in LLM training to fuse them into the matmul epilogue, achieving near state-of-the-art performance with LLM-generated kernels.
RPS is a two-stage LLM post-training method inspired by neuroscience, combining curriculum learning with learning rate decay. Preliminary results show improved program synthesis reliability on Qwen3-8b compared to equal learning rate training.
Proposes Introspective Training (IXT), a unified feedback-conditioning algorithm that uses a thinking reward model to annotate data with natural language critiques, enabling quality-aware training across all LLM stages. The method improves compute efficiency by up to 2.8x and achieves better performance in math and code domains.
Discusses how aggressive AI scrapers are disrupting wiki operations by imitating human traffic and using residential proxies, drastically increasing server costs and causing service instability.
Karpathy open-sourced an experimental project, autoresearch, that lets an AI Agent automatically complete the research loop for small-scale LLM training: modify code, run experiments, evaluate results, and iterate. Humans only need to write the research plan and constraints.
An open-source repository called train-llm-from-scratch enables training billion-parameter LLMs on a single GPU, with a configurable pipeline from raw text to inference, including dataset streaming and checkpointing, under MIT License.
DynaTrain is a distributed training system enabling sub-second online reconfiguration of parallelism for large language models, using a Virtual Parameter Space abstraction to achieve up to three orders of magnitude faster transitions than existing methods.
This paper develops an economic model combining scaling laws with microeconomic theory to analyze profit-optimal training of large language models, considering trade-offs between model quality, training costs, and hardware efficiency.
A technical deep-dive into common causes of failed pretraining runs in large language models, including causality-breaking issues in expert routing and numerical precision bugs, with examples from Llama 4, Gemini 2 Pro, and GPT-4.
A GitHub repository provides scripts to train billion-parameter language models from scratch on a single GPU using PyTorch, based on the Transformer architecture.
Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.