Tag
CLI-Universe is a synthesis engine that generates verifiable terminal-agent tasks via multi-dimensional capability taxonomy and evidence-guided research, producing a distilled dataset of 6,000 trajectories. Fine-tuning Qwen3-32B on this dataset achieves 33.4% on Terminal-Bench 2.0, setting a new state-of-the-art for open-source models at or below 32B parameters.
Z. ai has open-sourced its RL infrastructure, the slime framework, which enabled efficient OPD post-training of GLM-5.2 in about two days. slime is an LLM post-training framework for RL scaling that integrates Megatron and SGLang, and has been battle-tested by frontier models like GLM, Qwen, DeepSeek, and Llama.
Researchers from Stanford, UC, and Nanjing University release SEFD, a dataset of 152B tokens from SEC filings converted to layout-faithful MultiMarkdown, preserving table structure for LLM training with minimal overlap with Common Crawl.
A repository that builds a GPT-style transformer from scratch without high-level libraries, covering everything from data preprocessing to generation, and includes guides for SFT and RLHF.
This paper introduces the 'culture funnel' concept, demonstrating that cultural signals in LLM training data sharply decline during post-training stages. The authors release a 5.6M-sample tagged dataset to help preserve cultural grounding in model alignment.
MLX-LoRA-Studio is a native macOS app for fine-tuning LLMs on Apple Silicon, offering a user-friendly interface and support for various training algorithms including SFT, DPO, and QAT. It is fully open-source and allows local, private fine-tuning without cloud dependency.
Introduces llm.istanbul, a WebGPU LLM workbench that lets you train small models, train tokenizers, and generate text entirely in the browser, no server required, fully local.
This paper introduces FormatMix, a multi-format training approach that improves LLM consistency across different answer formats by expanding a subset of training items into multiple equivalent formats, showing that format diversity is key to robustness.
This article explains how to use GRPO to fine-tune an LLM (Qwen3-8B) for reliable JSON structured output, improving schema accuracy from 62% to 82%, surpassing GPT-4.1's 58%.
The paper introduces RACES, a recursive automated composition framework that treats verifiable environments as composable building blocks to scale reinforcement learning for LLMs, enabling efficient reasoning generalization through compositional operators.
The tweet outlines a 3-step loop for LLM training in 2026: train on data, run evals, and add synthetic data for underperforming tasks. It emphasizes the accessibility of legal distillation via open source models and cheap APIs, noting that training on reasoning traces alone can achieve high scores.
The author shares learnings from training a 160M parameter LLM from scratch, experimenting with architectures like multi-token prediction and hierarchical reasoning models. They emphasize the importance of fast iteration, simplifying ideas, and understanding why architectures work.
This paper introduces MAPL, a method for learned orthogonal compression of activations in pipeline parallelism, reducing communication overhead while maintaining performance via Stiefel manifold constraints and per-stage factorized anchor embeddings.
Hugging Face's Niels introduces On-policy Distillation (OPD), a key post-training technique used in models like Qwen 3.6/3.7, GLM-5.1, and DeepSeek-V4, now featured on PapersWithCode with a linked whiteboard explanation by Sasha Rush and Dwarkesh Patel.
Harvard researchers challenge the standard LLM training pipeline by showing RL can be effectively applied during pre-training rather than only after SFT, finding that data composition matters more than model scale, and proposing parallel averaging of RL and SFT objectives that outperforms sequential approaches while preserving general capabilities.
SDPG (Self-Distilled Policy Gradient) is a new RL training framework for LLMs that combines group-relative verifier advantages with on-policy self-distillation and KL regularization to address sparse rewards and instability in RLVR training. The method uses a shared model as both student and teacher by conditioning on privileged context, showing improved stability and performance over RLVR and self-distillation baselines.
This paper proposes a difficulty-aware SFT-then-RL framework for training small language models (≤3B parameters) on reasoning tasks, arguing that data difficulty should be strategically aligned with the distinct roles of SFT (learning new skills) and RL (consolidating partial skills). The authors introduce a Bridge mechanism for hard SFT samples and Critique Fine-Tuning for RL failures, showing consistent improvements across five reasoning benchmarks.
Recommended reading: the MAI-Thinking-1 technical paper, which details almost all the steps to train a SOTA large language model.
Trained a 75M parameter LLM called KeyLM from scratch on 18B tokens, achieving competitive instruction-following scores against larger models while using fewer parameters and less data.
OmniOPD introduces a logit-free on-policy distillation method that uses chunk-level semantic similarity and speculative verification to train student models with black-box teachers, achieving up to +28.64% improvement on math benchmarks over standard OPD.