Tag
OpenBMB released UltraData-SFT-2605, a 15M-sample high-quality SFT dataset for fine-tuning AI models like MiniCPM5-1B to run on phones or laptops.
OpenBMB releases UltraData-SFT-2605, a large-scale dataset with over 15 million high-quality samples for supervised fine-tuning (SFT) of reasoning LLMs, covering deep thinking, non-thinking, math, code, knowledge, instruction following, and multilingual data.
We propose LIFT, a learnability-informed fine-tuning algorithm for diffusion language models that aligns training with token difficulty and time step, achieving substantial gains on reasoning benchmarks.
An experiment comparing three Supervised Fine-Tuning data formats (demonstrations, first-person statements, synthetic documents) for injecting a C-3PO persona into Qwen3-4B, finding first-person statements best for generalization and synthetic documents best for factual knowledge.
Dynamic Fine-Tuning (DFT) is introduced as a method that reweights the SFT loss using the model's own token probability, creating a feedback loop, and adds forward KL to penalize tokens the base model finds likely but the policy has pushed toward zero probability. The tweet expresses skepticism about SFT papers in practice but praises the attempt.
This article describes using Fireworks Agent to automate the fine-tuning of a small open-weight model to generate wiki-style summaries, enabling a self-improving agent loop where model training becomes a callable step.
Anyscale introduces a new Agent Skill for LLM post-training that automatically selects the optimal fine-tuning method (SFT, DPO, GRPO, etc.) and generates ready-to-launch configs, helping avoid wasted GPU runs.
The author trained 1B, 2B, and 3B models with the same SFT recipe and observed that instruction-following (IFEval) regressed for the 1B and 2B models but improved for the 3B, possibly due to different learning rates or model capacity.
Percy Liang announces that for the next Marin model, they are compiling a new data mix and request high-quality token data for pre-training, mid-training, and SFT.
TRL v1.4 is released, featuring chunked NLL loss for SFT to reduce VRAM usage and first-class integration with OpenReward for GRPO.
This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.