Tag
This paper studies the problem of selecting which completion pairs to label for human preference feedback in LLM post-training. It formulates comparison curation as a sampling-design problem, provides theoretical bounds on DPO's policy optimality gap, and proposes practical sampling designs that improve sample efficiency over common heuristics on synthetic and real benchmarks.
This research introduces a method using interpretability to predict which behaviors DPO will amplify or suppress from a preference dataset before training, enabling data debugging to prevent undesired effects. The technique achieves R²=0.9 prediction accuracy and is integrated into Goodfire's Silico platform.
This paper proposes LLM-GNN Co-Teaching, a bidirectional framework for few-shot graph learning on text-attributed graphs. The LLM and GNN exchange confident pseudo-labels and use round-based preference optimization (RPL-PO) to mutually improve, outperforming prior methods on benchmarks.
Presented DV-DPO, a method to fine-tune Qwen2.5-7B on domain-specific tasks using only ~$3 in API calls and zero human labelers, achieving 96% composite performance of Claude Haiku via adversarial cross-examination.
This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.
DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.
A comprehensive guide to 15 policy optimization and preference optimization techniques important in 2026, including GRPO, DPO, REINFORCE++, and many newer variants, mapping the landscape of reasoning RL methods.
VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.
Direct Preference Optimization (DPO) is applied to OCR tasks beyond chatbots, showing significant reduction in text degeneration across multiple model families, with an average reduction of 59.4%.
Sanbu 散步 released a modern RL tutorial Hands-On Modern RL, covering from CartPole+PPO basics to LLM post-training (RLHF, DPO, GRPO) and Agentic RL, code-first, English version coming soon.
This paper introduces CroCo, a method for cross-lingual contrastive preference tuning on self-generated responses, showing that a reward model trained on English preferences can effectively rank responses in other languages, improving model performance across 14 languages without language-specific annotations.
Announces an upcoming video on training tiny models for preference tuning, covering reward models, RLHF, DPO, ORPO with Unsloth and TRL.
This paper proves that the equivalence between Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) is conditional and often violated in practice, revealing failure modes where DPO optimizes relative advantage rather than absolute alignment. The authors introduce Constrained Preference Optimization (CPO) for provable alignment and demonstrate state-of-the-art performance.
TransformerLab is an open-source platform that orchestrates GPUs across clouds and provides pre-built templates for AI training and evaluation workflows like LoRA, DPO, and MMLU.
Anyscale introduces a new Agent Skill for LLM post-training that automatically selects the optimal fine-tuning method (SFT, DPO, GRPO, etc.) and generates ready-to-launch configs, helping avoid wasted GPU runs.
The paper introduces Macro, a preference alignment framework using DPO to improve the validity and minimality of self-generated counterfactual explanations across multiple languages.
Talkie-1930-13b-it is a 13B parameter instruction-tuned language model trained on pre-1931 text and fine-tuned using reinforcement learning with DPO.
This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.
GroupDPO introduces a memory-efficient algorithm for group-wise direct preference optimization that leverages multiple candidate responses per prompt while reducing peak memory usage through decoupled backpropagation. The method demonstrates consistent improvements over standard DPO across offline and online alignment settings.