dpo

Tag

Cards List
#dpo

Which Pairs to Compare for LLM Post-Training?

arXiv cs.AI · 6d ago Cached

This paper studies the problem of selecting which completion pairs to label for human preference feedback in LLM post-training. It formulates comparison curation as a sampling-design problem, provides theoretical bounds on DPO's policy optimality gap, and proposes practical sampling designs that improve sample efficiency over common heuristics on synthetic and real benchmarks.

0 favorites 0 likes
#dpo

Predictive Data Debugging: Reveal and Shape What Your Model Learns, Before You Train (11 minute read)

TLDR AI · 2026-06-12 Cached

This research introduces a method using interpretability to predict which behaviors DPO will amplify or suppress from a preference dataset before training, enabling data debugging to prevent undesired effects. The technique achieves R²=0.9 prediction accuracy and is integrated into Goodfire's Silico platform.

0 favorites 0 likes
#dpo

Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching

arXiv cs.LG · 2026-06-11 Cached

This paper proposes LLM-GNN Co-Teaching, a bidirectional framework for few-shot graph learning on text-attributed graphs. The LLM and GNN exchange confident pseudo-labels and use round-based preference optimization (RPL-PO) to mutually improve, outperforming prior methods on benchmarks.

0 favorites 0 likes
#dpo

Fine-tuned Qwen2.5-7B to 96% of Claude Haiku on a domain-specific task using ~$3 of API calls and zero human labelers

Reddit r/LocalLLaMA · 2026-06-10

Presented DV-DPO, a method to fine-tune Qwen2.5-7B on domain-specific tasks using only ~$3 in API calls and zero human labelers, achieving 96% composite performance of Claude Haiku via adversarial cross-examination.

0 favorites 0 likes
#dpo

Improving Multimodal Reasoning via Worst Dimension Optimization

arXiv cs.AI · 2026-06-09 Cached

This paper introduces Multimodal Multi-Dimensional Scalarization Process Reward Modeling (MMS-PRM), which enforces the worst dimension's robustness in multimodal reasoning to prevent failures like visual hallucinations from being masked by strong text logic.

0 favorites 0 likes
#dpo

DOG-DPO:Dynamic Optimization in Geometry for Safety Alignment

arXiv cs.LG · 2026-06-09 Cached

DOG-DPO is a training-free data selection framework that treats preference pairs as structured geometric signals, decomposing multi-dataset preference geometry into anchor and residual subspaces to select diverse subsets for safety alignment. It achieves strong utility-robustness trade-offs using only 11% of preference pairs across six safety benchmarks.

0 favorites 0 likes
#dpo

@TheTuringPost: 15 Policy Optimization and Preference Optimization techniques important in 2026 GRPO DPO REINFORCE++ DAPO (Dynamic sAmp…

X AI KOLs Timeline · 2026-06-07 Cached

A comprehensive guide to 15 policy optimization and preference optimization techniques important in 2026, including GRPO, DPO, REINFORCE++, and many newer variants, mapping the landscape of reasoning RL methods.

0 favorites 0 likes
#dpo

VCIFBench: Evaluating Complex Instruction Following for Video Understanding

arXiv cs.CL · 2026-06-04 Cached

VCIFBench is a new benchmark for evaluating complex instruction following in video understanding, featuring 306 test instructions with content, format, style, and structure constraints, plus a DPO preference dataset. Experiments on 10 MLLMs reveal that joint constraint satisfaction remains challenging, and DPO training on the benchmark data improves instruction-following performance.

0 favorites 0 likes
#dpo

Direct Preference Optimization Beyond Chatbots

Hugging Face Blog · 2026-06-03 Cached

Direct Preference Optimization (DPO) is applied to OCR tasks beyond chatbots, showing significant reduction in text degeneration across multiple model families, with an average reduction of 59.4%.

0 favorites 0 likes
#dpo

@yuwen_lu_: I'm halfway through, damn why did no one ever tell me RL is this fun

X AI KOLs Timeline · 2026-05-30 Cached

Sanbu 散步 released a modern RL tutorial Hands-On Modern RL, covering from CartPole+PPO basics to LLM post-training (RLHF, DPO, GRPO) and Agentic RL, code-first, English version coming soon.

0 favorites 0 likes
#dpo

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

arXiv cs.CL · 2026-05-27 Cached

This paper introduces CroCo, a method for cross-lingual contrastive preference tuning on self-generated responses, showing that a reward model trained on English preferences can effectively rank responses in other languages, improving model performance across 14 languages without language-specific annotations.

0 favorites 0 likes
#dpo

@neural_avb: Next video is on training tiny (<1B) models for preference tuning. Plus how to generate preference datasets with local …

X AI KOLs Timeline · 2026-05-26 Cached

Announces an upcoming video on training tiny models for preference tuning, covering reward models, RLHF, DPO, ORPO with Unsloth and TRL.

0 favorites 0 likes
#dpo

Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment

arXiv cs.AI · 2026-05-22 Cached

This paper proves that the equivalence between Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) is conditional and often violated in practice, revealing failure modes where DPO optimizes relative advantage rather than absolute alignment. The authors introduce Constrained Preference Optimization (CPO) for provable alignment and demonstrate state-of-the-art performance.

0 favorites 0 likes
#dpo

@akshay_pachaar: The Operating System for Al Research Labs. TransformerLab orchestrates GPUs across any cloud and runs any training or e…

X AI KOLs Following · 2026-05-20 Cached

TransformerLab is an open-source platform that orchestrates GPUs across clouds and provides pre-built templates for AI training and evaluation workflows like LoRA, DPO, and MMLU.

0 favorites 0 likes
#dpo

@anyscalecompute: LLM post-training is the new baseline. Picking the wrong method or GPU config is how you waste a 36-hour run. Introduci…

X AI KOLs Following · 2026-05-15 Cached

Anyscale introduces a new Agent Skill for LLM post-training that automatically selects the optimal fine-tuning method (SFT, DPO, GRPO, etc.) and generates ready-to-launch configs, helping avoid wasted GPU runs.

0 favorites 0 likes
#dpo

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

arXiv cs.CL · 2026-05-13 Cached

The paper introduces Macro, a preference alignment framework using DPO to improve the validity and minimality of self-generated counterfactual explanations across multiple languages.

0 favorites 0 likes
#dpo

talkie-lm/talkie-1930-13b-it

Hugging Face Models Trending · 2026-04-20 Cached

Talkie-1930-13b-it is a 13B parameter instruction-tuned language model trained on pre-1931 text and fine-tuned using reinforcement learning with DPO.

0 favorites 0 likes
#dpo

Where does output diversity collapse in post-training?

arXiv cs.CL · 2026-04-20 Cached

This paper investigates where and why output diversity collapses during post-training of language models, analyzing three OLMo 3 lineages (Think, Instruct, RL-Zero) across multiple tasks and metrics. The authors find that diversity collapse is primarily determined by training data composition and embedded in model weights during training, not addressable at inference time alone.

0 favorites 0 likes
#dpo

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

arXiv cs.CL · 2026-04-20 Cached

GroupDPO introduces a memory-efficient algorithm for group-wise direct preference optimization that leverages multiple candidate responses per prompt while reducing peak memory usage through decoupled backpropagation. The method demonstrates consistent improvements over standard DPO across offline and online alignment settings.

0 favorites 0 likes
← Back to home

Submit Feedback