DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
Summary
This paper proposes DRIFT, a framework that combines offline trajectories with importance-weighted supervised fine-tuning to efficiently achieve multi-turn interactive learning performance comparable to reinforcement learning.
View Cached Full Text
Cached at: 06/01/26, 07:18 AM
Paper page - DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization
Source: https://huggingface.co/papers/2605.31455 Published on May 29
·
Submitted byhttps://huggingface.co/mujianijan
mjon Jun 1
Abstract
DRIFT is a framework that combines offline trajectories with importance-weighted supervised fine-tuning to achieve multi-turn interactive learning efficiency and performance comparable to reinforcement learning.
Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice:online reinforcement learningis able to effectively addressmulti-turn dynamicsbut is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereasoffline supervised fine-tuning(SFT) is efficient but suffers from distribution shift andbehavioral collapse. To this end, we novelly propose DRIFT (DecoupledRolloutsand Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that theKL-regularized RL objectiveis equivalent toimportance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixedreference policy, deriving return-basedimportance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.31455
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.31455 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.31455 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.31455 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Drift Q-Learning
Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.
Drifting Objectives for Refining Discrete Diffusion Language Models
This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.
PROWL: Prioritized Regret-Driven Optimization for World Model Learning
Introduces PROWL, a prioritized regret-driven optimization framework that uses an adversarial curriculum to improve diffusion-based world model robustness by focusing on high-error trajectories, achieving better performance on out-of-distribution scenarios in MineRL.
Learnability-Informed Fine-Tuning of Diffusion Language Models
We propose LIFT, a learnability-informed fine-tuning algorithm for diffusion language models that aligns training with token difficulty and time step, achieving substantial gains on reasoning benchmarks.
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.