DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Hugging Face Daily Papers Papers

Summary

This paper proposes DRIFT, a framework that combines offline trajectories with importance-weighted supervised fine-tuning to efficiently achieve multi-turn interactive learning performance comparable to reinforcement learning.

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice: online reinforcement learning is able to effectively address multi-turn dynamics but is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereas offline supervised fine-tuning (SFT) is efficient but suffers from distribution shift and behavioral collapse. To this end, we novelly propose DRIFT (Decoupled Rollouts and Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that the KL-regularized RL objective is equivalent to importance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixed reference policy, deriving return-based importance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.
Original Article
View Cached Full Text

Cached at: 06/01/26, 07:18 AM

Paper page - DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

Source: https://huggingface.co/papers/2605.31455 Published on May 29

·

Submitted byhttps://huggingface.co/mujianijan

mjon Jun 1

Abstract

DRIFT is a framework that combines offline trajectories with importance-weighted supervised fine-tuning to achieve multi-turn interactive learning efficiency and performance comparable to reinforcement learning.

Large language models are increasingly deployed in multi-turn interactive settings where users or environments can iteratively provide lightweight feedback. Unfortunately, optimizing such behavior presents a sharp dilemma in practice:online reinforcement learningis able to effectively addressmulti-turn dynamicsbut is prohibitively expensive due to the cost of generating full correction trajectories at every update, whereasoffline supervised fine-tuning(SFT) is efficient but suffers from distribution shift andbehavioral collapse. To this end, we novelly propose DRIFT (DecoupledRolloutsand Importance-Weighted Fine-Tuning), a framework that operationalizes the theoretical insight that theKL-regularized RL objectiveis equivalent toimportance-weighted supervised learning. DRIFT decouples rollout from optimization by sampling offline interaction trajectories from a fixedreference policy, deriving return-basedimportance weights, and optimizing the policy via weighted SFT on the resulting dataset. Empirically, we demonstrate that DRIFT matches or exceeds the performance of multi-turn reinforcement learning baselines while maintaining the training efficiency and simplicity of standard supervised fine-tuning. Code is available at https://github.com/2020-qqtcg/DRIFT.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2605\.31455

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.31455 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.31455 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.31455 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Drift Q-Learning

arXiv cs.LG

Proposes DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement for offline RL, outperforming diffusion and flow methods on D4RL and OGBench while maintaining simplicity and efficiency.

Drifting Objectives for Refining Discrete Diffusion Language Models

arXiv cs.CL

This paper introduces TokenDrift, a drifting objective that refines discrete diffusion language models by lifting categorical predictions to a continuous semantic space for anti-symmetric drifting, significantly improving generation quality under a fixed number of denoising steps.

PROWL: Prioritized Regret-Driven Optimization for World Model Learning

arXiv cs.LG

Introduces PROWL, a prioritized regret-driven optimization framework that uses an adversarial curriculum to improve diffusion-based world model robustness by focusing on high-error trajectories, achieving better performance on out-of-distribution scenarios in MineRL.

RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework

Hugging Face Daily Papers

RAD-2 presents a unified generator-discriminator framework for autonomous driving that combines diffusion-based trajectory generation with RL-optimized reranking, achieving 56% collision rate reduction compared to diffusion-based planners. The approach introduces techniques like Temporally Consistent Group Relative Policy Optimization and BEV-Warp simulation environment for efficient large-scale training.