Tag
OPID is a framework that extracts dense token-level supervision from completed on-policy trajectories for reinforcement learning of language agents, using hierarchical skills (episode-level and step-level) to improve sample efficiency and robustness.
OPID proposes an on-policy skill distillation framework that extracts dense hindsight supervision from completed trajectories, combining outcome-based RL with token-level self-distillation to improve language agent training efficiency and performance on multi-turn tasks.
DanceOPD proposes an on-policy generative field distillation framework for flow-matching models that unifies text-to-image generation, local editing, and global editing via capability-specific routing and velocity-based training, improving multi-capability composition while preserving anchor generation quality.
Introduces d-OPSD, the first on-policy self-distillation framework for diffusion large language models, using suffix conditioning and step-level supervision to outperform RLVR and SFT baselines on reasoning benchmarks.
The paper identifies that reintroducing privileged context to a distilled student model degrades performance (context-induced degradation), and proposes a lightweight consistency regularizer that anchors no-context outputs to mitigate this issue, improving robustness across 12 configurations.
The paper introduces OPDLM, a method that transforms autoregressive language models into diffusion language models via on-policy distillation, requiring 15x to 7000x fewer training tokens while retaining knowledge from the original model.
Sign-Gated On-Policy Distillation (SG-OPD) enhances standard on-policy distillation by using a binary verifier as a trust signal for teacher supervision, improving performance on competition-level math reasoning benchmarks.
Dwarkesh Patel shares an explanation from Sasha Rush on targeted on-policy self-distillation, where hint tokens are inserted into a trajectory to downweight specific model errors without requiring new rollouts.
Proposes on-policy critique distillation (Opcd) using weak models as critics to provide revision directions for strong models, improving reasoning and alignment without requiring weak models to solve tasks.
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.
The SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives, achieving outperforming results across six benchmarks.
Proposes SLAP, a novel data selection framework for efficient instruction tuning of large language models that evaluates batch learnability and uses stratified sampling to achieve superior performance with 20-40% less training data.
This paper proposes AKBE, an on-policy method for LLM agent reinforcement learning that dynamically identifies when tool use is needed versus when internal knowledge suffices, improving accuracy by +1.85 on average and reducing tool calls by 18% over standard agentic RL.
Proposes Adversarial Flow Distillation (AFD) for distilling heterogeneous black-box video generation models into autoregressive students, using on-policy feedback and forward-process flow-matching updates.
This paper identifies that teacher token reliability in reasoning distillation is trajectory-structured and proposes Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies increasing position weights to improve performance without additional teacher computation.
Introduces Pedagogical RL, a new paradigm where models learn to be self-teachers by using privileged information to actively sample successful and easy-to-follow trajectories, achieving up to 40% relative gains over GRPO and on-policy distillation methods.
This paper introduces a family of loss functions derived from f-divergences for training generative models like GFlowNets and LLMs, which are valid off-policy while matching on-policy gradients of the corresponding f-divergence. Applications include molecule discovery and asynchronous LLM tuning.
This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.
This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.
Introduces Multi-Rollout On-Policy Distillation (MOPD), a method that conditions the teacher on both successful and failed peer rollouts to provide denser token-level supervision for language model post-training, improving performance across multiple benchmarks.