on-policy

#on-policy

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

arXiv cs.CL ↗ · 3d ago Cached

OPID is a framework that extracts dense token-level supervision from completed on-policy trajectories for reinforcement learning of language agents, using hierarchical skills (episode-level and step-level) to improve sample efficiency and robustness.

0 favorites 0 likes

#on-policy

OPID: On-Policy Skill Distillation for Agentic Reinforcement Learning

Hugging Face Daily Papers ↗ · 4d ago Cached

OPID proposes an on-policy skill distillation framework that extracts dense hindsight supervision from completed trajectories, combining outcome-based RL with token-level self-distillation to improve language agent training efficiency and performance on multi-turn tasks.

0 favorites 0 likes

#on-policy

DanceOPD: On-Policy Generative Field Distillation

Hugging Face Daily Papers ↗ · 4d ago Cached

DanceOPD proposes an on-policy generative field distillation framework for flow-matching models that unifies text-to-image generation, local editing, and global editing via capability-specific routing and velocity-based training, improving multi-capability composition while preserving anchor generation quality.

0 favorites 0 likes

#on-policy

Learning from the Self-future: On-policy Self-distillation for dLLMs

arXiv cs.CL ↗ · 2026-06-17 Cached

Introduces d-OPSD, the first on-policy self-distillation framework for diffusion large language models, using suffix conditioning and step-level supervision to outperform RLVR and SFT baselines on reasoning benchmarks.

0 favorites 0 likes

#on-policy

When Context Returns: Toward Robust Internalization in On-Policy Distillation

arXiv cs.LG ↗ · 2026-06-11 Cached

The paper identifies that reintroducing privileged context to a distilled student model degrades performance (context-induced degradation), and proposes a lightweight consistency regularizer that anchors no-context outputs to mitigate this issue, improving robustness across 12 configurations.

0 favorites 0 likes

#on-policy

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

arXiv cs.CL ↗ · 2026-06-08 Cached

The paper introduces OPDLM, a method that transforms autoregressive language models into diffusion language models via on-policy distillation, requiring 15x to 7000x fewer training tokens while retaining knowledge from the original model.

0 favorites 0 likes

#on-policy

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

Hugging Face Daily Papers ↗ · 2026-06-08 Cached

Sign-Gated On-Policy Distillation (SG-OPD) enhances standard on-policy distillation by using a binary verifier as a trust signal for teacher supervision, improving performance on competition-level math reasoning benchmarks.

0 favorites 0 likes

#on-policy

@dwarkesh_sp: Recently met @srush_nlp and he started giving me an impromptu lecture on how targeted on-policy self-distillation works…

X AI KOLs Following ↗ · 2026-06-04 Cached

Dwarkesh Patel shares an explanation from Sasha Rush on targeted on-policy self-distillation, where hint tokens are inserted into a trajectory to downweight specific model errors without requiring new rollouts.

0 favorites 0 likes

#on-policy

Weak Critics Make Strong Learners: On-Policy Critique Distillation for Scalable Oversight

arXiv cs.AI ↗ · 2026-06-02 Cached

Proposes on-policy critique distillation (Opcd) using weak models as critics to provide revision directions for strong models, improving reasoning and alignment without requiring weak models to solve tasks.

0 favorites 0 likes

#on-policy

Trust Region On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.

0 favorites 0 likes

#on-policy

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

Hugging Face Daily Papers ↗ · 2026-05-29 Cached

The SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives, achieving outperforming results across six benchmarks.

0 favorites 0 likes

#on-policy

SLAP: Stratified Loss-based Pruning for On-Policy Data-Efficient Instruction Tuning

arXiv cs.CL ↗ · 2026-05-26 Cached

Proposes SLAP, a novel data selection framework for efficient instruction tuning of large language models that evaluates batch learnability and uses stratified sampling to achieve superior performance with 20-40% less training data.

0 favorites 0 likes

#on-policy

Efficient Agentic Reinforcement Learning with On-Policy Intrinsic Knowledge Boundary Enhancement

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

This paper proposes AKBE, an on-policy method for LLM agent reinforcement learning that dynamically identifies when tool use is needed versus when internal knowledge suffices, improving accuracy by +1.85 on average and reducing tool calls by 18% over standard agentic RL.

0 favorites 0 likes

#on-policy

On-Policy Adversarial Flow Distillation for Autoregressive Video Generation

Hugging Face Daily Papers ↗ · 2026-05-25 Cached

Proposes Adversarial Flow Distillation (AFD) for distilling heterogeneous black-box video generation models into autoregressive students, using on-policy feedback and forward-process flow-matching updates.

0 favorites 0 likes

#on-policy

When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning

arXiv cs.LG ↗ · 2026-05-22 Cached

This paper identifies that teacher token reliability in reasoning distillation is trajectory-structured and proposes Position-Weighted On-Policy Self-Distillation (PW-OPSD), which applies increasing position weights to improve performance without additional teacher computation.

0 favorites 0 likes

#on-policy

@lateinteraction: Indeed. But the next breakthrough for a far more scalable RL paradigm than GRPO is already here: Train your self-teache…

X AI KOLs Following ↗ · 2026-05-19 Cached

Introduces Pedagogical RL, a new paradigm where models learn to be self-teachers by using privileged information to actively sample successful and easy-to-follow trajectories, achieving up to 40% relative gains over GRPO and on-policy distillation methods.

0 favorites 0 likes

#on-policy

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper introduces a family of loss functions derived from f-divergences for training generative models like GFlowNets and LLMs, which are valid off-policy while matching on-policy gradients of the corresponding f-divergence. Applications include molecule discovery and asynchronous LLM tuning.

0 favorites 0 likes

#on-policy

Reducing the Safety Tax in LLM Safety Alignment with On-Policy Self-Distillation

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper introduces OPSA, an on-policy self-distillation method for LLM safety alignment that reduces the safety tax by training on the model's own rollouts and using a teacher flip rate to activate latent safety reasoning, achieving stronger safety-reasoning tradeoffs across multiple model scales.

0 favorites 0 likes

#on-policy

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

arXiv cs.CL ↗ · 2026-05-18 Cached

This paper introduces OP-Mix, a data mixing algorithm that uses low-rank adapters trained on the current model to cheaply simulate candidate data mixtures, enabling efficient and unified data mixing across pretraining, continual midtraining, and continual instruction tuning. OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of baselines, improving pretraining perplexity by 6.3% and reducing compute by 66-95% in continual learning settings.

0 favorites 0 likes

#on-policy

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

arXiv cs.LG ↗ · 2026-05-14 Cached

Introduces Multi-Rollout On-Policy Distillation (MOPD), a method that conditions the teacher on both successful and failed peer rollouts to provide denser token-level supervision for language model post-training, improving performance across multiple benchmarks.

0 favorites 0 likes

on-policy

Submit Feedback