off-policy

#off-policy

Token-Level Off-Policy Learning for Faithful Generation Under Distribution Shift

Hugging Face Daily Papers ↗ · 2d ago Cached

Proposes Token-Level Off-Policy Labeling (TOPL), an off-policy training paradigm for faithful generation that reframes post-training as token-level correctness prediction, achieving strong out-of-distribution generalization across summarization and machine translation tasks.

0 favorites 0 likes

#off-policy

Trust Region On-Policy Distillation

Hugging Face Daily Papers ↗ · 2026-05-31 Cached

The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.

0 favorites 0 likes

#off-policy

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

arXiv cs.AI ↗ · 2026-05-29 Cached

This paper proposes behavior-aware auxiliary corrections for off-policy temporal-difference prediction, introducing BA-TDC and BA-TDRC algorithms that replace the auxiliary covariance matrix with the behavior Bellman matrix to improve stability and convergence. Theoretical analysis and experiments on standard benchmarks validate the effectiveness of the proposed methods.

0 favorites 0 likes

#off-policy

Trust Region Q Adjoint Matching

Hugging Face Daily Papers ↗ · 2026-05-26 Cached

Trust Region Q-Adjoint Matching (TRQAM) addresses instability in off-policy reinforcement learning by adaptively controlling path-space KL divergence through projected dual descent, enabling stable fine-tuning of pretrained flow policies. The method consistently outperforms prior arts on 50 OGBench tasks, achieving a 68% success rate in offline RL compared to the strongest baseline's 46%.

0 favorites 0 likes

#off-policy

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

arXiv cs.LG ↗ · 2026-05-18 Cached

This paper introduces a family of loss functions derived from f-divergences for training generative models like GFlowNets and LLMs, which are valid off-policy while matching on-policy gradients of the corresponding f-divergence. Applications include molecule discovery and asynchronous LLM tuning.

0 favorites 0 likes

#off-policy

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Hugging Face Daily Papers ↗ · 2026-05-12 Cached

This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.

0 favorites 0 likes

off-policy

Token-Level Off-Policy Learning for Faithful Generation Under Distribution Shift

Trust Region On-Policy Distillation

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

Trust Region Q Adjoint Matching

$f$-Trajectory Balance: A Loss Family for Tuning GFlowNets, Generative Models, and LLMs with Off- and On-Policy Data

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Submit Feedback