Tag
The paper proposes Trust Region On-Policy Distillation (TrOPD) to stabilize on-policy distillation of large language models by using trust regions, outlier estimation, and off-policy guidance, outperforming existing methods on reasoning and code generation benchmarks.
This paper proposes behavior-aware auxiliary corrections for off-policy temporal-difference prediction, introducing BA-TDC and BA-TDRC algorithms that replace the auxiliary covariance matrix with the behavior Bellman matrix to improve stability and convergence. Theoretical analysis and experiments on standard benchmarks validate the effectiveness of the proposed methods.
This paper introduces a family of loss functions derived from f-divergences for training generative models like GFlowNets and LLMs, which are valid off-policy while matching on-policy gradients of the corresponding f-divergence. Applications include molecule discovery and asynchronous LLM tuning.
This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.