ppo-correction

Tag

Cards List
#ppo-correction

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Hugging Face Daily Papers · 2026-05-12 Cached

This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.

0 favorites 0 likes
← Back to home

Submit Feedback