Tag
This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.