Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Hugging Face Daily Papers 05/12/26, 12:00 AM Papers

Summary

This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

Original Article

View Cached Full Text

Cached at: 05/13/26, 08:11 AM

Paper page - Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Source: https://huggingface.co/papers/2605.12070

Abstract

Asynchronous reinforcement learning in large language models faces challenges with PPO-style corrections due to delayed updates and missing historical logits, which are addressed through exact and approximate correction methods including snapshot tracking and revised PPO-EWMA techniques.

Asynchronous reinforcement learningimprovesrollout throughputfor large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode forPPO-style off-policy correction. Inheterogeneous training systems, the totalimportance ratioshould ideally be decomposed into two semantically distinct factors: atraining--inference discrepancy termthat aligns inference-side and training-side distributions at the same behavior-policy version, and apolicy-staleness termthat constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines withdelayed updatesandpartial rolloutsoften lose the required historical training-side logits, orold logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact andapproximate correctionroutes. We propose three exact old-logit acquisition strategies:snapshot-based version tracking, a dedicatedold-logit model, and synchronization viapartial rollout interruption, and compare their system trade-offs. From the perspective ofapproximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exactold logitscannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revisedPPO-EWMAmethod, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2605\.12070

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12070 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12070 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12070 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Paper page - Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Z.ai's Stable Asynchronous RL (13 minute read)

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

vLLM V0 to V1: Correctness Before Corrections in RL

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

Submit Feedback

Similar Articles

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

Z.ai's Stable Asynchronous RL (13 minute read)

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

vLLM V0 to V1: Correctness Before Corrections in RL

Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards