Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Summary
This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.
View Cached Full Text
Cached at: 05/13/26, 08:11 AM
Paper page - Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction
Source: https://huggingface.co/papers/2605.12070
Abstract
Asynchronous reinforcement learning in large language models faces challenges with PPO-style corrections due to delayed updates and missing historical logits, which are addressed through exact and approximate correction methods including snapshot tracking and revised PPO-EWMA techniques.
Asynchronous reinforcement learningimprovesrollout throughputfor large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode forPPO-style off-policy correction. Inheterogeneous training systems, the totalimportance ratioshould ideally be decomposed into two semantically distinct factors: atraining--inference discrepancy termthat aligns inference-side and training-side distributions at the same behavior-policy version, and apolicy-staleness termthat constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines withdelayed updatesandpartial rolloutsoften lose the required historical training-side logits, orold logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact andapproximate correctionroutes. We propose three exact old-logit acquisition strategies:snapshot-based version tracking, a dedicatedold-logit model, and synchronization viapartial rollout interruption, and compare their system trade-offs. From the perspective ofapproximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exactold logitscannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revisedPPO-EWMAmethod, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.12070
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.12070 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.12070 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.12070 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
This paper diagnoses Training-Inference Mismatch (TIM) in LLM reinforcement learning, showing that small numerical disagreements between training and inference token probabilities can cause training collapse, and proposes remedies.
When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL
This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.
vLLM V0 to V1: Correctness Before Corrections in RL
ServiceNow engineers detail their migration from vLLM V0 to V1, focusing on resolving backend correctness issues like logprob semantics and runtime defaults to ensure stable reinforcement learning training dynamics.
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Proposes Correction-Oriented Policy Optimization (CIPO), an extension to RLVR that converts failed trajectories into correction-oriented supervision, improving reasoning and correction performance in LLMs across math and code benchmarks.
Agentic RL: Token-In, Token-Out Done Right (16 minute read)
This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.