Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Hugging Face Daily Papers Papers

Summary

This paper addresses the missing old logits problem in asynchronous reinforcement learning for LLMs, proposing exact and approximate correction methods to improve training stability and performance.

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a training--inference discrepancy term that aligns inference-side and training-side distributions at the same behavior-policy version, and a policy-staleness term that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.
Original Article
View Cached Full Text

Cached at: 05/13/26, 08:11 AM

Paper page - Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

Source: https://huggingface.co/papers/2605.12070

Abstract

Asynchronous reinforcement learning in large language models faces challenges with PPO-style corrections due to delayed updates and missing historical logits, which are addressed through exact and approximate correction methods including snapshot tracking and revised PPO-EWMA techniques.

Asynchronous reinforcement learningimprovesrollout throughputfor large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode forPPO-style off-policy correction. Inheterogeneous training systems, the totalimportance ratioshould ideally be decomposed into two semantically distinct factors: atraining--inference discrepancy termthat aligns inference-side and training-side distributions at the same behavior-policy version, and apolicy-staleness termthat constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines withdelayed updatesandpartial rolloutsoften lose the required historical training-side logits, orold logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact andapproximate correctionroutes. We propose three exact old-logit acquisition strategies:snapshot-based version tracking, a dedicatedold-logit model, and synchronization viapartial rollout interruption, and compare their system trade-offs. From the perspective ofapproximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exactold logitscannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revisedPPO-EWMAmethod, which achieves significant gains in both training speed and optimization performance. Code at https://github.com/millioniron/ROLL.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2605\.12070

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.12070 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.12070 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.12070 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.

vLLM V0 to V1: Correctness Before Corrections in RL

Hugging Face Blog

ServiceNow engineers detail their migration from vLLM V0 to V1, focusing on resolving backend correctness issues like logprob semantics and runtime defaults to ensure stable reinforcement learning training dynamics.

Agentic RL: Token-In, Token-Out Done Right (16 minute read)

TLDR AI

This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.