Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

arXiv cs.LG Papers

Summary

This paper diagnoses Training-Inference Mismatch (TIM) in LLM reinforcement learning, showing that small numerical disagreements between training and inference token probabilities can cause training collapse, and proposes remedies.

arXiv:2605.14220v1 Announce Type: new Abstract: Modern LLM RL systems separate rollout generation from policy optimization. These two stages are expected to produce token probabilities that match exactly. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training-Inference Mismatch (TIM). TIM is difficult to inspect because it is entangled with off-policy drift and common stabilization mechanisms. In this work, we isolate TIM in a zero-mismatch diagnostic setting (VeXact), and show that small token-level numerical disagreements can independently cause training collapse. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM. Our results suggest that TIM is not benign numerical noise, but a systems-level perturbation that should be treated as a first-order factor in analyzing LLM RL stability.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:27 AM

# Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Source: [https://arxiv.org/html/2605.14220](https://arxiv.org/html/2605.14220)
commentstyle=, keywordstyle=, stringstyle=, basicstyle=, identifierstyle=, backgroundcolor=, breakatwhitespace=false, breakindent=0pt, breaklines=true, captionpos=b, keepspaces=true, numbers=left, numberstyle=, numbersep=3pt, showspaces=false, showstringspaces=false, showtabs=false, framexleftmargin=0pt, frame=lines, rulecolor=, rulesepcolor=, xleftmargin=0pt, xrightmargin=0pt,

Tianle Zhong1,2,∗Neiwen Ling1,∗Yifan Pi1Zijun Wei1 Tianshu Yu1Geoffrey Fox2Peng Wu1,†Xiao Yu1,† 1ByteDance2The University of Virginia ∗Equal contribution†Corresponding authors

###### Abstract

Modern LLM RL systems separate rollout generation from policy optimization\. These two stages are expected to produce token probabilities that match exactly\. However, implementation differences can make them assign different values to the same sequence under the same model weights, inducing Training–Inference Mismatch \(TIM\)\. TIM is difficult to inspect because it is entangled with off\-policy drift and common stabilization mechanisms\. In this work, we isolate TIM in a zero\-mismatch diagnostic setting \(VeXact\), and show that small token\-level numerical disagreements can independently cause training collapse\. We further show that TIM changes the effective optimization problem, and identify a set of remedies that could mitigate TIM\. Our results suggest that TIM is not benign numerical noise, but a systems\-level perturbation that should be treated as a first\-order factor in analyzing LLM RL stability\.

## 1Introduction

Large language model \(LLM\) reinforcement learning \(RL\) has become a central paradigm for post\-training foundation models and a key driver of recent advances in complex reasoning capabilities\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.14220#bib.bib24); Ouyanget al\.,[2022](https://arxiv.org/html/2605.14220#bib.bib16); Ziegleret al\.,[2019](https://arxiv.org/html/2605.14220#bib.bib17); Shaoet al\.,[2024](https://arxiv.org/html/2605.14220#bib.bib25); Stiennonet al\.,[2020](https://arxiv.org/html/2605.14220#bib.bib18); Liuet al\.,[2025b](https://arxiv.org/html/2605.14220#bib.bib54)\)\. However, RL training remains difficult to stabilize in practice: policies may rapidly degrade, causing reward signals to drop over short training windows\.

Understanding what causes these collapses is therefore essential for building reliable LLM RL training systems\(Shenget al\.,[2025b](https://arxiv.org/html/2605.14220#bib.bib46); Huet al\.,[2024](https://arxiv.org/html/2605.14220#bib.bib45); Fuet al\.,[2025b](https://arxiv.org/html/2605.14220#bib.bib49); Team,[2025](https://arxiv.org/html/2605.14220#bib.bib47); Caoet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib48); Shenget al\.,[2025a](https://arxiv.org/html/2605.14220#bib.bib50); MiniMaxet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib35)\)\. However, diagnosing the root cause is difficult because many failure modes are deeply entangled and arise at different levels of the training stack\. A collapse may be caused by many factors, such as poorly tuned hyperparameters, reward misspecification, and reward hacking\(Fuet al\.,[2025a](https://arxiv.org/html/2605.14220#bib.bib52); Panet al\.,[2024](https://arxiv.org/html/2605.14220#bib.bib53)\)\. Among these factors, Training\-Inference Mismatch \(TIM\) is a infrastructure\-level confounder: implementation differences between training and inference engines can cause divergent token probabilities even for the same input and model weights\.

In response to these stability challenges, the community has developed a range of training\-level stabilization techniques, including importance sampling, rejection sampling, and other forms of conservative policy updates\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.14220#bib.bib24); Yaoet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib29); Liet al\.,[2026](https://arxiv.org/html/2605.14220#bib.bib33); Teamet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib1); Ring AI Team,[2025](https://arxiv.org/html/2605.14220#bib.bib21); Zhenget al\.,[2025a](https://arxiv.org/html/2605.14220#bib.bib19); Liuet al\.,[2025a](https://arxiv.org/html/2605.14220#bib.bib20)\)\. Although effective in some settings, their connection to specific failure mechanisms remains unclear: the same technique may correct PPO mini\-step off\-policy drift, suppress TIM\-induced numerical outliers, or introduce additional optimization bias\. Without a TIM\-free diagnostic baseline to isolate these effects, practitioners must tune interventions and filtering thresholds by trial and error rather than by causal diagnosis\.

In this paper, we aim to systematically understand the impact of TIM on LLM RL stability\. Specifically, we aim to answer two key questions: First, does TIM contribute to RL training instability, and if so, to what extent? Second, how do common stabilization techniques interact with TIM, what aspects of the mismatch do they mitigate, and what optimization side effects do they introduce?

To answer these questions, we developVeXact111The source code ofVeXactis openly available at[https://github\.com/verl\-project/vexact](https://github.com/verl-project/vexact)\., a lightweight rollout engine that achieves zero\-mismatch with FSDP\(Zhaoet al\.,[2023](https://arxiv.org/html/2605.14220#bib.bib38); Rajbhandariet al\.,[2020](https://arxiv.org/html/2605.14220#bib.bib39)\)engine on top of VeRL\(Shenget al\.,[2025b](https://arxiv.org/html/2605.14220#bib.bib46)\)\.VeXacteliminates TIM by unifying kernel and model implementations with the FSDP training engine, and by employing batch\-invariant kernels\(He,[2025](https://arxiv.org/html/2605.14220#bib.bib3)\)\(§[3\.1](https://arxiv.org/html/2605.14220#S3.SS1)\)\. UsingVeXact, we conduct fine\-grained diagnostic studies of LLM RL stability\. Concretely, our contributions are:

Isolating TIM impact for LLM RL:Using our TIM\-free baseline, we identify TIM alone as a significant factor in triggering RL training collapse \(§[3\.2](https://arxiv.org/html/2605.14220#S3.SS2)\)\.

Analyzing failure modes of TIM\-induced RL training collapse\.We then conduct ablation studies on TIM’s role in RL training collapse under general setups\. Specifically, we analyze why RL training collapse in both trainer\-side log\-probabilities recomputation and rollout\-side log\-probabilities bypass\. We find that TIM fundamentally changes the optimization objective, thereby inducing distinct failure \(§[4\.1](https://arxiv.org/html/2605.14220#S4.SS1)\)\.

Ablating effectiveness of algorithmic TIM compensation\.Furthermore, we evaluate whether common stabilization techniques can effectively mitigate TIM, including truncated importance sampling \(TIS\)\(Yaoet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib29)\), and rejection sampling \(RS\)\(Liet al\.,[2026](https://arxiv.org/html/2605.14220#bib.bib33)\)\(§[4\.2](https://arxiv.org/html/2605.14220#S4.SS2)\)\. Based on our ablation study, we identify an effective combination of existing algorithmic TIM compensations that can closely track our TIM\-free baseline\.

## 2Training\-Inference Mismatch in LLM RL

Due to the implementation differences between the training engines \(FSDP\(Zhaoet al\.,[2023](https://arxiv.org/html/2605.14220#bib.bib38); Rajbhandariet al\.,[2020](https://arxiv.org/html/2605.14220#bib.bib39)\), Megatron\(Shoeybiet al\.,[2020](https://arxiv.org/html/2605.14220#bib.bib40)\), etc\) and the inference engines \(vLLM\(Kwonet al\.,[2023](https://arxiv.org/html/2605.14220#bib.bib11)\), SGLang\(Zhenget al\.,[2024](https://arxiv.org/html/2605.14220#bib.bib41)\), etc\), including divergent model/kernel implementations, the probability distribution on the vocabulary for the next token can be different even with the exact same model checkpoint and inputs\. This introduces an unintended off\-policy bias between the sampling and model update\. Different from the off\-policy bias introduced by PPO mini\-steps\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.14220#bib.bib24)\), TIM off\-policy bias is an infrastructure\-level noise, which cannot be addressed by naive PPO clipping methods \(discussed in §[4](https://arxiv.org/html/2605.14220#S4)\)\.

We formulate this issue in RL objectives as follows: Given a contextxxand a sampled responsey=\(a1,…,aT\)y=\(a\_\{1\},\\ldots,a\_\{T\}\), letst=\(x,y<t\)s\_\{t\}=\(x,y\_\{<t\}\)\. We distinguish three token\-level distributions:πθ​\(at\|st\)\\pi\_\{\\theta\}\(a\_\{t\}\|s\_\{t\}\)denotes the current policy being optimized;πo​l​drollout​\(at\|st\)\\pi^\{\\mathrm\{rollout\}\}\_\{old\}\(a\_\{t\}\|s\_\{t\}\)denotes the behavioral distribution realized by the rollout engine when the token is sampled; andπo​l​dtrain​\(at\|st\)\\pi^\{\\mathrm\{train\}\}\_\{old\}\(a\_\{t\}\|s\_\{t\}\)denotes the trainer\-side reference distribution used when an algorithm requires an old\-policy probability\.

In an exact on\-policy implementation, the probability assigned to each sampled token should be consistent between rollout and training\. TIM occurs when the rollout execution path and the trainer execution path assign different probabilities to the same token under the same model weights and sampled sequence\. At the token level, this discrepancy can be written as

δt=log⁡πo​l​dtrain​\(at\|st\)−log⁡πo​l​drollout​\(at\|st\)\.\\delta\_\{t\}=\\log\\pi^\{\\mathrm\{train\}\}\_\{old\}\(a\_\{t\}\|s\_\{t\}\)\-\\log\\pi^\{\\mathrm\{rollout\}\}\_\{old\}\(a\_\{t\}\|s\_\{t\}\)\.\(1\)This definition is objective\-agnostic: the mismatch exists before choosing whether the update is implemented with REINFORCE\(Williams,[1992](https://arxiv.org/html/2605.14220#bib.bib36); Huet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib34)\), PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2605.14220#bib.bib24)\), or GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.14220#bib.bib25); Yuet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib27)\)\.

Table[1](https://arxiv.org/html/2605.14220#S2.T1)illustrates an example of several sampled tokens from the same context and model checkpoint\. For each token, we compare the log\-probability produced by the rollout with that re\-evaluated by the training engine\. Ideally, these two values should be identical for an on\-policy update\. However, we can observe clear discrepancies in token log\-probabilities between the two sides, including cases where the top\-1 token choice at a given position is flipped\. As Figure[1](https://arxiv.org/html/2605.14220#S2.F1)shows, although the average difference in token\-level probabilities is small per training batch, the maximum difference can even reach 1\.0 for some extreme tokens, which is generally observable when TIM exists\.

Theproblemstatesthat†thereexistrealnumberslog⁡πrollout\\log\\pi\_\{\\text\{rollout\}\}−0\.279\-0\.279−0\.063\-0\.063−0\.314\-0\.314−0\.694\-0\.694−0\.000\-0\.000−0\.030\-0\.030−0\.000\-0\.000−0\.000\-0\.000log⁡πtrain\\log\\pi\_\{\\text\{train\}\}−0\.278\-0\.278−0\.063\-0\.063−0\.314\-0\.314−0\.827\-0\.827−0\.000\-0\.000−0\.038\-0\.038−0\.000\-0\.000−0\.000\-0\.000δt\\delta\_\{t\}0\.001\\phantom\{\+\}0\.0010\.000\\phantom\{\+\}0\.0000\.000\\phantom\{\+\}0\.000−0\.133\-0\.1330\.000\\phantom\{\+\}0\.000−0\.008\-0\.0080\.000\\phantom\{\+\}0\.0000\.000\\phantom\{\+\}0\.000Table 1:Token\-level kernel\-numerical drift between the rollout and training stacks on the same Qwen3\-8B \(bf16\) weights, traced along one sentence from a greedy sampled response on an AIME\-2024 problem\.δt=log⁡πtrain−log⁡πrollout\\delta\_\{t\}=\\log\\pi\_\{\\text\{train\}\}\-\\log\\pi\_\{\\text\{rollout\}\}\. Most positions are bit\-close, but†marks an argmax flip: atthat,πtrain\\pi\_\{\\text\{train\}\}’s top 1 token is actually the punctuation \+ newline string":\\n\\n"\(log prob−0\.577\-0\.577\), so the training side would have ended the clause as “The problem states:\\n\\n”\.![Refer to caption](https://arxiv.org/html/2605.14220v1/x1.png)\(a\)\|δt\|\|\\delta\_\{t\}\|\(max\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x2.png)\(b\)\|δt\|\|\\delta\_\{t\}\|\(mean\)

Figure 1:Statistical\|δt\|\|\{\\delta\_\{t\}\}\|max and mean for every training batch in the Qwen3\-1\.7B GRPO experiment \(detailed configuration in Appendix[A\.1](https://arxiv.org/html/2605.14220#A1.SS1)\)\. While the mean of\|δt\|\|\\delta\_\{t\}\|is small, we can observe some extreme tokens with its\|δt\|\|\{\\delta\_\{t\}\}\|near 1\.0\.
## 3Isolating TIM withVeXact

In this section, we isolate the impact of TIM on RL training stability\. This requires two key ingredients: \(1\) First, we introduce a TIM\-free rollout implementation,VeXact, as a diagnostic baseline that removes infrastructure\-induced mismatch from the RL loop\. \(2\) Second, we evaluate this baseline under REINFORCE, which avoids PPO ratio clipping that may mask or distort TIM\-induced changes in the loss and gradient signals\.

### 3\.1VeXact: A Zero\-mismatch Rollout Engine

For the TIM\-free baseline, we introduceVeXact, a lightweight rollout engine whose rollout token log\-probabilities can achieve bit\-wise alignment with the FSDP engine\.

TIM comes from two sources: \(1\) The model and kernel implementation differences between the inference and training engines\. Although semantically and mathematically the same, they often make different decisions regarding implementation details\. For example, inference engines prefer inference\-optimized kernel libraries like FlashInfer, which is not applicable in training engines\. \(2\) Variations in kernel reduction order and tiling\. Even when the same kernel implementation is used, performance\-oriented optimizations such as atomic additions can introduce non\-determinism, causing the kernel to produce different outputs for identical inputs\. Moreover, even adeterministickernel may exhibit batch\-dependent numerical behavior: changes in batch size can trigger different launch\-grid configurations through auto\-tuning, thereby altering GPU tiling strategies and reduction orders\. Since floating\-point accumulation is non\-associative under finite precision, these changes in execution order can ultimately lead to numerically different results\.

Hence,VeXactaddresses these two sources of mismatch by \(1\) using the same HuggingFace\-based model implementation and registerVeXactkernel implementation in the FSDP engine initialization and \(2\) employing deterministic and batch\-invariant kernels, which fix the tiling and reduction order in the GPU kernel implementation\. Following the original batch invariant kernel implementation\(He,[2025](https://arxiv.org/html/2605.14220#bib.bib3)\),VeXactadditionally implements RMSNorm\(Zhang and Sennrich,[2019](https://arxiv.org/html/2605.14220#bib.bib42)\), batched matrix multiplication, and batch invariant Fused MoE kernels for efficient MoE training/inference\. For attention implementation\(Daoet al\.,[2022](https://arxiv.org/html/2605.14220#bib.bib6)\)specifically, we disable KV splitting\(Daoet al\.,[2023](https://arxiv.org/html/2605.14220#bib.bib37)\)to ensure determinism as well\.

Meanwhile, since fixed tiling in batch\-invariant kernels trades performance for numerical stability,VeXactretains reasonable throughput for practical RL training by integrating chunked prefill\(Agrawalet al\.,[2023](https://arxiv.org/html/2605.14220#bib.bib12)\), CUDAGraph\(NVIDIA Corporation,[2025](https://arxiv.org/html/2605.14220#bib.bib13)\), pipeline parallelism\(Huanget al\.,[2019](https://arxiv.org/html/2605.14220#bib.bib14)\), and optimistic KV allocation with preemption fallback\.VeXactmaintains hackable and very lightweight and its LOC is similar to nano\-vLLM\(GeeeekExplorer,[2025](https://arxiv.org/html/2605.14220#bib.bib15)\)\.

![Refer to caption](https://arxiv.org/html/2605.14220v1/x3.png)\(a\)MoE training reward
![Refer to caption](https://arxiv.org/html/2605.14220v1/x4.png)\(b\)MoE val\-reward
![Refer to caption](https://arxiv.org/html/2605.14220v1/x5.png)\(c\)MoEδt\\delta\_\{t\}\(mean\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x6.png)\(d\)MoE gradient norm
![Refer to caption](https://arxiv.org/html/2605.14220v1/x7.png)\(e\)Dense training reward
![Refer to caption](https://arxiv.org/html/2605.14220v1/x8.png)\(f\)Dense val\-reward
![Refer to caption](https://arxiv.org/html/2605.14220v1/x9.png)\(g\)Denseδt\\delta\_\{t\}\(mean\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x10.png)\(h\)Dense gradient norm

Figure 2:REINFORCE experiments comparing vLLM non\-exact rollout withVeXact\. Top row: Qwen3\-30B\-A3B MoE\. Bottom row: Qwen3\-1\.7B dense\. Each row reports training reward, AIME 2024 validation reward,δt\\delta\_\{t\}\(mean\), and log\-scale gradient norm\. More experimental results in Appendix[A\.2](https://arxiv.org/html/2605.14220#A1.SS2)\.
### 3\.2Exposing the Impact of TIM

To isolate how TIM impacts RL training stability from low\-level numerical disagreement, we first study REINFORCEWilliams \([1992](https://arxiv.org/html/2605.14220#bib.bib36)\)on\-policy updates\. Unlike PPO\-style objectives, REINFORCE consumes each rollout batch in a single policy\-gradient update\. This makes it a cleaner diagnostic objective for attributing changes in loss, gradient, and reward to rollout\-training probability disagreement\.

We conduct REINFORCE experiments \(with batch\-whitened advantages\) for both dense \(Qwen3\-1\.7B\) and MoE \(Qwen3\-30B\-A3B\) models\(Yang and others,[2025](https://arxiv.org/html/2605.14220#bib.bib44)\)\. Both settings compare a standard non\-exact rollout engine \(vLLM\) againstVeXact\. The dense run is trained on Sanity\-Test\-R1D\-1\.5B\(Qiet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib30)\)and evaluated on AIME 2024\(Mathematical Association of America,[2024](https://arxiv.org/html/2605.14220#bib.bib43)\)every 50 global steps\. The MoE run is trained on DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib27)\)dataset and evaluated on AIME 2024 every 20 global steps\. The full experimental configuration is summarized in Appendix Table[2](https://arxiv.org/html/2605.14220#A1.T2)\.

Figure[2](https://arxiv.org/html/2605.14220#S3.F2)shows that the non\-exact rollout run exhibits instability jointly in reward and gradient signals, whileVeXactis significantly more stable\. For example, in the MoE setting of Figure[2](https://arxiv.org/html/2605.14220#S3.F2), the vLLM run initially improves but starts to degrade after step280280, with training and validation rewards decreasing from0\.5740\.574and0\.2930\.293to0\.2550\.255and0\.0670\.067, respectively\. By contrast, theVeXactreference remains stable and continues to improve, reaching0\.7530\.753training reward and0\.5340\.534validation reward\. Since TIM is the only difference betweenVeXactand vLLM baseline,these experiments confirm that TIM is itself a critical destabilizing factor, not merely a secondary artifact compounded with other training effects\.

## 4Ablating TIM Mitigations under theVeXactBaseline

§[3](https://arxiv.org/html/2605.14220#S3)shows that TIM alone can destabilize RL training with the REINFORCE setup\. However, algorithms like PPO and GRPO have more complicated setups with PPO mini\-steps\. Different from fully on\-policy algorithms like REINFORCE, algorithms like PPO and GRPO fix a snapshot of the policy \(θo​l​d\\theta\_\{old\}\) to collect a batch of samples and perform multiple gradient steps on this batch\. To optimize the current policyθ\\thetaagainst the data sampled from a different distributionθo​l​d\\theta\_\{old\}, they applyimportance sampling, introducing the probability ratioπθ/πo​l​d\\pi\_\{\\theta\}/\\pi\_\{old\}into the objective\.

Under such setups, TIM is intertwined with stabilization techniques for them like PPO clipping\. We now turn to the second research question: when practitioners apply common stabilization techniques, among which some are designed for general PPO training stability and some are TIM\-aware, what do these mechanisms actually fix, and what optimization side effects do they introduce?

Experiment setup\.We study this question in a practical GRPO training setting\. Unless otherwise stated, the experiments in this section use Qwen3\-1\.7B with FSDP on mathematical reasoning workloads\. The model is trained on Sanity\-Test\-R1D\-1\.5B and evaluated on AIME 2024 every 50 global steps\. We compareVeXactwith vLLM non\-exact rollout implementations using recomputation, bypass, rollout correction, and rejection sampling variants\.

![Refer to caption](https://arxiv.org/html/2605.14220v1/x11.png)\(a\)Training reward\.
![Refer to caption](https://arxiv.org/html/2605.14220v1/x12.png)\(b\)AIME24 val\-reward\.
![Refer to caption](https://arxiv.org/html/2605.14220v1/x13.png)\(c\)Loss\.
![Refer to caption](https://arxiv.org/html/2605.14220v1/x14.png)\(d\)Gradient norm\.

Figure 3:Qwen3\-1\.7B GRPO experiments withVeXactand vLLM recomputation and bypass, where onlyVeXactcan maintain the training stability\. More experimental results on the DAPO dataset in Appendix[A\.3](https://arxiv.org/html/2605.14220#A1.SS3)\.### 4\.1The Failure Modes of Recomputation and Bypass

There are two implementations for acquiringπo​l​d\\pi\_\{old\}: recomputation \(where the trainer re\-evaluates the sampled tokens\) and bypass \(where the rollout engine directly transmits its log\-probabilities\)\. Both strategies use the same clipped PPO/GRPO surrogate, but differ in how the denominator of the policy ratio is obtained\. For a sampled tokenata\_\{t\}under statests\_\{t\}with advantageAtA\_\{t\}, the token\-level clipped surrogate is

ℒppo​\(rp​p​o,A\)=−min⁡\(rp​p​o​A,clip​\(rp​p​o,1−ϵ,1\+ϵ\)​A\)\.\\mathcal\{L\}\_\{\\mathrm\{ppo\}\}\(r\_\{ppo\},A\)=\-\\min\\left\(r\_\{ppo\}A,\\mathrm\{clip\}\(r\_\{ppo\},1\-\\epsilon,1\+\\epsilon\)A\\right\)\.\(2\)Under recomputation and bypass, the PPO ratios are respectively defined as

rp​p​otrain=πθ​\(at∣st\)πo​l​dtrain​\(at∣st\),rp​p​orollout=πθ​\(at∣st\)πo​l​drollout​\(at∣st\)\.r^\{\\mathrm\{train\}\}\_\{ppo\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\}\{\\pi^\{\\mathrm\{train\}\}\_\{old\}\(a\_\{t\}\\mid s\_\{t\}\)\},\\qquad r^\{\\mathrm\{rollout\}\}\_\{ppo\}=\\frac\{\\pi\_\{\\theta\}\(a\_\{t\}\\mid s\_\{t\}\)\}\{\\pi^\{\\mathrm\{rollout\}\}\_\{old\}\(a\_\{t\}\\mid s\_\{t\}\)\}\.\(3\)Thus, recomputation and bypass instantiate the same clipped surrogate with different denominators:

ℒrecomp=ℒppo​\(rp​p​otrain,A\),ℒbypass=ℒppo​\(rp​p​orollout,A\)\.\\mathcal\{L\}\_\{\\mathrm\{recomp\}\}=\\mathcal\{L\}\_\{\\mathrm\{ppo\}\}\(r^\{\\mathrm\{train\}\}\_\{ppo\},A\),\\qquad\\mathcal\{L\}\_\{\\mathrm\{bypass\}\}=\\mathcal\{L\}\_\{\\mathrm\{ppo\}\}\(r^\{\\mathrm\{rollout\}\}\_\{ppo\},A\)\.\(4\)The sequence\-level training loss is obtained by summing these token\-level terms over the response and averaging over the sampled batch\.

Figure[3](https://arxiv.org/html/2605.14220#S4.F3)shows the main phenomenon: whileVeXactkeeps the training reward around0\.930\.93, vLLM recomputation first degrades from about0\.870\.87to roughly0\.400\.40during the first650650steps, partially recovers afterward, and then drops rapidly again after about step16101610before collapsing to near\-zero reward after about step16651665\. vLLM bypass shows a single\-stage degradation, with training reward dropping to roughly0\.40\.4but not collapsing to zero, and this degradation is not accompanied by comparably large loss spikes\. Unlike the REINFORCE setting in §[3\.2](https://arxiv.org/html/2605.14220#S3.SS2), where reward collapse is accompanied by clear loss and gradient\-norm anomalies, the GRPO runs in Figure[3](https://arxiv.org/html/2605.14220#S4.F3)show less synchronized behavior across these signals: bypass exhibits reward degradation without a comparably synchronized loss anomaly, while recomputation enters an early reward\-degradation phase before the later gradient\-norm spike becomes visible\. This motivates a finer\-grained analysis\.

KL estimators are not sufficient indicators\.Since TIM directly perturbs the effective distance between the updated policyπθ\\pi\_\{\\theta\}and the old policyπold\\pi\_\{\\mathrm\{old\}\}, we first inspect KL estimators computed on the PPO ratio\(Schulman,[2020](https://arxiv.org/html/2605.14220#bib.bib32)\), such asK1​\(rppo\)=−log⁡rppoK\_\{1\}\(r\_\{\\mathrm\{ppo\}\}\)=\-\\log r\_\{\\mathrm\{ppo\}\}andK3​\(rppo\)=\(rppo−1\)−log⁡rppoK\_\{3\}\(r\_\{\\mathrm\{ppo\}\}\)=\(r\_\{\\mathrm\{ppo\}\}\-1\)\-\\log r\_\{\\mathrm\{ppo\}\}, whererppor\_\{\\mathrm\{ppo\}\}can be instantiated as eitherrppotrainr^\{\\mathrm\{train\}\}\_\{\\mathrm\{ppo\}\}orrpporolloutr^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\}\}as defined in Eq\.[3](https://arxiv.org/html/2605.14220#S4.E3)\. In bypass mode, where the rollout and trainer probabilities become visibly inconsistent, these aggregate KL probes increase noticeably in bothK1K\_\{1\}andK3K\_\{3\}\. However, under recomputation mode, even over the first 700 training steps in which the run already exhibits the onset of failure, the KL estimators remain close to theVeXactbaseline and fail to expose the emerging instability\. This indicates that aggregate probability\-space divergence is not sufficient to characterize the earliest stage of TIM\-induced training failure\.

![Refer to caption](https://arxiv.org/html/2605.14220v1/x15.png)\(a\)Recomputation:K1/K3K\_\{1\}/K\_\{3\}metrics
![Refer to caption](https://arxiv.org/html/2605.14220v1/x16.png)\(b\)Bypass:K1/K3K\_\{1\}/K\_\{3\}metrics

Figure 4:KL estimators under recomputation and bypass\. In recomputation mode, bothK1K\_\{1\}andK3K\_\{3\}remain nearly flat during the first 700 steps, even though the reward is already entering the degradation phase\. In bypass mode, bothK1K\_\{1\}andK3K\_\{3\}increase noticeably\. UnderVeXact,πoldtrain=πoldrollout\\pi\_\{\\mathrm\{old\}\}^\{\\mathrm\{train\}\}=\\pi\_\{\\mathrm\{old\}\}^\{\\mathrm\{rollout\}\}, so the two corresponding estimators coincide\. With recomputation,K1​⟨πθ,πoldtrain⟩K\_\{1\}\\langle\\pi\_\{\\theta\},\\pi\_\{\\mathrm\{old\}\}^\{\\mathrm\{train\}\}\\rangleis lower thanVeXactbecauseK1K\_\{1\}estimator is signed and each token’sK1K\_\{1\}can cancel out when accumulating at the batch level\.Zero\-centered loss contribution\.We therefore shift the analysis from probability space to objective space\. The optimizer does not directly consume probability discrepancies; it consumes advantage\-weighted surrogate\-loss contributions\. We isolate this effect with the zero\-centered loss contribution

C​\(rp​p​o\)=−\(rp​p​o−1\)​At,rp​p​o∈\{rp​p​otrain,rp​p​orollout\},C\(r\_\{ppo\}\)=\-\(r\_\{ppo\}\-1\)A\_\{t\},\\qquad r\_\{ppo\}\\in\\\{r\_\{ppo\}^\{\\mathrm\{train\}\},r\_\{ppo\}^\{\\mathrm\{rollout\}\}\\\},\(5\)which has the same gradient as the standard−rp​p​o​At\-r\_\{ppo\}A\_\{t\}objective but is zero whenrp​p​o=1r\_\{ppo\}=1\. Hererp​p​otrainr\_\{ppo\}^\{\\mathrm\{train\}\}measures the ratio between the current model distribution and the trainer\-side referenceπtrain\\pi\_\{\\mathrm\{train\}\}, whilerp​p​orolloutr\_\{ppo\}^\{\\mathrm\{rollout\}\}measures the ratio between the current model distribution and the rollout\-side behavioral distributionπrollout\\pi\_\{\\mathrm\{rollout\}\}\. Under recomputation,C​\(rp​p​otrain\)C\(r\_\{ppo\}^\{\\mathrm\{train\}\}\)is the contribution that actually drives the trainer update, because the optimizer uses the trainer\-reconstructed denominator\. By contrast,C​\(rp​p​orollout\)C\(r\_\{ppo\}^\{\\mathrm\{rollout\}\}\)is the rollout\-equivalent contribution of the same sampled token: it evaluates the update pressure relative to the behavioral distribution that generated the token\. UnderVeXact, the two references are numerically aligned, sorp​p​otrain=rp​p​orolloutr\_\{ppo\}^\{\\mathrm\{train\}\}=r\_\{ppo\}^\{\\mathrm\{rollout\}\}and thereforeC​\(rp​p​otrain\)=C​\(rp​p​orollout\)C\(r\_\{ppo\}^\{\\mathrm\{train\}\}\)=C\(r\_\{ppo\}^\{\\mathrm\{rollout\}\}\)\. Figure[5](https://arxiv.org/html/2605.14220#S4.F5)shows that recomputation does not merely add uniform noise to this contribution\. The distortion is uneven across positive and negative\-advantage samples: some harmful contributions are amplified, while offsetting contributions are not amplified symmetrically\. This sign\-imbalanced contribution distribution changes the effective optimization pressure before the mismatch becomes large enough to appear as a KL estimator spike\. We hypothesize that the heavy\-tailed numerical errors introduced by TIM interact asymmetrically with PPO’s clipping bounds\. Because the surrogate objective reacts differently depending on the sign ofAtA\_\{t\}, symmetric numerical noise inδt\\delta\_\{t\}is transformed into a skewed, non\-zero\-mean distortion of the gradient updates\. This contribution\-level distortion explains why the early degradation phase can remain nearly invisible to KL estimators\. At this stage, the dominant failure mode is not a large global distributional shift, but askew in the advantage\-weighted loss contributionsseen by the optimizer\.

![Refer to caption](https://arxiv.org/html/2605.14220v1/x17.png)\(a\)VeXactC​\(rp​p​orollout\)C\(r\_\{ppo\}^\{\\mathrm\{rollout\}\}\)/C​\(rp​p​otrain\)C\(r\_\{ppo\}^\{\\mathrm\{train\}\}\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x18.png)\(b\)vLLMC​\(rp​p​orollout\)C\(r\_\{ppo\}^\{\\mathrm\{rollout\}\}\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x19.png)\(c\)vLLMC​\(rp​p​otrain\)C\(r\_\{ppo\}^\{\\mathrm\{train\}\}\)

Figure 5:Sign\-imbalanced ratio contributions under recomputation\.We plot the zero\-centered ratio contributionC​\(r\)=−\(r−1\)​AC\(r\)=\-\(r\-1\)A, which isolates the ratio\-dependent component of the surrogate objective\. This shows that recomputation induces a sign\-dependent skew in the advantage\-weighted update signal, rather than merely adding uniform noise\.Why bypass also fails\.In bypass mode, the PPO ratio correctly uses the behavior distribution in the denominator \(rp​p​orollout=πθt​r​a​i​n/πo​l​drolloutr^\{\\mathrm\{rollout\}\}\_\{ppo\}=\{\\pi^\{train\}\_\{\\theta\}\}/\{\\pi^\{\\mathrm\{rollout\}\}\_\{old\}\}\)\. However, the numeratorπθt​r​a​i​n\\pi\_\{\\theta\}^\{train\}is evaluated using the trainer’s numerical execution path\. During the backward pass, the optimizer computes the score function gradient based on∇θlog⁡πθt​r​a​i​n\\nabla\_\{\\theta\}\\log\\pi\_\{\\theta\}^\{train\}\. Because TIM creates a misaligned probability landscape between the trainer and the rollout engine, the optimizer exploits numerical artifacts in the trainer’s forward pass\. These weight updates fail to translate into actual behavioral improvements whenθ\\thetais deployed back to the rollout engine, leading to silent policy degradation\.

#### Takeaway\.

\(1\) In recomputation mode, the sampling log\-probabilities used in loss computation are not from the actual samplers \(rollout\), resulting in a skew in the advantage\-weighted loss contributions seen by the optimizer\. \(2\) Even with bypass mode, the actual optimization target still exists in a different sampling space from the rollout, making policy update ineffective\.

### 4\.2Ablating Algorithmic TIM Compensation

§[4\.1](https://arxiv.org/html/2605.14220#S4.SS1)shows that recomputation and bypass instantiate different ways of accounting for the old\-policy probability, but neither removes TIM at the source\. This motivates a natural question: can post\-hoc algorithmic compensation recover the behavior of zero\-mismatch rollout, or does it only suppress some observable symptoms? We answer this question usingVeXactas a TIM\-free reference\. Our goal is to diagnose what existing TIM\-aware correction scaffolds can approximate, which design choices matter, and where their limits remain\.

#### Existing algorithmic corrections\.

Existing rollout\-correction methods commonly suppress unreliable samples through token\-level truncation or sequence\-level rejection\. Letrcorr=πoldtrainπoldrolloutr\_\{\\mathrm\{corr\}\}=\\frac\{\\pi^\{\\mathrm\{train\}\}\_\{\\mathrm\{old\}\}\}\{\\pi^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{old\}\}\}denote the correction ratio\. They typically act either at the token level through truncated importance sampling,

ℒTIS=∑t=1Tmin⁡\(rcorr,t,τtok\)​ℒPPO​\(rppo,ttrain,At\),\\mathcal\{L\}\_\{\\mathrm\{TIS\}\}=\\sum\_\{t=1\}^\{T\}\\min\(r\_\{\\mathrm\{corr\},t\},\\tau\_\{\\mathrm\{tok\}\}\)\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}\\left\(r^\{\\mathrm\{train\}\}\_\{\\mathrm\{ppo\},t\},A\_\{t\}\\right\),\(6\)or at the sequence level through rejection sampling,

ℒRS=𝟏​\[Sseq​\(q1:T\)≤τseq\]​∑t=1TℒPPO​\(rppo,trollout,At\)\.\\mathcal\{L\}\_\{\\mathrm\{RS\}\}=\\mathbf\{1\}\\left\[S\_\{\\mathrm\{seq\}\}\(q\_\{1:T\}\)\\leq\\tau\_\{\\mathrm\{seq\}\}\\right\]\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}\\left\(r^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\},A\_\{t\}\\right\)\.\(7\)Here,τtok\\tau\_\{\\mathrm\{tok\}\}is a token\-level truncation threshold that limits large correction weights, andτseq\\tau\_\{\\mathrm\{seq\}\}is the sequence\-level rejection threshold\.q1:Tq\_\{1:T\}denotes the sequence of token\-level diagnostic signals used for rejection,Sseq​\(⋅\)S\_\{\\mathrm\{seq\}\}\(\\cdot\)maps these signals to a scalar trajectory\-level trust\-region score\.

#### Our diagnostic instantiation\.

We therefore use these post\-hoc corrections as diagnostic probes rather than as new algorithmic proposals\. §[4\.1](https://arxiv.org/html/2605.14220#S4.SS1)shows that TIM can enter PPO/GRPO both through trainer–rollout old\-policy displacement and through skewed token\-level loss contributions\. These observations motivate a controlled diagnostic ablation along two axes: the ratio used to drive sequence\-level rejection, and the effect of adding token\-level truncation before sequence\-level rejection\.

For the masking signal, we compareqt∈\{rcorr,t,rppo,trollout\}q\_\{t\}\\in\\\{r\_\{\\mathrm\{corr\},t\},r^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\}\\\}\. Here,rcorr,tr\_\{\\mathrm\{corr\},t\}measures system\-induced mismatch betweenπoldtrain\\pi^\{\\mathrm\{train\}\}\_\{\\mathrm\{old\}\}andπoldrollout\\pi^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{old\}\}, whereasrppo,trolloutr^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\}measures policy movement from the rollout behavior policy to the current policy\. For both choices,qtq\_\{t\}is used only as the masking signal, rather than necessarily as the PPO ratio inside the surrogate objective\. At the granularity level, we apply sequence\-level rejection to both signals by aggregating token\-wise mismatch intoSseq​\(q1:T\)=∑t=1TK​\(qt\)S\_\{\\mathrm\{seq\}\}\(q\_\{1:T\}\)=\\sum\_\{t=1\}^\{T\}K\(q\_\{t\}\), withK​\(⋅\)K\(\\cdot\)instantiated asK1K\_\{1\}orK3K\_\{3\}, and rejecting trajectories whose accumulated mismatch exceeds a threshold\. We evaluate a joint token\- and sequence\-level variant, which first filters localized token\-level outliers and then applies the sequence\-level rejection criterion\.

Together, these choices yield three diagnostic configurations: sequence\-level rejection based on eitherrcorrr\_\{\\mathrm\{corr\}\}orrpporolloutr^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\}\}, and joint token\-level clipping plus sequence\-level rejection based onrcorrr\_\{\\mathrm\{corr\}\}, with the sequence\-level score instantiated by eitherK​1K1orK​3K3\. We provide the expanded objectives in Appendix[A\.4](https://arxiv.org/html/2605.14220#A1.SS4)\.

![Refer to caption](https://arxiv.org/html/2605.14220v1/x20.png)\(a\)Training reward
![Refer to caption](https://arxiv.org/html/2605.14220v1/x21.png)\(b\)Validation reward
![Refer to caption](https://arxiv.org/html/2605.14220v1/x22.png)\(c\)Loss
![Refer to caption](https://arxiv.org/html/2605.14220v1/x23.png)\(d\)Gradient norm

Figure 6:Algorithmic\-patch comparison\. We evaluate four correction baselines:srs\-k3\-corr\-ratioandsrs\-k3\-ppo\-ratioapply sequence\-level rejection usingrcorrr\_\{\\mathrm\{corr\}\}andrppor\_\{\\mathrm\{ppo\}\}, respectively, whiletis\-srs\-k1\-corr\-ratioandtis\-srs\-k3\-corr\-ratiocombine truncated importance sampling withrcorrr\_\{\\mathrm\{corr\}\}\-based sequence rejection instantiated with K1 and K3, respectively\. We provide the expanded objectives in Appendix[A\.4](https://arxiv.org/html/2605.14220#A1.SS4)\. For sequence rejection,rc​o​r​rr\_\{corr\}is more effective thanrp​p​or\_\{ppo\}\. Additionally with TIS,rc​o​r​rr\_\{corr\}\-based sequence rejection can trackVeXactclosely\. For better visual clarity, curves are smoothed by center mean with window size of 25\.Experiment results\.As Figure[6](https://arxiv.org/html/2605.14220#S4.F6)shows, comparing the sequence\-level variants, usingrtcorrr\_\{t\}^\{\\mathrm\{corr\}\}as the filtering signal yields higher reward than usingrppo,trolloutr^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\}\. We attribute this difference to the distinct semantics of the two ratios: The correction ratio clipping guarantees that thebase distribution \(πo​l​dt​r​a​i​n\\pi\_\{old\}^\{train\}\) to update is not too far from the sampling space\.In contrast, the rollout ratio overlaps with PPO’s policy\-ratio mechanism, whose purpose is to control the update magnitude of how far the current policy moves away from the behavior policy, but not the starting distribution location\.

The same ablation also shows that the configurationcombining token\-level filtering with sequence\-level rejectionmost closely tracks theVeXactreference, indicating that TIM manifests at multiple granularities\. Localized token\-level mismatch outliers can distort individual PPO contributions even when the aggregate sequence\-level score remains moderate, while some trajectories exhibit large accumulated mismatch and should be rejected at the sequence level\. The choice betweenK​1K1andK​3K3for the sequence rejection has a comparatively minor effect in our experiments\.

These results suggest that algorithmic correction can closely approach the zero\-mismatch reference in the evaluated setting\. However, these forms of algorithmic TIM compensation remains post\-hoc: unlikeVeXact, it can only suppress already\-generated samples and may also discard useful learning signals\. More importantly, withoutVeXactas a ground\-truth reference, patch configuration \(e\.g\.,τseq\\tau\_\{\\mathrm\{seq\}\},τtok\\tau\_\{\\mathrm\{tok\}\}\) remains largely ad hoc\.VeXacttherefore complements algorithmic patches by enabling their principled calibration\.

Takeaway\.\(1\)rc​o​r​rr\_\{corr\}\-based sequence\-level rejection sampling is more effective thanrp​p​or\_\{ppo\}\-based\. The threshold metric choice ofK1K\_\{1\}orK3K\_\{3\}does not affect much\. \(2\) TIS is effective as it fixes the PPO ratio used in loss function fromrp​p​ot​r​a​i​nr\_\{ppo\}^\{train\}torp​p​or​o​l​l​o​u​tr\_\{ppo\}^\{rollout\}by multiplyingrp​p​ot​r​a​i​nr\_\{ppo\}^\{train\}withrc​o​r​rr\_\{corr\}, consistent with our conclusion from §[4\.1](https://arxiv.org/html/2605.14220#S4.SS1)\. \(3\) Combining \(1\) and \(2\), we find that carefully designed algorithmic configurations can closely track our zero\-mismatch referenceVeXact\.

## 5Related Works

RL stabilization techniques\.Existing stabilization techniques for PPO/GRPO\-style training fall into three categories\.Fine\-grained clipping regionssuch as DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib27)\)apply a non\-symmetric clipping region for tokens with positive and negative advantages\. GSPO\(Zhenget al\.,[2025b](https://arxiv.org/html/2605.14220#bib.bib28)\)moves to sequence\-level importance ratios to reduce variance accumulation over long responses;TIM\-aware correctionsexplicitly target the training–inference numerical mismatch: Truncated IS \(TIS\)\(Yaoet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib29)\)truncates extreme token\-level IS weights, and Masked IS\(Zhenget al\.,[2025a](https://arxiv.org/html/2605.14220#bib.bib19); Liet al\.,[2026](https://arxiv.org/html/2605.14220#bib.bib33)\)zeros out gradients for the most divergent tokens from token and sequence\-level\. Additionally,\(Qiet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib30)\)and\(Dirhoussiet al\.,[2026](https://arxiv.org/html/2605.14220#bib.bib8)\)find that using FP16 instead of BF16 is helpful in RL training stability\.MoE\-specific correctionsaddress the additional instability from dynamic expert routing\. For example, R3\(Maet al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib31)\)directly replays inference\-time routing decisions during training mini\-steps to eliminate divergent expert selections\.

Efficient batch\-invariant kernels\.To address the performance penalty of fixing tiling and reduction order in batch invariant kernels, DeepSeek\-V4\(DeepSeek\-AI,[2026](https://arxiv.org/html/2605.14220#bib.bib22)\)introduces dual\-kernel strategies for the attention kernel and a set of optimizations for GEMM kernels\. Tree\-based invariant kernels \(TBIK\)\(Zhanget al\.,[2025](https://arxiv.org/html/2605.14220#bib.bib23)\)resolve the accumulation order mismatch across tensor\-parallel orders by using a tree\-based reduction order, enabling zero mismatch rollout under TP inference\. They prove that batch\-invariant kernels can be highly performant and scalable for large\-scale RL\. Our work focuses more on the analysis of TIM’s role in RL training collapse withVeXactbaseline\.

## 6Discussion and Limitations

Implications for asynchronous LLM RL\.Large\-scale asynchronous LLM RL like agentic tasks are off\-policy by design to speed up the training\. However, we believe addressing TIM is still necessary for them since TIM fundamentally creates different probability landscapes for optimization and sampling spaces as we discussed\. DeepSeek\-V4 also reports that the use of batch\-invariant kernels ensures the exact behavior of log\-probabilities across training, inference, and async RL pipelines\. In addition, unlike algorithmic corrections that mask or discard high\-mismatch tokens or sequences, zero\-mismatch RL preserves the full learning signal\. This suggests that eliminating TIM at the system level may enable more robust and sample\-efficient learning across both synchronous and asynchronous RL settings\.

VeXactas a system\-algorithm calibration tool\.In real\-world developments,VeXactcan act as an useful calibration tool, which allows researchers to scientifically benchmark algorithmic patches and accurately tune sensitive filtering thresholds in a noise\-free environment before deploying algorithmic mitigations to large\-scale RL pipelines\.

Limitations\.Our experiments show that several algorithmic corrections can reduce the impact of TIM and, in some settings, closely track the zero\-mismatch baseline\. However, our evaluation is limited in scale and coverage: we study a representative set of stabilization and correction techniques under a finite set of models, tasks, and system configurations\. It remains unclear whether these mitigations generalize across broader RL settings, or whether they introduce additional optimization side effects that are not visible in our current experiments\. This limitation reaffirms a joint systems\-and\-algorithms perspective on RL stability\.

## 7Conclusion

In this work, we investigate Training\-Inference Mismatch \(TIM\) as a systems\-level confounder in LLM RL stability\. With our zero\-mismatch diagnostic baselineVeXact, we demonstrate that TIM alone can destabilize RL training under a wide range of setups\. We further analyze how TIM alters the effective optimization problem, explaining why common implementation choices such as trainer\-side recomputation and rollout\-side bypass fail to fully eliminate its impact\. Finally, we find that existing algorithmic corrections can closely approach the zero\-mismatch reference, but require careful design and calibration\. Overall, our findings call for a joint system\-algorithm perspective on RL stability and highlight the need for zero\-mismatch RL execution\.

## Acknowledgments

We are grateful to Haolin Liu \(University of Virginia\), Yuxuan Tong, Xibin Wu and Xia Xiao for their valuable suggestions and insightful discussions\.

## References

- SARATHI: efficient llm inference by piggybacking decodes with chunked prefills\.External Links:2308\.16369,[Link](https://arxiv.org/abs/2308.16369)Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p4.1)\.
- S\. Cao, S\. Hegde, D\. Li, T\. Griggs, S\. Liu, E\. Tang, J\. Pan, X\. Wang, A\. Malik, G\. Neubig, K\. Hakhamaneshi, R\. Liaw, P\. Moritz, M\. Zaharia, J\. E\. Gonzalez, and I\. Stoica \(2025\)SkyRL\-v0: train real\-world long\-horizon agents via reinforcement learning\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- T\. Dao, D\. Y\. Fu, S\. Ermon, A\. Rudra, and C\. Ré \(2022\)FLASHATTENTION: fast and memory\-efficient exact attention with io\-awareness\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p3.1)\.
- T\. Dao, D\. Haziza, F\. Massa, and G\. Sizov \(2023\)Flash\-decoding for long\-context inference\.Note:Blog postExternal Links:[Link](https://crfm.stanford.edu/2023/10/12/flashdecoding.html)Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p3.1)\.
- DeepSeek\-AI \(2026\)DeepSeek\-v4: towards highly efficient million\-token context intelligence\.Cited by:[§5](https://arxiv.org/html/2605.14220#S5.p2.1)\.
- A\. Dirhoussi, Q\. Gallouédec, E\. Beeching, L\. Tunstall, K\. Rasul, and L\. von Werra \(2026\)Defeating the trainer\-generator precision mismatch in trl\.Cited by:[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- J\. Fu, X\. Zhao, C\. Yao, H\. Wang, Q\. Han, and Y\. Xiao \(2025a\)Reward shaping to mitigate reward hacking in rlhf\.arXiv preprint arXiv:2502\.18770\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- W\. Fu, J\. Gao, X\. Shen, C\. Zhu, Z\. Mei, C\. He, S\. Xu, G\. Wei, J\. Mei, J\. Wang,et al\.\(2025b\)Areal: a large\-scale asynchronous reinforcement learning system for language reasoning\.arXiv preprint arXiv:2505\.24298\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- GeeeekExplorer \(2025\)Nano\-vllm: nano vllm\.GitHub\.Note:[https://github\.com/GeeeekExplorer/nano\-vllm](https://github.com/GeeeekExplorer/nano-vllm)Accessed: 2026\-04\-09Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p4.1)\.
- H\. He \(2025\)Defeating nondeterminism in llm inference\.Thinking Machines Lab: Connectionism\.Note:https://thinkingmachines\.ai/blog/defeating\-nondeterminism\-in\-llm\-inference/External Links:[Document](https://dx.doi.org/10.64434/tml.20250910)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p5.1),[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p3.1)\.
- J\. Hu, J\. K\. Liu, H\. Xu, and W\. Shen \(2025\)REINFORCE\+\+: stabilizing critic\-free policy optimization with global advantage normalization\.External Links:2501\.03262,[Link](https://arxiv.org/abs/2501.03262)Cited by:[§2](https://arxiv.org/html/2605.14220#S2.p3.2)\.
- J\. Hu, X\. Wu, Z\. Zhu, W\. Wang, D\. Zhang, Y\. Cao,et al\.\(2024\)Openrlhf: an easy\-to\-use, scalable and high\-performance rlhf framework\.arXiv preprint arXiv:2405\.111436\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- Y\. Huang, Y\. Cheng, A\. Bapna, O\. Firat, M\. X\. Chen, D\. Chen, H\. Lee, J\. Ngiam, Q\. V\. Le, Y\. Wu, and Z\. Chen \(2019\)GPipe: efficient training of giant neural networks using pipeline parallelism\.External Links:1811\.06965,[Link](https://arxiv.org/abs/1811.06965)Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p4.1)\.
- W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica \(2023\)Efficient memory management for large language model serving with pagedattention\.External Links:2309\.06180,[Link](https://arxiv.org/abs/2309.06180)Cited by:[§2](https://arxiv.org/html/2605.14220#S2.p1.1)\.
- Y\. Li, J\. Liu, J\. Xu, Y\. Tong, Z\. Li, Q\. Liu, and B\. Wang \(2026\)Trust region masking for long\-horizon llm reinforcement learning\.External Links:2512\.23075,[Link](https://arxiv.org/abs/2512.23075)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p3.1),[§1](https://arxiv.org/html/2605.14220#S1.p8.1),[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- J\. Liu, Y\. Li, Y\. Fu, J\. Wang, Q\. Liu, and Y\. Shen \(2025a\)External Links:[Link](https://richardli.xyz/rl-collapse)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p3.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025b\)Understanding r1\-zero\-like training: a critical perspective\.arXiv preprint arXiv:2503\.20783\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p1.1)\.
- W\. Ma, H\. Zhang, L\. Zhao, Y\. Song, Y\. Wang, Z\. Sui, and F\. Luo \(2025\)Stabilizing MoE reinforcement learning by aligning training and inference routers\.arXiv preprint arXiv:2510\.11370\.External Links:[Link](https://arxiv.org/abs/2510.11370)Cited by:[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- Mathematical Association of America \(2024\)AIME 2024: American Invitational Mathematics Examination\.External Links:[Link](https://artofproblemsolving.com/wiki/index.php/2024_AIME_I)Cited by:[§3\.2](https://arxiv.org/html/2605.14220#S3.SS2.p2.1)\.
- MiniMax, :, A\. Chen, A\. Li, B\. Gong, B\. Jiang, B\. Fei, B\. Yang, B\. Shan, C\. Yu, C\. Wang, C\. Zhu, C\. Xiao, C\. Du, C\. Zhang, C\. Qiao, C\. Zhang, C\. Du, C\. Guo, D\. Chen, D\. Ding, D\. Sun, D\. Li, E\. Jiao, H\. Zhou, H\. Zhang, H\. Ding, H\. Sun, H\. Feng, H\. Cai, H\. Zhu, J\. Sun, J\. Zhuang, J\. Cai, J\. Song, J\. Zhu, J\. Li, J\. Tian, J\. Liu, J\. Xu, J\. Yan, J\. Liu, J\. He, K\. Feng, K\. Yang, K\. Xiao, L\. Han, L\. Wang, L\. Yu, L\. Feng, L\. Li, L\. Zheng, L\. Du, L\. Yang, L\. Zeng, M\. Yu, M\. Tao, M\. Chi, M\. Zhang, M\. Lin, N\. Hu, N\. Di, P\. Gao, P\. Li, P\. Zhao, Q\. Ren, Q\. Xu, Q\. Li, Q\. Wang, R\. Tian, R\. Leng, S\. Chen, S\. Chen, S\. Shi, S\. Weng, S\. Guan, S\. Yu, S\. Li, S\. Zhu, T\. Li, T\. Cai, T\. Liang, W\. Cheng, W\. Kong, W\. Li, X\. Chen, X\. Song, X\. Luo, X\. Su, X\. Li, X\. Han, X\. Hou, X\. Lu, X\. Zou, X\. Shen, Y\. Gong, Y\. Ma, Y\. Wang, Y\. Shi, Y\. Zhong, Y\. Duan, Y\. Fu, Y\. Hu, Y\. Gao, Y\. Fan, Y\. Yang, Y\. Li, Y\. Hu, Y\. Huang, Y\. Li, Y\. Xu, Y\. Mao, Y\. Shi, Y\. Wenren, Z\. Li, Z\. Li, Z\. Tian, Z\. Zhu, Z\. Fan, Z\. Wu, Z\. Xu, Z\. Yu, Z\. Lyu, Z\. Jiang, Z\. Gao, Z\. Wu, Z\. Song, and Z\. Sun \(2025\)MiniMax\-m1: scaling test\-time compute efficiently with lightning attention\.External Links:2506\.13585,[Link](https://arxiv.org/abs/2506.13585)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- NVIDIA Corporation \(2025\)CUDA programming guide: CUDA graphs\.Note:[https://docs\.nvidia\.com/cuda/cuda\-programming\-guide/04\-special\-topics/cuda\-graphs\.html](https://docs.nvidia.com/cuda/cuda-programming-guide/04-special-topics/cuda-graphs.html)Accessed: 2026\-04\-09Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p4.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p1.1)\.
- A\. Pan, E\. Jones, M\. Jagadeesan, and J\. Steinhardt \(2024\)Feedback loops with language models drive in\-context reward hacking\.arXiv preprint arXiv:2402\.06627\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- P\. Qi, Z\. Liu, X\. Zhou, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025\)Defeating the training\-inference mismatch via FP16\.arXiv preprint arXiv:2510\.26788\.External Links:[Link](https://arxiv.org/abs/2510.26788)Cited by:[§3\.2](https://arxiv.org/html/2605.14220#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- S\. Rajbhandari, J\. Rasley, O\. Ruwase, and Y\. He \(2020\)ZeRO: memory optimizations toward training trillion parameter models\.External Links:1910\.02054,[Link](https://arxiv.org/abs/1910.02054)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p5.1),[§2](https://arxiv.org/html/2605.14220#S2.p1.1)\.
- Ring AI Team \(2025\)Learning for trillion\-scale thinking model\.arXiv preprint arXiv:2510\.18855\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p3.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.External Links:[Link](https://arxiv.org/abs/1707.06347)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p1.1),[§1](https://arxiv.org/html/2605.14220#S1.p3.1),[§2](https://arxiv.org/html/2605.14220#S2.p1.1),[§2](https://arxiv.org/html/2605.14220#S2.p3.2)\.
- J\. Schulman \(2020\)Approximating kl divergence\.Note:[http://joschu\.net/blog/kl\-approx\.html](http://joschu.net/blog/kl-approx.html)Cited by:[§4\.1](https://arxiv.org/html/2605.14220#S4.SS1.p3.9)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\.K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.External Links:[Link](https://arxiv.org/abs/2402.03300)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p1.1),[§2](https://arxiv.org/html/2605.14220#S2.p3.2)\.
- G\. Sheng, Y\. Tong, B\. Wan, W\. Zhang, C\. Jia, X\. Wu, Y\. Wu, X\. Li, C\. Zhang, Y\. Peng,et al\.\(2025a\)Laminar: a scalable asynchronous rl post\-training framework\.arXiv preprint arXiv:2510\.12633\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025b\)HybridFlow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,EuroSys ’25,pp\. 1279–1297\.External Links:[Link](http://dx.doi.org/10.1145/3689031.3696075),[Document](https://dx.doi.org/10.1145/3689031.3696075)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1),[§1](https://arxiv.org/html/2605.14220#S1.p5.1)\.
- M\. Shoeybi, M\. Patwary, R\. Puri, P\. LeGresley, J\. Casper, and B\. Catanzaro \(2020\)Megatron\-lm: training multi\-billion parameter language models using model parallelism\.External Links:1909\.08053,[Link](https://arxiv.org/abs/1909.08053)Cited by:[§2](https://arxiv.org/html/2605.14220#S2.p1.1)\.
- N\. Stiennon, L\. Ouyang, J\. Wu, D\. Ziegler, R\. Lowe, C\. Voss, A\. Radford, D\. Amodei, and P\. F\. Christiano \(2020\)Learning to summarize with human feedback\.InAdvances in Neural Information Processing Systems,Vol\.33\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p1.1)\.
- L\. Team, A\. Shen, B\. Li, B\. Hu, B\. Jing, C\. Chen, C\. Huang, C\. Zhang, C\. Yang, C\. Lin, C\. Wen, C\. Li, D\. Zhao, D\. Yuan, D\. You, F\. Mao, F\. Meng, F\. Xu, G\. Li, G\. Wang, H\. Dai, H\. Zheng, H\. Liu, J\. Guo, J\. Liu, J\. Liu, J\. Fu, J\. Shi, J\. Wang, J\. Lai, J\. Yang, J\. Mei, J\. Zhou, J\. Zhao, J\. Zhao, K\. Xu, L\. Su, L\. Chen, L\. Tang, L\. Jiang, L\. Fu, L\. Xu, L\. Shi, L\. Liao, L\. Zheng, M\. Li, M\. Chen, Q\. Zuo, Q\. Cheng, Q\. Cao, Q\. Shi, Q\. Guo, S\. Zhu, S\. Wang, S\. Zheng, S\. Li, S\. Gu, S\. Chen, T\. Wu, T\. Zhang, T\. Zhang, T\. Zhou, T\. Bie, T\. Yang, W\. Hong, W\. Ren, W\. Chen, W\. Yu, W\. Zheng, X\. Wang, X\. Yan, X\. Wan, X\. Zhao, X\. Kong, X\. Tang, X\. Han, X\. Wang, X\. Yang, X\. Hu, Y\. Zhang, Y\. Sun, Y\. Shan, Y\. Wang, Y\. Xu, Y\. Liu, Y\. Guo, Y\. Wang, Y\. Yan, Y\. Wang, Y\. Guo, Z\. Li, Z\. Xu, Z\. Li, Z\. Zhang, Z\. Gui, Z\. Pan, Z\. Huang, Z\. Lan, Z\. Ding, Z\. Zhang, Z\. Li, Z\. Liu, Z\. Wang, and Z\. Wen \(2025\)Every step evolves: scaling reinforcement learning for trillion\-scale thinking model\.External Links:2510\.18855,[Link](https://arxiv.org/abs/2510.18855)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p3.1)\.
- R\. Team \(2025\)Introducing miles: rl framework to fire up large\-scale moe training\.Note:[https://lmsys\.org/blog/2025\-11\-19\-miles/](https://lmsys.org/blog/2025-11-19-miles/)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p2.1)\.
- R\. J\. Williams \(1992\)Simple statistical gradient\-following algorithms for connectionist reinforcement learning\.Machine Learning8,pp\. 229–256\.External Links:[Document](https://dx.doi.org/10.1007/BF00992696)Cited by:[§2](https://arxiv.org/html/2605.14220#S2.p3.2),[§3\.2](https://arxiv.org/html/2605.14220#S3.SS2.p1.1)\.
- A\. Yanget al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.External Links:[Link](https://arxiv.org/abs/2505.09388)Cited by:[§3\.2](https://arxiv.org/html/2605.14220#S3.SS2.p2.1)\.
- F\. Yao, L\. Liu, D\. Zhang, C\. Dong, J\. Shang, and J\. Gao \(2025\)On the rollout\-training mismatch in modern RL systems\.InNeurIPS 2025 Workshop on Efficient Reasoning,External Links:[Link](https://openreview.net/forum?id=8MHqvb4lK9)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p3.1),[§1](https://arxiv.org/html/2605.14220#S1.p8.1),[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)DAPO: an open\-source LLM reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.External Links:[Link](https://arxiv.org/abs/2503.14476)Cited by:[§2](https://arxiv.org/html/2605.14220#S2.p3.2),[§3\.2](https://arxiv.org/html/2605.14220#S3.SS2.p2.1),[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- B\. Zhang and R\. Sennrich \(2019\)Root mean square layer normalization\.InAdvances in Neural Information Processing Systems,Vol\.32,pp\. 12360–12371\.External Links:[Link](https://arxiv.org/abs/1910.07467)Cited by:[§3\.1](https://arxiv.org/html/2605.14220#S3.SS1.p3.1)\.
- Z\. Zhang, X\. Ding, J\. Yuan, R\. Liu, H\. Mao, J\. Xing, and Z\. Liu \(2025\)Deterministic inference across tensor parallel sizes that eliminates training\-inference mismatch\.External Links:2511\.17826,[Link](https://arxiv.org/abs/2511.17826)Cited by:[§5](https://arxiv.org/html/2605.14220#S5.p2.1)\.
- Y\. Zhao, A\. Gu, R\. Varma, L\. Luo, C\. Huang, M\. Xu, L\. Wright, H\. Shojanazeri, M\. Ott, S\. Shleifer, A\. Desmaison, C\. Balioglu, P\. Damania, B\. Nguyen, G\. Chauhan, Y\. Hao, A\. Mathews, and S\. Li \(2023\)PyTorch fsdp: experiences on scaling fully sharded data parallel\.External Links:2304\.11277,[Link](https://arxiv.org/abs/2304.11277)Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p5.1),[§2](https://arxiv.org/html/2605.14220#S2.p1.1)\.
- C\. Zheng, K\. Dang, B\. Yu, M\. Li, H\. Jiang, J\. Lin, Y\. Liu, A\. Yang, J\. Zhou, and J\. Lin \(2025a\)Stabilizing reinforcement learning with llms: formulation and practices\.arXiv preprint arXiv:2512\.01374\.Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p3.1),[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- C\. Zheng, S\. Liu, M\. Li, X\. Chen, B\. Yu, C\. Gao, K\. Dang, Y\. Liu, R\. Men, A\. Yang, J\. Zhou, and J\. Lin \(2025b\)Group sequence policy optimization\.arXiv preprint arXiv:2507\.18071\.External Links:[Link](https://arxiv.org/abs/2507.18071)Cited by:[§5](https://arxiv.org/html/2605.14220#S5.p1.1)\.
- L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez, C\. Barrett, and Y\. Sheng \(2024\)SGLang: efficient execution of structured language model programs\.External Links:2312\.07104,[Link](https://arxiv.org/abs/2312.07104)Cited by:[§2](https://arxiv.org/html/2605.14220#S2.p1.1)\.
- D\. M\. Ziegler, N\. Stiennon, J\. Wu, T\. B\. Brown, A\. Radford, D\. Amodei, P\. Christiano, and G\. Irving \(2019\)Fine\-tuning language models from human preferences\.InarXiv preprint arXiv:1909\.08593,Cited by:[§1](https://arxiv.org/html/2605.14220#S1.p1.1)\.

## Appendix AAdditional Details

### A\.1Experimental Settings

SettingDense GRPODense REINFORCEMoE REINFORCEModelQwen3\-1\.7BQwen3\-1\.7BQwen3\-30B\-A3BRL algorithmGRPOREINFORCEREINFORCETraining dataSanity\-Test\-R1D\-1\.5BSanity\-Test\-R1D\-1\.5BDAPO\-Math\-17kEvaluationAIME 2024 every 50 stepsAIME 2024 every 50 stepsAIME 2024 every 20 stepsBatchingGlobal batch 64, mini\-batch 16, rollout group 8Global batch 64Global batch 512Sequence lengthPrompt 1024, response 8192Prompt 1024, response 8192Prompt 2048, response 20480Hardware1 node, 8 GPUs \(H100\)2 nodes, 16 GPUs \(H100\)8 nodes, 64 GPUs \(H100\)EngineFSDP2 \+ vLLM/VeXactFSDP2 \+ vLLM/VeXactFSDP2 \+ vLLM/VeXactTable 2:Experimental settings corresponding to the dense GRPO, dense REINFORCE, and MoE REINFORCE recipes\.
### A\.2Additional REINFORCE Results

![Refer to caption](https://arxiv.org/html/2605.14220v1/x24.png)\(a\)MoEδt\\delta\_\{t\}\(max\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x25.png)\(b\)MoE sequence loss
![Refer to caption](https://arxiv.org/html/2605.14220v1/x26.png)\(c\)Denseδt\\delta\_\{t\}\(max\)
![Refer to caption](https://arxiv.org/html/2605.14220v1/x27.png)\(d\)Dense sequence loss

Figure 7:Additional metrics of REINFORCE experiments comparing vLLM non\-exact rollout withVeXact\. Each row reportsδt\\delta\_\{t\}\(max\), and sequence loss\. Top row: Qwen3\-30B\-A3B MoE\. Bottom row: Qwen3\-1\.7B dense\.
### A\.3Additional DAPO Results

![Refer to caption](https://arxiv.org/html/2605.14220v1/x28.png)\(a\)Training reward\.
![Refer to caption](https://arxiv.org/html/2605.14220v1/x29.png)\(b\)Validation reward\.

Figure 8:Qwen3\-1\.7B GRPO experiments on the DAPO dataset, withVeXactand vLLM recomputation and bypass, whereVeXactcan maintain the training stability better\.
### A\.4Expanded Objectives for Post\-hoc Patch Variants in Section[4\.2](https://arxiv.org/html/2605.14220#S4.SS2)

Letrppo,trolloutr^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\}be the rollout\-side PPO ratio defined in Eq\.[3](https://arxiv.org/html/2605.14220#S4.E3), and letℒPPO\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}denote the token\-level clipped surrogate\.

#### srs\-k3\-corr\-ratio\.

Sequence\-level rejection using the correction ratiorcorrr\_\{\\mathrm\{corr\}\}as the filtering signal:

ℒsrs​\-​k3​\-​corr=𝟏​\[∑t=1TK​3​\(rcorr,1:T\)≤τseq\]​∑t=1TℒPPO​\(rppo,trollout,At\)\.\\mathcal\{L\}\_\{\\mathrm\{srs\\text\{\-\}k3\\text\{\-\}corr\}\}=\\mathbf\{1\}\\left\[\\sum\_\{t=1\}^\{T\}K3\(r\_\{\\mathrm\{corr\},1:T\}\)\\leq\\tau\_\{\\mathrm\{seq\}\}\\right\]\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}\\left\(r^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\},A\_\{t\}\\right\)\.\(8\)

#### srs\-k3\-ppo\-ratio\.

Sequence\-level rejection using the rollout\-side PPO ratiorpporolloutr^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\}\}as the filtering signal:

ℒsrs​\-​k3​\-​ppo=𝟏​\[∑t=1TK​3​\(rppo,1:Trollout\)≤τseq\]​∑t=1TℒPPO​\(rppo,trollout,At\)\.\\mathcal\{L\}\_\{\\mathrm\{srs\\text\{\-\}k3\\text\{\-\}ppo\}\}=\\mathbf\{1\}\\left\[\\sum\_\{t=1\}^\{T\}K3\(r^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},1:T\}\)\\leq\\tau\_\{\\mathrm\{seq\}\}\\right\]\\sum\_\{t=1\}^\{T\}\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}\\left\(r^\{\\mathrm\{rollout\}\}\_\{\\mathrm\{ppo\},t\},A\_\{t\}\\right\)\.\(9\)

#### tis\-srs\-k3\-corr\-ratio\.

Token\-level truncation followed by sequence\-level rejection, using the correction ratiorcorrr\_\{\\mathrm\{corr\}\}as the filtering signal:

ℒtis​\-​srs​\-​k3​\-​corr=𝟏​\[∑t=1TK​3​\(rcorr,1:T\)≤τseq\]​∑t=1Tmin⁡\(rcorr,t,τtok\)​ℒPPO​\(rppo,ttrain,At\)\.\\mathcal\{L\}\_\{\\mathrm\{tis\\text\{\-\}srs\\text\{\-\}k3\\text\{\-\}corr\}\}=\\mathbf\{1\}\\left\[\\sum\_\{t=1\}^\{T\}K3\(r\_\{\\mathrm\{corr\},1:T\}\)\\leq\\tau\_\{\\mathrm\{seq\}\}\\right\]\\sum\_\{t=1\}^\{T\}\\min\\left\(r\_\{\\mathrm\{corr\},t\},\\tau\_\{\\mathrm\{tok\}\}\\right\)\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}\\left\(r^\{\\mathrm\{train\}\}\_\{\\mathrm\{ppo\},t\},A\_\{t\}\\right\)\.\(10\)

#### tis\-srs\-k1\-corr\-ratio\.

Token\-level truncation followed by sequence\-level rejection, using the correction ratiorcorrr\_\{\\mathrm\{corr\}\}as the filtering signal:

ℒtis​\-​srs​\-​k1​\-​corr=𝟏​\[∑t=1TK​1​\(rcorr,1:T\)≤τseq\]​∑t=1Tmin⁡\(rcorr,t,τtok\)​ℒPPO​\(rppo,ttrain,At\)\.\\mathcal\{L\}\_\{\\mathrm\{tis\\text\{\-\}srs\\text\{\-\}k1\\text\{\-\}corr\}\}=\\mathbf\{1\}\\left\[\\sum\_\{t=1\}^\{T\}K1\(r\_\{\\mathrm\{corr\},1:T\}\)\\leq\\tau\_\{\\mathrm\{seq\}\}\\right\]\\sum\_\{t=1\}^\{T\}\\min\\left\(r\_\{\\mathrm\{corr\},t\},\\tau\_\{\\mathrm\{tok\}\}\\right\)\\mathcal\{L\}\_\{\\mathrm\{PPO\}\}\\left\(r^\{\\mathrm\{train\}\}\_\{\\mathrm\{ppo\},t\},A\_\{t\}\\right\)\.\(11\)
In our implementation, we setτtok=2\\tau\_\{\\mathrm\{tok\}\}=2andτseq=0\.001\\tau\_\{\\mathrm\{seq\}\}=0\.001\.

Similar Articles

Agentic RL: Token-In, Token-Out Done Right (16 minute read)

TLDR AI

This article explains the 'Token-In, Token-Out' (TITO) invariant in reinforcement learning for LLMs, highlighting a common error when training multi-turn agents with tool calls. It presents two solutions: using per-model renderers or designing training to avoid re-encoding decoded tokens, emphasizing prefix-preserving chat templates.

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

arXiv cs.LG

This paper frames LLM-generated reward shaping for sparse structured RL as a debugging problem, identifying failure modes like reward flooding and semantic misunderstanding. The authors propose diagnostic-driven iterative refinement, achieving dramatic success rate improvements (e.g., DoorKey-8×8 from 2.3% to 97.6%) compared to one-shot generation.