Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Summary
FreshPER introduces a freshness-aware prioritized experience replay method for LLM/VLM reinforcement learning that addresses the 'priority staleness' problem by applying exponential age decay to stored priorities, enabling off-policy reuse of trajectories. Evaluated on eight agentic, reasoning, and math tasks, FreshPER significantly outperforms on-policy baselines with gains up to +367% on Sokoban.
View Cached Full Text
Cached at: 04/21/26, 07:05 AM
# Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Source: [https://arxiv.org/html/2604.16918](https://arxiv.org/html/2604.16918)
Weiyu Ma1Yongcheng Zeng2Yan Song3Xinyu Cui2Jian Zhao4 Xuhui Liu1Mohamed Elhoseiny1 1King Abdullah University of Science and Technology \(KAUST\) 2Chinese Academy of Sciences, Institute of Automation \(CASIA\) 3AI Centre, Department of Computer Science, University College London 4Zhongguancun Institute of Artificial Intelligence weiyu\.ma@kaust\.edu\.samohamed\.elhoseiny@kaust\.edu\.sa
###### Abstract
Reinforcement Learning \(RL\) has achieved impressive success in post\-training Large Language Models \(LLMs\) and Vision\-Language Models \(VLMs\), with on\-policy algorithms such as PPO, GRPO, and REINFORCE\+\+ serving as the dominant paradigm\. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi\-turn environment interactions are expensive\. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay \(PER\) to LLMs fails\. The rapid policy evolution of billion\-parameter models renders stored priorities stale, causing old high\-priority trajectories to dominate sampling long after they have become uninformative\. We propose FreshPER, which addresses this*priority staleness*problem by augmenting any PER\-based priority with a multiplicative exponential age decay grounded in effective sample size analysis\. To the best of our knowledge, FreshPER is the first work to successfully apply PER to LLM/VLM reinforcement learning\. We evaluate on eight multi\-step agentic, reasoning, and math competition tasks with 0\.5B, 3B, and 7B models\. FreshPER significantly outperforms on\-policy baselines, achieving \+46% on NQ Search, \+367% on Sokoban, and \+133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance\. Our code is publicly available at[https://github\.com/Vision\-CAIR/Freshness\-Aware\-PER](https://github.com/Vision-CAIR/Freshness-Aware-PER)\.
## 1Introduction
Figure 1:Overview of the FreshPER training pipeline\.Top: On\-policy loop—the behavior policyπμ\\pi\_\{\\mu\}\(vLLM inference\) interacts with agentic environments, and the current policyπθ\\pi\_\{\\theta\}\(DeepSpeed training\) is updated via policy gradient on fresh data\.Bottom: Off\-policy loop—trajectories with their behavior log\-probs and rewards are stored in the replay buffer on the CPU controller\. An asynchronous thread refreshes priorities viapi←pibase⋅exp\(−Δi/τ\)p\_\{i\}\\leftarrow p\_\{i\}^\{\\text\{base\}\}\\cdot\\exp\(\-\\Delta\_\{i\}/\\tau\), and prioritized batches are sampled for additional off\-policy training\.Reinforcement learning \(RL\) has become a transformative technique for large language models \(LLMs\)\. RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2604.16918#bib.bib15)\)played a central role in producing ChatGPT, demonstrating that RL\-based post\-training can dramatically improve the usability and safety of LLMs\. More recently, OpenAI o1\(OpenAI,[2024](https://arxiv.org/html/2604.16918#bib.bib74)\)and DeepSeek\-R1\(DeepSeek\-AI,[2025](https://arxiv.org/html/2604.16918#bib.bib21)\)showed that RL can unlock advanced reasoning capabilities, achieving expert\-level performance on mathematics and coding benchmarks\. These successes have been powered by on\-policy policy gradient algorithms—PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2604.16918#bib.bib7)\), GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib18)\), and REINFORCE\+\+\(Huet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib23)\)—which remain the dominant training paradigm\.
A particularly exciting frontier is*agentic RL*, where LLMs and vision\-language models \(VLMs\) interact with external environments over multiple turns: searching the web\(Jinet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib72)\), executing code\(Weiet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib60)\), calling tools\(Schicket al\.,[2023](https://arxiv.org/html/2604.16918#bib.bib53)\), and navigating visual environments\(Driesset al\.,[2023](https://arxiv.org/html/2604.16918#bib.bib54)\)\. Unlike single\-turn preference alignment, agentic RL requires the model to take sequential actions and receive feedback from a live environment, bringing LLM training much closer to the classical RL paradigm\. This shift also introduces a critical new challenge:*environment interactions are very expensive*\.
Consider training a search agent with REINFORCE\+\+ on a retrieval\-augmented QA task\(Jinet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib72)\)\. Each prompt generates multiple rollout trajectories \(e\.g\., 8 per prompt\), and each trajectory involves up to 5 search turns\. For a batch of 128 prompts, this yields over 5,000 retrieval calls per iteration, each requiring embedding computation and vector index lookup on dedicated hardware\. The rollout stage alone dominates training time, often exceeding 70% of the total wall\-clock cost\(Yuet al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib75)\)\. Yet on\-policy algorithms use these expensive trajectories for a single gradient update and then discard them entirely \([fig\.2](https://arxiv.org/html/2604.16918#S1.F2)\)\.
Figure 2:On\-policy LLM RL algorithms \(PPO, REINFORCE\+\+, GRPO\) use each trajectory for a single gradient update and then discard it \(⊗\\otimes\), regardless of its learning potential\.In classical RL,*experience replay*\(Lin,[1992](https://arxiv.org/html/2604.16918#bib.bib1); Mnihet al\.,[2015](https://arxiv.org/html/2604.16918#bib.bib2)\)and its prioritized variant PER\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\)are the standard solutions to this problem, enabling agents to reuse past experiences and prioritize the most informative ones\. However, directly applying PER to LLM RL fails\. The core issue is*priority staleness*: LLM policies evolve rapidly due to large gradient updates on long token sequences, causing old high\-priority trajectories to dominate sampling long after they have become uninformative or even detrimental\. Standard PER has no mechanism to account for this temporal degradation\.
We present FreshPER, which addresses priority staleness by augmenting any PER base priority with a multiplicative exponential age decay, directly motivated by the exponential decay of effective sample size \(ESS\) as the policy diverges from the behavior policy\. This simple mechanism ensures that even the highest\-priority old trajectory is eventually deprioritized below a fresh trajectory of moderate priority, while preserving the informativeness\-driven sampling of standard PER\.
The main contributions of this work are summarized as follows:
- •To the best of our knowledge, we are the first to successfully apply PER to LLM/VLM RL\. We identify*priority staleness*as a key failure mode and propose freshness\-aware age decay grounded in importance sampling theory\.
- •We implement a complete off\-policy training pipeline with trajectory\-level replay, integrated into the ROLL framework\(Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\)\.
- •We demonstrate consistent improvements across eight environments with 0\.5B, 3B, and 7B models, achieving \+46% on NQ Search, \+367% on Sokoban, and \+133% on VLM FrozenLake, while standard PER without age decay consistently degrades\.
## 2Related Work
Reinforcement learning has become a central technique for post\-training LLMs and VLMs, spanning preference alignment\(Ouyanget al\.,[2022](https://arxiv.org/html/2604.16918#bib.bib15)\), reasoning\(DeepSeek\-AI,[2025](https://arxiv.org/html/2604.16918#bib.bib21); Shaoet al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib18)\), agentic tasks\(Yaoet al\.,[2022](https://arxiv.org/html/2604.16918#bib.bib51); Jinet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib72); Wanget al\.,[2025b](https://arxiv.org/html/2604.16918#bib.bib57)\), and visual reasoning\(Huanget al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib62); Shenet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib63)\)\. Distributed training frameworks such as ROLL\(Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\), veRL\(Shenget al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib68)\), and OpenRLHF\(Huet al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib69)\)provide the infrastructure to scale these methods\. Below we review three threads most relevant to our work\.
#### Online and Offline RL for LLMs\.
Online \(on\-policy\) methods dominate LLM RL\. PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2604.16918#bib.bib7)\)remains the workhorse of RLHF\(Ouyanget al\.,[2022](https://arxiv.org/html/2604.16918#bib.bib15)\); GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib18)\)simplifies it by removing the critic; DAPO\(Yuet al\.,[2025b](https://arxiv.org/html/2604.16918#bib.bib22)\), REINFORCE\+\+\(Huet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib23)\), and Dr\. GRPO\(Liuet al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib24)\)further refine the policy gradient estimator\. On the offline side, DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2604.16918#bib.bib17)\)and its iterative variants\(Donget al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib25)\)optimize from preference data without environment interaction\. A common limitation across online methods is that trajectories are discarded after a single gradient update, while offline methods forgo interaction entirely\.
#### Off\-Policy RL for LLMs\.
A middle ground is*off\-policy*training, which reuses historical trajectories\. Asynchronous RLHF\(Noukhovitchet al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib36)\)and AReaL\(Fuet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib33)\)decouple generation from training with uniform replay\. This introduces data staleness, prompting work on off\-policy stability through importance\-weight control\(Rouxet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib37); Zhenget al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib38); Xiet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib39); Luoet al\.,[2026](https://arxiv.org/html/2604.16918#bib.bib40)\)\. On the data reuse side, RLEP\(Zhanget al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib43)\)replays correct trajectories, DOTS\(Sunet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib44)\)combines difficulty\-targeted selection with replay, and LoRR\(Liuet al\.,[2025b](https://arxiv.org/html/2604.16918#bib.bib46)\)enables high replay ratios via parameter resets\. Most recently, Fatemi\(Fatemiet al\.,[2026](https://arxiv.org/html/2604.16918#bib.bib42)\)proposes problem\-level prioritized scheduling for RL post\-training, but explicitly argues that transition\-level PER is “ill\-suited for sequence models” and instead uses success\-rate\-based curriculum scheduling\. In contrast, we demonstrate that trajectory\-level PER*can*succeed in LLM RL when augmented with freshness\-aware age decay\. Across all existing approaches, none employ prioritized sampling that accounts for both sample informativeness and temporal freshness\.
#### Experience Replay in Classic RL\.
Experience replay\(Lin,[1992](https://arxiv.org/html/2604.16918#bib.bib1)\), popularized by DQN\(Mnihet al\.,[2015](https://arxiv.org/html/2604.16918#bib.bib2)\), stores past transitions for reuse\. PER\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\)assigns priority proportional to TD error so that “surprising” transitions are replayed more often, and has been extended to distributed settings\(Horganet al\.,[2018](https://arxiv.org/html/2604.16918#bib.bib9)\)and high replay ratios\(D’Oroet al\.,[2023](https://arxiv.org/html/2604.16918#bib.bib13); Schwarzeret al\.,[2023](https://arxiv.org/html/2604.16918#bib.bib14); Feduset al\.,[2020](https://arxiv.org/html/2604.16918#bib.bib5)\)\. Most related to our decay mechanism is FPER\(Maet al\.,[2022](https://arxiv.org/html/2604.16918#bib.bib71)\), which discounts priority based on*replay count*\. In contrast, our age decay is measured in*gradient steps*and directly grounded in the exponential ESS decay caused by policy divergence \([section3\.3](https://arxiv.org/html/2604.16918#S3.SS3)\)\. These techniques are well\-established for fixed\-dimensional state\-action spaces but have not been adapted to the LLM setting, where trajectories are variable\-length token sequences and policy drift is far more rapid\.
Despite recent attempts at problem\-level scheduling\(Fatemiet al\.,[2026](https://arxiv.org/html/2604.16918#bib.bib42)\), no prior work has successfully applied trajectory\-level PER to LLM training\. FreshPER bridges this gap by combining informativeness\-driven sampling with explicit temporal decay grounded in ESS analysis\.
## 3Method: FreshPER
### 3\.1Problem Formulation
We model LLM RL as a multi\-turn Markov Decision Process \(MDP\)\(𝒮,𝒜,T,R,γ\)\(\\mathcal\{S\},\\mathcal\{A\},T,R,\\gamma\)\. At each turntt, the statest∈𝒮s\_\{t\}\\in\\mathcal\{S\}is the full conversation history, comprising the initial prompt concatenated with all prior assistant responses and environment observations\. The actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}is the assistant response generated at the current turn\. Upon receivingata\_\{t\}, the environment returns an observationot∈𝒪o\_\{t\}\\in\\mathcal\{O\}\. The transition functionT:𝒮×𝒜×𝒪→𝒮T:\\mathcal\{S\}\\times\\mathcal\{A\}\\times\\mathcal\{O\}\\rightarrow\\mathcal\{S\}is deterministic, defined byst\+1=st⊕at⊕ots\_\{t\+1\}=s\_\{t\}\\oplus a\_\{t\}\\oplus o\_\{t\}, where⊕\\oplusdenotes sequence concatenation\. The reward functionR:𝒮×𝒜→ℝR:\\mathcal\{S\}\\times\\mathcal\{A\}\\rightarrow\\mathbb\{R\}assigns a scalar reward to each state\-action pair\. The episode terminates after at mostHHturns, and we setγ=1\\gamma=1, corresponding to an undiscounted episodic return\. Given a parameterized policyπθ\\pi\_\{\\theta\}, the objective is to maximize the expected discounted return:
maxθ𝔼πθ\[∑tγtrt\]=maxθ𝔼πθ\[∑trt\]\.\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\\sum\_\{t\}\\gamma^\{t\}r\_\{t\}\\right\]=\\max\_\{\\theta\}\\;\\mathbb\{E\}\_\{\\pi\_\{\\theta\}\}\\\!\\left\[\\sum\_\{t\}r\_\{t\}\\right\]\.\(1\)Policy gradient methods optimize this objective via clipped importance ratiosρ=πθ\(a∣s\)/πold\(a∣s\)\\rho=\\pi\_\{\\theta\}\(a\\mid s\)/\\pi\_\{\\mathrm\{old\}\}\(a\\mid s\), whereπold\\pi\_\{\\mathrm\{old\}\}denotes the current policy prior to the update\. Accordingly, we define the state\-value functionVπ\(st\)=𝔼π\[∑k=0∞γkr\(st\+k\)∣st\]V\_\{\\pi\}\(s\_\{t\}\)=\\mathbb\{E\}\_\{\\pi\}\[\\sum\_\{k=0\}^\{\\infty\}\\gamma^\{k\}r\(s\_\{t\+k\}\)\\mid s\_\{t\}\]and the action\-value functionQπ\(st,at\)=𝔼π\[∑k=0∞γkr\(st\+k\)∣st,at\]Q\_\{\\pi\}\(s\_\{t\},a\_\{t\}\)=\\mathbb\{E\}\_\{\\pi\}\[\\sum\_\{k=0\}^\{\\infty\}\\gamma^\{k\}r\(s\_\{t\+k\}\)\\mid s\_\{t\},a\_\{t\}\], yielding the advantage functionAπ\(s,a\)≔Qπ\(s,a\)−Vπ\(s\)A\_\{\\pi\}\(s,a\)\\coloneqq Q\_\{\\pi\}\(s,a\)\-V\_\{\\pi\}\(s\)\.
### 3\.2Prioritized Experience Replay
Prioritized Experience Replay \(PER\)\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\)maintains a replay bufferℬ\\mathcal\{B\}of transitions\(s,a,r,s′\)\(s,a,r,s^\{\\prime\}\)and samples them with probability proportional to their priority:
P\(i\)=piα∑kpkαP\(i\)=\\frac\{p\_\{i\}^\{\\alpha\}\}\{\\sum\_\{k\}p\_\{k\}^\{\\alpha\}\}\(2\)wherepip\_\{i\}is the priority of transitioniiandα∈\[0,1\]\\alpha\\in\[0,1\]controls the degree of prioritization\. Whenα=0\\alpha=0, the sampling distribution reduces to uniform sampling\. The priority is typically set to the absolute temporal\-difference \(TD\) error:pi=\|δi\|\+ϵp\_\{i\}=\|\\delta\_\{i\}\|\+\\epsilon, where the TD errorδi=r\+γVπ\(s′\)−Vπ\(s\)\\delta\_\{i\}=r\+\\gamma V\_\{\\pi\}\(s^\{\\prime\}\)\-V\_\{\\pi\}\(s\)measures the discrepancy between the observed returnr\+γVπ\(s′\)r\+\\gamma V\_\{\\pi\}\(s^\{\\prime\}\)and the current value estimateVπ\(s\)V\_\{\\pi\}\(s\)\. Large\|δi\|\|\\delta\_\{i\}\|indicates “surprising” transitions with high learning potential\.
To correct for the non\-uniform sampling bias, importance sampling weights are applied:
wi=\(1N⋅P\(i\)\)β/maxjwjw\_\{i\}=\\left\(\\frac\{1\}\{N\\cdot P\(i\)\}\\right\)^\{\\beta\}\\bigg/\\max\_\{j\}w\_\{j\}\(3\)whereN=\|ℬ\|N=\|\\mathcal\{B\}\|is the current number of transitions in the replay buffer, themaxjwj\\max\_\{j\}w\_\{j\}normalization ensures weights are at most 1, andβ\\betaanneals from a small value to 1 over training, gradually increasing the correction\.
### 3\.3Priority Staleness and Exponential Age Decay
In standard Prioritized Experience Replay \(PER\)\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\), each trajectory stored in the replay buffer carries a base prioritypibasep\_\{i\}^\{\\text\{base\}\}that reflects its perceived learning value at the time of collection, such as the absolute TD error\|δi\|\|\\delta\_\{i\}\|, the reward magnitude\|ri\|\|r\_\{i\}\|, or the absolute advantage\|A^i\|\|\\hat\{A\}\_\{i\}\|\. Once assigned, this priority remains fixed\. As training progresses, however, the current policyπθ\\pi\_\{\\theta\}gradually diverges from the behavior policyπμ\\pi\_\{\\mu\}that originally generated the trajectory, and the priority recorded at collection time becomes an increasingly poor indicator of how useful the trajectory actually is for training the current policy\. We refer to this growing mismatch aspriority staleness\.
This problem is especially pronounced in LLM RL, where two factors amplify the effect\. First, because the policy contains billions of parameters, even a single gradient step can produce a substantial shift in the output distribution\. Second, the action space is discrete: each action corresponds to a token, and the probability of selecting a given token can change drastically after only a few updates\. Despite this rapid distributional shift, old trajectories with high priorities continue to dominate sampling long after they have become uninformative or even detrimental to learning\.
When a trajectory collected underπμ\\pi\_\{\\mu\}is reused to updateπθ\\pi\_\{\\theta\}, the distributional mismatch between the two policies must be corrected through*importance weighting*\. Concretely, each trajectory is weighted by the importance ratio
ρ=πθ\(a∣s\)πμ\(a∣s\),\\rho\\;=\\;\\frac\{\\pi\_\{\\theta\}\(a\\mid s\)\}\{\\pi\_\{\\mu\}\(a\\mid s\)\},\(4\)which re\-scales the contribution of each sample so that the resulting gradient estimate is unbiased with respect toπθ\\pi\_\{\\theta\}\. When the two policies are close,ρ\\rhostays near 1 and the correction is benign\. As they diverge, however,ρ\\rhobecomes highly variable: a few samples receive very large weights while most receive weights close to zero\. In this regime, the gradient estimate is dominated by a small number of samples, and the majority of the replay buffer contributes little to learning\.
The*effective sample size*\(ESS\)\(Kong,[1992](https://arxiv.org/html/2604.16918#bib.bib12)\)formalizes this intuition\. Givennnimportance\-weighted samples, the ESS quantifies how many samples drawn directly fromπθ\\pi\_\{\\theta\}would yield an estimate of equivalent statistical quality:
ESS=n1\+Varπμ\[ρ\]\.\\mathrm\{ESS\}\\;=\\;\\frac\{n\}\{1\+\\mathrm\{Var\}\_\{\\pi\_\{\\mu\}\}\[\\rho\]\}\.\(5\)WhereVar\[ρ\]=𝔼\[ρ2\]−\(𝔼\[ρ\]\)2\\mathrm\{Var\}\[\\rho\]=\\mathbb\{E\}\[\\rho^\{2\}\]\-\(\\mathbb\{E\}\[\\rho\]\)^\{2\}\. Whenπθ=πμ\\pi\_\{\\theta\}=\\pi\_\{\\mu\}, every importance weight equals 1,Var\[ρ\]=0\\mathrm\{Var\}\[\\rho\]=0, and allnnsamples are fully effective \(ESS=n\\mathrm\{ESS\}=n\)\. As the variance ofρ\\rhoincreases, more samples are effectively wasted and the ESS shrinks\. Eq\. \([5](https://arxiv.org/html/2604.16918#S3.E5)\) reveals that the ESS is entirely governed byVar\[ρ\]\\mathrm\{Var\}\[\\rho\]\. To understand how fast the ESS decays as the policies diverge, it suffices to characterize howVar\[ρ\]\\mathrm\{Var\}\[\\rho\]grows over the course of training\. To connectVar\[ρ\]\\mathrm\{Var\}\[\\rho\]to a divergence measure, we introduce theχ2\\chi^\{2\}\-divergence, defined as
χ2\(P∥Q\)=𝔼Q\[\(P\(x\)Q\(x\)−1\)2\],\\chi^\{2\}\(P\\\|Q\)\\;=\\;\\mathbb\{E\}\_\{Q\}\\\!\\left\[\\left\(\\frac\{P\(x\)\}\{Q\(x\)\}\-1\\right\)^\{\\\!2\}\\right\],\(6\)which measures how far the likelihood ratioP/QP/Qdeviates from 1 on average underQQ\. Since the importance ratioρ=πθ/πμ\\rho=\\pi\_\{\\theta\}/\\pi\_\{\\mu\}is precisely such a likelihood ratio, we can expand Eq\. \([6](https://arxiv.org/html/2604.16918#S3.E6)\) as
χ2\(πθ∥πμ\)=𝔼πμ\[ρ2\]−2𝔼πμ\[ρ\]\+1\.\\chi^\{2\}\\\!\\left\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mu\}\\right\)\\;=\\;\\mathbb\{E\}\_\{\\pi\_\{\\mu\}\}\[\\rho^\{2\}\]\-2\\,\\mathbb\{E\}\_\{\\pi\_\{\\mu\}\}\[\\rho\]\+1\.\(7\)A fundamental property of importance weights is that their expectation under the proposal distribution is always 1:
𝔼πμ\[ρ\]=∫πμ\(x\)πθ\(x\)πμ\(x\)𝑑x=∫πθ\(x\)𝑑x=1,\\mathbb\{E\}\_\{\\pi\_\{\\mu\}\}\[\\rho\]\\;=\\;\\int\\pi\_\{\\mu\}\(x\)\\,\\frac\{\\pi\_\{\\theta\}\(x\)\}\{\\pi\_\{\\mu\}\(x\)\}\\,dx\\;=\\;\\int\\pi\_\{\\theta\}\(x\)\\,dx\\;=\\;1,\(8\)sinceπθ\\pi\_\{\\theta\}is a valid probability distribution\. Substituting𝔼πμ\[ρ\]=1\\mathbb\{E\}\_\{\\pi\_\{\\mu\}\}\[\\rho\]=1into both Eq\. \([7](https://arxiv.org/html/2604.16918#S3.E7)\) and the standard variance formulaVar\[ρ\]=𝔼\[ρ2\]−\(𝔼\[ρ\]\)2\\mathrm\{Var\}\[\\rho\]=\\mathbb\{E\}\[\\rho^\{2\}\]\-\(\\mathbb\{E\}\[\\rho\]\)^\{2\}, both expressions reduce to𝔼πμ\[ρ2\]−1\\mathbb\{E\}\_\{\\pi\_\{\\mu\}\}\[\\rho^\{2\}\]\-1, yielding the identity
Varπμ\[ρ\]=χ2\(πθ∥πμ\)\.\\mathrm\{Var\}\_\{\\pi\_\{\\mu\}\}\[\\rho\]\\;=\\;\\chi^\{2\}\\\!\\left\(\\pi\_\{\\theta\}\\\|\\pi\_\{\\mu\}\\right\)\.\(9\)This result converts the problem of boundingVar\[ρ\]\\mathrm\{Var\}\[\\rho\]into the problem of bounding theχ2\\chi^\{2\}\-divergence between the two policies\.
We now link theχ2\\chi^\{2\}\-divergence back to the KL divergence\. The bridge is the Rényi divergence of order 2, denotedD2D\_\{2\}, which satisfiesD2\(P∥Q\)=log𝔼Q\[\(P/Q\)2\]D\_\{2\}\(P\\\|Q\)=\\log\\mathbb\{E\}\_\{Q\}\[\(\{P\}/\{Q\}\)^\{2\}\]by definition\. Combined with Eq\. \([9](https://arxiv.org/html/2604.16918#S3.E9)\), this gives the exact relationχ2=exp\(D2\)−1\\chi^\{2\}=\\exp\(D\_\{2\}\)\-1\(Metelliet al\.,[2020](https://arxiv.org/html/2604.16918#bib.bib11)\)\. Moreover, Rényi divergence is non\-decreasing in its order, soD2≥DKLD\_\{2\}\\geq D\_\{\\mathrm\{KL\}\}\. Applying the monotonicity of the exponential function, we obtain
Varπμ\[ρ\]=exp\(D2\)−1≥exp\(DKL\)−1\.\\mathrm\{Var\}\_\{\\pi\_\{\\mu\}\}\[\\rho\]\\;=\\;\\exp\(D\_\{2\}\)\-1\\;\\geq\\;\\exp\(D\_\{\\mathrm\{KL\}\}\)\-1\.\(10\)
Plugging this into the ESS formula, we obtain:
ESS≤n1\+exp\(DKL\)−1=n⋅exp\(−DKL\)\.\\mathrm\{ESS\}\\;\\leq\\;\\frac\{n\}\{1\+\\exp\(D\_\{\{\\mathrm\{KL\}\}\}\)\-1\}\\;=\\;n\\cdot\\exp\(\-D\_\{\{\\mathrm\{KL\}\}\}\)\.\(11\)Each sample’s effective contribution to learning therefore decays*exponentially*with the KL divergence between the two policies\.
In this work, we heuristically approximate the degree of KL divergence between the two policies byc⋅Δc\\cdot\\Delta, wherec\>0c\>0is a constant determined by the learning rate and the typical gradient magnitude, andΔ\\Deltadenotes the number of gradient steps elapsed since collection\. Substituting this approximation into Eq\. \([11](https://arxiv.org/html/2604.16918#S3.E11)\) yields:
ESS≤n⋅exp\(−c⋅Δ\)\.\\mathrm\{ESS\}\\;\\leq\\;n\\cdot\\exp\(\-c\\cdot\\Delta\)\.\(12\)
Eq\. \([12](https://arxiv.org/html/2604.16918#S3.E12)\) motivates down\-weighting older trajectories by an exponential factor in their age\. In practice, the per\-step divergence rateccdepends on training dynamics, such as learning rate schedule, gradient noise, and batch size, which are difficult to estimate reliably online\. We therefore absorbccinto a single tunable hyperparameterτ=1/c\\tau=1/c, which we call the*age decay constant*, and define the age decay factor as
wage\(Δ\)=exp\(−Δτ\)\.w\_\{\\mathrm\{age\}\}\(\\Delta\)\\;=\\;\\exp\\\!\\left\(\-\\frac\{\\Delta\}\{\\tau\}\\right\)\.\(13\)Whenτ\\tauis large, the decay is gentle and older samples retain substantial weight; whenτ\\tauis small, the decay is rapid and the buffer effectively forgets old data more quickly\. We compare this exponential form against polynomial decay alternatives in[section4\.3](https://arxiv.org/html/2604.16918#S4.SS3)\.
### 3\.4Freshness\-Aware Priority
Lettit\_\{i\}be the global training step when trajectoryiiwas collected, andttthe current step\. We define the*age*Δi=t−ti\\Delta\_\{i\}=t\-t\_\{i\}and propose the following priority function:
pi=pibase⏟base priority⋅exp\(−Δiτ\)⏟age decay,p\_\{i\}=\\underbrace\{p\_\{i\}^\{\\text\{base\}\}\}\_\{\\text\{base priority\}\}\\cdot\\underbrace\{\\exp\\\!\\left\(\-\\frac\{\\Delta\_\{i\}\}\{\\tau\}\\right\)\}\_\{\\text\{age decay\}\},\(14\)wherepibasep\_\{i\}^\{\\text\{base\}\}is an existing PER base priority andτ\>0\\tau\>0is the age decay constant controlling the half\-life of priority \(t1/2=τln2t\_\{1/2\}=\\tau\\ln 2\)\. The base prioritypibasep\_\{i\}^\{\\text\{base\}\}follows the classical PER intuition that higher\-magnitude training signals indicate greater learning potential\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\); it can be instantiated with any standard PER priority signal:
The base priority can be instantiated as reward magnitudepibase=\|ri\|\+ϵp\_\{i\}^\{\\text\{base\}\}=\|r\_\{i\}\|\+\\epsilonfor critic\-free methods such as REINFORCE\+\+\(Huet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib23)\)and GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2604.16918#bib.bib18)\), advantage magnitudepibase=\|A^i\|\+ϵp\_\{i\}^\{\\text\{base\}\}=\|\\hat\{A\}\_\{i\}\|\+\\epsilonfor actor\-critic methods such as PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2604.16918#bib.bib7)\), or TD errorpibase=\|δi\|\+ϵp\_\{i\}^\{\\text\{base\}\}=\|\\delta\_\{i\}\|\+\\epsilonfollowing the classical PER formulation\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\)\.
In all cases, the absolute value ensures that both high\-reward and high\-penalty trajectories receive high priority\. The age decay termexp\(−Δi/τ\)\\exp\(\-\\Delta\_\{i\}/\\tau\)is a modular layer applied on top of any base priority, compensating for the exponential ESS decay in Eq\. \([11](https://arxiv.org/html/2604.16918#S3.E11)\)\. This multiplicative formulation decouples the question of*what makes a sample informative*\(base priority\) from the question of*how stale is a sample*\(age decay\), and ensures that even the highest\-priority old trajectory is eventually deprioritized below a fresh trajectory of moderate priority\.
### 3\.5Sampling and Implementation Details
#### Proportional Sampling\.
Sampling from the priority distribution \(Eq\. \([2](https://arxiv.org/html/2604.16918#S3.E2)\)\) naïvely requiresO\(N\)O\(N\)time\. We use a sum segment tree to achieveO\(logN\)O\(\\log N\)per sample\. To further reduce variance, we employ stratified sampling: the total priority massS=∑ipiαS=\\sum\_\{i\}p\_\{i\}^\{\\alpha\}is partitioned intoBBequal segments \(whereBBis the batch size\), and one trajectory is drawn uniformly from each segment\.
#### Importance Sampling Correction\.
Prioritized sampling introduces bias that we correct with importance sampling weights:
wi=\(1N⋅P\(i\)\)β/maxjwjw\_\{i\}=\\left\(\\frac\{1\}\{N\\cdot P\(i\)\}\\right\)^\{\\beta\}\\bigg/\\max\_\{j\}w\_\{j\}\(15\)whereN=\|ℬ\|N=\|\\mathcal\{B\}\|,P\(i\)=piα/∑kpkαP\(i\)=p\_\{i\}^\{\\alpha\}/\\sum\_\{k\}p\_\{k\}^\{\\alpha\}, andβ\\betaanneals fromβ0\\beta\_\{0\}to 1\.0 over training \(following Eq\. \([3](https://arxiv.org/html/2604.16918#S3.E3)\)\)\. These weights scale each sample’s policy gradient lossℓi\\ell\_\{i\}:
ℒreplay=1B∑i=1Bwi⋅ℓi\\mathcal\{L\}\_\{\\text\{replay\}\}=\\frac\{1\}\{B\}\\sum\_\{i=1\}^\{B\}w\_\{i\}\\cdot\\ell\_\{i\}\(16\)
#### Priority Update\.
The base priority and the age decay term are updated independently\. Each iteration, a background CPU thread refreshes the age decay for all entries by recomputingexp\(−Δi/τ\)\\exp\(\-\\Delta\_\{i\}/\\tau\)with current ages \(Line 6 of Algorithm[1](https://arxiv.org/html/2604.16918#alg1)\)\. The base priority depends on the chosen instantiation: for reward\-based priorities it is fixed at collection time, while for advantage\- or TD\-error\-based variants it is recomputed after each training step to reflect the current policy\.
#### Overall Algorithm\.
Algorithm[1](https://arxiv.org/html/2604.16918#alg1)summarizes the complete FreshPER training procedure\.
Algorithm 1FreshPER Training0:Policy
πθ\\pi\_\{\\theta\}, environment
ℰ\\mathcal\{E\}, replay buffer
ℬ\\mathcal\{B\}, age decay constant
τ\\tau, replay ratio
KK
1:foreach iterationdo
2:Roll out
πμ\\pi\_\{\\mu\}in
ℰ\\mathcal\{E\}to collect episodes
\{ej\}\\\{e\_\{j\}\\\}with rewards
\{rj\}\\\{r\_\{j\}\\\}
3:Compute behavior log\-probs
logπμ\(ej\)\\log\\pi\_\{\\mu\}\(e\_\{j\}\)⊳\\trianglerightrecord before training
4:Store each
eje\_\{j\}into
ℬ\\mathcal\{B\}with base priority
pjbasep\_\{j\}^\{\\text\{base\}\}and
logπμ\\log\\pi\_\{\\mu\}
5:Train
πθ\\pi\_\{\\theta\}on fresh batch
\{ej\}\\\{e\_\{j\}\\\}via policy gradient⊳\\trianglerighton\-policy update
6:Refresh priorities:
pi←pibase⋅exp\(−Δi/τ\)p\_\{i\}\\leftarrow p\_\{i\}^\{\\text\{base\}\}\\cdot\\exp\(\-\\Delta\_\{i\}/\\tau\)for all
i∈ℬi\\in\\mathcal\{B\}⊳\\trianglerightasync on CPU
7:for
k=1,…,Kk=1,\\ldots,Kdo
8:Sample batch
𝒮\\mathcal\{S\}from
ℬ\\mathcal\{B\}with
P\(i\)∝piαP\(i\)\\propto p\_\{i\}^\{\\alpha\}⊳\\trianglerightstratified sampling
9:Compute IS weights
wiw\_\{i\}to correct sampling bias \(Eq\. \([15](https://arxiv.org/html/2604.16918#S3.E15)\)\)
10:Train
πθ\\pi\_\{\\theta\}on
𝒮\\mathcal\{S\}with IS\-weighted policy gradient
11:endfor
12:Sync inference engine:
πμ←πθ\\pi\_\{\\mu\}\\leftarrow\\pi\_\{\\theta\}
13:endfor
[fig\.1](https://arxiv.org/html/2604.16918#S1.F1)illustrates the overall training pipeline\. The replay buffer and priority logic run on the CPU controller, naturally pipelining with GPU\-bound inference and training\. Further architectural details \(eviction policy, asynchronous priority refresh, framework integration\) are provided in the supplementary material\.
## 4Experiments
We evaluate FreshPER across eight environments spanning text\-only LLM and multimodal VLM settings\. All experiments use REINFORCE\+\+\(Huet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib23)\)as the policy gradient algorithm and are implemented on the ROLL framework\(Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\)\.
### 4\.1Experimental Setup
#### Environments\.
We select eight environments of varying difficulty and modality\. On theLLMside,NQ Search\(Jinet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib72)\)is an agentic retrieval\-augmented QA task on Natural Questions in which the model interleaves multi\-step reasoning with search engine queries and produces a final answer scored by exact match; we use FAISS\-based retrieval with E5 embeddings\.AIMEis a math competition task where the model solves American Invitational Mathematics Examination problems with integer answers \(000–999\), allowing up to 3 attempts per problem\.Sokoban\(ROLL built\-in\) is a box\-pushing puzzle on a6×66\{\\times\}6grid requiring multi\-step planning with irreversible actions; we evaluate a*Simple*variant \(larger rooms, fewer boxes\) and a*Hard*variant \(tighter layouts, more boxes\), with scores ranging from−1\-1to\+3\+3\.CliffWalking\(ROLL built\-in\) is a4×124\{\\times\}12grid navigation task \(optimal score0\) that serves as a simple\-environment control\.GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2604.16918#bib.bib73)\)consists of grade\-school math word problems on which the base model already exceeds 93% accuracy, serving as a near\-saturated control\. On theVLMside,FrozenLake\(ROLL built\-in\) is a visual navigation task where a VLM agent navigates a4×44\{\\times\}4grid rendered as RGB images, avoiding holes on slippery ice;GeoQAis a geometry question\-answering task requiring the VLM to interpret diagrams and solve problems\.
#### Models\.
We use Qwen2\.5\-7B\-Instruct\(Qwen Team,[2024](https://arxiv.org/html/2604.16918#bib.bib65)\)for NQ Search and AIME, Qwen2\.5\-0\.5B\-Instruct for all other LLM tasks \(Sokoban, CliffWalking, GSM8K\), and Qwen2\.5\-VL\-3B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2604.16918#bib.bib66)\)for VLM tasks \(FrozenLake, GeoQA\)\. Full per\-experiment hardware and configuration details are provided in the supplementary material\.
#### RL Algorithm and Base Priority\.
Since REINFORCE\+\+ is critic\-free, we instantiate the base priority in Eq\. \([14](https://arxiv.org/html/2604.16918#S3.E14)\) aspibase=\|ri\|\+ϵp\_\{i\}^\{\\text\{base\}\}=\|r\_\{i\}\|\+\\epsilon\(reward magnitude\)\. When combined with actor\-critic methods, the base priority can be replaced with advantage\- or TD\-error\-based variants without modifying the age decay mechanism\.
#### Baselines\.
We compare three configurations: \(1\)On\-Policy—standard REINFORCE\+\+ without a replay buffer, as implemented in ROLL\(Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\); \(2\)Standard PER—reward\-based prioritized replay \(pi=\|ri\|\+ϵp\_\{i\}=\|r\_\{i\}\|\+\\epsilon\) without age decay, corresponding to conventional PER\(Schaulet al\.,[2016](https://arxiv.org/html/2604.16918#bib.bib3)\)adapted for LLM RL; and \(3\)FreshPER \(Ours\)—freshness\-aware PER with age decay \(pi=\(\|ri\|\+ϵ\)⋅exp\(−Δi/τ\)p\_\{i\}=\(\|r\_\{i\}\|\+\\epsilon\)\\cdot\\exp\(\-\\Delta\_\{i\}/\\tau\),τ=500\\tau\{=\}500by default\)\.
#### Implementation Details\.
We implement FreshPER on the ROLL framework\(Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\), using DeepSpeed for distributed training and vLLM for inference, with GPU counts ranging from 2 to 8 depending on model size \(see supplementary material\)\. Replay buffer capacity: 50K trajectories\. Priority exponentα=0\.6\\alpha\{=\}0\.6, IS exponentβ=0\.4\\beta\{=\}0\.4, age decayτ=500\\tau\{=\}500by default, replay ratioK=2K\{=\}2, learning rate1×10−61\{\\times\}10^\{\-6\}, clipϵ=0\.2\\epsilon\{=\}0\.2\. For off\-policy training with smaller models \(0\.5B\) and for AIME \(7B\), we use tighter advantage clipping and disable KL regularization \(“Config A”; see supplementary material\)\. AIME usesτ=1000\\tau\{=\}1000\.
### 4\.2Main Results

\(a\) NQ Search

\(b\) AIME

\(c\) Sokoban Simple

\(d\) Sokoban Hard

\(e\) FrozenLake
Figure 3:Learning curves on LLM tasks\.Blue■\\blacksquare: On\-Policy;Yellow▲\\blacktriangle: Standard PER;Red∙\\bullet: FreshPER \(Ours\)\.
\(a\) VLM FrozenLake

\(b\) VLM GeoQA
Figure 4:Learning curves on VLM tasks\.Blue■\\blacksquare: On\-Policy;Yellow▲\\blacktriangle: Standard PER;Red∙\\bullet: FreshPER \(Ours\)\.[table1](https://arxiv.org/html/2604.16918#S4.T1)summarizes peak performance across all seven main experiments, and[figs\.3](https://arxiv.org/html/2604.16918#S4.F3)and[4](https://arxiv.org/html/2604.16918#S4.F4)show the corresponding learning curves\. FreshPER achieves the best peak performance on all seven tasks, with the largest gains on challenging agentic tasks \(NQ Search \+46%, Sokoban Simple \+367%, VLM FrozenLake \+133%\) and consistent improvements on math competition \(AIME \+18%\)\. Standard PER without age decay underperforms the on\-policy baseline on most tasks, sometimes catastrophically \(Sokoban Simple, NQ Search\)\. We analyze these results in detail in[section4\.3](https://arxiv.org/html/2604.16918#S4.SS3)\.
Table 1:Peak performance across main tasks\. Best inbold\. FreshPER achieves the best result on all tasks\.
### 4\.3Analysis

\(a\) Sokoban Simple

\(b\) FrozenLake \(LLM\)
Figure 5:Ablation of age decay constantτ\\tau\.Blue■\\blacksquare: Baseline;Red∙\\bullet:τ=500\\tau\{=\}500;Orange◆\\blacklozenge:τ=1000\\tau\{=\}1000;Gray▼\\blacktriangledown:τ=1500\\tau\{=\}1500\. \(a\) Sokoban:τ=500\\tau\{=\}500is optimal;τ=1500\\tau\{=\}1500fails completely\. \(b\) FrozenLake:τ=1000\\tau\{=\}1000is optimal; allτ\\tauvalues outperform Baseline\.#### The benefit of replay scales with task difficulty\.

\(a\) CliffWalking

\(b\) GSM8K
Figure 6:Control experiments\.Blue■\\blacksquare: On\-Policy;Yellow▲\\blacktriangle: Standard PER;Red∙\\bullet: FreshPER \(Ours\)\. \(a\) CliffWalking: all methods converge to optimal; replay adds transient instability\. \(b\) GSM8K: initial performance already\>\{\>\}93%; all methods saturate at∼\{\\sim\}97%\.Ablation on the age decay constantτ\\tauacross Sokoban Simple and FrozenLake \([fig\.5](https://arxiv.org/html/2604.16918#S4.F5)\) reveals thatτ\\tauis task\-dependent: Sokoban, which exhibits rapid policy drift, requires aggressive decay \(τ=500\\tau\{=\}500\), while the slower\-evolving FrozenLake benefits from a gentler setting \(τ=1000\\tau\{=\}1000\)\. This pattern extends to the main results\. The largest gains appear on tasks where on\-policy training struggles or collapses: NQ Search \(\+46%\), Sokoban Simple \(\+367%\), and VLM FrozenLake \(\+133%\)\. Conversely, control experiments on CliffWalking and GSM8K \([fig\.6](https://arxiv.org/html/2604.16918#S4.F6)\) show that when the task is simple enough for on\-policy training to solve quickly, or the model is already near\-saturated, replay provides little additional benefit\. Taken together, these results suggest a practical guideline:*the harder the task and the faster the policy evolves, the more valuable freshness\-aware replay becomes, and the more aggressivelyτ\\taushould be set*\.
#### Priority staleness validates the need for age decay\.
Standard PER without age decay underperforms the on\-policy baseline on most tasks, and fails catastrophically on Sokoban Simple and NQ Search\. This is consistent with the ESS decay analysis in[section3\.3](https://arxiv.org/html/2604.16918#S3.SS3): without temporal decay, old trajectories with high base priorities dominate sampling even after they have become stale relative to the current policy\. The contrast between Standard PER and FreshPER isolates the contribution of age decay, confirming that it is essential for applying PER to LLM RL\.
#### Training stability\.
Figure 7:IS correction ablation on FrozenLake \(LLM\)\.Blue■\\blacksquare: Baseline;Red∙\\bullet:τ=500\\tau\{=\}500;Green\+\\boldsymbol\{\+\}:τ=500\\tau\{=\}500\+ IS\. Adding IS \(β=0\.4\\beta\{=\}0\.4\) does not substantially improve peak performance but eliminates late\-stage degradation: the IS variant maintains its peak \(0\.281\) through the end of training\.Across multiple tasks, on\-policy training peaks early and then degrades \(e\.g\., Sokoban Simple, GeoQA\)\. FreshPER consistently sustains its peak performance through the end of training\. IS correction further enhances this stability without improving peak performance \([fig\.7](https://arxiv.org/html/2604.16918#S4.F7)\), validating the modular design where age decay and IS correction address complementary aspects of off\-policy learning\.
#### Generalization to VLM\.
FreshPER transfers to multimodal settings without modification, achieving \+133% on VLM FrozenLake\. This confirms that the age decay mechanism is agnostic to the observation modality\.
## 5Conclusion
We presented FreshPER, which, to the best of our knowledge, is the first to successfully apply Prioritized Experience Replay to LLM and VLM reinforcement learning\. The key enabler is freshness\-aware age decay, which addresses the priority staleness problem caused by the rapid policy evolution of billion\-parameter models\. Evaluations across eight environments with models ranging from 0\.5B to 7B confirm that FreshPER consistently outperforms both on\-policy baselines and standard PER, with particularly large gains on challenging agentic tasks\. More broadly, our results suggest that classical sample\-efficiency techniques from RL remain highly effective in the LLM era, provided they are adapted to handle the unique training dynamics of large language models\.
#### Limitations and Future Work\.
Our experiments span three model scales \(0\.5B, 3B, and 7B parameters\), but the interaction of freshness decay with even larger models \(e\.g\., 70B\+\) and different training dynamics remains to be studied\. Automatic tuning ofτ\\taubased on observed policy divergence rates is a promising direction for future work\.
## Acknowledgement
This research was supported by funding from the King Abdullah University of Science and Technology \(KAUST\) Center of Excellence for Generative AI under Award No\. 5940\.
## References
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[Appendix D](https://arxiv.org/html/2604.16918#A4.SS0.SSS0.Px6.p1.1),[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px1.p1.6)\.
- P\. D’Oro, M\. Schwarzer, E\. Nikishin, P\. Bacon, M\. G\. Bellemare, and A\. Courville \(2023\)Sample\-efficient reinforcement learning by breaking the replay ratio barrier\.InICLR,Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- DeepSeek\-AI \(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- H\. Dong, W\. Xiong, B\. Pang, H\. Wang, H\. Zhao, Y\. Zhou, N\. Jiang, D\. Sahoo, C\. Xiong, and T\. Zhang \(2024\)RLHF workflow: from reward modeling to online RLHF\.arXiv preprint arXiv:2405\.07863\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1)\.
- D\. Driess, F\. Xia, M\. S\. M\. Sajjadi, C\. Lynch, A\. Chowdhery, B\. Ichter, A\. Wahid, J\. Tompson, Q\. Vuong, T\. Yu,et al\.\(2023\)PaLM\-E: an embodied multimodal language model\.InICML,Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p2.1)\.
- M\. Fatemi, B\. Rafiee, K\. Asadi, Y\. Li, Y\. Tang, Y\. Zhou, and D\. Bahdanau \(2026\)Prioritized replay for RL post\-training\.arXiv preprint arXiv:2601\.02648\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p2.1)\.
- W\. Fedus, P\. Ramachandran, A\. Rajeswaran, C\. Blundell, T\. Lillicrap,et al\.\(2020\)Revisiting fundamentals of experience replay\.arXiv preprint arXiv:2007\.06700\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- W\. Fu, J\. Gao, X\. Shen, C\. Zhu, S\. Li, W\. Zhang, Y\. Wu, T\. Xie, Y\. Zhang, T\. Yu, Z\. Jia, and Z\. Wang \(2025\)AReaL: a large\-scale asynchronous reinforcement learning system for language reasoning\.arXiv preprint arXiv:2505\.24298\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- D\. Horgan, J\. Quan, D\. Budden, G\. Barth\-Maron, M\. Hessel, H\. van Hasselt, and D\. Silver \(2018\)Distributed prioritized experience replay\.InICLR,Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Hu, J\. K\. Liu, H\. Xu, and W\. Shen \(2025\)REINFORCE\+\+: stabilizing critic\-free policy optimization with global advantage normalization\.arXiv preprint arXiv:2501\.03262\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2604.16918#S3.SS4.p3.3),[§4](https://arxiv.org/html/2604.16918#S4.p1.1)\.
- J\. Hu, X\. Wu, W\. Shen, J\. K\. Liu, Z\. Zhu, W\. Wang, S\. Jiang, H\. Wang, H\. Chen, B\. Chen,et al\.\(2024\)OpenRLHF: an easy\-to\-use, scalable and high\-performance RLHF framework\.arXiv preprint arXiv:2405\.11143\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- W\. Huang, B\. Jia, Z\. Zhai,et al\.\(2025\)Vision\-R1: incentivizing reasoning capability in multimodal large language models\.arXiv preprint arXiv:2503\.06749\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- B\. Jin, H\. Zeng, Z\. Yue, J\. Yoon, S\. Arik, D\. Wang, H\. Zamani, and J\. Han \(2025\)Search\-R1: training LLMs to reason and leverage search engines with reinforcement learning\.arXiv preprint arXiv:2503\.09516\.Cited by:[Appendix D](https://arxiv.org/html/2604.16918#A4.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2604.16918#S1.p2.1),[§1](https://arxiv.org/html/2604.16918#S1.p3.1),[§2](https://arxiv.org/html/2604.16918#S2.p1.1),[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px1.p1.6)\.
- A\. Kong \(1992\)A note on importance sampling using standardized weights\.University of Chicago, Department of Statistics, Technical Report\(348\)\.Cited by:[§C\.3](https://arxiv.org/html/2604.16918#A3.SS3.p1.1),[§3\.3](https://arxiv.org/html/2604.16918#S3.SS3.p4.2)\.
- L\. Lin \(1992\)Self\-improving reactive agents based on reinforcement learning, planning and teaching\.Machine Learning8\(3\-4\),pp\. 293–321\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p4.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Liu, C\. Chen, W\. Li, P\. Qi, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025a\)Understanding R1\-zero\-like training: a critical perspective\.arXiv preprint arXiv:2503\.20783\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Liu, J\. Wang, L\. Song, and J\. Bian \(2025b\)Sample\-efficient LLM optimization with reset replay\.arXiv preprint arXiv:2508\.06412\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Luo, S\. Han, Y\. Hu, D\. Li, and J\. Hao \(2026\)R2VPO: ratio\-variance regularized policy optimization for efficient LLM fine\-tuning\.arXiv preprint arXiv:2601\.03320\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Ma, D\. Ning, C\. Zhang, and S\. Liu \(2022\)Fresher experience plays a more important role in prioritized experience replay\.Applied Sciences12\(23\),pp\. 12489\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- A\. M\. Metelli, M\. Papini, N\. Montali, and M\. Restelli \(2020\)Importance sampling techniques for policy optimization\.JMLR21\(141\),pp\. 1–75\.Cited by:[§3\.3](https://arxiv.org/html/2604.16918#S3.SS3.p5.5)\.
- V\. Mnih, K\. Kavukcuoglu, D\. Silver, A\. A\. Rusu, J\. Veness, M\. G\. Bellemare, A\. Graves, M\. Riedmiller, A\. K\. Fidjeland, G\. Ostrovski,et al\.\(2015\)Human\-level control through deep reinforcement learning\.Nature518\(7540\),pp\. 529–533\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p4.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Noukhovitch, S\. Huang, S\. Xhonneux, A\. Hosseini, R\. Agarwal, and A\. Courville \(2024\)Asynchronous RLHF: faster and more efficient off\-policy RL for language models\.arXiv preprint arXiv:2410\.18252\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2024\)Learning to reason with LLMs\.External Links:[Link](https://openai.com/index/learning-to-reason-with-llms/)Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p1.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray,et al\.\(2022\)Training language models to follow instructions with human feedback\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px2.p1.1)\.
- Qwen Team \(2025\)Qwen2\.5\-VL technical report\.arXiv preprint arXiv:2502\.13923\.Cited by:[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px2.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, S\. Ermon, C\. D\. Manning, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1)\.
- N\. L\. Roux, M\. G\. Bellemare, J\. Lebensold, A\. Bergeron, J\. Greaves,et al\.\(2025\)Tapered off\-policy REINFORCE: stable and efficient reinforcement learning for LLMs\.arXiv preprint arXiv:2503\.14286\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Schaul, J\. Quan, I\. Antonoglou, and D\. Silver \(2016\)Prioritized experience replay\.InICLR,Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p4.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2604.16918#S3.SS2.p1.2),[§3\.3](https://arxiv.org/html/2604.16918#S3.SS3.p1.6),[§3\.4](https://arxiv.org/html/2604.16918#S3.SS4.p2.4),[§3\.4](https://arxiv.org/html/2604.16918#S3.SS4.p3.3),[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px4.p1.3)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessi, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InNeurIPS,Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p2.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.arXiv preprint arXiv:1707\.06347\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2604.16918#S3.SS4.p3.3)\.
- M\. Schwarzer, J\. Obando\-Ceron, A\. Courville, M\. Bellemare, R\. Agarwal, and P\. S\. Castro \(2023\)Bigger, better, faster: human\-level Atari with human\-level efficiency\.InICML,Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, M\. Zhang, Y\.K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.p1.1),[§3\.4](https://arxiv.org/html/2604.16918#S3.SS4.p3.3)\.
- H\. Shen, P\. Liu, J\. Li,et al\.\(2025\)VLM\-R1: a stable and generalizable R1\-style large vision\-language model\.arXiv preprint arXiv:2504\.07615\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2024\)HybridFlow: a flexible and efficient RLHF framework\.arXiv preprint arXiv:2409\.19256\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- Y\. Sun, J\. Shen, Y\. Wang, T\. Chen, Z\. Wang,et al\.\(2025\)Improving data efficiency for LLM reinforcement fine\-tuning through difficulty\-targeted online data selection and rollout replay\.arXiv preprint arXiv:2506\.05316\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- T\. van Erven and P\. Harremoës \(2014\)Rényi divergence and Kullback\-Leibler divergence\.IEEE Transactions on Information Theory60\(7\),pp\. 3797–3820\.Cited by:[§C\.2](https://arxiv.org/html/2604.16918#A3.SS2.SSS0.Px2.p1.3)\.
- W\. Wang, S\. Xiong, G\. Chen, W\. Gao, S\. Guo, Y\. He, J\. Huang, J\. Liu, Z\. Li, X\. Li, Z\. Liu, H\. Zhao,et al\.\(2025a\)ROLL: reinforcement learning optimization for large\-scale learning: an efficient and user\-friendly scaling library\.arXiv preprint arXiv:2506\.06122\.Cited by:[§A\.2](https://arxiv.org/html/2604.16918#A1.SS2.p1.1),[Appendix D](https://arxiv.org/html/2604.16918#A4.SS0.SSS0.Px3.p1.3),[Appendix F](https://arxiv.org/html/2604.16918#A6.p1.1),[2nd item](https://arxiv.org/html/2604.16918#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2604.16918#S2.p1.1),[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px4.p1.3),[§4\.1](https://arxiv.org/html/2604.16918#S4.SS1.SSS0.Px5.p1.7),[§4](https://arxiv.org/html/2604.16918#S4.p1.1)\.
- Z\. Wang, K\. Wang, Q\. Wang, P\. Zhang, L\. Li, Z\. Yang, K\. Yu, M\. Luu, J\. Gao, and Y\. Lu \(2025b\)RAGEN: understanding self\-evolution in LLM agents via multi\-turn reinforcement learning\.arXiv preprint arXiv:2504\.20073\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- Y\. Wei, O\. Duchenne, J\. Copet,et al\.\(2025\)SWE\-RL: advancing LLM reasoning via reinforcement learning on open software evolution\.arXiv preprint arXiv:2502\.18449\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p2.1)\.
- Z\. Xi, X\. Guo, Y\. Nan, E\. Zhou, J\. Shen,et al\.\(2025\)BAPO: stabilizing off\-policy reinforcement learning for LLMs via balanced policy optimization with adaptive clipping\.arXiv preprint arXiv:2510\.18927\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Yao, H\. Chen, J\. Yang, and K\. Narasimhan \(2022\)WebShop: towards scalable real\-world web shopping environments\.InNeurIPS,Cited by:[§2](https://arxiv.org/html/2604.16918#S2.p1.1)\.
- L\. Yu, Z\. Zeng, J\. Chen, Q\. Yu, J\. Li, Y\. Zhang, H\. Yan, B\. Yi, A\. Liu, T\. Ji, Z\. Chen, D\. Lin, J\. Zhao, and Z\. Zheng \(2025a\)ROLL Flash: accelerating RLVR and agentic training with asynchrony\.arXiv preprint arXiv:2510\.11345\.Cited by:[§1](https://arxiv.org/html/2604.16918#S1.p3.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, T\. Fan, G\. Chen, L\. Liu, X\. Liu,et al\.\(2025b\)DAPO: an open\-source LLM reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Zhang, J\. Fu, J\. Zhang,et al\.\(2025\)RLEP: reinforcement learning with experience replay for LLM reasoning\.arXiv preprint arXiv:2507\.07451\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
- H\. Zheng, J\. Zhao, and B\. Chen \(2025\)Prosperity before collapse: how far can off\-policy RL reach with stale data on LLMs?\.arXiv preprint arXiv:2510\.01161\.Cited by:[§2](https://arxiv.org/html/2604.16918#S2.SS0.SSS0.Px2.p1.1)\.
## Appendix ASystem Design Details
### A\.1Replay Buffer Architecture
The replay buffer operates at the*trajectory level*: each entry stores a complete episode \(prompt, all assistant turns, and environment observations\) along with its behavior log\-probabilities, reward, collection steptit\_\{i\}, and current prioritypip\_\{i\}\. This granularity naturally matches agentic RL, where episode\-level rewards are the primary training signal\.
When the buffer reaches capacity, we evict the entry with the lowest effective priority—i\.e\., the oldest, lowest\-priority trajectory—which follows naturally from the age decay: entries whose priority has decayed near zero are evicted first\. The priority refresh step \(Line 5 of Algorithm[1](https://arxiv.org/html/2604.16918#alg1)in the main paper\) recomputes all priorities with current ages once per iteration\. To avoid blocking the training loop, thisO\(N\)O\(N\)scan runs in a background CPU thread during on\-policy GPU training, completing in5050–100100ms forN=100N\{=\}100K entries and fully overlapping with GPU computation\.
### A\.2Framework Integration
We implement FreshPER on top of the ROLL framework\[Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\], which provides a single\-controller architecture that decouples inference and training across GPU pools\. The replay buffer and all priority logic run on the CPU\-side controller, naturally pipelining with GPU\-bound training and inference\. The training flow follows Algorithm[1](https://arxiv.org/html/2604.16918#alg1): fresh rollouts are first used for on\-policy training, then stored into the buffer; the replay loop draws prioritized batches forKKadditional off\-policy updates per environment interaction\.
## Appendix BAdditional Ablation Studies
### B\.1Age Decay Constantτ\\tau
We ablateτ∈\{500,1000,1500\}\\tau\\in\\\{500,1000,1500\\\}on Sokoban Simple and FrozenLake \(LLM, 0\.5B\), comparing against On\-Policy and Standard PER \(equivalent toτ=∞\\tau\{=\}\\infty\)\.

\(a\) Sokoban Simple

\(b\) FrozenLake \(LLM\)
Figure 8:Ablation of age decay constantτ\\tau\.Blue■\\blacksquare: Baseline;Red∙\\bullet:τ=500\\tau\{=\}500;Orange◆\\blacklozenge:τ=1000\\tau\{=\}1000;Gray▼\\blacktriangledown:τ=1500\\tau\{=\}1500\.On Sokoban Simple \([fig\.8](https://arxiv.org/html/2604.16918#A2.F8)a\),τ=500\\tau\{=\}500achieves a peak score of 2\.30,τ=1000\\tau\{=\}1000reaches 1\.50, andτ=1500\\tau\{=\}1500completely fails \(−\-0\.90\), identical to Standard PER\. This environment exhibits rapid policy drift, requiring aggressive decay\. On FrozenLake \([fig\.8](https://arxiv.org/html/2604.16918#A2.F8)b\), the ranking shifts:τ=1000\\tau\{=\}1000achieves the highest peak \(0\.336\), followed byτ=1500\\tau\{=\}1500\(0\.328\) andτ=500\\tau\{=\}500\(0\.266\)\. Unlike Sokoban, evenτ=1500\\tau\{=\}1500remains effective, indicating that FrozenLake tolerates higher data staleness\.
### B\.2Importance Sampling Correction
We investigate whether importance sampling \(IS\) correction improves training stability when combined with freshness decay\. We compareτ=500\\tau\{=\}500with and without IS \(β=0\.4\\beta\{=\}0\.4\) on FrozenLake\.
Figure 9:IS correction ablation on FrozenLake \(LLM\)\.Blue■\\blacksquare: Baseline;Red∙\\bullet:τ=500\\tau\{=\}500;Green\+\\boldsymbol\{\+\}:τ=500\\tau\{=\}500\+ IS\.[fig\.9](https://arxiv.org/html/2604.16918#A2.F9)shows that IS correction does not substantially boost peak performance \(\+\+5\.9%, from 0\.266 to 0\.281\), but its primary value lies in*training stability*: the IS variant is the only configuration whose final performance equals its peak \(0\.281 at both step 40 and step 390\), whereas all other methods degrade in late training\. This complementarity—freshness decay improves peak performance by suppressing stale priorities, while IS correction stabilizes training by compensating for distribution shift—validates the modular design of FreshPER, where age decay and IS correction can be independently activated based on task requirements\.
### B\.3When Does Replay Help?
To understand the boundary conditions of FreshPER, we evaluate on two environments where replay is expected to offer little benefit\.

\(a\) CliffWalking

\(b\) GSM8K
Figure 10:Control experiments\.Blue■\\blacksquare: On\-Policy;Yellow▲\\blacktriangle: Standard PER;Red∙\\bullet: FreshPER \(Ours\)\.#### Too\-Simple Environments \(CliffWalking\)\.
All three methods converge to the optimal score of 0 \([fig\.10](https://arxiv.org/html/2604.16918#A2.F10)a\)\. On\-Policy converges fastest and most stably, while Standard PER and FreshPER exhibit transient instability \(PER shows spikes down to−\-17 at step 140\)\. When the task is simple enough for on\-policy training to solve quickly, replay provides no benefit and may introduce unnecessary off\-policy noise\.
#### Near\-Saturated Models \(GSM8K\)\.
All methods cluster in the narrow 0\.94–0\.99 range \([fig\.10](https://arxiv.org/html/2604.16918#A2.F10)b\)\. The base model already achieves 93–95% success; training provides only∼\{\\sim\}2–3% additional improvement\. With so little room for improvement, replay\-based methods cannot demonstrate meaningful advantages\.
These control experiments delineate the regime where FreshPER is most valuable: tasks that are*challenging enough*that on\-policy training either converges slowly or collapses, and where the model has*sufficient room for improvement*\.
## Appendix CDerivations for Priority Staleness
This section provides detailed derivations for the results stated in[section3\.3](https://arxiv.org/html/2604.16918#S3.SS3)\.
### C\.1Variance of Importance Weights Equalsχ2\\chi^\{2\}\-Divergence
###### Proposition 1\.
LetPPandQQbe two probability distributions withQQabsolutely continuous with respect toPP, and letρ\(x\)=P\(x\)/Q\(x\)\\rho\(x\)=P\(x\)/Q\(x\)denote the importance ratio\. Then
VarQ\[ρ\]=χ2\(P∥Q\),\\mathrm\{Var\}\_\{Q\}\[\\rho\]=\\chi^\{2\}\(P\\\|Q\),\(17\)whereχ2\(P∥Q\)=𝔼Q\[\(ρ−1\)2\]\\chi^\{2\}\(P\\\|Q\)=\\mathbb\{E\}\_\{Q\}\\\!\\left\[\(\\rho\-1\)^\{2\}\\right\]is theχ2\\chi^\{2\}\-divergence\.
###### Proof\.
By the definition of variance:
VarQ\[ρ\]=𝔼Q\[ρ2\]−\(𝔼Q\[ρ\]\)2\.\\mathrm\{Var\}\_\{Q\}\[\\rho\]=\\mathbb\{E\}\_\{Q\}\[\\rho^\{2\}\]\-\\left\(\\mathbb\{E\}\_\{Q\}\[\\rho\]\\right\)^\{2\}\.\(18\)We first evaluate the mean ofρ\\rhounderQQ:
𝔼Q\[ρ\]=∫Q\(x\)⋅P\(x\)Q\(x\)𝑑x=∫P\(x\)𝑑x=1,\\mathbb\{E\}\_\{Q\}\[\\rho\]=\\int Q\(x\)\\cdot\\frac\{P\(x\)\}\{Q\(x\)\}\\,dx=\\int P\(x\)\\,dx=1,\(19\)where the last equality holds becausePPis a valid probability distribution\. Substituting into Eq\. \([18](https://arxiv.org/html/2604.16918#A3.E18)\):
VarQ\[ρ\]=𝔼Q\[ρ2\]−1\.\\mathrm\{Var\}\_\{Q\}\[\\rho\]=\\mathbb\{E\}\_\{Q\}\[\\rho^\{2\}\]\-1\.\(20\)
Now expand theχ2\\chi^\{2\}\-divergence from its definition:
χ2\(P∥Q\)=𝔼Q\[\(ρ−1\)2\]=𝔼Q\[ρ2\]−2𝔼Q\[ρ\]\+1=𝔼Q\[ρ2\]−1,\\chi^\{2\}\(P\\\|Q\)=\\mathbb\{E\}\_\{Q\}\\\!\\left\[\(\\rho\-1\)^\{2\}\\right\]=\\mathbb\{E\}\_\{Q\}\[\\rho^\{2\}\]\-2\\,\\mathbb\{E\}\_\{Q\}\[\\rho\]\+1=\\mathbb\{E\}\_\{Q\}\[\\rho^\{2\}\]\-1,\(21\)where we again used𝔼Q\[ρ\]=1\\mathbb\{E\}\_\{Q\}\[\\rho\]=1\. Comparing Eq\. \([20](https://arxiv.org/html/2604.16918#A3.E20)\) and Eq\. \([21](https://arxiv.org/html/2604.16918#A3.E21)\) completes the proof\. ∎
### C\.2Monotonicity of Rényi Divergence and theD2≥DKLD\_\{2\}\\geq D\_\{\\mathrm\{KL\}\}Bound
The Rényi divergence of orderα\>0\\alpha\>0\(α≠1\\alpha\\neq 1\) between distributionsPPandQQis defined as:
Dα\(P∥Q\)=1α−1log𝔼Q\[\(P\(x\)Q\(x\)\)α\]\.D\_\{\\alpha\}\(P\\\|Q\)=\\frac\{1\}\{\\alpha\-1\}\\,\\log\\mathbb\{E\}\_\{Q\}\\\!\\left\[\\left\(\\frac\{P\(x\)\}\{Q\(x\)\}\\right\)^\{\\\!\\alpha\}\\right\]\.\(22\)Two standard properties are relevant here:
#### Property 1: Connection toχ2\\chi^\{2\}\-divergence\.
Settingα=2\\alpha=2in Eq\. \([22](https://arxiv.org/html/2604.16918#A3.E22)\):
D2\(P∥Q\)=log𝔼Q\[ρ2\]=log\(1\+χ2\(P∥Q\)\),D\_\{2\}\(P\\\|Q\)=\\log\\mathbb\{E\}\_\{Q\}\\\!\\left\[\\rho^\{2\}\\right\]=\\log\\\!\\left\(1\+\\chi^\{2\}\(P\\\|Q\)\\right\),\(23\)where we used𝔼Q\[ρ2\]=1\+χ2\\mathbb\{E\}\_\{Q\}\[\\rho^\{2\}\]=1\+\\chi^\{2\}from Eq\. \([21](https://arxiv.org/html/2604.16918#A3.E21)\)\. Rearranging:
χ2\(P∥Q\)=exp\(D2\(P∥Q\)\)−1\.\\chi^\{2\}\(P\\\|Q\)=\\exp\\\!\\left\(D\_\{2\}\(P\\\|Q\)\\right\)\-1\.\(24\)
#### Property 2: Monotonicity inα\\alpha\.
Previous studies\[van Erven and Harremoës,[2014](https://arxiv.org/html/2604.16918#bib.bib10)\]have demonstrated thatDα\(P∥Q\)D\_\{\\alpha\}\(P\\\|Q\)is non\-decreasing inα\\alpha\. The KL divergence is recovered in the limitα→1\\alpha\\to 1:
DKL\(P∥Q\)=limα→1Dα\(P∥Q\)\.D\_\{\\mathrm\{KL\}\}\(P\\\|Q\)=\\lim\_\{\\alpha\\to 1\}\\,D\_\{\\alpha\}\(P\\\|Q\)\.\(25\)Sinceα=2\>1\\alpha=2\>1, monotonicity immediately gives:
D2\(P∥Q\)≥DKL\(P∥Q\)\.D\_\{2\}\(P\\\|Q\)\\;\\geq\\;D\_\{\\mathrm\{KL\}\}\(P\\\|Q\)\.\(26\)
Combining Eq\. \([24](https://arxiv.org/html/2604.16918#A3.E24)\) and Eq\. \([26](https://arxiv.org/html/2604.16918#A3.E26)\), and using the monotonicity of the exponential:
VarQ\[ρ\]=χ2\(P∥Q\)=exp\(D2\)−1≥exp\(DKL\)−1\.\\mathrm\{Var\}\_\{Q\}\[\\rho\]=\\chi^\{2\}\(P\\\|Q\)=\\exp\(D\_\{2\}\)\-1\\;\\geq\\;\\exp\(D\_\{\\mathrm\{KL\}\}\)\-1\.\(27\)
### C\.3Effective Sample Size Decay
The effective sample size \(ESS\)\[Kong,[1992](https://arxiv.org/html/2604.16918#bib.bib12)\]quantifies how many ofnnimportance\-weighted samples are effectively contributing to an estimate\. Formally:
ESS=n1\+VarQ\[ρ\]\.\\mathrm\{ESS\}=\\frac\{n\}\{1\+\\mathrm\{Var\}\_\{Q\}\[\\rho\]\}\.\(28\)Whenπθ=πμ\\pi\_\{\\theta\}=\\pi\_\{\\mu\}\(i\.e\.,ρ≡1\\rho\\equiv 1,Var\[ρ\]=0\\mathrm\{Var\}\[\\rho\]=0\), all samples are equally useful andESS=n\\mathrm\{ESS\}=n\. AsVar\[ρ\]\\mathrm\{Var\}\[\\rho\]grows, more samples are effectively wasted and the ESS shrinks\.
Substituting the lower bound from Eq\. \([27](https://arxiv.org/html/2604.16918#A3.E27)\):
ESS=n1\+Var\[ρ\]≤n1\+exp\(DKL\)−1=nexp\(DKL\)\.\\mathrm\{ESS\}=\\frac\{n\}\{1\+\\mathrm\{Var\}\[\\rho\]\}\\leq\\frac\{n\}\{1\+\\exp\(D\_\{\\mathrm\{KL\}\}\)\-1\}=\\frac\{n\}\{\\exp\(D\_\{\\mathrm\{KL\}\}\)\}\.\(29\)This shows that per\-sample effective contribution decays exponentially with the KL divergence between the two policies\.
## Appendix DEnvironment and Model Details
Table[2](https://arxiv.org/html/2604.16918#A4.T2)summarizes the eight evaluation environments\. We use three model scales across experiments, chosen to match the computational cost of each task\.
Table 2:Environment and model configurations\. “Max Actions” denotes the maximum number of assistant turns per episode\. “Seq Len” is the maximum sequence length \(prompt \+ all turns\)\. “Config” refers to the hyperparameter configuration \([appendixE](https://arxiv.org/html/2604.16918#A5)\)\.EnvironmentTypeModelGPUsSeq LenMax ActionsConfigRewardNQ SearchLLMQwen2\.5\-7B8128005DefaultBinary \(EM\)AIMELLMQwen2\.5\-7B840963ABinarySokoban SimpleLLMQwen2\.5\-0\.5B2204810A\[−1,\+3\]\[\-1,\+3\]Sokoban HardLLMQwen2\.5\-0\.5B2204810A\[−1,\+3\]\[\-1,\+3\]FrozenLake \(LLM\)LLMQwen2\.5\-0\.5B2204810ABinaryCliffWalkingLLMQwen2\.5\-0\.5B22048200A\[−∞,0\]\[\-\\infty,0\]GSM8KLLMQwen2\.5\-0\.5B240963ABinaryFrozenLake \(VLM\)VLMQwen2\.5\-VL\-3B4409610DefaultBinaryGeoQAVLMQwen2\.5\-VL\-3B440963DefaultBinary#### NQ Search\.
An agentic retrieval\-augmented QA task on Natural Questions\[Jinet al\.,[2025](https://arxiv.org/html/2604.16918#bib.bib72)\]\. The agent alternates between reasoning \(<think\>\), issuing search queries \(<search\>\), and producing a final answer \(<answer\>\)\. Retrieved documents are provided via a FAISS index with E5 embeddings \(k=3k\{=\}3passages per query\)\. Reward is binary exact\-match \(EM\), with case/punctuation/article normalization\. The long sequence length \(12800 tokens\) accommodates multi\-turn search context\.
#### AIME\.
A math competition task based on the American Invitational Mathematics Examination\. Each problem requires an integer answer in the range 000–999\. The agent has up to 3 attempts per problem; if the first answer is incorrect, it receives feedback and may try again\. Reward is binary \(correct/incorrect\)\. We use Qwen2\.5\-7B\-Instruct with Config A \(τ=1000\\tau\{=\}1000\) on 8 GPUs\.
#### Sokoban\.
A box\-pushing puzzle on a grid where the agent must push all boxes onto target positions\[Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\]\. Actions are irreversible—a misplaced box cannot be pulled back\. We evaluate two difficulty levels:*Simple*\(6×66\{\\times\}6grid, 1 box\) and*Hard*\(larger layouts, more boxes\)\. Score ranges from−1\-1\(all boxes misplaced\) to\+3\+3\(all on target\), making it a dense reward signal with a meaningful performance gradient\.
#### FrozenLake \(LLM\)\.
A text\-only navigation task on a4×44\{\\times\}4grid with slippery ice\. The agent receives a text description of the grid state and must navigate to the goal while avoiding holes\. Slippery dynamics mean the agent moves in the intended direction with probability1/31/3and slides sideways otherwise, making the task stochastic\. Used for both main evaluation and ablation studies\.
#### CliffWalking\.
A4×124\{\\times\}12grid navigation task where the agent must reach the goal while avoiding a cliff along the bottom row\. Falling off the cliff incurs a large negative reward \(−100\-100\) and resets the agent\. The optimal policy achieves score 0\. This serves as a control: the task is simple enough that on\-policy training solves it quickly, and replay is not expected to help\.
#### GSM8K\.
Grade\-school math word problems\[Cobbeet al\.,[2021](https://arxiv.org/html/2604.16918#bib.bib73)\]requiring multi\-step arithmetic reasoning\. The base Qwen2\.5\-0\.5B model already achieves\>\{\>\}93% accuracy on this task, so it serves as a near\-saturated control where replay cannot provide meaningful improvements\.
#### FrozenLake \(VLM\)\.
The visual counterpart of the text FrozenLake: the4×44\{\\times\}4grid is rendered as an RGB image, and the VLM agent must interpret the visual state to navigate\. This tests whether FreshPER generalizes to multimodal settings where observations are images rather than text\.
#### GeoQA\.
A geometry question\-answering task where the VLM interprets geometric diagrams and solves problems requiring spatial reasoning\. Binary reward \(correct/incorrect\)\.
## Appendix EEngineering Insights: Hyperparameter Configuration for Off\-Policy LLM RL
During development, we discovered that standard on\-policy hyperparameters are unsuitable for off\-policy training with a replay buffer\. We summarize the key findings below, as they may benefit practitioners adopting experience replay for LLM RL\.
#### Advantage Clipping is Critical\.
The default advantage clipping threshold in many LLM RL implementations is large \(ϵclip=20\\epsilon\_\{\\text\{clip\}\}=20\), effectively disabling clipping\. For on\-policy training this is harmless, but with replay data, stale trajectories can produce extreme advantage values that destabilize training\. We found that reducing the clip toϵclip=0\.2\\epsilon\_\{\\text\{clip\}\}=0\.2—the same value used for the PPO ratio clip—was essential for stable off\-policy learning\. Withϵclip=20\\epsilon\_\{\\text\{clip\}\}=20, replay\-augmented training frequently diverged; withϵclip=0\.2\\epsilon\_\{\\text\{clip\}\}=0\.2, it was consistently stable across all environments\.
#### KL Regularization Should Be Disabled\.
On\-policy methods commonly use KL divergence penalties \(βKL=0\.05\\beta\_\{\\text\{KL\}\}=0\.05–0\.10\.1\) to prevent the policy from deviating too far from the reference model\. However, with off\-policy replay, the KL penalty interacts poorly with stale data: old trajectories already have large KL divergence from the reference, and penalizing this further suppresses useful gradient signal from replay\. We found that settingβKL=0\\beta\_\{\\text\{KL\}\}=0\(and disabling the adaptive KL controller\) improved performance across all environments when using a replay buffer\.
#### Entropy Bonus is Unnecessary\.
Similarly, the entropy loss coefficient \(λent=0\.01\\lambda\_\{\\text\{ent\}\}=0\.01\) commonly used for exploration in on\-policy training was counterproductive with replay\. The replay buffer itself provides exploration diversity; adding an entropy bonus on top led to overly stochastic policies\.
Table[3](https://arxiv.org/html/2604.16918#A5.T3)summarizes these differences\. We refer to the off\-policy\-optimized configuration as “Config A” throughout our experiments\.
Table 3:Hyperparameter comparison between standard on\-policy settings and our off\-policy\-optimized Config A\.Note: Config A was developed on FrozenLake \(LLM, 0\.5B\) and subsequently validated on Sokoban, CliffWalking, and GSM8K without modification—all 0\.5B experiments in this paper use Config A for both baseline and replay runs\. The NQ Search \(7B\) and VLM \(3B\) experiments use the default on\-policy settings, as larger models appear more robust to these hyperparameter choices\.
## Appendix FDetailed Experimental Configurations
Table[4](https://arxiv.org/html/2604.16918#A6.T4)lists the full training configuration for each experiment family\. All experiments use the ROLL framework\[Wanget al\.,[2025a](https://arxiv.org/html/2604.16918#bib.bib67)\]with REINFORCE\+\+ as the policy gradient algorithm, DeepSpeed for training, and vLLM for inference\.
Table 4:Training configurations across experiment families\. “GA” = gradient accumulation steps, “BS/dev” = per\-device batch size\. All experiments use learning rate10−610^\{\-6\}\.ExperimentModelGPUsBS/devGADeepSpeedMax StepsConfigNQ Search7B8116ZeRO\-3300DefaultAIME7B8232ZeRO\-2400ASokoban0\.5B2432ZeRO\-2400AFrozenLake \(LLM\)0\.5B2432ZeRO\-2400ACliffWalking0\.5B2432ZeRO\-2400AGSM8K0\.5B2432ZeRO\-2400AFrozenLake \(VLM\)VL\-3B4132ZeRO\-2100DefaultGeoQAVL\-3B4132ZeRO\-2100DefaultTable[5](https://arxiv.org/html/2604.16918#A6.T5)details the replay buffer configuration\. All replay\-enabled experiments use a trajectory\-level buffer with FIFO eviction and the same core priority parameters\.
Table 5:Replay buffer configurations\. All experiments use trajectory\-level sampling, FIFO eviction, and priority exponentα=0\.6\\alpha\{=\}0\.6\.#### Common Settings\.
All experiments share: rollout batch size 128, validation batch size 128, seed 42, cosine learning rate schedule with 10\-step warmup, BF16 mixed precision, and FlashAttention\-2\. Inference uses vLLM withtop\-p=0\.99\\text\{top\-\}p\{=\}0\.99,top\-k=100\\text\{top\-\}k\{=\}100, and temperature 0\.99\.
#### GPU Layout\.
For 2\-GPU experiments \(0\.5B models\): GPU 0 runs training \(DeepSpeed\) and reference model inference; GPU 1 runs vLLM inference\. For 4\-GPU experiments \(VLM 3B\): GPUs 0–1 run training and reference; GPUs 2–3 run vLLM\. For 8\-GPU experiments \(NQ Search 7B\): GPUs 0–3 run training \(ZeRO\-3\); GPUs 4–7 run vLLM\.Similar Articles
Useful memories become faulty when continuously updated by LLMs (30 minute read)
This research demonstrates that continuously updating LLM agent memories through distillation and consolidation loops causes performance regression, even when trained on ground-truth solutions. The study finds that episodic-only retention outperforms text-based consolidation, highlighting significant flaws in current self-improvement paradigms.
Hindsight Experience Replay
OpenAI presents Hindsight Experience Replay (HER), a technique enabling sample-efficient reinforcement learning from sparse binary rewards without complex reward engineering. It is demonstrated on robotic arm manipulation tasks including pushing, sliding, and pick-and-place, and validated on physical robots.
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Researchers introduce Memora, a benchmark that evaluates LLMs’ ability to retain, update, and forget long-term user memories over weeks-to-months conversations, revealing frequent reuse of obsolete memories.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct proposes a saliency-guided sparse update strategy for improving long-context reasoning in LLMs by selectively updating weights associated with high-magnitude activations in query and key vectors, achieving ~8% improvement on LongBench v2.
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
This paper introduces PYTHALAB-MERA, an external controller for frozen local LLMs that uses validation-grounded memory and retrieval to improve coding agent performance. It demonstrates superior success rates in strict validation tasks compared to self-refinement baselines by leveraging execution feedback and temporal difference learning.