Revisiting Hard Questions with Replay Buffers (8 minute read)
Summary
ZPPO introduces a replay buffer for hard questions in reinforcement learning for LLMs/VLMs, allowing repeated exposure to gradually improve rollout accuracy without policy drift. The method graduates more hard questions than GRPO, especially those with near-zero initial accuracy.
View Cached Full Text
Cached at: 06/20/26, 02:14 PM
Similar Articles
Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning
This paper introduces ReRULE, an off-policy replay method for reinforcement unlearning in LLMs, improving forgetting and retention efficiency on benchmarks like RWKU and MUSE.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
FreshPER introduces a freshness-aware prioritized experience replay method for LLM/VLM reinforcement learning that addresses the 'priority staleness' problem by applying exponential age decay to stored priorities, enabling off-policy reuse of trajectories. Evaluated on eight agentic, reasoning, and math tasks, FreshPER significantly outperforms on-policy baselines with gains up to +367% on Sokoban.
SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models
Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.
Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients
Zone of Proximal Policy Optimization (ZPPO) improves knowledge distillation by using reformulated prompts that help students learn from both correct and incorrect responses, enhancing performance especially at smaller model sizes.
Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization
This paper proposes PTD-PO, a privileged tutoring distillation framework that provides dense token-level supervision for reinforcement learning with verifiable rewards in multimodal reasoning tasks, without exposing the answer. It uses structured hints and a Top-K JS divergence objective to stabilize training, consistently outperforming existing methods on 2B-8B LVLMs.