Revisiting Hard Questions with Replay Buffers (8 minute read)

TLDR AI Papers

Summary

ZPPO introduces a replay buffer for hard questions in reinforcement learning for LLMs/VLMs, allowing repeated exposure to gradually improve rollout accuracy without policy drift. The method graduates more hard questions than GRPO, especially those with near-zero initial accuracy.

ZPPO stores difficult questions in a replay buffer so the model can repeatedly train on them rather than seeing them only once. The method is designed to strengthen learning on challenging examples and improve rollout accuracy.
Original Article
View Cached Full Text

Cached at: 06/20/26, 02:14 PM

# NVIDIA-ZPPO: Zone of Proximal Policy Optimization Source: [https://byungkwanlee.github.io/ZPPO-page/](https://byungkwanlee.github.io/ZPPO-page/) TL;DR ### Accuracy Gain \(Δ pp\) Method**10 LLM Benchmarks****16 VLM Benchmarks****5 Video Benchmarks** Off\-Policy Distill†0\.00\.00\.0 On\-Policy Distill†0\.00\.00\.0 GRPO†0\.00\.00\.0 GRPO†\+ Teacher response0\.00\.00\.0 **ZPPO**\(Ours\)0\.00\.00\.0 †: prompt replay buffer·all experiments run on Qwen3\.5 1 / 3Off\-Policy Distill†andOn\-Policy Distill† Distillation forces a student to imitate teacher logits, inducing**memorization on the training samples**while**degrading generalization**on unseen samples\. \(Overfitting on dataset and teacher\) 2 / 3GRPO† RL lets model have freedom of responding the question until they solve it, encouraging**reasoning exploration via self\-reflection like "Wait, that step looks wrong — let me re\-check\."**\(Not forced to imitate any response\) —**preserving generalization**\. However, RL can't learn how to solve**hard questions**whose rollout accuracy is near zero — they are**discarded forever**\. 3 / 3GRPO†\+ Teacher response To solve hard questions, some RL methods naively inject the teacher's response into the student — as if it were the student's own response — breaking the**on\-policy assumption**,**degrading generalization again**\. Insight Research Question > For**hard questions**, how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient?How to make the student solve the hard question without**policy drift**\(degrading generalization\)? method Technically, we use a**Replay Buffer**to store**hard questions**, so the model revisits each**hard question**many times — not just once, as in GRPO\. Repeated exposure strengthens the BCQ/NCQ effect on each**hard question**, which we expect to lift its**rollout accuracy**\. 1. **Batch**includes new questions, replayed questions,**BCQ**, and**NCQ**—**Student**is**RL\-trained**on them\. results A question is admitted to the**Replay Buffer**when its rollout accuracy stays**below 50%**, and it**graduates**— leaving the buffer — once that accuracy reaches**50%**\. ZPPO graduates far more hard questions than GRPO, and the gap is widest where the initial accuracy starts near**zero**\. qualitative ## *BCQ*\+*NCQ*on hard questions\.

Similar Articles

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

arXiv cs.CL

FreshPER introduces a freshness-aware prioritized experience replay method for LLM/VLM reinforcement learning that addresses the 'priority staleness' problem by applying exponential age decay to stored priorities, enabling off-policy reuse of trajectories. Evaluated on eight agentic, reasoning, and math tasks, FreshPER significantly outperforms on-policy baselines with gains up to +367% on Sokoban.

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

arXiv cs.CL

Researchers propose SPS (Steering Probability Squeezing), a training paradigm combining reinforcement learning with inverse reinforcement learning to address probability squeezing in LLM reasoning training, where probability mass concentrates too narrowly on high-reward trajectories, limiting exploration and multi-sample performance (Pass@k). Experiments on five reasoning benchmarks demonstrate improved exploration and Pass@k metrics.