Revisiting Hard Questions with Replay Buffers (8 minute read)

TLDR AI 06/19/26, 12:00 AM Papers

reinforcement-learning policy-optimization replay-buffer llm vlm generalization hard-questions

Summary

ZPPO introduces a replay buffer for hard questions in reinforcement learning for LLMs/VLMs, allowing repeated exposure to gradually improve rollout accuracy without policy drift. The method graduates more hard questions than GRPO, especially those with near-zero initial accuracy.

ZPPO stores difficult questions in a replay buffer so the model can repeatedly train on them rather than seeing them only once. The method is designed to strengthen learning on challenging examples and improve rollout accuracy.

Original Article

View Cached Full Text

Cached at: 06/20/26, 02:14 PM

# NVIDIA-ZPPO: Zone of Proximal Policy Optimization Source: [https://byungkwanlee.github.io/ZPPO-page/](https://byungkwanlee.github.io/ZPPO-page/) TL;DR ### Accuracy Gain \(Δ pp\) Method**10 LLM Benchmarks****16 VLM Benchmarks****5 Video Benchmarks** Off\-Policy Distill†0\.00\.00\.0 On\-Policy Distill†0\.00\.00\.0 GRPO†0\.00\.00\.0 GRPO†\+ Teacher response0\.00\.00\.0 **ZPPO**\(Ours\)0\.00\.00\.0 †: prompt replay buffer·all experiments run on Qwen3\.5 1 / 3Off\-Policy Distill†andOn\-Policy Distill† Distillation forces a student to imitate teacher logits, inducing**memorization on the training samples**while**degrading generalization**on unseen samples\. \(Overfitting on dataset and teacher\) 2 / 3GRPO† RL lets model have freedom of responding the question until they solve it, encouraging**reasoning exploration via self\-reflection like "Wait, that step looks wrong — let me re\-check\."**\(Not forced to imitate any response\) —**preserving generalization**\. However, RL can't learn how to solve**hard questions**whose rollout accuracy is near zero — they are**discarded forever**\. 3 / 3GRPO†\+ Teacher response To solve hard questions, some RL methods naively inject the teacher's response into the student — as if it were the student's own response — breaking the**on\-policy assumption**,**degrading generalization again**\. Insight Research Question > For**hard questions**, how can we transfer the teacher's knowledge to the student without imitating the teacher's logits or injecting the teacher's response directly into the student's gradient?How to make the student solve the hard question without**policy drift**\(degrading generalization\)? method Technically, we use a**Replay Buffer**to store**hard questions**, so the model revisits each**hard question**many times — not just once, as in GRPO\. Repeated exposure strengthens the BCQ/NCQ effect on each**hard question**, which we expect to lift its**rollout accuracy**\. 1. **Batch**includes new questions, replayed questions,**BCQ**, and**NCQ**—**Student**is**RL\-trained**on them\. results A question is admitted to the**Replay Buffer**when its rollout accuracy stays**below 50%**, and it**graduates**— leaving the buffer — once that accuracy reaches**50%**\. ZPPO graduates far more hard questions than GRPO, and the gap is widest where the initial accuracy starts near**zero**\. qualitative ## *BCQ*\+*NCQ*on hard questions\.

Revisiting Hard Questions with Replay Buffers (8 minute read)

Similar Articles

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization

Submit Feedback

Similar Articles

Replay What Matters: Off-Policy Replay for Efficient LLM Reinforcement Unlearning

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

Teaching the Way, Not the Answer: Privileged Tutoring Distillation for Multimodal Policy Optimization