hard-questions

Tag

Cards List
#hard-questions

Revisiting Hard Questions with Replay Buffers (8 minute read)

TLDR AI · 2026-06-19 Cached

ZPPO introduces a replay buffer for hard questions in reinforcement learning for LLMs/VLMs, allowing repeated exposure to gradually improve rollout accuracy without policy drift. The method graduates more hard questions than GRPO, especially those with near-zero initial accuracy.

0 favorites 0 likes
← Back to home

Submit Feedback