dapo

#dapo

@johnschulman2: PPO had a second wave in the LLM era for reasons unanticipated by the original paper - the importance-ratio objective f…

X AI KOLs Following ↗ · yesterday Cached

This paper reveals that the clipping mechanism in PPO and GRPO biases entropy in RLVR for LLMs: clip-low increases entropy, clip-high decreases it. The authors prove that standard clipping reduces entropy even with random rewards, and show that adjusting clip-low can prevent entropy collapse and promote exploration.

0 favorites 0 likes

dapo

@johnschulman2: PPO had a second wave in the LLM era for reasons unanticipated by the original paper - the importance-ratio objective f…

Submit Feedback