Tag
This paper demonstrates that selecting the SFT checkpoint with the highest pass@1 for GRPO can fail because SFT overtraining compresses output diversity, leading to entropy collapse and rank inversion in reinforcement learning. Experiments on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B show that pre-RL entropy is positively associated with GRPO outcome, and a two-stage diagnostic can detect high-risk checkpoints.