rank-inversion

#rank-inversion

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

arXiv cs.LG ↗ · 2026-06-18 Cached

This paper demonstrates that selecting the SFT checkpoint with the highest pass@1 for GRPO can fail because SFT overtraining compresses output diversity, leading to entropy collapse and rank inversion in reinforcement learning. Experiments on Qwen2.5-Coder-3B and DeepSeek-Coder-6.7B show that pre-RL entropy is positively associated with GRPO outcome, and a two-stage diagnostic can detect high-risk checkpoints.

0 favorites 0 likes

rank-inversion

SFT Overtraining Predicts Rank Inversion via Entropy Collapse Under RLVR

Submit Feedback