Tag
The S2L-PO framework uses smaller models as natural explorers to enhance policy diversity in GRPO for training large language models. It achieves faster convergence and improves accuracy on mathematical reasoning benchmarks while reducing rollout compute.