small-to-large-policy-optimization

#small-to-large-policy-optimization

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Hugging Face Daily Papers ↗ · 2026-06-02 Cached

The S2L-PO framework uses smaller models as natural explorers to enhance policy diversity in GRPO for training large language models. It achieves faster convergence and improves accuracy on mathematical reasoning benchmarks while reducing rollout compute.

0 favorites 0 likes

small-to-large-policy-optimization

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Submit Feedback