Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Hugging Face Daily Papers 05/14/26, 12:00 AM Papers

reinforcement-learning llm few-shot supervised-fine-tuning sample-efficiency math-reasoning coding

Summary

FEST is a few-shot demonstration-guided reinforcement learning algorithm that achieves strong performance with minimal supervised fine-tuning data by combining supervised signals, on-policy learning, and weighted training to prevent overfitting.

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved great success in developing Large Language Models (LLMs) with chain-of-thought rollouts for many tasks such as math and coding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue via demonstration-guided RLVR, i.e., to conduct Supervised FineTuning (SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoT demonstration-guided RLVR algorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal, on-policy signal, and decaying weights on the few-shot SFT dataset to prevent overfitting from multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

Original Article

View Cached Full Text

Cached at: 05/15/26, 04:26 PM

Paper page - Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Source: https://huggingface.co/papers/2605.15012

Abstract

Reinforcement Learning with Verifiable Rewards(RLVR) has achieved great success in developing Large Language Models (LLMs) withchain-of-thought rolloutsfor many tasks such asmathandcoding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue viademonstration-guided RLVR, i.e., to conductSupervised FineTuning(SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoTdemonstration-guided RLVRalgorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal,on-policy signal, anddecaying weightson the few-shot SFT dataset to preventoverfittingfrom multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.

View arXiv page View PDF GitHub0 Add to collection

Get this paper in your agent:

hf papers read 2605\.15012

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.15012 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.15012 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.15012 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Paper page - Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

Fair Reinforcement Learning

Reinforcement learning with prediction-based rewards

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Submit Feedback

Similar Articles

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

Reinforcement learning with prediction-based rewards

Reinforcing Few-step Generators via Reward-Tilted Distribution Matching

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization