Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
Summary
FEST is a few-shot demonstration-guided reinforcement learning algorithm that achieves strong performance with minimal supervised fine-tuning data by combining supervised signals, on-policy learning, and weighted training to prevent overfitting.
View Cached Full Text
Cached at: 05/15/26, 04:26 PM
Paper page - Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
Source: https://huggingface.co/papers/2605.15012
Abstract
FEST is a few-shot demonstration-guided reinforcement learning algorithm that achieves strong performance with minimal supervised fine-tuning data by combining supervised signals, on-policy learning, and weighted training to prevent overfitting.
Reinforcement Learning with Verifiable Rewards(RLVR) has achieved great success in developing Large Language Models (LLMs) withchain-of-thought rolloutsfor many tasks such asmathandcoding. Nevertheless, RLVR struggles with sample efficiency on difficult problems where correct rollouts are hard to generate. Prior works propose to address this issue viademonstration-guided RLVR, i.e., to conductSupervised FineTuning(SFT) when RL fails; however, SFT often requires a lot of data, which can be expensive to acquire. In this paper, we propose FEST, a FEw-ShoTdemonstration-guided RLVRalgorithm. It attains compelling results with only 128 demonstrations randomly selected from an SFT dataset. We find that three components are vital for the success: supervised signal,on-policy signal, anddecaying weightson the few-shot SFT dataset to preventoverfittingfrom multiple-epoch training. On several benchmarks, FEST outperforms baselines with magnitudes less SFT data, even matching their performance with full dataset.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2605\.15012
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.15012 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.15012 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.15012 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Proposes Demo2Reward, a test-time prompt optimization technique for VLM reward models using a few expert demonstrations, significantly reducing false positives and improving policy learning in robotics without additional model training.
Fair Reinforcement Learning
Fair Reinforcement Learning introduces Democratic Alignment to incorporate multiple competing value sets from different agents, overcoming traditional RLHF limitations, and achieves orders of magnitude faster optimization via a black-box policy wrapper.
Reinforcement learning with prediction-based rewards
OpenAI introduces Random Network Distillation (RND), a prediction-based method for encouraging exploration in RL agents through curiosity, achieving human-level performance on Montezuma's Revenge without demonstrations or game state access.
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching
RTDMD is a two-stage framework combining distribution matching distillation with reward-guided reinforcement learning to improve few-step image generation alignment with human preferences. It achieves state-of-the-art results on multiple models with only 4 inference steps.
CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization
CEPO improves reinforcement learning with verifiable rewards by using contrastive signals from rejected rollouts to distinguish decisive reasoning steps from filler tokens, achieving higher accuracy on multimodal math reasoning benchmarks compared to GRPO.