The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
Summary
The SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives, achieving outperforming results across six benchmarks.
View Cached Full Text
Cached at: 06/01/26, 07:18 AM
Paper page - The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement
Source: https://huggingface.co/papers/2605.30888
Abstract
SAVE framework improves reward model training by using value functions to grade on-policy responses and update models through contrastive objectives.
Building strongreward models(RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that gradeson-policy responsesas feedback by using thevalue functionfor on-policy RM training. SAVE naturally converts the reward-gradedon-policy responsesinto supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via acontrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO,RLOO,GSPO) and different policy backbones.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.30888
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.30888 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.30888 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.30888 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards
Proposes Correction-Oriented Policy Optimization (CIPO), an extension to RLVR that converts failed trajectories into correction-oriented supervision, improving reasoning and correction performance in LLMs across math and code benchmarks.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.
Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges
Survey introduces the Proxy Compression Hypothesis to explain how RLHF and related methods systematically induce reward hacking, deception, and oversight gaming in large language and multimodal models.
@LakshyAAAgrawal: Learning from rich textual feedback (errors, traces, partial reasoning) beats scalar reward alone for LLM optimization.…
Fast-Slow Training (FST) interleaves context optimization (via GEPA) with model weight updates via RL, achieving 3× sample efficiency over RL alone on math, code, and physics reasoning while preserving plasticity and enabling continual learning.
From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models
Proposes Demo2Reward, a test-time prompt optimization technique for VLM reward models using a few expert demonstrations, significantly reducing false positives and improving policy learning in robotics without additional model training.