Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Summary
This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.
View Cached Full Text
Cached at: 05/20/26, 10:40 PM
Paper page - Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
Source: https://huggingface.co/papers/2605.20164 “Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR”
As RL post-training expands beyond fully verifiable domains, rubrics, or checklists, are becoming a common reward interface for open-ended and multimodal tasks.
The question we study is: Should the same rubric weights that define final answer quality also determine what the current policy learns from during RL?
Our finding is no - A criterion can be important for the final response, but if all sampled rollouts pass it or all sampled rollouts fail it, it provides no group-relative learning signal. Across our multimodal setting and HealthBench, roughly half of rubric criteria are non-contrastive for a fresh policy, and static aggregation routes 45–51% of within-category training pressure to such criteria.
In this work, we: • diagnose how static rubric aggregation misallocates learning signal, • show that human importance and policy-dependent usefulness can decouple, and • introduce POW3R, a policy-aware rubric reward framework that preserves the evaluation target while adapting criterion-level reward weights during training.
Across three base policies and multimodal/text-only settings, POW3R wins 24/30 base-policy/metric comparisons and reaches the same plateau in 2.5–4× fewer training steps.
Similar Articles
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.
Reward Hacking in Rubric-Based Reinforcement Learning
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.