RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
Summary
RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.
View Cached Full Text
Cached at: 05/29/26, 07:00 AM
Paper page - RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
Source: https://huggingface.co/papers/2605.29156
Abstract
RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data for training.
Pointwisereward modelingoffers critical signals forLLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings.Rubric-based methodsaddress this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with itsRL stageusing onlypairwise preference data. Our method couples aprobability-based scoring rulethat reduces ties withphase-specific preference-based rewardsand analternating GRPO schemethat together train thepointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.
View arXiv pageView PDFAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29156 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29156 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29156 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.
Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.
ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents
ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.