RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Hugging Face Daily Papers Papers

Summary

RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.
Original Article
View Cached Full Text

Cached at: 05/29/26, 07:00 AM

Paper page - RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Source: https://huggingface.co/papers/2605.29156

Abstract

RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data for training.

Pointwisereward modelingoffers critical signals forLLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings.Rubric-based methodsaddress this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with itsRL stageusing onlypairwise preference data. Our method couples aprobability-based scoring rulethat reduces ties withphase-specific preference-based rewardsand analternating GRPO schemethat together train thepointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

View arXiv pageView PDFAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29156 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29156 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29156 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

Hugging Face Daily Papers

C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Hugging Face Daily Papers

This paper introduces POW3R, a policy-aware rubric reward framework for reinforcement learning with verifiable rewards (RLVR). It shows that static rubric aggregation misallocates learning signal, and POW3R achieves faster convergence and better performance across multiple settings.

ARBOR: Online Process Rewards via a Reusable Rubric Buffer for Search Agents

arXiv cs.CL

ARBOR introduces a reusable rubric buffer to provide online process rewards for LLM-based search agents, improving training efficiency when outcome-only rewards are insufficient. It outperforms GRPO and DAPO on multi-hop QA benchmarks, converting up to 42% of zero-gradient training groups into informative ones.