RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Hugging Face Daily Papers 05/27/26, 12:00 AM Papers

Summary

RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data, achieving competitive accuracy and gains for LLM post-training in non-verifiable domains.

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

Original Article

View Cached Full Text

Cached at: 05/29/26, 07:00 AM

Paper page - RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Source: https://huggingface.co/papers/2605.29156

Abstract

RUBRIC-ARROW presents an alternating framework for reward modeling that improves upon rubric-based methods by reducing ties and leveraging pairwise preference data for training.

Pointwisereward modelingoffers critical signals forLLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings.Rubric-based methodsaddress this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with itsRL stageusing onlypairwise preference data. Our method couples aprobability-based scoring rulethat reduces ties withphase-specific preference-based rewardsand analternating GRPO schemethat together train thepointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

View arXiv page View PDF Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29156 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29156 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29156 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Paper page - RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Beyond Score Prediction: LLM-Based Essay Scoring and Feedback Generation via Reinforcement Learning with Rubric Rewards

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Submit Feedback

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Beyond Score Prediction: LLM-Based Essay Scoring and Feedback Generation via Reinforcement Learning with Rubric Rewards

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria