Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Hugging Face Daily Papers 05/08/26, 12:00 AM Papers

Summary

This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.

Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities to reward hacking. While recent Rubrics-as-Reward (RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframes reward modeling from implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes a VLM's internalized preference knowledge as prompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling both zero-shot deployment and few-shot conditioning on minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR's structured multi-dimensional evaluation into a robust binary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilize policy gradients. On text-to-image generation and image editing benchmarks, ARR-RPO outperforms pairwise reward models and VLM judges, demonstrating that explicitly externalizing implicit preference knowledge into structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.

Original Article

View Cached Full Text

Cached at: 05/12/26, 02:49 AM

Paper page - Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Source: https://huggingface.co/papers/2605.08354

Abstract

Auto-Rubric as Reward (ARR) framework externalizes implicit preference knowledge into structured rubrics for improved multimodal alignment, while Rubric Policy Optimization (RPO) stabilizes policy gradients through binary rewards derived from multi-dimensional evaluation.

Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities toreward hacking. While recentRubrics-as-Reward(RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframesreward modelingfrom implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes aVLM’s internalized preference knowledge asprompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling bothzero-shot deploymentandfew-shot conditioningon minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR’s structured multi-dimensional evaluation into a robustbinary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilizepolicy gradients. Ontext-to-image generationandimage editingbenchmarks, ARR-RPO outperforms pairwise reward models andVLMjudges, demonstrating that explicitly externalizingimplicit preference knowledgeinto structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.

View arXiv page View PDF Project page GitHub Add to collection

Get this paper in your agent:

hf papers read 2605\.08354

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### OpenEnvisionLab/Auto-Rubric-as-Reward Text-to-Image• Updated33 minutes ago • 2

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.08354 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.08354 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Paper page - Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Abstract

Models citing this paper1

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

Reward Hacking in Rubric-Based Reinforcement Learning

Submit Feedback

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement

Reward Hacking in Rubric-Based Reinforcement Learning