Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Summary
This paper introduces the Auto-Rubric as Reward (ARR) framework, which externalizes implicit preference knowledge into explicit rubrics for multimodal alignment. It proposes Rubric Policy Optimization (RPO) to stabilize policy gradients, achieving better performance in text-to-image and image editing tasks.
View Cached Full Text
Cached at: 05/12/26, 02:49 AM
Paper page - Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria
Source: https://huggingface.co/papers/2605.08354
Abstract
Auto-Rubric as Reward (ARR) framework externalizes implicit preference knowledge into structured rubrics for improved multimodal alignment, while Rubric Policy Optimization (RPO) stabilizes policy gradients through binary rewards derived from multi-dimensional evaluation.
Aligning multimodal generative models with human preferences demands reward signals that respect the compositional, multi-dimensional structure of human judgment. Prevailing RLHF approaches reduce this structure to scalar or pairwise labels, collapsing nuanced preferences into opaque parametric proxies and exposing vulnerabilities toreward hacking. While recentRubrics-as-Reward(RaR) methods attempt to recover this structure through explicit criteria, generating rubrics that are simultaneously reliable, scalable, and data-efficient remains an open problem. We introduce Auto-Rubric as Reward (ARR), a framework that reframesreward modelingfrom implicit weight optimization to explicit, criteria-based decomposition. Before any pairwise comparison, ARR externalizes aVLM’s internalized preference knowledge asprompt-specific rubrics, translating holistic intent into independently verifiable quality dimensions. This conversion of implicit preference structure into inspectable, interpretable constraints substantially suppresses evaluation biases including positional bias, enabling bothzero-shot deploymentandfew-shot conditioningon minimal supervision. To extend these gains into generative training, we propose Rubric Policy Optimization (RPO), which distills ARR’s structured multi-dimensional evaluation into a robustbinary reward, replacing opaque scalar regression with rubric-conditioned preference decisions that stabilizepolicy gradients. Ontext-to-image generationandimage editingbenchmarks, ARR-RPO outperforms pairwise reward models andVLMjudges, demonstrating that explicitly externalizingimplicit preference knowledgeinto structured rubrics achieves more reliable, data-efficient multimodal alignment, revealing that the bottleneck is the absence of a factorized interface, not a deficit of knowledge.
View arXiv pageView PDFProject pageGitHubAdd to collection
Get this paper in your agent:
hf papers read 2605\.08354
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### OpenEnvisionLab/Auto-Rubric-as-Reward Text-to-Image• Updated33 minutes ago • 2
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.08354 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.08354 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences
C2 proposes a scalable rubric-augmented reward modeling framework that trains a cooperative rubric generator and critical verifier exclusively from binary preferences, eliminating the need for costly rubric annotations while achieving up to 6.5 point gains on RM-Bench.
RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards
This paper introduces RubricEM, a reinforcement learning framework that uses rubric-guided policy decomposition and reflection-based meta-policy evolution to train deep research agents for long-form tasks. The resulting RubricEM-8B model demonstrates strong performance on long-form research benchmarks by leveraging stage-aware planning and denser semantic feedback.
DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification
DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.
Quantifying the Statistical Effect of Rubric Modifications on Human-Autorater Agreement
This study analyzes how modifications to evaluation rubrics, such as shifting from holistic to analytic criteria, impact the agreement between human raters and AI autoraters. The findings suggest that providing examples and reducing bias improves agreement, while higher complexity tends to decrease it.
Reward Hacking in Rubric-Based Reinforcement Learning
This paper investigates reward hacking in rubric-based reinforcement learning, analyzing the divergence between training verifiers and evaluation metrics. It introduces a diagnostic for the 'self-internalization gap' and demonstrates that stronger verification reduces but does not eliminate reward hacking.