DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Hugging Face Daily Papers 05/10/26, 12:00 AM Papers

Summary

DeltaRubric is a research paper introducing a two-step multimodal preference evaluation approach using a single MLLM to improve reward modeling reliability through joint planning and verification.

Aligning Multimodal Large Language Models (MLLMs) requires reliable reward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity of visual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce DeltaRubric, an approach that reformulates multimodal preference evaluation as a plan-and-execute process within a single MLLM. DeltaRubric operates in two steps: acting first as a Disagreement Planner, the model generates a neutral, instance-specific verification checklist. Transitioning into a Checklist Verifier, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as a multi-role reinforcement learning problem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, On VL-RewardBench, it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

Original Article

View Cached Full Text

Cached at: 05/12/26, 07:34 AM

Paper page - DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Source: https://huggingface.co/papers/2605.09269

Abstract

DeltaRubric introduces a two-step multimodal preference evaluation approach using a single MLLM, where a Disagreement Planner generates instance-specific verification checklists and a Checklist Verifier executes these checks to produce grounded judgments, improving reward modeling reliability.

AligningMultimodal Large Language Models(MLLMs) requires reliablereward models, yet existing single-step evaluators can suffer from lazy judging, exploiting language priors over fine-grained visual verification. While rubric-based evaluation mitigates these biases in text-only settings, extending it to multimodal tasks is bottlenecked by the complexity ofvisual reasoning. The critical differences between responses often depend on instance-specific visual details. Robust evaluation requires dynamically synthesizing rubrics that isolate spatial and factual discrepancies. To address this, we introduce DeltaRubric, an approach that reformulatesmultimodal preference evaluationas aplan-and-execute processwithin a single MLLM. DeltaRubric operates in two steps: acting first as aDisagreement Planner, the model generates a neutral,instance-specific verificationchecklist. Transitioning into aChecklist Verifier, it executes these self-generated checks against the image and question to produce the final grounded judgment. We formulate DeltaRubric as amulti-role reinforcement learningproblem, jointly optimizing planning and verification capabilities. Validated on Qwen3-VL 4B and 8B Instruct models, DeltaRubric achieves solid empirical gains. For instance, OnVL-RewardBench, it improves base model overall accuracy by +22.6 (4B) and +18.8 (8B) points, largely outperforming standard no-rubric baselines. The results demonstrate that decomposing evaluation into structured, verifiable steps leads to more reliable and generalizable multimodal reward modeling.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2605\.09269

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.09269 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.09269 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.09269 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Paper page - DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

Submit Feedback

Similar Articles

C2: Scalable Rubric-Augmented Reward Modeling from Binary Preferences

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

RubricEM: Meta-RL with Rubric-guided Policy Decomposition beyond Verifiable Rewards

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR