Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Hugging Face Daily Papers 05/13/26, 12:00 AM Papers

image-editing benchmark reward-modeling evaluation computer-vision multimodal

Summary

Introduces Edit-Compass and EditReward-Compass, a unified benchmark suite for evaluating image editing models and reward models, with 2,388 annotated instances and 2,251 preference pairs for realistic RL scenarios.

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/14/26, 04:17 AM

Paper page - Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Source: https://huggingface.co/papers/2605.13062

Abstract

Recentimage editing modelshave achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grainedevaluation protocols. In parallel,reward modelshave become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of bothimage editing modelsandreward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based onstructured reasoningand carefully designedscoring rubrics. In parallel, EditReward-Compass contains 2,251preference pairsthat simulate realisticreward modeling scenariosduring RL optimization.

View arXiv page View PDF GitHub4 Add to collection

Get this paper in your agent:

hf papers read 2605\.13062

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13062 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.13062 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13062 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Paper page - Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision

Submit Feedback

Similar Articles

HP-Edit: A Human-Preference Post-Training Framework for Image Editing

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

Delta-Adapter: Scalable Exemplar-Based Image Editing with Single-Pair Supervision