Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Hugging Face Daily Papers Papers

Summary

Introduces Edit-Compass and EditReward-Compass, a unified benchmark suite for evaluating image editing models and reward models, with 2,388 annotated instances and 2,251 preference pairs for realistic RL scenarios.

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/14/26, 04:17 AM

Paper page - Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling

Source: https://huggingface.co/papers/2605.13062

Abstract

Recent image editing models have achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grained evaluation protocols. In parallel, reward models have become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of both image editing models and reward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based on structured reasoning and carefully designed scoring rubrics. In parallel, EditReward-Compass contains 2,251 preference pairs that simulate realistic reward modeling scenarios during RL optimization.

Recentimage editing modelshave achieved remarkable progress in instruction following, multimodal understanding, and complex visual editing. However, existing benchmarks often fail to faithfully reflect human judgment, especially for strong frontier models, due to limited task difficulty and coarse-grainedevaluation protocols. In parallel,reward modelshave become increasingly important for RL-based image editing optimization, yet existing reward model benchmarks still rely on unrealistic evaluation settings that deviate from practical RL scenarios. These limitations hinder reliable assessment of bothimage editing modelsandreward models. To address these challenges, we introduce Edit-Compass and EditReward-Compass, a unified evaluation suite for image editing and reward modeling. Edit-Compass contains 2,388 carefully annotated instances spanning six progressively challenging task categories, covering capabilities such as world knowledge reasoning, visual reasoning, and multi-image editing. Beyond broad task coverage, Edit-Compass adopts a fine-grained multidimensional evaluation framework based onstructured reasoningand carefully designedscoring rubrics. In parallel, EditReward-Compass contains 2,251preference pairsthat simulate realisticreward modeling scenariosduring RL optimization.

View arXiv pageView PDFGitHub4Add to collection

Get this paper in your agent:

hf papers read 2605\.13062

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.13062 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.13062 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.13062 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

Hugging Face Daily Papers

VEFX-Bench introduces a large-scale human-annotated video editing dataset (5,049 examples) with multi-dimensional quality labels and a specialized reward model for standardized evaluation of video editing systems. The paper addresses the lack of comprehensive benchmarks in AI-assisted video creation by providing VEFX-Dataset, VEFX-Reward, and a 300-video-prompt benchmark that reveals gaps in current editing models.

WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models

Hugging Face Daily Papers

WebCompass is a multimodal benchmark for evaluating LLMs on web coding tasks across three input modalities (text, image, video) and three task types (generation, editing, repair). It introduces an Agent-as-a-Judge paradigm that autonomously executes generated websites in a real browser to assess visual fidelity and interactivity.