Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing
Summary
This paper introduces RE-Edit, a benchmark for evaluating image editing systems across five reasoning dimensions (physical, environmental, cultural, causal, referential) to assess logical consistency beyond visual plausibility. The benchmark includes 1,000 samples and evaluates ten open-source and two commercial models, showing that even advanced systems struggle with implicit multi-dimensional reasoning.
View Cached Full Text
Cached at: 06/05/26, 06:07 AM
Paper page - Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing
Source: https://huggingface.co/papers/2606.05172 Published on Apr 16
·
Submitted byhttps://huggingface.co/Yixuan-Ding-ZJU
dingon Jun 5
Abstract
RE-Edit benchmark evaluates image editing systems on five reasoning dimensions to assess logical consistency beyond visual plausibility.
Diffusion-based image editinghas achieved strongvisual fidelityundernatural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark forREasoning-aware image Editingthat evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercialimage editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guidedpost-edit baselineas an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.
View arXiv pageView PDFGitHub0Add to collection
Get this paper in your agent:
hf papers read 2606\.05172
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper1
#### Yixuan-Ding-ZJU/EditRefine 8B• Updatedabout 3 hours ago
Datasets citing this paper1
#### Yixuan-Ding-ZJU/RE-Edit Viewer• Updatedabout 3 hours ago • 1k • 33
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.05172 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
ETCHR: Editing To Clarify and Harness Reasoning
ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.
Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
Introduces Edit-Compass and EditReward-Compass, a unified benchmark suite for evaluating image editing models and reward models, with 2,388 annotated instances and 2,251 preference pairs for realistic RL scenarios.
Editor's Choice: Evaluating Abstract Intent in Image Editing through Atomic Entity Analysis
This paper introduces AbstractEdit, a benchmark for abstract image editing, and Entity-Rubrics, a framework for entity-level assessment, revealing challenges in balancing intent and preservation for abstract instructions and highlighting the need for LLM integration.
PaintBench: Deterministic Evaluation of Precise Visual Editing
PaintBench is a new benchmark for evaluating precise visual editing in multimodal models, covering 20 operations across 4 categories with deterministic pixel-level evaluation. Testing 11 models reveals overall low performance, with the best model scoring only 17.1% mIoU.
Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs
This paper introduces UniKE, the first benchmark for cross-modal knowledge editing in unified multimodal models (UMMs), revealing a significant modality gap where text edits achieve 92% efficacy but only 18.5% transfer to image generation. It proposes Reasoning-augmented Parameter Editing to improve cross-modal transfer, with gains up to 18.6 percentage points.