Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing

Hugging Face Daily Papers Papers

Summary

This paper introduces RE-Edit, a benchmark for evaluating image editing systems across five reasoning dimensions (physical, environmental, cultural, causal, referential) to assess logical consistency beyond visual plausibility. The benchmark includes 1,000 samples and evaluates ten open-source and two commercial models, showing that even advanced systems struggle with implicit multi-dimensional reasoning.

Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.
Original Article
View Cached Full Text

Cached at: 06/05/26, 06:07 AM

Paper page - Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing

Source: https://huggingface.co/papers/2606.05172 Published on Apr 16

·

Submitted byhttps://huggingface.co/Yixuan-Ding-ZJU

dingon Jun 5

Abstract

RE-Edit benchmark evaluates image editing systems on five reasoning dimensions to assess logical consistency beyond visual plausibility.

Diffusion-based image editinghas achieved strongvisual fidelityundernatural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark forREasoning-aware image Editingthat evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercialimage editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guidedpost-edit baselineas an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.

View arXiv pageView PDFGitHub0Add to collection

Get this paper in your agent:

hf papers read 2606\.05172

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper1

#### Yixuan-Ding-ZJU/EditRefine 8B• Updatedabout 3 hours ago

Datasets citing this paper1

#### Yixuan-Ding-ZJU/RE-Edit Viewer• Updatedabout 3 hours ago • 1k • 33

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.05172 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

ETCHR: Editing To Clarify and Harness Reasoning

Hugging Face Daily Papers

ETCHR is a novel image editing approach that decouples visual reasoning from image generation, using a two-stage training process (Reasoning Imitation and Reasoning Enhancement) to improve multimodal language model performance across five visual reasoning tasks. It achieves consistent gains of 4-5% Pass@1 on models like Qwen3-VL-8B, Gemini-3.1-Flash-Lite, and Kimi K2.5.

PaintBench: Deterministic Evaluation of Precise Visual Editing

Hugging Face Daily Papers

PaintBench is a new benchmark for evaluating precise visual editing in multimodal models, covering 20 operations across 4 categories with deterministic pixel-level evaluation. Testing 11 models reveals overall low performance, with the best model scoring only 17.1% mIoU.