Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Hugging Face Daily Papers 04/17/26, 12:00 AM Papers

Summary

Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/22/26, 10:35 AM

Paper page - Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Source: https://huggingface.co/papers/2604.16054

Abstract

Multimodal large language models demonstrate significant limitations in visuospatial reasoning tasks compared to human performance, revealing deficiencies in visual attention, perceptual manipulation, and conceptual abstraction.

Multimodal large language models(MLLMs) have achieved impressive progress onvision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce “Mind’s Eye”, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel “A-R-T” taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes offluid intelligencesuch aspattern induction,analogical relation mapping, andmental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i)visual attention allocation, (ii)internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2604\.16054

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16054 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16054 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16054 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Paper page - Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

Submit Feedback

Similar Articles

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns