Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Hugging Face Daily Papers Papers

Summary

Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.

Multimodal large language models (MLLMs) have achieved impressive progress on vision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce "Mind's Eye", a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel "A-R-T" taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes of fluid intelligence such as pattern induction, analogical relation mapping, and mental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i) visual attention allocation, (ii) internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/22/26, 10:35 AM

Paper page - Mind’s Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Source: https://huggingface.co/papers/2604.16054

Abstract

Multimodal large language models demonstrate significant limitations in visuospatial reasoning tasks compared to human performance, revealing deficiencies in visual attention, perceptual manipulation, and conceptual abstraction.

Multimodal large language models(MLLMs) have achieved impressive progress onvision language benchmarks, yet their capacity for visual cognitive and visuospatial reasoning remains less understood. We introduce “Mind’s Eye”, a multiple-choice benchmark of eight visuo-cognitive tasks inspired by classic human intelligence tests and organized under a novel “A-R-T” taxonomy: Abstraction, Relation, and Transformation. The tasks probe core processes offluid intelligencesuch aspattern induction,analogical relation mapping, andmental transformation. We evaluate a diverse suite of closed-source and open-source MLLMs and compare their performance with human participants. Humans achieve 80% accuracy, while top performing MLLMs remain below 50%. Error analysis reveals failures in: (i)visual attention allocation, (ii)internal perceptual manipulation, and (iii) weak abstraction of underlying visual concepts. Our findings suggest that current MLLMs exhibit limited visuospatial reasoning capabilities, when compared with human participants, highlighting the need for more cognitively grounded evaluation frameworks.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2604\.16054

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.16054 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.16054 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.16054 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Hugging Face Daily Papers

This paper investigates the arithmetic limitations of multimodal LLMs on multi-digit multiplication across text, image, and audio modalities, introducing a controlled benchmark and a novel 'arithmetic load' metric (C) that better predicts model accuracy than traditional step-counting methods. Results show accuracy collapses as C grows, and that performance degradation is primarily computational rather than perceptual.

Do Vision-Language Models Truly Perform Vision Reasoning? A Rigorous Study of the Modality Gap

arXiv cs.CL

This paper introduces CrossMath, a controlled multimodal reasoning benchmark that reveals a critical limitation in current vision-language models: they perform reasoning primarily in textual space rather than genuine vision-grounded reasoning, with visual input often degrading performance compared to text-only baselines. The authors propose fine-tuning approaches to mitigate this modality gap and improve multimodal reasoning capabilities.

The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring

arXiv cs.CL

A new cross-domain benchmark (Metacognitive Monitoring Battery) with 524 items evaluates LLM self-monitoring capabilities across six cognitive domains using human psychometric methodology. Applied to 20 frontier LLMs, it reveals three distinct metacognitive profiles and shows that accuracy rank and metacognitive sensitivity rank are largely inverted.

HumanLLM: Benchmarking and Improving LLM Anthropomorphism via Human Cognitive Patterns

arXiv cs.CL

HumanLLM presents a framework for benchmarking and improving LLM anthropomorphism by modeling psychological patterns as interacting causal forces, constructing 244 patterns from academic literature and 11,359 multi-pattern scenarios. The approach demonstrates that authentic human alignment requires cognitive modeling rather than shallow behavioral mimicry, with HumanLLM-8B outperforming larger models like Qwen3-32B on multi-pattern dynamics.