multimodal-llms

#multimodal-llms

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

arXiv cs.AI ↗ · 2d ago Cached

This paper proposes a causal framework for probing internal visual representations in Multimodal Large Language Models, revealing differences in how entities and abstract concepts are encoded. The study highlights that increasing model depth is crucial for encoding abstract concepts and uncovers a disconnect between perception and reasoning in current MLLMs.

0 favorites 0 likes

#multimodal-llms

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

This paper investigates the arithmetic limitations of multimodal LLMs on multi-digit multiplication across text, image, and audio modalities, introducing a controlled benchmark and a novel 'arithmetic load' metric (C) that better predicts model accuracy than traditional step-counting methods. Results show accuracy collapses as C grows, and that performance degradation is primarily computational rather than perceptual.

0 favorites 0 likes

#multimodal-llms

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.

0 favorites 0 likes

#multimodal-llms

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Hugging Face Daily Papers ↗ · 2026-04-17 Cached

Research shows Chain-of-Thought prompting harms visual-spatial reasoning in multimodal LLMs due to shortcut learning and hallucinating visual details from text alone.

0 favorites 0 likes

multimodal-llms

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Submit Feedback