Tag
This paper proposes a causal framework for probing internal visual representations in Multimodal Large Language Models, revealing differences in how entities and abstract concepts are encoded. The study highlights that increasing model depth is crucial for encoding abstract concepts and uncovers a disconnect between perception and reasoning in current MLLMs.
This paper investigates the arithmetic limitations of multimodal LLMs on multi-digit multiplication across text, image, and audio modalities, introducing a controlled benchmark and a novel 'arithmetic load' metric (C) that better predicts model accuracy than traditional step-counting methods. Results show accuracy collapses as C grows, and that performance degradation is primarily computational rather than perceptual.
Researchers introduce Mind’s Eye, a benchmark of eight visual-cognitive tasks that reveals top multimodal LLMs score under 50% while humans reach 80%, exposing major gaps in visual abstraction, relation mapping and mental transformation.
Research shows Chain-of-Thought prompting harms visual-spatial reasoning in multimodal LLMs due to shortcut learning and hallucinating visual details from text alone.