Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Hugging Face Daily Papers 04/20/26, 12:00 AM Papers

multimodal-llms arithmetic benchmarking multiplication lora reasoning evaluation

Summary

This paper investigates the arithmetic limitations of multimodal LLMs on multi-digit multiplication across text, image, and audio modalities, introducing a controlled benchmark and a novel 'arithmetic load' metric (C) that better predicts model accuracy than traditional step-counting methods. Results show accuracy collapses as C grows, and that performance degradation is primarily computational rather than perceptual.

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.

Original Article

View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Source: https://huggingface.co/papers/2604.18203

Abstract

Multimodal large language models demonstrate consistent computational limitations in exact multi-digit multiplication across different representations and modalities, with performance closely tied to a novel arithmetic load metric that predicts accuracy better than traditional step-counting methods.

Multimodal LLMscan accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also definearithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures ofarithmetic loadthat count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce aforced-completion loss probethat scoresheuristic-specific reasoningprefixes--includingcolumnar multiplication,distributive decomposition, androunding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specificLoRA adaptersproduce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tunedinternal router.

View arXiv page View PDF Project page Add to collection

Get this paper in your agent:

hf papers read 2604\.18203

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18203 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18203 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18203 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Paper page - Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

What We are Missing in Multimodal LLM Evaluation?

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Submit Feedback

Similar Articles

What We are Missing in Multimodal LLM Evaluation?

MODE: Modality-Decomposed Expert-Level Mixed-Precision Quantization for MoE Multimodal LLMs

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs