Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Hugging Face Daily Papers Papers

Summary

This paper investigates the arithmetic limitations of multimodal LLMs on multi-digit multiplication across text, image, and audio modalities, introducing a controlled benchmark and a novel 'arithmetic load' metric (C) that better predicts model accuracy than traditional step-counting methods. Results show accuracy collapses as C grows, and that performance degradation is primarily computational rather than perceptual.

Multimodal LLMs can accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also define arithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures of arithmetic load that count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce a forced-completion loss probe that scores heuristic-specific reasoning prefixes--including columnar multiplication, distributive decomposition, and rounding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specific LoRA adapters produce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tuned internal router.
Original Article
View Cached Full Text

Cached at: 04/21/26, 07:20 AM

Paper page - Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs

Source: https://huggingface.co/papers/2604.18203

Abstract

Multimodal large language models demonstrate consistent computational limitations in exact multi-digit multiplication across different representations and modalities, with performance closely tied to a novel arithmetic load metric that predicts accuracy better than traditional step-counting methods.

Multimodal LLMscan accurately perceive numerical content across modalities yet fail to perform exact multi-digit multiplication when the identical underlying arithmetic problem is presented as numerals, number words, images, or in audio form. Because existing benchmarks often lack systematically paired instances across modalities, it remains difficult to compare genuine arithmetic limits within and across model families. We therefore introduce a controlled multimodal multiplication benchmark that factorially varies digit length, digit sparsity, representation (e.g., numerals vs. number words), and modality (text, rendered images, audio), with paired instances from a reproducible generator. We also definearithmetic load, C, as the product of the total and non-zero digit count as a compact, mechanistically motivated proxy for operation count. Across evaluations, accuracy falls sharply as C grows, often nearing zero by C > 100. Indeed, C remains predictive of performance across modalities and models, with R-squared often > 0.5, nearing the value from more complex measures ofarithmetic loadthat count the number of intermediate arithmetic steps. A separate perception-versus-computation decomposition shows that multimodal degradation is primarily computational rather than perceptual: on matched-perception checks, models are near-perfect (> 99%) across modalities, even when multiplication accuracy drops. Beyond measuring when models fail, we ask which procedures they are predisposed to follow. We introduce aforced-completion loss probethat scoresheuristic-specific reasoningprefixes--includingcolumnar multiplication,distributive decomposition, androunding/compensation. Here, decomposition is favored in both text and vision modalities; heuristic-specificLoRA adaptersproduce near-orthogonal updates yet degrade accuracy, indicating the base model maintains a well-tunedinternal router.

View arXiv pageView PDFProject pageAdd to collection

Get this paper in your agent:

hf papers read 2604\.18203

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2604.18203 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2604.18203 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2604.18203 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

What We are Missing in Multimodal LLM Evaluation?

arXiv cs.AI

This paper reviews current multimodal LLM evaluation benchmarks and identifies key gaps such as temporal-spatial coherence, physical world understanding, multimodal consistency, and selective attention, arguing that existing isolated-task benchmarks fail to measure true cross-modal integration.