MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Summary
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.
View Cached Full Text
Cached at: 04/21/26, 07:20 AM
Paper page - MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
Source: https://huggingface.co/papers/2604.18584
Abstract
MathNet is a large-scale, multilingual, multimodal dataset of Olympiad-level math problems designed for evaluating mathematical reasoning and retrieval in generative models and embedding-based systems.
Mathematical problem solving remains a challenging test of reasoning for large language and multimodal models, yet existing benchmarks are limited in size, language coverage, and task diversity. We introduce MathNet, a high-quality, large-scale, multimodal, andmultilingual datasetofOlympiad-level math problemstogether with a benchmark for evaluatingmathematical reasoningingenerative modelsandmathematical retrievalinembedding-based systems. MathNet spans 47 countries, 17 languages, and two decades of competitions, comprising 30,676 expert-authored problems with solutions across diverse domains. In addition to the core dataset, we construct a retrieval benchmark consisting of mathematically equivalent and structurally similar problem pairs curated by human experts. MathNet supports three tasks: (i) Problem Solving, (ii) Math-Aware Retrieval, and (iii) Retrieval-Augmented Problem Solving. Experimental results show that even state-of-the-art reasoning models (78.4% for Gemini-3.1-Pro and 69.3% for GPT-5) remain challenged, while embedding models struggle to retrieve equivalent problems. We further show thatretrieval-augmented generationperformance is highly sensitive to retrieval quality; for example, DeepSeek-V3.2-Speciale achieves gains of up to 12%, obtaining the highest scores on the benchmark. MathNet provides the largest high-quality Olympiad dataset together with the first benchmark for evaluating mathematical problem retrieval, and we publicly release both the dataset and benchmark at https://mathnet.mit.edu.
View arXiv pageView PDFProject pageAdd to collection
Get this paper in your agent:
hf papers read 2604\.18584
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2604.18584 in a model README.md to link it from this page.
Datasets citing this paper1
#### ShadenA/MathNet Updated5 minutes ago • 11
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2604.18584 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MIT & the IMO released MathNet, the world’s largest dataset of International Math Olympiad problems & solutions. MathNet is 5x larger than previous datasets & is sourced from over 40 countries across 4 decades
MIT and the IMO release MathNet, a massive dataset of International Math Olympiad problems and solutions spanning 40 years and 40+ countries, 5x larger than prior datasets.
MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone
MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.
Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.
NSMQ Riddles: A Benchmark of Scientific and Mathematical Riddles for Quizzing Large Language Models
This paper introduces NSMQ Riddles, a novel benchmark using scientific and mathematical riddles from Ghana's National Science and Maths Quiz to evaluate Large Language Models, addressing the underrepresentation of Global South datasets in AI research.