TabularMath: Understanding Math Reasoning over Tables with Large Language Models
Summary
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.
View Cached Full Text
Cached at: 04/20/26, 08:32 AM
# Understanding Math Reasoning over Tables with Large Language Models Source: https://arxiv.org/html/2505.19563 Shi-Yu Tian1,2∗, Zhi Zhou1, Wei Dong2∗, Kun-Yang Yu1,2, Ming Yang1,2, Zi-Jian Cheng1,3, Lan-Zhe Guo1,3†, Yu-Feng Li1,2 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3School of Intelligence Science and Technology, Nanjing University \{tiansy,zhouz,guolz,liyf\}@lamda.nju.edu.cn ###### Abstract Mathematical reasoning has long been a key benchmark for evaluating large language models (LLMs). Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks, enabling the evaluation of both accuracy and robustness. Building on this pipeline, we develop TabularMath, a benchmark comprising three progressively complex subsets and an imperfect subset, with their corresponding image version. Our study reveals three key observations: (1) Table complexity and reasoning difficulty impact reasoning performance jointly; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically being easier for models to reason over even for MLLMs. In-depth analyses are conducted for each observation to guide future research. TabularMath: Understanding Math Reasoning over Tables with Large Language Models Shi-Yu Tian1,2∗, Zhi Zhou1† (Equal contribution), Wei Dong2∗, Kun-Yang Yu1,2, Ming Yang1,2, Zi-Jian Cheng1,3, Lan-Zhe Guo1,3†, Yu-Feng Li1,2† (Corresponding author) 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3School of Intelligence Science and Technology, Nanjing University \{tiansy,zhouz,guolz,liyf\}@lamda.nju.edu.cn ## 1 Introduction Mathematical reasoning has long been a critical benchmark for evaluating the capabilities of large language models (LLMs). The field has advanced remarkably in recent years (OpenAI, 2023; Guo et al., 2025a), with many single-scenario benchmarks now considered largely solved (Hosseini et al., 2014; Patel et al., 2021; Cobbe et al., 2021). This progress has prompted a shift in research focus toward real-world applications, particularly reasoning over semi-structured data like tables (Lu et al., 2023). Unlike plain text, tables present information in a highly structured and organized format, making them indispensable in domains such as business intelligence (Zhang et al., 2024) and financial forecasting (Zhu et al., 2021). Nevertheless, real-world table reasoning scenarios present significant challenges for LLMs. For example, in the financial sector, the need to process large-scale tables continues to grow with the increasing volume and complexity of data, alongside stricter requirements for reliability and security (Bradley et al., 2024; Zavitsanos et al., 2024). In quarterly financial reports, models are expected not only to perform cross-column computations on numerous metrics like revenue, profit, and liabilities but also to verify numerical consistency (e.g., ensuring total assets equal the sum of liabilities and equity). Failure to properly interpret the data or detect inconsistencies can lead to severe consequences in downstream applications like investment decisions and risk assessment (Cerchiello and Giudici, 2016).
Figure 1: Model performance comparison between average questions and top 10% complex questions.
Despite prior works (Zhu et al., 2021; Chen et al., 2021; Lu et al., 2023) addressing some aspects of tabular mathematical reasoning, these efforts have been limited in table scale and primarily focused on accuracy of perfectly crafted problems. Specifically, existing tabular benchmarks largely rely on manual annotation and collection, making it difficult to scale the datasets effectively. As a result, these benchmarks fail to explore the limits of LLMs' reasoning capabilities on more complex tables, where the models perform worse (as shown in Fig. 1). Additionally, current benchmarks have not adequately assessed the robustness of tabular mathematical reasoning, overlooking the risk of LLMs providing hallucinated answers when faced with incomplete and inconsistent data. Therefore, to systematically assess model capabilities across multiple dimensions, a comprehensive benchmark is crucial. In this context, there is an urgent need for a more complete and systematic evaluation framework to thoroughly explore and challenge the boundaries of existing models.
To address the above limitations, we propose an Automatic Text-to-Table generation framework, AutoT2T. It is a neuro-symbolic pipeline that converts math word problems into scalable and verified tabular reasoning tasks without human annotation, enabling the evaluation of both accuracy and robustness. To facilitate standardized evaluation and fair comparison, we construct a comprehensive tabular math reasoning benchmark TabularMath based on AutoT2T. It includes three progressively difficult subsets (*Easy, Medium, Hard*) as well as an *Imperfect* subset aimed at evaluating the robustness of models in the face of incomplete or inconsistent tabular data, covering both table complexity and robustness dimensions. Based on this, we conduct systematic experiments and analyses on 18 open-source and proprietary models. The results are organized around the following three research questions and lead to several key observations.
1. **How does table complexity affect mathematical reasoning?** Tabular complexity and reasoning difficulty impact reasoning performance jointly. Nearly all models suffer significant performance drops when transitioning from pure text to tabular modalities, with degradation increasing as table complexity grows. The coupling between retrieval and reasoning forms a core bottleneck, with pure retrieval being substantially easier than joint reasoning and retrieval (performance gap exceeding 20% on average).
2. **How does table quality affect mathematical reasoning?** Low-quality tables pose severe risks to reliable reasoning in current LLMs. When tables contain missing or contradictory information, most models fail to identify these flaws and produce misleading answers, with error rates exceeding 50% in some cases. Moreover, when models are informed that inputs may contain imperfect expression, they experience performance degradation on well-defined problems, demonstrating a trade-off between solvency and discriminative ability.
3. **How does table representation affect mathematical reasoning?** Different table modalities show similar trends, with text-based tables typically being easier for models to reason over. Across models and difficulty levels, image- and text-based tables show similar trends, while even multimodal models achieve comparable or higher accuracy on text-based tables. Among textual formats, key-value structured formats such as JSON and serialization perform better.
Overall, we conduct a systematic and in-depth analysis of tabular mathematical reasoning from the perspectives of table complexity, table quality, and table representation, complemented by additional discussions. This work represents an exploratory step toward multimodal reasoning over structured data, laying the groundwork for addressing these challenges in future research.
## 2 Related Work
Figure 2: An overview of the AutoT2T pipeline.
### Math Reasoning and Benchmark Evaluation
Mathematical reasoning serves as a key benchmark for evaluating the capabilities of large language models (LLMs) due to its verifiable nature. Early progress was made on elementary-level math problems using datasets such as GSM8K (Cobbe et al., 2021), MultiArith (Koncel-Kedziorski et al., 2016), and SVAMP (Patel et al., 2021), where methods like in-context learning (Wei et al., 2022; Gao et al., 2023), supervised fine-tuning (Li et al., 2024b), and reinforcement learning (Guo et al., 2025a) demonstrated strong performance. Since then, researchers have questioned the accuracy of current assessments of large models' mathematical reasoning, exploring approaches such as neural-symbolic methods (Mirzadeh et al., 2024). These neural-symbolic methods are also widely used in multimodal benchmark generation (Zhou et al.; Shang et al., 2026; Yang et al., 2026; Ma et al., 2026; Huang et al., 2026). A growing area of interest is the robustness of mathematical reasoning (Zhou et al., 2024; Shi et al., 2023), specifically, whether models can refrain from generating hallucinations when faced with incomplete or logically deceptive prompts (Tian et al., 2025b; Zhao et al., 2024). These types of descriptive verification issues are receiving more attention in current research on LLMs (Huang et al., 2024; Guo et al., 2025b; Yang et al., 2025), ensuring the achievement of robust and reliable AI paradigms (Tian et al., 2024, 2025a; Dai et al., 2026).
### Table Question Answering
Table Question Answering (Table QA) has significant practical applications across various domains, including financial statement analysis (Chen et al., 2021) and medical diagnosis (Hasny et al., 2025). The field has advanced considerably with the development of high-quality datasets, beginning with the pioneering work of Pasupat et al. (Pasupat and Liang, 2015), who constructed the WikiTableQuestions (WTQ) dataset using Wikipedia tables. Subsequent research shifted to more complex QA tasks requiring reasoning capabilities, exemplified by datasets such as ToTTo (Parikh et al., 2020) (focused on answer generation) and OTTQA (Chen et al., 2020) (emphasizing cross-table reasoning). More recently, FinQA (Chen et al., 2021) and AiTQA (Katis et al., 2021) have explored numerical reasoning in tables, while TableBench (Wu et al., 2025) and Text2Analysis (He et al., 2024) introduced multimodal approaches incorporating visual elements. There have also been some works in table machine learning that focus on table-type problem solving in open environments (Yu et al., 2026; Zhou et al., 2025a). However, most existing datasets rely on manual annotation, lacking an automated pipeline for scalable data processing, which is common in other application areas (Zhou et al., 2025b; Yang et al., 2026).
Figure 3: Illustrative cases in TabularMath and corresponding model responses.
## 3 Automated Text to Table
The AutoT2T pipeline converts math word problems into tabular problems through the following three stages (Fig. 2).
### 3.1 Semantic Decoupling
Firstly, our objective is to semantically decouple the text of the math word problems and extract key elements that can be structurally represented. We decompose math problems using formal language modeling (Such as SMT-Lib (Barrett et al., 2010; Li et al., 2024a)), structuring problems as:
**Goal** g := solve f(v)
**Constraints** c := e₁(v) ⋈ e₂(v), ⋈ ∈ {≥, ≤, >, <, =, ≠}
**Expressions** e := h | e₁ ⊕ e₂, ⊕ ∈ {+, −, ×, ÷}
**Domains** D := ℕ | ℕ⁺ | ℝ
where v is a variable, c is a constraint, e is an expression, h is a constant, and f is the objective function. For a problem p, we define the modeling state as S = (V, C), where V and C denote the variable and constraint sets, respectively. The LLM constructs S by extracting candidate components from the problem description, while a formal solver Φ (e.g., Z3 (de Moura and Bjørner, 2008), CVC5 (Barbosa et al., 2022)) verifies satisfiability and consistency, providing feedback for refinement and identifying ill-defined formulations.
### 3.2 Table Transformation
After obtaining the formal modeling state, the next step is to transform the semantically decoupled components into a structured tabular representation. Specifically, we convert the formal state into a table by introducing a name field as the primary key and mapping variables V and active constraints C_a to table columns. Given a problem p, the LLM produces a blurred textual description p̂ together with a two-row seed table t_seed.Similar Articles
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.
TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity
Introduces TableVista, a comprehensive benchmark for evaluating foundation models on multimodal table reasoning under visual and structural complexity, comprising 3,000 problems expanded into 30,000 multimodal samples. Evaluation of 29 models reveals performance degradation on complex layouts and vision-only settings.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.
Can LLMs model real-world systems in TLA+?
Researchers from the Specula team created SysMoBench, a benchmark evaluating whether LLMs can faithfully model real-world computing systems in TLA+ or merely recite textbook specifications. The benchmark tests 11 systems across four phases and reveals systematic gaps in current LLMs' ability to accurately model system implementations versus reference papers.