TabularMath: Understanding Math Reasoning over Tables with Large Language Models

arXiv cs.CL 04/20/26, 04:00 AM Papers

math-reasoning tabular-data benchmark llm-evaluation neuro-symbolic table-understanding

Summary

TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.

arXiv:2505.19563v4 Announce Type: replace-cross Abstract: Mathematical reasoning has long been a key benchmark for evaluating large language models. Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks. Building on this pipeline, we develop TabularMath, a benchmark comprising four subsets that include both text-based and image-based tables, covering table complexity, table quality, and table representation dimensions. Our study reveals three key observations: (1) Table complexity and reasoning difficulty impact reasoning performance jointly; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically being easier for models to reason over. In-depth analyses are conducted for each observation to guide future research.

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 04/20/26, 08:32 AM

# Understanding Math Reasoning over Tables with Large Language Models Source: https://arxiv.org/html/2505.19563 Shi-Yu Tian1,2∗, Zhi Zhou1, Wei Dong2∗, Kun-Yang Yu1,2, Ming Yang1,2, Zi-Jian Cheng1,3, Lan-Zhe Guo1,3†, Yu-Feng Li1,2 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3School of Intelligence Science and Technology, Nanjing University \{tiansy,zhouz,guolz,liyf\}@lamda.nju.edu.cn ###### Abstract Mathematical reasoning has long been a key benchmark for evaluating large language models (LLMs). Although substantial progress has been made on math word problems, the need for reasoning over tabular data in real-world applications has been overlooked. For instance, applications such as business intelligence demand not only multi-step numerical reasoning with tables but also robustness to incomplete or inconsistent information. However, comprehensive evaluation in this area is severely limited, constrained by the reliance on manually collected tables that are difficult to scale and the lack of coverage for potential traps encountered in real-world scenarios. To address this problem, we propose AutoT2T, a neuro-symbolic framework that controllably transforms math word problems into scalable and verified tabular reasoning tasks, enabling the evaluation of both accuracy and robustness. Building on this pipeline, we develop TabularMath, a benchmark comprising three progressively complex subsets and an imperfect subset, with their corresponding image version. Our study reveals three key observations: (1) Table complexity and reasoning difficulty impact reasoning performance jointly; (2) Low-quality tables pose severe risks to reliable reasoning in current LLMs; (3) Different table modalities show similar trends, with text-based tables typically being easier for models to reason over even for MLLMs. In-depth analyses are conducted for each observation to guide future research. TabularMath: Understanding Math Reasoning over Tables with Large Language Models Shi-Yu Tian1,2∗, Zhi Zhou1† (Equal contribution), Wei Dong2∗, Kun-Yang Yu1,2, Ming Yang1,2, Zi-Jian Cheng1,3, Lan-Zhe Guo1,3†, Yu-Feng Li1,2† (Corresponding author) 1National Key Laboratory for Novel Software Technology, Nanjing University 2School of Artificial Intelligence, Nanjing University 3School of Intelligence Science and Technology, Nanjing University \{tiansy,zhouz,guolz,liyf\}@lamda.nju.edu.cn ## 1 Introduction Mathematical reasoning has long been a critical benchmark for evaluating the capabilities of large language models (LLMs). The field has advanced remarkably in recent years (OpenAI, 2023; Guo et al., 2025a), with many single-scenario benchmarks now considered largely solved (Hosseini et al., 2014; Patel et al., 2021; Cobbe et al., 2021). This progress has prompted a shift in research focus toward real-world applications, particularly reasoning over semi-structured data like tables (Lu et al., 2023). Unlike plain text, tables present information in a highly structured and organized format, making them indispensable in domains such as business intelligence (Zhang et al., 2024) and financial forecasting (Zhu et al., 2021). Nevertheless, real-world table reasoning scenarios present significant challenges for LLMs. For example, in the financial sector, the need to process large-scale tables continues to grow with the increasing volume and complexity of data, alongside stricter requirements for reliability and security (Bradley et al., 2024; Zavitsanos et al., 2024). In quarterly financial reports, models are expected not only to perform cross-column computations on numerous metrics like revenue, profit, and liabilities but also to verify numerical consistency (e.g., ensuring total assets equal the sum of liabilities and equity). Failure to properly interpret the data or detect inconsistencies can lead to severe consequences in downstream applications like investment decisions and risk assessment (Cerchiello and Giudici, 2016).

Figure 1: Model performance comparison between average questions and top 10% complex questions.

Despite prior works (Zhu et al., 2021; Chen et al., 2021; Lu et al., 2023) addressing some aspects of tabular mathematical reasoning, these efforts have been limited in table scale and primarily focused on accuracy of perfectly crafted problems. Specifically, existing tabular benchmarks largely rely on manual annotation and collection, making it difficult to scale the datasets effectively. As a result, these benchmarks fail to explore the limits of LLMs' reasoning capabilities on more complex tables, where the models perform worse (as shown in Fig. 1). Additionally, current benchmarks have not adequately assessed the robustness of tabular mathematical reasoning, overlooking the risk of LLMs providing hallucinated answers when faced with incomplete and inconsistent data. Therefore, to systematically assess model capabilities across multiple dimensions, a comprehensive benchmark is crucial. In this context, there is an urgent need for a more complete and systematic evaluation framework to thoroughly explore and challenge the boundaries of existing models.

To address the above limitations, we propose an Automatic Text-to-Table generation framework, AutoT2T. It is a neuro-symbolic pipeline that converts math word problems into scalable and verified tabular reasoning tasks without human annotation, enabling the evaluation of both accuracy and robustness. To facilitate standardized evaluation and fair comparison, we construct a comprehensive tabular math reasoning benchmark TabularMath based on AutoT2T. It includes three progressively difficult subsets (*Easy, Medium, Hard*) as well as an *Imperfect* subset aimed at evaluating the robustness of models in the face of incomplete or inconsistent tabular data, covering both table complexity and robustness dimensions. Based on this, we conduct systematic experiments and analyses on 18 open-source and proprietary models. The results are organized around the following three research questions and lead to several key observations.

1. **How does table complexity affect mathematical reasoning?** Tabular complexity and reasoning difficulty impact reasoning performance jointly. Nearly all models suffer significant performance drops when transitioning from pure text to tabular modalities, with degradation increasing as table complexity grows. The coupling between retrieval and reasoning forms a core bottleneck, with pure retrieval being substantially easier than joint reasoning and retrieval (performance gap exceeding 20% on average).

2. **How does table quality affect mathematical reasoning?** Low-quality tables pose severe risks to reliable reasoning in current LLMs. When tables contain missing or contradictory information, most models fail to identify these flaws and produce misleading answers, with error rates exceeding 50% in some cases. Moreover, when models are informed that inputs may contain imperfect expression, they experience performance degradation on well-defined problems, demonstrating a trade-off between solvency and discriminative ability.

3. **How does table representation affect mathematical reasoning?** Different table modalities show similar trends, with text-based tables typically being easier for models to reason over. Across models and difficulty levels, image- and text-based tables show similar trends, while even multimodal models achieve comparable or higher accuracy on text-based tables. Among textual formats, key-value structured formats such as JSON and serialization perform better.

Overall, we conduct a systematic and in-depth analysis of tabular mathematical reasoning from the perspectives of table complexity, table quality, and table representation, complemented by additional discussions. This work represents an exploratory step toward multimodal reasoning over structured data, laying the groundwork for addressing these challenges in future research.

## 2 Related Work

Figure 2: An overview of the AutoT2T pipeline.

### Math Reasoning and Benchmark Evaluation

Mathematical reasoning serves as a key benchmark for evaluating the capabilities of large language models (LLMs) due to its verifiable nature. Early progress was made on elementary-level math problems using datasets such as GSM8K (Cobbe et al., 2021), MultiArith (Koncel-Kedziorski et al., 2016), and SVAMP (Patel et al., 2021), where methods like in-context learning (Wei et al., 2022; Gao et al., 2023), supervised fine-tuning (Li et al., 2024b), and reinforcement learning (Guo et al., 2025a) demonstrated strong performance. Since then, researchers have questioned the accuracy of current assessments of large models' mathematical reasoning, exploring approaches such as neural-symbolic methods (Mirzadeh et al., 2024). These neural-symbolic methods are also widely used in multimodal benchmark generation (Zhou et al.; Shang et al., 2026; Yang et al., 2026; Ma et al., 2026; Huang et al., 2026). A growing area of interest is the robustness of mathematical reasoning (Zhou et al., 2024; Shi et al., 2023), specifically, whether models can refrain from generating hallucinations when faced with incomplete or logically deceptive prompts (Tian et al., 2025b; Zhao et al., 2024). These types of descriptive verification issues are receiving more attention in current research on LLMs (Huang et al., 2024; Guo et al., 2025b; Yang et al., 2025), ensuring the achievement of robust and reliable AI paradigms (Tian et al., 2024, 2025a; Dai et al., 2026).

### Table Question Answering

Table Question Answering (Table QA) has significant practical applications across various domains, including financial statement analysis (Chen et al., 2021) and medical diagnosis (Hasny et al., 2025). The field has advanced considerably with the development of high-quality datasets, beginning with the pioneering work of Pasupat et al. (Pasupat and Liang, 2015), who constructed the WikiTableQuestions (WTQ) dataset using Wikipedia tables. Subsequent research shifted to more complex QA tasks requiring reasoning capabilities, exemplified by datasets such as ToTTo (Parikh et al., 2020) (focused on answer generation) and OTTQA (Chen et al., 2020) (emphasizing cross-table reasoning). More recently, FinQA (Chen et al., 2021) and AiTQA (Katis et al., 2021) have explored numerical reasoning in tables, while TableBench (Wu et al., 2025) and Text2Analysis (He et al., 2024) introduced multimodal approaches incorporating visual elements. There have also been some works in table machine learning that focus on table-type problem solving in open environments (Yu et al., 2026; Zhou et al., 2025a). However, most existing datasets rely on manual annotation, lacking an automated pipeline for scalable data processing, which is common in other application areas (Zhou et al., 2025b; Yang et al., 2026).

Figure 3: Illustrative cases in TabularMath and corresponding model responses.

## 3 Automated Text to Table

The AutoT2T pipeline converts math word problems into tabular problems through the following three stages (Fig. 2).

### 3.1 Semantic Decoupling

Firstly, our objective is to semantically decouple the text of the math word problems and extract key elements that can be structurally represented. We decompose math problems using formal language modeling (Such as SMT-Lib (Barrett et al., 2010; Li et al., 2024a)), structuring problems as:

**Goal** g := solve f(v)

**Constraints** c := e₁(v) ⋈ e₂(v), ⋈ ∈ {≥, ≤, >, <, =, ≠}

**Expressions** e := h | e₁ ⊕ e₂, ⊕ ∈ {+, −, ×, ÷}

**Domains** D := ℕ | ℕ⁺ | ℝ

where v is a variable, c is a constraint, e is an expression, h is a constant, and f is the objective function. For a problem p, we define the modeling state as S = (V, C), where V and C denote the variable and constraint sets, respectively. The LLM constructs S by extracting candidate components from the problem description, while a formal solver Φ (e.g., Z3 (de Moura and Bjørner, 2008), CVC5 (Barbosa et al., 2022)) verifies satisfiability and consistency, providing feedback for refinement and identifying ill-defined formulations.

### 3.2 Table Transformation

After obtaining the formal modeling state, the next step is to transform the semantically decoupled components into a structured tabular representation. Specifically, we convert the formal state into a table by introducing a name field as the primary key and mapping variables V and active constraints C_a to table columns. Given a problem p, the LLM produces a blurred textual description p̂ together with a two-row seed table t_seed.

TabularMath: Understanding Math Reasoning over Tables with Large Language Models

Similar Articles

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Can LLMs model real-world systems in TLA+?

Submit Feedback

Similar Articles

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

TableVista: Benchmarking Multimodal Table Reasoning under Visual and Structural Complexity

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Can LLMs model real-world systems in TLA+?