MathFormer: Testing whether symbolic math is pattern matching or reasoning [D]
Summary
MathFormer is a small seq2seq model that achieves ~98.6% accuracy on symbolic math tasks, suggesting that mathematical reasoning in LLMs may be large-scale structured pattern completion rather than true reasoning.
Similar Articles
Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
This paper evaluates three approaches (pure chain-of-thought reasoning, single-shot code execution, and iterative code execution) on 1,000 GSM-Symbolic problems using Claude Haiku 4.5, finding that chain-of-thought is the most robust to perturbation, while code execution does not improve reasoning robustness on grade-school math problems.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.
What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code
This paper challenges the belief that code improves reasoning in language models, finding through controlled pretraining experiments that code alone primarily enhances programming ability, while reasoning gains come from structured reasoning traces like code-text and math-text mixtures.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.