Tag
MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.
Empirical study on LLM formal-math reasoning finds a single-prompt ceiling: accuracy plateaus around 60–79% regardless of prompt size, driven by undecidability, model fragility, and distribution mismatch.
Researchers introduce GeoRepEval, a framework to evaluate LLM robustness across equivalent geometric problem representations (Euclidean, coordinate, vector). Testing 11 LLMs on 158 geometry problems, they find accuracy gaps up to 14 percentage points based solely on representation choice, with vector formulations being a consistent failure point.
SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.
This paper identifies and addresses the problem of 'Miracle Steps' in LLM mathematical reasoning—unjustified jumps to correct answers that indicate reward hacking—by proposing Rubric Reward Model (RRM), a process-oriented reward function that evaluates entire reasoning trajectories. RRM achieves significant improvements on AIME2024 (26.7% to 62.6% Verified Pass@1024) and reduces Miracle Steps by 71%.
This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.
DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.
OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.
Google DeepMind's advanced Gemini with Deep Think achieved gold-medal standard at the International Mathematical Olympiad 2025, solving 5 out of 6 problems for 35 points—a significant advance over last year's silver-medal performance, operating end-to-end in natural language within competition time limits.
OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.
OpenAI presents GPT-f, a transformer-based automated theorem prover for the Metamath formalization language, which discovered new short proofs accepted into the main Metamath library — marking the first time a deep-learning system contributed proofs adopted by a formal mathematics community.