mathematical-reasoning

Tag

#mathematical-reasoning

MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone

MIT News — Artificial Intelligence ↗ · 2026-04-24 Cached

MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.

0 favorites 0 likes

#mathematical-reasoning

Less Is More: Cognitive Load and the Single-Prompt Ceiling in LLM Mathematical Reasoning

arXiv cs.CL ↗ · 2026-04-22 Cached

Empirical study on LLM formal-math reasoning finds a single-prompt ceiling: accuracy plateaus around 60–79% regardless of prompt size, driven by undecidability, model fragility, and distribution mismatch.

0 favorites 0 likes

#mathematical-reasoning

Measuring Representation Robustness in Large Language Models for Geometry

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers introduce GeoRepEval, a framework to evaluate LLM robustness across equivalent geometric problem representations (Euclidean, coordinate, vector). Testing 11 LLMs on 158 geometry problems, they find accuracy gaps up to 14 percentage points based solely on representation choice, with vector formulations being a consistent failure point.

0 favorites 0 likes

#mathematical-reasoning

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning

arXiv cs.CL ↗ · 2026-04-20 Cached

SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.

0 favorites 0 likes

#mathematical-reasoning

Large Language Models for Math Education in Low-Resource Languages: A Study in Sinhala and Tamil

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.

0 favorites 0 likes

#mathematical-reasoning

Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper identifies and addresses the problem of 'Miracle Steps' in LLM mathematical reasoning—unjustified jumps to correct answers that indicate reward hacking—by proposing Rubric Reward Model (RRM), a process-oriented reward function that evaluates entire reasoning trajectories. RRM achieves significant improvements on AIME2024 (26.7% to 62.6% Verified Pass@1024) and reduces Miracle Steps by 71%.

0 favorites 0 likes

#mathematical-reasoning

Learning to Reason with Insight for Informal Theorem Proving

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.

0 favorites 0 likes

#mathematical-reasoning

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

arXiv cs.CL ↗ · 2026-04-20 Cached

This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.

0 favorites 0 likes

#mathematical-reasoning

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.

0 favorites 0 likes

#mathematical-reasoning

MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval

Hugging Face Daily Papers ↗ · 2026-04-20 Cached

MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.

0 favorites 0 likes

#mathematical-reasoning

DiPO: Disentangled Perplexity Policy Optimization for Fine-grained Exploration-Exploitation Trade-Off

Hugging Face Daily Papers ↗ · 2026-04-15 Cached

DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.

0 favorites 0 likes

#mathematical-reasoning

Evaluating AI’s ability to perform scientific research tasks

OpenAI Blog ↗ · 2025-12-16 Cached

OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.

0 favorites 0 likes

#mathematical-reasoning

Advanced Gemini with Deep Think Achieves Gold Medal Standard at International Mathematical Olympiad

Google DeepMind Blog ↗ · 2025-10-24 Cached

Google DeepMind's advanced Gemini with Deep Think achieved gold-medal standard at the International Mathematical Olympiad 2025, solving 5 out of 6 problems for 35 points—a significant advance over last year's silver-medal performance, operating end-to-end in natural language within competition time limits.

0 favorites 0 likes

#mathematical-reasoning

Improving mathematical reasoning with process supervision

OpenAI Blog ↗ · 2023-05-31 Cached

OpenAI demonstrates that process supervision—rewarding intermediate reasoning steps rather than just final answers—improves mathematical reasoning while reducing alignment costs. This approach produces more interpretable, human-aligned reasoning without sacrificing model performance.

0 favorites 0 likes

#mathematical-reasoning

Generative language modeling for automated theorem proving

OpenAI Blog ↗ · 2020-09-07 Cached

OpenAI presents GPT-f, a transformer-based automated theorem prover for the Metamath formalization language, which discovered new short proofs accepted into the main Metamath library — marking the first time a deep-learning system contributed proofs adopted by a formal mathematics community.

0 favorites 0 likes

← Back to home

Submit Feedback