Tag
This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.
This paper introduces ThinC (Thinking in Code), a framework where language models use code blocks exclusively for reasoning after a brief natural language planning step, outperforming existing tool-integrated reasoning baselines on math benchmarks.
This paper proposes D-RPC, a method for distilling reasoning from large language models to smaller ones by compressing reasoning paths into a reusable bank, achieving better performance and consistency on math and commonsense benchmarks.
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
The paper proposes Crosslingual On-Policy Self-Distillation (COPSD), a method to transfer high-resource language reasoning capabilities to low-resource languages using a shared student-teacher architecture. Experiments across 17 African languages show significant improvements in mathematical reasoning and answer-format adherence, outperforming Group Relative Policy Optimization (GRPO).
DyStruct is a training-free Bayesian decoding framework for discrete Diffusion Language Models that enables flexible-length generation by dynamically determining expansion size and decoding order, improving accuracy on math and code tasks.
Soohak is a new benchmark of 439 research-level math problems curated by mathematicians to evaluate the reasoning capabilities of frontier LLMs, highlighting significant gaps in solving advanced problems and recognizing ill-posed questions.
MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.
Empirical study on LLM formal-math reasoning finds a single-prompt ceiling: accuracy plateaus around 60–79% regardless of prompt size, driven by undecidability, model fragility, and distribution mismatch.
Researchers introduce GeoRepEval, a framework to evaluate LLM robustness across equivalent geometric problem representations (Euclidean, coordinate, vector). Testing 11 LLMs on 158 geometry problems, they find accuracy gaps up to 14 percentage points based solely on representation choice, with vector formulations being a consistent failure point.
SAI-DPO introduces a dynamic sampling framework that adapts training data to a model's evolving capabilities during mathematical reasoning tasks, using self-aware difficulty metrics and knowledge semantic alignment to achieve state-of-the-art efficiency with less data on benchmarks like AIME24 and AMC23.
This paper evaluates the mathematical reasoning capabilities of large language models in Sinhala and Tamil, two low-resource South Asian languages, using a parallel dataset of independently authored problems. The study demonstrates that while basic arithmetic transfers well across languages, complex reasoning tasks show significant performance degradation in non-English languages, with implications for deploying AI tutoring tools in multilingual educational contexts.
This paper identifies and addresses the problem of 'Miracle Steps' in LLM mathematical reasoning—unjustified jumps to correct answers that indicate reward hacking—by proposing Rubric Reward Model (RRM), a process-oriented reward function that evaluates entire reasoning trajectories. RRM achieves significant improvements on AIME2024 (26.7% to 62.6% Verified Pass@1024) and reduces Miracle Steps by 71%.
This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.
This paper investigates how large language models perform arithmetic operations by analyzing internal mechanisms through early decoding, revealing that proficient models exhibit a clear division of labor between attention and MLP modules in reasoning tasks.
STRATAGEM is a new framework for improving reasoning transferability in language models by using game self-play with a Reasoning Transferability Coefficient and Reasoning Evolution Reward to reinforce abstract, domain-agnostic reasoning patterns over game-specific heuristics. Experiments show strong improvements on mathematical reasoning, general reasoning, and code generation benchmarks.
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.
DiPO introduces a novel reinforcement learning approach for LLMs that uses perplexity-based sample partitioning to disentangle exploration and exploitation subspaces, combined with a bidirectional reward allocation mechanism for more stable policy optimization. The method demonstrates superior performance on mathematical reasoning and function calling tasks.
OpenAI introduces FrontierScience, a new benchmark for measuring expert-level AI scientific capabilities across physics, chemistry, and biology, with GPT-5.2 achieving 77% on olympiad-style tasks and 25% on research-style tasks. The paper presents early evidence that GPT-5 meaningfully accelerates real scientific workflows, shortening work from weeks to hours while establishing metrics for tracking progress toward AI-accelerated science.
Google DeepMind's advanced Gemini with Deep Think achieved gold-medal standard at the International Mathematical Olympiad 2025, solving 5 out of 6 problems for 35 points—a significant advance over last year's silver-medal performance, operating end-to-end in natural language within competition time limits.