Tag
Google DeepMind's AlphaProof Nexus combines LLM-driven proof generation with machine verification using Lean, solving nine out of 353 open Erdős problems, two of which had been open for 56 years, at a cost of only a few hundred dollars per problem.
Research Math Agents (RMA) is an agentic framework for automated reasoning on research-level mathematical problems, achieving state-of-the-art results on the First Proof benchmark by solving 8 out of 10 problems, outperforming strong baselines like GPT-5.2R and Aletheia.
This paper introduces Digit Entropy Loss (DEL), a novel loss function for numerical learning in large language models that reformulates entropy optimization to improve digit-level prediction accuracy and handle floating-point numbers, consistently outperforming existing methods on mathematical reasoning benchmarks.
SCRL is a curriculum reinforcement learning framework that uses subproblem-level normalization and curriculum learning to improve credit assignment in LLM reasoning, outperforming baselines on mathematical reasoning benchmarks.
This paper challenges the belief that code improves reasoning in language models, finding through controlled pretraining experiments that code alone primarily enhances programming ability, while reasoning gains come from structured reasoning traces like code-text and math-text mixtures.
This survey synthesizes recent advancements in mathematical reasoning with large language models, covering benchmarks, architectures, training strategies, and evaluation protocols. It identifies key challenges such as reasoning faithfulness and benchmark biases.
Introduces Stepwise Confidence Attribution (SCA), a framework for assigning step-level confidence to reasoning traces from black-box LLMs without internal access, using the Information Bottleneck principle to distinguish legitimate variability from errors. Experiments show SCA reliably identifies low-confidence steps and improves self-correction success rates by up to 13.5% over answer-level feedback.
Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.
CEPO improves reinforcement learning with verifiable rewards by using contrastive signals from rejected rollouts to distinguish decisive reasoning steps from filler tokens, achieving higher accuracy on multimodal math reasoning benchmarks compared to GRPO.
Gemini 3.2 Flash can solve IMO 2025 P6, but only GPT-5.5-Pro can do so without any scaffolding or harness engineering.
This paper identifies a failure mode called 'trajectory locking' in reward-maximizing post-training for diffusion language models, and proposes TraFL, a trajectory-balance objective that improves diversity and performance across math and code benchmarks.
This paper proposes a principled offline reasoning distillation framework that corrects teacher-student distribution drift, improving reasoning accuracy on math benchmarks without requiring online rollouts.
This paper presents a simple and unified recipe combining supervised fine-tuning, two-stage reinforcement learning, and test-time scaling to train a reasoning model (SU-01) that achieves gold-medal-level performance on International Mathematical and Physics Olympiad problems.
This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.
HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.
This paper introduces ThinC (Thinking in Code), a framework where language models use code blocks exclusively for reasoning after a brief natural language planning step, outperforming existing tool-integrated reasoning baselines on math benchmarks.
This paper proposes D-RPC, a method for distilling reasoning from large language models to smaller ones by compressing reasoning paths into a reusable bank, achieving better performance and consistency on math and commonsense benchmarks.
This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.
The paper proposes Crosslingual On-Policy Self-Distillation (COPSD), a method to transfer high-resource language reasoning capabilities to low-resource languages using a shared student-teacher architecture. Experiments across 17 African languages show significant improvements in mathematical reasoning and answer-format adherence, outperforming Group Relative Policy Optimization (GRPO).
DyStruct is a training-free Bayesian decoding framework for discrete Diffusion Language Models that enables flexible-length generation by dynamically determining expansion size and decoding order, improving accuracy on math and code tasks.