Tag
A comprehensive survey analyzing over 300 papers on LLM reasoning, presenting a taxonomy of reasoning paradigms including Chain-of-Thought, Multi-Hop, Mathematical, Commonsense, and others, along with common failure modes and research gaps.
ComBench is an Olympiad-level combinatorics benchmark with 100 problems designed to evaluate rigorous proof reasoning and constructive realization in large language models, revealing that frontier models like GPT-5.5 achieve only 65.4% overall average and that these two capabilities are distinct.
Proposes PADD, a framework for distilling knowledge from dense teachers into mixture-of-experts (MoE) students, addressing the challenge of learning routing policies without a router in the teacher. The method involves four stages and shows improvements on mathematical reasoning benchmarks.
N-GRPO introduces semantic neighbor mixing in the GRPO framework to enhance mathematical reasoning diversity while preserving semantic consistency, achieving improvements on math benchmarks and out-of-distribution tasks.
This paper introduces Prefix Utility Model (PUM), which evaluates LLM reasoning prefixes based on their utility (improvement in solve rate) rather than local correctness. PUM shows strong performance in mathematical reasoning tasks across selection, search, and reinforcement learning.
RASFT is a novel supervised fine-tuning framework for large language models that adapts expert supervision based on the model's own reasoning capabilities, achieving better performance on mathematical and code reasoning benchmarks compared to standard SFT and reinforcement learning methods.
This paper benchmarks sub-1B models on mathematical reasoning tasks, revealing that full fine-tuning actively harms performance in models under 300M parameters, while parameter-efficient fine-tuning (PEFT) like LoRA and DoRA provides stability. The authors recommend defaulting to PEFT for all aligned sub-1B models and caution against full FT for architectures smaller than 500M to prevent catastrophic forgetting.
Introduces CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES–AoPS CrowdMath program, capturing collaborative mathematical problem-solving. Benchmarks six frontier models, finding they achieve 83-88% accuracy on next-post prediction but only 0.42 macro-F1 on post-role classification, highlighting a gap in understanding collaborative progress.
Sign-Gated On-Policy Distillation (SG-OPD) enhances standard on-policy distillation by using a binary verifier as a trust signal for teacher supervision, improving performance on competition-level math reasoning benchmarks.
This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.
Deliberate Evolution (DE) is an agentic framework that improves LLM-based symbolic regression by decoupling candidate generation from search control, using adaptive operators, structural diagnosis tools, and reflective memory to achieve better results with only 40% of the standard sample budget.
BiNSGPS is a framework that introduces bidirectional interaction between a multimodal LLM adviser and a symbolic solver for geometry problem solving, allowing feedback from the solver to correct errors and generate auxiliary hypotheses. It achieves state-of-the-art performance of 90.5% on Geometry3K and 90.1% on PGPS9K benchmarks.
Researchers from Oxford, Cambridge, MIT, CMU and other institutions conduct a mixed-methods study examining how people integrate AI tools into mathematical proof formalization workflows, finding that participants generally achieve higher formalization accuracy with AI assistance while preferring to maintain high-level human control over the proof discovery process.
The paper introduces GTBench, a curriculum-grounded benchmark for evaluating LLMs as mathematical research assistants in graph theory, containing 63 problems across three difficulty levels. It evaluates five frontier models and finds that performance degrades with difficulty, with GPT-5 achieving near-perfect results on basic problems but only 82% on graduate-level proofs.
EvoTrainer introduces an autonomous training framework that co-evolves LLM policies and training harnesses through empirical feedback, outperforming human-engineered RL baselines on mathematical reasoning, code generation, and long-horizon software engineering tasks.
GRAIL introduces gradient-reweighted advantages to improve token-level credit assignment in reinforcement learning for LLM reasoning, outperforming GRPO across multiple models.
AXIOM is a trust-first neuro-symbolic execution architecture for mathematical reasoning where the LLM acts as a canonicalizer, rewriting natural language problems into schemas processed by a deterministic CAS pipeline, achieving 94.36% correctness with 100% trust on parseable queries.
KACE introduces a knowledge-adaptive context engineering method that separates storage from usage via an epistemic tree and tiered self-consistency, achieving 62.2% on AIME 2025—a 10.4-point gain over fixed self-consistency.
This paper proposes CAST, a non-privileged clipped asymmetric self-teaching method that enhances GRPO-based reinforcement learning with verifiable rewards by providing dense token-level guidance and addressing zero-variance group issues, demonstrating improvements in mathematical reasoning.
This paper introduces the Universal Quantum Transformer (UQT), a quantum-native architecture that uses multi-qubit systems for exact mathematical reasoning, achieving deterministic generalization on modular arithmetic and permutation groups while bypassing classical over-parameterization and quadratic attention bottlenecks, with deployment on IBM Quantum hardware.