mathematical-reasoning

Tag

Cards List
#mathematical-reasoning

Google DeepMind's AlphaProof Nexus solves decades-old math problems for a few hundred dollars (7 minute read)

TLDR AI · 2026-05-26 Cached

Google DeepMind's AlphaProof Nexus combines LLM-driven proof generation with machine verification using Lean, solving nine out of 353 open Erdős problems, two of which had been open for 56 years, at a cost of only a few hundred dollars per problem.

0 favorites 0 likes
#mathematical-reasoning

RMA: an Agentic System for Research-Level Mathematical Problems

arXiv cs.AI · 2026-05-25 Cached

Research Math Agents (RMA) is an agentic framework for automated reasoning on research-level mathematical problems, achieving state-of-the-art results on the First Proof benchmark by solving 8 out of 10 problems, outperforming strong baselines like GPT-5.2R and Aletheia.

0 favorites 0 likes
#mathematical-reasoning

DEL: Digit Entropy Loss for Numerical Learning of Large Language Models

arXiv cs.CL · 2026-05-21 Cached

This paper introduces Digit Entropy Loss (DEL), a novel loss function for numerical learning in large language models that reformulates entropy optimization to improve digit-level prediction accuracy and handle floating-point numbers, consistently outperforming existing methods on mathematical reasoning benchmarks.

0 favorites 0 likes
#mathematical-reasoning

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

Hugging Face Daily Papers · 2026-05-21 Cached

SCRL is a curriculum reinforcement learning framework that uses subproblem-level normalization and curriculum learning to improve credit assignment in LLM reasoning, outperforming baselines on mathematical reasoning benchmarks.

0 favorites 0 likes
#mathematical-reasoning

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

arXiv cs.AI · 2026-05-20

This paper challenges the belief that code improves reasoning in language models, finding through controlled pretraining experiments that code alone primarily enhances programming ability, while reasoning gains come from structured reasoning traces like code-text and math-text mixtures.

0 favorites 0 likes
#mathematical-reasoning

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

arXiv cs.CL · 2026-05-20 Cached

This survey synthesizes recent advancements in mathematical reasoning with large language models, covering benchmarks, architectures, training strategies, and evaluation protocols. It identifies key challenges such as reasoning faithfulness and benchmark biases.

0 favorites 0 likes
#mathematical-reasoning

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

arXiv cs.CL · 2026-05-20 Cached

Introduces Stepwise Confidence Attribution (SCA), a framework for assigning step-level confidence to reasoning traces from black-box LLMs without internal access, using the Information Bottleneck principle to distinguish legitimate variability from errors. Experiments show SCA reliably identifies low-confidence steps and improves self-correction success rates by up to 13.5% over answer-level feedback.

0 favorites 0 likes
#mathematical-reasoning

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

arXiv cs.AI · 2026-05-19 Cached

Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.

0 favorites 0 likes
#mathematical-reasoning

CEPO: RLVR Self-Distillation using Contrastive Evidence Policy Optimization

Hugging Face Daily Papers · 2026-05-19 Cached

CEPO improves reinforcement learning with verifiable rewards by using contrastive signals from rejected rollouts to distinguish decisive reasoning steps from filler tokens, achieving higher accuracy on multimodal math reasoning benchmarks compared to GRPO.

0 favorites 0 likes
#mathematical-reasoning

Gemini 3.2 Flash is capable of solving IMO 2025 P6. Only GPT-5.5-Pro can solve it currently without any scaffolding / harness engineering.

Reddit r/singularity · 2026-05-18

Gemini 3.2 Flash can solve IMO 2025 P6, but only GPT-5.5-Pro can do so without any scaffolding or harness engineering.

0 favorites 0 likes
#mathematical-reasoning

Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models

arXiv cs.LG · 2026-05-15 Cached

This paper identifies a failure mode called 'trajectory locking' in reward-maximizing post-training for diffusion language models, and proposes TraFL, a trajectory-balance objective that improves diversity and performance across math and code benchmarks.

0 favorites 0 likes
#mathematical-reasoning

Distribution Corrected Offline Data Distillation for Large Language Models

arXiv cs.CL · 2026-05-15 Cached

This paper proposes a principled offline reasoning distillation framework that corrects teacher-student distribution drift, improving reasoning accuracy on math benchmarks without requiring online rollouts.

0 favorites 0 likes
#mathematical-reasoning

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

arXiv cs.AI · 2026-05-14 Cached

This paper presents a simple and unified recipe combining supervised fine-tuning, two-stage reinforcement learning, and test-time scaling to train a reasoning model (SU-01) that achieves gold-medal-level performance on International Mathematical and Physics Olympiad problems.

0 favorites 0 likes
#mathematical-reasoning

YFPO: A Preliminary Study of Yoked Feature Preference Optimization with Neuron-Guided Rewards for Mathematical Reasoning

arXiv cs.CL · 2026-05-13 Cached

This paper introduces YFPO, a neuron-guided preference optimization framework that uses internal activation signals to improve mathematical reasoning in large language models.

0 favorites 0 likes
#mathematical-reasoning

Hölder Policy Optimisation

Hugging Face Daily Papers · 2026-05-12 Cached

HölderPO introduces a generalized policy optimization framework that uses the Hölder mean for token-level probability aggregation in GRPO, with a dynamic annealing schedule to balance gradient concentration and variance. The method achieves state-of-the-art results on mathematical benchmarks (54.9% average, 7.2% relative gain over GRPO) and a 93.8% success rate on ALFWorld.

0 favorites 0 likes
#mathematical-reasoning

Teaching Language Models to Think in Code

arXiv cs.CL · 2026-05-11 Cached

This paper introduces ThinC (Thinking in Code), a framework where language models use code blocks exclusively for reasoning after a brief natural language planning step, outperforming existing tool-integrated reasoning baselines on math benchmarks.

0 favorites 0 likes
#mathematical-reasoning

Structural Rationale Distillation via Reasoning Space Compression

arXiv cs.CL · 2026-05-11 Cached

This paper proposes D-RPC, a method for distilling reasoning from large language models to smaller ones by compressing reasoning paths into a reusable bank, achieving better performance and consistency on math and commonsense benchmarks.

0 favorites 0 likes
#mathematical-reasoning

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

Hugging Face Daily Papers · 2026-05-11 Cached

This paper presents a comprehensive empirical study on on-policy distillation for large language models, identifying failure mechanisms like distribution mismatch and optimization instability, and proposing fixes such as stop-gradient objectives and RLVR-adapted teachers.

0 favorites 0 likes
#mathematical-reasoning

Crosslingual On-Policy Self-Distillation for Multilingual Reasoning

Hugging Face Daily Papers · 2026-05-10 Cached

The paper proposes Crosslingual On-Policy Self-Distillation (COPSD), a method to transfer high-resource language reasoning capabilities to low-resource languages using a shared student-teacher architecture. Experiments across 17 African languages show significant improvements in mathematical reasoning and answer-format adherence, outperforming Group Relative Policy Optimization (GRPO).

0 favorites 0 likes
#mathematical-reasoning

Dystruct: Dynamically Structured Diffusion Language Model Decoding via Bayesian Inference

Hugging Face Daily Papers · 2026-05-10 Cached

DyStruct is a training-free Bayesian decoding framework for discrete Diffusion Language Models that enables flexible-length generation by dynamically determining expansion size and decoding order, improving accuracy on math and code tasks.

0 favorites 0 likes
← Previous
Next →
← Back to home

Submit Feedback