@FinanceYF5: Google new paper: Let LLM solve math competition problems, accuracy jumps from 10% to 70%. [LEAP framework] Instead of having the model write a complete proof at once, it breaks down the problem into a goal tree, learns step by step from Lean verifier feedback, and reuses proven lemmas. Result: All 12 problems of Putnam 2025 solved, IMO style…
Summary
Google new paper proposes the LEAP framework, which decomposes math problems into goal trees, learns from Lean verifier feedback, and improves LLM accuracy on math competition problems from 10% to 70%. It solves all 12 problems of Putnam 2025 and surpasses dedicated gold-medal-level systems on IMO-style benchmarks.
View Cached Full Text
Cached at: 06/05/26, 09:09 AM
Google’s New Paper: Enables LLMs to solve math competition problems, accuracy jumps from 10% to 70%.
[LEAP Framework] Instead of having the model write a complete proof in one go, it breaks the problem into a goal tree, learning on the fly from Lean verifier feedback and reusing proven lemmas.
Result: All 12 problems of Putnam 2025 solved, surpassing the dedicated gold-medal-level system by 48% on IMO-style benchmarks.
The model’s capability didn’t change; the structure did, and the ceiling shifted. https://t.co/aY2IEGePO9
Similar Articles
@rohanpaul_ai: Another great paper from Google. Shows general LLMs can solve formal math by planning proofs and checking each step. Ra…
A new Google paper introduces LEAP, an agentic framework that enables general LLMs to solve formal math problems by planning proofs and checking each step, raising performance from under 10% to 70% on the Lean IMO benchmark and solving all 2025 Putnam problems.
LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
LEAP is an agentic framework that enables general-purpose LLMs to achieve state-of-the-art performance in formal theorem proving in Lean, solving all 12 problems from the 2025 Putnam Competition and boosting formal solve rates from below 10% to 70% on a new benchmark (Lean-IMO-Bench), surpassing specialized systems.
@stevibe: Which LLMs actually love to think? Tested 7 models on 5 math problems, measured reasoning length. The think winners: bo…
Benchmarked 7 LLMs on 5 math problems; Qwen3.5 27B and 35B A3B generated the longest reasoning chains, exceeding 10k tokens per question.
LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening
LLMEval-Logic is a new Chinese benchmark for evaluating logical reasoning in LLMs, featuring solver-verified answers and adversarial hardening. The benchmark reveals significant gaps in current models, with the best reaching only 37.5% accuracy on hard items.
@berryxia: Honestly, only truly brilliant people dare to say such things! An undergraduate student can handle the math training of LLMs! In a recent interview, Terence Tao laid out the core mystery of LLMs directly. The Fields Medal winner, the highest honor in mathematics — often called the Nobel Prize of math — and one of the most top contemporary…
Terence Tao pointed out that the math behind current LLMs is actually very simple, but the real puzzle lies in the intermediate zone of natural language data, which leads to unpredictable model behavior.