Tag
This paper presents Process-Verified Reinforcement Learning, using the Lean proof assistant as a process oracle to provide fine-grained tactic-level feedback during training, improving theorem proving performance.
This paper presents improvements to IsabeLLM, an automated theorem proving tool built on Isabelle, by integrating a retrieval-augmented generation framework, error tracing, and counterexample generation. The improved tool is evaluated on the formal verification of Bitcoin's Proof of Work consensus protocol.
This paper presents an agent pipeline for formalizing a numerical analysis textbook in Lean 4 and introduces a quality audit framework that evaluates semantic correctness and library reuse beyond kernel acceptance, revealing common unfaithful formalization patterns.
This paper presents a case study of using a large language model (Claude Code) to formalize Grothendieck's vanishing theorem in the Lean theorem prover. It finds that while agents can produce verified code, they struggle with definitions and API design, emphasizing the need for expert review beyond mere compilation.
MA-ProofBench is a new formal benchmark for evaluating LLMs on theorem proving in mathematical analysis, containing 200 problems across two difficulty levels. The best model, GPT-5.5, achieves only 16% on Level I and 5% on Level II, highlighting a significant gap between informal and formal reasoning.
ATS is a statically typed programming language that unifies implementation with formal specification, supporting functional, imperative, concurrent, and modular programming with dependent and linear types for high efficiency and safety.
Gilad Bracha envisions a future where software engineers use AI to translate informal requirements into formal specs and review them, while AI implements and verifies code. The human ensures the formal spec is correct, writing only natural language.
LEAP is an agentic framework that enables general-purpose LLMs to achieve state-of-the-art performance in formal theorem proving in Lean, solving all 12 problems from the 2025 Putnam Competition and boosting formal solve rates from below 10% to 70% on a new benchmark (Lean-IMO-Bench), surpassing specialized systems.
Proposes Feedback Distillation, a training method that uses token-level supervision from an LLM to improve complex reasoning, evaluated on Lean 4 theorem proving. It maintains diversity better than GRPO and the two methods are complementary.
Terry Tao remarks on AI enabling mass-produced mathematics at scale, turning proof-writing into a searchable problem that generates thousands of mini-lemmas and filters them with cheap checkers.
Galois announces that SAW now supports generating Isabelle theories from Cryptol specifications, bridging the usability of Cryptol and SAW with the expressivity of interactive theorem provers like Isabelle, enabling semi-automated verification of cryptographic protocols.
Sharing experience from the AI loop at the Yang Zhang lab group meeting, including automated theorem proving, multi-machine collaboration, distilling a private experience base, and mentioning examples of Fields medalists using AI to solve mathematical problems.
Researchers introduce Self-Guided Self-Play (SGS), a self-play algorithm for LLMs that prevents reward hacking by using a Guide role to score synthetic problems. Applied to theorem proving in Lean4, SGS surpasses RL baselines and allows a 7B model to outperform a 671B model.
Aleph, a fully autonomous AI agent system for formal verification, achieved top performance on major theorem proving benchmarks including PutnamBench, VeriSoftBench, and Verina.
FormalSLT is a Lean 4 library that formally proves finite-sample statistical learning theory results (ERM, VC bounds, Rademacher bounds, PAC-Bayes, etc.) with explicit assumptions and zero sorry statements, providing a machine-checked foundation for ML theory.
This paper introduces the AI Co-Mathematician, a workbench that uses agentic AI to support mathematicians in open-ended research tasks like ideation and theorem proving. Early tests show the system achieving state-of-the-art results on hard problem-solving benchmarks, including a 48% score on FrontierMath Tier 4.
Researchers from Charles University introduce Bolzano, an open-source multi-agent LLM system that orchestrates prover and verifier agents to assist with mathematical research, reporting new results on six problems where four reached publishable quality and three were produced essentially autonomously.
Verus is a static verification tool for Rust that uses SMT solving to prove full functional correctness of low-level systems code without runtime checks.
This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.
Google DeepMind's advanced Gemini with Deep Think achieved gold-medal standard at the International Mathematical Olympiad 2025, solving 5 out of 6 problems for 35 points—a significant advance over last year's silver-medal performance, operating end-to-end in natural language within competition time limits.