Process-Verified Reinforcement Learning for Theorem Proving via Lean

arXiv cs.AI 06/20/26, 04:00 AM Papers
theorem-proving lean reinforcement-learning process-reward formal-verification symbolic-reasoning llm
Summary
This paper presents Process-Verified Reinforcement Learning, using the Lean proof assistant as a process oracle to provide fine-grained tactic-level feedback during training, improving theorem proving performance.
arXiv:2606.20068v1 Announce Type: new Abstract: While reinforcement learning from verifiable rewards (RLVR) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine-grained structured feedback. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome-level and fine-grained tactic-level verified feedback during training. Proof attempts are parsed into tactic sequences, and Lean's elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier-grounded credit signals rooted in type theory. We incorporate these structured rewards into a GRPO-style reinforcement learning objective with first-error propagation and first-token credit methods that balances outcome- and process-level advantages. Experiments with STP-Lean and DeepSeek-Prover-V1.5 show that tactic-level supervision outperforms outcome-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process-level reward oracles during training. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning.
Original Article
View Cached Full Text
Cached at: 06/20/26, 02:35 PM
# Process-Verified Reinforcement Learning for Theorem Proving via Lean
Source: [https://arxiv.org/html/2606.20068](https://arxiv.org/html/2606.20068)
###### Abstract

While reinforcement learning from verifiable rewards \(RLVR\) typically has relied on a single binary verification signal, symbolic proof assistants in formal reasoning offer rich, fine\-grained structured feedback\. This gap between structured processes and unstructured rewards highlights the importance of feedback that is both dense and sound\. In this work, we demonstrate that the Lean proof assistant itself can serve as a symbolic process oracle, supplying both outcome\-level and fine\-grained tactic\-level verified feedback during training\. Proof attempts are parsed into tactic sequences, and Lean’s elaboration marks both locally sound steps and the earliest failing step, yielding dense, verifier\-grounded credit signals rooted in type theory\. We incorporate these structured rewards into a GRPO\-style reinforcement learning objective with first\-error propagation and first\-token credit methods that balances outcome\- and process\-level advantages\. Experiments with STP\-Lean and DeepSeek\-Prover\-V1\.5 show that tactic\-level supervision outperforms outcome\-only baselines in most settings, delivering improvements on benchmarks such as MiniF2F and ProofNet\. Beyond empirical gains, our study highlights a broader perspective: symbolic proof assistants are not only verifiers at evaluation time, but can also act as process\-level reward oracles during training\. This opens a path toward reinforcement learning frameworks that combine the scalability of language models with the reliability of symbolic verification for formal reasoning\.

## 1Introduction

Automated theorem proving \(ATP\) is one of the long\-term goals of AI\(Newellet al\.,[1957](https://arxiv.org/html/2606.20068#bib.bib1)\)\. Compared to reasoning in natural language \(NL\), which often contains vague or ambiguous symbols, formal theorem proving based on formal logic and type theory provides technical and precise language for proving mathematical theorem\(Church,[1940](https://arxiv.org/html/2606.20068#bib.bib3); Fitting,[1996](https://arxiv.org/html/2606.20068#bib.bib2)\)\. Currently, interactive theorem provers \(ITP\) such as Lean\(de Mouraet al\.,[2015](https://arxiv.org/html/2606.20068#bib.bib4); Moura and Ullrich,[2021](https://arxiv.org/html/2606.20068#bib.bib7)\), Isabelle\(Nipkowet al\.,[2002](https://arxiv.org/html/2606.20068#bib.bib5)\)and Coq\(Barraset al\.,[1997](https://arxiv.org/html/2606.20068#bib.bib6)\), serve as reliable and powerful tools for verifying mathematical proofs\. Lean proofs are sequences of tactics, with automation handling routine arithmetic/logic and verification\-so ITPs provide a middle ground between full automation and human guidance

By contrast, LLMs model next\-token probabilities from large corpora via pre\- and post\-training, learning lexical correlations rather than rule\-based symbolic manipulation\(Brownet al\.,[2020](https://arxiv.org/html/2606.20068#bib.bib8)\)\. With further techniques such as instruction tuning and Reinforcement Learning from Human Feedback \(RLHF\), LLMs have evolved to handle a wide range of tasks, including question answering, summarization, dialogue\(Ouyanget al\.,[2022](https://arxiv.org/html/2606.20068#bib.bib9); Baiet al\.,[2022](https://arxiv.org/html/2606.20068#bib.bib10)\)\. In particular, reinforcement learning \(RL\) approaches for reasoning tasks aim to enhance the model’s reasoning ability by encouraging the generation of long chains of thought rationale\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib13); OpenAIet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib11)\)\.

Compared to other reasoning tasks which often verify or reward LLMs’ response according to its final answer\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.20068#bib.bib12)\), the theorem prover can verify the correctness of entire proof when LLMs respond with formal language\. In this context, given the human\-in\-the\-loop nature of ITPs, there have been growing attempts to use LLMs for formal theorem proving tasks\(AlphaProof and AlphaGeometry teams,[2024](https://arxiv.org/html/2606.20068#bib.bib16); Trinhet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib15)\)\. LLMs act as prover agents while theorem provers serve as verifiers, being used either at inference time\-to search and validate tactics and premises\-or for augmenting formal reasoning datasets with verified samples\(Lampleet al\.,[2022](https://arxiv.org/html/2606.20068#bib.bib18); Wanget al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib19); Yinget al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib17); Zhuet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib57)\)\. Furthermore, some recent studies incorporated binary feedback from the Lean theorem prover into its online RL framework\(Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\)\.

The tactic\-based proof structure in Lean contains information relevant for reasoning tasks such as the positions of tactics or the nature of proof errors or failures\. This structured information captures not just the outcome of a proof, but also the underlying reasoning process\. However, despite its potential, only a few works have explored incorporating this kind of fine\-grained supervision into the training of LLMs\(Jiet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib52)\)\. At the same time, recent RL approaches for reasoning have increasingly emphasized the use of process\-based reward models \(PRMs\) to guide model behavior\. While these models show promising performance, there is still a lack of clarity around how PRMs are constructed, how the reasoning step or step reward should be defined, what training signals or datasets they should depend on\(Yuanet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib22); Luoet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib24); Cuiet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib23)\)\.

Unlike recent approaches that rely on PRMs or long NL CoT\(Linet al\.,[2025a](https://arxiv.org/html/2606.20068#bib.bib29);[b](https://arxiv.org/html/2606.20068#bib.bib33)\), we directly leverage the Lean proof assistant as a*symbolic process oracle*during RL training, without any natural\-language guidance\. For each generated proof, Lean provides \(i\) a global outcome signal and \(ii\) fine\-grained tactic\-level feedback via info trees and error logs\.

While fine\-grained tactic level signals are available from Lean, leveraging them effectively during RL training is nontrivial\. Lean outputs symbolic, tree\-structured language feedback, such as proof states and error locations, whereas LLMs operate over autoregressive token sequences and learn from scalar rewards in RL\. This representational mismatch creates a credit\-assignment challenge: symbolic verifier feedback must be transformed into structured token\-level training signals\. To bridge this gap, we introduce a structured credit assignment framework for integrating symbolic verifier signals into an online RL objective, requiring three components: \(i\) a formulation for incorporating fine\-grained signals, \(ii\) a principled rule for reducing Lean’s symbolic feedback into per\-tactic scores, and \(iii\) a mapping from per\-tactic scores to token\-level advantages\. We instantiate this pipeline using a tactic\-level MDP, a first\-error propagation rule grounded in Lean’s semantics, and a first\-token credit assignment strategy\.

We integrate the resulting per\-tactic signals into a Group Relative Policy Optimization \(GRPO\) style objective combining outcome\- and process\-level advantages\. This enables precise, type\-theoretic credit assignment grounded in verifier feedback without the need for an auxiliary PRM\. Empirically, we found that incorporating symbolic verifier feedback into the RL objective consistently improves performance on MiniF2F and ProofNet, demonstrating the value of fine\-grained verifier signals for reliable credit assignment in reasoning tasks\. Our key contributions are as follows:

- •Formalizing Lean’s Symbolic Feedback\.We formalize Lean’s symbolic, tactic\-level feedback and reduce it into scalar training signals that enable fine\-grained, token\-level credit assignment\.
- •Symbolic verifier\-guided RL\.We integrate outcome and tactic\-level rewards derived from Lean into an RL framework, providing dense and verifiable credit assignment\.
- •Stable improvements on benchmarks\.On MiniF2F and ProofNet, our approach consistently outperforms both outcome\-only RL and vanilla baselines, yielding more stable and robust gains without NL notation or external PRM\.

![Refer to caption](https://arxiv.org/html/2606.20068v1/x1.png)Figure 1:Overall framework for combining outcome and tactic level rewards via Lean: the proofYYis parsed into tacticsT1,…,T5T\_\{1\},\\dots,T\_\{5\}, with Lean providing outcome feedbackg\(Y\)g\(Y\)and step\-level errors \(e\.g\., failure atT3T\_\{3\}invalidates later tactics\)\. Rewards are then assigned to the first token of each tactic\.
## 2Related Work

#### Automatic Theorem Proving

An automated theorem prover typically consists of two stages\. The first is autoformalization, i\.e\., translating natural language mathematical statements into formal ones\. LLMs have been used for this task\(Wuet al\.,[2022](https://arxiv.org/html/2606.20068#bib.bib25)\), producing datasets such as MiniF2F, ProofNet, Deepseek\-Prover, and LeanWorkbook\(Zhenget al\.,[2022](https://arxiv.org/html/2606.20068#bib.bib30); Azerbayevet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib31); Xinet al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib14); Yinget al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib17)\)\. The second stage is proof generation, which can be performed step\-by\-step via tree search\(Polu and Sutskever,[2020](https://arxiv.org/html/2606.20068#bib.bib26); Azerbayevet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib28); Wuet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib27); Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\)or by generating entire proofs at once\(Xinet al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib14); Linet al\.,[2025b](https://arxiv.org/html/2606.20068#bib.bib33)\)\. Existing approaches such as Lean\-STaR and RMaxTS use Lean only as a step\-checker during inference\(Linet al\.,[2025a](https://arxiv.org/html/2606.20068#bib.bib29); Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\), whereas recent work has employed Lean as a whole\-proof verifier during training\(Wanget al\.,[2025a](https://arxiv.org/html/2606.20068#bib.bib44); Zhanget al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib45); Renet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib46)\)\. In this paper, we go beyond step\-checking or whole\-proof verification by using Lean’s fine\-grained, tactic\-level feedback as process\-based rewards in online RL training\.

#### Reinforcement Learning in Language Models

Beyond algorithmic advances such as PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.20068#bib.bib38)\)and GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib37)\), reward shaping and credit assignment remain core challenges in RL\. Outcome\-based rewards\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.20068#bib.bib12)\), though widely used in RLHF, suffer from sparsity\(Chanet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib40); Zhenget al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib41)\)\. Process\-based reward models \(PRMs\) address this by assigning step\-level rewards\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib39); Setluret al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib42); Kazemnejadet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib21); Yuanet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib22); Cuiet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib23)\)\. Rewards can be defined implicitly\(Cuiet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib23)\)or explicitly via correctness annotations\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib39)\)or Monte Carlo rollout success rates\(Wanget al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib48)\), but existing methods require large annotated datasets of step\-level correctness\. This motivates our approach of leveraging the Lean prover itself as a process oracle, automatically verifying each step without human labels or sampling\. Additional discussion is in Appendix[A](https://arxiv.org/html/2606.20068#A1)\.

## 3Preliminaries

### 3\.1Lean4

In Lean theorem proving, a statement to be established is represented as an initial goal and incrementally reduced into subgoals through a sequence of tactics\. Each tactic is parsed and elaborated by unifying it with lemmas or theorems in the library, generating new subgoals, and verifying their validity\. The elaboration stage produces structured info trees that record proof states and error messages\. Finally, the kernel ensures that the elaborated proof is type\-theoretically consistent and constitutes a valid proof for the original theorem\.

Formally, letxxdenote a theorem statement provided to an LLM, and letYYbe the response, a proof expressed in the Lean language\. Write𝒴\\mathcal\{Y\}for the set of Lean proofs and𝒯\\mathcal\{T\}for the set of tactics\. ForY∈𝒴Y\\in\\mathcal\{Y\}, We view a proofYYas a sequence of tactics\(T1,T2,…,TN\(Y\)\)\(T\_\{1\},T\_\{2\},\\dots,T\_\{N\(Y\)\}\)parsed from the Abstract Syntax Tree \(AST\) and sorted by their starting positions, whereN\(Y\)N\(Y\)is the number of tactics inYYwhich aligns with the LLM’s autoregressive generation process\. Each tacticTiT\_\{i\}comprises corresponding tokensyty\_\{t\}inYY\. Lean represents tactics as AST nodes; each node encodes the tactic’s syntactic structure and binding context, and may carry metadata such as error messages, proof states, and an index through which users \(or training frameworks\) can interact with Lean\. If a tactic does not appear in the error log, then it has been elaborated successfully and passed Lean’s internal rule\-based verification, which guarantees that the step is locally sound under dependent type theory\. Thus, any tactic not marked as an error constitutes a verified reasoning step\-even if it does not contribute to closing the proof because some subgoals remain or later tactics fail\. In other words, Lean ensures tactic\-level soundness, while proof\-level completeness depends on whether the entire sequence resolves all goals\. Leveraging this parsing and validation feedback, we define the parsing functionf:𝒴→𝒯∗f:\\mathcal\{Y\}\\to\\mathcal\{T\}^\{\*\}to be the sequence obtained by sorting𝖳𝖺𝖼𝖲𝖾𝗍\(Y\)\\mathsf\{TacSet\}\(Y\):f\(Y\)=\(T1,…,TN\(Y\)\)\.f\(Y\)=\(T\_\{1\},\\dots,T\_\{N\(Y\)\}\)\.We also define the global scoring functiong:𝒴→\[0,1\]g:\\mathcal\{Y\}\\to\[0,1\], whereg\(Y\)=1g\(Y\)=1ifYYpasses the Lean verifier and0otherwise, and the per\-tactic scoring functionφ:\{\(Y,T\)∣Y∈𝒴,T∈TacSet\(Y\)\}⟶\{1,d1,d2\}\.\\varphi:\\\{\(Y,T\)\\mid Y\\in\\mathcal\{Y\},\\,T\\in\\mathrm\{TacSet\}\(Y\)\\\}\\ \\longrightarrow\\ \\\{1,d\_\{1\},d\_\{2\}\\\}\.Specifically,

φ\(Y,T\)=\{1,ifg\(Y\)=1,d1,else ifg\(Y\)=0andThas no errors in Lean,d2,else ifg\(Y\)=0andTcontains errors\.\\varphi\(Y,T\)=\\begin\{cases\}1,&\\text\{if \}g\(Y\)=1,\\\\ d\_\{1\},&\\text\{else if \}g\(Y\)=0\\text\{ and \}T\\text\{ has no errors in Lean\},\\\\ d\_\{2\},&\\text\{else if \}g\(Y\)=0\\text\{ and \}T\\text\{ contains errors\}\.\\end\{cases\}Combining these components, we represent Lean’s role viaf,g,φf,g,\\varphias

Lean\\displaystyle\\mathrm\{Lean\}:𝒴→\{0,1\}×\(𝒯×\{1,d1,d2\}\)∗,\\displaystyle:\\ \\mathcal\{Y\}\\to\\\{0,1\\\}\\times\(\\mathcal\{T\}\\times\\\{1,d\_\{1\},d\_\{2\}\\\}\)^\{\*\},Lean\(Y\)\\displaystyle\\mathrm\{Lean\}\(Y\)=\(g\(Y\),\[\(T1,φ\(Y,T1\)\),…,\(TN\(Y\),φ\(Y,TN\(Y\)\)\)\]\)f\(Y\)=\(T1,…,TN\(Y\)\)\.\\displaystyle=\\bigl\(g\(Y\),\[\(T\_\{1\},\\varphi\(Y,T\_\{1\}\)\),\\dots,\(T\_\{N\(Y\)\},\\varphi\(Y,T\_\{N\(Y\)\}\)\)\]\\bigr\)\_\{\\begin\{subarray\}\{c\}f\(Y\)=\(T\_\{1\},\\dots,T\_\{N\(Y\)\}\)\\end\{subarray\}\}\.

### 3\.2Tactic\-Level MDP

We define a tactic\-level Markov Decision Process \(MDP\) as the tupleℳ=\(𝒮,𝒜,r,F,m\)\\mathcal\{M\}=\(\\mathcal\{S\},\\mathcal\{A\},r,F,m\)\. The state space𝒮\\mathcal\{S\}contains partial formal proofs; eachs∈𝒮s\\in\\mathcal\{S\}is the proof prefix produced so far\. The action space𝒜\\mathcal\{A\}coincides with the tactic space𝒯\\mathcal\{T\}; each actiona∈𝒜a\\in\\mathcal\{A\}is a single Lean tactic\. The reward functionr:𝒮×𝒜→ℝr:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathbb\{R\}assigns a tactic\-level rewardr\(s,a\)r\(s,a\)\. The transition functionF:𝒮×𝒜→𝒮F:\\mathcal\{S\}\\times\\mathcal\{A\}\\to\\mathcal\{S\}is deterministic:sj\+1=F\(sj,aj\)=sj⊕aj,s\_\{j\+1\}=F\(s\_\{j\},a\_\{j\}\)=s\_\{j\}\\oplus a\_\{j\},where⊕\\oplusdenotes concatenation of the tacticaja\_\{j\}to the proofsjs\_\{j\}at time stepjj\. Transitions are pure concatenations; Lean feedback affectsrr, notFF\. Let𝒮term⊆𝒮\\mathcal\{S\}\_\{\\text\{term\}\}\\subseteq\\mathcal\{S\}be EOS absorbing states\. Letm∈𝒮m\\in\\mathcal\{S\}be the initial state\. In Section[4](https://arxiv.org/html/2606.20068#S4), we extend this formulation with outcome\- and tactic\-level rewards derived from the Lean theorem prover to obtain the final training signal\.

### 3\.3Credit Assignment in Reinforcement Learning

PPO assigns a sparse end\-of\-sequence reward and propagates credit with a value model with Generalized Advantage Estimate \(GAE\), reducing variance at the cost of extra learning complexity; full details are deferred to Appendix[E](https://arxiv.org/html/2606.20068#A5)\.

In contrast, REINFORCE style GRPO optimizes directly from verifiable whole\-trajectory rewards without a value model\. For a promptqq, we sampleGGresponses\{yi\}i=1G\\\{y\_\{i\}\\\}\_\{i=1\}^\{G\}fromπold\\pi\_\{\\mathrm\{old\}\}and obtain rewardsrir\_\{i\}\. A normalized, response\-level advantage is applied uniformly to all tokens ofyiy\_\{i\}:

A^i=ri−mean\(r\)std\(r\)\.\\hat\{A\}\_\{i\}=\\frac\{r\_\{i\}\-\\mathrm\{mean\}\(r\)\}\{\\mathrm\{std\}\(r\)\}\.The objective is

LGRPO\(θ\)=𝔼\[1G∑i=1G\{min⁡\(πθ\(yi∣q\)πθold\(yi∣q\)A^i,clip\(πθ\(yi∣q\)πθold\(yi∣q\),1−ϵ,1\+ϵ\)A^i\)−βDKL\[πθ∥πref\]\}\]\.L\_\{\\mathrm\{GRPO\}\}\(\\theta\)=\\mathbb\{E\}\\Bigg\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\Big\\\{\\min\\\!\\Big\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i\}\\\!\\mid q\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i\}\\\!\\mid q\)\}\\,\\hat\{A\}\_\{i\},\\;\\mathrm\{clip\}\\\!\\Big\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i\}\\\!\\mid q\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i\}\\\!\\mid q\)\},1\\\!\-\\\!\\epsilon,1\\\!\+\\\!\\epsilon\\Big\)\\hat\{A\}\_\{i\}\\Big\)\-\\beta\\,D\_\{\\mathrm\{KL\}\}\[\\pi\_\{\\theta\}\\\|\\pi\_\{\\mathrm\{ref\}\}\]\\Big\\\}\\Bigg\]\.We make this dense and sound by injecting*Lean\-derived tactic advantages*into GRPO: the outcome signal remains at response level, while tactic\-level signals are mapped to tokens at the first token of each tactic \(Sec\.[4](https://arxiv.org/html/2606.20068#S4)\)\. This preserves GRPO’s simplicity while addressing sparse credit\.

## 4method

### 4\.1Define Tactic\-level Rewards

In the previous section, we modeled the correctness of proofsYYgenerated by the Lean proof assistant and parsed and verified each tactic withinYY\. We now introduce a reward mechanism that integrates both outcome\-based and process\-based signals explicitly into the RL framework\. Specifically, we employ an outcome\-based reward defined through a functiong\(Y\)g\(Y\), similar to approaches used by\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib13)\), as a global reward evaluating the entire proof\. Additionally, we define a process\-based rewardφ\(Y,T\)\\varphi\(Y,T\), assessing the correctness or validity at the level of individual tacticsT∈YT\\in Y\. Unlike implicit rewards or Monte Carlo estimations typically interpreted as process rewards, our method explicitly assigns correctness\-based rewards at each tactic step\.

Assume that, analogous to the GRPO training rollout framework, given a questionqq, an LLM generates a group of responses\{Y1,Y2,…,YG\}\\\{Y\_\{1\},Y\_\{2\},\\ldots,Y\_\{G\}\\\}\. Lean produces an outcome\-based rewards:

routcome\(Yi\)=g\(Yi\)r\_\{\\mathrm\{outcome\}\}\(Y\_\{i\}\)\\;=g\(Y\_\{i\}\)\\We define the outcome\-based advantage for any tokenyi,ty\_\{i,t\}in responseYiY\_\{i\}as:

Aoutcome,i,t=g\(Yi\)−mean\(g\(Y1\),…,g\(YG\)\)std\(g\(Y1\),…,g\(YG\)\)\.A\_\{\\mathrm\{outcome\},\\,i,\\,t\}\\;=\\;\\frac\{\\,g\\bigl\(Y\_\{i\}\\bigr\)\\;\-\\;\\mathrm\{mean\}\\\!\\bigl\(g\(Y\_\{1\}\),\\dots,g\(Y\_\{G\}\)\\bigr\)\\,\}\{\\mathrm\{std\}\\\!\\bigl\(g\(Y\_\{1\}\),\\dots,g\(Y\_\{G\}\)\\bigr\)\}\\,\.Beyond binary outcome verification signals, we further design elaborate rewards based on the AST feedback produced by the Lean parser as in section[3\.1](https://arxiv.org/html/2606.20068#S3.SS1)\. We leverage this AST feedback to distinguish between different kinds of tactics: for example, whether a tactic is elaborated successfully \(i\.e\., type\-correct and locally sound\), but may still leave unresolved subgoals that prevent the proof from being completed, or whether it has type errors or parser\-level mismatches\. This structured feedback allows us to assign more fine\-grained process\-based rewards\. Since, we sorted the tree node containing proof state by increasing order, we apply a First Error Propagation rule when mapping Lean’s feedback into tactic\-level rewards as\(Luet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib32); Lightmanet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib39)\)\. Given a sequence of tactics\(T1,…,TN\)\(T\_\{1\},\\dots,T\_\{N\}\), once an error is observed atTjT\_\{j\}, we propagate this failure to all subsequent tactics, i\.e\., everyTkT\_\{k\}withk≥jk\\geq jis treated as erroneous for the purpose of reward assignment\.

Letj=min\{i:Ticontains an error\}\.φ\(Y,Tk\)=\{1,g\(Y\)=1\.d1,g\(Y\)=0andk<jand no error,d2,g\(Y\)=0andk≥j,\\text\{Let \}j=\\min\\\{i:\\,T\_\{i\}\\text\{ contains an error\}\\\}\.\\quad\\varphi\(Y,T\_\{k\}\)=\\begin\{cases\}1,&g\(Y\)=1\.\\\\ d\_\{1\},&g\(Y\)=0\\text\{ and \}k<j\\text\{ and no error\},\\\\ d\_\{2\},&g\(Y\)=0\\text\{ and \}k\\geq j,\\end\{cases\}Unlike Lean, which parses proofs into a tree structure, the LLM generates proofs in an autoregressive, causal manner\. Once the first erroneous tacticTjT\_\{j\}occurs, the continuationTj\+1,…,TNT\_\{j\+1\},\\dots,T\_\{N\}is conditioned on an invalid prefix, and therefore cannot constitute a valid reasoning process\. First\-error propagation enforces this principle by assigning error signals to all subsequent tactics, ensuring causal and type\-theoretic credit assignment\.

For any arbitrary responseYiY\_\{i\}, composed of tacticsYi=\{Ti,1,Ti,2,…\}Y\_\{i\}=\\\{\\,T\_\{i,1\},\\,T\_\{i,2\},\\,\\ldots\\\}, if we setsj,ajs\_\{j\},a\_\{j\}as the state and tacticTi,jT\_\{i,j\}at stepjjin responseYiY\_\{i\}, the process\-based reward for tacticTi,jT\_\{i,j\}is:

rprocess\(sj,aj\)=rprocess,i,j=φ\(Yi,Ti,j\)\.r\_\{\\mathrm\{process\}\}\(s\_\{j\},a\_\{j\}\)\\;=r\_\{\\mathrm\{process\},\\,i,\\,j\}\\;=\\;\\varphi\\bigl\(Y\_\{i\},\\,T\_\{i,j\}\\bigr\)\.The corresponding process\-based advantage is

Aprocess,i,j=rprocess,i,j−mean\(g\(Y1\),…,g\(YG\)\)\.A\_\{\\mathrm\{process\},\\,i,\\,j\}\\;=\\;r\_\{\\mathrm\{process\},\\,i,\\,j\}\\;\-\\;\\mathrm\{mean\}\\\!\\bigl\(g\(Y\_\{1\}\),\\dots,g\(Y\_\{G\}\)\\bigr\)\.
Here, the subtraction of the mean outcome reward serves as a dynamic baseline reflecting the difficulty of the problemqqas GRPO algorithm\. If the problem is easier, the mean outcome reward becomes higher, thus penalizing incorrect proofs and their tactics more heavily\. Conversely, for more challenging problems, the lower baseline imposes less severe penalties\.

### 4\.2Integrating Lean into Tactic\-based Reinforcement Learning

We then integrate these two types of advantages into the standard GRPO objective as follows\.

Ai,t=Aoutcome,i,t\+1\{t=first\(Ti,s\(i,t\)\)\}⋅Aprocess,i,s\(i,t\),A\_\{i,t\}\\;=\\;A\_\{\\mathrm\{outcome\},\\,i,\\,t\}\\;\+\\;\\mathbf\{1\}\\\{t=\\mathrm\{first\}\(T\_\{i,s\(i,t\)\}\)\\\}\\,\\cdot A\_\{\\mathrm\{process\},\\,i,\\,s\(i,t\)\},wheres\(i,t\)∈\{1,…,N\}s\(i,t\)\\in\\\{1,\\ldots,N\\\}is the index of the tactic containing the tokenttinYiY\_\{i\},first\(Ti,j\)\\mathrm\{first\}\(T\_\{i,j\}\)indicates the first token of the tactic\. i\.e\., we assign the tactic advantage only to the first token of each tactic\. We applied the advantageAi,tA\_\{i,t\}into GRPO objective function:

L\(θ\)=𝔼q∼P\(Q\),\{Yi\}i=1G∼πθold\(Y∣q\)\[1G∑i=1G\{1\|Yi\|∑t=1\|Yi\|min⁡\(ρi,tAi,t,clip\(ρi,t,1−ϵ,1\+ϵ\)Ai,t\)−βDKL\[πθ∥πref\]\}\]\.\\displaystyle\\begin\{split\}L\(\\theta\)&=\\mathbb\{E\}\_\{q\\sim P\(Q\),\\;\\\{\\,Y\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(Y\\mid q\)\}\\\\\[\-2\.15277pt\] &\\hskip\-15\.00002pt\\Biggl\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\Bigl\\\{\\frac\{1\}\{\\lvert Y\_\{i\}\\rvert\}\\sum\_\{t=1\}^\{\\lvert Y\_\{i\}\\rvert\}\\min\\Bigl\(\\rho\_\{i,t\}\\,A\_\{i,t\},\\;\\mathrm\{clip\}\\bigl\(\\rho\_\{i,t\},\\,1\-\\epsilon,\\,1\+\\epsilon\\bigr\)\\,A\_\{i,t\}\\Bigr\)\\;\-\\;\\beta\\,D\_\{\\mathrm\{KL\}\}\\bigl\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\bigr\]\\Bigr\\\}\\Biggr\]\.\\end\{split\}

\(1\)whereρi,t=πθ\(yi,t∣q,Yi,<t\)πθold\(yi,t∣q,Yi,<t\)\.\\rho\_\{i,t\}=\\frac\{\\pi\_\{\\theta\}\\\!\\bigl\(y\_\{i,t\}\\mid q,\\,Y\_\{i,<t\}\\bigr\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\\\!\\bigl\(y\_\{i,t\}\\mid q,\\,Y\_\{i,<t\}\\bigr\)\}\.This formulation explicitly leverages both the global correctness signalAoutcome,i,tA\_\{\\mathrm\{outcome\},\\,i,\\,t\}from proof outcomes and the detailed, tactic level correctness assessmentAprocess,i,s\(i,t\)A\_\{\\mathrm\{process\},\\,i,\\,s\(i,t\)\}\. By combining them into a single advantageAi,tA\_\{i,t\}, we enrich the learning signal provided to the LLM based proof generator under the GRPO framework\.

In Appendix[J](https://arxiv.org/html/2606.20068#A10), we provide a mathematical grounding and interpretation of our method via a potential\-based reward shaping\. We can interpret our credit assignment as a discrete approximation of potential\-based reward shaping, where the potential function of a state is defined as the probability that the current proof prefix can be completed into a valid Lean proof\. We treat the error state as an absorbing state and set its potential to zero \(see Appendix[J](https://arxiv.org/html/2606.20068#A10)for details\)\.

Rather than propagating cumulative rewards across an entire proof trajectory, we collapse credit assignment to Lean\-verified, tactic\-level signals\. In general RL, a suboptimal step may still obtain positive return if later rewards are high, but in mathematical proof, this could be unsound: once a tactic fails, all subsequent steps are invalid under first\-error propagation\. Empirically, return\-based credit led to unstable optimization, as it requires a value function or auxiliary estimator to normalize scale and reduce variance\. Hence, we adopt a simpler formulation that combines normalized outcome\-level signals with tactic\-level rewards, without computing returns \(See Appendix[G](https://arxiv.org/html/2606.20068#A7)\)\.

Table 1:Budgets for whole\-proof methods denote the*sample budget*\(NN\) per problem; for tree\-search methods, budgets denote the authors’ reported*search expansions counts*\. We compare with InternLM family and DeepSeek\-Prover based tree search methods for fair comparison with our method\. Bold indicates the best number within the whole\-proof block\. All our GRPO\-style runs use the same STP subset, generations per query, and a 15s Lean timeout\.The notationμ±σ\\mu\\pm\\ \\sigmaindicates the mean and the standard deviation each\.

## 5Experiments

### 5\.1Experimental Setup

We trained on 10k samples randomly drawn from the STP dataset \(3\.26M proofs\)\. Proofs were verified via Lean through a REPL interface, with a 15s timeout per attempt\. Baselines included STP\-Lean and DeepSeek\-Prover\-V1\.5\-SFT, the latter additionally fine\-tuned on 500k STP samples before RL\. We used non\-CoT prompt, response styles as in\(Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\)\. We used tactic\-level rewardsd1=−0\.05d\_\{1\}=\-0\.05andd2=−0\.1d\_\{2\}=\-0\.1for the main experiment\. Full hyperparameters and training details are provided in Appendix[B](https://arxiv.org/html/2606.20068#A2)\.

### 5\.2Main Results

In Table[1](https://arxiv.org/html/2606.20068#S4.T1), the results on both the MiniF2F and ProofNet datasets demonstrate that training with tactic\-based advantage via Lean consistently enhances model performance across most evaluation settings\. For the STP\-Lean model, our method improves MiniF2F performance up to\+2\.5%\+2\.5\\%p \(pass@64\), and ProofNet performance by\+1\.4%\+1\.4\\%p \(pass@32\), while showing a negligible decrease of−0\.1%\-0\.1\\%p on pass@64\. Similarly, for DeepSeek\-Prover\-V1\.5, our approach achieves marginal yet consistent increases across all benchmarks\.

Across both MiniF2F and ProofNet, leveraging Lean as a*process\-level oracle*yields consistent, stable gains over outcome\-only reinforcement learning, without increasing training cost\. In particular, in Table[2](https://arxiv.org/html/2606.20068#S5.T2), when applied to DeepSeek\-Prover models, GRPO fails to yield any gains on the ProofNet\-Test set, and in some cases even underperforms relative to the supervised baseline\. This highlights a key limitation of purely outcome\-based credit assignment: it often lacks stability and fails to provide consistent guidance for proof search\.

By comparison, tactic\-level credit assignment yields more reliable improvements\. While minor drops appear in some settings, it generally provides stable gains over outcome\-only GRPO\. For example, on MiniF2F \(pass@64\), STP\-Lean \+ Ours improves by\+2\.5%\+2\.5\\%p over the baseline, compared to\+1\.2%\+1\.2\\%p with GRPO\. As shown in Table[1](https://arxiv.org/html/2606.20068#S4.T1)and[2](https://arxiv.org/html/2606.20068#S5.T2), tactic\-based training consistently matches or surpasses both the supervised baseline and GRPO\. Importantly, this stability comes with almost no extra cost: since both methods already use REPL interactions with Lean, the additional sorting and scoring overhead is negligible\.

Compared to strong search\-based baselines \(e\.g\., InternLM families, DeepSeek\-Prover\-RL\+RMaxTS\), our single\-shot, whole\-proof training approaches their reported accuracy \(e\.g\., 59\.2% vs\. 58\.5% pass@64 on MiniF2F\) while avoiding large search\-time compute\.

![Refer to caption](https://arxiv.org/html/2606.20068v1/x2.png)Figure 2:Training dynamics showing \(a\) outcome reward, \(b\) tactic reward, \(c\) entropy, and \(d\) mean of response length during reinforcement learning\. We used time\-weighted smooting with smooting factor 0\.99\.Table 2:Ablation study of STP\-Lean with various verifier methods on MiniF2F\-Test and ProofNet\-Test benchmarks\.
### 5\.3Analysis

#### The Role of Outcome and Tactic Rewards\.

Integrating both outcome\-level and tactic\-level signals yields more effective learning than employing either signal in isolation\. Outcome\-only RL, as in GRPO, is constrained by the sparsity of binary feedback: improvements are gradual and the final performance plateaus at a relatively low level \(Figure[2](https://arxiv.org/html/2606.20068#S5.F2)\(a\)\)\. In contrast, tactic\-only training provides dense feedback but lacks a global objective, resulting in premature convergence\. When combined, outcome rewards serve as a global objective function, while tactic rewards provide local credit assignment, enabling both rapid progress and higher performance\. This complementary relationship is further reflected in Figure[2](https://arxiv.org/html/2606.20068#S5.F2)\(b\), where tactic\-only supervision’s tactic reward plateaus, but outcome\-tactic combined rewards continue to increase steadily\. The results in Table[2](https://arxiv.org/html/2606.20068#S5.T2)supports this finding: outcome signals enforce proof\-level correctness, while tactic signals supply verifiable intermediate feedback; only their integration consistently improves performance across benchmarks\.

#### Entropy and Proof Length\.

The use of fine\-grained rewards influences exploration not by indiscriminately broadening the search space but by focusing learning on more informative decision points\. As shown in Figure[2](https://arxiv.org/html/2606.20068#S5.F2)\(c\), outcome\+tactic training converges to lower entropy than tactic\-only and outcome\-only settings, indicating that the policy becomes more decisive as training progresses\. This does not correspond to mode collapse: Figure[2](https://arxiv.org/html/2606.20068#S5.F2)\(d\) shows that the average proof length remains stable across all methods, suggesting that the performance gains are not attributable to trivial lengthening of outputs\. Instead, denser intermediate rewards appear to reduce the need for broad stochastic exploration, guiding the model toward more efficient proof strategies\.

#### Tactic to Token Level Credit Assignment\.

After defining tactic\-level rewards, next step is how to distribute them across tokens\. In our main method, the tactic advantage is assigned only to the first token of the tactic\. For comparison, we conducted ablations where the tactic advantage was instead \(i\) distributed to all tokens of a tactic, \(ii\) assigned only to the last token, \(iii\) keep first token reward distribution, but additionally choose 10% tokens within the tactic with respect to high entropy\. AsWanget al\.\([2025d](https://arxiv.org/html/2606.20068#bib.bib49)\)showed that high entropy tokens could be reasoning drive tokens, we speculated that this method can automatically select the tokens for serving as fork in formal reasoning\. Assigning credit to the first token of each tactic achieves the most stable and consistent improvements, as evidenced by Table[3](https://arxiv.org/html/2606.20068#S5.T3)\. Alternative strategies do not yield comparable gains and in some cases even degrade performance\. This outcome aligns with the semantics of Lean proofs: the first token corresponds to the tactic keyword \(e\.g\.,intro,apply,have\), determining the subsequent proof strategy and constrains the structure of subgoals\. Concentrating credit on this decision point enhances the model’s ability to select tactics appropriately, resulting in more reliable downstream reasoning\. This finding is also aligned with\(Fanget al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib63)\), showing that focusing on key tokens during training improves performance on long\-context tasks\.

#### Reward Strategy for Tactic\-level Feedback\.

For tactic\-level feedback to be effective, it must reflect the sequential dependency of proof construction, account for task difficulty, and distinguish between partially correct and erroneous steps\. The first\-error propagation rule ensures that once an error occurs, subsequent tactics are treated as invalid; removing this rule significantly reduces performance \(Table[4](https://arxiv.org/html/2606.20068#S5.T4)\), because once the first error occurs, the remaining tactics are evaluated in an invalid context and cannot salvage correctness\. Incorporating a difficulty\-normalized baseline further stabilizes training, while its absence leads to degraded results\. Finally, differentiating penalties between partially correct tactics and outright erroneous ones proves essential: collapsing these into a single penaltyd1=d2d\_\{1\}=d\_\{2\}yields inconsistent outcomes\- improvements on MiniF2F but declines on ProofNet\. These results indicate that an effective tactic\-level reward scheme must combine sequential error propagation, difficulty\-aware normalization, and differentiated penalties in order to provide stable and semantically faithful learning signals\. In the sensitivity analysis of Appendix[C](https://arxiv.org/html/2606.20068#A3), assigning different values tod1d\_\{1\}andd2d\_\{2\}leads to robust performance, tending to outperform the GRPO baseline and yielding the strongest improvements on MiniF2F\.

Table 3:Ablation study of STP\-Lean on how to distribute tactic\-level advantages across tokens\.Table 4:Ablation study on reward strategies for tactic\-level feedback in STP\-Lean\. Additional experiments include removing the first\-error propagation policy \(No First Error\), removing the baseline extraction \(No Baseline\)\. and using equal penalties for all tactics \(Same tactic reward\)\.
#### Effect of Verification Timeouts

When using Lean as a verifier, long proofs can lead to excessive verification time, so we introduced timeout thresholds of 5, 10, 15, and 30s \(Figure[3](https://arxiv.org/html/2606.20068#S5.F3)\)\. A 5s limit gave the worst results, since even relatively simple proofs often exceeded this window and produced too few valid reward signals\. In contrast, 10\-30s yielded much stronger performance, with 15s giving the best overall balance\. Interestingly, 10\-15s sometimes outperformed 30s despite the shorter allowance\. We attribute this to the fact that discarding overly complex proofs biases training toward shorter and more efficient proof strategies\. This effect is amplified in our setting because we evaluate non\-CoT responses purely by Lean verification \(without NL commentary\): longer outputs are not only slower to check but also more error\-prone\. As a result, shorter verification limits encourage the model to generate concise, canonical proofs, which we hypothesize leads to better generalization at test time\.

![Refer to caption](https://arxiv.org/html/2606.20068v1/x3.png)Figure 3:Ablation study of STP\-Lean on different Lean verification timeouts \(5, 10, 15, and 30 seconds\) during outcome\+tactic based training\. We report evaluation performance on the MiniF2F and ProofNet benchmarks \(a\),\(b\), and the maximum response length observed during training \(c\)\.
#### Qualitative Analysis

We conduct a qualitative analysis to better understand the differences between our tactic\-reward\-based approach and the baseline STP model\. Specifically, we examine proofs from two benchmark problems: Imo\_1960\_p2 in the MiniF2F benchmark and 1\_14 from ProofNet \(See Appendix[F](https://arxiv.org/html/2606.20068#A6)\)\. Table[8](https://arxiv.org/html/2606.20068#A6.T8)presents the proof generated by our tactic\-reward model, while Table[9](https://arxiv.org/html/2606.20068#A6.T9)shows the corresponding proof from the STP model\.The key difference in the first example lies in how the upper boundx<45/8x<45/8is established\. The STP model attempts to use the nonlinear inequality tactic nlinarith, which results in an error\. By contrast, our tactic\-reward model learns to penalize such invalid tactic choices\. Instead, it carefully applies previously proven assumptions and intermediate lemmas before invoking nlinarith, thereby producing a correct and more robust proof\. The second example comes from the ProofNet benchmark \(1\_14 exercise\)\. As shown in Table[10](https://arxiv.org/html/2606.20068#A6.T10), the tactic\-reward model begins by normalizing the problem using a rewrite tactic\. In contrast, the baseline model in Table[10](https://arxiv.org/html/2606.20068#A6.T10)skips this normalization step and directly attempts inequality manipulations, which ultimately causes the proof to fail\. Analysis for Failure case is in Appendix[H](https://arxiv.org/html/2606.20068#A8)These anecdotal examples illustrate plausible mechanisms; for definitive evidence, see Table[1](https://arxiv.org/html/2606.20068#S4.T1)\.

## 6Conclusion

We introduced a reinforcement learning framework that uses the Lean proof assistant as a process\-level reward oracle\. Unlike prior outcome\-only methods, our approach leverages Lean’s parsing and validation to provide both global outcome signals and fine\-grained tactic rewards, integrated into a GRPO objective\. This enables denser, verifiable credit assignment: outcome rewards enforce proof\-level success, while tactic rewards guide step\-level reasoning\. Experiments on STP\-Lean and DeepSeek\-Prover\-V1\.5 show consistent improvements on MiniF2F and ProofNet, with stable gains achieved by assigning tactic rewards to the first token of each tactic and first error propagation method\. Overall, proof assistants can serve not only as checkers at inference but also as structured feedback sources during training, pointing toward more stable and effective RL for reasoning\.

## Limitations

We did not compare against learned PRMs, as they rely on natural\-language CoT supervision and large annotated datasets that are not yet available for Lean\. Our models also generate pure Lean proofs without long CoT, leaving open how to design fine\-grained rewards for long\-form reasoning\. In addition, tactic rewards in our method were fixed scores \(d1,d2d\_\{1\},d\_\{2\}\), which proved effective but somewhat sensitive across different models and datasets\. Developing general advantage estimators and large\-scale tactic\-level datasets remains important future work\.

## Ethics Statement

This research does not involve human subjects, personal data, or sensitive information that could raise ethical concerns\. All experiments were conducted on publicly available formal mathematical datasets, and the proposed models are solely trained and evaluated for automated theorem proving tasks\.

## Reproducibility statement

To ensure the reproducibility of our work, we provide a comprehensive description of our and training process in Section[5\.1](https://arxiv.org/html/2606.20068#S5.SS1)\. For further details such as hyperparameter, version of Lean, we introduced it in Appendix[B](https://arxiv.org/html/2606.20068#A2)\. We utilized Huggingface and trl library for our experiments\.

## Acknowledgement

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation \(IITP\) grant funded by the Korean government \(MSIT\) \[No\. RS\-2022\-II220311, Development of Goal\-Oriented Reinforcement Learning Techniques for Contact\-Rich Robotic Manipulation of Everyday Objects, 90%\] and \[No\. 2019\-0\-00075, Artificial Intelligence Graduate School Program \(KAIST\), 10%\]\.

## References

- AlphaProof and AlphaGeometry teams \(2024\)AI achieves silver\-medal standard solving international mathematical olympiad problems\.Note:Google DeepMind BlogAccessed: 17 April 2025External Links:[Link](https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p3.1)\.
- Z\. Azerbayev, B\. Piotrowski, H\. Schoelkopf, E\. W\. Ayers, D\. Radev, and J\. Avigad \(2023\)ProofNet: autoformalizing and formally proving undergraduate\-level mathematics\.External Links:2302\.12433,[Link](https://arxiv.org/abs/2302.12433)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Azerbayev, H\. Schoelkopf, K\. Paster, M\. D\. Santos, S\. McAleer, A\. Q\. Jiang, J\. Deng, S\. Biderman, and S\. Welleck \(2024\)Llemma: an open language model for mathematics\.External Links:2310\.10631,[Link](https://arxiv.org/abs/2310.10631)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Baba, C\. Liu, S\. Kurita, and A\. Sannai \(2025\)Prover agent: an agent\-based framework for formal mathematical proofs\.External Links:2506\.19923,[Link](https://arxiv.org/abs/2506.19923)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. Dassarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan, N\. Joseph, S\. Kadavath, J\. Kernion, T\. Conerly, S\. El\-Showk, N\. Elhage, Z\. Hatfield\-Dodds, D\. Hernandez, T\. Hume, S\. Johnston, S\. Kravec, L\. Lovitt, N\. Nanda, C\. Olsson, D\. Amodei, T\. B\. Brown, J\. Clark, S\. McCandlish, C\. Olah, B\. Mann, and J\. Kaplan \(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.ArXivabs/2204\.05862\.External Links:[Link](https://api.semanticscholar.org/CorpusID:248118878)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p2.1)\.
- B\. Barras, S\. Boutin, C\. Cornes, J\. Courant, J\. Filliâtre, E\. Giménez, H\. Herbelin, G\. Huet, C\. Muñoz, C\. Murthy, C\. Parent\-vigouroux, C\. Paulin\-Mohring, A\. Saïbi, and B\. Werner \(1997\)The coq proof assistant reference manual : version 6\.1\.pp\.\.Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- T\. B\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell, S\. Agarwal, A\. Herbert\-Voss, G\. Krueger, T\. Henighan, R\. Child, A\. Ramesh, D\. M\. Ziegler, J\. Wu, C\. Winter, C\. Hesse, M\. Chen, E\. Sigler, M\. Litwin, S\. Gray, B\. Chess, J\. Clark, C\. Berner, S\. McCandlish, A\. Radford, I\. Sutskever, and D\. Amodei \(2020\)Language models are few\-shot learners\.External Links:2005\.14165,[Link](https://arxiv.org/abs/2005.14165)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p2.1)\.
- M\. Cao, S\. Zhang, X\. Chang, and D\. Precup \(2025\)SCAR: shapley credit assignment for more efficient rlhf\.External Links:2505\.20417,[Link](https://arxiv.org/abs/2505.20417)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p3.1)\.
- A\. J\. Chan, H\. Sun, S\. Holt, and M\. van der Schaar \(2024\)Dense reward for free in reinforcement learning from human feedback\.External Links:2402\.00782,[Link](https://arxiv.org/abs/2402.00782)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p1.1),[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Church \(1940\)A formulation of the simple theory of types\.Journal of Symbolic Logic5\(3\),pp\. 114–115\.External Links:[Document](https://dx.doi.org/10.2307/2266866)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2606.20068#S1.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Cui, L\. Yuan, Z\. Wang, H\. Wang, W\. Li, B\. He, Y\. Fan, T\. Yu, Q\. Xu, W\. Chen, J\. Yuan, H\. Chen, K\. Zhang, X\. Lv, S\. Wang, Y\. Yao, X\. Han, H\. Peng, Y\. Cheng, Z\. Liu, M\. Sun, B\. Zhou, and N\. Ding \(2025\)Process reinforcement through implicit rewards\.External Links:2502\.01456,[Link](https://arxiv.org/abs/2502.01456)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.20068#S1.p4.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- L\. M\. de Moura, S\. Kong, J\. Avigad, F\. van Doorn, and J\. von Raumer \(2015\)The lean theorem prover \(system description\)\.InCADE,External Links:[Link](https://api.semanticscholar.org/CorpusID:232990)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Ding, H\. Xin, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Wang, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, S\. Ye, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Zhao, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948,[Link](https://arxiv.org/abs/2501.12948)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p2.1),[§4\.1](https://arxiv.org/html/2606.20068#S4.SS1.p1.5)\.
- K\. Dong and T\. Ma \(2025\)STP: self\-play llm theorem provers with iterative conjecturing and proving\.External Links:2502\.00212,[Link](https://arxiv.org/abs/2502.00212)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[Appendix D](https://arxiv.org/html/2606.20068#A4.p1.1),[Table 1](https://arxiv.org/html/2606.20068#S4.T1.14.14.14.14.14.3)\.
- L\. Fang, Y\. Wang, Z\. Liu, C\. Zhang, S\. Jegelka, J\. Gao, B\. Ding, and Y\. Wang \(2025\)What is wrong with perplexity for long\-context language modeling?\.External Links:2410\.23771,[Link](https://arxiv.org/abs/2410.23771)Cited by:[§5\.3](https://arxiv.org/html/2606.20068#S5.SS3.SSS0.Px3.p1.1)\.
- M\. Fitting \(1996\)First\-order logic and automated theorem proving \(2nd ed\.\)\.Springer\-Verlag,Berlin, Heidelberg\.External Links:ISBN 0387945938Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- X\. Ji, Y\. Liu, Q\. Wang, J\. Zhang, Y\. Yue, R\. Shi, C\. Sun, F\. Zhang, G\. Zhou, and K\. Gai \(2025\)Leanabell\-prover\-v2: verifier\-integrated reasoning for formal theorem proving via reinforcement learning\.External Links:2507\.08649,[Link](https://arxiv.org/abs/2507.08649)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[§1](https://arxiv.org/html/2606.20068#S1.p4.1)\.
- A\. Q\. Jiang, S\. Welleck, J\. P\. Zhou, W\. Li, J\. Liu, M\. Jamnik, T\. Lacroix, Y\. Wu, and G\. Lample \(2023\)Draft, sketch, and prove: guiding formal theorem provers with informal proofs\.External Links:2210\.12283,[Link](https://arxiv.org/abs/2210.12283)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1)\.
- A\. Kazemnejad, M\. Aghajohari, E\. Portelance, A\. Sordoni, S\. Reddy, A\. Courville, and N\. L\. Roux \(2024\)VinePPO: unlocking rl potential for llm reasoning through refined credit assignment\.External Links:2410\.01679,[Link](https://arxiv.org/abs/2410.01679)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p2.1),[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- G\. Lample, M\. Lachaux, T\. Lavril, X\. Martinet, A\. Hayat, G\. Ebner, A\. Rodriguez, and T\. Lacroix \(2022\)HyperTree proof search for neural theorem proving\.External Links:2205\.11491,[Link](https://arxiv.org/abs/2205.11491)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p3.1)\.
- C\. Li, Z\. Tang, Z\. Li, M\. Xue, K\. Bao, T\. Ding, R\. Sun, B\. Wang, X\. Wang, J\. Lin, and D\. Liu \(2025\)CoRT: code\-integrated reasoning within thinking\.External Links:2506\.09820,[Link](https://arxiv.org/abs/2506.09820)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p4.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.External Links:2305\.20050,[Link](https://arxiv.org/abs/2305.20050)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2606.20068#S4.SS1.p2.8)\.
- H\. Lin, Z\. Sun, S\. Welleck, and Y\. Yang \(2025a\)Lean\-star: learning to interleave thinking and proving\.External Links:2407\.10040,[Link](https://arxiv.org/abs/2407.10040)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p4.1),[§1](https://arxiv.org/html/2606.20068#S1.p5.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Lin, S\. Tang, B\. Lyu, J\. Wu, H\. Lin, K\. Yang, J\. Li, M\. Xia, D\. Chen, S\. Arora, and C\. Jin \(2025b\)Goedel\-prover: a frontier model for open\-source automated theorem proving\.External Links:2502\.07640,[Link](https://arxiv.org/abs/2502.07640)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[§1](https://arxiv.org/html/2606.20068#S1.p5.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Lin, S\. Tang, B\. Lyu, Z\. Yang, J\. Chung, H\. Zhao, L\. Jiang, Y\. Geng, J\. Ge, J\. Sun, J\. Wu, J\. Gesi, X\. Lu, D\. Acuna, K\. Yang, H\. Lin, Y\. Choi, D\. Chen, S\. Arora, and C\. Jin \(2025c\)Goedel\-prover\-v2: scaling formal theorem proving with scaffolded data synthesis and self\-correction\.External Links:2508\.03613,[Link](https://arxiv.org/abs/2508.03613)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p4.1),[Table 1](https://arxiv.org/html/2606.20068#S4.T1.10.10.10.10.10.3)\.
- J\. Lu, Y\. Wan, Z\. Liu, Y\. Huang, J\. Xiong, C\. Liu, J\. Shen, H\. Jin, J\. Zhang, H\. Wang, Z\. Yang, J\. Tang, and Z\. Guo \(2024\)Process\-driven autoformalization in lean 4\.External Links:2406\.01940,[Link](https://arxiv.org/abs/2406.01940)Cited by:[§4\.1](https://arxiv.org/html/2606.20068#S4.SS1.p2.8)\.
- L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng, J\. Sun, and A\. Rastogi \(2024\)Improve mathematical reasoning in language models by automated process supervision\.External Links:2406\.06592,[Link](https://arxiv.org/abs/2406.06592)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p4.1)\.
- L\. Moura and S\. Ullrich \(2021\)The lean 4 theorem prover and programming language\.pp\. 625–635\.External Links:ISBN 978\-3\-030\-79875\-8,[Document](https://dx.doi.org/10.1007/978-3-030-79876-5%5F37)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- A\. Newell, J\. C\. Shaw, and H\. A\. Simon \(1957\)Empirical explorations of the logic theory machine: a case study in heuristic\.InPapers Presented at the February 26\-28, 1957, Western Joint Computer Conference: Techniques for Reliability,IRE\-AIEE\-ACM ’57 \(Western\),New York, NY, USA,pp\. 218–230\.External Links:ISBN 9781450378611,[Link](https://doi.org/10.1145/1455567.1455605),[Document](https://dx.doi.org/10.1145/1455567.1455605)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- A\. Y\. Ng, D\. Harada, and S\. J\. Russell \(1999\)Policy invariance under reward transformations: theory and application to reward shaping\.InProceedings of the Sixteenth International Conference on Machine Learning,ICML ’99,San Francisco, CA, USA,pp\. 278–287\.External Links:ISBN 1558606122Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p3.1),[§J\.2](https://arxiv.org/html/2606.20068#A10.SS2.p4.2)\.
- T\. Nipkow, M\. Wenzel, and L\. C\. Paulson \(2002\)Isabelle/hol: a proof assistant for higher\-order logic\.Springer\-Verlag,Berlin, Heidelberg\.External Links:ISBN 3540433767Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p1.1)\.
- OpenAI, :, A\. Jaech, A\. Kalai, A\. Lerer, A\. Richardson, A\. El\-Kishky, A\. Low, A\. Helyar, A\. Madry, A\. Beutel, A\. Carney, A\. Iftimie, A\. Karpenko, A\. T\. Passos, A\. Neitz, A\. Prokofiev, A\. Wei, A\. Tam, A\. Bennett, A\. Kumar, A\. Saraiva, A\. Vallone, A\. Duberstein, A\. Kondrich, A\. Mishchenko, A\. Applebaum, A\. Jiang, A\. Nair, B\. Zoph, B\. Ghorbani, B\. Rossen, B\. Sokolowsky, B\. Barak, B\. McGrew, B\. Minaiev, B\. Hao, B\. Baker, B\. Houghton, B\. McKinzie, B\. Eastman, C\. Lugaresi, C\. Bassin, C\. Hudson, C\. M\. Li, C\. de Bourcy, C\. Voss, C\. Shen, C\. Zhang, C\. Koch, C\. Orsinger, C\. Hesse, C\. Fischer, C\. Chan, D\. Roberts, D\. Kappler, D\. Levy, D\. Selsam, D\. Dohan, D\. Farhi, D\. Mely, D\. Robinson, D\. Tsipras, D\. Li, D\. Oprica, E\. Freeman, E\. Zhang, E\. Wong, E\. Proehl, E\. Cheung, E\. Mitchell, E\. Wallace, E\. Ritter, E\. Mays, F\. Wang, F\. P\. Such, F\. Raso, F\. Leoni, F\. Tsimpourlas, F\. Song, F\. von Lohmann, F\. Sulit, G\. Salmon, G\. Parascandolo, G\. Chabot, G\. Zhao, G\. Brockman, G\. Leclerc, H\. Salman, H\. Bao, H\. Sheng, H\. Andrin, H\. Bagherinezhad, H\. Ren, H\. Lightman, H\. W\. Chung, I\. Kivlichan, I\. O’Connell, I\. Osband, I\. C\. Gilaberte, I\. Akkaya, I\. Kostrikov, I\. Sutskever, I\. Kofman, J\. Pachocki, J\. Lennon, J\. Wei, J\. Harb, J\. Twore, J\. Feng, J\. Yu, J\. Weng, J\. Tang, J\. Yu, J\. Q\. Candela, J\. Palermo, J\. Parish, J\. Heidecke, J\. Hallman, J\. Rizzo, J\. Gordon, J\. Uesato, J\. Ward, J\. Huizinga, J\. Wang, K\. Chen, K\. Xiao, K\. Singhal, K\. Nguyen, K\. Cobbe, K\. Shi, K\. Wood, K\. Rimbach, K\. Gu\-Lemberg, K\. Liu, K\. Lu, K\. Stone, K\. Yu, L\. Ahmad, L\. Yang, L\. Liu, L\. Maksin, L\. Ho, L\. Fedus, L\. Weng, L\. Li, L\. McCallum, L\. Held, L\. Kuhn, L\. Kondraciuk, L\. Kaiser, L\. Metz, M\. Boyd, M\. Trebacz, M\. Joglekar, M\. Chen, M\. Tintor, M\. Meyer, M\. Jones, M\. Kaufer, M\. Schwarzer, M\. Shah, M\. Yatbaz, M\. Y\. Guan, M\. Xu, M\. Yan, M\. Glaese, M\. Chen, M\. Lampe, M\. Malek, M\. Wang, M\. Fradin, M\. McClay, M\. Pavlov, M\. Wang, M\. Wang, M\. Murati, M\. Bavarian, M\. Rohaninejad, N\. McAleese, N\. Chowdhury, N\. Chowdhury, N\. Ryder, N\. Tezak, N\. Brown, O\. Nachum, O\. Boiko, O\. Murk, O\. Watkins, P\. Chao, P\. Ashbourne, P\. Izmailov, P\. Zhokhov, R\. Dias, R\. Arora, R\. Lin, R\. G\. Lopes, R\. Gaon, R\. Miyara, R\. Leike, R\. Hwang, R\. Garg, R\. Brown, R\. James, R\. Shu, R\. Cheu, R\. Greene, S\. Jain, S\. Altman, S\. Toizer, S\. Toyer, S\. Miserendino, S\. Agarwal, S\. Hernandez, S\. Baker, S\. McKinney, S\. Yan, S\. Zhao, S\. Hu, S\. Santurkar, S\. R\. Chaudhuri, S\. Zhang, S\. Fu, S\. Papay, S\. Lin, S\. Balaji, S\. Sanjeev, S\. Sidor, T\. Broda, A\. Clark, T\. Wang, T\. Gordon, T\. Sanders, T\. Patwardhan, T\. Sottiaux, T\. Degry, T\. Dimson, T\. Zheng, T\. Garipov, T\. Stasi, T\. Bansal, T\. Creech, T\. Peterson, T\. Eloundou, V\. Qi, V\. Kosaraju, V\. Monaco, V\. Pong, V\. Fomenko, W\. Zheng, W\. Zhou, W\. McCabe, W\. Zaremba, Y\. Dubois, Y\. Lu, Y\. Chen, Y\. Cha, Y\. Bai, Y\. He, Y\. Zhang, Y\. Wang, Z\. Shao, and Z\. Li \(2024\)OpenAI o1 system card\.External Links:2412\.16720,[Link](https://arxiv.org/abs/2412.16720)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p2.1)\.
- A\. Ospanov, F\. Farnia, and R\. Yousefzadeh \(2025\)APOLLO: automated llm and lean collaboration for advanced formal reasoning\.External Links:2505\.05758,[Link](https://arxiv.org/abs/2505.05758)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p2.1)\.
- S\. Polu and I\. Sutskever \(2020\)Generative language modeling for automated theorem proving\.External Links:2009\.03393,[Link](https://arxiv.org/abs/2009.03393)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Z\. Ren, Z\. Shao, J\. Song, H\. Xin, H\. Wang, W\. Zhao, L\. Zhang, Z\. Fu, Q\. Zhu, D\. Yang, Z\. F\. Wu, Z\. Gou, S\. Ma, H\. Tang, Y\. Liu, W\. Gao, D\. Guo, and C\. Ruan \(2025\)DeepSeek\-prover\-v2: advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition\.External Links:2504\.21801,[Link](https://arxiv.org/abs/2504.21801)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Schulman, F\. Wolski, P\. Dhariwal, A\. Radford, and O\. Klimov \(2017\)Proximal policy optimization algorithms\.External Links:1707\.06347,[Link](https://arxiv.org/abs/1707.06347)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- A\. Setlur, C\. Nagpal, A\. Fisch, X\. Geng, J\. Eisenstein, R\. Agarwal, A\. Agarwal, J\. Berant, and A\. Kumar \(2024\)Rewarding progress: scaling automated process verifiers for llm reasoning\.External Links:2410\.08146,[Link](https://arxiv.org/abs/2410.08146)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300,[Link](https://arxiv.org/abs/2402.03300)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Trinh, Y\. T\. Wu, Q\. Le, H\. He, and T\. Luong \(2024\)Solving olympiad geometry without human demonstrations\.Nature625,pp\. 476–482\.External Links:[Link](https://www.nature.com/articles/s41586-023-06747-5)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p3.1)\.
- H\. Wang, M\. Unsal, X\. Lin, M\. Baksys, J\. Liu, M\. D\. Santos, F\. Sung, M\. Vinyes, Z\. Ying, Z\. Zhu, J\. Lu, H\. de Saxcé, B\. Bailey, C\. Song, C\. Xiao, D\. Zhang, E\. Zhang, F\. Pu, H\. Zhu, J\. Liu, J\. Bayer, J\. Michel, L\. Yu, L\. Dreyfus\-Schmidt, L\. Tunstall, L\. Pagani, M\. Machado, P\. Bourigault, R\. Wang, S\. Polu, T\. Barroyer, W\. Li, Y\. Niu, Y\. Fleureau, Y\. Hu, Z\. Yu, Z\. Wang, Z\. Yang, Z\. Liu, and J\. Li \(2025a\)Kimina\-prover preview: towards large formal reasoning models with reinforcement learning\.External Links:2504\.11354,[Link](https://arxiv.org/abs/2504.11354)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Wang, Y\. Yuan, Z\. Liu, J\. Shen, Y\. Yin, J\. Xiong, E\. Xie, H\. Shi, Y\. Li, L\. Li, J\. Yin, Z\. Li, and X\. Liang \(2023\)DT\-solver: automated theorem proving with dynamic\-tree sampling guided by proof\-level value function\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),A\. Rogers, J\. Boyd\-Graber, and N\. Okazaki \(Eds\.\),Toronto, Canada,pp\. 12632–12646\.External Links:[Link](https://aclanthology.org/2023.acl-long.706/),[Document](https://dx.doi.org/10.18653/v1/2023.acl-long.706)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p3.1)\.
- P\. Wang, L\. Li, Z\. Shao, R\. X\. Xu, D\. Dai, Y\. Li, D\. Chen, Y\. Wu, and Z\. Sui \(2024\)Math\-shepherd: verify and reinforce llms step\-by\-step without human annotations\.External Links:2312\.08935,[Link](https://arxiv.org/abs/2312.08935)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- R\. Wang, Y\. Li, Y\. R\. Fung, and T\. Zhang \(2025b\)Let’s reason formally: natural\-formal hybrid reasoning enhances llm’s math capability\.External Links:2505\.23703,[Link](https://arxiv.org/abs/2505.23703)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p4.1)\.
- R\. Wang, R\. Pan, Y\. Li, J\. Zhang, Y\. Jia, S\. Diao, R\. Pi, J\. Hu, and T\. Zhang \(2025c\)MA\-lot: multi\-agent lean\-based long chain\-of\-thought reasoning enhances formal theorem proving\.External Links:2503\.03205,[Link](https://arxiv.org/abs/2503.03205)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1)\.
- S\. Wang, L\. Yu, C\. Gao, C\. Zheng, S\. Liu, R\. Lu, K\. Dang, X\. Chen, J\. Yang, Z\. Zhang, Y\. Liu, A\. Yang, A\. Zhao, Y\. Yue, S\. Song, B\. Yu, G\. Huang, and J\. Lin \(2025d\)Beyond the 80/20 rule: high\-entropy minority tokens drive effective reinforcement learning for llm reasoning\.External Links:2506\.01939,[Link](https://arxiv.org/abs/2506.01939)Cited by:[§5\.3](https://arxiv.org/html/2606.20068#S5.SS3.SSS0.Px3.p1.1)\.
- Y\. Wu, A\. Q\. Jiang, W\. Li, M\. N\. Rabe, C\. Staats, M\. Jamnik, and C\. Szegedy \(2022\)Autoformalization with large language models\.External Links:2205\.12615,[Link](https://arxiv.org/abs/2205.12615)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Wu, S\. Huang, Z\. Zhou, H\. Ying, J\. Wang, D\. Lin, and K\. Chen \(2024\)InternLM2\.5\-stepprover: advancing automated theorem proving via expert iteration on large\-scale lean problems\.External Links:2410\.15700,[Link](https://arxiv.org/abs/2410.15700)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Xin, D\. Guo, Z\. Shao, Z\. Ren, Q\. Zhu, B\. Liu, C\. Ruan, W\. Li, and X\. Liang \(2024a\)DeepSeek\-prover: advancing theorem proving in llms through large\-scale synthetic data\.External Links:2405\.14333,[Link](https://arxiv.org/abs/2405.14333)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p1.1),[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1),[Table 1](https://arxiv.org/html/2606.20068#S4.T1.2.2.2.2.2.3),[Table 1](https://arxiv.org/html/2606.20068#S4.T1.34.34.34.34.34.3),[Table 1](https://arxiv.org/html/2606.20068#S4.T1.6.6.6.6.6.3)\.
- H\. Xin, Z\. Z\. Ren, J\. Song, Z\. Shao, W\. Zhao, H\. Wang, B\. Liu, L\. Zhang, X\. Lu, Q\. Du, W\. Gao, Q\. Zhu, D\. Yang, Z\. Gou, Z\. F\. Wu, F\. Luo, and C\. Ruan \(2024b\)DeepSeek\-prover\-v1\.5: harnessing proof assistant feedback for reinforcement learning and monte\-carlo tree search\.External Links:2408\.08152,[Link](https://arxiv.org/abs/2408.08152)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p2.1),[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[Appendix B](https://arxiv.org/html/2606.20068#A2.SS0.SSS0.Px3.p1.7),[Appendix D](https://arxiv.org/html/2606.20068#A4.p1.1),[§1](https://arxiv.org/html/2606.20068#S1.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1),[§5\.1](https://arxiv.org/html/2606.20068#S5.SS1.p1.2)\.
- H\. Ying, Z\. Wu, Y\. Geng, J\. Wang, D\. Lin, and K\. Chen \(2024a\)Lean workbook: a large\-scale lean problem set formalized from natural language math problems\.External Links:2406\.03847,[Link](https://arxiv.org/abs/2406.03847)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.20068#S1.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Ying, S\. Zhang, L\. Li, Z\. Zhou, Y\. Shao, Z\. Fei, Y\. Ma, J\. Hong, K\. Liu, Z\. Wang, Y\. Wang, Z\. Wu, S\. Li, F\. Zhou, H\. Liu, S\. Zhang, W\. Zhang, H\. Yan, X\. Qiu, J\. Wang, K\. Chen, and D\. Lin \(2024b\)InternLM\-math: open math large language models toward verifiable reasoning\.External Links:2402\.06332,[Link](https://arxiv.org/abs/2402.06332)Cited by:[Table 1](https://arxiv.org/html/2606.20068#S4.T1.30.30.30.30.30.2)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.External Links:2503\.14476,[Link](https://arxiv.org/abs/2503.14476)Cited by:[Appendix B](https://arxiv.org/html/2606.20068#A2.SS0.SSS0.Px3.p1.7)\.
- L\. Yuan, W\. Li, H\. Chen, G\. Cui, N\. Ding, K\. Zhang, B\. Zhou, Z\. Liu, and H\. Peng \(2024\)Free process rewards without process labels\.External Links:2412\.01981,[Link](https://arxiv.org/abs/2412.01981)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2606.20068#S1.p4.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Zhang, Q\. Wang, X\. Ji, Y\. Liu, Y\. Yue, F\. Zhang, D\. Zhang, G\. Zhou, and K\. Gai \(2025\)Leanabell\-prover: posttraining scaling in formal reasoning\.External Links:2504\.06122,[Link](https://arxiv.org/abs/2504.06122)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p3.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- K\. Zheng, J\. M\. Han, and S\. Polu \(2022\)MiniF2F: a cross\-system benchmark for formal olympiad\-level mathematics\.External Links:2109\.00110,[Link](https://arxiv.org/abs/2109.00110)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px1.p1.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Zheng, S\. Dou, S\. Gao, Y\. Hua, W\. Shen, B\. Wang, Y\. Liu, S\. Jin, Q\. Liu, Y\. Zhou, L\. Xiong, L\. Chen, Z\. Xi, N\. Xu, W\. Lai, M\. Zhu, C\. Chang, Z\. Yin, R\. Weng, W\. Cheng, H\. Huang, T\. Sun, H\. Yan, T\. Gui, Q\. Zhang, X\. Qiu, and X\. Huang \(2023\)Secrets of rlhf in large language models part i: ppo\.External Links:2307\.04964,[Link](https://arxiv.org/abs/2307.04964)Cited by:[Appendix A](https://arxiv.org/html/2606.20068#A1.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2606.20068#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Zhu, J\. Clune, J\. Avigad, A\. Q\. Jiang, and S\. Welleck \(2025\)Premise selection for a lean hammer\.External Links:2506\.07477,[Link](https://arxiv.org/abs/2506.07477)Cited by:[§1](https://arxiv.org/html/2606.20068#S1.p3.1)\.

## Appendix

## Appendix AAdditional Related Works

#### Automatic Theorem Proving

An automated theorem prover typically consists of two stages\. The first is the process of translating mathematical statements written in natural language into formal statements\.Wuet al\.\([2022](https://arxiv.org/html/2606.20068#bib.bib25)\)utilized large language models to translate mathematical questions into formal languages such as Isabelle and HOL\. This process, known as autoformalization, is primarily used for constructing datasets intended for formal reasoning\. Benchmarks or training datasets such as MiniF2F, LeanWorkbook, ProofNet, Deepseek\-Prover have employed LLMs to translate natural language mathematical statements into formal expressions, contributing to the creation of high\-quality formal reasoning datasets\(Zhenget al\.,[2022](https://arxiv.org/html/2606.20068#bib.bib30); Azerbayevet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib31); Xinet al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib14); Yinget al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib17)\)\.

The second stage involves generating a formal proof from the translated formal statement\. This proof generation process is typically divided into two approaches: one involves step\-by\-step inference, such as tree search during inference time\(Polu and Sutskever,[2020](https://arxiv.org/html/2606.20068#bib.bib26); Azerbayevet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib28); Wuet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib27); Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\), and the other generates the entire proof at once\(Xinet al\.,[2024a](https://arxiv.org/html/2606.20068#bib.bib14); Linet al\.,[2025b](https://arxiv.org/html/2606.20068#bib.bib33)\)\.Ospanovet al\.\([2025](https://arxiv.org/html/2606.20068#bib.bib55)\); Wanget al\.\([2025c](https://arxiv.org/html/2606.20068#bib.bib34)\); Babaet al\.\([2025](https://arxiv.org/html/2606.20068#bib.bib56)\)use Lean compiler as agent for complementing formal reasoning ability of LLMs, while\(Dong and Ma,[2025](https://arxiv.org/html/2606.20068#bib.bib35)\)enhances formal reasoning by augmenting problems via conjecture\.\(Jianget al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib36)\)presents a unified framework that combines both autoformalization and proof generation in a single pipeline\.

Existing methods such as Lean\-STaR and RMaxTS\(Linet al\.,[2025a](https://arxiv.org/html/2606.20068#bib.bib29); Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\)utilize Lean as a step\-checker during inference, generating steps sequentially and searching optimally via tree search to find valid proofs\. In contrast, in this paper, similar to\(Wanget al\.,[2025a](https://arxiv.org/html/2606.20068#bib.bib44); Zhanget al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib45); Renet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib46); Jiet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib52)\), we utilize Lean as a whole\-proof verifier during the training stage\. Additionally, beyond merely providing correctness checks for the entire proof, we leverage Lean’s parsing and elaboration capabilities to validate each individual tactic step, integrating this step\-level validation into the training process\. In other words, we employ the Lean proof assistant as a process\-based reward model for validating the correctness of each reasoning steps\.\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib39)\)\.

Unlike prior work that leverages dense feedback from proof assistants, our research takes a different perspective: we rely solely on the rule\-based signals of the symbolic engine, without introducing any natural language\. Approaches such asLinet al\.\([2025a](https://arxiv.org/html/2606.20068#bib.bib29);[c](https://arxiv.org/html/2606.20068#bib.bib53)\); Wanget al\.\([2025b](https://arxiv.org/html/2606.20068#bib.bib58)\); Liet al\.\([2025](https://arxiv.org/html/2606.20068#bib.bib59)\),exploit natural language reasoning as a form of annotation to enhance LLMs’ formal reasoning abilities\. In contrast, our method improves performance exclusively through reward signals provided by Lean, without any reliance on natural language\.

#### Reinforcement Learning in Language Models

While developing or applying algorithms such as PPO\(Schulmanet al\.,[2017](https://arxiv.org/html/2606.20068#bib.bib38)\)and GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib37)\)plays a significant role in reinforcement learning, reward shaping and credit assignment are central challenges in reinforcement learning\.\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.20068#bib.bib12)\)introduced a reward model based on the outcome of a response\. However, similar to other areas of RLHF, this approach suffers from the limitation of sparse rewards\(Chanet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib40); Zhenget al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib41)\)\.

To address this, process\-based reward model \(PRM\) assigns step\-level rewards during inference to guide rationale generation\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib39)\), and can also be used to reward responses during training\(Setluret al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib42); Kazemnejadet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib21)\)\.\(Yuanet al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib22); Cuiet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib23)\)derived an implicit PRM from the ORM without any data annotation or additional training\. When assigning scores to reasoning steps,\(Lightmanet al\.,[2023](https://arxiv.org/html/2606.20068#bib.bib39)\)defined the reward as the correctness of each step, which required substantial human annotation effort\.\(Wanget al\.,[2024](https://arxiv.org/html/2606.20068#bib.bib48)\)instead adopted a Monte Carlo approach, defining the score of a step as the proportion of successful rollouts originating from that step\. While recent PRM approaches show promise in natural language reasoning, they require large annotated datasets of step\-level correctness\. To the best of our knowledge, no such dataset exists for Lean or formal theorem proving, making a direct comparison with a learned PRM baseline infeasible\. This further motivates our approach of leveraging the Lean verifier itself as a process oracle\. In contrast, our process\-based reward leverages the Lean theorem prover to automatically verify the correctness of each step, thereby eliminating the need for human annotators or sampling many proofs steps\.

Our work can also be interpreted through the lens of reward shaping\(Nget al\.,[1999](https://arxiv.org/html/2606.20068#bib.bib62)\)\. Prior approaches have explored different mechanisms for distributing reward signals:Chanet al\.\([2024](https://arxiv.org/html/2606.20068#bib.bib40)\)leverages the internal attention patterns of LLMs to assign higher weights to important tokens,Caoet al\.\([2025](https://arxiv.org/html/2606.20068#bib.bib60)\)employs Shapley values to allocate credit across actions, andKazemnejadet al\.\([2024](https://arxiv.org/html/2606.20068#bib.bib21)\)uses Monte Carlo rollouts to estimate and distribute rewards over intermediate steps\. In contrast, our method relies on an external parser\-the Lean theorem prover\-to parse tactics and assign reward to the first token, thereby implementing a form of credit assignment\.

## Appendix BExperimental Detail

#### Data\.

We randomly sampled 10k instances from the STP dataset \(3\.26M total\) for RL training\. For DeepSeek\-Prover\-V1\.5\-SFT, we applied an additional supervised fine\-tuning step on 500k STP samples before RL, since the vanilla model produced low\-quality proofs during RL training\.

#### Verification\.

We use Lean 4\.9\.0\-rc1 for all experiments in the paper\. During training, we used a REPL \(read\-eval\-print loop\) interface with Lean to verify proofs and assign outcome\- and tactic\-level rewards\. Each proof attempt was given a maximum of 15 seconds for verification; longer runs were treated as failures \(both outcomes, tactic rewards are zero\)\.

#### RL configuration\.

For GRPO training, we usedG=4G=4generations per prompt, sampling temperature0\.90\.9, KL coefficient0\.040\.04, clippingϵ=0\.2\\epsilon=0\.2, and the DAPO upper bound0\.280\.28\(Yuet al\.,[2025](https://arxiv.org/html/2606.20068#bib.bib47)\)\. Tactic\-level rewards were fixed atd1=−0\.05d\_\{1\}=\-0\.05andd2=−0\.1d\_\{2\}=\-0\.1for partially valid and erroneous tactics in the main experiments, respectively\. All experiments used non\-CoT prompts, followingXinet al\.\([2024b](https://arxiv.org/html/2606.20068#bib.bib20)\)\.

#### Training details\.

We fine\-tuned the models with LoRA \(rank 64,α=64\\alpha=64\) using bf16 precision\. The AdamW optimizer was used with a learning rate of1\.0×10−51\.0\\times 10^\{\-5\}\. Maximum response length was set to 1024 tokens during both training and evaluation\.

#### Evaluation\.

For decoding we used temperature1\.01\.0and top\-p0\.950\.95\. We re\-evaluated all baselines under the same non\-CoT and budget settings \(32/64 samples\)\. All reported results are from the final checkpoint\.

#### Compute\.

Training was conducted on 4×\\timesNVIDIA A6000 GPUs, requiring approximately 21\-23 hours\.

## Appendix CHyperparameter ablations ondid\_\{i\}

Table 5:Ablation study on tactic\-level penaltiesd1,d2d\_\{1\},d\_\{2\}\. We compare outcome\-only GRPO baseline with three variants of\(d1,d2\)\(d\_\{1\},d\_\{2\}\)settings\. Results are reported as pass@32 and pass@64 \(%\) on MiniF2F and ProofNet test sets\. The experiment is couducted with STP\-Lean model\.This ablation shows that introducing a gap betweend1d\_\{1\}andd2d\_\{2\}makes the method more robust: performance remains consistently above the GRPO baseline, with stable gains across different penalty scales and especially clear improvements on MiniF2F\.

## Appendix DPrompts

For training and evaluation, we used non\-COT evaluation followed by\(Dong and Ma,[2025](https://arxiv.org/html/2606.20068#bib.bib35)\)and\(Xinet al\.,[2024b](https://arxiv.org/html/2606.20068#bib.bib20)\)\. The examples are introduced in Table[6](https://arxiv.org/html/2606.20068#A4.T6),[7](https://arxiv.org/html/2606.20068#A4.T7)\.

Prompt Template[⬇](data:text/plain;base64,Q29tcGxldGUgdGhlIGZvbGxvd2luZyBMZWFuIDQgY29kZTpcblxuCmBgYGxlYW40XG57aGVhZGVyfXtmb3JtYWxfc3RhdGVtZW50fQ==)CompletethefollowingLean4code:\\n\\n“‘lean4\\n\{header\}\{formal\_statement\}Table 6:Prompt template used in training and evaluation\. We selected non\-COT generation which is appropriate with our MDP settingPrompt Example[⬇](data:text/plain;base64,Q29tcGxldGUgdGhlIGZvbGxvd2luZyBMZWFuIDQgY29kZToKCmBgYGxlYW40CmltcG9ydCBNYXRobGliCmltcG9ydCBBZXNvcApzZXRfb3B0aW9uIG1heEhlYXJ0YmVhdHMgMApvcGVuIEJpZ09wZXJhdG9ycyBSZWFsIE5hdCBUb3BvbG9neSBSYXQKCnRoZW9yZW0gdGhlb3JlbV9leGVyY2lzZV8yMDExXzJfMjU3IChHIDogVHlwZSopIFtHcm91cCBHXSBbRmludHlwZSBHXQooaOKCgSA6IEZpbnR5cGUuY2FyZCBHIHwgMikgKHggOiBHKSA6IHggXiAyID0gMSDihpQKICAo4oiAIHggeSA6IEcsIHggKiB5ID0geSAqIHgpIOKIpyAo4oiAIGEgOiBHLCBhID0gYeKBu8K5KSDiiKcg4oiAIGEgOiBHLCBhXjIgPSAxIOKIpwogIGxldCBwIHggOiBHIOKGkiBHIDo9IGJ5)CompletethefollowingLean4code:\`\`\`lean4importMathlibimportAesopset\_optionmaxHeartbeats0openBigOperatorsRealNatTopologyRattheoremtheorem\_exercise\_2011\_2\_257\(G:Type\*\)\[GroupG\]\[FintypeG\]\(h1:Fintype\.cardG\|2\)\(x:G\):x^2=1↔\\leftrightarrow\(∀\\forallxy:G,x\*y=y\*x\)∧\\wedge\(∀\\foralla:G,a=a\-1\)∧\\wedge∀\\foralla:G,a^2=1∧\\wedgeletpx:G→\\rightarrowG:=byTable 7:A training sample used in training and evaluation\. We selected non\-COT generation which is appropriate with our MDP setting\.
## Appendix ECredit Assignment in Reinforcement Learning

Letyty\_\{t\}be thett\-th token ofyy,RRdenote the reward model,πθ\\pi\_\{\\theta\}represent the policy model, andπref\\pi\_\{\\text\{ref\}\}be the reference model\.LLdenote the response length andBBbe a coefficient controlling the distance between the policy and the reference policy\. In PPO, the token\-level reward at positionttis defined as:rt\(x,yt\)=R\(x,y\)𝟏\(yt=L\)−Blog⁡\(πθ\(yt\|x\)πref\(yt\|x\)\)r\_\{t\}\(x,y\_\{t\}\)=R\(x,y\)\\mathbf\{1\}\(y\_\{t\}=L\)\-B\\log\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{t\}\|x\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{t\}\|x\)\}\\right\), where the non\-zero rewardR\(x,y\)R\(x,y\)is assigned only to the last token\. For all other tokens, only a KL divergence penalty is applied via a log ratiolog⁡\(πθ\(yt\|x\)πref\(yt\|x\)\)\\log\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{t\}\|x\)\}\{\\pi\_\{\\text\{ref\}\}\(y\_\{t\}\|x\)\}\\right\)\. Direct usage of rewards can lead to high variance; therefore, PPO reduces variance by utilizing a learned value modelVV\.This value network assigns a value to each tokenyty\_\{t\}, from which the Temporal\-Difference \(TD\) error is computed as:δt=rt\+γV\(yt\+1\)−V\(yt\)whereγis discounted factor\.\\delta\_\{t\}=r\_\{t\}\+\\gamma V\(y\_\{t\+1\}\)\-V\(y\_\{t\}\)\\text\{ where \}\\gamma\\text\{ is discounted factor\.\}Then, the advantage for each token is recursively calculated as follows:AL=δLA\_\{L\}=\\delta\_\{L\},At=δt\+γλAt\+1,fort=L−1,L−2,…,1\.A\_\{t\}=\\delta\_\{t\}\+\\gamma\\lambda A\_\{t\+1\},\\quad\\text\{for \}t=L\-1,L\-2,\\dots,1\.Subsequently, because the computed advantagesAtA\_\{t\}can exhibit high variance during exploration, normalization or similar techniques are applied, resulting in the final adjusted advantageAtA\_\{t\}\. This adjusted advantage is then utilized in the PPO loss defined as:

LCLIP\(θ\)=𝔼t\[min⁡\(πθ\(yt∣x\)πθold\(yt∣x\)At,clip\(πθ\(yt∣x\)πθold\(yt∣x\),1−ϵ,1\+ϵ\)At\)\]L^\{\\text\{CLIP\}\}\(\\theta\)=\\mathbb\{E\}\_\{t\}\\left\[\\min\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{t\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{t\}\\mid x\)\}A\_\{t\},\\;\\text\{clip\}\\left\(\\frac\{\\pi\_\{\\theta\}\(y\_\{t\}\\mid x\)\}\{\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(y\_\{t\}\\mid x\)\},\\,1\-\\epsilon,\\,1\+\\epsilon\\right\)A\_\{t\}\\right\)\\right\]\(2\)In contrast, REINFORCE\-based methods such as GRPO and RLOO have proposed algorithms that optimize policies directly from verifiable rewards without requiring a value model, due to concerns about the computational cost and estimation capability associated with training value networks\.

GRPO generates multiple response groups\{y\(i\)\}i=1G\\\{\\,y\_\{\(i\)\}\\\}\_\{i=1\}^\{G\}for a given questionqqfrom an old policyπold\\pi\_\{\\text\{old\}\}\. Subsequently, a reward function outputs rewardrr=\{r\(i\)\}i=1G\\\{\\,r\_\{\(i\)\}\\\}\_\{i=1\}^\{G\}for each response group\. If we setyi,ty\_\{i,t\}ast−t\-th token index of responseyiy\_\{i\}The advantage foryi,ty\_\{i,t\},Ai,tA\_\{i,t\}is then computed by normalizing these rewards as follows:A^i,t=ri−mean\(r\)std\(r\)\.\\hat\{A\}\_\{i,t\}\\;=\\;\\frac\{r\_\{i\}\-\\mathrm\{mean\}\(r\)\}\{\\mathrm\{std\}\(r\)\}\.

This advantage is uniformly assigned to each tokenyi,ty\_\{i,t\}constituting the responseyiy\_\{i\}\. Subsequently, this identical token\-level advantage is utilized in calculating the following loss:

LGRPO\(θ\)=𝔼q∼P\(Q\),\{yi\}i=1G∼πθold\(Y∣q\)\[1G∑i=1G\{min⁡\(πθ\(yi,t∣q\)πθold\(yi,t∣q\)A^i,t,clip\(πθ\(yi,t∣q\)πθold\(yi,t∣q\),1−ϵ,1\+ϵ\)A^i,t\)−βDKL\[πθ∥πref\]\}\]\.\\displaystyle\\begin\{split\}L\_\{\\mathrm\{GRPO\}\}\(\\theta\)&=\\mathbb\{E\}\_\{q\\sim P\(Q\),\\;\\\{\\,y\_\{i\}\\\}\_\{i=1\}^\{G\}\\sim\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(Y\\mid q\)\}\\\\\[\-2\.15277pt\] &\\hskip\-15\.00002pt\\Biggl\[\\frac\{1\}\{G\}\\sum\_\{i=1\}^\{G\}\\Bigl\\\{\\min\\Bigl\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid q\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i,t\}\\mid q\)\}\\,\\hat\{A\}\_\{i,t\},\\mathrm\{clip\}\\Bigl\(\\frac\{\\pi\_\{\\theta\}\(y\_\{i,t\}\\mid q\)\}\{\\pi\_\{\\theta\_\{\\mathrm\{old\}\}\}\(y\_\{i,t\}\\mid q\)\},\\,1\-\\epsilon,\\,1\+\\epsilon\\Bigr\)\\,\\hat\{A\}\_\{i,t\}\\Bigr\)\\;\-\\;\\beta\\,D\_\{\\mathrm\{KL\}\}\\bigl\[\\pi\_\{\\theta\}\\,\\\|\\,\\pi\_\{\\mathrm\{ref\}\}\\bigr\]\\Bigr\\\}\\Biggr\]\.\\end\{split\}

\(3\)
As our goal is to integrate Lean\-based fine\-grained rewards to GRPO algorithm, our method keep balance between value\-based PPO’s fine\-grained credit assignmethod and GRPO’s simple unifrom credit assignment approach\.

## Appendix FGenerated Proofs Examples

imo\_1960\_p2 in minif2f from tactic\-reward\-trained model[⬇](data:text/plain;base64,aW1wb3J0IE1hdGhsaWIKaW1wb3J0IEFlc29wCgpzZXRfb3B0aW9uIG1heEhlYXJ0YmVhdHMgMAoKb3BlbiBCaWdPcGVyYXRvcnMgUmVhbCBOYXQgVG9wb2xvZ3kgUmF0Cgp0aGVvcmVtIGltb18xOTYwX3AyICh4IDog4oSdKSAoaF8wIDogMCDiiaQgMSArIDIgKiB4KQooaF8xIDogKDEgLSBSZWFsLnNxcnQgKDEgKyAyICogeCkpIF4gMiDiiaAgMCkKICAgIChoXzIgOiA0ICogeCBeIDIgLyAoMSAtIFJlYWwuc3FydCAoMSArIDIgKiB4KSkgXiAyIDwgMiAqIHggKyA5KSA6CiAgICAtKDEgLyAyKSDiiaQgeCDiiKcgeCA8IDQ1IC8gOCA6PSBieQogIGhhdmUgaF8zIDogMCDiiaQgKDEgKyAyICogeCkgOj0gYnkgbGluYXJpdGgKICBoYXZlIGhfNCA6IDAg4omkIFJlYWwuc3FydCAoMSArIDIgKiB4KTo9IFJlYWwuc3FydF9ub25uZWcoMSArIDIgKiB4KQogIGhhdmUgaF81IDogMCDiiaQgKDEgKyAyICogeCkgKiBSZWFsLnNxcnQgKDEgKyAyICogeCkgOj0gYnkgcG9zaXRpdml0eQogIGhhdmUgaF82IDogMCA8ICgxIC0gUmVhbC5zcXJ0ICgxICsgMiAqIHgpKSBeIDIgOj0gYnkKICAgIGFwcGx5IHNxX3Bvc19vZl9uZV96ZXJvCiAgICBpbnRybyBoCiAgICBhcHBseSBoXzEKICAgIG5saW5hcml0aAogIGNvbnN0cnVjdG9yCiAgbmxpbmFyaXRoIFtzcV9zcXJ0IChzaG93IDAg4omkIDEgKyAyICogeCBmcm9tIGJ5IGxpbmFyaXRoKSwgaF8yXQogIGhhdmUgaF83IDogeCA8IDQ1IC8gOCA6PSBieQogICAgYXBwbHkgbHRfb2Zfbm90X2dlCiAgICBpbnRybyBoZ2UKICAgIGhhdmUgaF84IDogNCAqIHggXiAyIC8gKDEgLSBSZWFsLnNxcnQgKDEgKyAyICogeCkpIF4gMiDiiaUgMip4KzkKICAgIDo9IGJ5CiAgICAgIHJ3IFtnZV9pZmZfbGVdCiAgICAgIHJ3IFtsZV9kaXZfaWZmIGhfNl0KICAgICAgbmxpbmFyaXRoIFtzcV9zcXJ0IChzaG93IDAg4omkIDEgKyAyICogeCBmcm9tIGJ5IGxpbmFyaXRoKSwgaGdlXQogICAgbGluYXJpdGgKICBsaW5hcml0aA==)importMathlibimportAesopset\_optionmaxHeartbeats0openBigOperatorsRealNatTopologyRattheoremimo\_1960\_p2\(x:ℝ\\mathbb\{R\}\)\(h\_0:0≤\\leq1\+2\*x\)\(h\_1:\(1\-Real\.sqrt\(1\+2\*x\)\)^2≠\\neq0\)\(h\_2:4\*x^2/\(1\-Real\.sqrt\(1\+2\*x\)\)^2<2\*x\+9\):\-\(1/2\)≤\\leqx∧\\wedgex<45/8:=byhaveh\_3:0≤\\leq\(1\+2\*x\):=bylinarithhaveh\_4:0≤\\leqReal\.sqrt\(1\+2\*x\):=Real\.sqrt\_nonneg\(1\+2\*x\)haveh\_5:0≤\\leq\(1\+2\*x\)\*Real\.sqrt\(1\+2\*x\):=bypositivityhaveh\_6:0<\(1\-Real\.sqrt\(1\+2\*x\)\)^2:=byapplysq\_pos\_of\_ne\_zerointrohapplyh\_1nlinarithconstructornlinarith\[sq\_sqrt\(show0≤\\leq1\+2\*xfrombylinarith\),h\_2\]haveh\_7:x<45/8:=byapplylt\_of\_not\_geintrohgehaveh\_8:4\*x^2/\(1\-Real\.sqrt\(1\+2\*x\)\)^2≥\\geq2\*x\+9:=byrw\[ge\_iff\_le\]rw\[le\_div\_iffh\_6\]nlinarith\[sq\_sqrt\(show0≤\\leq1\+2\*xfrombylinarith\),hge\]linarithlinarithTable 8:Proof ofimo\_1960\_p2from MiniF2F generated by our tactic\-reward\-trained model\.imo\_1960\_p2 in minif2f from baseline model[⬇](data:text/plain;base64,aW1wb3J0IE1hdGhsaWIKaW1wb3J0IEFlc29wCgpzZXRfb3B0aW9uIG1heEhlYXJ0YmVhdHMgMAoKb3BlbiBCaWdPcGVyYXRvcnMgUmVhbCBOYXQgVG9wb2xvZ3kgUmF0Cgp0aGVvcmVtIGltb18xOTYwX3AyICh4IDog4oSdKQooaF8wIDogMCDiiaQgMSArIDIgKiB4KSAoaF8xIDogKDEgLSBSZWFsLnNxcnQgKDEgKyAyICogeCkpIF4gMiDiiaAgMCkKICAgIChoXzIgOiA0ICogeCBeIDIgLyAoMSAtIFJlYWwuc3FydCAoMSArIDIgKiB4KSkgXiAyIDwgMiAqIHggKyA5KSA6CiAgICAtKDEgLyAyKSDiiaQgeCDiiKcgeCA8IDQ1IC8gOCA6PSBieQogIGhhdmUgaF8zIDogMCA8ICgxIC0gUmVhbC5zcXJ0ICgxICsgMiAqIHgpKSBeIDIgOj0gYnkgcG9zaXRpdml0eQogIGhhdmUgaF80IDogMCDiiaQgUmVhbC5zcXJ0ICgxICsgMiAqIHgpIDo9IFJlYWwuc3FydF9ub25uZWcgXwogIGhhdmUgaF81IDogMCDiiaQgMSArIDIgKiB4IDo9IGJ5IGxpbmFyaXRoCiAgaGF2ZSBoXzYgOiBSZWFsLnNxcnQgKDEgKyAyICogeCkg4omkIDEgKyB4IDo9IGJ5CiAgICBydyBbUmVhbC5zcXJ0X2xlX2lmZl0KICAgIGNvbnN0cnVjdG9yCiAgICDCtyBsaW5hcml0aAogICAgwrcgbmxpbmFyaXRoCiAgaGF2ZSBoXzcgOiA0ICogeCBeIDIgLyAoMSAtIFJlYWwuc3FydCAoMSArIDIgKiB4KSkgXiAyIDwgMiAqIHggKyA5CiAgOj0KICBieSBhc3N1bXB0aW9uCiAgaGF2ZSBoXzggOiAtKDEgLyAyKSDiiaQgeCA6PSBieQogICAgcncgW2Rpdl9sdF9pZmYgaF8zXSBhdCBoXzcKICAgIG5saW5hcml0aAogIGhhdmUgaF85IDogeCA8IDQ1IC8gOCA6PSBieQogICAgcncgW2Rpdl9sdF9pZmYgaF8zXSBhdCBoXzcKICAgIG5saW5hcml0aAogIGV4YWN0IOKfqGhfOCwgaF854p+pCiAgPDs+IGFzc3VtcHRpb24KICA8Oz4gYXNzdW1wdGlvbgogIDw7PiBhc3N1bXB0aW9uCiAgPDs+IGFzc3VtcHRpb24KICA8Oz4gYXNzdW1wdGlvbgogIDw7PiBhc3N1bXB0aW9uCiAgPDs+IGFzc3VtcHRpb24KICA8Oz4gYXNzdW1wdGlvbgogIDw7PiBhc3N1bXB0aW9u)importMathlibimportAesopset\_optionmaxHeartbeats0openBigOperatorsRealNatTopologyRattheoremimo\_1960\_p2\(x:ℝ\\mathbb\{R\}\)\(h\_0:0≤\\leq1\+2\*x\)\(h\_1:\(1\-Real\.sqrt\(1\+2\*x\)\)^2≠\\neq0\)\(h\_2:4\*x^2/\(1\-Real\.sqrt\(1\+2\*x\)\)^2<2\*x\+9\):\-\(1/2\)≤\\leqx∧\\wedgex<45/8:=byhaveh\_3:0<\(1\-Real\.sqrt\(1\+2\*x\)\)^2:=bypositivityhaveh\_4:0≤\\leqReal\.sqrt\(1\+2\*x\):=Real\.sqrt\_nonneg\_haveh\_5:0≤\\leq1\+2\*x:=bylinarithhaveh\_6:Real\.sqrt\(1\+2\*x\)≤\\leq1\+x:=byrw\[Real\.sqrt\_le\_iff\]constructor⋅\\cdotlinarith⋅\\cdotnlinarithhaveh\_7:4\*x^2/\(1\-Real\.sqrt\(1\+2\*x\)\)^2<2\*x\+9:=byassumptionhaveh\_8:\-\(1/2\)≤\\leqx:=byrw\[div\_lt\_iffh\_3\]ath\_7nlinarithhaveh\_9:x<45/8:=byrw\[div\_lt\_iffh\_3\]ath\_7nlinarithexact⟨\\langleh\_8,h\_9⟩\\rangle<;\>assumption<;\>assumption<;\>assumption<;\>assumption<;\>assumption<;\>assumption<;\>assumption<;\>assumption<;\>assumptionTable 9:Proof ofimo\_1960\_p2from MiniF2F generated by the baseline STP model\.exercise\_1\_14 in ProofNet from tactic\-reward\-trained model\.[⬇](data:text/plain;base64,aW1wb3J0IE1hdGhsaWIKCm9wZW4gVG9wb2xvZ3kgRmlsdGVyIFJlYWwgQ29tcGxleCBUb3BvbG9naWNhbFNwYWNlIEZpbnNldApvcGVuIHNjb3BlZCBCaWdPcGVyYXRvcnMKCnRoZW9yZW0gZXhlcmNpc2VfMV8xNAogICh6IDog4oSCKSAoaCA6IGFicyB6ID0gMSkKICA6IChhYnMgKDEgKyB6KSkgXiAyICsgKGFicyAoMSAtIHopKSBeIDIgPSA0IDo9IGJ5CiAgcncgW+KGkENvbXBsZXgubm9ybVNxX2VxX2Ficywg4oaQQ29tcGxleC5ub3JtU3FfZXFfYWJzXQogIHNpbXAgW0NvbXBsZXgubm9ybVNxX2FkZCwgQ29tcGxleC5ub3JtU3Ffc3ViLCBoXQogIHJpbmdfbmYKICA8Oz4gc2ltcCBbaCwgQ29tcGxleC5ub3JtU3FfZXFfYWJzXQogIDw7PiByaW5nX25m)importMathlibopenTopologyFilterRealComplexTopologicalSpaceFinsetopenscopedBigOperatorstheoremexercise\_1\_14\(z:ℂ\\mathbb\{C\}\)\(h:absz=1\):\(abs\(1\+z\)\)^2\+\(abs\(1\-z\)\)^2=4:=byrw\[←\\leftarrowComplex\.normSq\_eq\_abs,←\\leftarrowComplex\.normSq\_eq\_abs\]simp\[Complex\.normSq\_add,Complex\.normSq\_sub,h\]ring\_nf<;\>simp\[h,Complex\.normSq\_eq\_abs\]<;\>ring\_nfTable 10:Proof ofexercise\_1\_14from ProofNet generated by our tactic\-reward\-trained model\.exercise\_1\_14 in ProofNet from baseline[⬇](data:text/plain;base64,aW1wb3J0IE1hdGhsaWIKCm9wZW4gVG9wb2xvZ3kgRmlsdGVyIFJlYWwgQ29tcGxleCBUb3BvbG9naWNhbFNwYWNlIEZpbnNldApvcGVuIHNjb3BlZCBCaWdPcGVyYXRvcnMKCnRoZW9yZW0gZXhlcmNpc2VfMV8xNAogICh6IDog4oSCKSAoaCA6IGFicyB6ID0gMSkKICA6IChhYnMgKDEgKyB6KSkgXiAyICsgKGFicyAoMSAtIHopKSBeIDIgPSA0IDo9IGJ5CiAgaGF2ZSBoXzAgOiBhYnMgeiBeIDIgPSAxIDo9IGJ5IHNpbXAgW2hdCiAgc2ltcCBbcG93X3R3bywgYWJzX2FkZCwgYWJzX3N1YiwgaF8wXQogIHJpbmdfbmYKICBzaW1wX2FsbCBbc3FfYWJzXQ==)importMathlibopenTopologyFilterRealComplexTopologicalSpaceFinsetopenscopedBigOperatorstheoremexercise\_1\_14\(z:ℂ\\mathbb\{C\}\)\(h:absz=1\):\(abs\(1\+z\)\)^2\+\(abs\(1\-z\)\)^2=4:=byhaveh\_0:absz^2=1:=bysimp\[h\]simp\[pow\_two,abs\_add,abs\_sub,h\_0\]ring\_nfsimp\_all\[sq\_abs\]Table 11:Proof ofexercise\_1\_14from ProofNet generated by the baseline STP model\.
## Appendix GResults on Return\-based Advantage

Table 12:Ablation study on reward strategies for return\-based advantage\.![Refer to caption](https://arxiv.org/html/2606.20068v1/x4.png)Figure 4:Training dynamics showing \(a\) outcome reward,\(b\) entropy, and \(c\) mean of response length during reinforcement learning in our method and return\-based method\.
## Appendix HFailure case and limitation of our method

imo\_2019\_p1 in MiniF2f from tactic\-reward\-trained model\.[⬇](data:text/plain;base64,aW1wb3J0IE1hdGhsaWIKaW1wb3J0IEFlc29wCgpzZXRfb3B0aW9uIG1heEhlYXJ0YmVhdHMgMAoKb3BlbiBCaWdPcGVyYXRvcnMgUmVhbCBOYXQgVG9wb2xvZ3kgUmF0Cgp0aGVvcmVtIGltb18yMDE5X3AxIChmIDogXFog4oaSIFxaKSA6CiAgICAoXGZvcmFsbCBhIGIsIGYgKDIgKiBhKSArIDIgKiBmIGIgPSBmIChmIChhICsgYikpKQogICAgXG4gXGlmZiAoXGZvcmFsbCB6LCBmIHogPSAwKSDiiKgKICAgIFxuIChcZXhpc3RzIGMsIFxmb3JhbGwgeiwgZiB6ID0gMiAqIHogKyBjKSA6PSBieQogIGNvbnN0cnVjdG9yCiAgaW50cm8gaAogIGludHJvIHoKICBoYXZlIGhfMSA6PSBoIDAgMAogIGhhdmUgaF8yIDo9IGggMCAxCiAgaGF2ZSBoXzMgOj0gaCAxIDAKICBoYXZlIGhfNCA6PSBoIDEgMQogIGhhdmUgaF81IDo9IGggMiAwCiAgaGF2ZSBoXzYgOj0gaCAyIDEKICBoYXZlIGhfNyA6PSBoIDAgMgogIGhhdmUgaF84IDo9IGggMSAyCiAgaGF2ZSBoXzkgOj0gaCAyIDIKICBzaW1wIGF0IGhfMSBoXzIgaF8zIGhfNCBoXzUgaF82IGhfNyBoXzggaF85CiAgbm9ybV9udW0gYXQgaF8xIGhfMiBoXzMgaF80IGhfNSBoXzYgaF83IGhfOCBoXzkKICBvbWVnYQogIGludHJvIGgKICBpbnRybyBhIGIKICBoYXZlIGhfMSA6PSBoICgyICogYSkKICBoYXZlIGhfMiA6PSBoIGIKICBoYXZlIGhfMyA6PSBoIChhICsgYikKICBjYXNlcycgaF8xIHdpdGggaF8xIGhfMSA8Oz4gY2FzZXMnIGhfMiB3aXRoIGhfMiBoXzIKICBcbiA8Oz4gY2FzZXMnIGhfMyB3aXRoIGhfMyBoXzMgPDs+IHNpbXBfYWxsCiAgPDs+IG9tZWdhCg==)importMathlibimportAesopset\_optionmaxHeartbeats0openBigOperatorsRealNatTopologyRattheoremimo\_2019\_p1\(f:\\Z→\\rightarrow\\Z\):\(\\forallab,f\(2\*a\)\+2\*fb=f\(f\(a\+b\)\)\)\\n\\iff\(\\forallz,fz=0\)∨\\vee\\n\(\\existsc,\\forallz,fz=2\*z\+c\):=byconstructorintrohintrozhaveh\_1:=h00haveh\_2:=h01haveh\_3:=h10haveh\_4:=h11haveh\_5:=h20haveh\_6:=h21haveh\_7:=h02haveh\_8:=h12haveh\_9:=h22simpath\_1h\_2h\_3h\_4h\_5h\_6h\_7h\_8h\_9norm\_numath\_1h\_2h\_3h\_4h\_5h\_6h\_7h\_8h\_9omegaintrohintroabhaveh\_1:=h\(2\*a\)haveh\_2:=hbhaveh\_3:=h\(a\+b\)cases'h\_1withh\_1h\_1<;\>cases'h\_2withh\_2h\_2\\n<;\>cases'h\_3withh\_3h\_3<;\>simp\_all<;\>omegaTable 13:Proof ofimo\_2019\_p1in MiniF2f generated by our tactic\-reward\-trained model\.Consider a functionf:ℤ→ℤf:\\mathbb\{Z\}\\to\\mathbb\{Z\}satisfying

∀a,b∈ℤ,f\(2a\)\+2f\(b\)=f\(f\(a\+b\)\)\.\\forall a,b\\in\\mathbb\{Z\},\\qquad f\(2a\)\+2\\,f\(b\)\\;=\\;f\\bigl\(f\(a\+b\)\\bigr\)\.

The task is to prove that necessarily one of the following holds:

1. \(i\)∀z∈ℤ,f\(z\)=0\\forall z\\in\\mathbb\{Z\},\\ f\(z\)=0, or
2. \(ii\)∃c∈ℤ,∀z∈ℤ,f\(z\)=2z\+c\\exists c\\in\\mathbb\{Z\},\\ \\forall z\\in\\mathbb\{Z\},\\ f\(z\)=2z\+c\.

Our model first introduced the assumption

h:∀a,b∈ℤ,f\(2a\)\+2f\(b\)=f\(f\(a\+b\)\),h:\\ \\forall a,b\\in\\mathbb\{Z\},\\ f\(2a\)\+2\\,f\(b\)=f\\bigl\(f\(a\+b\)\\bigr\),and then instantiated it at several concrete pairs to create hypotheseshih\_\{i\}\(e\.g\.,h1:=h\(0,0\)h\_\{1\}:=h\(0,0\),h2:=h\(0,1\)h\_\{2\}:=h\(0,1\),…\\dots\)\. After some local simplification steps \(e\.g\.,simp,norm\_num\), it attempted to close the goal using theomegatactic, a decision procedure for Presburger arithmetic \(linear integer arithmetic\)\.

However, theomegacall produced the first Lean error\. While our method correctly assigns thed2d\_\{2\}penalty to this failingomegatactic under first\-error propagation, it does not penalize the preceding tactics \(intro,have,simp\) because they elaborate successfully and thus appear locally valid\. In other words, although introducinghhand instantiatinghih\_\{i\}is not logically incorrect, this route is strategically unproductive for this problem: the remaining goal still involves quantifiers, disjunction, and an uninterpreted functionff, which lie outsideomega’s theory\. Consequently, our current scheme only punishes the terminal failing step and fails to capture that the earlier \(locally successful\) steps did not make meaningful progress toward solving the global goal\.

## Appendix ILarge Language Model Usage

In preparing this manuscript, we made limited use of large language models strictly for writing assistance\. Specifically, we used ChatGPT\-5 and Gemini\-2\.5 to improve grammar, enhance clarity of expression, and polish the overall presentation\.

## Appendix JMathematical Ground for Tactic Reward

In this section, we provide a conceptual interpretation of our tactic\-level rewards using a simple value model and potential\-based reward shaping\. Our goal is not to claim a formal optimality guarantee, but rather to clarify how the structure of our discrete Lean\-based rewards is aligned with an idealized concept of proof success under a “First error propagation” assumption with potential function\.

### J\.1Define Value function

Consider a Lean proof trajectory

s0→T1s1→T2⋯→TNsN,s\_\{0\}\\xrightarrow\{T\_\{1\}\}s\_\{1\}\\xrightarrow\{T\_\{2\}\}\\cdots\\xrightarrow\{T\_\{N\}\}s\_\{N\},
wherests\_\{t\}denotes the prefix of tactics \(T1,…,Tt−1T\_\{1\},\\dots,T\_\{t\-1\}\), andsNs\_\{N\}refers to a completed proof\.

We adopt the modeling assumption already used in the main paper:

From the first error propagation, once the first erroneous tactic occurs, the proof can no longer be repaired into a valid Lean proof\.

Formally, letjjbe the index of the first tactic for which Lean reports an error\. Then all statessts\_\{t\}witht≥jt\\geq jlie in an absorbing “failed” state\. Instead of assuming independent Bernoulli errors, we consider a more general and realistic model with conditional valid probabilities\. For a valid prefixsk−1s\_\{k\-1\}, let

q\(sk−1\)=P\(no error at stepk∣sk−1is valid\)\.q\(s\_\{k\-1\}\)=P\(\\text\{no error at step \}k\\mid s\_\{k\-1\}\\text\{ is valid\}\)\.
Under the first error propagation assumption, we define the value function as the probability of eventually producing a valid proof from a valid prefixsts\_\{t\}is

V\(st\)=P\(success∣st\)=∏k\>tq\(sk−1\)\.V\(s\_\{t\}\)=P\(\\text\{success\}\\mid s\_\{t\}\)=\\prod\_\{k\>t\}q\(s\_\{k\-1\}\)\.
Along a valid trajectory, we have

V\(st\+1\)=V\(st\)q\(st\)≥V\(st\),V\(s\_\{t\+1\}\)=\\frac\{V\(s\_\{t\}\)\}\{q\(s\_\{t\}\)\}\\geq V\(s\_\{t\}\),
soV\(st\)V\(s\_\{t\}\)is monotone increasing until no errors are founded\. if the first error occurs at stepjj, the success probability collapses to zero:

V\(sj\)=V\(sj\+1\)=⋯=0\.V\(s\_\{j\}\)=V\(s\_\{j\+1\}\)=\\cdots=0\.
Qualitatively, this yields the following structure:

- •For a successful proof,V\(st\)V\(s\_\{t\}\)increases from a small value ats0s\_\{0\}toV\(sN\)=1V\(s\_\{N\}\)=1\.
- •For a failed proof,V\(st\)V\(s\_\{t\}\)increases along the correct prefix, and then drops to 0 at the first error and stays at 0 afterwards\.

Thus, the “ideal” value function encodes \(i\) Monotone growth along error\-free prefixes and \(ii\) Irreversible collapse after the first error\.

This structure motivates using stronger positive credit for tactics on an error\-free prefix and negative or neutral credit after the first failure\.

### J\.2Potential\-Based Reward Shaping withV\(s\)V\(s\)

The value defined in previous section suggests a way to define potential\-based shaping\. In an idealized MDP setting where the environment state is exactlysts\_\{t\}and the agent has access to the value functionV\(s\)V\(s\), one could define a potential

Φ⋆\(s\)=f\(V\(s\)\),\\Phi^\{\\star\}\(s\)=f\\bigl\(V\(s\)\\bigr\),whereffis any monotonically increasing transformation \(e\.g\.,f\(v\)=vf\(v\)=v\)\. A shaped reward can then be written as

rt⋆=routcome,t\+γΦ⋆\(st\+1\)−Φ⋆\(st\),r\_\{t\}^\{\\star\}\\;=\\;r\_\{\\text\{outcome\},t\}\+\\gamma\\Phi^\{\\star\}\(s\_\{t\+1\}\)\-\\Phi^\{\\star\}\(s\_\{t\}\),
whereroutcome,tr\_\{\\text\{outcome\},t\}is the sparse end\-of\-proof reward derived fromg\(Y\)g\(Y\)\. Under the assumptions of\(Nget al\.,[1999](https://arxiv.org/html/2606.20068#bib.bib62)\), such potential\-based shaping preserves the set of optimal policies\.

Intuitively, usingΦ⋆\(s\)=f\(V\(s\)\)\\Phi^\{\\star\}\(s\)=f\(V\(s\)\)means that the potential is highest on globally successful trajectories, increases along error\-free prefixes, and collapses after the first error, similar with the structure ofV\(s\)V\(s\)in previous section\. The corresponding temporal\-difference term

γΦ⋆\(st\+1\)−Φ⋆\(st\)\\gamma\\Phi^\{\\star\}\(s\_\{t\+1\}\)\-\\Phi^\{\\star\}\(s\_\{t\}\)acts as a local improvement signal: it is positive along valid contexts, negative when the value collapses at the first error, and zero afterwards\.

In our setting, however, we neither assume access to the trueV\(s\)V\(s\)\. We therefore view this potential\-based construction as a*normative model*that suggests the qualitative shape of a desirable local credit signal, rather than as a source of formal optimality guarantees\.

### J\.3Lean\-Based Discrete Approximation as Quantized Local Shaping

In practice, we did not estimateV\(s\)V\(s\)orΦ⋆\(s\)\\Phi^\{\\star\}\(s\)explicitly\. Instead, we exploit Lean’s symbolic feedback \(AST errors, first\-error propagation\) to construct discrete tactic\-level scores\.

For a proofYY, with first error indexjj\(if any\), recall that we define

φ\(Y,Tt\)=\{1,ifg\(Y\)=1,d1,ifg\(Y\)=0andt<j,d2,ifg\(Y\)=0andt≥j,1\>d1\>d2\.\\varphi\(Y,T\_\{t\}\)=\\begin\{cases\}1,&\\text\{if \}g\(Y\)=1,\\\\\[2\.5pt\] d\_\{1\},&\\text\{if \}g\(Y\)=0\\text\{ and \}t<j,\\\\\[2\.5pt\] d\_\{2\},&\\text\{if \}g\(Y\)=0\\text\{ and \}t\\geq j,\\end\{cases\}\\qquad 1\>d\_\{1\}\>d\_\{2\}\.as the process\-level reward for tacticTtT\_\{t\}\.

Conceptually,φ\(Y,Tt\)\\varphi\(Y,T\_\{t\}\)is a coarse, Lean\-driven*quantization*of the ideal local improvement signal suggested by the value model\. States on globally successful trajectories receive the highest score \(11\); states on error\-free prefixes of failed proofs receive an intermediate score \(d1d\_\{1\}\); and states at or after the first error receive the lowest score \(d2d\_\{2\}\)\. This partitions trajectories into three regions whose ordering \(success\>\>pre\-error\>\>post\-error\) is aligned with the ordering ofV\(s\)V\(s\)implied by previous section\.

Assumingγ=1\\gamma=1and a finite horizon, any such per\-step process reward sequence can be written as a difference of a state potential\. For a fixed trajectoryYYof lengthNN, defineΦ\\Phibackwards by

Φ\(sN\)=0,Φ\(st\+1\)−Φ\(st\)=rprocess,t,t=N−1,…,0\.\\Phi\(s\_\{N\}\)=0,\\qquad\\Phi\(s\_\{t\+1\}\)\-\\Phi\(s\_\{t\}\)=r\_\{\\mathrm\{process\},t\},\\quad t=N\-1,\\dots,0\.
By construction,

so the total shaped reward becomes

rt′=routcome,t\+rprocess,t=routcome,t\+Φ\(st\+1\)−Φ\(st\)\.r^\{\\prime\}\_\{t\}=r\_\{\\text\{outcome\},t\}\+r\_\{\\mathrm\{process\},t\}=r\_\{\\text\{outcome\},t\}\+\\Phi\(s\_\{t\+1\}\)\-\\Phi\(s\_\{t\}\)\.This potentialΦ\\Phiis not intended as an estimate of the true value functionV\(s\)V\(s\); rather, it is an implicit potential induced by our discrete Lean\-based scoring rule\. The key point is that its level sets respect the same qualitative ordering \(success\>\>pre\-error\>\>post\-error\) as the ideal value model, providing a theoretically motivated yet practical shaping signal\.

### J\.4Discussion and Limitations

Our analysis suggests the following:

- •Under a first\-error propagation assumption, an ideal value functionV\(s\)V\(s\)for Lean proofs increases along error\-free prefixes and collapses to zero after the first error\.
- •Our discrete tactic\-level scoresφ\(Y,Tt\)∈1,d1,d2\\varphi\(Y,T\_\{t\}\)\\in\{1,d\_\{1\},d\_\{2\}\}can be viewed as a quantized local improvement signal, capturing this qualitative structure without estimatingV\(s\)V\(s\)explicitly\.
- •For any given trajectory, the resulting process rewardsrprocess,tr\_\{\\mathrm\{process\},t\}can be written as a potential\-based shaping termrprocess,t=Φ\(st\+1\)−Φ\(st\)r\_\{\\mathrm\{process\},t\}=\\Phi\(s\_\{t\+1\}\)\-\\Phi\(s\_\{t\}\)for a suitable potentialΦ\\Phi\.

Consequently, our use of potential\-based shaping should be understood as a theoretical framework that explains the structure of our rewards and motivates our design choices, rather than as a strict proof that our procedure preserves the optimal policy for the original sparse outcome reward\. Empirically, we observe that this verifier\-informed, discretized shaping leads to more stable training and consistent improvements over outcome\-only GRPO on MiniF2F and ProofNet\. We emphasize that we do not claim any formal optimality guarantee for this shaped reward in our large\-scale LLM and Lean setting; the potential\-based perspective is used purely as a conceptual framework for designing and interpreting our tactic\-level rewards\.
Process-Verified Reinforcement Learning for Theorem Proving via Lean

Similar Articles

Distilling LLM Feedback for Lean Theorem Proving

Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

OProver: A Unified Framework for Agentic Formal Theorem Proving

LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification

Submit Feedback

Similar Articles

Distilling LLM Feedback for Lean Theorem Proving
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
OProver: A Unified Framework for Agentic Formal Theorem Proving
LaTER: Efficient Test-Time Reasoning via Latent Exploration and Explicit Verification