IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus

arXiv cs.AI Papers

Summary

This paper presents improvements to IsabeLLM, an automated theorem proving tool built on Isabelle, by integrating a retrieval-augmented generation framework, error tracing, and counterexample generation. The improved tool is evaluated on the formal verification of Bitcoin's Proof of Work consensus protocol.

arXiv:2606.18098v1 Announce Type: new Abstract: Advances in Artificial Intelligence (AI) have led AI for Theorem Proving to become a promising means of formally verifying computer systems. Whilst formal verification is traditionally reserved for safety-critical systems due to the required amount of expertise and effort, AI can help to automate a large amount of this workload and make it far more accessible. Blockchain-based systems are becoming increasingly popular and are frequently targeted by malicious actors, often resulting in huge financial losses, highlighting the need to better verify these systems and mitigate vulnerabilities. Arguably the most important component of these systems is the consensus protocol, which allows nodes to agree on decisions in a potentially adversarial environment. In this paper, we improve upon IsabeLLM, the automated theorem proving tool in Isabelle. Namely, we implement a Retrieval-Augmented Generation framework, Error tracing and counterexample generation for improved context supplied to the Large Language Model. Compatibility with the latest version of Isabelle and Sledgehammer is also implemented for improved efficiency. We compare the performance of the two versions of IsabeLLM in their ability to complete the verification of Bitcoin's Proof of Work consensus.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:41 AM

# IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus
Source: [https://arxiv.org/html/2606.18098](https://arxiv.org/html/2606.18098)
andWilliam KnottenbeltImperial College LondonLondonUnited Kingdom

###### Abstract\.

Advances in Artificial Intelligence \(AI\) have led AI for Theorem Proving to become a promising means of formally verifying computer systems\. Whilst formal verification is traditionally reserved for safety\-critical systems due to the required amount of expertise and effort, AI can help to automate a large amount of this workload and make it far more accessible\. Blockchain\-based systems are becoming increasingly popular and are frequently targeted by malicious actors, often resulting in huge financial losses, highlighting the need to better verify these systems and mitigate vulnerabilities\. Arguably the most important component of these systems is the consensus protocol, which allows nodes to agree on decisions in a potentially adversarial environment\. In this paper, we improve upon IsabeLLM, the automated theorem proving tool in Isabelle\. Namely, we implement a Retrieval\-Augmented Generation framework, Error tracing and counterexample generation for improved context supplied to the Large Language Model\. Compatibility with the latest version of Isabelle and Sledgehammer is also implemented for improved efficiency\. We compare the performance of the two versions of IsabeLLM in their ability to complete the verification of Bitcoin’s Proof of Work consensus\.

Theorem Proving, Formal Verification, Blockchain, Consensus, Artificial Intelligence

††ccs:Security and privacy Logic and verification## 1\.Introduction

A blockchain enables peer\-to\-peer digital transactions without the need for a trusted intermediary\. This is only possible because of its consensus protocol, which allows nodes within the system to agree on the state of the blockchain, even in the presence of adversaries\. For this reason, it is paramount that consensus is designed and implemented correctly to prevent the system from reaching unwanted states that can be exploited by adversaries\. The most famous example of this is Bitcoin’s Proof of Work \(PoW\) consensus protocol and its susceptibility to a 51% attack, where adversaries control the majority of the compute power in the system and giving them the potential to double spend\. Infamous examples of such attacks include Ethereum Classic\(CoinDesk,[2021](https://arxiv.org/html/2606.18098#bib.bib4)\), Bitcoin Gold\(CoinDesk,[2020](https://arxiv.org/html/2606.18098#bib.bib5)\), and Vertcoin\(CoinDesk,[2019](https://arxiv.org/html/2606.18098#bib.bib6)\), totalling losses of over $30 million\.

Other key components of modern blockchain systems are bridging protocols for cross\-chain data transfer and smart contracts for automated agreement execution\. These components are also not without their exploits, with infamous examples such as the Poly Network\(Reuters,[2021](https://arxiv.org/html/2606.18098#bib.bib7)\), Wormhole Bridge\(Investopedia,[2022](https://arxiv.org/html/2606.18098#bib.bib8)\), Binance Smart Chain\(Decrypt,[2024](https://arxiv.org/html/2606.18098#bib.bib9)\), and Qubit Finance\(Chainalysis,[2023](https://arxiv.org/html/2606.18098#bib.bib10)\), totalling losses of over $1\.5 Billion\. These failures further underscore the need to ensure correctness across the domain\.

Formal verification is the process of formalising a system and then mathematically proving its correctness\. However, it is often underutilised in the software development process because of the large amount of effort and expertise it requires\. Whilst organisations like the Ethereum Foundation actively fund formal verification work in the space, it continues to see the huge financial losses discussed previously\. Furthermore, blockchain systems cannot rely on the traditional ‘test and patch’model for their consensus protocols as patching would require a hard fork, such as the Ethereum/Ethereum Classic split\(CoinBase,[2016](https://arxiv.org/html/2606.18098#bib.bib11)\)\. Hard Forks are disruptive and controversial as they challenge blockchain’s core principle of immutability\. Furthermore, patching smart contracts is often impossible once they have been deployed on a blockchain as they are usually immutable\(Investopedia,[2024](https://arxiv.org/html/2606.18098#bib.bib12)\)with some exceptions\(Alchemy,[2024](https://arxiv.org/html/2606.18098#bib.bib13)\)\. This outlines the need for formal verification, as it can be used for correctness\-by\-construction\(Bordiset al\.,[2023](https://arxiv.org/html/2606.18098#bib.bib14)\)and minimise the need for costly post\-deployment fixes\.

In recent years, the field of Artificial Intelligence has made incredible progress, particularly within the realm of Large Language Models \(LLMs\) like OpenAI’s ChatGPT and High\-Flyer’s DeepSeek\. This advancement has opened up new opportunities across all domains, including the formal verification space\. In particular, AI for theorem proving has gained traction and has started to see applications outside of purely mathematical statements and instead for program verification\. An example of this is FVEL\(Linet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib15)\), which is used to assist in automated verification of C/C\+\+ programs in the Isabelle proof assistant\. This reduces the entry barrier for the formal verification of such programs, making it more accessible and less time\-consuming\.

In our original paper, we presented IsabeLLM\(Jones and Knottenbelt,[2025](https://arxiv.org/html/2606.18098#bib.bib59)\), a tool that integrates the proof assistant Isabelle with a Large Language Model to assist and automate proofs\. We demonstrated the effectiveness of IsabeLLM by using it to prove the correctness of a novel model for Bitcoin’s Proof\-of\-Work consensus protocol\.

We expand on the original paper by introducing IsabeLLM\-RAG, an improved version of IsabeLLM\. Specifically, we made the following improvements:

1. \(1\)Implemented a Retrieval\-Augmented Generation \(RAG\) database of proofs from the binary tree model in\(Jones and Marmsoler,[2024](https://arxiv.org/html/2606.18098#bib.bib16)\)for improved domain\-specific context\.
2. \(2\)Implemented the Nitpick Counterexample Generator\(Blanchette and Nipkow,[2010](https://arxiv.org/html/2606.18098#bib.bib60)\)to identify logically false proof steps\.
3. \(3\)Implemented an error trace that captures all modifications between LLM calls, providing the history of changes as context and highlighting which proof steps have been corrected and should be preserved\.
4. \(4\)Compatibility with Isabelle 2025 and general efficiency improvements\.
5. \(5\)Changed LLM from DeepSeek R1 to DeepSeek R1T2 Chimera for improved efficiency\.

We discuss these changes in more detail in Section[5\.3](https://arxiv.org/html/2606.18098#S5.SS3)\.

## 2\.Background

### 2\.1\.Blockchain

A blockchain is a decentralised ledger that allows multiple parties to carry out transactions without the need of a trusted intermediary, eliminating the need for trust\. This is only possible through a blockchain’s consensus protocol, which allows all parties to agree on the current state of the blockchain and the transactions recorded on it\. The most popular consensus protocol is Proof of Work \(PoW\) used by Bitcoin’s blockchain, which has around 1\.2 billion recorded transactions\(Blockchain\.com,[2025](https://arxiv.org/html/2606.18098#bib.bib19)\)with Bitcoin’s market capitalisation sitting around $1\.8 trillion\(CoinMarketCap,[2025](https://arxiv.org/html/2606.18098#bib.bib18)\)\. The core idea of PoW is that the longest blockchain is correct since it assumes the majority of computing power within the system is honest and therefore should be able to solve hashes and add blocks faster than adversaries\(Nakamoto,[2008](https://arxiv.org/html/2606.18098#bib.bib2)\)\.

### 2\.2\.Isabelle

Isabelle is a proof assistant written in Scala and ML that uses Higher\-Order Logic \(HOL\)\. It is used to write and verify formal proofs with high assurance due to the mechanisation of these proofs\(Nipkowet al\.,[2002](https://arxiv.org/html/2606.18098#bib.bib3)\)\. Isabelle’s Isar proof language allows these proofs to be more readable than the traditional approach to theorem proving by repeatedly applying tactics\. Isabelle also makes use of automation tools like Sledgehammer, which uses external automated theorem provers \(ATPs\) to help you complete proofs\. Outside of the proof assistant itself, the Scala library Scala\-Isabelle provides the functionality to interact with an Isabelle process inside of a Scala application\(Unruh,[2022](https://arxiv.org/html/2606.18098#bib.bib20)\)\.

### 2\.3\.Retrieval\-Augmented Generation

Retrieval\-Augmented Generation \(RAG\) is the technique of using external data to improve the performance of LLMs\. Although LLMs are great generalist tools, they often struggle in niche domains where there is little available data, resulting in the LLM hallucinating an incorrect answer\. RAG aims to combat this by providing a database of domain\-specific data to help give the LLM more context and understand the problem\.

A RAG framework consists of a retrieval mechanism and a generative LLM\. Data is first mapped into a vector space such that the closer data points are to one another, the more similar they are\. The retriever searches this vector space to find the most relevant information to a given query and then includes it in the prompt fed to the LLM\(Lewis and others,[2020](https://arxiv.org/html/2606.18098#bib.bib56)\)\.

In the context of AI for Theorem Proving, RAG is used to identify similar statements to what we are currently trying to prove\. By seeing how similar statements have been proven in the past, it helps the LLM understand how to prove the new statement\. RAG\-based techniques have been used in both Lean\(Thakuret al\.,[2023](https://arxiv.org/html/2606.18098#bib.bib57)\)and Isabelle\(Firstet al\.,[2023](https://arxiv.org/html/2606.18098#bib.bib58)\)for purely mathematical proofs but has yet to be used in the context of formal verification\.

## 3\.Related Work

Isabelle has been used extensively in the last 20 years to carry out numerous verifications\. Some of the most notable verifications include the seL4 Microkernel\(Kleinet al\.,[2010](https://arxiv.org/html/2606.18098#bib.bib21)\), the ML compiler\(Hupel and Nipkow,[2018](https://arxiv.org/html/2606.18098#bib.bib22)\), and numerous protocol and program verifications\(Gomeset al\.,[2017](https://arxiv.org/html/2606.18098#bib.bib23); Fosteret al\.,[2021](https://arxiv.org/html/2606.18098#bib.bib24); Marić,[2010](https://arxiv.org/html/2606.18098#bib.bib26)\)\. Outside of verification, Isabelle has been used to formalise a large amount of mathematics that can be found in the Archive of Formal Proofs \(AFP\)\(Isabelle/HOL,[2004](https://arxiv.org/html/2606.18098#bib.bib27)\)\. Some of the most notable formalisms in the AFP include Gödel’s incompleteness theorems\(Paulson,[2015](https://arxiv.org/html/2606.18098#bib.bib28)\), Jordan curve theorem\(Thiemann and Yamada,[2016](https://arxiv.org/html/2606.18098#bib.bib29)\), and Ramsey’s Theorem\(Paulson,[2004](https://arxiv.org/html/2606.18098#bib.bib30)\)\. In recent years, Isabelle has been used for verifications and formalisms of blockchain systems, including the Ethereum Virtual Machine\(Amaniet al\.,[2018](https://arxiv.org/html/2606.18098#bib.bib31)\)and a framework to verify solidity smart contracts\(Marmsoler and Brucker,[2021](https://arxiv.org/html/2606.18098#bib.bib32)\)\.

Outside of Isabelle, various other theorem provers have been used to carry out verifications in the blockchain domain\. To name a few, Agda\(Setzer,[2018](https://arxiv.org/html/2606.18098#bib.bib35); Alhabardiet al\.,[2022](https://arxiv.org/html/2606.18098#bib.bib33); Alhabardi and Setzer,[2023](https://arxiv.org/html/2606.18098#bib.bib34)\), Coq\(Sun and Yu,[2020](https://arxiv.org/html/2606.18098#bib.bib37); Yanget al\.,[2020](https://arxiv.org/html/2606.18098#bib.bib36); Alturkiet al\.,[2019](https://arxiv.org/html/2606.18098#bib.bib38)\), and Lean\(Nethermind,[2024](https://arxiv.org/html/2606.18098#bib.bib40); Pusceddu and Bartoletti,[2024](https://arxiv.org/html/2606.18098#bib.bib39)\)have been used for the formalisms of blockchain\. The field has also started to see formalisms of Decentralised Finance \(DeFi\) components\(Bartolettiet al\.,[2021](https://arxiv.org/html/2606.18098#bib.bib43),[2022](https://arxiv.org/html/2606.18098#bib.bib42); Bartoletti and Zunino,[2023](https://arxiv.org/html/2606.18098#bib.bib41)\)but are yet to be mechanised\. Other major works within the space include a formalism of the Ethereum Virtual Machine in the K\(Hildenbrandtet al\.,[2018](https://arxiv.org/html/2606.18098#bib.bib44)\), Certora Prover\(Certora,[2024](https://arxiv.org/html/2606.18098#bib.bib45)\)for semi\-automated smart contract verification, and fuzzer Echidna\(Griecoet al\.,[2020](https://arxiv.org/html/2606.18098#bib.bib65)\)\. Most recently, PropertyGPT\(Liuet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib66)\)used LLMs trained on Certora auditing reports to formally verify smart contracts\. Zero\-Knowledge Proof circuits, which are widely used in blockchain systems, have also seen increased interest in formal analysis\(Wenet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib71)\)\.

The field of AI for theorem proving has seen the development of major data sets for proof assistants in recent years, including IsarStep and PISA for Isabelle\(Liet al\.,[2020](https://arxiv.org/html/2606.18098#bib.bib47); Jianget al\.,[2021](https://arxiv.org/html/2606.18098#bib.bib48)\), LeanDojo for Lean\(Yanget al\.,[2023](https://arxiv.org/html/2606.18098#bib.bib49)\), and GamePad and CoqGym for Coq\(Yang and Deng,[2019](https://arxiv.org/html/2606.18098#bib.bib51)\)\. Using these datasets has allowed for the development of various theorem proving models, including LEGO\-Prover\(Wanget al\.,[2023](https://arxiv.org/html/2606.18098#bib.bib52)\), LISA\(Jianget al\.,[2021](https://arxiv.org/html/2606.18098#bib.bib48)\)and DeepSeek\-Prover\(Xinet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib53)\)\. Artificial Intelligence for formal verification has seen limited use, with the aforementioned FVEL\(Linet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib15)\)being the major work in this area\. As for AI for formal verification of blockchain, the literature is sparse and has only seen research into extracting smart contract specifications from natural language\(Leiteet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib54)\)\.

## 4\.Model

Our consensus model builds on previous work\(Jones and Marmsoler,[2024](https://arxiv.org/html/2606.18098#bib.bib16)\)by generalising the blockchain structure from a binary tree to an n\-ary tree\. This extension enables the model to account for an arbitrary number of forks at any given block, reflecting a more realistic view of a blockchain\. We prove that consensus holds in a majority honest network using the common prefix and chain quality properties outlined in the Bitcoin Backbone Protocol\(Garayet al\.,[2015](https://arxiv.org/html/2606.18098#bib.bib1)\), where they are discussed in more detail\. We make the same assumptions of majority honesty and synchronisation in the network, meaning that the majority of the computing power in the network is honest and that everyone shares the same view of the blockchain\. Moreover, we assume block difficulty remains constant, simplifying Bitcoin’s heaviest chain rule to the longest chain rule\. We make these assumptions to reduce complexity in our model and prevent the need to switch to a computational model\. Whilst such models offer a more realistic verification, the objective of this research is to assess the efficacy of LLMs in verification\. This model serves as a sufficient baseline for future verifications of increased complexity\.

As in the previous work, we can omit the chain quality property under our majority honesty assumption, leaving us with the common prefix property which states that all honest parties agree on a common chain up to the lastkkblocks in a chain\. This is a safety property, showing honest nodes do not diverge except near the tip of the chain\. We first define some preliminaries before specifying our consensus theorem\.

###### Definition 4\.1 \(nn\-ary Trees and Height Function\)\.

Let𝒯\\mathcal\{T\}be the set ofnn\-ary trees\. A treet∈𝒯t\\in\\mathcal\{T\}is defined as a tuplet=\(x,t​s\)t=\(x,ts\)wherexxis the node ID andt​s=\[t1,t2,…,tk\]ts=\[t\_\{1\},t\_\{2\},\\dots,t\_\{k\}\]is a finite list of subtrees, such that eachti∈𝒯t\_\{i\}\\in\\mathcal\{T\}andk≥0k\\geq 0\. LetH​e​i​g​h​t:𝒯→ℕHeight:\\mathcal\{T\}\\to\\mathbb\{N\}be a function returning the height of a tree\.

###### Definition 4\.2 \(Depth Check Function\)\.

For depthsn,d∈ℕn,d\\in\\mathbb\{N\}, the functionC​h​e​c​k:ℕ×ℕ×𝒯→𝔹Check:\\mathbb\{N\}\\times\\mathbb\{N\}\\times\\mathcal\{T\}\\to\\mathbb\{B\}is defined as follows:

C​h​e​c​k​\(0,d,t\)=T​r​u​eCheck\(0,d,t\)=TrueC​h​e​c​k​\(n\+1,d,\(x,t​s\)\)=∃p∈t​s:\(H​e​i​g​h​t​\(p\)\>d\+maxp′∈t​s∖\{p\}⁡H​e​i​g​h​t​\(p′\)∧C​h​e​c​k​\(n,d,p\)\)Check\(n\+1,d,\(x,ts\)\)=\\exists p\\in ts:\\left\(Height\(p\)\>d\+\\max\_\{p^\{\\prime\}\\in ts\\setminus\\\{p\\\}\}Height\(p^\{\\prime\}\)\\land Check\(n,d,p\)\\right\)
This function ensures that no other branch is less thanddnodes away from the longest chain up to depthnnof the tree\.

###### Definition 4\.3 \(Events\)\.

An evente∈ℰe\\in\\mathcal\{E\}is defined as a tuple\(b,t\)∈𝔹×𝒯\(b,t\)\\in\\mathbb\{B\}\\times\\mathcal\{T\}, wherebbis a boolean flag indicating if the event was honest, andttis the state of the tree immediately after the event\.ℒ​\(ℰ\)\\mathcal\{L\}\(\\mathcal\{E\}\)denotes the set of all finite sequences of events, representing the traces of the blockchain\.

###### Definition 4\.4 \(Traces\)\.

Leta​d​dh:𝒯→𝒯add\_\{h\}:\\mathcal\{T\}\\to\\mathcal\{T\}anda​d​dd:𝒯→𝒯add\_\{d\}:\\mathcal\{T\}\\to\\mathcal\{T\}be the transition functions for honest \(adding to the longest chain\) and dishonest \(adding to any branch\) block additions, respectively\. LetS​t​a​t​e:ℰ→𝒯State:\\mathcal\{E\}\\to\\mathcal\{T\}extract the tree from an event,h​d:ℒ​\(ℰ\)→ℰhd:\\mathcal\{L\}\(\\mathcal\{E\}\)\\to\\mathcal\{E\}return the most recent event in a trace, andCb​o​o​l:ℒ​\(ℰ\)→ℕC\_\{bool\}:\\mathcal\{L\}\(\\mathcal\{E\}\)\\to\\mathbb\{N\}return the number of events in a trace matching the specified boolean flag\.

Given a starting treet0t\_\{0\}under the assumption thatC​h​e​c​k​\(n,H​e​i​g​h​t​\(t0\)−n\+1,t0\)Check\(n,Height\(t\_\{0\}\)\-n\+1,t\_\{0\}\)holds, the inductive set of valid blockchain traces𝒮⊆ℒ​\(ℰ\)\\mathcal\{S\}\\subseteq\\mathcal\{L\}\(\\mathcal\{E\}\)is defined by the following rules:

1. \(1\)Honest Base Case: \[\(T​r​u​e,a​d​dh​\(t0\)\)\]∈𝒮\[\(True,add\_\{h\}\(t\_\{0\}\)\)\]\\in\\mathcal\{S\}
2. \(2\)Dishonest Base Case: \[\(F​a​l​s​e,a​d​dd​\(t0\)\)\]∈𝒮\[\(False,add\_\{d\}\(t\_\{0\}\)\)\]\\in\\mathcal\{S\}
3. \(3\)Honest Inductive Step:For any existing tracet​r∈𝒮tr\\in\\mathcal\{S\}: \(True,addh\(State\(hd\(tr\)\)\)\)::tr∈𝒮\(True,add\_\{h\}\(State\(hd\(tr\)\)\)\)::tr\\in\\mathcal\{S\}
4. \(4\)Dishonest Inductive Step:For any existing tracet​r∈𝒮tr\\in\\mathcal\{S\}, given the conditionCF​\(t​r\)<CT​\(t​r\)\+\(H​e​i​g​h​t​\(t0\)−n\)C\_\{F\}\(tr\)<C\_\{T\}\(tr\)\+\(Height\(t\_\{0\}\)\-n\)holds: \(False,addd\(State\(hd\(tr\)\)\)\)::tr∈𝒮\(False,add\_\{d\}\(State\(hd\(tr\)\)\)\)::tr\\in\\mathcal\{S\}

###### Theorem 4\.5 \(Consensus\)\.

LetL​o​n​g​e​s​t​P​a​t​h​s:ℒ​\(ℰ\)→𝒫​\(𝒯\)LongestPaths:\\mathcal\{L\}\(\\mathcal\{E\}\)\\to\\mathcal\{P\}\(\\mathcal\{T\}\)return the set of the longest branches, and let the functiont​a​k​e:ℕ×𝒯→ℕtake:\\mathbb\{N\}\\times\\mathcal\{T\}\\to\\mathbb\{N\}return the prefix of a branch up to the specified depthnn\. For all valid traces, all longest paths share the same prefix up to depthnn:

∀t​r∈𝒮,∀p,p′∈L​o​n​g​e​s​t​P​a​t​h​s​\(S​t​a​t​e​\(h​d​\(t​r\)\)\):t​a​k​e​\(n,p\)=t​a​k​e​\(n,p′\)\\forall tr\\in\\mathcal\{S\},\\forall p,p^\{\\prime\}\\in LongestPaths\(State\(hd\(tr\)\)\):take\(n,p\)=take\(n,p^\{\\prime\}\)

This statement is identical to the consensus statement for the binary tree model\. However, the generalisation to an n\-ary tree significantly increases the complexity of the proof\. In the binary tree case, inductive arguments typically require only two cases \(e\.g\., left and right subtrees\), whereas the n\-ary setting necessitates reasoning over an arbitrary number of branches, complicating case distinctions and inductive reasoning\. To show this, Table[1](https://arxiv.org/html/2606.18098#S4.T1)gives a description for each lemma and the Lines of Proof \(LoP\) required to complete the verification of each model\. We only list the lemmas that were more than one LoP in at least one of the models\. In the binary tree column, “N/A” means that the lemma was not required for the verification\. For the rows withx\+yx\+y,xxis the LoP that are ‘original’ andyyis the LoP that are symmetric toxxand are just repeated for the different cases\. With this in mind, it is clear that the n\-ary tree model has more than double the original LoP when compared to the binary tree model\.

Table 1\.Lines of Proof \(LoP\) and Lemma Descriptions for each tree model\.![Refer to caption](https://arxiv.org/html/2606.18098v1/Images/new_architecture.png)Figure 1\.IsabeLLM Architecture\.Shows the interface between Isabelle, Sledgehammer and the LLM\.
## 5\.IsabeLLM

IsabeLLM is an interface between the Isabelle proof assistant and an LLM\. It is designed for general purpose and so can be used to prove any kind of statements within Isabelle\. It should be noted that if you are using bespoke imports for your theory file, then they should be given to the LLM as context for it to understand\. In our models, we are only importing Isabelle’s Main library, and everything is contained within the single theory file, meaning we do not need to provide extra context\.

Table 2\.Isabelle build errors\.### 5\.1\.Architecture

The high\-level architecture for IsabeLLM can be seen in Fig\.[1](https://arxiv.org/html/2606.18098#S4.F1)\. The main idea is that we use an LLM to understand the high\-level structure of a proof and then use Isabelle’s Sledgehammer tool to solve the intermediate steps that the LLM failed \(if any\)\. The general workflow for IsabeLLM is as follows:

1. 1The user uploads their Isabelle theory file \(\.thy\) to their working directory, along with a ROOT file so that the Isabelle server knows which files to look at\. The user starts IsabeLLM\.
2. 2IsabeLLM first uses the Isabelle server to try and build the theory file\. If there are no issues with the file and all statements have been proven, then the build completes, and we are done\. If not, then IsabeLLM captures the errors raised to identify the unproven statements and extracts them\.
3. 3IsabeLLM sends the context of the theory file and the unproven lemma to the LLM via its API\. The LLM tries to prove the lemma and returns a proof of the statement\.
4. 4IsabeLLM injects the new proof into the theory file and tries to build it again\. If this fails, we send the file to Isabelle’s Sledgehammer tool\.
5. 5Sledgehammer tries to solve each unproven line within the proof\. If some are left unproven, then IsabeLLM extracts these lines and their errors, along with the rest of the updated theory file\.
6. 6IsabeLLM returns the current proof state to the LLM and asks it to resolve the remaining errors\. The LLM returns a proof of the statement\.
7. 7Steps 4–6 are repeated until the theory file is successfully built or IsabeLLM reaches a set number of iterations\.

Fig\.[2](https://arxiv.org/html/2606.18098#S5.F2)shows an example workflow in IsabeLLM\. In this example, we prove the lemma subtree\_height which states that the height of a tree in a set of trees is always greater than or equal to the maximum height of the set of trees\. The LLM generates a proof that fails a proof step, which we then correct with Sledgehammer\. The proof is by induction over the list of trees and splitting the inductive step into the cases of whether the tree is at the head of the list or not\.

For this paper, we opted to use variations of the DeepSeek R1 model as our LLM due to its strong coding benchmarks and free access to its API\. Claude Sonnet was also considered, but was ultimately decided to be too expensive\. Other models like OpenAI’s GPT\-4 and Mistral’s Le Chat were also considered but showed poor performance during manual testing\. As for our choice of proof assistant, we chose Isabelle due to its existing automation tool Sledgehammer and integration library Scala\-Isabelle\. Although not integral, the Isar language also helps to understand the logic of the proofs and the dialogue between IsabeLLM and LLM\.

### 5\.2\.Implementation

Almost all of IsabeLLM is written in Scala, with some Python to access the LLM API using the OpenAI library\. The main reason for choosing Scala is to be able to use the Scala\-Isabelle library, which offers the functionality to control an Isabelle process from a Scala application\. In particular, we make use of Scala\-Isabelle for calling Sledgehammer in our theory file\. All of our code, including the IsabeLLM source code and theory files, can be found at\(Jones,[2025](https://arxiv.org/html/2606.18098#bib.bib55)\)\. We list IsabeLLM’s features below:

1. \(1\)Interface between Isabelle and a LLM API\.
2. \(2\)Code extraction from a theory file, including lemmas, definitions, and proofs\.
3. \(3\)Injection of code into a theory file, including lemmas, definitions, and proofs\.
4. \(4\)Sledgehammer functionality with a timeout control and option to select which provers it uses\. In this paper, we set this timeout to 60 seconds and use the default provers\.
5. \(5\)Handling of errors in the build process\. An Error trace is maintained between LLM calls for improved context\.
6. \(6\)Records and updates to the LLM chat history\.
7. \(7\)LLM Prompt Generation\.
8. \(8\)RAG Database for improved context sent to LLM\.
9. \(9\)Nitpick counterexample generator to identify logically false statements\.
10. \(10\)Controls for the maximum number of LLM iterations before timing out to prevent endless loops\.

Due to the stochastic nature of LLMs, the most challenging part of automating proof with IsabeLLM is ensuring the output from the LLM has the correct syntax to be injected into Isabelle\. Generally speaking, most models understand the syntax well enough to give a coherent response\. However, most outputs will trigger at least one error in the build process and must be handled accordingly\. Table[2](https://arxiv.org/html/2606.18098#S5.T2)highlights the general types of error that are encountered when trying to build the theory file after injecting a generated proof\.

When sending requests to the LLM, IsabeLLM automatically builds the required prompts to make the context clear\. When we first initialise a proof, we send a prompt that includes the context of the lemma we are trying to prove and everything in the theory file before it\. After the initialisation prompt, we send prompts that include only the current proof state of the lemma and error trace\. Similar proofs extracted from the RAG database are appended to the end of both messages\. The templates for these prompts can be seen in Table[3](https://arxiv.org/html/2606.18098#S5.T3)\. IsabeLLM also maintains the chat history with the LLM\. To minimise the size of our context, we reset the history after successfully proving a lemma, then update the initial context to the theory file with the updated lemma\.

Table 3\.IsabeLLM prompts\.![Refer to caption](https://arxiv.org/html/2606.18098v1/x1.png)Figure 2\.IsabeLLM example workflow\.We prove the subtree\_height lemma\.
### 5\.3\.Improvements

This paper improves on the original IsabeLLM tool by making a variety of changes to improve performance and efficiency, most notably by enabling stronger context in our prompts\. The first of these is the implementation of a Retrieval\-Augmented Generation \(RAG\) framework\. To ensure the LLM receives self\-contained context, we process the Isabelle \.thy files using regular expressions to isolate definitions and lemmas\. We then build a dependency graph and perform a breadth\-first search to identify all transitive dependencies \(e\.g\., datatypes and functions\) that a lemma relies on\. These dependencies are recursively bundled with the proof text before vectorisation\. Embeddings are generated using the SentenceTransformers all\-MiniLM\-L6\-v2 model\(Reimers and Gurevych,[2019](https://arxiv.org/html/2606.18098#bib.bib63)\)and stored in a local ChromaDB instance\. The database consists of all 22 proofs \(trivial and non\-trivial\) from the previous binary tree model in\(Jones and Marmsoler,[2024](https://arxiv.org/html/2606.18098#bib.bib16)\)\. We query the vector database to find the three most similar lemmas using Euclidean distance as our metric\. These lemmas are then injected into the LLM prompt under a dedicated reference header\.

We also integrated the Nitpick counterexample generator\(Blanchette and Nipkow,[2010](https://arxiv.org/html/2606.18098#bib.bib60)\)\. In the previous paper we found that LLMs would repeatedly try to prove statements that were logically false\. Nitpick identifies these statements which we can then provide as context to the LLM to help it understand why the proof does not work and adapt to a new approach\. Fig\.[3](https://arxiv.org/html/2606.18098#S5.F3)shows an example counter example\. Here, the LLM is trying to prove the intermediate step that the height of treettis strictly less than the maximum height of all trees in the sett​sts\. However, nitpick identifies a counterexample wheretthas the maximum height and so the statement must be false\.

![Refer to caption](https://arxiv.org/html/2606.18098v1/Images/nitpick.png)Figure 3\.Nitpick ExampleLastly, we implemented an error trace that is added context to the LLM\. In the previous paper we noted the LLM would often override changes we made between LLM calls and get stuck in a loop by repeatedly changing proof steps we have already corrected\. The trace provides the exact changes that have been made since the last LLM call and highlights what should be preserved, enabling the LLM to build on our improvements instead of trying to start from scratch each call\.

Next, we aimed to improve the efficiency of IsabeLLM\. First, we made the tool compatible with Isabelle 2025\. The original tool used Isabelle 2022 as this was the version that the Scala\-Isabelle package\(Unruh,[2022](https://arxiv.org/html/2606.18098#bib.bib20)\)supported\. Since then, the package has been updated to be compatible with the latest Isabelle versions and its associated tools, including Sledgehammer\. In the previous paper we noted that the older version of Sledgehammer failed to find proofs in the allotted time but the Isabelle 2025 sledgehammer succeeded\. Updating to the latest version enables us to use these more efficient tools, improving our ability to find proofs\. Moreover, the original IsabeLLM ran different processes for the Isabelle Server and Sledgehammer\. We have updated this to consolidate the whole tool into a single process, massively improving the speed of computation and significantly reducing the required memory\.

### 5\.4\.Model Selection

In this paper, we test IsabeLLM\-RAG with 3 different models\. Namely, these are DeepSeek R1T2 Chimera\(TNG Technology Consulting GmbH,[2025](https://arxiv.org/html/2606.18098#bib.bib62)\), NVIDIA\-Nemotron\-3\-Super\-120B\-A12B\(NVIDIA,[2026](https://arxiv.org/html/2606.18098#bib.bib67)\)and OpenAI’s gpt\-oss\-120b\(OpenAI,[2025](https://arxiv.org/html/2606.18098#bib.bib68)\)\. We summarise their architectures in Table[4](https://arxiv.org/html/2606.18098#S5.T4)\. The original IsabeLLM paper used DeepSeek R1 as its model, using a standard Mixture\-of\-Experts \(MoE\) transformer architecture with 671B total parameters and around 37B active parameters, optimised via reinforcement learning for pure reasoning\. In contrast, DeepSeek R1T2 Chimera maintains the same parameter footprint but employs an Assembly of Experts \(AoE\) methodology\(Klaggeset al\.,[2025](https://arxiv.org/html/2606.18098#bib.bib61)\), combining the R1, V3\-0324, and R1\-0528 models to stabilise outputs\. It also offers a slightly larger context window\. NVIDIA Nemotron 3 Super introduces a MoE Hybrid Mamba\-Transformer architecture with 120B total and 12B active parameters, leveraging LatentMoE routing\(Elangoet al\.,[2026](https://arxiv.org/html/2606.18098#bib.bib69)\)and Multi\-Token Prediction \(MTP\)\(Gloeckleet al\.,[2024](https://arxiv.org/html/2606.18098#bib.bib70)\)to achieve a 1M token context window\. Finally, GPT\-OSS\-120B uses a traditional MoE transformer with 117B total and 5\.1B active parameters, with MXFP4 quantisation and a 128K context window\.

Our DeepSeek models are by far the largest in terms of total parameters, roughly 5 times the size of our NVIDIA and OpenAI models\. From this, we would expect our DeepSeek models to outperform the others\. Nemotron and GPT\-OSS are of similar size, with Nemotron being the most recent model and making use of recent innovations to enable a massive context window\. Moreover, GPT\-OSS is quantised with MXFP4, meaning it operates at lower precision to significantly reduce its memory footprint, while activating far fewer parameters \(5\.1B\) than our other models\.

While the Chimera model is arguably less advanced compared to the pure reasoning of the R1 model, it is much more stable and efficient\. The reason for this switch is that we noted one of the main efficiency bottlenecks of the tool is the slow response time of the LLM\. In this case, the speed increase is worth the slight decrease in performance, especially when our other improvements help mitigate this decrease\.

While the baseline R1 model is highly optimised for pure mathematical reasoning, we hypothesised that the AoE methodology in the Chimera architecture would provide greater syntax stability and inference efficiency\. As noted, a primary bottleneck of IsabeLLM is the latency of LLM generation\. Reviewing our OpenRouter API logs, DeepSeek R1’s speed typically ranged between10−2010\-20tokens per second \(tok/s\)\. In contrast, Chimera’s speed typically ranged between30−5030\-50tok/s, representing a significant latency improvement\. Furthermore, as demonstrated in Section[6](https://arxiv.org/html/2606.18098#S6), Chimera also minimised the number of corrective loops required, achieving the lowest average number of LLM iterations per successful proof \(1\.06 compared to 1\.31 for R1\)\. Both Nemotron and GPT\-OSS exhibited a similar generation speed to the Chimera model\.

Table 4\.Model Architectures

## 6\.Results

We tested IsabeLLM and IsabeLLM\-RAG by using them to prove each of the 16 non\-trivial lemmas listed in Table[1](https://arxiv.org/html/2606.18098#S4.T1)\. We attempt each proof 10 times with a maximum of 3 iterations per attempt, not counting instances when the LLM would return an empty output as this is an API failure rather than a methodological failure\. We reduced the maximum number of iterations from 5 in the first paper as we discovered that the additional iterations were almost always ineffective\. We consider an attempt to be a failure if it exceeds the maximum number of iterations or exits prematurely \(often due to syntax issues\)\. It should be noted that the LoP specified for each lemma can vary as the LLM can generate different proofs for the same statement\.

The original IsabeLLM used DeepSeek R1 as its LLM\. To evaluate the impact of underlying model architecture on our retrieval\-augmented framework, IsabeLLM\-RAG was tested using DeepSeek R1T2 Chimera, NVIDIA\-Nemotron\-3\-Super\-120B\-A12B, and OpenAI’s gpt\-oss\-120b\. As discussed in Section[5\.3](https://arxiv.org/html/2606.18098#S5.SS3), IsabeLLM\-RAG benefits from the RAG database, error trace, counterexample generator, and access to the Isabelle 2025 Sledgehammer\. All experiments were conducted on a workstation running Windows 11\. The system was equipped with an AMD Ryzen 7 5825U processor \(8 cores, 16 threads\) and 16\.0 GB of RAM\.

Table[5](https://arxiv.org/html/2606.18098#S6.T5)shows the results of each version of IsabeLLM\. Recall that in the previous paper, for the lemma branch\_height, we had to manually use the 2025 Sledgehammer to succeed\. These manual interventions have been removed so that we can more clearly see the automated improvements in IsabeLLM\-RAG\.

Generally speaking, IsabeLLM\-RAG outperforms the original across all lemmas when using the Chimera and Nemotron models\. It matches or improves the success rate for nearly every lemma while decreasing the average number of iterations per successful attempt\. Overall, the Chimera model achieved a 94\.4% success rate with a significant reduction in LLM iterations\. Nemotron achieved a 87\.5% success rate, outperforming IsabeLLM despite it using a much larger model\. The most notable improvement is for the lemma branch\_height, which previously failed without manual Sledgehammer intervention\. IsabeLLM\-RAG \(Chimera\) achieved a 90% success rate here, needing only 1 LLM call in almost every successful attempt\. Conversely, the GPT\-OSS model struggled, yielding a 67\.5% average success rate and frequently requiring the maximum number of iterations\. However, this is not far behind how IsabeLLM performed with DeepSeek R1, a much larger model without the precision loss of quantisation\.

Similar to the original paper, IsabeLLM\-RAG rarely obtained the correct answer on the very first LLM generation for complex lemmas; IsabeLLM had to apply various error corrections and leverage Sledgehammer to fully complete the proofs\. Interestingly, where the original paper saw some variation in proof approaches, particularly for larger proofs, IsabeLLM\-RAG produced highly deterministic structural approaches across iterations\.

Table 5\.Performance comparison between IsabeLLM Versions\.\(SR = Success Rate, Iter\. = Avg\. Iterations\)
## 7\.Discussion

We first examine the variance in success rates across the models\. The five lemmas that IsabeLLM\-RAG \(Chimera\) failed to prove with a 100% success rate have the largest LoP, showing that increased logical complexity inherently reduces success rate\. Exceptions to this are the honest mining variants of height\_add and check\_add, which both require a high number of LoP yet achieved perfect completions across Chimera and Nemotron\. This success is likely attributable to the RAG system\. As discussed in Section[4](https://arxiv.org/html/2606.18098#S4), the two tree models have different approaches to proof; however, the logic of adapting the proofs from the mining variant to the honest mining variant remains consistent in both models\. The LLM identifies this structural symmetry from the RAG context and applies it to prove these lemmas reliably\. On the other hand, three of the failed proofs were unique to the n\-ary tree model, meaning there was no analogous proof in the binary tree database\. It is evident that the RAG context is less effective here as it cannot provide a good template\.

As expected, our largest model Chimera performed best, achieving the highest success rate and lowest average iterations\. Interestingly, Nemotron performed exceedingly well despite its small size\. Despite having a large context window, our prompt lengths rarely exceeded 12K tokens, indicating its success does not stem from context capacity\. We attribute its performance primarily to its usage of MTP, which forces the model to learn the broader structural syntax of Isar proofs rather than prioritising myopic next\-token prediction\. GPT\-OSS’s weaker performance highlights its architectural limitations\. Its lower active parameter count and quantisation appear to restrict its capacity for deep reasoning, context retention and syntax consistency\.

The lower variability observed in LLM responses is attributable to RAG\. The use of RAG effectively prunes the potential proof space, providing a strict template on how to approach the proof and drastically reducing the chance of hallucination\. Despite the original DeepSeek R1 having an advantage in pure mathematical reasoning, we saw no instance where it outperformed the Chimera model\. However, this is also influenced by other improvements in IsabeLLM\-RAG\.

In terms of overall efficiency, IsabeLLM\-RAG exceeded its predecessor\. One of the highlighted issues in the original paper was the LLM getting stuck in a proof loop, repeatedly failing on the same step\. Proof loops were effectively mitigated across the Chimera and Nemotron tests due to the stronger context supplied by Nitpick and the error trace, allowing the LLM to learn from immediate failure\. It was also noted that the API interactions were far more reliable\.

As before, IsabeLLM is still limited to proof completion, unable to generate the initial properties to be proven\. This means that the user must specify a statement before it can be verified, including setting up functions, locales, and sets\. However, this issue will be largely mitigated if IsabeLLM is used in conjunction with existing verification frameworks rather than building from the ground up\. For example, Isabelle/Solidity\(Marmsoler and Brucker,[2021](https://arxiv.org/html/2606.18098#bib.bib32)\)builds most of the model automatically, leaving only the invariant properties to be specified\. A remaining challenge is ensuring the LLM understands the bespoke calculus of such frameworks, given the scarcity of relevant proof corpora\.

While the primary objective of IsabeLLM\-RAG is the full automation of the verification process, the system inherently facilitates proof discovery\. All proofs use Isabelle’s Isar proof language that is highly readable, allowing users to follow the logic of the proof and gain insight into potentially new approaches\. Furthermore, the integration of Nitpick enables the discovery of previously unconsidered edge cases through counterexample generation, effectively highlighting logical flaws in proof steps\.

While our current RAG implementation provides static templates based on prior successful proofs, scaling IsabeLLM to larger verifications involving multiple Isabelle theory files proves to be a challenge\. A large context becomes infeasible to include in our prompt due to the limited context window, causing our RAG methodology to break down\. In this scenario, integrating the Model Context Protocol \(MCP\)\(Anthropic,[2024](https://arxiv.org/html/2606.18098#bib.bib64)\)presents a complementary solution to our RAG strategy\. Whereas RAG effectively anchors the overarching architecture of a proof by matching structural templates, a MCP\-based approach would act as a dynamic retrieval mechanism\. By equipping the LLM with MCP tools, the model could actively query Isabelle’s environment during inference\. This dynamic retrieval would mitigate the context window bottleneck by fetching only the necessary facts\.

## 8\.Conclusion

In this paper, we improved the original IsabeLLM tool with IsabeLLM\-RAG, implementing a RAG database, counterexample generator \(Nitpick\), improved conversational error trace, and compatibility with Isabelle 2025 and its associated tools\. We evaluated IsabeLLM\-RAG’s performance using three models \- DeepSeek R1T2 Chimera, NVIDIA Nemotron 3 Super, and GPT\-OSS\-120B\. With these improvements, we were able to outperform the original tool across the board using the Chimera and Nemotron architectures, increasing success rates and decreasing the average number of LLM iterations required\. Although GPT\-OSS showed weaker performance, it was not far behind the performance of the original IsabeLLM tool, which used a much larger model\.

While our results highlight IsabeLLM’s capacity to evolve alongside state\-of\-the\-art AI and theorem\-proving infrastructure, several avenues for future work remain\. IsabeLLM could be parallelised to query different LLMs simultaneously, generating a distributed proof tree to maximise the probability of finding a valid proof path\. Furthermore, the LLMs could be fine\-tuned on domain\-specific data, such as the Archive of Formal Proofs or bespoke consensus verifications\. Moreover, IsabeLLM could be integrated with semantic embeddings of programming languages within Isabelle, the most prominent in the blockchain space being Isabelle/Solidity\. This would transition IsabeLLM from verifying abstract state\-machine models to verifying concrete smart contract bytecode, solidifying its viability in real\-world, safety\-critical applications\. Conversely, because IsabeLLM remains a general\-purpose proof assistant wrapper, its methodologies could be readily translated to formal verifications in domains entirely outside the blockchain space\.

## References

- Alchemy \(2024\)What are upgradeable smart contracts?\.Note:\[Accessed April 2025\]External Links:[Link](https://docs.alchemy.com/docs/upgradeable-smart-contracts)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p3.1)\.
- F\. F\. Alhabardi, A\. Beckmann, B\. Lazar, and A\. Setzer \(2022\)Verification of bitcoin script in Agda using weakest preconditions for access control\.Note:PreprintExternal Links:2203\.03054Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- F\. Alhabardi and A\. Setzer \(2023\)A simulator of Solidity\-style smart contracts in the theorem prover Agda\.InProceedings of the 2023 6th International Conference on Blockchain Technology and Applications \(ICBTA\),Xi’an, China,pp\. 1–11\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- M\. A\. Alturki, J\. Chen, V\. Luchangco, B\. Moore, K\. Palmskog, L\. Peña, and G\. Roşu \(2019\)Towards a verified model of the Algorand consensus protocol in coq\.InInternational Symposium on Formal Methods,Lecture Notes in Computer Science, Vol\.11800,Cham,pp\. 362–367\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- S\. Amani, M\. Bégel, M\. Bortin, and M\. Staples \(2018\)Towards verifying ethereum smart contract bytecode in Isabelle/HOL\.InProceedings of the 7th ACM SIGPLAN International Conference on Certified Programs and Proofs \(CPP\),New York, NY, USA,pp\. 66–77\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- Anthropic \(2024\)Note:Accessed: April 2026External Links:[Link](https://modelcontextprotocol.io/)Cited by:[§7](https://arxiv.org/html/2606.18098#S7.p7.1)\.
- M\. Bartoletti, J\. H\. Chiang, and A\. L\. Lafuente \(2021\)Towards a theory of decentralized finance\.InFinancial Cryptography and Data Security Workshops \(FC 2021\),Berlin, Heidelberg,pp\. 227–232\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- M\. Bartoletti, J\. H\. Chiang, and A\. Lluch\-Lafuente \(2022\)A theory of automated market makers in DeFi\.Logical Methods in Computer Science18\(4\),pp\. 12:1–12:35\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- M\. Bartoletti and R\. Zunino \(2023\)A theoretical basis for blockchain extractable value\.Note:PreprintExternal Links:2302\.02154Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- J\. C\. Blanchette and T\. Nipkow \(2010\)Nitpick: a counterexample generator for higher\-order logic based on a relational model finder\.InInteractive Theorem Proving \(ITP\),Lecture Notes in Computer Science, Vol\.6172,Berlin, Heidelberg,pp\. 131–146\.Cited by:[item 2](https://arxiv.org/html/2606.18098#S1.I1.i2.p1.1),[§5\.3](https://arxiv.org/html/2606.18098#S5.SS3.p2.3)\.
- Blockchain\.com \(2025\)BTC total number of transactions\.Note:\[Accessed April 2025\]External Links:[Link](https://www.blockchain.com/explorer/charts/n-transactions-total)Cited by:[§2\.1](https://arxiv.org/html/2606.18098#S2.SS1.p1.1)\.
- T\. Bordis, T\. Runge, A\. Kittelmann, and I\. Schaefer \(2023\)Correctness\-by\-construction: an overview of the corc ecosystem\.Ada Letters42\(2\),pp\. 75–78\.Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p3.1)\.
- Certora \(2024\)Certora prover\.Note:\[Accessed April 2025\]External Links:[Link](https://www.certora.com/prover)Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- Chainalysis \(2023\)The $80 million qubit hack likely the work of north korea\-linked cybercriminals\.Note:\[Accessed April 2025\]External Links:[Link](https://www.chainalysis.com/blog/qubit-hack-north-korea/)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p2.1)\.
- CoinBase \(2016\)Ethereum classic and the ethereum hard fork\.Note:\[Accessed April 2025\]External Links:[Link](https://help.coinbase.com/en/coinbase/getting-started/crypto-education/eth-hard-fork)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p3.1)\.
- CoinDesk \(2019\)The Vertcoin cryptocurrency just got 51% attacked – again\.Note:\[Accessed April 2025\]External Links:[Link](https://www.coindesk.com/tech/2019/12/02/the-vertcoin-cryptocurrency-just-got-51-attacked-again)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p1.1)\.
- CoinDesk \(2020\)Bad actors rent hashing power to hit bitcoin gold with new 51% attacks\.Note:\[Accessed April 2025\]External Links:[Link](https://www.coindesk.com/tech/2020/01/27/bad-actors-rent-hashing-power-to-hit-bitcoin-gold-with-new-51-attacks)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p1.1)\.
- CoinDesk \(2021\)Ethereum classic hit by third 51% attack in a month\.Note:\[Accessed April 2025\]External Links:[Link](https://www.coindesk.com/markets/2020/08/29/ethereum-classic-hit-by-third-51-attack-in-a-month)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p1.1)\.
- CoinMarketCap \(2025\)BTC market capitalisation\.Note:\[Accessed April 2025\]External Links:[Link](https://coinmarketcap.com/currencies/bitcoin/)Cited by:[§2\.1](https://arxiv.org/html/2606.18098#S2.SS1.p1.1)\.
- Decrypt \(2024\)BNB chain hits record\-high sandwich attacks exposing $1\.5 billion in trades\.Note:\[Accessed April 2025\]External Links:[Link](https://decrypt.co/294648/bnb-smart-chain-blocks-hits-record-high-sandwich-attacks)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p2.1)\.
- V\. Elango, N\. Bhatia, R\. Waleffe, R\. Shafipour, T\. Asida, A\. Khattar, N\. Assaf, M\. Golub, J\. Guman, T\. Mitra,et al\.\(2026\)LatentMoE: toward optimal accuracy per flop and parameter in mixture of experts\.arXiv preprint arXiv:2601\.18089\.Cited by:[§5\.4](https://arxiv.org/html/2606.18098#S5.SS4.p1.1)\.
- E\. First, M\. Rabe, T\. Ringer, and Y\. Brun \(2023\)Baldur: whole\-proof generation and repair with large language models\.InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering \(ESEC/FSE\),New York, NY, USA,pp\. 1229–1241\.Cited by:[§2\.3](https://arxiv.org/html/2606.18098#S2.SS3.p3.1)\.
- S\. Foster, J\. J\. Huerta y Munive, M\. Gleirscher, and G\. Struth \(2021\)Hybrid systems verification with Isabelle/HOL: simpler syntax, better models, faster proofs\.InFormal Methods: 24th International Symposium \(FM 2021\),Lecture Notes in Computer Science, Vol\.13047,Cham,pp\. 367–386\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- J\. Garay, A\. Kiayias, and N\. Leonardos \(2015\)The bitcoin backbone protocol: analysis and applications\.InAnnual International Conference on the Theory and Applications of Cryptographic Techniques \(EUROCRYPT\),Lecture Notes in Computer Science, Vol\.9057,Sofia, Bulgaria,pp\. 281–310\.Cited by:[§4](https://arxiv.org/html/2606.18098#S4.p1.1)\.
- F\. Gloeckle, B\. Y\. Idrissi, B\. Rozière, D\. Lopez\-Paz, and G\. Synnaeve \(2024\)Better & faster large language models via multi\-token prediction\.arXiv preprint arXiv:2404\.19737\.Cited by:[§5\.4](https://arxiv.org/html/2606.18098#S5.SS4.p1.1)\.
- V\. B\. Gomes, M\. Kleppmann, D\. P\. Mulligan, and A\. R\. Beresford \(2017\)Verifying strong eventual consistency in distributed systems\.Proceedings of the ACM on Programming Languages1\(OOPSLA\),pp\. 1–28\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- G\. Grieco, W\. Song, A\. Cygan, J\. Feist, and A\. Groce \(2020\)Echidna: effective, usable, and fast fuzzing for smart contracts\.InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis,pp\. 557–560\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- E\. Hildenbrandt, M\. Saxena, N\. Rodrigues, X\. Zhu, P\. Daian, D\. Guth, B\. Moore, D\. Park, Y\. Zhang, A\. Stefanescu,et al\.\(2018\)Kevm: a complete formal semantics of the ethereum virtual machine\.In2018 IEEE 31st Computer Security Foundations Symposium \(CSF\),Oxford, UK,pp\. 204–217\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- L\. Hupel and T\. Nipkow \(2018\)A verified compiler from isabelle/HOL to CakeML\.InEuropean Symposium on Programming \(ESOP\),Lecture Notes in Computer Science, Vol\.10801,Cham,pp\. 999–1026\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- Investopedia \(2022\)Crypto worth over $320 million taken in wormhole hack\.Note:\[Accessed April 2025\]External Links:[Link](https://www.investopedia.com/crypto-theft-of-usd320-million-wormhole-hack-5218062)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p2.1)\.
- Investopedia \(2024\)What are smart contracts on the blockchain and how do they work?\.Note:\[Accessed April 2025\]External Links:[Link](https://www.investopedia.com/terms/s/smart-contracts.asp)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p3.1)\.
- Isabelle/HOL \(2004\)Archive of formal proofs\.Note:\[Accessed April 2025\]External Links:[Link](https://www.isa-afp.org/)Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- A\. Q\. Jiang, W\. Li, J\. M\. Han, and Y\. Wu \(2021\)LISA: language models of isabelle proofs\.In6th Conference on Artificial Intelligence and Theorem Proving \(AITP\),Innsbruck, Austria,pp\. 378–392\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- E\. Jones and W\. Knottenbelt \(2025\)Towards automating blockchain consensus verification with isabellm\.InInternational Conference on Blockchain Technology and Emerging Applications,Communications in Computer and Information Science, Vol\.2100,Cham,pp\. 3–18\.Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p5.1)\.
- E\. Jones and D\. Marmsoler \(2024\)Towards mechanised consensus in isabelle\.In5th International Workshop on Formal Methods for Blockchains \(FMBC 2024\),OASIcs, Vol\.117,Dagstuhl, Germany,pp\. 4:1–4:22\.Cited by:[item 1](https://arxiv.org/html/2606.18098#S1.I1.i1.p1.1),[§4](https://arxiv.org/html/2606.18098#S4.p1.1),[§5\.3](https://arxiv.org/html/2606.18098#S5.SS3.p1.1)\.
- E\. Jones \(2025\)IsabeLLM\.Note:GitHub RepositoryExternal Links:[Link](https://github.com/EllbellCode/IsabeLLM)Cited by:[§5\.2](https://arxiv.org/html/2606.18098#S5.SS2.p1.1)\.
- H\. Klagges, F\. Klemm, B\. Merkel, D\. Klingmann, D\. A\. Reiss, and D\. Zecha \(2025\)Assembly of experts: linear\-time construction of the chimera llm variants with emergent and adaptable behaviors\.Note:PreprintExternal Links:2506\.14794Cited by:[§5\.4](https://arxiv.org/html/2606.18098#S5.SS4.p1.1)\.
- G\. Klein, T\. Sewell, and S\. Winwood \(2010\)Refinement in the formal verification of the seL4 microkernel\.InDesign and Verification of Microprocessor Systems for High\-Assurance Applications,D\. S\. Warren \(Ed\.\),pp\. 323–339\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- G\. Leite, F\. Arruda, P\. Antonino, A\. Sampaio, and A\. Roscoe \(2024\)Extracting formal smart\-contract specifications from natural language with llms\.InInternational Conference on Formal Aspects of Component Software,Lecture Notes in Computer Science, Vol\.14330,Cham,pp\. 109–126\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- P\. Lewiset al\.\(2020\)Retrieval\-augmented generation for knowledge\-intensive nlp tasks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.33,New York, NY, USA,pp\. 9459–9474\.Cited by:[§2\.3](https://arxiv.org/html/2606.18098#S2.SS3.p2.1)\.
- W\. Li, L\. Yu, Y\. Wu, and L\. C\. Paulson \(2020\)Isarstep: a benchmark for high\-level mathematical reasoning\.Note:PreprintExternal Links:2006\.09265Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- X\. Lin, Q\. Cao, Y\. Huang, H\. Wang, J\. Lu, Z\. Liu, L\. Song, and X\. Liang \(2024\)FVEL: interactive formal verification environment with large language models via theorem proving\.Note:PreprintExternal Links:2406\.14408,[Link](https://arxiv.org/abs/2406.14408)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p4.1),[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- Y\. Liu, Y\. Xue, D\. Wu, Y\. Sun, Y\. Li, M\. Shi, and Y\. Liu \(2024\)Propertygpt: llm\-driven formal verification of smart contracts through retrieval\-augmented property generation\.arXiv preprint arXiv:2405\.02580\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- F\. Marić \(2010\)Formal verification of a modern SAT solver by shallow embedding into Isabelle/HOL\.Theoretical Computer Science411\(50\),pp\. 4333–4356\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- D\. Marmsoler and A\. D\. Brucker \(2021\)A denotational semantics of solidity in Isabelle/HOL\.InInternational Conference on Software Engineering and Formal Methods \(SEFM\),Lecture Notes in Computer Science, Vol\.13085,Cham,pp\. 403–422\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1),[§7](https://arxiv.org/html/2606.18098#S7.p5.1)\.
- S\. Nakamoto \(2008\)Bitcoin: a peer\-to\-peer electronic cash system\.Note:WhitepaperExternal Links:[Link](https://bitcoin.org/bitcoin.pdf)Cited by:[§2\.1](https://arxiv.org/html/2606.18098#S2.SS1.p1.1)\.
- Nethermind \(2024\)Clear–prove anything about your Solidity smart contracts\.Note:\[Accessed April 2025\]External Links:[Link](https://www.nethermind.io/blog/clear-prove-anything-about-your-solidity-smart-contracts)Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- T\. Nipkow, M\. Wenzel, and L\. C\. Paulson \(2002\)Isabelle/HOL: a proof assistant for higher\-order logic\.Lecture Notes in Computer Science, Vol\.2283,Springer,Berlin, Heidelberg\.Cited by:[§2\.2](https://arxiv.org/html/2606.18098#S2.SS2.p1.1)\.
- NVIDIA \(2026\)NVIDIA\-nemotron\-3\-super\-120b\-a12b\.External Links:[Link](https://build.nvidia.com/nvidia/nemotron-3-super-120b-a12b/modelcard)Cited by:[§5\.4](https://arxiv.org/html/2606.18098#S5.SS4.p1.1)\.
- OpenAI \(2025\)Gpt\-oss\-120b\.External Links:[Link](https://huggingface.co/openai/gpt-oss-120b)Cited by:[§5\.4](https://arxiv.org/html/2606.18098#S5.SS4.p1.1)\.
- L\. C\. Paulson \(2004\)Theory ramsey\.Note:Isabelle/HOL LibraryExternal Links:[Link](https://isabelle.in.tum.de/website-Isabelle2021-1/dist/library/HOL/HOL-Library/Ramsey.html)Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- L\. C\. Paulson \(2015\)A mechanised proof of gödel’s incompleteness theorems using Nominal Isabelle\.Journal of Automated Reasoning55\(1\),pp\. 1–37\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- D\. Pusceddu and M\. Bartoletti \(2024\)Formalizing automated market makers in the Lean 4 theorem prover\.Note:PreprintExternal Links:2402\.06064Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- N\. Reimers and I\. Gurevych \(2019\)Sentence\-bert: sentence embeddings using siamese bert\-networks\.Association for Computational Linguistics\.External Links:[Link](https://arxiv.org/abs/1908.10084)Cited by:[§5\.3](https://arxiv.org/html/2606.18098#S5.SS3.p1.1)\.
- Reuters \(2021\)How hackers stole $613 million in crypto tokens from poly network\.Note:\[Accessed April 2025\]External Links:[Link](https://www.reuters.com/technology/how-hackers-stole-613-million-crypto-tokens-poly-network-2021-08-12/)Cited by:[§1](https://arxiv.org/html/2606.18098#S1.p2.1)\.
- A\. Setzer \(2018\)Modelling bitcoin in Agda\.Note:PreprintExternal Links:1804\.06398Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- T\. Sun and W\. Yu \(2020\)A formal verification framework for security issues of blockchain smart contracts\.Electronics9\(2\),pp\. 255\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- A\. Thakur, Y\. Wu, K\. Pelrine, B\. Holmgren, J\. Ahn, Z\. Zhang, J\. Denain, K\. Clark, N\. Smith, A\. Drouin, and E\. Grefenstette \(2023\)A language\-agent approach to formal theorem\-proving\.External Links:2305\.07641,[Link](https://arxiv.org/abs/2305.07641)Cited by:[§2\.3](https://arxiv.org/html/2606.18098#S2.SS3.p3.1)\.
- R\. Thiemann and A\. Yamada \(2016\)Formalizing Jordan normal forms in Isabelle/HOL\.InProceedings of the 5th ACM SIGPLAN Conference on Certified Programs and Proofs \(CPP\),New York, NY, USA,pp\. 88–99\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p1.1)\.
- TNG Technology Consulting GmbH \(2025\)DeepSeek\-tng r1t2 chimera\.Hugging Face\.Note:Model version 1\.0External Links:[Link](https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera)Cited by:[§5\.4](https://arxiv.org/html/2606.18098#S5.SS4.p1.1)\.
- D\. Unruh \(2022\)Scala\-isabelle – a scala library for controlling isabelle/hol\.Note:\[Accessed April 2025\]External Links:[Link](https://dominique-unruh.github.io/scala-isabelle/)Cited by:[§2\.2](https://arxiv.org/html/2606.18098#S2.SS2.p1.1),[§5\.3](https://arxiv.org/html/2606.18098#S5.SS3.p4.1)\.
- H\. Wang, H\. Xin, C\. Zheng, L\. Li, Z\. Liu, Q\. Cao, Y\. Huang, J\. Xiong, H\. Shi,et al\.\(2023\)Lego\-prover: Neural theorem proving with growing libraries\.Note:PreprintExternal Links:2310\.00656Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- H\. Wen, J\. Stephens, Y\. Chen, K\. Ferles, S\. Pailoor, K\. Charbonnet, I\. Dillig, and Y\. Feng \(2024\)Practical security analysis of zero\-knowledge proof circuits\.In33rd USENIX Security Symposium \(USENIX Security 24\),pp\. 1471–1487\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.
- H\. Xin, D\. Guo, Z\. Shao, Z\. Ren, Q\. Zhu, B\. Liu, C\. Ruan, W\. Li, and X\. Liang \(2024\)Deepseek\-prover: Advancing theorem proving in llms through large\-scale synthetic data\.Note:PreprintExternal Links:2405\.14333Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- K\. Yang and J\. Deng \(2019\)Learning to prove theorems via interacting with proof assistants\.InInternational Conference on Machine Learning \(ICML\),Long Beach, CA, USA,pp\. 6984–6994\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- K\. Yang, A\. Swope, A\. Gu, R\. Chalamala, P\. Song, S\. Yu, S\. Godil, R\. J\. Prenger, and A\. Anandkumar \(2023\)LeanDojo: theorem proving with retrieval\-augmented language models\.Advances in Neural Information Processing Systems36,pp\. 21573–21612\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p3.1)\.
- Z\. Yang, H\. Lei, and W\. Qian \(2020\)A hybrid formal verification system in coq for ensuring the reliability and security of ethereum\-based service smart contracts\.IEEE Access8,pp\. 21411–21436\.Cited by:[§3](https://arxiv.org/html/2606.18098#S3.p2.1)\.

## Appendix

All relevant code for IsabeLLM can be found at:

The repository includes:

- •Source code for IsabeLLM\.
- •Isabelle theory files for the n\-ary tree PoW model\.
- •Setup instructions\.

Similar Articles

Announcing Isabelle support for SAW

Lobsters Hottest

Galois announces that SAW now supports generating Isabelle theories from Cryptol specifications, bridging the usability of Cryptol and SAW with the expressivity of interactive theorem provers like Isabelle, enabling semi-automated verification of cryptographic protocols.

OProver: A Unified Framework for Agentic Formal Theorem Proving

Hugging Face Daily Papers

OProver is a unified framework for agentic formal theorem proving in Lean 4 that iteratively improves proof generation through training with verified proofs and compiler feedback, achieving state-of-the-art results on multiple benchmarks.