Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
Summary
Researchers from University of Edinburgh propose a self-play framework using Liquid Haskell for formal verification to train LLMs on semantic equivalence reasoning, releasing OpInstruct-HSx dataset (28k programs) and achieving 13.3pp accuracy gains on EquiBench.
View Cached Full Text
Cached at: 04/21/26, 07:05 AM
# Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
Source: [https://arxiv.org/html/2604.17010](https://arxiv.org/html/2604.17010)
Poon Tsz Nok School of Informatics University of Edinburgh trevorpoon@gmail\.com &Antonio Valerio Miceli Barone School of Informatics University of Edinburgh antonio@ed\.ac\.uk
###### Abstract
We introduce a self\-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator\. The framework leverages Liquid Haskell proofs for validating equivalence and execution\-based counterexamples for inequivalence, organized via a difficulty\-aware curriculum\. To facilitate this, we releaseOpInstruct\-HSx, a synthetic dataset of≈\\approx28k validated Haskell programs\. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13\.3pp accuracy gain on EquiBench and consistent gains on PySecDB\. Ablation studies on the SEQ\-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model’s reasoning capabilities\. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively\.
Improving LLM Code Reasoning via Semantic Equivalence Self\-Play with Formal Verification
Poon Tsz NokSchool of InformaticsUniversity of Edinburghtrevorpoon@gmail\.comAntonio Valerio Miceli BaroneSchool of InformaticsUniversity of Edinburghantonio@ed\.ac\.uk
## 1Introduction
The rise of large language models \(LLMs\) has reshaped how software can be generated and maintained\. Despite models such as Codex and Qwen2\.5\-coder have demonstrated strong capabilities in producing functional code from natural language prompts\(Murphyet al\.,[2024](https://arxiv.org/html/2604.17010#bib.bib2); Huiet al\.,[2024](https://arxiv.org/html/2604.17010#bib.bib1)\), their outputs often fail to preserve the intended program behavior beyond basic test coverage\(Laneveet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib3); Nguyenet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib4); Weiet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib5)\)\. This gap raises a fundamental challenge: How can we design training that explicitly teaches models to reason about semantic equivalence between programs? Addressing this problem is critical not only for reliable code generation but also for downstream applications such as program optimization, automated refactoring, and vulnerability detection\.
Current approaches largely rely on test suites, which are insufficient for capturing deep semantic properties and edge cases\. To bridge this gap, we ground our framework in Haskell for three key reasons\. First, its pure functional nature and strong static typing eliminate hidden state and side effects\(Thompson,[2011](https://arxiv.org/html/2604.17010#bib.bib6)\)\(See Appendix[A](https://arxiv.org/html/2604.17010#A1)for illustration\), making equivalence reasoning mathematically tractable\(Launchbury,[1993](https://arxiv.org/html/2604.17010#bib.bib12); Sestoft,[1997](https://arxiv.org/html/2604.17010#bib.bib13)\)\. Second, the Liquid Haskell ecosystem enables the generation of machine\-checkable proofs via refinement types, making it possible to certify semantic equivalence for a subset of Haskell programs\(Liquid Haskell Tutorial,[2025](https://arxiv.org/html/2604.17010#bib.bib11)\)\. This offers a source of formal supervision that is unavailable in languages such as Python or Java, where formal verification is much harder\. Finally, training on underrepresented functional paradigms pushes the model beyond standard object\-oriented patterns\(van Damet al\.,[2024](https://arxiv.org/html/2604.17010#bib.bib15); Giagnorioet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib14)\), encouraging deeper abstraction capabilities\.
We propose a self\-play framework for semantic equivalence, in which two specialized agents interact: Alice, a generator that produces variants of reference programs; and Bob, an evaluator trained to decide whether two programs are equivalent\. The self\-play loop alternates between program generation, verification through proofs or counterexamples, difficulty scoring, and fine\-tuning of both agents\. By framing the problem as a game between generator and evaluator, the system encourages progressively harder examples and deeper reasoning about semantics\.
This work investigates three core questions regarding the utility of functional programming for LLM alignment\. First, we examine the dynamics of self\-play, asking whether an adversarial loop in a functional language can induce a progressive curriculum that improves semantic reasoning\. Second, we evaluate cross\-domain and cross\-language transferability, assessing whether semantic reasoning skills acquired in Haskell generalize to zero\-shot synthesis and vulnerability detection in broader coding benchmarks\. Finally, we perform a controlled ablation to determine the relative contributions of supervision signals, distinguishing between the effects of formal equivalence proofs and execution\-based counterexamples on evaluator robustness\.
## 2Related Work
Determining semantic equivalence is extremely challenging in general and current LLMs often fail to recognise semantic equivalence in code\(Laneveet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib3); Nguyenet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib4); Weiet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib5)\)\. By Rice’s Theorem, determining if two arbitrary programs are semantically equivalent is generally undecidable\(Rice,[1953](https://arxiv.org/html/2604.17010#bib.bib16)\)\. Traditional approaches rely on unit testing; however, even extended test suites, such as HumanEval\+ and MBPP\+ remain insufficient to guarantee correctness\(Gren and Antinyan,[2017](https://arxiv.org/html/2604.17010#bib.bib19); Chioteliet al\.,[2021](https://arxiv.org/html/2604.17010#bib.bib18)\)\. Similarly, symbolic execution offers path\-sensitive analysis but suffers from combinatorial state\-space explosion as program complexity increases\(Badihiet al\.,[2020](https://arxiv.org/html/2604.17010#bib.bib20)\)\.
To address these incompleteness issues, recent research has pivoted towards formal verification\. In the LLM domain, frameworks like DeepSeek\-Prover\(Xinet al\.,[2024a](https://arxiv.org/html/2604.17010#bib.bib21),[b](https://arxiv.org/html/2604.17010#bib.bib22); Renet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib23)\)and Kimina\-Prover\(Wanget al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib24)\)have demonstrated that fine\-tuning models on self\-generated proofs \(e\.g\., in Lean\) significantly enhances their verifiable reasoning capabilities\. However, generating fully machine\-checkable equivalence proofs for imperative languages like Python or C\+\+ is beyond the reach of current tools except for trivial cases\(Miceli\-Baroneet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib25)\)\.
Our approach bridges this gap by using Haskell and Liquid Haskell\. Liquid Haskell embeds refinement types into the language\(Vazouet al\.,[2014](https://arxiv.org/html/2604.17010#bib.bib8)\), allowing logical properties to be verified automatically via Satisfiability Modulo Theories \(SMT\) solvers\(Diatchki,[2015](https://arxiv.org/html/2604.17010#bib.bib10); Jhalaet al\.,[2020](https://arxiv.org/html/2604.17010#bib.bib26); Liquid Haskell Tutorial,[2025](https://arxiv.org/html/2604.17010#bib.bib11)\)\. This framework enables the construction of machine\-checkable lemmas, using reflection and Proof by Logical Evaluation \(PLE\), to formally certify pointwise equality between candidate functions\. This provides a deterministic, high\-fidelity feedback signal for the self\-play loop\.
Self\-play has historically driven breakthroughs in game\-playing agents like AlphaZero\(Silveret al\.,[2017](https://arxiv.org/html/2604.17010#bib.bib27)\)and OpenAI Five\(OpenAIet al\.,[2019](https://arxiv.org/html/2604.17010#bib.bib28)\)\. In the coding domain, recent frameworks such as Sol\-Ver\(Linet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib29)\)and AutoIF\(Donget al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib30)\)adapt this by using model\-generated unit tests and execution feedback to filter synthetic data, achieving significant gains on benchmarks like MBPP and IFEval\.
Our work is most directly built upon the adversarial framework proposed byMiceli\-Baroneet al\.\([2025](https://arxiv.org/html/2604.17010#bib.bib25)\), known as the Semantic Inequivalence Game \(SINQ\)\. In their setup, a generator \("Alice"\) creates program variants and diverging inputs \(counterexamples\), while an evaluator \("Bob"\) attempts to detect inequivalence without seeing the generator’s justification\. This adversarial loop provides a scalable curriculum that improves semantic reasoning\. We extend this paradigm by including Semantic Equivalence tasks \(SEQ\)\. WhileMiceli\-Baroneet al\.\([2025](https://arxiv.org/html/2604.17010#bib.bib25)\)focused exclusively on inequivalence via execution feedback, we integrate formal verification using Liquid Haskell\. This allows us to train on positive instances of equivalence supported by reasoning traces\.
## 3Methodology
We introduce the core methodology of the self\-play framework for improving code reasoning in LLMs through two complementary tasks: the SEQ and the SINQ\. In the SEQ task, the generator model is asked to produce a functionally identical variant of a given Haskell program, along with a formal proof certifying its equivalence\. In contrast, the SINQ task requires generating a function that diverges from the original program on at least one input\. Figure[1](https://arxiv.org/html/2604.17010#S3.F1)shows the complete self\-play framework, with difficulty\-based supervised fine\-tuning that guides the adaptive training loop\.
Figure 1:Overview of the semantic self\-play framework for improving code reasoning in LLMs via Haskell\.### 3\.1Framework Overview
The self\-play framework is formulated as a two\-agent adversarial game between Alice \(the generator\) and Bob \(the evaluator\)\.
1. 1\.Each round starts with a reference Haskell programPP, upon which Alice’s task is to generate programQQ, which is either a semantically equivalent variant toPP\(with a proof\) or a semantically inequivalent program \(with a diverging input that demonstrates their different behaviour\)\.
2. 2\.The proof or the diverging input is verified afterwards, and if it fails, Alice loses\.
3. 3\.Bob then decides whether\(P,Q\)\(P,Q\)are semantically equivalent\. If Bob’s judgment is accurate, he wins and Alice loses, vice versa\.
Alice’s objective is to create instances that are difficult for Bob to classify, whereas Bob’s task is to correctly assess them\. Through repeated interactions, both agents gradually improve their performance\.Miceli\-Baroneet al\.\([2025](https://arxiv.org/html/2604.17010#bib.bib25)\)have proved that this adversarial framework hasno theoretical upper bound on model’s performance improvement, and in principle both agents can learn endlessly about the complex programming logic while training on a real\-world coding dataset\.
### 3\.2Dataset Generation and Preparation
The availability of high\-quality Haskell datasets remains extremely limited\. Among the few usable resources, the most substantial one is the Blastwind dataset111[https://huggingface\.co/datasets/blastwind/github\-code\-haskell\-file](https://huggingface.co/datasets/blastwind/github-code-haskell-file), which aggregates real\-world source files scraped from public GitHub repositories\. However, its utility is hindered by substantial noise: the data contains unannotated, inconsistently formatted, and often non\-compilable code\.
To address the scarcity of high\-quality Haskell datasets, we adopt a complementary strategy:introducing OpInstruct\-HSx, where we generate a synthetic Haskell datasetby adapting from thenvidia\-OpenCodeInstructdataset\(Ahmadet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib33)\), a large\-scale, high\-quality instruction corpus originally built for Python code generation\.
The programs are first pre\-filtered and then transformed into Haskell programs using theDeepSeek\-R1\-Distill\-Llama\-70Bmodel\. This process results in a synthetic dataset of Haskell programs\. To ensure its quality, we applied an automated filtering and validation stage\. For each generated Haskell program, we extract the function name and its argument types using syntactic heuristics, and synthesize a type\-correct input using a recursive literal generator supporting common base types \(e\.g\.,Int,Bool,List,Tuple\)\. Each program is compiled with Glasgow Haskell Compiler \(GHC\) and executed on the synthesised input\. Only those that compile successfully and execute without errors are retained\. This filtering process eliminates malformed or non\-functional code, ensuring that the resulting dataset consists of minimally functional and executable Haskell programs\.
Figure[2](https://arxiv.org/html/2604.17010#S3.F2)shows the entire multi\-stage filtering mechanism\. We have contributed OpInstruct\-HSx, a clean and executable Haskell dataset for both SEQ and SINQ games, which consists of approximately 28,000 validated Haskell functions derived from real\-world problems\. This dataset is made publicly available222[https://huggingface\.co/datasets/Trevor0501/OpInstruct\-HSx](https://huggingface.co/datasets/Trevor0501/OpInstruct-HSx), serving as a high\-quality synthetic Haskell resource for training LLMs in semantic reasoning tasks\. The code to create the data and replicate the experiment is released on a public repository333[https://github\.com/TrevorPoon/llm\-self\-play\-liquidhaskell](https://github.com/TrevorPoon/llm-self-play-liquidhaskell)\.
Figure 2:Full Pipeline for theOpInstruct\-HSxdataset generation
### 3\.3The Self\-Play Loop: Alice and Bob
#### Step 1: Program Selection and Branching
Let𝒟\\mathcal\{D\}be the dataset of reference Haskell programs\. We randomly choose a reference programP∈𝒟P\\in\\mathcal\{D\}and then select the SEQ game with 50% probability, otherwise we choose the SINQ game\.
#### Step 2a: SEQ Game \(Alice’s Turn\)
In the SEQ game, Alice receivesPPand must synthesizeQQsuch that for all inputsxx:
∀x∈𝒳\.P\(x\)=Q\(x\)\\forall x\\in\\mathcal\{X\}\.\\;P\(x\)=Q\(x\)To challenge Bob, Alice is encouraged to construct a hard instanceQQ, aiming for a maximum target difficulty level \(d=10, defined in Section[3\.4\.1](https://arxiv.org/html/2604.17010#S3.SS4.SSS1)\)\. In addition, Alice is required to produce a formal proof \(in Liquid Haskell\) of this semantic equivalence\. See Appendix[B](https://arxiv.org/html/2604.17010#A2)for an SEQ instance\.
#### Step 2b: SINQ Game \(Alice’s Turn\)
Alternatively in the SINQ game, Alice is instructed to produce a functionQQthat diverges fromPPon at least one input:
∃x∗∈𝒳:P\(x∗\)≠Q\(x∗\)\\exists x^\{\*\}\\in\\mathcal\{X\}:P\(x^\{\*\}\)\\neq Q\(x^\{\*\}\)Alice is also incentivized to construct a difficult functionQQthat Bob is likely to misclassify, again targeting a maximum difficulty level \(d=10\)\. Alice must also output a diverging inputxax\_\{a\}showing this inequivalence such thatP\(xa\)≠Q\(xa\)P\(x\_\{a\}\)\\neq Q\(x\_\{a\}\)\. See Appendix[B](https://arxiv.org/html/2604.17010#A2)for an SINQ instance\.
#### Step 3: Verification through Liquid Haskell or Execution
- •SEQ Game: Alice’s proof is verified by Liquid Haskell, which acts as an external oracle\. If the proof is accepted, the proof is retained as a fine\-tuning example\.
- •SINQ Game: The candidatexax\_\{a\}is tested by an execution: IfP\(xa\)≠Q\(xa\)P\(x\_\{a\}\)\\neq Q\(x\_\{a\}\), the instance is accepted\. Otherwise, the sample is discarded\.
All candidates undergo compilation, execution, and formal verification checks\. This ensures the training data for both Alice and Bob remains high\-quality and executable\.
#### Step 4: Bob’s Turn – Difficulty Estimation
After Alice produces her candidate programQQ, Bob is presented with bothPPandQQonly and must decide whether the two programs are semantically equivalent\. To estimate the challenge posed by each example, Bob is sampledNNtimes and the proportion of correct responses from Bob is then used to compute a difficulty score\. Further details on the difficulty\-based curriculum and dataset sampling can be found in Section[3\.4](https://arxiv.org/html/2604.17010#S3.SS4)\.
### 3\.4Implementation with Supervised Fine‐Tuning with Difficulty Score
Given the practical challenges of reinforcement learning \(Appendix[C](https://arxiv.org/html/2604.17010#A3)\), we instead adopt rejection sampling supervised fine\-tuning \(SFT\)\. We construct fine\-tuning datasets by having Alice generate challenging programs and Bob learn from his own correct identifications\. This semi\-adversarial pipeline ensures Alice continually refines her ability to craft borderline\-difficult SEQ or SINQ program pairs, while Bob steadily improves at its reasoning ability in semantic equivalence\. Detailed prompt formats for both agents are provided in Appendix[D](https://arxiv.org/html/2604.17010#A4)\.
#### 3\.4\.1Alice’s Training Data Selection
After Alice generates the candidate pairs\(P,Q\)\(P,Q\), Alice’s outputs are not immediately used for fine\-tuning\. Instead, for each pair, Bob is asked to evaluate their semantic relation multiple times \(typicallyN=10N=10\)\. The number of correct Bob responsesNsuccessN\_\{\\text\{success\}\}determines thedifficulty scored^\\hat\{d\}as defined in Equation[1](https://arxiv.org/html/2604.17010#S3.E1)\.
d^=d\(P,Q\)=10×\(1−NsuccessN\)\\hat\{d\}=d\(P,Q\)=10\\times\\left\(1\-\\frac\{N\_\{\\text\{success\}\}\}\{N\}\\right\)\(1\)However, most of Alice’s early generations are trivial for Bob\. Including all examples would flood the dataset with low\-difficulty cases that Bob already solves easily\. Following the design in Miceli\-Barone et al\.\(Miceli\-Baroneet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib25)\), we only retain examples that are sufficiently challenging, as determined by the difficulty scored^\\hat\{d\}\. Hence, shown in Schema[2](https://arxiv.org/html/2604.17010#S3.E2), we split all\(P,Q\)\(P,Q\)pairs into𝒟Hard\\mathcal\{D\}\_\{Hard\}and𝒟Easy\\mathcal\{D\}\_\{Easy\}, whereτ\\tauis a chosen difficulty threshold \(e\.g\.,τ=5\\tau=5\)\.
𝒟Hard=\{\(P,Q\)\|d\(P,Q\)\>τ\},𝒟Easy=\{\(P,Q\)\|d\(P,Q\)≤τ\}\\begin\{split\}\\mathcal\{D\}\_\{Hard\}&=\\left\\\{\(P,Q\)\\;\\middle\|\\;d\(P,Q\)\>\\tau\\right\\\},\\\\ \\mathcal\{D\}\_\{Easy\}&=\\left\\\{\(P,Q\)\\;\\middle\|\\;d\(P,Q\)\\leq\\tau\\right\\\}\\end\{split\}\(2\)
Finally, Alice’s training dataset is comprised of three SFT examples for each validated pairs\(P,Q\)\(P,Q\)\. The first example pairs the prompts and Alice’s generation, and directly trains Alice to generate a challenging programQQ, which helps improve Alice’s ability to craft precise SEQ or SINQ code \(Schema[3](https://arxiv.org/html/2604.17010#S3.E3)\)\.SP\\mathrm\{SP\}andUP\\mathrm\{UP\}denote the system and user prompts conditioned on the reference programPPand a target difficultyd=10d=10\.OOrepresents the model’s raw output, which includes the chain\-of\-thought followed by the generated programQQ\(and the diverging inputxxin the case of SINQ\)\.
SEQ=\(SPAeq,UPAeq\(P,10\),OAeq\)SINQ=\(SPAinq,UPAinq\(P,10\),OAinq\)\\begin\{split\}\\text\{SEQ\}&=\\bigl\(\\mathrm\{SP\}\_\{A\}^\{eq\},\\;\\mathrm\{UP\}\_\{A\}^\{eq\}\(P,10\),\\;O\_\{A\}^\{eq\}\\bigr\)\\\\ \\text\{SINQ\}&=\\bigl\(\\mathrm\{SP\}\_\{A\}^\{inq\},\\;\\mathrm\{UP\}\_\{A\}^\{inq\}\(P,10\),\\;O\_\{A\}^\{inq\}\\bigr\)\\end\{split\}\(3\)
We then select every hard example plus 20% as many easy ones sampled round\-robin across integer difficulty bins to maintain a balanced curriculum\.
The second example, designated as a “difficulty‐prediction” instance, teaches Alice to self‐assess the hardness of its own creations: by taking Alice’s generated programQQalong with the Difficulty Prediction User Prompt and supervising on the numeric label “Difficulty level:d^\\text\{Difficulty level:\}\\hat\{d\}" \(Schema[4](https://arxiv.org/html/2604.17010#S3.E4)\)\. The training dataset is also biased towards hard examples, with easy examples comprising 50% of the subset\. Alice learns to calibrate its difficulty estimates and ensures that its future generations are appropriately challenging\.
SEQDP=\(SPA,DPeq,UPA,DPeq,d^\)SINQDP=\(SPA,DPinq,UPA,DPinq,d^\)\\begin\{split\}\\text\{SEQ\}\_\{\\text\{DP\}\}&=\\bigl\(\\mathrm\{SP\}\_\{A,DP\}^\{eq\},\\;\\mathrm\{UP\}\_\{A,DP\}^\{eq\},\\;\\hat\{d\}\\bigr\)\\\\ \\text\{SINQ\}\_\{\\text\{DP\}\}&=\\bigl\(\\mathrm\{SP\}\_\{A,DP\}^\{inq\},\\;\\mathrm\{UP\}\_\{A,DP\}^\{inq\},\\;\\hat\{d\}\\bigr\)\\end\{split\}\(4\)
The third example concerns all\(P,Q\)\(P,Q\)pairs in the SEQ that have been successfully proved\. We not only consider the generated candidateQQfor a givenPP, but also the complete reasoning trace and valid Liquid Haskell proof script from Alice \(Schema[5](https://arxiv.org/html/2604.17010#S3.E5)\)\. These examples serve as supervised training signals to help Alice internalize the proof obligationπ\\pi, corresponding to the refinement type∀x\.,P\(x\)=Q\(x\)\\forall x\.,P\(x\)=Q\(x\)\.
SEQproof=\{\(SPAeq,UPAeq\(P\),OAeq\(P,Q,π\)\)\|LH\(P,Q\)⊢π\}\\begin\{split\}\\text\{SEQ\}\_\{\\text\{proof\}\}=\\Big\\\{\\bigl\(&\\mathrm\{SP\}\_\{A\}^\{eq\},\\;\\mathrm\{UP\}\_\{A\}^\{eq\}\(P\),\\;O\_\{A\}^\{eq\}\(P,Q,\\pi\)\\bigr\)\\\\ &\\;\\Big\|\\;\\mathrm\{LH\}\(P,Q\)\\vdash\\pi\\Big\\\}\\end\{split\}\(5\)
This three\-part structure enables Alice to generate challenging variants, assess their difficulties, and internalize proof strategies\.
#### 3\.4\.2Bob’s Training Data Selection
Bob’s training data is constructed from all of his correct\(P,Q\)\(P,Q\)pairs’ evaluations\. Each training example \(Schema[6](https://arxiv.org/html/2604.17010#S3.E6)\) consists of the original pair, the system prompts, user prompts, and Bob’s responseOBO\_\{B\}\. By sampling from a wide range of difficulties, Bob learns to reason and recognize both easy and subtle equivalence relations\.
ℰBob=\(SPB,UPB\(P,Q\),OB\)\\mathcal\{E\}\_\{\\text\{Bob\}\}=\\bigl\(\\mathrm\{SP\}\_\{B\},\\;\\mathrm\{UP\}\_\{B\}\(P,Q\),\\;O\_\{B\}\\bigr\)\(6\)
## 4Experimental Setup
We useDeepSeek\-R1\-Distill\-Qwen\-7Bas our base model for both Alice and Bob\. Fine\-tuning is performed using LoRA adaptors\. Please see Appendix[E](https://arxiv.org/html/2604.17010#A5)for more details\.
To investigate the distinct contributions of equivalence versus inequivalence supervision, we conduct the main experiment \(E0E\_\{0\}\) alongside three controlled ablations \(E1E\_\{1\}–E3E\_\{3\}\)\. These regimes systematically vary the game type probability and reference program budget \(PP\) to isolate the impact of SEQ supervision under controlled data constraints\. The configurations are shown in Table[1](https://arxiv.org/html/2604.17010#S4.T1)\.
Table 1:Experimental regimes to test SEQ impact\.
## 5Results
The following evaluation results are structured to address the three research questions outlined in the introduction\.
### 5\.1Intrinsic Evaluation
The goal ofE0E\_\{0\}is teaching Alice \(the generator\) to generate increasingly challenging instances while Bob \(the evaluator\) to improve at judging program semantics\.
#### 5\.1\.1Capability Assessment
As shown in Figure[3](https://arxiv.org/html/2604.17010#S5.F3), the mean difficulty rises from 0\.50 to 1\.18 by round 7, indicating that Alice is crafting harder instances for a constant Bob\. See Appendix[H\.1\.1](https://arxiv.org/html/2604.17010#A8.SS1.SSS1)for the difficulty trajectories breakdown for both SEQ and SINQ games\.
Figure 3:Mean and standard deviation for the difficulty scores for Alice’s generated instances from a fixed untrained Bob over 7 rounds\.In addition, we measure how much the evaluator model \(Bob\) improves after its first and only training round in the semantic equivalence self\-play\. Challenge instances are generated by the final trained generator model, Alice from round 7, using source programs from the training split of the OpInstruct\-HSx and from the unseen test split\. The resulting accuracies, summarised in Table[2](https://arxiv.org/html/2604.17010#S5.T2), show modest improvements on unseen data\.
Table 2:OpInstruct\-HSx accuracy results comparing the Base Model and Trained \(Bob adapter\) models\.
### 5\.2Extrinsic Evaluation
Having validated the agents’ improvement within the self\-play loop, we now investigate whether this training enhances the model’s understanding of program semantics across different programming languages and domains\.
#### 5\.2\.1Haskell Program Generation
In MBPP and HumanEval from MultiPL\-E\(Cassanoet al\.,[2022](https://arxiv.org/html/2604.17010#bib.bib35)\), both models’ compilation errors decrease substantially and Pass@1 improves on the Haskell tasks \(shown in Table[3](https://arxiv.org/html/2604.17010#S5.T3)\)\. Performance trajectories for Alice across self\-play rounds are provided in Appendix[H\.1\.2](https://arxiv.org/html/2604.17010#A8.SS1.SSS2)and[H\.1\.3](https://arxiv.org/html/2604.17010#A8.SS1.SSS3)\.
Table 3:HumanEval and MBPP results \(zero\-shot prompting in Haskell\), comparing the Base Model \(untrained\) and Trained models\. Averages over 16 trials\.The results indicate that semantic reasoning training transfers positively to Haskell code generation performance and robustness for both agents\. For Bob, the gains confirm that the semantic discrimination skills learned during self\-play can yield broader benefits in practical programming scenarios\. And for Alice, the improvements suggest that training on high\-quality, verifiable program transformations in self\-play can enhance downstream synthesis accuracy and reduce compilation errors\.
### 5\.3Coding Related Tasks Evaluation
We would like to investigate whether richer type semantics learnt in the evaluator model Bob can further enhance his ability to analyse correct and semantically meaningful code across other coding paradigms, not just the functional ones\.
#### 5\.3\.1EquiBench
The EquiBench\(Weiet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib5)\)benchmark consists of six data categories: DCE \(C pairs with dead/live code variations\), x86\-64 \(identical assembly sequences via superoptimization\), and CUDA \(equivalent kernels with different scheduling\) form the low\-level systems group\. The algorithmic group uses Python pairs from competitive programming, featuring OJ\_A \(algorithmic refactors\), OJ\_V \(variable renaming\), and OJ\_VA \(combined algorithmic and variable changes\)\.
In the zero\-shot prompting evaluation, our fine\-tuned Bob model shows clear gains over the base model in several dataset configurations\.
Figure 4:Accuracy comparison between base and fine\-tuned Bob models across EquiBench dataset configurations\.Figure 5:F1 score comparison between base and fine\-tuned Bob models across EquiBench dataset configurations\.From Figure[4](https://arxiv.org/html/2604.17010#S5.F4)and Figure[5](https://arxiv.org/html/2604.17010#S5.F5), the model transfers best to OJ\_A and OJ\_V because both categories \(algorithmic refactors and variable renaming respectively\) align with Bob’s learned skill on detecting high\-level behavioural divergence, resulting in large gains in performance\. OJ\_VA’s performance stays modest as the authors stated that it is harder than pure renaming and closer to the “non\-local structural” end\.
In contrast, DCE, STOKE, and TVM require reasoning about low\-level or non\-functional semantics absent from our Haskell curriculum\. This mismatch in language, abstraction level, and equivalence definition explains the weaker or negligible improvements on these tasks\.
#### 5\.3\.2PySecDB
PySecDB is the first comprehensive dataset of security\-related commits in Python\(Sunet al\.,[2023](https://arxiv.org/html/2604.17010#bib.bib37)\)\. In Table[4](https://arxiv.org/html/2604.17010#S5.T4), our fine‐tuned Bob model achieves consistent improvements over the base model across all evaluated metrics\. The results indicate that Bob’s Haskell semantic reasoning training transfers positively to Python vulnerability detection, improving in identification of security‐relevant code changes\.
Table 4:PySecDB vulnerability detection results comparing the Base Model and Trained \(Bob adapter\)\.
#### 5\.3\.3CodeXGlue
CodeXGLUE’s defect detection dataset\(Zhouet al\.,[2019](https://arxiv.org/html/2604.17010#bib.bib38)\)is a curated collection of over 27,000 C\-language functions, each manually labeled to indicate whether it contains a security\-relevant defect\.
Table 5:CodeXGLUE defect detection results comparing the Base Model and Trained \(Bob adapter\)\.In Table[5](https://arxiv.org/html/2604.17010#S5.T5), both the base and fine‐tuned Bob models perform near the 50% mark for accuracy, which is indistinguishable from a random model\. CodeXGLUE’s dataset gives Bob only one C function and asks if it’s vulnerable\. The vulnerabilities are low\-level memory issues that require tracking pointers, memory allocation, and data layouts\. However, Haskell is memory\-safe, garbage\-collected, and doesn’t have manual frees or pointer arithmetic\. Therefore, the trained model shows no meaningful advantage over the base model on this task\.
### 5\.4Controlled Comparison of SEQ–SINQ Regimes
A critical finding fromE0E\_\{0\}is the asymmetry in number of validated programs between SINQ and SEQ even under a uniform sampling policy in choosing the game\. In Figure[6](https://arxiv.org/html/2604.17010#S5.F6), we have hundreds of accepted SINQ instances per round, but very few SEQ instances \(single digits in most rounds\)\. A review of the output from the self\-play shows that while Alice can readily produce SINQ pairs, the small reasoning model often struggles to produce Liquid Haskell proofs for SEQ, which drastically limits the number of compiled and verified SEQ examples available for training\. This imbalance limits the diversity of program types in the training buffer, with the current SEQ\+SINQ configuration heavily skewed with SINQ examples\.
Figure 6:Counts and proportions of validated Alice generations by round\.This subsection investigates whether incorporating SEQ supervision yields unique reasoning benefits beyond those of SINQ alone, aiming to isolate the impact of the supervision type from the disparity in verified data volume \(see Appendix[F](https://arxiv.org/html/2604.17010#A6)for derivation details\)\.
#### 5\.4\.1Evaluation Results
We analyze the four experimental regimes to decouple the effects of supervision type \(SEQ vs\. SINQ\) from training data volume\. All results are presented in Appendix[G](https://arxiv.org/html/2604.17010#A7)and[H](https://arxiv.org/html/2604.17010#A8)\.
#### Value of SEQ Supervision
Despite having significantly fewer verified training pairs inE0E\_\{0\}than in the pure SINQ regimeE1E\_\{1\}\(937937vs\.1,8391,839\),E0E\_\{0\}achieves superior performance on semantic equivalence tasks\. This indicates that SEQ supervision improves high\-level semantic judgment even when data volume is reduced\. Furthermore, when strictly controlling for data volume \(E2E\_\{2\}vs\.E3E\_\{3\}, both≈150\\approx 150pairs\), the inclusion of SEQ \(E2E\_\{2\}\) yields consistent advantages on structural reasoning tasks over pure SINQ \(E3E\_\{3\}\), shown in Table[8](https://arxiv.org/html/2604.17010#A7.T8)\. This confirms that SEQ supervision confers a unique benefit that cannot be replicated by inequivalence signals alone\.
#### Volume Trade\-offs
While SEQ is qualitatively valuable, data volume remains critical\. RegimeE2E\_\{2\}, which aggressively prioritizes SEQ \(96% attempts\) to force a balanced validated distribution, suffers a sharp drop in total verified pairs \(N=140N=140\) due to low proof yields, causing performance degradation across all benchmarks compared toE0E\_\{0\}\. The 50/50 mixed regime \(E0E\_\{0\}\) therefore represents an acceptable trade\-off, balancing the semantic depth of SEQ with the high verification yield of SINQ\.
## 6Conclusions
This paper investigates the use of self\-play to improve code generation and semantic reasoning in LLMs, focusing on program equivalence in Haskell and Liquid Haskell\. The framework introduces two agents: a generator of semantically equivalent or inequivalent program variants, and an evaluator trained to judge equivalence\. Evaluation shows that Bob benefits substantially: Haskell coding performance improves, and transfer is strong on high\-level semantic reasoning tasks, but negligible on low\-level or memory\-oriented tasks\. Controlled comparisons reveal that although inequivalence data yields far more verified pairs, the inclusion of equivalence supervision confers unique gains in semantic reasoning that cannot be achieved through inequivalence alone\.
## 7Future Work
There are several promising avenues for future research\. First, extending training across more rounds could allow Bob to accumulate richer experience\. Additionally, \(1\) scaling to larger parameter models and \(2\) employing full\-parameter fine\-tuning instead of LoRA may increase both Alice’s proof synthesis capabilities and Bob’s reasoning capacity\.
A second priority is to improve the statistical robustness of the controlled comparisons\. The experiments in regimesE2E\_\{2\}andE3E\_\{3\}were limited by small sample sizes, which may not be representative of the true differences between SINQ\- and SEQ\-based supervision\. Running larger\-scale experiments would provide stronger evidence about the relative contributions of the two regimes\.
Furthermore, methodologically speaking, the Haskell self\-play framework could be expanded to cover low\-level program behaviors by introducing tasks that explicitly model memory usage, side effects, and bit\-level operations\.
Moreover, converting the self\-play pipeline into a dedicated Haskell Equivalence Evaluation test set would help to provide a more direct and sustainable benchmark for semantic reasoning, especially given the scarcity of high\-quality Haskell resources currently available\.
Finally, reinforcement learning could be re\-examined as a means of enhancing the SEQ game’s contribution\. By running Alice long enough to generate sufficiently challenging programs and providing Bob with richer feedback through reward\-based updates, it may be possible to overcome the current proof bottleneck and more effectively integrate equivalence reasoning into the self\-play loop\.
## 8Limitations
A central limitation of this work lies in the proof synthesis bottleneck\. Alice rarely produces Liquid Haskell proofs that successfully pass PLE, resulting in very low validation yields for SEQ cases\. This imbalance causes the majority of verified training data SINQ\-dominated, reducing the contribution of SEQ supervision to Bob’s learning\. The issue is likely exacerbated by the relatively small model size used in this study, which likely restricts Alice’s ability to generate structurally complex proofs\.
Another limitation arises from the inherent constraints of Liquid Haskell itself\. The verification process depends heavily on reflection and proof by logical evaluation, which are effective for local reasoning but struggle with certain classes of programs\. Non\-terminating behaviors, partial functions, and large\-scale algebraic rewrites are difficult to certify, and in some cases impossible, within this fragment of the logic\. Moreover, not all Haskell programs are reflectable, further narrowing the scope of tasks where equivalence can be formally verified\.
Finally, there is a cross\-domain mismatch in generalization\. While Bob transfers strongly to high\-level semantic reasoning tasks, such as OJ\_A and OJ\_V in EquiBench and vulnerability detection in PySecDB, his performance is notably weaker on low\-level or stateful semantics, including DCE, STOKE, TVM, and CodeXGlue\. This gap highlights a limitation of the current framework in addressing equivalence reasoning that depends on memory, side effects, or bit\-level operations\.
## 9Ethical considerations
This work adversarially trains models to introduce hard to detect semantic modifications to computer programs and then to detect them, which may train them to detect security\-relevant vulnerabilities or introduce obfuscated backdoors\. In principle, these capabilities have a potential for malicious use, however, they can be also used for pro\-social use to improve the reliability and security of code, whether generated by humans or LLMs\. We believe that the net societal impact of models better able to reason about the subtleties of program semantics to be positive, especially as these capabilities are disseminated through Open Source releases, as we do in this work\.
## References
- OpenCodeInstruct: a large\-scale instruction tuning dataset for code llms\.External Links:2504\.04030,[Link](https://arxiv.org/abs/2504.04030)Cited by:[§3\.2](https://arxiv.org/html/2604.17010#S3.SS2.p2.1)\.
- S\. Badihi, F\. Akinotcho, Y\. Li, and J\. Rubin \(2020\)ARDiff: scaling program equivalence checking via iterative abstraction and refinement of common code\.InProceedings of the 28th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering,pp\. 13–24\.Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p1.1)\.
- F\. Cassano, J\. Gouwar, D\. Nguyen, S\. Nguyen, L\. Phipps\-Costin, D\. Pinckney, M\. Yee, Y\. Zi, C\. J\. Anderson, M\. Q\. Feldman, A\. Guha, M\. Greenberg, and A\. Jangda \(2022\)MultiPL\-e: a scalable and extensible approach to benchmarking neural code generation\.External Links:2208\.08227,[Link](https://arxiv.org/abs/2208.08227)Cited by:[§5\.2\.1](https://arxiv.org/html/2604.17010#S5.SS2.SSS1.p1.1)\.
- E\. Chioteli, I\. Batas, and D\. Spinellis \(2021\)Does unit\-tested code crash? a case study of eclipse\.In25th Pan\-Hellenic Conference on Informatics,PCI 2021,pp\. 260–264\.External Links:[Link](http://dx.doi.org/10.1145/3503823.3503872),[Document](https://dx.doi.org/10.1145/3503823.3503872)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p1.1)\.
- I\. S\. Diatchki \(2015\)Improving haskell types with smt\.SIGPLAN Not\.50\(12\),pp\. 1–10\.External Links:ISSN 0362\-1340,[Link](https://doi.org/10.1145/2887747.2804307),[Document](https://dx.doi.org/10.1145/2887747.2804307)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p3.1)\.
- G\. Dong, K\. Lu, C\. Li, T\. Xia, B\. Yu, C\. Zhou, and J\. Zhou \(2025\)Self\-play with execution feedback: improving instruction\-following capabilities of large language models\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=cRR0oDFEBC)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p4.1)\.
- A\. Giagnorio, A\. Martin\-Lopez, and G\. Bavota \(2025\)Enhancing code generation for low\-resource languages: no silver bullet\.arXiv preprint arXiv:2501\.19085\.Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p2.1)\.
- L\. Gren and V\. Antinyan \(2017\)On the relation between unit testing and code quality\.In2017 43rd Euromicro Conference on Software Engineering and Advanced Applications \(SEAA\),pp\. 52–56\.External Links:[Link](http://dx.doi.org/10.1109/SEAA.2017.36),[Document](https://dx.doi.org/10.1109/seaa.2017.36)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p1.1)\.
- B\. Hui, J\. Yang, Z\. Cui, J\. Yang, D\. Liu, L\. Zhang, T\. Liu, J\. Zhang, B\. Yu, K\. Lu,et al\.\(2024\)Qwen2\. 5\-coder technical report\.arXiv preprint arXiv:2409\.12186\.Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p1.1)\.
- R\. Jhala, E\. Seidel, and N\. Vazou \(2020\)Programming with refinement types: an introduction to liquid haskell\.Note:[https://ucsd\-progsys\.github\.io/liquidhaskell\-tutorial/book\.pdf](https://ucsd-progsys.github.io/liquidhaskell-tutorial/book.pdf)Version 13, July 20, 2020Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p3.1)\.
- C\. Laneve, A\. Spanò, D\. Ressi, S\. Rossi, and M\. Bugliesi \(2025\)Assessing code understanding in llms\.External Links:2504\.00065,[Link](https://arxiv.org/abs/2504.00065)Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p1.1),[§2](https://arxiv.org/html/2604.17010#S2.p1.1)\.
- J\. Launchbury \(1993\)A natural semantics for lazy evaluation\.InPOPL’9 3: Proceedings of the 20th ACM SIGPLAN\-SIGACT Symposium on Principles of Programming Languages,pp\. 144–154\.External Links:[Document](https://dx.doi.org/10.1145/158511.158618)Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p2.1)\.
- Z\. Lin, S\. Shen, J\. Shang, J\. Weston, and Y\. Nie \(2025\)Learning to solve and verify: a self\-play framework for code and test generation\.External Links:2502\.14948,[Link](https://arxiv.org/abs/2502.14948)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p4.1)\.
- Liquid Haskell Tutorial \(2025\)Liquid haskell tutorial : introduction\.Note:[https://liquid\.kosmikus\.org/01\-intro\.html](https://liquid.kosmikus.org/01-intro.html)Accessed: 2025\-07\-18Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p2.1),[§2](https://arxiv.org/html/2604.17010#S2.p3.1)\.
- Y\. Liu, S\. Li, L\. Cao, Y\. Xie, M\. Zhou, H\. Dong, X\. Ma, S\. Han, and D\. Zhang \(2025\)SuperRL: reinforcement learning with supervision to boost language model reasoning\.External Links:2506\.01096,[Link](https://arxiv.org/abs/2506.01096)Cited by:[1st item](https://arxiv.org/html/2604.17010#A3.I2.i1.p1.1)\.
- M\. Mahaut and F\. Franzon \(2025\)Repetitions are not all alike: distinct mechanisms sustain repetition in language models\.arXiv preprint arXiv:2504\.01100\.Cited by:[§E\.1](https://arxiv.org/html/2604.17010#A5.SS1.p1.4)\.
- A\. V\. Miceli\-Barone, V\. Belle, and A\. Payani \(2025\)Program semantic inequivalence game with large language models\.arXiv preprint arXiv:2505\.03818\.Cited by:[1st item](https://arxiv.org/html/2604.17010#A6.I1.i1.p1.2),[§2](https://arxiv.org/html/2604.17010#S2.p2.1),[§2](https://arxiv.org/html/2604.17010#S2.p5.1),[§3\.1](https://arxiv.org/html/2604.17010#S3.SS1.p2.1),[§3\.4\.1](https://arxiv.org/html/2604.17010#S3.SS4.SSS1.p1.10)\.
- W\. Murphy, N\. Holzer, F\. Qiao, L\. Cui, R\. Rothkopf, N\. Koenig, and M\. Santolucito \(2024\)Combining llm code generation with formal specifications and reactive program synthesis\.arXiv preprint arXiv:2410\.19736\.Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p1.1)\.
- T\. Nguyen, T\. T\. Vu, H\. D\. Vo, and S\. Nguyen \(2025\)An empirical study on capability of large language models in understanding code semantics\.Information and Software Technology,pp\. 107780\.Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p1.1),[§2](https://arxiv.org/html/2604.17010#S2.p1.1)\.
- OpenAI, :, C\. Berner, G\. Brockman, B\. Chan, V\. Cheung, P\. Dębiak, C\. Dennison, D\. Farhi, Q\. Fischer, S\. Hashme, C\. Hesse, R\. Józefowicz, S\. Gray, C\. Olsson, J\. Pachocki, M\. Petrov, H\. P\. d\. O\. Pinto, J\. Raiman, T\. Salimans, J\. Schlatter, J\. Schneider, S\. Sidor, I\. Sutskever, J\. Tang, F\. Wolski, and S\. Zhang \(2019\)Dota 2 with large scale deep reinforcement learning\.External Links:1912\.06680,[Link](https://arxiv.org/abs/1912.06680)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p4.1)\.
- L\. Ouyang, J\. Wu, X\. Jiang, D\. Almeida, C\. L\. Wainwright, P\. Mishkin, C\. Zhang, S\. Agarwal, K\. Slama, A\. Ray, J\. Schulman, J\. Hilton, F\. Kelton, L\. Miller, M\. Simens, A\. Askell, P\. Welinder, P\. Christiano, J\. Leike, and R\. Lowe \(2022\)Training language models to follow instructions with human feedback\.External Links:2203\.02155,[Link](https://arxiv.org/abs/2203.02155)Cited by:[3rd item](https://arxiv.org/html/2604.17010#A3.I2.i3.p1.1)\.
- Z\. Z\. Ren, Z\. Shao, J\. Song, H\. Xin, H\. Wang, W\. Zhao, L\. Zhang, Z\. Fu, Q\. Zhu, D\. Yang, Z\. F\. Wu, Z\. Gou, S\. Ma, H\. Tang, Y\. Liu, W\. Gao, D\. Guo, and C\. Ruan \(2025\)DeepSeek\-prover\-v2: advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition\.External Links:2504\.21801,[Link](https://arxiv.org/abs/2504.21801)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p2.1)\.
- H\. G\. Rice \(1953\)Classes of recursively enumerable sets and their decision problems\.Transactions of the American Mathematical society74\(2\),pp\. 358–366\.Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p1.1)\.
- P\. Sestoft \(1997\)Deriving a lazy abstract machine\.Journal of Functional Programming7\(3\),pp\. 231–264\.External Links:[Document](https://dx.doi.org/10.1017/S0956796897002712)Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p2.1)\.
- D\. Silver, T\. Hubert, J\. Schrittwieser, I\. Antonoglou, M\. Lai, A\. Guez, M\. Lanctot, L\. Sifre, D\. Kumaran, T\. Graepel, T\. Lillicrap, K\. Simonyan, and D\. Hassabis \(2017\)Mastering chess and shogi by self\-play with a general reinforcement learning algorithm\.External Links:1712\.01815,[Link](https://arxiv.org/abs/1712.01815)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p4.1)\.
- S\. Sun, S\. Wang, X\. Wang, Y\. Xing, E\. Zhang, and K\. Sun \(2023\)Exploring security commits in python\.External Links:2307\.11853,[Link](https://arxiv.org/abs/2307.11853)Cited by:[§5\.3\.2](https://arxiv.org/html/2604.17010#S5.SS3.SSS2.p1.1)\.
- S\. Thompson \(2011\)Haskell: the craft of functional programming\.Addison\-Wesley\.Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p2.1)\.
- T\. van Dam, F\. van der Heijden, P\. de Bekker, B\. Nieuwschepen, M\. Otten, and M\. Izadi \(2024\)Investigating the performance of language models for completing code in functional programming languages: a haskell case study\.External Links:2403\.15185,[Link](https://arxiv.org/abs/2403.15185)Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p2.1)\.
- N\. Vazou, E\. L\. Seidel, R\. Jhala, D\. Vytiniotis, and S\. Peyton\-Jones \(2014\)Refinement types for haskell\.SIGPLAN Not\.49\(9\),pp\. 269–282\.External Links:ISSN 0362\-1340,[Link](https://doi.org/10.1145/2692915.2628161),[Document](https://dx.doi.org/10.1145/2692915.2628161)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p3.1)\.
- H\. Wang, M\. Unsal, X\. Lin, M\. Baksys, J\. Liu, M\. D\. Santos, F\. Sung, M\. Vinyes, Z\. Ying, Z\. Zhu,et al\.\(2025\)Kimina\-prover preview: towards large formal reasoning models with reinforcement learning\.arXiv preprint arXiv:2504\.11354\.Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p2.1)\.
- A\. Wei, J\. Cao, R\. Li, H\. Chen, Y\. Zhang, Z\. Wang, Y\. Liu, T\. S\. Teixeira, D\. Yang, K\. Wang,et al\.\(2025\)EquiBench: benchmarking large language models’ understanding of program semantics via equivalence checking\.arXiv preprint arXiv:2502\.12466\.Cited by:[§1](https://arxiv.org/html/2604.17010#S1.p1.1),[§2](https://arxiv.org/html/2604.17010#S2.p1.1),[§5\.3\.1](https://arxiv.org/html/2604.17010#S5.SS3.SSS1.p1.1)\.
- H\. Xin, D\. Guo, Z\. Shao, Z\. Ren, Q\. Zhu, B\. Liu, C\. Ruan, W\. Li, and X\. Liang \(2024a\)Deepseek\-prover: advancing theorem proving in llms through large\-scale synthetic data\.arXiv preprint arXiv:2405\.14333\.Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p2.1)\.
- H\. Xin, Z\. Z\. Ren, J\. Song, Z\. Shao, W\. Zhao, H\. Wang, B\. Liu, L\. Zhang, X\. Lu, Q\. Du, W\. Gao, Q\. Zhu, D\. Yang, Z\. Gou, Z\. F\. Wu, F\. Luo, and C\. Ruan \(2024b\)DeepSeek\-prover\-v1\.5: harnessing proof assistant feedback for reinforcement learning and monte\-carlo tree search\.External Links:2408\.08152,[Link](https://arxiv.org/abs/2408.08152)Cited by:[§2](https://arxiv.org/html/2604.17010#S2.p2.1)\.
- Y\. Zhou, S\. Liu, J\. Siow, X\. Du, and Y\. Liu \(2019\)Devign: effective vulnerability identification by learning comprehensive program semantics via graph neural networks\.InAdvances in Neural Information Processing Systems,pp\. 10197–10207\.Cited by:[§5\.3\.3](https://arxiv.org/html/2604.17010#S5.SS3.SSS3.p1.1)\.
## Appendix AIllustrative Haskell Features for Semantic Reasoning and Verification in LLM Self‐Play
\[1\] Pure SemanticsEvery function is a pure mapping from inputs to outputs, there is no hidden state or mutation\. This purity ensures that two functions are semantically equivalent if they produce the same outputs for all inputs, regardless of how those outputs are computed\.[⬇](data:text/plain;base64,LS0gcmV2UmVjIDo6IFthXSAtPiBbYV0KcmV2UmVjIDo6IFthXSAtPiBbYV0KcmV2UmVjIFtdICAgICA9IFtdCnJldlJlYyAoeDp4cykgPSByZXZSZWMgeHMgKysgW3hdCi0tIHJldkZvbGQgOjogW2FdIC0+IFthXQpyZXZGb2xkIDo6IFthXSAtPiBbYV0KcmV2Rm9sZCA9IGZvbGRsIChmbGlwICg6KSkgW10=)12revRec::\[a\]\-\>\[a\]3revRec\[\]=\[\]4revRec\(x:xs\)=revRecxs\+\+\[x\]56revFold::\[a\]\-\>\[a\]7revFold=foldl\(flip\(:\)\)\[\]Because Haskell is referentially transparent,revRec xscan be substituted withrevFold xswherever it appears\. This makes reasoning about equivalence both feasible and formalizable\.
\[2\] Static Typing and GHC CompileHaskell’s static type system provides strong guarantees at compile time\. Once a program is accepted by the compiler, most classes of errors such as type mismatches and null de\-referencing are eliminated\. This makes the type checker an effective pre\-filter for program validity in the self\-play loop\.[⬇](data:text/plain;base64,LS0gYWRkIDo6IEludCAtPiBJbnQgLT4gSW50CmFkZCB4IHkgPSB4ICsgeQ==)12addxy=x\+yWhen writingadd True False, GHC will raise a compilation error before running the program asaddexpects twoInt\.
## Appendix BLiquid Haskell SEQ / SINQ Instances
Listing 1:Example of a SEQ Instancedouble::Int\-\>Int
doublex=x\+x
double\_alt::Int\-\>Int
double\_altx=2\*x
\-\>\{doublex==double\_altx\}@\-\}
lemma\_double\_equiv::Int\-\>Proof
lemma\_double\_equivx
=doublex
===double\_altx
\*\*\*QED\-\}
Listing 2:Example of a SINQ Instancesign::Int\-\>String
signn
\|n<0="negative"
\|n==0="zero"
\|otherwise="positive"
signIneq::Int\-\>String
signIneqn
\|n<=0="non\-positive"
\|otherwise="positive"
x\_a=0
## Appendix CImplementation with Reinforcement Learning
Reinforcement learning \(RL\) offers a conceptually natural way to optimize the game by directly rewarding or penalizing generated programs based on the game outcome\. A straightforward RL setup would proceed as follows based on the game rules:
### C\.1RL Formulation
We treat Alice and Bob as two competing agents in a zero\-sum Markov game\.
- •For the SINQ branch: An executor first checks thatQQcompiles and thatP\(xa\)≠Q\(xa\)P\(x\_\{a\}\)\\neq Q\(x\_\{a\}\)on Alice proposed diverging inputxax\_\{a\}\.
- •For the SEQ branch: the SMT‐based checker will execute the Liquid Haskell proof script asserting∀x\.P\(x\)=Q\(x\)\\forall x\.\\;P\(x\)=Q\(x\)\.
If the above verification fails, the episode terminates with Alice receiving a negative rewardrA=rfail<0r\_\{A\}=r\_\{\\mathrm\{fail\}\}<0\. Otherwise Bob observes state\(P,Q\)\(P,Q\)\(but not Alice’s diverging inputxax\_\{a\}\) and identify whether\(P,Q\)\(P,Q\)are semantically equivalent\.
If Bob is correct, then Bob earnsrB=rsuccess\>0r\_\{B\}=r\_\{\\mathrm\{success\}\}\>0and Alice is penalized withrA=rtoo\_easy<0r\_\{A\}=r\_\{\\mathrm\{too\\\_easy\}\}<0\. If Bob is incorrect, he getsrB=rfail<0r\_\{B\}=r\_\{\\mathrm\{fail\}\}<0and Alice winsrA=rwin\>0r\_\{A\}=r\_\{\\mathrm\{win\}\}\>0\. Both agents update their policies to maximize expected cumulative reward over many self‐play episodes\.
### C\.2Potential Benefits of RL
RL can be layered on top of the semantic equivalence self‐play by treating Alice’s generator as an RL agent whose policy is updated based on proof outcomes\. Over many episodes, this encourages the policy to avoid classes of common semantic errors \(e\.g\. off‐by‐one edge cases\) and refines its internal value function to distinguish subtle equivalence‐breaking patterns, resulting in an LLM that is both more precise and robust\.
### C\.3Practical Challenges
- •Sparse, High‐Variance Rewards: In our reasoning tasks, reward is only provided at the end of a multi\-step chain of thought\. As a result, the LLMs often struggle to find successful trajectories, making it difficult for LLMs to learn which intermediate steps contributed to success\(Liuet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib31)\)\.
- •Credit Assignment: Attributing success or failure back to specific reasoning tokens or code‐editing steps is nontrivial, and naive reward assignment can lead to undesirable shortcuts, such as inserting semantically neutral modifications to exploit reward signals without genuine reasoning\.
- •Computational Resources Constraints: RL is more computationally expensive than supervised fine\-tuning due to the repeated sampling, model rollouts, and reward computation across multiple steps\(Ouyanget al\.,[2022](https://arxiv.org/html/2604.17010#bib.bib32)\)\.
## Appendix DPrompt Template
### D\.1Alice SEQ System Prompt
YouareahelpfulandexpertHaskellprogrammer,poweredbyLiquidHaskell\.
Yourjobisto\*transform\*anygivenfunction\`P\`intoanewfunction\`Q\`that:
\-IssyntacticallycorrectHaskell\.
\-Issemanticallyequivalent:∀\\forallx\.\`Px==Qx\`\.
\-Usesa\*different\*implementation\-\-Implementanon\-structuralchangeratherthanmerelyswappingoperatororder\.
\-Usesadifferentfunctionname\(e\.g\.addatrailing\`'\_alt\`\)\.
TheoriginalprogramandyourprogramwillbeusedinatesttoevaluatetheskillofanexpertHaskellprogrammerwhowillidentifywhetheryourprogramQissemanticequivalenttotheoriginalprogramP,somakesurethatthedifferenceyouintroducearenotveryeasytounderstand\.
Youwillbegivenadifficultylevelfrom0\(easiest\)to10\(hardest\)totarget\.E\.g\.difficultylevel0meansthatanexpertcomputerscientistinthebottomdecileoraboveshouldbeabletofindadivergingexample,difficultylevel9meansthatonlyanexpertcomputerscientistinthetopdecileshouldbeabletofindadivergingexample,anddifficultylevel10meansthatonlythetop0\.01orlessofexpertHaskellprogrammershouldbeabletofindadivergingexample\.
Alwaysthinkthroughyourtransformationstepsin\`<think\>\.\.\.</think\>\`,thenemitexactly:
GeneratedProgram\`Q\`:
\`\`\`haskell
<yourQhere\>
\`\`\`
### D\.2Alice SEQ User Prompt
Difficultylevel:\{difficulty\_level\}
HereistheoriginalHaskellfunction\`P\`:
\`\`\`haskell
\{program\_p\_completion\}
\`\`\`
Itsargumenttypeis
\`\`\`haskell
t=\{t\}
\`\`\`
Yourtask:produceanewfunction\`Q\`thatsatisfiesthesystempromptrequirements\.
\-Makesure\`Q\`hasadifferentname\(e\.g\.appenda\`'\_alt\`\)\.
\-Avoidtrivialsymmetricrewrites\-showagenuinealternativeimplementation\.
\-Donotincludeanyextracommentarybeyondtherequired\`<think\>\.\.\.</think\>\`andthe\`GeneratedProgram\`Q\`:block\.
\-Whereappropriate,feelfreetousePreludefunctionssuchasfoldr,map,orzipWithtoencouragediversestrategies\.
<think\>
### D\.3Lemma SEQ Proof System Prompt
YouareanexpertHaskell/LiquidHaskellprover\.
Youareaskedtoprovethattworeflectedfunctionsareequivalent\.
Themostbasicproofshouldbeinthefollowingformat:
\`\`\`haskell
\{\{\-@lemma\_\{func\_name\_p\}\_equiv::x:\{arg\_type\}\-\>\{\{\{func\_name\_p\}x==\{func\_name\_q\}x\}\}@\-\}\}
lemma\_\{func\_name\_p\}\_equiv::\{arg\_type\}\-\>Proof
lemma\_\{func\_name\_p\}\_equivx
=\{func\_name\_p\}x
===\{func\_name\_q\}x
\*QED
\`\`\`
However,youshouldalsousemoreadvancedprooftechniquesifnecessary\.
Few\-ShotExample1:
\`\`\`haskell
\{\{\-@LIQUID"\-\-reflection"@\-\}\}
\{\{\-@LIQUID"\-\-ple"@\-\}\}
moduleMyTestwhere
importLanguage\.Haskell\.Liquid\.ProofCombinators
\-\-AliceprogramP
\{\{\-@reflectdouble@\-\}\}
double::Int\-\>Int
doublex=x\+x
\-\-AliceproposesQ
\{\{\-@reflectdouble'@\-\}\}
double'::Int\-\>Int
double'x=2\*x
\-\-Hereisthefulllemma,fromannotationtoQED:
\{\{\-@lemma\_double\_equiv::x:Int\-\>\{\{doublex==double'x\}\}@\-\}\}
lemma\_double\_equiv::Int\-\>Proof
lemma\_double\_equivx
=doublex
===double'x
\*QED
\`\`\`
Few\-ShotExample2:
\`\`\`haskell
\{\{\-@LIQUID"\-\-reflection"@\-\}\}
\{\{\-@LIQUID"\-\-ple"@\-\}\}
moduleEquivwhere
importLanguage\.Haskell\.Liquid\.ProofCombinators
\{\{\-@reflectaddNumbers@\-\}\}
addNumbers::Int\-\>Int\-\>Int
addNumbersab=a\+b
\{\{\-@reflectaddNumbers'@\-\}\}
addNumbers'::Int\-\>\(Int\-\>Int\)
addNumbers'a=\\b\-\>a\+b
\-\-Alicedetailedproofofequivalence
lemma\_addNumbers\_equiv::Int\-\>Int\-\>Proof
lemma\_addNumbers\_equivxy
=addNumbersxy
===addNumbers'xy
\*QED
Whenyouanswer,outputonlythecompletelemmablockinthesamestyle:
1\.Usethe\`\{\{\-@lemma\_\.\.\.@\-\}\}\`annotation,withtheexactnamingpatternlemma\_<P\>\_equiv
2\.TheHaskelltypesignature
3\.Thefunctiondefinitionwith\`===\`steps
4\.Endwith\`\*QED\`
5\.Pleaseputyourproofbetween\`\`\`haskelland\`\`\`
Noextratext,noadditionalcomments\.
Youranswermustmatchtheexampleformatexactly,withouttrailingwhitespaceornewlinesoutsidethecodeblock\.
### D\.4Lemma SEQ Proof User Prompt
\{error\_msg\_section\}
\{equiv\_code\}
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
Yourtask:Producetheproofofequivalenceforthefollowingfunction:
\`\{func\_name\_p\}x==\{func\_name\_q\}x\`forall\`x\`\.
\`\`\`haskell
\{\{\-@LIQUID"\-\-reflection"@\-\}\}
\{\{\-@LIQUID"\-\-ple"@\-\}\}
moduleEquivwhere
importLanguage\.Haskell\.Liquid\.ProofCombinators
\{\{\-@reflect\{func\_name\_p\}@\-\}\}
\{program\_p\_content\}
\{\{\-@reflect\{func\_name\_q\}@\-\}\}
\{program\_q\_content\}
\-\-Yourcompleteproofofequivalence
/\*PROOFBODYHERE\*/
\`\`\`
<think\>
### D\.5Alice SINQ System Prompt
YouareanexpertHaskellprogrammer\.YourtaskistogenerateasemanticallyinequivalentvariantofagivenHaskellprogram,whichmeansthattheremustexistatleastadiverginginputexamplesuchthattheoriginalprogramandyourprogrameitherproducedifferentoutputsorexceptions,oronehaltsandtheotheronedoesnothalt\.
Youmustalsoprovideadiverginginput,whichisavalidinputforbothprograms,butonwhichtheyproducedifferentoutputs\.
Agoodinequivalentprogram\`Q\`shouldbesubtlydifferentfrom\`P\`\.
Agooddiverginginput\`x\`shouldbesimpleandclearlydemonstratethesemanticdifferencebetween\`P\`and\`Q\`\.
TheoriginalprogramandyourprogramwillbeusedinatesttoevaluatetheskillofanexpertHaskellprogrammerwhowillhavetoproduceadivergingexample\(notnecessarilythesameasyours\),somakesurethatthedifferenceyouintroducearenotveryeasytounderstand\.
Youwillbegivenadifficultylevelfrom0\(easiest\)to10\(hardest\)totarget\.E\.g\.difficultylevel0meansthatanexpertcomputerscientistinthebottomdecileoraboveshouldbeabletofindadivergingexample,difficultylevel9meansthatonlyanexpertcomputerscientistinthetopdecileshouldbeabletofindadivergingexample,anddifficultylevel10meansthatonlythetop0\.01orlessofexpertHaskellprogrammershouldbeabletofindadivergingexample\.
First,thinkstep\-by\-stepandwritedownyouranalysisofprogram\`P\`andyourstrategyforcreatinganinequivalentprogram\`Q\`\.Enclosethisreasoningwithin\`<think\>\`and\`</think\>\`tags\.
Afterthethinkingblock,thefinalanswercouldonlybeinthefollowingformat,withoutanyadditionalexplanationorcontext\.
FinaloutputMUSTbeexactly:
GeneratedProgram\`Q\`:
\`\`\`haskell
<YourgeneratedHaskellcodefor\`Q\`\>
\`\`\`
DivergingInput\`x\`:
\`\`\`
<Thediverginginput\`x\`\>
\`\`\`
### D\.6Alice SINQ User Prompt
Difficultylevel:\{difficulty\_level\}
Originalprogram\`P\`:
\`\`\`haskell
\{program\}
\`\`\`
<think\>
### D\.7Bob System Prompt
YouareanexpertHaskellprogrammer\.YouaregiventwoHaskellprograms,\`P\`and\`Q\`\.
Yourtaskistodetermineiftheyaresemanticallyequivalent\.
Usethefollowingformattorespond:
\#Equivalent?
YesorNo
Iftheprogramsareequivalent,respondwithyourthoughtprocessandafinaloutputwith:
\#Equivalent?
Yes
Iftheyareinequivalent,respondwithyourthoughtprocessandafinaloutputwith:
\#Equivalent?
No
### D\.8Bob User Prompt
Program\`P\`:
\`\`\`haskell
\{program\_p\}
\`\`\`
Program\`Q\`:
\`\`\`haskell
\{program\_q\}
\`\`\`
<think\>
### D\.9Alice SEQ Difficulty Prediction System Prompt
Difficultylevel:Any
ProgramP
\`\`\`haskell
\{program\_p\}
\`\`\`
TheprogramQbelowissemanticallyequivalenttotheoriginalprogramP,wheretheequivalenceisduetothefactthatthetwoprogramshavethesamebehavioronallinputs\.
ProgramQ
\`\`\`haskell
\{program\_q\}
\`\`\`
### D\.10Alice SINQ Difficulty Prediction System Prompt
Difficultylevel:Any
programP
\`\`\`haskell
\{program\_p\}
\`\`\`
TheprogramQbelowissemanticallyinequivalenttotheoriginalprogramP,wheretheinequivalenceisduetothefactthatthetwoprogramshavedifferentbehavioronsomeinputs\.
ProgramQ
\`\`\`haskell
\{program\_q\}
\`\`\`
### D\.11Alice Difficulty Prediction User Prompt
PredictthedifficultyleveloftheinstanceofProgramQcomparedtoProgramP\.Justwrite\\"Difficultylevel:D\\"whereDisyourprediction,donotwriteanythingelse\.
## Appendix EMain Experiment Setup
### E\.1Model Configs and Experiment Settings
We employDeepSeek\-R1\-Distill\-Qwen\-7Bas the base model\. The decoding parameters are configured as: temperatureT=0\.6T=0\.6, top\-p=0\.95p=0\.95, top\-k=20k=20, presence penalty=1\.5=1\.5\(Applied to reduce repetition in outputs from smaller reasoning models\(Mahaut and Franzon,[2025](https://arxiv.org/html/2604.17010#bib.bib34)\)\), and a context length of 32,768 tokens\.
InE0E\_\{0\}, Alice is configured with a 50/50 probability of playing either the SEQ or SINQ game \(Section[3\.3](https://arxiv.org/html/2604.17010#S3.SS3.SSSx1)\), with the difficulty level set as 10 in her prompts \(Appendix[D\.2](https://arxiv.org/html/2604.17010#A4.SS2)and[D\.6](https://arxiv.org/html/2604.17010#A4.SS6)\)\. Bob serves solely as an evaluator and difficulty scorer, and he is not fine\-tuned iteratively but only once after all rounds are complete\.
### E\.2Dataset Size and Difficulty Threshold
Due to computational constraints, each experimental run uses a maximum of 500 Haskell reference programs555A main run with fine\-tuning takes about 3 days on 4 NVIDIA L40S GPUs\., i\.e\.,\|𝒟\|=P=500\|\\mathcal\{D\}\|=P=500\. The difficulty thresholdτ\\tauis set to33\(rather than the default55\) to increase the number of training examples available to Alice\. For future experiments with greater resources, it is recommended to increase\|𝒟\|\|\\mathcal\{D\}\|and restoreτ=5\\tau=5to more closely simulate a fully adversarial setting\.
### E\.3Fine\-Tuning Protocol
After each self\-play round, Alice is fine\-tuned on the combination of newly generated instances and the retained examples from all previous rounds\. More importantly, each round’s fine\-tuning is initialized from the original base model rather than from the fine\-tuned adapter checkpoint obtained in the previous round\. This strategy is adopted to avoid bias accumulation across iterations and to maintain stable and unbiased difficulty estimation\.
During generation in roundi\+1i\+1, Alice uses the LoRA learned in roundii, but fine\-tuning for that round still begins from the base model\.
### E\.4Experimental Duration
The main run comprises 7 self\-play rounds\. In each round, Alice is fine\-tuned according to the above protocol\. Bob undergoes a single fine\-tuning step after all 7 rounds are complete\. For both Alice and Bob, the fine\-tuning minimises the loss only on the model’s own outputs\.
## Appendix FRationale and Derivations for the SEQ\-SINQ Comparison Experiment
### F\.1Fairness Criterion and Derivation
LetPPbe the number of reference programs attempted per round,p∈\[0,1\]p\\in\[0,1\]the probability of playing SEQ \(so1−p1\-pfor SINQ\), and letrSEQ,rSINQr\_\{\\text\{SEQ\}\},r\_\{\\text\{SINQ\}\}denote the verification yields \(probability that an attempt becomes a verified training example\)\. By linearity of expectation, the expected number of verified examples per round is
K\(P,p\)=P\(prSEQ\+\(1−p\)rSINQ\)\.K\(P,p\)\\;=\\;P\\bigl\(p\\,r\_\{\\text\{SEQ\}\}\+\(1\-p\)\\,r\_\{\\text\{SINQ\}\}\\bigr\)\.To isolate the effect of SEQ vs\. SINQ unbalanced volume size, we match the expected countKKof verified pairs across arms\.
### F\.2Yield Estimation from the Main RunE0E\_\{0\}
From the 7\-round of main experimentE0E\_\{0\},\|𝒟\|=P=500\|\\mathcal\{D\}\|=P=500\(per\-round attempts≈250\\approx 250SEQ and250250SINQ\):
Table 6:Validation yields fromE0E\_\{0\}\. Validated counts by round were: SEQ \(1,2,4,6,3,9,91,2,4,6,3,9,9\) and SINQ \(130,139,124,128,131,138,113130,139,124,128,131,138,113\)\.
### F\.3Predicting Supervision for Each Arm
To achieve a balanced distribution of verified training examples \(approx\. 50/50 split\), we configured regimeE2E\_\{2\}with a mix of 96% SEQ and 4% SINQ\. With a reference budget ofP=500P=500, the estimated verified yields per round are:
NSEQ\\displaystyle N\_\{\\text\{SEQ\}\}≈500⋅0\.96⋅0\.0194≈9\.3\\displaystyle\\approx 500\\cdot 0\.96\\cdot 0\.0194\\approx 9\.3\(7\)NSINQ\\displaystyle N\_\{\\text\{SINQ\}\}≈500⋅0\.04⋅0\.516≈10\.3\\displaystyle\\approx 500\\cdot 0\.04\\cdot 0\.516\\phantom\{0\}\\approx 10\.3\(8\)Totaling these yields usingK\(P,p\)K\(P,p\)results in an expectedK≈19\.6K\\approx 19\.6verified examples per round, achieving the desired parity between the two supervision signals\.
### F\.4Matched comparator forE2E\_\{2\}
ChoosePE3P\_\{\\text\{$E\_\{3\}$\}\}in a 100% SINQ run so that
PE3⋅rSINQ=KE2⇒PE3=19\.6320\.516=38\.04≈40\\begin\{split\}P\_\{E\_\{3\}\}\\cdot r\_\{\\text\{SINQ\}\}&=K\_\{E\_\{2\}\}\\\\ \\Rightarrow P\_\{E\_\{3\}\}&=\\frac\{19\.632\}\{0\.516\}=38\.04\\approx 40\\end\{split\}\(9\)
### F\.5What this controls, and what it does not
MatchingKKensures each arm receives similar amounts of verified training, so outcome differences are attributable to the type of game \(presence/absence of SEQ\) rather than volume\. However, residual differences may still arise from the difficulty distribution of accepted items and class\-imbalance within Bob’s updates\.
### F\.6Comparison Objectives
The experimental design supports the following four key pairwise comparisons:
- •E0vs\.E1E\_\{0\}\\text\{ vs\. \}E\_\{1\}: Under fixed compute resources \(constantPP\), does a mixed 50% SEQ / 50% SINQ regime yield greater performance improvements than a pure SINQ regime similar to Miceli et al\. setting\(Miceli\-Baroneet al\.,[2025](https://arxiv.org/html/2604.17010#bib.bib25)\), given that the latter produces substantially more verified training examples due to higher verification success rates in SINQ?
- •E0vs\.E2E\_\{0\}\\text\{ vs\. \}E\_\{2\}: With fixed compute resources, does shifting towards a more balanced number of verified SEQ and SINQ examples, at the cost of reducing the overall number of verified training pairs, lead to improved evaluator performance compared to a SINQ\-dominated main experiment?
- •E1vs\.E2E\_\{1\}\\text\{ vs\. \}E\_\{2\}: Under the same compute constraints, does a pure SINQ regime \(maximising verified training pairs\) outperform a more balanced SEQ–SINQ dataset that sacrifices data volume for greater semantic diversity?
- •E2vs\.E3E\_\{2\}\\text\{ vs\. \}E\_\{3\}When the number of verified training pairs is held constant, does the inclusion of SEQ supervision yield measurable performance benefits over an SINQ\-only regime?
### F\.7Summary of experiments
- •E0E\_\{0\}\(main\): 50/50,P=500P=500\.
- •E1E\_\{1\}: 100% SINQ,P=500P=500\.
- •E2E\_\{2\}: 96% SEQ / 4% SINQ,P=500P=500\.
- •E3E\_\{3\}\(matched toE2E\_\{2\}\): 100% SINQ,P=40P=40to equalize expected verified pairs withE2E\_\{2\}\.
This suite isolates the causal effect of adding SEQ supervision under controlled total verified pairs, enabling a fair assessment of whether SEQ contributes beyond SINQ alone\.
## Appendix GExperiment Results
Table 7:Benchmark Performances over all experiments\. Averages over 16 samples\.Table 8:Total verified generation from Alice of all experiments over 7 rounds\.
## Appendix HExtended Regime Results
### H\.1ExperimentE0E\_\{0\}
Most of the evaluation results are included in the main report\.
#### H\.1\.1Difficulty trajectories for both task types
When we inspect the difficulty trajectories for both task types in Figure[7](https://arxiv.org/html/2604.17010#A8.F7), they move upward across rounds\.
- •For SEQ, although the sample size is not statistically significant, its mean difficulty score jumps from near0\.00\.0in early rounds to2\.0−3\.02\.0\-3\.0in later ones, signaling that Alice manages to construct are non\-trivial from Bob’s perspective\.
- •For SINQ, its averages being greater than the medians \(0\.00\.0for all rounds\) indicates a right‐skewed distribution\. This suggests that while most generated instances are of lower difficulty, Alice occasionally produces much harder examples that significantly influence the mean\. Starting in round 4, the magnitude of the difference between the mean and median becomes larger, and by round 7 the mean even exceeds the upper quartile\. Therefore, Alice is producing increasingly harder examples over successive rounds\.


Figure 7:Mean and box\-whiskey plots of the difficulty scores of Alice’s generated instances by round\. Top: SEQ Game\. Bottom SINQ Game\.Both SEQ and SINQ setups show rising difficulty scores across rounds, indicating that Alice improves in generating more challenging instances for Bob\.
#### H\.1\.2Alice’s HumanEval \(Haskell\) Performance
Figure 8:Alice’s HumanEval results\. Averages over 16 trials\.
#### H\.1\.3Alice’s MBPP \(Haskell\) Performance
Figure 9:Alice’s MBPP results\. Averages over 16 trials\.
### H\.2ExperimentE1E\_\{1\}
#### H\.2\.1Alice’s HumanEval \(Haskell\) Performance
Figure 10:Alice’s HumanEval results\. Averages over 16 trials\.
#### H\.2\.2Alice’s MBPP \(Haskell\) Performance
Figure 11:Alice’s MBPP results\. Averages over 16 trials\.
#### H\.2\.3Alice’s SEQ vs SINQ Validated Generation Counts
Figure 12:SEQ vs SINQ Validated Generation Counts\.
### H\.3ExperimentE2E\_\{2\}
#### H\.3\.1Alice’s HumanEval \(Haskell\) Performance
Figure 13:Alice’s HumanEval results\. Averages over 16 trials\.
#### H\.3\.2Alice’s MBPP \(Haskell\) Performance
Figure 14:Alice’s MBPP results\. Averages over 16 trials\.
#### H\.3\.3Alice’s SEQ vs SINQ Validated Generation Counts
Figure 15:SEQ vs SINQ Validated Generation Counts\.
### H\.4ExperimentE3E\_\{3\}
#### H\.4\.1Alice’s HumanEval \(Haskell\) Performance
Figure 16:Alice’s HumanEval results\. Averages over 16 trials\.
#### H\.4\.2Alice’s MBPP \(Haskell\) Performance
Figure 17:Alice’s MBPP results\. Averages over 16 trials\.
#### H\.4\.3Alice’s SEQ vs SINQ Validated Generation Counts
Figure 18:SEQ vs SINQ Validated Generation Counts\.
#### H\.4\.4E2E\_\{2\}vsE3E\_\{3\}Validated Generation Counts
Figure 19:Validated Generation Counts of both experiment, indicating the effort of controllingPPshown in Appendix[F](https://arxiv.org/html/2604.17010#A6)in order to achieve an equal number of verified generation\.Similar Articles
Logic-Regularized Verifier Elicits Reasoning from LLMs
Introduces LoVer, an unsupervised verifier that uses logical rules (negation consistency, intra-group and inter-group consistency) to improve LLM reasoning without labeled data, achieving performance close to supervised verifiers on reasoning benchmarks.
Learning to reason with LLMs
OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.
Learning to Reason with Insight for Informal Theorem Proving
This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.
ReaComp: Compiling LLM Reasoning into Symbolic Solvers for Efficient Program Synthesis
ReaComp compiles LLM reasoning traces into reusable symbolic program synthesizers that achieve strong accuracy on program synthesis benchmarks while eliminating LLM calls at test time, significantly reducing computational cost.
How Well Do LLMs Perform on the Simplest Long-Chain Reasoning Tasks: An Empirical Study on the Equivalence Class Problem
This empirical study evaluates LLMs on the Equivalence Class Problem to assess long-chain reasoning capabilities, finding that non-reasoning models fail while reasoning models struggle with specific structural difficulties.