Cycle-Consistent Neural Explanation of Formal Verification Certificates

arXiv cs.AI 06/24/26, 04:00 AM Papers
Summary
This paper proposes a cycle-consistent neural architecture that generates faithful natural language explanations of formal verification certificates, achieving 90% soundness and 860x faster inference than LLM baselines.
arXiv:2606.24414v1 Announce Type: new Abstract: Formal verification produces machine-checkable certificates that attest to the satisfaction or violation of temporal properties, yet these certificates remain opaque to non-specialist stakeholders. We propose a cycle-consistent neural architecture that generates faithful natural language explanations of verification certificates. A forward network NN1 maps certificates to explanations, and an inverse network NN2 reconstructs certificates from explanations; a symbolic verifier closes the loop, providing a differentiable faithfulness proxy. A pointer-generator mechanism ensures lexical grounding by copying state names directly from the certificate. We evaluate on 420 test certificates spanning six verification methods (bounded proof, k-induction, inductive invariant, lasso, reachability, witness pair) in both YES and NO verdict variants, drawn from a financial compliance domain with 207 named states. Our trained architecture, combined with a hybrid inference-time routing strategy, achieves 90.0% cycle-verified soundness, surpassing a multi- LLM few-shot baseline (76.1% for the best of 16 LLM combinations across four frontier models) by 13.9 percentage points. The neural model wins on 10 of 12 verdict/kind categories, with three categories reaching 100% soundness. The architecture offers 860x faster inference (185 ms vs. 160 s per certificate for the full multi-LLM baseline), offline operation, deterministic outputs, and zero per-inference cost. These results demonstrate that trained specialization outperforms general-purpose LLM prompting for structured certificate explanation, while eliminating the deployment constraints of cloud-based inference.
Original Article
View Cached Full Text
Cached at: 06/24/26, 07:47 AM
# Cycle-Consistent Neural Explanation of Formal Verification Certificates
Source: [https://arxiv.org/html/2606.24414](https://arxiv.org/html/2606.24414)
###### Abstract

Formal verification produces machine\-checkable certificates that attest to the satisfaction or violation of temporal properties, yet these certificates remain opaque to non\-specialist stakeholders\. We propose a cycle\-consistent neural architecture that generates faithful natural language explanations of verification certificates\. A forward networkNN1\\mathrm\{NN\}\_\{1\}maps certificates to explanations, and an inverse networkNN2\\mathrm\{NN\}\_\{2\}reconstructs certificates from explanations; a symbolic verifier closes the loop, providing a differentiable faithfulness proxy\. A pointer\-generator mechanism ensures lexical grounding by copying state names directly from the certificate\.

We evaluate on 420 test certificates spanning six verification methods \(bounded proof,kk\-induction, inductive invariant, lasso, reachability, witness pair\) in both YES and NO verdict variants, drawn from a financial compliance domain with 207 named states\. Our trained architecture, combined with a hybrid inference\-time routing strategy, achieves 90\.0% cycle\-verified soundness—surpassing a multi\-LLM few\-shot baseline \(76\.1% for the best of 16 LLM combinations across four frontier models\) by 13\.9 percentage points\. The neural model wins on 10 of 12 verdict/kind categories, with three categories reaching 100% soundness\. The architecture offers∼860×\{\\sim\}860\\timesfaster inference \(∼185\{\\sim\}185ms vs\.∼160\{\\sim\}160s per certificate for the full multi\-LLM baseline\), offline operation, deterministic outputs, and zero per\-inference cost\. These results demonstrate that trained specialization outperforms general\-purpose LLM prompting for structured certificate explanation, while eliminating the deployment constraints of cloud\-based inference\.

## 1Introduction

Formal verification produces machine\-checkable certificates\[[11](https://arxiv.org/html/2606.24414#bib.bib72),[24](https://arxiv.org/html/2606.24414#bib.bib32),[6](https://arxiv.org/html/2606.24414#bib.bib33)\]—lasso\-shaped counterexamples, reachability witnesses, inductive invariants, bounded proofs—that provide mathematical guarantees about system behavior\. In regulated domains such as financial compliance, these certificates are the evidentiary basis for regulatory assurance, yet they remain opaque to the auditors and compliance officers who must act on them\. Generating natural language explanations that are simultaneously*faithful*\(no hallucinated facts\) and*fluent*\(comprehensible to non\-specialists\) is an open challenge: existing neural approaches prioritize fluency at the expense of grounding, while template systems are rigid and domain\-specific\.

We observe that faithfulness admits a testable formulation: an explanationEEof a certificateCCis faithful if and only ifCCcan be reconstructed fromEEalone\. This motivates a*cycle\-consistent*architecture: a forward pointer\-generator networkNN1\\mathrm\{NN\}\_\{1\}mapsC→EC\\rightarrow E, copying state names directly from the certificate to eliminate entity hallucination; an inverse TransformerNN2\\mathrm\{NN\}\_\{2\}reconstructsE→C′E\\rightarrow C^\{\\prime\}; and a symbolic verifier comparesCCwithC′C^\{\\prime\}, providing a faithfulness proxy that is made differentiable through surrogate reconstruction losses \(Section[3\.2](https://arxiv.org/html/2606.24414#S3.SS2)\), requiring no human annotation\. In addition, a hybrid inference\-time router selects among several decoding configurations per example, exploiting structural differences across certificate types\.

We evaluate on 420 test certificates spanning 12 verdict/kind categories from financial compliance workflows\. The hybrid router achieves 90\.0% cycle\-verified soundness—a lower bound on true faithfulness, since NN2’s 3\.6% false\-negative rate means some genuinely faithful explanations are conservatively flagged\. This surpasses the best of 16 frontier LLM pair configurations \(76\.1%\) by 13\.9 percentage points \(non\-overlapping 95% CIs\), winning 10 of 12 categories with three at 100% and no losses\. The sub\-1M parameter model operates offline at∼860×\{\\sim\}860\\timeslower latency, with deterministic outputs, zero per\-inference cost, and no data exposure to external services—properties critical for regulated industries\. Our contributions:

1. 1\.Cycle\-consistent explanation\.A neural architecture combining a pointer\-generator forward model, an inverse Transformer, and a symbolic verifier to enforce faithfulness via cycle consistency without human annotation \(Section[3](https://arxiv.org/html/2606.24414#S3)\)\.
2. 2\.Hybrid inference\-time routing\.A test\-time strategy achieving 90\.0% soundness vs\. 61\.7% for the best single configuration, requiring no additional training \(Section[4\.3](https://arxiv.org/html/2606.24414#S4.SS3)\)\.
3. 3\.Surpassing LLM prompting\.A sub\-1M parameter model outperforms 16 frontier LLM configurations on cycle\-verified soundness \(90\.0% vs\. 76\.1%\), providing evidence that trained specialization outperforms general\-purpose prompting when the faithfulness criterion is formally verifiable \(Section[4\.4](https://arxiv.org/html/2606.24414#S4.SS4)\)\.
4. 4\.Deployment\-ready architecture\.Offline operation, deterministic outputs, zero marginal cost, and no data exposure—with a fixed computation graph that lacks the free\-text input channel through which prompt injection attacks operate \(Section[4](https://arxiv.org/html/2606.24414#S4)\)\.

## 2Related Work

Our approach sits at the intersection of four research areas, which we discuss in turn before positioning our contribution against prior work \(Table[1](https://arxiv.org/html/2606.24414#S2.T1)\)\.

#### Cycle\-consistent learning\.

Cycle consistency as a training objective was popularized by CycleGAN\[[82](https://arxiv.org/html/2606.24414#bib.bib1)\]for unpaired image\-to\-image translation, enforcing that a mappingG:X→YG\\colon X\\rightarrow Ycomposed with its inverseF:Y→XF\\colon Y\\rightarrow Xapproximates the identity\. In natural language processing, back\-translation\[[73](https://arxiv.org/html/2606.24414#bib.bib2)\]applies an analogous principle: translating a sentence to a pivot language and back, then penalizing divergence from the original\. Subsequent work extended cycle consistency to style transfer\[[64](https://arxiv.org/html/2606.24414#bib.bib3)\], data augmentation\[[31](https://arxiv.org/html/2606.24414#bib.bib4)\], and unsupervised machine translation\[[50](https://arxiv.org/html/2606.24414#bib.bib5)\]\. Our work differs in a fundamental respect: the cycle traverses two*different modalities*—structured symbolic certificates and natural language text—rather than two instances of the same modality \(e\.g\., two image domains or two languages\)\. Moreover, the faithfulness criterion at the cycle’s endpoint is not approximate reconstruction but*symbolic semantic equivalence*, evaluated by a formal verifier that compares state sets, path structures, and component memberships\.

#### LLM reward signals and judging\.

The use of large language models to provide training\-time signals has gained traction through RLAIF\[[5](https://arxiv.org/html/2606.24414#bib.bib6)\], LLM\-as\-Judge\[[81](https://arxiv.org/html/2606.24414#bib.bib7)\], and Constitutional AI\[[5](https://arxiv.org/html/2606.24414#bib.bib6)\]\. In these frameworks, the LLM typically judges*overall quality*—a single signal conflating correctness, fluency, helpfulness, and safety\. Our architecture introduces a critical separation: the LLM judges*only fluency*, while faithfulness is enforced by an entirely independent mechanism \(cycle consistency with symbolic verification\), ensuring that fluency optimization cannot introduce hallucinations\.

Our hybrid router is also related to test\-time compute scaling, where generating and reranking multiple candidates improves output quality\[[14](https://arxiv.org/html/2606.24414#bib.bib24)\]; our twist is that the reranking oracle is an*exact symbolic verifier*rather than a learned reward model, and the sub\-1M model makes evaluating dozens of candidates per certificate computationally trivial\.

#### Neural text generation from structured data\.

Data\-to\-text generation maps structured inputs to natural language\. Early neural approaches used sequence\-to\-sequence models with attention\[[4](https://arxiv.org/html/2606.24414#bib.bib8)\], later augmented with copy mechanisms\[[37](https://arxiv.org/html/2606.24414#bib.bib9),[71](https://arxiv.org/html/2606.24414#bib.bib10)\]\. Recent work includes structured data verbalization\[[16](https://arxiv.org/html/2606.24414#bib.bib11)\], table\-to\-text with pre\-trained models\[[59](https://arxiv.org/html/2606.24414#bib.bib12)\], and knowledge\-grounded generation\[[45](https://arxiv.org/html/2606.24414#bib.bib13)\]\. We adapt pointer\-generator networks\[[71](https://arxiv.org/html/2606.24414#bib.bib10)\]to certificate\-to\-explanation generation; our setting is distinguished by formally verifiable faithfulness and the finding that a small trained model surpasses LLM few\-shot prompting \(90\.0% vs\. 76\.1%\)\.

A parallel line of work measures generation faithfulness post hoc, using learned proxies: entailment\-based factual\-consistency classifiers\[[49](https://arxiv.org/html/2606.24414#bib.bib26)\], question\-answering\-based consistency metrics\[[29](https://arxiv.org/html/2606.24414#bib.bib27),[77](https://arxiv.org/html/2606.24414#bib.bib28)\], and the broader hallucination taxonomies surveyed by Ji et al\.\[[45](https://arxiv.org/html/2606.24414#bib.bib13)\]\. These metrics approximate faithfulness with a trained or prompted scorer and are therefore subject to their own errors\. Our criterion is fundamentally different: faithfulness is not estimated but*exactly verified*, since the symbolic verifier checks set\- and sequence\-level equivalence between the original and reconstructed certificate\. This exactness is possible only because certificates are structured objects with a well\-defined fact set\.

#### Explainability in formal verification\.

Counterexample explanation has been addressed through simulation\[[36](https://arxiv.org/html/2606.24414#bib.bib14)\], abstraction\-refinement traces\[[22](https://arxiv.org/html/2606.24414#bib.bib15)\]and fault localization\[[10](https://arxiv.org/html/2606.24414#bib.bib16)\]\. In general, template\-based systems are faithful by construction but fundamentally inflexible—each new certificate kind or domain requires manual template authoring\. Our neural approach addresses these limitations through learned generation, while maintaining faithfulness enforcement via cycle consistency rather than template rigidity\.

In the realm of very recent work in reactive systems,\[[9](https://arxiv.org/html/2606.24414#bib.bib80)\]produce formally\-guaranteed explanations of reactive neural networks via abductive formal reasoning and\[[69](https://arxiv.org/html/2606.24414#bib.bib83)\]generate simple counter\-examples \(similar to the traces we use\) to assess a system is not realizable\. Neither of these approaches produce NL explanations whose formalism in double checked with a cycle consistent architecture\.

#### Neuro\-symbolic and verifier\-in\-the\-loop generation\.

A growing body of work couples neural generators with symbolic or programmatic checkers that constrain or validate outputs, spanning verifier\-guided decoding, tool\- and checker\-augmented generation, and neuro\-symbolic NLG that enforces logical constraints on generated text\[[57](https://arxiv.org/html/2606.24414#bib.bib29),[63](https://arxiv.org/html/2606.24414#bib.bib30)\]\. A closely related but*inverse*task is autoformalization, which maps natural language into formal statements—e\.g\., translating informal mathematics or specifications into a proof assistant or temporal logic\[[79](https://arxiv.org/html/2606.24414#bib.bib31)\]\. Our forward task is the opposite direction: we*de\-formalize*, mapping a formal certificate to natural language\. The connection is that our inverse networkNN2\\mathrm\{NN\}\_\{2\}performs a constrained, domain\-specific form of autoformalization \(explanation→\\rightarrowcertificate\), so the cycle as a whole composes verbalization with re\-formalization under a symbolic verifier\. Unlike autoformalization systems, however, faithfulness here is not the correctness of a single NL→\\rightarrowformal translation but the*round\-trip equivalence*of the original and reconstructed certificate\. More broadly, our architecture is verifier\-in\-the\-loop in a strong sense: the symbolic verifier is not merely a post\-hoc filter but closes a differentiable training cycle, and at inference time it serves as the selection oracle for routing\. Unlike systems where the checker validates surface form, our verifier evaluates*semantic equivalence*of reconstructed certificates, making faithfulness a property the architecture optimizes rather than one it audits after the fact\.

#### Positioning\.

The novelty lies not in any single component but in their*principled integration*for a domain where faithfulness is formally verifiable\. The separation of fluency from faithfulness provides a structural property absent from prior neural explanation systems: the faithfulness constraint is enforced by a symbolic verifier, not by a learned or prompted model\. Table[1](https://arxiv.org/html/2606.24414#S2.T1)summarizes how our approach compares to prior work along the dimensions of faithfulness, fluency, supervision requirements, and formal grounding\.

Table 1:Comparison with related approaches\. Our work uniquely combines cycle\-consistent faithfulness, LLM fluency scoring, and pointer\-generator copying for formal verification certificates\.

## 3Methodology

We present a cycle\-consistent neural architecture for generating natural language explanations of formal verification certificates\. The system comprises two jointly trained neural networks, a symbolic verifier, and a hybrid inference\-time routing strategy \(Figure[1](https://arxiv.org/html/2606.24414#S3.F1)\)\.

### 3\.1Problem Formulation

Let𝒞\\mathcal\{C\}denote the space of verification certificates andℰ\\mathcal\{E\}the space of natural language explanations\. A certificateC∈𝒞C\\in\\mathcal\{C\}is a structured object comprising a verdict \(YES/NO\), a kind \(reachability, lasso, witness\_pair, inductive\_invariant, bounded\_proof,kk\-induction\), and kind\-specific content \(state sequences, path structures, invariant sets\)\. An explanationE∈ℰE\\in\\mathcal\{E\}is a natural language sentence or paragraph describing the certificate’s meaning\. The underlying temporal properties are specified in standard logics such as LTL\[[62](https://arxiv.org/html/2606.24414#bib.bib74)\]and CTL\[[23](https://arxiv.org/html/2606.24414#bib.bib75)\]; our method consumes the resulting certificates and is agnostic to the source logic\.

We seek a mappingf:𝒞→ℰf\\colon\\mathcal\{C\}\\rightarrow\\mathcal\{E\}that produces explanations satisfying two properties:

- •Soundness \(no hallucination\)\.Every factual claim inEEis supported byCC\. If the explanation mentions a statess, thenssmust appear in the certificate\. If it claims a path exists, that path must be present inCC\. Soundness is a hard constraint: any violation constitutes a hallucination that could mislead a human decision\-maker\.
- •Fluency\.The explanation reads as natural English comprehensible to domain experts who are not formal methods specialists\.

We additionally define a desirable but certificate\-type\-dependent property:

- •Completeness \(coverage\)\.The explanation mentions a sufficient proportion of the certificate’s essential content\. Unlike soundness, completeness is not a binary correctness criterion but an*editorial quality*: the appropriate level of completeness varies by certificate type and intended audience\. An exhaustive listing of every state in a 10\-step reachability path may reduce clarity, while omitting states from an inductive invariant may undermine the proof’s validity\.

This distinction between soundness and completeness is fundamental to our approach\. Soundness must be maximized universally\. Completeness is a learned editorial judgment that the model acquires from training examples\.

### 3\.2Faithfulness as Cycle Consistency

We operationalize soundness through a cycle\-consistent architecture \(see Figure[1](https://arxiv.org/html/2606.24414#S3.F1)\)\. A forward modelNN1:𝒞→ℰ\\mathrm\{NN\}\_\{1\}\\colon\\mathcal\{C\}\\rightarrow\\mathcal\{E\}generates explanations, and an inverse modelNN2:ℰ→𝒞\\mathrm\{NN\}\_\{2\}\\colon\\mathcal\{E\}\\rightarrow\\mathcal\{C\}reconstructs certificates from explanations\. Given a certificateCC:

E=NN1\(C\),C′=NN2\(E\)E=\\mathrm\{NN\}\_\{1\}\(C\),\\qquad C^\{\\prime\}=\\mathrm\{NN\}\_\{2\}\(E\)\(1\)
Soundness is enforced by requiring that the reconstructed certificateC′C^\{\\prime\}is consistent with the originalCC\. We evaluate this through a symbolic verifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)that decomposes the comparison into:

Soundness\(C,C′\)=\|ℱ\(C\)∩ℱ\(C′\)\|\|ℱ\(C′\)\|,Completeness\(C,C′\)=\|ℱ\(C\)∩ℱ\(C′\)\|\|ℱ\(C\)\|\\text\{Soundness\}\(C,C^\{\\prime\}\)=\\frac\{\|\\mathcal\{F\}\(C\)\\cap\\mathcal\{F\}\(C^\{\\prime\}\)\|\}\{\|\\mathcal\{F\}\(C^\{\\prime\}\)\|\},\\qquad\\text\{Completeness\}\(C,C^\{\\prime\}\)=\\frac\{\|\\mathcal\{F\}\(C\)\\cap\\mathcal\{F\}\(C^\{\\prime\}\)\|\}\{\|\\mathcal\{F\}\(C\)\|\}\(2\)
whereℱ\(⋅\)\\mathcal\{F\}\(\\cdot\)extracts the set of facts—states, path segments, and structural elements—from a certificate\. Soundness measures precision \(of the factsNN2\\mathrm\{NN\}\_\{2\}reconstructed, how many were in the original?\), and completeness measures recall \(of the facts in the original, how many survived the cycle?\)\. In the evaluation metrics of Section[3\.6](https://arxiv.org/html/2606.24414#S3.SS6), we report completeness under the name*coverage*\.

CCCertificateNN1\\mathrm\{NN\}\_\{1\}Pointer\-GenEEExplanationNN2\\mathrm\{NN\}\_\{2\}TransformerC′C^\{\\prime\}ReconstructedSymbolic VerifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)ℒcycle\\mathcal\{L\}\_\{\\mathrm\{cycle\}\}ℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}ErefE^\{\\mathrm\{ref\}\}Referencecopy

Figure 1:Cycle\-consistent architecture\. CertificateCCis mapped to explanationEEbyNN1\\mathrm\{NN\}\_\{1\}\(pointer\-generator\), then reconstructed asC′C^\{\\prime\}byNN2\\mathrm\{NN\}\_\{2\}\(Transformer\)\. A symbolic verifier comparesCCandC′C^\{\\prime\}, producing a cycle consistency lossℒcycle\\mathcal\{L\}\_\{\\mathrm\{cycle\}\}\. The pointer\-generator copies state names directly from the encoded certificate, while generating English scaffolding from the vocabulary\.The cycle consistency loss used during training approximates this check differentiably:

ℒcycle=ℒcycle\-recon\(C,C′\)\+ℒcycle\-sem\(C,C′\)\\mathcal\{L\}\_\{\\mathrm\{cycle\}\}=\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}recon\}\}\(C,C^\{\\prime\}\)\+\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}sem\}\}\(C,C^\{\\prime\}\)\(3\)
whereℒcycle\-recon\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}recon\}\}is the token\-level reconstruction loss between the encoded representations ofCCandC′C^\{\\prime\}, andℒcycle\-sem\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}sem\}\}is the semantic divergence measured by the symbolic verifier\.

Crucially, the cycle enforces soundness by construction: ifNN1\\mathrm\{NN\}\_\{1\}hallucinates a state that is not inCC, thenNN2\\mathrm\{NN\}\_\{2\}will reconstruct a certificate containing that hallucinated state, and the verifier will penalize the mismatch\. The only way to minimize cycle loss is forNN1\\mathrm\{NN\}\_\{1\}to produce explanations whose content is faithfully grounded in the input certificate\.

The appropriate level of completeness varies by certificate type—reachability witnesses tolerate lower completeness while inductive invariants require high completeness—and is learned from training examples rather than hard\-coded \(see Appendix[D](https://arxiv.org/html/2606.24414#A4)for details\)\.

### 3\.3Architecture

#### Certificate Encoder\.

Certificates are encoded as token sequences with a structured header\. The format is as follows:

\[START\]\[v0v1\]\[k0k1k2\]\[SEP\]content tokens\[END\]\[\\texttt\{START\}\]\\;\[v\_\{0\}\\;v\_\{1\}\]\\;\[k\_\{0\}\\;k\_\{1\}\\;k\_\{2\}\]\\;\[\\texttt\{SEP\}\]\\;\\text\{content tokens\}\\;\[\\texttt\{END\}\]where\[v0v1\]\[v\_\{0\}\\;v\_\{1\}\]is a 2\-bit verdict encoding and\[k0k1k2\]\[k\_\{0\}\\;k\_\{1\}\\;k\_\{2\}\]is a 3\-bit kind encoding using deterministic token pairs\. Content tokens are state indices drawn from a shared vocabulary with the explanation tokenizer, ensuring that the same token ID refers to the same state in both modalities\.

#### NN1: Certificate to Explanation \(Pointer\-Generator\)\.

The forward model is a Transformer encoder\-decoder\[[76](https://arxiv.org/html/2606.24414#bib.bib71)\]augmented with a pointer\-generator mechanism\[[71](https://arxiv.org/html/2606.24414#bib.bib10)\]\. At each decoding steptt, the model computes the following:

- •A vocabulary distributionPvocab\(w\)P\_\{\\mathrm\{vocab\}\}\(w\)over the output vocabulary via the decoder’s softmax layer\.
- •A copy distributionPcopy\(w\)P\_\{\\mathrm\{copy\}\}\(w\)that places probability mass on source tokens via an attention distribution over the encoded certificate\.
- •A generation probabilitypgen∈\[0,1\]p\_\{\\mathrm\{gen\}\}\\in\[0,1\]computed from the decoder state, context vector, and current input embedding\.

The final output distribution is the mixture:

P\(w\)=pgen⋅Pvocab\(w\)\+\(1−pgen\)⋅Pcopy\(w\)P\(w\)=p\_\{\\mathrm\{gen\}\}\\cdot P\_\{\\mathrm\{vocab\}\}\(w\)\+\(1\-p\_\{\\mathrm\{gen\}\}\)\\cdot P\_\{\\mathrm\{copy\}\}\(w\)\(4\)
This mechanism is essential for soundness: state names are copied directly from the certificate tokens rather than generated from the vocabulary distribution, eliminating the primary source of entity hallucination\.

#### NN2: Explanation to Certificate\.

The inverse model is a standard Transformer encoder\-decoder sharing the vocabulary with NN1\. It does not require a copy mechanism because certificate reconstruction targets a fixed, small output vocabulary \(state tokens and special tokens\)\. Full architectural details \(layer counts, dimensions\) are in Section[4\.1](https://arxiv.org/html/2606.24414#S4.SS1)\.

#### End\-to\-End Walkthrough\.

Figure[2](https://arxiv.org/html/2606.24414#S3.F2)shows the architecture from Figure[1](https://arxiv.org/html/2606.24414#S3.F1)instantiated on a concrete YES/reachability certificate, with the actual data flowing through each component\. Appendix[I](https://arxiv.org/html/2606.24414#A9)illustrates the full pipeline on four concrete examples with varying soundness and coverage outcomes\.

CertificateCCYES / reachability
path: \[KYCVerification→\\rightarrowSanctionsScreening→\\rightarrowCollateralCheck→\\rightarrowSettlementFinal\]
target:SettlementFinalNN1\\mathrm\{NN\}\_\{1\}: Pointer\-Generatorforward modelExplanationEE
“The verification confirms that the property holds: a reachability path exists fromKYCVerificationthroughSanctionsScreeningandCollateralCheckto the target stateSettlementFinal\.”
Blue bold= copied from certificate via pointer mechanism \| Regular text = generated from vocabularygenerate explanationNN2\\mathrm\{NN\}\_\{2\}: Inverse Transformerinverse modelReconstructed CertificateC′C^\{\\prime\}
path: \[KYCVerification→\\rightarrowSanctionsScreening→\\rightarrowCollateralCheck→\\rightarrowSettlementFinal\]
target:SettlementFinalreconstruct certificateSymbolic VerifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)compareFact Extraction
ℱ\(C\)=ℱ\(C′\)=\{\\mathcal\{F\}\(C\)=\\mathcal\{F\}\(C^\{\\prime\}\)=\\\{path\_0: KYCVerification, path\_1: SanctionsScreening, path\_2: CollateralCheck,
path\_3: SettlementFinal, target: SettlementFinal\}\\\}
Soundness:5/5 = 1\.00 ✓Coverage:5/5 = 1\.00 ✓SOUNDverifyFigure 2:The cycle\-consistent architecture \(Figure[1](https://arxiv.org/html/2606.24414#S3.F1)\) instantiated on a YES/reachability certificate\.Blue boldtokens are copied directly from the certificate via the pointer\-generator; regular text is generated from vocabulary\. The symbolic verifier confirmsℱ\(C\)=ℱ\(C′\)\\mathcal\{F\}\(C\)=\\mathcal\{F\}\(C^\{\\prime\}\): all 5 facts match, yielding perfect soundness and coverage\. The dashed arrow indicates that the verifier comparesC′C^\{\\prime\}against the originalCC\.

### 3\.4Training Objective and Protocol

The training objective combines three loss terms:

ℒ=λreconℒrecon\+λcycleℒcycle\+ℒunk\\mathcal\{L\}=\\lambda\_\{\\mathrm\{recon\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\+\\lambda\_\{\\mathrm\{cycle\}\}\\,\\mathcal\{L\}\_\{\\mathrm\{cycle\}\}\+\\mathcal\{L\}\_\{\\mathrm\{unk\}\}\(5\)
whereℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}is the negative log\-likelihood of the reference explanation under NN1’s output distribution,ℒcycle=ℒcycle\-recon\+ℒcycle\-sem\\mathcal\{L\}\_\{\\mathrm\{cycle\}\}=\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}recon\}\}\+\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}sem\}\}is the cycle consistency loss \(Section[3\.2](https://arxiv.org/html/2606.24414#S3.SS2)\), andℒunk\\mathcal\{L\}\_\{\\mathrm\{unk\}\}penalizes the generation of unknown tokens\. We useλrecon=1\.0\\lambda\_\{\\mathrm\{recon\}\}=1\.0andλcycle=2\.0\\lambda\_\{\\mathrm\{cycle\}\}=2\.0\. Teacher forcing with linear decay from ratio1\.01\.0to0\.50\.5is applied during training to balance exposure bias and stability\.

The model is trained on paired\(C,E\)\(C,E\)examples whereEEis a reference explanation generated from diverse templates\. Supervised training serves three purposes: \(i\)language bootstrapping—NN1learns English syntax, vocabulary usage, and certificate\-type\-specific phrasing; \(ii\)editorial judgment—the reference explanations encode which parts of each certificate type to emphasize and what level of completeness is appropriate; and \(iii\)cold\-start resolution—without supervised signal, NN1produces unintelligible output, NN2receives no useful gradient, and the cycle collapses\. Training uses AdamW\[[51](https://arxiv.org/html/2606.24414#bib.bib18)\]with learning rate3×10−43\\times 10^\{\-4\}, batch size 16, and converges in approximately 39 epochs \(∼\\sim18 hours on CPU\)\.

Figure[3](https://arxiv.org/html/2606.24414#S3.F3)shows the training loss trajectory over 50 epochs\. All loss components decrease rapidly in the first 5 epochs as the model learns basic English scaffolding and certificate\-to\-token mappings\. The reconstruction loss \(ℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}\) continues decreasing steadily, reflecting improving template adherence\. The cycle reconstruction loss \(ℒcycle\-recon\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}recon\}\}\) decreases monotonically throughout training, while the cycle semantic loss \(ℒcycle\-sem\\mathcal\{L\}\_\{\\mathrm\{cycle\\text\{\-\}sem\}\}\) plateaus near 0\.30 after epoch 10\. We interpret this plateau as a floor set by the information bottleneck of the natural language channel: some certificate facts \(e\.g\., precise ordering within long state sequences\) cannot be fully preserved through the discrete text representation\. The best checkpoint is selected at epoch 39 by minimum total cycle loss\.

![Refer to caption](https://arxiv.org/html/2606.24414v1/x1.png)Figure 3:Training loss over 50 epochs\. Left: all loss components showing rapid initial convergence and steady improvement\. Right: detail of cycle and reconstruction losses\. The cycle semantic loss plateaus near 0\.30 after epoch 10, indicating a floor set by the information bottleneck of the natural language channel\. Best checkpoint \(epoch 39\) is selected by minimum cycle loss\.Optionally, an LLM fluency oracle can provide an additional training signal that evaluates the naturalness of generated text \(see Appendix[C](https://arxiv.org/html/2606.24414#A3)for details on fluency\-guided refinement\)\. In our main experiments, we train withλLM=0\\lambda\_\{\\mathrm\{LM\}\}=0\(no fluency loss\), relying on the reconstruction loss to implicitly enforce fluency\.

### 3\.5Hybrid Inference\-Time Routing

At inference time, multiple decoding strategies—greedy, beam search\[[32](https://arxiv.org/html/2606.24414#bib.bib23)\]with varying widths, and constrained decoding \(restricting the output vocabulary to valid certificate tokens\[[63](https://arxiv.org/html/2606.24414#bib.bib30)\]\)—produce different trade\-offs between soundness and coverage\. Rather than selecting a single strategy, we introduce a*hybrid routing*approach that exploits the structural diversity across certificate types\.

We observe that certificate categories divide into two regimes:

- •Copy\-dominated categories\(e\.g\., NO/bounded\_proof, YES/kk\-induction, lasso variants\)\. These categories require primarily copying state names from the certificate\. The best strategy is a single fixed decoding configuration per category, selected by soundness rate on held\-out data\.
- •Generation\-heavy categories\(e\.g\., witness\_pair, reachability, inductive\_invariant\)\. These categories require generating explanatory text around copied elements\. Different examples within the same category benefit from different decoding configurations, and the*coverage*of the cycle\-verified reconstruction positively predicts soundness\.

The hybrid router operates as follows: for each test example, it identifies the certificate’s verdict/kind category\. The copy\-dominated vs\. generation\-heavy categorization is determined on a held\-out validation set \(disjoint from the test set\) by examining per\-category soundness variance across configurations\. If the category is copy\-dominated, the router applies the pre\-selected best configuration for that category\. Otherwise, the router runs all available configurations and selects the result with the highest coverage score \(see Figure[4](https://arxiv.org/html/2606.24414#S3.F4)\)\. This strategy requires no additional training—it operates entirely at inference time over the same trained model, and achieves substantially higher soundness than any single configuration \(Section[4\.3](https://arxiv.org/html/2606.24414#S4.SS3)\)\.

CertificateCC\(verdict, kind\)Copy\-dominated?Apply fixed bestconfig for category\(e\.g\., e39\_beam5\)YesRun all 37 configs→\\rightarrow\(E1,C1′\),…,\(E37,C37′\)\(E\_\{1\},C^\{\\prime\}\_\{1\}\),\\ldots,\(E\_\{37\},C^\{\\prime\}\_\{37\}\)NoSelectarg⁡maxi\\arg\\max\_\{i\}coverage\(C,Ci′\)\(C,C^\{\\prime\}\_\{i\}\)E∗E^\{\*\}\(sound\)E∗E^\{\*\}\(sound\)NO/bounded\_proof
YES/kk\-induction
YES/lasso, NO/lassowitness\_pair
reachability
inductive\_invariantFigure 4:Hybrid inference\-time router\. For copy\-dominated categories \(left\), a single pre\-selected decoding configuration is applied\. For generation\-heavy categories \(right\), all 37 configurations are run and the result with the highest cycle\-verified coverage is selected\. No additional training is required\.
### 3\.6Evaluation Metrics

The primary metric issoundness rate: the fraction of test examples for which the reconstructed certificateC′C^\{\\prime\}contains zero incorrect facts \(\|ℱ\(C′\)∖ℱ\(C\)\|=0\|\\mathcal\{F\}\(C^\{\\prime\}\)\\setminus\\mathcal\{F\}\(C\)\|=0\), using the fact sets defined in Eq\.[2](https://arxiv.org/html/2606.24414#S3.E2)\. This is a strict, per\-example criterion: a single hallucinated state renders an explanation unsound\. We additionally reportcoverageas a secondary diagnostic—it is particularly useful for the hybrid router \(Section[3\.5](https://arxiv.org/html/2606.24414#S3.SS5)\), where it serves as a selection criterion among candidate explanations\.

## 4Experiments

We evaluate the cycle\-consistent neural architecture on verification certificates drawn from a financial regulatory compliance domain\. Our experiments address four questions: \(1\) Does the architecture produce sound explanations across diverse certificate types? \(2\) How does a hybrid inference\-time routing strategy improve over single\-configuration decoding? \(3\) How does the trained architecture compare to direct LLM few\-shot prompting? \(4\) What structural properties of certificates determine explanation difficulty?

### 4\.1Experimental Setup

#### Domain and Certificate Types\.

We instantiate the architecture over a workflow domain comprising 207 named states drawn from financial compliance processes \(e\.g\.,KYCVerification,SanctionsScreening,CollateralCheck\)\. Certificates span six kinds, each corresponding to a standard verification technique:

1. 1\.Reachability: a path from an initial state to a target state \(1 sequence\)\.
2. 2\.Lasso: a prefix path followed by a repeating cycle, witnessing or refuting liveness properties \(2 sequences\)\.
3. 3\.Witness pair: a path demonstrating that two requirements are jointly satisfiable \(1 sequence\)\.
4. 4\.Inductive invariant: a set of states closed under the transition relation, proving or refuting safety \(1–2 sequences\)\.
5. 5\.Bounded proof \(BMC\): a bounded model checking\[[11](https://arxiv.org/html/2606.24414#bib.bib72)\]trace or checked state set \(1 sequence\)\.
6. 6\.kk\-induction: base and step state sets with an optional counterexample trace \(2–3 sequences\)\[[74](https://arxiv.org/html/2606.24414#bib.bib73)\]\.

Each kind has both YES \(property holds / violation found\) and NO \(property refuted / no counterexample\) variants, yielding 12 verdict/kind categories\. Appendix[B](https://arxiv.org/html/2606.24414#A2)provides concrete examples\.

#### Dataset\.

Training data consists of certificates generated by bounded model checking of deployed financial compliance workflows from operational systems, spanning structurally diverse regulatory processes \(specific process names are withheld for compliance reasons, but they cover distinct areas of financial regulation including customer due diligence, transaction monitoring, and settlement assurance\)\. For each verdict/kind combination, explanation templates provide diverse natural language renderings of the certificate content\. The template set comprises 48 unique templates \(4 per verdict/kind category\) with lexical variation in phrasing, connectives, and sentence structure\. Crucially, the 12 verdict/kind categories require fundamentally different linguistic patterns—describing a lasso cycle differs structurally from describing an inductive invariant or a bounded proof trace—so the 48 templates span a broad space of explanation structures rather than minor paraphrases of a single pattern\. Test\-set templates are drawn from the same pool but applied to held\-out certificates generated with a separate random seed; we additionally evaluate a held\-out template split that confirms generalization beyond template surface forms \(Section[4\.5](https://arxiv.org/html/2606.24414#S4.SS5)\)\. The training set contains 5,841 examples\. A held\-out test set of 420 examples \(35 per verdict/kind category\) is generated with a separate random seed\. The shared vocabulary contains 543 tokens: 207 state names, special tokens, and approximately 330 common English words\.

#### Model Configuration\.

Both NN1and NN2use Transformer encoder\-decoder architectures withdmodel=256d\_\{\\mathrm\{model\}\}=256,nheads=8n\_\{\\mathrm\{heads\}\}=8,nlayers=4n\_\{\\mathrm\{layers\}\}=4, anddff=1024d\_\{\\mathrm\{ff\}\}=1024\. NN1is augmented with a pointer\-generator copy mechanism \(Section[3\.3](https://arxiv.org/html/2606.24414#S3.SS3)\)\. Maximum certificate encoding length is 512 tokens; maximum explanation length is 120 tokens\. The total model has fewer than 1M parameters\.

#### Evaluation Protocol\.

Each test certificate is processed through the full cycle:C→NN1E→NN2C′C\\xrightarrow\{\\mathrm\{NN\}\_\{1\}\}E\\xrightarrow\{\\mathrm\{NN\}\_\{2\}\}C^\{\\prime\}\. The symbolic verifier extracts facts from bothCCandC′C^\{\\prime\}and computes binary soundness \(zero wrong facts\) and coverage \(fraction of correct facts\)\. We reportsoundness rate\(fraction of sound examples\) as the primary metric\. Soundness is the focus of this work; fluency is a secondary objective that is addressed architecturally through the decoupled LLM fluency oracle \(Appendix[C](https://arxiv.org/html/2606.24414#A3)\), which can score generated explanations independently of the faithfulness signal\. Since our main experiments train withλLM=0\\lambda\_\{\\mathrm\{LM\}\}=0and the training data is template\-anchored, the generated outputs inherit the linguistic quality of the reference templates by design—they are grammatical and domain\-appropriate, though not stylistically diverse\. Workflow owners confirmed that outputs are comprehensible, actionable, and suitable for compliance reports; a formal human evaluation protocol remains future work\.

#### Quantitative Fluency: BERTScore\.

To complement the qualitative fluency endorsement from workflow owners, we compute BERTScore\[[80](https://arxiv.org/html/2606.24414#bib.bib25)\]between the generated explanations and the reference template\-based explanations on all 420 test examples\. We use themicrosoft/deberta\-xlarge\-mnlimodel as the scoring backbone\.

Table 2:BERTScore \(F1, Precision, Recall\) for generated vs\. reference explanations on 420 test certificates\.The high BERTScore F1 confirms that generated explanations closely match the lexical and semantic content of the reference templates\. Sound examples achieve slightly higher scores than unsound examples, consistent with the expectation that faithful explanations also better match reference text\. We note that BERTScore measures similarity to template references, not intrinsic fluency; however, since the templates were authored by domain experts and validated for use in compliance reports, high similarity to these references is a meaningful quality signal\.

#### NN2Standalone Accuracy\.

To quantify the tightness of the cycle\-consistency proxy, we evaluate NN2on ground\-truth reference explanations \(i\.e\., bypassing NN1\)\. NN2achieves 96\.4% soundness and 0\.98 mean coverage when reconstructing from reference explanations, confirming that the false\-negative rate—faithful explanations that fail the cycle check due to NN2reconstruction errors—is approximately 3\.6%\. This establishes that cycle consistency is a tight proxy for faithfulness, not merely a loose correlate\. Consequently, the reported 90\.0% soundness is a*lower bound*on true faithfulness: some genuinely faithful explanations are conservatively flagged as unsound due to NN2reconstruction limitations rather than actual hallucination by NN1\. The converse concern—false positives where NN2“charitably” reconstructs a certificate from an ambiguous explanation—is mitigated by the architectural separation \(NN1and NN2share no hidden state\) and by the private\-language test \(Section[4\.7](https://arxiv.org/html/2606.24414#S4.SS7)\), which confirms that NN2relies on the surface content of the explanation rather than co\-adapted latent signals\. Nonetheless, subtle co\-adaptation remains an inherent limitation of any learned inverse proxy, and we cannot fully rule it out\.

### 4\.2Single\-Model Results

Table[3](https://arxiv.org/html/2606.24414#S4.T3)shows the per\-category results for the best single decoding configuration \(epoch 39, beam 9\), achieving 61\.7% overall soundness\. Performance varies substantially across categories: copy\-dominated categories \(NO/bounded\_proof: 97%, YES/inductive\_invariant: 93%\) achieve high soundness, while generation\-heavy categories \(YES/witness\_pair: 17%, YES/reachability: 27%\) are more challenging\. The mean coverage column shows that even unsound explanations typically preserve most certificate facts \(overall mean coverage 0\.94\), indicating that failures are predominantly near\-misses rather than catastrophic\.

Table 3:Best single model results \(epoch 39, beam 9\) on 420 test examples across 12 verdict/kind categories\.
### 4\.3Hybrid Router Results

We train a single model and evaluate it under 37 decoding configurations \(varying epoch checkpoints, beam widths 1–11, and constrained vs\. unconstrained decoding\)\. The hybrid router \(Section[3\.5](https://arxiv.org/html/2606.24414#S3.SS5)\) selects among these configurations per example\.

Table 4:Comparison of inference strategies on 420 test examples \(no additional training\)\.The hybrid router achieves 90\.0% soundness, a 28\.3 percentage point improvement over the best single model, and more than 11 points above the ensemble baseline that selects the best configuration per category\. We note that the best single decoding configuration \(61\.7%\) falls below the best LLM pair \(76\.1%\); the neural architecture’s advantage emerges only when combined with the hybrid routing strategy, which exploits the symbolic verifier’s coverage score as a selection criterion\. In principle, an analogous best\-of\-NNstrategy could be applied to the LLM baseline \(generatingNNcandidates per certificate and reranking by verifier coverage\)\. However, this is precisely where the neural architecture’s deployment advantage is decisive: running 37 neural configurations costs∼185\{\\sim\}185ms and $0 per certificate, whereas 37 LLM calls would require∼370\{\\sim\}370s and∼$0\.74\{\\sim\}\\mathdollar 0\.74—a2,000×2\{,\}000\\timeslatency increase that makes test\-time search impractical for LLMs at production scale\. The neural model’s contribution is not merely accuracy but that it makes verifier\-guided test\-time selection*computationally feasible*\. The per\-example coverage\-based selection for generation\-heavy categories is the key driver: it identifies which decoding configuration produces the most faithful explanation for each specific certificate instance, rather than committing to a single strategy per category\. Table[5](https://arxiv.org/html/2606.24414#S4.T5)shows the per\-category breakdown\. Comparing with the single\-model results \(Table[3](https://arxiv.org/html/2606.24414#S4.T3)\) reveals that the router’s largest gains concentrate on generation\-heavy categories: YES/witness\_pair improves from 17% to 87%, YES/reachability from 27% to 83%, and YES/bounded\_proof from 40% to 100%, while copy\-dominated categories that were already strong \(e\.g\., NO/bounded\_proof: 97%→\\rightarrow100%\) see only modest gains\.

Table 5:Per\-category soundness: neural hybrid router vs\. best LLM pair \(Opus 4\.5/Opus 4\.5\)\. Left: hybrid router results \(sound count and rate\)\. Right: best LLM pair rate and winner\. Both evaluated with identical cycle\-consistency verification on 420 test examples \(35 per category\)\.Three categories achieve 100% soundness \(NO/bounded\_proof, YES/bounded\_proof, YES/kk\-induction\)\. All categories except NO/kk\-induction and NO/reachability exceed 80%\. Soundness correlates with the number of distinct state sequences that must be preserved through the natural language channel rather than with certificate kind*per se*; see Appendix[G](https://arxiv.org/html/2606.24414#A7)\.

### 4\.4Baseline: Multi\-LLM Few\-Shot Prompting

Since the hybrid router evaluates multiple decoding configurations per certificate, a fair LLM baseline should similarly explore multiple model configurations\. We evaluate four frontier LLMs—GPT\-4o and GPT\-5\.3\[[58](https://arxiv.org/html/2606.24414#bib.bib19)\], Claude Opus 4\.5 and Claude Sonnet 4\.6\[[2](https://arxiv.org/html/2606.24414#bib.bib20)\]—in all 16 pairwise combinations for the NN1and NN2roles\. For each test certificate and each LLM pair, we perform two independent calls with isolated conversation contexts: one LLM generates an explanation from the certificate \(NN1role\), and a separate LLM instance reconstructs the certificate from that explanation alone \(NN2role\)\. Importantly, the LLM baseline does*not*use the trained NN2: each LLM pair performs its own reconstruction, and only the symbolic verifier—which operates on certificate structures, not on explanation text—is shared across both evaluation pipelines\. This ensures the comparison is not biased by NN2’s co\-adaptation with NN1\.

We evaluate all 16 LLM pair combinations \(see Table[6](https://arxiv.org/html/2606.24414#S4.T6)\)\. The best pair is Opus 4\.5/Opus 4\.5 at 76\.1%, followed by GPT\-4o/GPT\-4o at 74\.2%; same\-model pairs generally outperform cross\-model pairs\. The “Best LLM” column in Table[5](https://arxiv.org/html/2606.24414#S4.T5)compares the best LLM pair with the neural hybrid router per category\.

Table 6:Multi\-LLM baseline: soundness rate \(%\) for each NN1/NN2pair on 420 test examples\. Best result in bold\.The neural hybrid router outperforms the best LLM pair on 10 of 12 categories and ties on 2 \(YES/bounded\_proof and NO/kk\-induction\), with no category losses\. The overall gap is 13\.9 percentage points \(90\.0% vs\. 76\.1%\)\. Withn=420n=420, this gap is statistically significant \(95% binomial confidence intervals:90\.0%±2\.9%90\.0\\%\\pm 2\.9\\%vs\.76\.1%±4\.1%76\.1\\%\\pm 4\.1\\%, non\-overlapping\)\. However, per\-category intervals withn=35n=35are wide \(±10\\pm 10–1515pp\), so most individual category wins—particularly narrow gaps such as YES/reachability \(\+3\+3pp\)—are not individually significant and should be interpreted as directional rather than conclusive\. The second\-best LLM pair \(GPT\-4o/GPT\-4o\) achieves 74\.2%, and the remaining 14 combinations range from 65\.3% to 73%, confirming that the gap is robust to model selection\. Cross\-model pairs \(e\.g\., Opus 4\.5 as NN1with GPT\-4o as NN2\) generally underperform same\-model pairs, suggesting that LLMs develop model\-specific explanation conventions that are not consistently parsed by other models\.

We note that this comparison evaluates*few\-shot*LLM prompting; a fine\-tuned small LLM \(e\.g\., T5\-small\) on the same 5,841 training examples would be a natural additional baseline\. Our claim is not that model size is the decisive factor, but that the*cycle\-consistent architecture*—which enables cheap verifier\-guided test\-time search—produces a system that outperforms few\-shot frontier LLMs on this structured task\. The architecture and the routing strategy are complementary: the pointer\-generator ensures lexical grounding, while the routing exploits the verifier to select among candidates\. Neither component alone achieves the headline result\.

#### Analysis\.

The neural architecture’s advantage is largest on copy\-dominated categories where the LLM’s transcription strategy introduces errors\. Even the best LLM pair achieves only 43% soundness on NO/bounded\_proof—it frequently hallucinates state names or misspells copied identifiers—while the pointer\-generator achieves 100% by copying directly from the encoded certificate\. Similarly, YES/lasso \(97% vs\. 80%\) and NO/lasso \(93% vs\. 77%\) reflect the neural model’s learned sequence attribution patterns for prefix/cycle decomposition\. NO/kk\-induction is the one category where the neural model and the best LLM pair are tied at 73%\. This category involves complex multi\-sequence certificates with counterexample traces, where the LLM’s larger context window partially compensates for its copying errors\.

#### Why the neural model wins\.

The neural advantage stems from a single structural difference\. All four LLMs adopt the same strategy—*transcription*: listing certificate content in natural language\. This introduces three error mechanisms: state\-name misspelling during generation, omission of states from long lists, and misattribution of states between structural roles \(e\.g\., base vs\. step inkk\-induction\)\. These persist across all model families and scales tested, indicating they are intrinsic to the transcription strategy rather than artifacts of any single model\. Our architecture eliminates the first mechanism by construction—the pointer\-generator copies state tokens directly rather than regenerating them—while gradient training on thousands of examples addresses the latter two in ways few\-shot prompting cannot\.

#### Failure Analysis\.

The 42 unsound examples \(10\.0%\) cluster in three error modes: \(i\)*sequence attribution errors*\(20/42\): states are correctly copied but assigned to the wrong structural role \(e\.g\., a base state attributed to the step set inkk\-induction\); \(ii\)*partial omission*\(14/42\): the explanation omits states from multi\-state sequences, causing NN2to reconstruct a truncated certificate that the verifier flags as containing spurious facts when it attempts to align the shorter sequence; \(iii\)*scaffolding errors*\(8/42\): the generated English text misrepresents the certificate’s structural claim \(e\.g\., describing a cycle as a prefix\)\. Categories with 3\+ interleaved sequences \(NO/kk\-induction, NO/reachability\) account for 22 of the 42 failures\. The majority of failures are near\-misses rather than catastrophic: 33/42 unsound examples have only a single incorrect fact \(one misattributed or hallucinated state\), while only 9/42 exhibit multiple errors\.

#### Pointer\-Generator Behavior\.

The pointer\-generator copy mechanism is essential for soundness: the generation probabilitypgenp\_\{\\mathrm\{gen\}\}exhibits a clear bimodal pattern—close to 0 when generating state names \(copying\) and close to 1 for English scaffolding \(generating\)—confirming the intended separation between structural content and linguistic scaffolding\.

### 4\.5Held\-Out Template Evaluation

To assess generalization beyond template surface forms, we perform a leave\-one\-out template split: for each of the 12 verdict/kind categories, we train on 3 of the 4 templates and evaluate on certificates rendered with the held\-out 4th template\. This is repeated for all 4 folds, and results are averaged\.

The held\-out template evaluation achieves 82\.4% overall soundness \(vs\. 90\.0% with same\-pool templates\), with mean coverage of 0\.91\. Per\-category results are shown in Table[7](https://arxiv.org/html/2606.24414#S4.T7)\.

Table 7:Held\-out template split: soundness rate \(%\) by verdict/kind category, averaged over 4 template folds\. “Same\-pool” column repeats the main evaluation for comparison\.The moderate drop from same\-pool to held\-out evaluation \(7\.6 pp\) confirms that the model has learned genuine certificate\-to\-explanation mappings rather than pure template memorization: soundness remains 6\.3 pp above the best LLM baseline \(76\.1%\) even on unseen template structures\. The largest drops occur in multi\-sequence categories \(YES/kk\-induction:−11\-11pp, YES/lasso:−11\-11pp\), where novel template phrasing disrupts the learned sequence\-attribution patterns\. Copy\-dominated categories \(NO/bounded\_proof:−6\-6pp, YES/bounded\_proof:−3\-3pp\) are most robust, as the pointer\-generator’s copying behavior generalizes across template variations\.

### 4\.6Deployment Characteristics

The neural architecture offers decisive latency and determinism advantages over LLM\-based approaches \(Table[8](https://arxiv.org/html/2606.24414#S4.T8)\):∼860×\{\\sim\}860\\timesfaster inference than the full 16\-pair LLM evaluation \(185 ms vs\. 160 s\), or∼54×\{\\sim\}54\\timesfaster than a single LLM pair, with fully deterministic outputs\.

Table 8:Deployment comparison: neural architecture vs\. multi\-LLM baseline\. LLM costs reflect the full 16\-pair evaluation needed to match the hybrid router’s multi\-configuration approach; single\-pair costs are shown in parentheses\.In a production deployment processing 10,000 certificates per day, the neural hybrid router requires∼31\{\\sim\}31minutes total \(37 configurations×\\times∼5\{\\sim\}5ms each×\\times10K\)\. The multi\-LLM baseline, evaluated across all 16 pairs to select the best, requires∼444\{\\sim\}444hours of serial API time at∼$3,200\{\\sim\}\\mathdollar 3\{,\}200/day\. Even restricting to the single best pair \(Opus 4\.5/Opus 4\.5\) still requires∼28\{\\sim\}28hours at∼$200\{\\sim\}\\mathdollar 200/day\.

Three structural advantages persist regardless of future LLM accuracy improvements:

- •Cost\.The neural model has zero marginal inference cost\. At current API pricing, the multi\-LLM baseline costs∼$1\.2\{\\sim\}\\mathdollar 1\.2M/year for 10K certificates/day\.
- •Privacy\.Verification certificates may encode proprietary process structures and compliance\-sensitive state names\. The neural architecture processes all data on\-device, requiring no external data transmission\.
- •Operational independence\.No API keys, no rate limits, no provider outages, no model deprecation\. In regulated environments, the ability to audit, version\-control, and reproduce exact model behavior is a compliance requirement\.

### 4\.7Private\-Language Test

A known failure mode of cycle\-consistent architectures is*private language*\(also called steganographic encoding\): the two networks collude to encode certificate information in imperceptible patterns within the intermediate representation, achieving low cycle loss without producing genuinely informative explanations\[[82](https://arxiv.org/html/2606.24414#bib.bib1)\]\. Our architecture mitigates this risk through two mechanisms: \(i\) the reconstruction lossℒrecon\\mathcal\{L\}\_\{\\mathrm\{recon\}\}anchors NN1’s output distribution to human\-readable reference explanations, and \(ii\) the intermediate representation is a*discrete*token sequence drawn from the same 543\-token vocabulary as the training references, leaving no continuous channel for hidden signals\.

To verify empirically that NN2genuinely reads the explanation text rather than decoding a private signal, we apply three perturbations to NN1’s output*before*passing it to NN2, and measure the resulting soundness collapse:

Table 9:Private\-language test: perturbation of NN1’s output before NN2reconstruction\. If NN2decoded a hidden signal rather than reading the explanation, perturbations would not destroy soundness\.All three perturbations cause dramatic soundness drops\. Random\-token replacement destroys soundness entirely \(0%\), confirming that NN2extracts certificate content from the*actual tokens*in the explanation, not from positional or structural artifacts\. Word\-order scrambling reduces soundness to 6\.2%, indicating that NN2relies on syntactic structure \(e\.g\., “fromXXtoYY”\) to determine state roles\. State\-name swapping reduces soundness to 17\.1%, confirming that NN2is sensitive to*which*state names appear in*which*positions\. Together, these results rule out private\-language encoding: the explanation text is the genuine information channel between NN1and NN2\.

## 5Conclusion

We have presented a cycle\-consistent neural architecture that enforces explanation faithfulness via a symbolic verifier, without human annotation\. A hybrid inference\-time router achieves 90\.0% cycle\-verified soundness on 420 test certificates—surpassing the best of 16 frontier LLM configurations \(76\.1%\) by 13\.9 percentage points—while offering∼860×\{\\sim\}860\\timesfaster inference from a sub\-1M parameter model\. The architecture and the routing strategy are complementary: the pointer\-generator eliminates entity hallucination by copying state names directly from the certificate, while verifier\-guided test\-time routing—made computationally feasible by the model’s low inference cost—selects the most faithful explanation per example\.

#### Limitations and Future Work\.

Our evaluation has four principal limitations\. First, the architecture does not support*emptiness SCC certificates*, which require counting and enumerating strongly connected components; atdmodel=256d\_\{\\mathrm\{model\}\}=256the model cannot reliably copy numeric count information into the generated text, so we exclude this category\. Second, soundness degrades as certificate complexity grows: with 8 attention heads at 32 dimensions each, the model reaches capacity limits on certificates with more than approximately 15 states across 3\+ sequences, which accounts for the lower soundness on NO/kk\-induction and NO/reachability\. Third, the evaluation is confined to a single financial\-compliance domain with a fixed vocabulary and template\-anchored references; production model checkers \(e\.g\., nuXmv\[[15](https://arxiv.org/html/2606.24414#bib.bib76)\], SPIN\[[42](https://arxiv.org/html/2606.24414#bib.bib77)\], CBMC\[[48](https://arxiv.org/html/2606.24414#bib.bib78)\], neural approaches\[[34](https://arxiv.org/html/2606.24414#bib.bib84),[35](https://arxiv.org/html/2606.24414#bib.bib85)\]\) may emit longer identifiers and deeper nesting that fall outside this vocabulary, and extending to new domains requires retraining\. Fourth, all experiments were conducted on CPU, constraining model scale\. We also note that cycle\-verified soundness is a*lower bound*on true faithfulness due to NN2’s 3\.6% false\-negative rate, and that subtle co\-adaptation between NN1and NN2, while ruled out empirically by the private\-language test \(Section[4\.7](https://arxiv.org/html/2606.24414#S4.SS7)\), cannot be fully excluded for any learned inverse proxy\.

Several of these limitations suggest direct extensions: scaling todmodel=512d\_\{\\mathrm\{model\}\}=512to relax the multi\-sequence capacity ceiling and absorb emptiness\-SCC counting; a subword or character\-level encoding to remove the fixed\-vocabulary constraint; and direct fine\-tuning on certificates from additional model checkers, leveraging the fact that cycle consistency can serve as the sole training signal without reference explanations\. Two further directions are the inherent resistance of the fixed\-computation\-graph architecture to prompt\-injection attacks—absent in LLM\-based pipelines—and asymmetric training schemes \(e\.g\., periodically re\-initializing NN2\) to guard against private language at scale\.

#### Takeaway: A General Paradigm\.

Although we instantiate our solution on verification certificates, the underlying principle is general:*any structured artifact whose content can be decomposed into a checkable fact set admits a cycle\-consistent explanation*\. Whenever a forward model can verbalize the artifact and an inverse model can reconstruct it, a symbolic checker comparing the original and the reconstruction yields an exact, annotation\-free faithfulness signal\. The method is thus not specific to temporal\-logic certificates but applies to any domain where outputs are structured and a notion of fact\-level equivalence is definable \(find an extended discussion in Appendix[H](https://arxiv.org/html/2606.24414#A8)\)\.

## Disclaimer

This paper was prepared for informational purposes by the Artificial Intelligence Research group of JPMorgan Chase & Co\. and its affiliates \("JP Morgan”\) and is not a product of the Research Department of JP Morgan\. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein\. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful\.

© 2026 JPMorgan Chase & Co\. All rights reserved\.

## References

- \[1\]M\. Alshiekh, R\. Bloem, R\. Ehlers, B\. Könighofer, S\. Niekum, and U\. Topcu\(2018\)Safe Reinforcement Learning via Shielding\.InAAAI Conference on Artificial Intelligence,Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[2\]Anthropic\(2025\)The Claude Model Family: Claude Opus 4\.5 and Claude Sonnet 4\.6\.Technical Report\.External Links:[Link](https://www.anthropic.com/research)Cited by:[§4\.4](https://arxiv.org/html/2606.24414#S4.SS4.p1.7)\.
- \[3\]S\. Azzopardi, L\. D\. Stefano, N\. Piterman, and G\. Schneider\(2025\)Full LTL Synthesis over Infinite\-State Arenas\.InComputer Aided Verification \- 37th International Conference, CAV 2025, Part IV,LNCS, Vol\.15934,pp\. 274–297\.External Links:[Link](https://doi.org/10.1007/978-3-031-98685-7%5C_13),[Document](https://dx.doi.org/10.1007/978-3-031-98685-7%5F13)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[4\]D\. Bahdanau, K\. Cho, and Y\. Bengio\(2015\)Neural Machine Translation by Jointly Learning to Align and Translate\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p1.1)\.
- \[5\]Y\. Bai, S\. Kadavath, S\. Kundu, A\. Askell, J\. Kernion, A\. Jones, A\. Chen, A\. Goldie, A\. Mirhoseini, C\. McKinnon,et al\.\(2022\)Constitutional AI: harmlessness from AI feedback\.arXiv preprint arXiv:2212\.08073\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]C\. Baier and J\. Katoen\(2008\)Principles of model checking\.MIT Press\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.24414#S1.p1.1)\.
- \[7\]H\. Barbosa, C\. Barrett, M\. Brain, G\. Kremer,et al\.\(2022\)cvc5: A Versatile and Industrial\-Strength SMT Solver\.InTools and Algorithms for the Construction and Analysis of Systems \(TACAS\),pp\. 415–442\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[8\]B\. Barras, S\. Boutin, C\. Cornes,et al\.\(1999\)The coq proof assistant reference manual\.Note:INRIACited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[9\]S\. Bassan, G\. Amir, D\. Corsi, I\. Refaeli, and G\. Katz\(2023\)Formally explaining neural networks within reactive systems\.InFormal Methods in Computer\-Aided Design, \(FMCAD 2023\),pp\. 1–13\.External Links:[Link](https://doi.org/10.34727/2023/isbn.978-3-85448-060-0%5C_9),[Document](https://dx.doi.org/10.34727/2023/ISBN.978-3-85448-060-0%5F9)Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px4.p2.1)\.
- \[10\]I\. Beer, S\. Ben\-David, H\. Chockler, A\. Orni, and R\. Trefler\(2012\)Explaining Counterexamples Using Causality\.InFormal Methods in System Design,Vol\.40,pp\. 20–40\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px4.p1.1)\.
- \[11\]A\. Biere, A\. Cimatti, E\. Clarke, and Y\. Zhu\(1999\)Symbolic Model Checking without BDDs\.InTools and Algorithms for the Construction and Analysis of Systems \(TACAS\),pp\. 193–207\.Cited by:[§1](https://arxiv.org/html/2606.24414#S1.p1.1),[item 5](https://arxiv.org/html/2606.24414#S4.I1.i5.p1.1)\.
- \[12\]R\. Bloem, B\. Jobstmann, N\. Piterman, A\. Pnueli, and Y\. Sa’ar\(2012\)Synthesis of Reactive\(1\) Designs\.Journal of Computer and System Sciences78\(3\),pp\. 911–938\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[13\]R\. Bloem, B\. Könighofer, R\. Könighofer, and C\. Wang\(2015\)Shield Synthesis: Runtime Enforcement for Reactive Systems\.InTools and Algorithms for the Construction and Analysis of Systems \(TACAS\),pp\. 533–548\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[14\]B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini\(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px2.p2.1)\.
- \[15\]R\. Cavada, A\. Cimatti, M\. Dorigatti, A\. Griggio, A\. Mariotti, A\. Micheli, S\. Mover, M\. Roveri, and S\. Tonetta\(2014\)The nuXmv Symbolic Model Checker\.InInternational Conference on Computer Aided Verification \(CAV\),pp\. 334–342\.Cited by:[§5](https://arxiv.org/html/2606.24414#S5.SS0.SSS0.Px1.p1.5)\.
- \[16\]W\. Chen, J\. Chen, Y\. Su, Z\. Chen, and W\. Y\. Wang\(2020\)Logical natural language generation from open\-domain tables\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 7929–7942\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p1.1)\.
- \[17\]Z\. Chen, M\. Kang, and B\. Li\(2025\)ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning\.In42nd International Conference on Machine Learning, \(ICML 2025\),PMLR, Vol\.267\.External Links:[Link](https://proceedings.mlr.press/v267/chen25ae.html)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[18\]J\. Cheney, L\. Chiticariu, and W\. Tan\(2009\)Provenance in Databases: Why, How, and Where\.Foundations and Trends in Databases1\(4\),pp\. 379–474\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[19\]W\. Choi, B\. Finkbeiner, R\. Piskac, and M\. Santolucito\(2022\)Can reactive synthesis and syntax\-guided synthesis be friends?\.InPLDI ’22: 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation,pp\. 229–243\.External Links:[Link](https://doi.org/10.1145/3519939.3523429),[Document](https://dx.doi.org/10.1145/3519939.3523429)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[20\]A\. Cimatti, E\. M\. Clarke, F\. Giunchiglia, and M\. Roveri\(2002\)NuSMV 2: An OpenSourceTool for Symbolic Model Checking\.InComputer Aided Verification: 14th International Conference, CAV 2002,pp\. 359–364\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p1.1)\.
- \[21\]A\. Cimatti, A\. Griggio, S\. Mover, and S\. Tonetta\(2014\)IC3 Modulo Theories via Implicit Predicate Abstraction\.Tools and Algorithms for the Construction and Analysis of Systems \(TACAS\)\.Note:Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p1.1)\.
- \[22\]E\. Clarke, O\. Grumberg, S\. Jha, Y\. Lu, and H\. Veith\(2000\)Counterexample\-Guided Abstraction Refinement\.InInternational Conference on Computer Aided Verification \(CAV\),pp\. 154–169\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px4.p1.1)\.
- \[23\]E\. M\. Clarke, E\. A\. Emerson, and A\. P\. Sistla\(1986\)Automatic verification of finite\-state concurrent systems using temporal logic specifications\.ACM Transactions on Programming Languages and Systems \(TOPLAS\)8\(2\),pp\. 244–263\.Cited by:[§3\.1](https://arxiv.org/html/2606.24414#S3.SS1.p1.5)\.
- \[24\]E\. M\. Clarke, T\. A\. Henzinger, H\. Veith, and R\. Bloem\(2018\)Handbook of model checking\.Springer\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.24414#S1.p1.1)\.
- \[25\]P\. Cousot and R\. Cousot\(1977\)Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints\.InSymposium on Principles of Programming Languages \(POPL\),pp\. 238–252\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[26\]Davide Corsi and Guy Amir and Andoni Rodríguez and Guy Katz and César Sánchez and Roy Fox\(2024\)Verification\-Guided Shielding for Deep Reinforcement Learning\.RLJ4,pp\. 1759–1780\.External Links:[Link](https://rlj.cs.umass.edu/2024/papers/Paper224.html)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[27\]L\. de Moura and N\. Bjørner\(2008\)Z3: An Efficient SMT Solver\.InTools and Algorithms for the Construction and Analysis of Systems \(TACAS\),pp\. 337–340\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[28\]L\. de Moura and S\. Ullrich\(2021\)The Lean 4 Theorem Prover and Programming Language\.InInternational Conference on Automated Deduction \(CADE\),pp\. 625–635\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[29\]E\. Durmus, H\. He, and M\. Diab\(2020\)FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 5055–5070\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p2.1)\.
- \[30\]S\. Eriksson, G\. Röger, and M\. Helmert\(2017\)Unsolvability Certificates for Classical Planning\.InInternational Conference on Automated Planning and Scheduling \(ICAPS\),Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[31\]M\. Fadaee, A\. Bisazza, and C\. Monz\(2017\)Data Augmentation for Low\-Resource Neural Machine Translation\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 567–573\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px1.p1.2)\.
- \[32\]M\. Freitag and Y\. Al\-Onaizan\(2017\)Beam Search Strategies for Neural Machine Translation\.InProceedings of the First Workshop on Neural Machine Translation,pp\. 56–60\.Cited by:[§3\.5](https://arxiv.org/html/2606.24414#S3.SS5.p1.1)\.
- \[33\]M\. Ghallab, D\. Nau, and P\. Traverso\(2004\)Automated planning: theory and practice\.Elsevier\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[34\]M\. Giacobbe, D\. Kroening, A\. Pal, and M\. Tautschnig\(2024\)Neural Model Checking\.InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, \(NeurIPS 2024\),External Links:[Link](http://papers.nips.cc/paper%5C_files/paper/2024/hash/9d0947107ea92d6ce369dce7749180dd-Abstract-Conference.html)Cited by:[§5](https://arxiv.org/html/2606.24414#S5.SS0.SSS0.Px1.p1.5)\.
- \[35\]M\. Giacobbe, D\. Kroening, A\. Pal, and M\. Tautschnig\(2025\)Let a Neural Network be Your Invariant\.InAdvances in Neural Information Processing Systems 2025, \(NeurIPS 2025\),Vol\.38,pp\. 74713–74740\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/2025/file/6c4512573f6b91b801e722656b27ea94-Paper-Conference.pdf)Cited by:[§5](https://arxiv.org/html/2606.24414#S5.SS0.SSS0.Px1.p1.5)\.
- \[36\]A\. Groce and W\. Visser\(2004\)What Went Wrong: Explaining Counterexamples\.InModel Checking Software \(SPIN\),pp\. 121–136\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px4.p1.1)\.
- \[37\]C\. Gulcehre, S\. Ahn, R\. Nallapati, B\. Zhou, and Y\. Bengio\(2016\)Pointing the Unknown Words\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 140–149\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p1.1)\.
- \[38\]S\. Gulwani, O\. Polozov, and R\. Singh\(2017\)Program synthesis\.Foundations and Trends in Programming Languages\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[39\]L\. Heck, F\. Macák, R\. Andriushchenko, M\. Češka, and S\. Junges\(2026\)Shields to Guarantee Probabilistic Safety in MDPs\.InComputer Aided Verification \- 38th International Conference, CAV 2026, Proceedings, Part IV,LNCS, Vol\.,pp\.\.External Links:Link,[Document](https://dx.doi.org/)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[40\]P\. Heim and R\. Dimitrova\(2025\)Issy: A Comprehensive Tool for Specification and Synthesis of Infinite\-State Reactive Systems\.InComputer Aided Verification \- 37th International Conference, CAV 2025, Proceedings, Part IV,LNCS, Vol\.15934,pp\. 298–312\.External Links:[Link](https://doi.org/10.1007/978-3-031-98685-7%5C_14),[Document](https://dx.doi.org/10.1007/978-3-031-98685-7%5F14)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[41\]M\. Helmert\(2006\)The Fast Downward Planning System\.Journal of Artificial Intelligence Research26,pp\. 191–246\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[42\]G\. J\. Holzmann\(1997\)The Model Checker SPIN\.IEEE Transactions on Software Engineering23\(5\),pp\. 279–295\.Cited by:[§5](https://arxiv.org/html/2606.24414#S5.SS0.SSS0.Px1.p1.5)\.
- \[43\]M\. Horridge, B\. Parsia, and U\. Sattler\(2006\)Justification of OWL Entailments\.InInternational Semantic Web Conference \(ISWC\),Note:Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[44\]R\. Howey, D\. Long, and M\. Fox\(2004\)VAL: Automatic Plan Validation, Continuous Effects and Mixed Initiative Planning Using PDDL\.IEEE International Conference on Tools with Artificial Intelligence \(ICTAI\),pp\. 294–301\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[45\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of Hallucination in Natural Language Generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p1.1),[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p2.1)\.
- \[46\]A\. Katis, G\. Fedyukovich, H\. Guo, A\. Gacek, J\. Backes, A\. Gurfinkel, and M\. W\. Whalen\(2018\)Validity\-Guided Synthesis of Reactive Systems from Assume\-Guarantee Contracts\.InTools and Algorithms for the Construction and Analysis of Systems \- 24th International Conference, TACAS 2018, Proceedings, Part II,LNCS, Vol\.10806,pp\. 176–193\.External Links:[Link](https://doi.org/10.1007/978-3-319-89963-3%5C_10),[Document](https://dx.doi.org/10.1007/978-3-319-89963-3%5F10)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[47\]B\. Könighofer, R\. Bloem, N\. Jansen, S\. Junges, and S\. Pranger\(2025\-10\)Shields for Safe Reinforcement Learning\.Commun\. ACM68\(11\),pp\. 80–90\.External Links:ISSN 0001\-0782,[Link](https://doi.org/10.1145/3715958),[Document](https://dx.doi.org/10.1145/3715958)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[48\]D\. Kroening and M\. Tautschnig\(2014\)CBMC – C Bounded Model Checker\.InTools and Algorithms for the Construction and Analysis of Systems \(TACAS\),pp\. 389–391\.Cited by:[§5](https://arxiv.org/html/2606.24414#S5.SS0.SSS0.Px1.p1.5)\.
- \[49\]W\. Kryściński, B\. McCann, C\. Xiong, and R\. Socher\(2020\)Evaluating the Factual Consistency of Abstractive Text Summarization\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 9332–9346\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p2.1)\.
- \[50\]G\. Lample, A\. Conneau, L\. Denoyer, and M\. Ranzato\(2018\)Unsupervised Machine Translation Using Monolingual Corpora Only\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px1.p1.2)\.
- \[51\]I\. Loshchilov and F\. Hutter\(2019\)Decoupled Weight Decay Regularization\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§3\.4](https://arxiv.org/html/2606.24414#S3.SS4.p4.7)\.
- \[52\]K\. L\. McMillan\(2003\)Interpolation and SAT\-based Model Checking\.InInternational Conference on Computer Aided Verification \(CAV\),pp\. 1–13\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[53\]K\. L\. McMillan\(2018\)Eager Abstraction for Symbolic Model Checking\.InComputer Aided Verification \- 30th International Conference, \(CAV 2018\)Proceedings, Part I,LNCS, Vol\.10981,pp\. 191–208\.External Links:[Link](https://doi.org/10.1007/978-3-319-96145-3%5C_11),[Document](https://dx.doi.org/10.1007/978-3-319-96145-3%5F11)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p1.1)\.
- \[54\]M\. Morales, A\. Pozanco, G\. Canonaco, S\. Gopalakrishnan, D\. Borrajo, and M\. Veloso\(2024\)On Learning Action Costs From Input Plans\.arXiv preprint arXiv:2408\.10889\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[55\]E\. Mugdan, R\. Christen, and S\. Eriksson\(2023\)Optimality Certificates for Classical Planning\.InProc\. of the 33rd International Conference on Automated Planning and Scheduling, Prague,pp\. 286–294\.External Links:[Link](https://doi.org/10.1609/icaps.v33i1.27206),[Document](https://dx.doi.org/10.1609/ICAPS.V33I1.27206)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[56\]T\. Nipkow, L\. C\. Paulson, and M\. Wenzel\(2002\)Isabelle/hol: a proof assistant for higher\-order logic\.Springer\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p2.2)\.
- \[57\]M\. Nye, M\. H\. Tessler, J\. B\. Tenenbaum, and B\. M\. Lake\(2021\)Improving Coherence and Consistency in Neural Sequence Models with Dual\-System, Neuro\-Symbolic Reasoning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px5.p1.3)\.
- \[58\]OpenAI\(2024\)GPT\-4o System Card\.Technical Report\.External Links:[Link](https://openai.com/index/gpt-4o-system-card/)Cited by:[§4\.4](https://arxiv.org/html/2606.24414#S4.SS4.p1.7)\.
- \[59\]A\. P\. Parikh, X\. Wang, S\. Gehrmann, M\. Faruqui, B\. Dhingra, D\. Yang, and D\. Das\(2020\)ToTTo: A Controlled Table\-to\-Text Generation Dataset\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 1173–1186\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p1.1)\.
- \[60\]A\. Pnueli and R\. Rosner\(1989\)On the Synthesis of a Reactive Module\.InSymposium on Principles of Programming Languages \(POPL\),pp\. 179–190\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[61\]A\. Pnueli, M\. Siegel, and E\. Singerman\(1998\)Translation Validation\.InTools and Algorithms for the Construction and Analysis of Systems \(TACAS\),pp\. 151–166\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[62\]A\. Pnueli\(1977\)The Temporal Logic of Programs\.InSymposium on Foundations of Computer Science \(FOCS\),pp\. 46–57\.Cited by:[§3\.1](https://arxiv.org/html/2606.24414#S3.SS1.p1.5)\.
- \[63\]G\. Poesia, O\. Polozov, V\. Le, A\. Tiwari, G\. Soares, C\. Meek, and S\. Gulwani\(2022\)Synchromesh: Reliable Code Generation from Pre\-trained Language Models\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px5.p1.3),[§3\.5](https://arxiv.org/html/2606.24414#S3.SS5.p1.1)\.
- \[64\]S\. Prabhumoye, Y\. Tsvetkov, R\. Salakhutdinov, and A\. W\. Black\(2018\)Style Transfer Through Back\-Translation\.InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 866–876\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px1.p1.2)\.
- \[65\]A\. Rodríguez, G\. Amir, D\. Corsi, C\. Sánchez, and G\. Katz\(2025\)Shield Synthesis for LTL Modulo Theories\.InProceedings of the 39th AAAI Conference on Artificial Intelligence,Vol\.39\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[66\]A\. Rodríguez, F\. Gorostiaga, and C\. Sánchez\(2025\)Counter Example Guided Reactive Synthesis for LTL Modulo Theories\.In37th International Conference in Computer Aided Verification \(CAV 2025\), Part IV,LNCS, Vol\.15934,pp\. 224–248\.External Links:[Link](https://doi.org/10.1007/978-3-031-98685-7%5C_11),[Document](https://dx.doi.org/10.1007/978-3-031-98685-7%5F11)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[67\]A\. Rodríguez and C\. Sánchez\(2023\)Boolean Abstractions for Realizability Modulo Theories\.In35th International Conference in Computer Aided Verification \(CAV 2023\), Part III,LNCS, Vol\.13966,pp\. 305–328\.External Links:[Link](https://doi.org/10.1007/978-3-031-37709-9%5C_15),[Document](https://dx.doi.org/10.1007/978-3-031-37709-9%5F15)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[68\]A\. Rodríguez and C\. Sánchez\(2024\)Adaptive Reactive Synthesis for LTL and LTLf Modulo Theories\.In38th AAAI Conference on Artificial Intelligence, \(AAAI 2024\),pp\. 10679–10686\.External Links:[Link](https://doi.org/10.1609/aaai.v38i9.28939),[Document](https://dx.doi.org/10.1609/AAAI.V38I9.28939)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[69\]A\. Rodríguez, I\. Shaik, D\. Corsi, R\. Fox, and C\. Sánchez\(2025\)Explanations for Unrealizability of Infinite\-State Safety Shields\.InProc\. of the 22nd International Conference on Principles of Knowledge Representation and Reasoning, \(KR 2025\),External Links:[Link](https://doi.org/10.24963/kr.2025/83),[Document](https://dx.doi.org/10.24963/KR.2025/83)Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px4.p2.1)\.
- \[70\]S\. Samuel, D\. D’Souza, and R\. Komondoor\(2021\)GenSys: A Scalable Fixed\-point Engine for Maximal Controller Synthesis over Infinite State Spaces\.InESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021,pp\. 1585–1589\.External Links:[Link](https://doi.org/10.1145/3468264.3473126),[Document](https://dx.doi.org/10.1145/3468264.3473126)Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[71\]A\. See, P\. J\. Liu, and C\. D\. Manning\(2017\)Get To The Point: Summarization with Pointer\-Generator Networks\.InProceedings of the 55th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 1073–1083\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p1.1),[§3\.3](https://arxiv.org/html/2606.24414#S3.SS3.SSS0.Px2.p1.1)\.
- \[72\]P\. G\. Selinger, M\. M\. Astrahan, D\. D\. Chamberlin, R\. A\. Lorie, and T\. G\. Price\(1979\)Access Path Selection in a Relational Database Management System\.InACM SIGMOD International Conference on Management of Data,pp\. 23–34\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[73\]R\. Sennrich, B\. Haddow, and A\. Birch\(2016\)Improving Neural Machine Translation Models with Monolingual Data\.InProceedings of the 54th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 86–96\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px1.p1.2)\.
- \[74\]M\. Sheeran, S\. Singh, and G\. Stålmarck\(2000\)Checking Safety Properties Using Induction and a SAT\-Solver\.InFormal Methods in Computer\-Aided Design \(FMCAD\),pp\. 127–144\.Cited by:[item 6](https://arxiv.org/html/2606.24414#S4.I1.i6.p1.1)\.
- \[75\]A\. Solar\-Lezama\(2008\)Program Synthesis by Sketching\.InPhD Thesis, University of California, Berkeley,Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px1.p3.1)\.
- \[76\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\(2017\)Attention Is All You Need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§3\.3](https://arxiv.org/html/2606.24414#S3.SS3.SSS0.Px2.p1.1)\.
- \[77\]A\. Wang, K\. Cho, and M\. Lewis\(2020\)Asking and Answering Questions to Evaluate the Factual Consistency of Summaries\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 5008–5020\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px3.p2.1)\.
- \[78\]L\. A\. Wolsey and G\. L\. Nemhauser\(1988\)Integer and combinatorial optimization\.Wiley\.Cited by:[§H\.3](https://arxiv.org/html/2606.24414#A8.SS3.SSS0.Px2.p1.1)\.
- \[79\]Y\. Wu, A\. Q\. Jiang, W\. Li, M\. N\. Rabe, C\. Staats, M\. Jamnik, and C\. Szegedy\(2022\)Autoformalization with Large Language Models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px5.p1.3)\.
- \[80\]T\. Zhang, V\. Kishore, F\. Wu, K\. Q\. Weinberger, and Y\. Artzi\(2020\)BERTScore: Evaluating Text Generation with BERT\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§4\.1](https://arxiv.org/html/2606.24414#S4.SS1.SSS0.Px5.p1.1)\.
- \[81\]L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing,et al\.\(2023\)Judging LLM\-as\-a\-Judge with MT\-Bench and Chatbot Arena\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px2.p1.1)\.
- \[82\]J\. Zhu, T\. Park, P\. Isola, and A\. A\. Efros\(2017\)Unpaired Image\-to\-Image Translation Using Cycle\-Consistent Adversarial Networks\.InProceedings of the IEEE International Conference on Computer Vision \(ICCV\),pp\. 2223–2232\.Cited by:[§2](https://arxiv.org/html/2606.24414#S2.SS0.SSS0.Px1.p1.2),[§4\.7](https://arxiv.org/html/2606.24414#S4.SS7.p1.2)\.

## Appendix AData Availability

Due to confidentiality restrictions inherent to the safety\-critical deployment environment, the full source code and production dataset cannot be released at this stage\.

## Appendix BCertificate Examples

This appendix provides one representative example of each verdict/kind combination used in our evaluation\. For each example, we show the structured JSON certificate and the corresponding natural language explanation generated by the template system\. State names are drawn from a financial compliance workflow domain comprising 207 named states\. These examples illustrate the structural diversity across the six certificate kinds and the YES/NO verdict asymmetry discussed in Appendix[G](https://arxiv.org/html/2606.24414#A7)\.

### Reachability

Reachability certificates contain a single state sequence—a path from an initial state to a target state \(or a statement that no such path exists\)\. This is the simplest certificate structure: one ordered list of states\.

#### YES / Reachability\.

A witness path demonstrating that the target state is reachable\.

```
{
  "kind": "reachability",
  "path": ["RegulatoryReporting", "CounterpartyRiskAssessment",
           "BeneficialOwnershipCheck", "InitialRiskScoring",
           "S2", "Escalation", "RegulatoryCheck"]
}
```

Explanation:“Verdict: YES for property ‘the property’\. The system can reach RegulatoryCheck from RegulatoryReporting via a path of 7 states\. The execution goes through CounterpartyRiskAssessment, BeneficialOwnershipCheck, InitialRiskScoring, S2, and Escalation before arriving at the target state\.”

#### NO / Reachability\.

A statement that no path to the target exists\.

```
{
  "kind": "reachability",
  "path": ["TermsNegotiation"]
}
```

Explanation:“No witness exists for ‘the property’\. From TermsNegotiation, the execution explores nearby states without finding a path to the target state\.”

Structural note\.YES certificates carry a full path \(1 sequence, 1–10 states\); NO certificates carry only the starting state\. With the hybrid router, YES/reachability achieves 83% soundness and NO/reachability achieves 73%\.

### Witness Pair

Witness pair certificates contain a single path demonstrating \(or failing to demonstrate\) that two requirements are jointly satisfiable\.

#### YES / Witness Pair\.

```
{
  "kind": "witness_pair",
  "path": ["LowRisk", "CryptoWalletAnalysis", "ComplianceCheck",
           "SanctionsScreening", "Approved",
           "AutomatedApprovalPath", "ThirdPartyCheck"]
}
```

Explanation:“Verdict: YES\. A 7\-step witness pair from LowRisk through CryptoWalletAnalysis, ComplianceCheck, SanctionsScreening, Approved, and AutomatedApprovalPath arrives at ThirdPartyCheck, confirming ‘the property’\.”

#### NO / Witness Pair\.

```
{
  "kind": "witness_pair",
  "path": ["HighRisk", "StressTest", "CreditCheck", "S3"]
}
```

Explanation:“The witness pair check for ‘the property’ failed\. Starting from HighRisk, no execution pair through StressTest, CreditCheck, and S3 demonstrates the required property\.”

Structural note\.Both verdicts carry a single sequence\. Structurally similar to reachability, witness pair achieves 87–90% soundness with the hybrid router\.

### Bounded Proof \(BMC\)

Bounded model checking certificates contain a trace and/or a set of checked states, with an integer depth bound\.

#### YES / Bounded Proof\.

A trace witnessing a violation within the bound\.

```
{
  "kind": "bounded_proof",
  "bound": 11,
  "trace": ["ClearingSubmission", "CollateralCheck",
            "QueueWait", "SARFilingAssessment",
            "MarginCalculation"]
}
```

Explanation:“Verdict: YES\. Bounded model checking found an 11\-step violation of ‘the property’\. The trace visits ClearingSubmission, CollateralCheck, QueueWait, SARFilingAssessment, and MarginCalculation\.”

#### NO / Bounded Proof\.

No counterexample found; the checked state space is enumerated\.

```
{
  "kind": "bounded_proof",
  "bound": 11,
  "checked_states": ["KYCVerification", "CollateralEvaluation",
                     "Remediation", "AsiaJurisdiction",
                     "FormatValidation", "SettlementPending"],
  "trace": ["KYCVerification", "CollateralEvaluation",
            "Remediation", "AsiaJurisdiction",
            "FormatValidation", "SettlementPending"]
}
```

Explanation:“Property ‘the property’ is safe up to depth 11\. The BMC search over states KYCVerification, CollateralEvaluation, Remediation, AsiaJurisdiction, FormatValidation, and SettlementPending found no counterexample\.”

Structural note\.YES carries one trace; NO carries both a trace and a checked\-state set, but the verifier evaluates them as a single sequence\. Both bounded proof categories achieve 100% soundness with the hybrid router\.

### Inductive Invariant

Inductive invariant certificates contain a set of states forming the invariant, with an optional counterexample transition for the NO case\.

#### YES / Inductive Invariant\.

A valid inductive invariant proving safety\.

```
{
  "kind": "inductive_invariant",
  "invariant_states": ["EUJurisdiction", "ArchiveComplete",
                       "DocumentSubmit", "DocumentCollection",
                       "ExposureCalc", "ManagerApproval",
                       "PostSettlement", "AdverseMediaScan"]
}
```

Explanation:“The safety property ‘the property’ is verified: YES\. An inductive invariant containing EUJurisdiction, ArchiveComplete, DocumentSubmit, DocumentCollection, ExposureCalc, ManagerApproval, PostSettlement, and AdverseMediaScan was found\. Starting from EUJurisdiction, the invariant is closed under all transitions and excludes unsafe states\.”

#### NO / Inductive Invariant\.

The candidate invariant is broken by a counterexample transition\.

```
{
  "kind": "inductive_invariant",
  "invariant_states": ["PEPCheck", "PricingReview", "StressTest",
                       "LegalReview", "DocumentSubmit", "S5"],
  "counterexample": ["PricingReview", "AutoRoute"]
}
```

Explanation:“Property ‘the property’ does not hold\. The candidate invariant PEPCheck, PricingReview, StressTest, LegalReview, DocumentSubmit, and S5 is broken by a transition from PricingReview to AutoRoute that exits the safe region\.”

Structural note\.YES carries 1 sequence \(the invariant set\); NO carries 2 sequences \(invariant set \+ counterexample pair\)\. With the hybrid router, YES/inductive\_invariant achieves 97% soundness and NO/inductive\_invariant achieves 87%\.

### Lasso

Lasso certificates contain a prefix path followed by a repeating cycle, used for liveness property checking\. This is inherently a two\-sequence structure\.

#### YES / Lasso\.

A lasso witness demonstrating the property violation\.

```
{
  "kind": "lasso",
  "prefix": ["BatchProcess", "ReportGeneration",
             "SecondaryReview", "TradeCapture",
             "EntityResolution"],
  "cycle": ["CTLVerification", "S5", "CaseClosed"]
}
```

Explanation:“The lasso check for ‘the property’ found a witness\. The lasso prefix starts at BatchProcess and reaches ReportGeneration, SecondaryReview, TradeCapture, and EntityResolution\. The lasso cycle repeats CTLVerification, S5, and CaseClosed forever\.”

#### NO / Lasso\.

```
{
  "kind": "lasso",
  "prefix": ["S0", "TradeCapture", "AMLReview",
             "InitialScreening", "AsiaJurisdiction"],
  "cycle": ["AutomatedApprovalPath"]
}
```

Explanation:“Property ‘the property’ holds against all lassos\. The lasso prefix from S0 through TradeCapture, AMLReview, InitialScreening, and AsiaJurisdiction and lasso cycle visiting AutomatedApprovalPath satisfy the temporal property\.”

Structural note\.Both verdicts carry 2 sequences \(prefix \+ cycle\)\. With the hybrid router, YES/lasso achieves 97% and NO/lasso achieves 93% soundness\. The two\-sequence structure is a natural match for the pointer\-generator, which learns to copy state names from each sequence position independently\.

### kk\-Induction

kk\-induction certificates are the most structurally complex kind, containing base states, step states, an induction parameterkk, and—for the NO case—a counterexample trace\.

#### YES /kk\-Induction\.

A successfulkk\-induction proof \(2 sequences: base \+ step\)\.

```
{
  "kind": "k_induction",
  "k": 2,
  "base_states": ["TradeCapture", "HighRisk", "PricingReview"],
  "step_states": ["ExposureCalc", "RiskAssessment", "PEPCheck"]
}
```

Explanation:“Verdict: YES for property ‘the property’\. A 2\-induction proof was established\. Base states TradeCapture, HighRisk, and PricingReview hold for 2 steps and step states ExposureCalc, RiskAssessment, and PEPCheck maintain the inductive hypothesis\.”

#### NO /kk\-Induction\.

A failed induction with counterexample \(3 sequences: base \+ step \+ counterexample\)\.

```
{
  "kind": "k_induction",
  "k": 4,
  "base_states": ["CaseClosed", "HighRisk"],
  "step_states": ["AMLReview", "PendingReview",
                  "ReconciliationCheck"],
  "counterexample": ["AMLReview", "SettlementPending"]
}
```

Explanation:“4\-induction fails for ‘the property’\. The base states CaseClosed and HighRisk are safe but the counterexample AMLReview and SettlementPending demonstrates a step violation\.”

Structural note\.YES carries 2 sequences \(base \+ step\) and achieves 100% soundness with the hybrid router\. NO carries 3 sequences \(base \+ step \+ counterexample\) and achieves 73%—the most challenging category\. The model must keep three interleaved state sets distinct through the natural language channel, which is at the limit of the 256\-dimensional attention capacity\. This example most clearly illustrates why soundness correlates with per\-instance sequence count rather than certificate kind\.

## Appendix CLLM\-Guided Fluency Training

This appendix describes an optional extension to the training protocol presented in Section[3\.4](https://arxiv.org/html/2606.24414#S3.SS4): using an external large language model as a fluency oracle during training\. While the main results in this paper are obtained*without*LLM fluency guidance \(λLM=0\\lambda\_\{\\mathrm\{LM\}\}=0\), the mechanism is architecturally supported and may benefit deployments where explanation readability is a priority\.

### C\.1Two\-Phase Training Protocol

The base training protocol \(Section[3\.4](https://arxiv.org/html/2606.24414#S3.SS4)\) trains with reconstruction and cycle consistency losses only\. A two\-phase extension adds LLM\-scored fluency as a third signal:

#### Phase I: Supervised Bootstrap\.

Identical to the base protocol:λrecon=1\.0\\lambda\_\{\\mathrm\{recon\}\}=1\.0,λcycle=2\.0\\lambda\_\{\\mathrm\{cycle\}\}=2\.0,λLM=0\\lambda\_\{\\mathrm\{LM\}\}=0\. The model learns English syntax, vocabulary usage, and certificate\-type\-specific phrasing from template\-generated reference explanations\.

#### Phase II: Fluency\-Guided Refinement\.

The reconstruction weight is reduced toλrecon=0\.1\\lambda\_\{\\mathrm\{recon\}\}=0\.1and an LLM fluency loss is activated withλLM=0\.3\\lambda\_\{\\mathrm\{LM\}\}=0\.3\. This reduces dependence on template\-specific phrasing while rewarding more natural linguistic expression\. The cycle consistency loss \(λcycle=2\.0\\lambda\_\{\\mathrm\{cycle\}\}=2\.0\) remains the dominant signal, ensuring that fluency optimization cannot compromise soundness\.

### C\.2Fluency Oracle

The fluency oracle is an external LLM \(e\.g\., GPT\-4o\) that receives a generated explanation and returns a scalar scoreϕ\(E\)∈\[0,1\]\\phi\(E\)\\in\[0,1\]\. The fluency loss combines:

- •Aranking componentℒrank\\mathcal\{L\}\_\{\\mathrm\{rank\}\}: fluent explanations should score higher than disfluent ones\.
- •Acalibration componentℒcalibrate\\mathcal\{L\}\_\{\\mathrm\{calibrate\}\}: the model should produce explanations that achieve high absolute fluency scores\.

The decoupling between fluency \(LLM\-scored\) and faithfulness \(symbolically verified\) is a key architectural property: the LLM evaluates only linguistic quality, while the symbolic verifier evaluates only semantic fidelity\. This separation guarantees by construction that optimizing fluency cannot introduce hallucinated facts\.

### C\.3Practical Considerations

#### Sparse Sampling\.

LLM API calls are expensive relative to local gradient computation\. We use a sparse sampling strategy: fluency is scored for everykk\-th batch \(e\.g\.,k=5k=5\), and the most recent fluency gradient is reused for intermediate batches\. This reduces LLM calls byk×k\\timeswith negligible impact on training dynamics, because the fluency signal changes slowly across consecutive batches\.

#### Cost Model\.

With sparse sampling atk=5k=5and approximately 365 batches per epoch, Phase II requires∼73\{\\sim\}73LLM calls per epoch\. At typical API pricing \(∼$0\.01\{\\sim\}\\mathdollar 0\.01/call for scoring a short text\), this amounts to∼$0\.73\{\\sim\}\\mathdollar 0\.73/epoch—a modest cost for the linguistic quality improvement\.

#### Negative Finding: Full Supervision Removal\.

Settingλrecon=0\\lambda\_\{\\mathrm\{recon\}\}=0in Phase II \(relying entirely on cycle consistency and fluency\) causes immediate distribution drift: NN1’s output leaves the distribution that NN2can parse, and the cycle collapses within 1–2 epochs\. Atdmodel=256d\_\{\\mathrm\{model\}\}=256, a residual reconstruction anchor \(λrecon≥0\.1\\lambda\_\{\\mathrm\{recon\}\}\\geq 0\.1\) is necessary for stable training\. This limitation may be addressed by larger models, exponential moving average weight updates, or replay buffers\.

## Appendix DCertificate\-Type\-Dependent Completeness

The appropriate level of completeness varies by certificate type, reflecting the mathematical role of each certificate’s components:

- •Reachability witnesses\.The path\[s0,s1,…,sn\]\[s\_\{0\},s\_\{1\},\\ldots,s\_\{n\}\]proves thatsns\_\{n\}is reachable froms0s\_\{0\}\. Intermediate states provide context but are not individually essential to the proof claim\. An explanation that omits some intermediate states remains sound\. Lower completeness is acceptable\.
- •Lasso counterexamples\.The cycle component is the*reason*the property is violated \(an infinite repetition\), while the prefix describes how the cycle is reached\. An explanation that fully describes the cycle but abbreviates the prefix preserves the essential proof structure\. Completeness should be high for the cycle, moderate for the prefix\.
- •Inductive invariants\.The invariant is a set of states that must be jointly closed under the transition relation\. Removing any state may break the inductiveness argument\. Completeness must be high: omitting states from the invariant potentially invalidates the proof\.

This variation is not hard\-coded into the architecture\. Instead, it is*learned from training examples*\. Templates that mention most states teach high completeness; templates that abbreviate teach lower completeness\. The model learns a certificate\-type\-conditioned distribution over completeness levels\.

## Appendix EEnsemble of Decoding Configurations

The hybrid router \(Section[3\.5](https://arxiv.org/html/2606.24414#S3.SS5)\) operates over an ensemble of 37 decoding configurations applied to a single trained model\. These configurations vary along three axes: \(i\) epoch checkpoint \(epochs 39 and 48, selected by cycle loss\), \(ii\) beam width \(1, 3, 5, 7, 9, 11\), and \(iii\) constrained vs\. unconstrained decoding \(restricting the output vocabulary to valid certificate tokens\)\. Figure[5](https://arxiv.org/html/2606.24414#A5.F5)illustrates this ensemble\. Each configuration produces a candidate explanation–reconstruction pair\(Ei,Ci′\)\(E\_\{i\},C^\{\\prime\}\_\{i\}\), and the router selects among them based on the certificate’s category and the verifier’s coverage score\.

Trained Model\(NN1\+NN2\\mathrm\{NN\}\_\{1\}\+\\mathrm\{NN\}\_\{2\}\)CertificateCCEpoch 39, greedyEpoch 39, beam 5Epoch 39, beam 9Epoch 48, constrained⋮\\vdots\(37 configs\)E1,C1′E\_\{1\},\\;C^\{\\prime\}\_\{1\}E2,C2′E\_\{2\},\\;C^\{\\prime\}\_\{2\}E3,C3′E\_\{3\},\\;C^\{\\prime\}\_\{3\}E4,C4′E\_\{4\},\\;C^\{\\prime\}\_\{4\}Figure 5:Ensemble of decoding configurations\. A single trained model is evaluated under 37 decoding configurations \(varying epoch checkpoint, beam width, and constrained/unconstrained decoding\), producing multiple candidate explanation–reconstruction pairs\(Ei,Ci′\)\(E\_\{i\},C^\{\\prime\}\_\{i\}\)per certificate\.
## Appendix FFact Extraction Schema

The symbolic verifier’s soundness and coverage metrics \(Eq\.[2](https://arxiv.org/html/2606.24414#S3.E2)\) depend on the fact extractorℱ\(⋅\)\\mathcal\{F\}\(\\cdot\), which decomposes a certificate into a set of discrete, comparable facts\. The extractor is certificate\-type\-specific because different verification techniques produce structurally different certificates: a reachability witness contains an ordered path, while an inductive invariant contains an unordered state set\. Designingℱ\\mathcal\{F\}to respect these structural differences ensures that the verifier’s comparison is semantically meaningful—for instance, set\-based certificates use order\-insensitive comparison, while sequence\-based certificates require positional alignment\. Each fact is a \(role, value\) pair where the role identifies the structural position \(e\.g\.,prefix\_2,base\_state,target\) and the value is a state name or structural element\. The per\-kind extraction rules are:

- •Reachability and witness pair:ℱ\\mathcal\{F\}extracts the ordered state sequence and endpoint identities\.
- •Lasso:ℱ\\mathcal\{F\}extracts prefix states, cycle states, and their partition\.
- •kk\-induction:ℱ\\mathcal\{F\}extracts base states, step states, counterexample states \(if present\), and the induction parameterkk\.
- •Inductive invariant:ℱ\\mathcal\{F\}extracts the invariant state set and any counterexample transition\.
- •Bounded proof:ℱ\\mathcal\{F\}extracts the trace, checked states, and the depth bound\.

## Appendix GAnalysis by Certificate Structure

#### Single\-Sequence Kinds\.

Reachability and witness pair certificates contain a single state sequence\. The pointer\-generator copies state names directly from the certificate encoding\. However, generation\-heavy categories in this group \(YES/reachability, YES/witness\_pair\) require the model to produce correct explanatory text*around*the copied names, which introduces errors when the generated scaffolding misattributes states\.

#### Multi\-Sequence Kinds\.

Lasso andkk\-induction certificates contain two or more state sequences that must be kept distinct through the natural language channel\. This introduces a*sequence attribution*challenge: NN2must determine from the text which states belong to which sequence\. The hybrid router is particularly effective here, as different decoding strategies resolve attribution ambiguities differently\.

#### Set\-Based Kinds\.

Inductive invariant and bounded proof certificates are evaluated using set\-based comparison\. This is a natural match for the cycle architecture, as the text need not preserve exact ordering—only set membership\.

#### Copy\-Dominated vs\. Generation\-Heavy\.

The key structural distinction is between categories where the explanation template is largely fixed and the model’s primary task is copying state names \(copy\-dominated\), versus categories where significant text generation is required \(generation\-heavy\)\. In copy\-dominated categories, a single good configuration suffices; the hybrid router’s category\-level selection handles these\. In generation\-heavy categories, per\-example variation is high, and coverage\-based selection provides substantial gains\.

## Appendix HExtended Limitations and Future Work

### H\.1Limitations

#### Emptiness SCC certificates\.

The current architecture does not support emptiness SCC certificates, which require counting and enumerating strongly connected components in product automata\. Emptiness certificates encode nested set structures whose cardinality \(number of SCCs, nodes per SCC\) must be accurately reflected in the explanation\. Our enriched encoding experiments—explicitly representing counts as digit tokens—did not improve soundness on this category: the model architecture atdmodel=256d\_\{\\mathrm\{model\}\}=256cannot reliably learn to copy numeric count information into the generated text\. We exclude this category from our main evaluation and discuss it as future work\.

#### Bounded certificate complexity\.

Soundness degrades as the number of distinct state sequences in a certificate increases\. Withdmodel=256d\_\{\\mathrm\{model\}\}=256and 8 attention heads \(32\-dimensional per\-head representations\), the model reaches capacity limits on certificates with more than approximately 15 states distributed across 3\+ sequences\. Scaling todmodel=512d\_\{\\mathrm\{model\}\}=512would double per\-head capacity and may resolve this bottleneck\.

#### Evaluation scope\.

While certificates are derived from deployed financial compliance workflows, the explanation templates and the fixed 207\-state vocabulary constrain the evaluation scope\. Production model checkers may produce certificates with longer variable names and deeper nesting that fall outside the current vocabulary\.

#### Fixed vocabulary\.

The 207\-state vocabulary is domain\-specific and fixed at training time\. Extension to new domains requires retraining\. A subword or character\-level encoding could address this at the cost of longer sequences\.

### H\.2Future Work

#### Emptiness SCC via architectural extensions\.

The counting limitation motivating the exclusion of emptiness certificates may be addressed by architectural changes: explicit count embeddings, hierarchical encoding of nested structures, or a two\-stage generation approach that first predicts SCC counts and then generates per\-SCC descriptions\. Alternatively, a larger model \(dmodel=512d\_\{\\mathrm\{model\}\}=512\) with increased attention capacity may learn the required numerical patterns from data alone\.

#### Scaling and the complexity frontier\.

The hybrid router’s success on 12 categories suggests that the same approach applied at larger model scale—with relaxed complexity caps \(e\.g\.,kk\-induction with base/step sizes up to 8, lasso prefix up to 10\)—could extend the architecture’s advantage over LLMs to increasingly complex certificates\. As certificate complexity grows, the LLM’s transcription strategy is expected to degrade faster than the trained model’s learned attribution patterns\.

#### Real verification tool integration\.

The training protocol requires only\(C,E\)\(C,E\)pairs and can operate without reference explanations when using cycle consistency as the sole training signal\. This enables direct fine\-tuning on certificates from additional model checkers, extending coverage beyond the current deployment domain\.

#### Interactive and multi\-language explanation\.

The cycle consistency mechanism could support interactive explanation—users requesting different detail levels or focusing on specific certificate components—while maintaining soundness guarantees\. The architecture is language\-agnostic in principle; replacing English templates with another language’s equivalents would produce explanations in that language with the same faithfulness properties\.

#### Guarding against private language at scale\.

While our perturbation tests \(Section[4\.7](https://arxiv.org/html/2606.24414#S4.SS7)\) confirm that the current architecture does not develop a private language, this risk may increase with larger models that have greater capacity to encode hidden signals\. A natural defense is*asymmetric training*: freezing NN2\(or training it on human\-written explanations from a separate corpus\) so that NN1must produce text that a*fixed*reader can parse, rather than co\-adapting with its partner\. Alternatively, periodically replacing NN2with a freshly initialized network during training would prevent long\-term co\-adaptation\.

#### Adversarial robustness and prompt injection\.

A significant but unexplored advantage of the neural architecture over LLM\-based approaches is its inherent resistance to*prompt injection attacks*\. In an LLM\-based pipeline, a maliciously crafted certificate could embed adversarial content that manipulates the LLM’s generation behavior\. The neural architecture’s fixed computation graph and learned token embeddings are fundamentally immune to this class of attack\. Formal analysis of this robustness property is an important direction for future work in safety\-critical deployment contexts\.

### H\.3Generality of the Approach

The cycle\-consistent recipe—verbalize a structured artifact, reconstruct it from the explanation alone, and accept the explanation iff a symbolic checker confirms round\-trip equivalence—is not specific to the certificates studied here\. It applies wherever two conditions hold: \(i\) the artifact decomposes into a discrete, comparable fact set, and \(ii\) an exact notion of equivalence between an original and a reconstruction is definable\. Under these conditions, faithfulness becomes a*verifiable*property rather than an approximated one, and explanations can be generated and validated without human annotation\.

#### Formal methods and verification\.

The most immediate opportunities lie within formal methods, where structured, machine\-checkable artifacts are pervasive and their opacity to non\-specialists is a recurring obstacle\. Natural extensions include model\-checking outputs\[[24](https://arxiv.org/html/2606.24414#bib.bib32),[6](https://arxiv.org/html/2606.24414#bib.bib33)\]beyond the six kinds studied here—fairness witnesses, assume\-guarantee contracts, and the emptiness\-SCC certificates we currently exclude\. The opportunity is especially pronounced forinfinite\-state model checking\[[20](https://arxiv.org/html/2606.24414#bib.bib86),[21](https://arxiv.org/html/2606.24414#bib.bib79),[53](https://arxiv.org/html/2606.24414#bib.bib61)\]—where verification proceeds over unbounded data domains rather than finite Boolean state—since the resulting certificates \(e\.g\., inductive invariants over first\-order theories\) are richer and correspondingly harder for a human to read directly\.

Other candidates includeSMT and SAT artifactssuch as unsat cores, Craig interpolants\[[52](https://arxiv.org/html/2606.24414#bib.bib34)\], and proof certificates from solvers such as Z3\[[27](https://arxiv.org/html/2606.24414#bib.bib35)\]or CVC5\[[7](https://arxiv.org/html/2606.24414#bib.bib36)\], where a faithful explanation could render an unsatisfiability argument intelligible to an engineer;theorem\-proving artifacts—proof scripts and tactic traces from Lean\[[28](https://arxiv.org/html/2606.24414#bib.bib37)\], Isabelle\[[56](https://arxiv.org/html/2606.24414#bib.bib38)\], or Coq\[[8](https://arxiv.org/html/2606.24414#bib.bib39)\]—explained step\-by\-step while guaranteeing that no inference is misrepresented;static\-analysis and abstract\-interpretation results\[[25](https://arxiv.org/html/2606.24414#bib.bib40)\]—abstract counterexamples, invariants, and alarm traces; andequivalence\-checking and program\-verification witnesses\. In each case the symbolic verifier is domain\-specific but the architecture is unchanged: only the fact extractorℱ\(⋅\)\\mathcal\{F\}\(\\cdot\)and the certificate encoder must be re\-instantiated\. Because the training signal requires only\(C,E\)\(C,E\)pairs—and can operate on cycle consistency alone when reference explanations are unavailable—the method can be fine\-tuned directly on the output of existing verification tools, extending explainability across the formal\-methods toolchain\.

A particularly promising direction is the explanation of*synthesized*artifacts, where the output is itself a formal object carrying a correctness guarantee\. Inprogram synthesis, a synthesized program together with its specification admits structured fact extraction\[[38](https://arxiv.org/html/2606.24414#bib.bib45),[75](https://arxiv.org/html/2606.24414#bib.bib46),[19](https://arxiv.org/html/2606.24414#bib.bib64)\], so a cycle\-consistent explanation could describe*what*the synthesized program does while verifiably preserving its specification\-level behavior\. Inreactive synthesis, where a controller or strategy is automatically derived from a temporal\-logic specification\[[60](https://arxiv.org/html/2606.24414#bib.bib47),[12](https://arxiv.org/html/2606.24414#bib.bib48)\], the synthesized strategy is a structured automaton whose transitions and winning conditions form a natural fact set—making the resulting controller’s behavior explainable to operators who did not author the specification\. This is especially timely given the recent traction of synthesis for infinite\-state systems\[[46](https://arxiv.org/html/2606.24414#bib.bib43),[70](https://arxiv.org/html/2606.24414#bib.bib59),[40](https://arxiv.org/html/2606.24414#bib.bib50),[3](https://arxiv.org/html/2606.24414#bib.bib51)\]andmodulo theoriessystems\[[67](https://arxiv.org/html/2606.24414#bib.bib41),[68](https://arxiv.org/html/2606.24414#bib.bib60),[66](https://arxiv.org/html/2606.24414#bib.bib44)\]—same\-spirit approaches where synthesized strategies are expressed over rich first\-order data domains rather than finite Boolean state: such artifacts are markedly harder for a human to read directly, so a faithful, automatically\-validated explanation layer is correspondingly more valuable \(note how this opportunity mirrors that of the infinite\-state variant of model checking discussed above\)\. The connection is most consequential forshield synthesis\[[13](https://arxiv.org/html/2606.24414#bib.bib52),[1](https://arxiv.org/html/2606.24414#bib.bib54),[39](https://arxiv.org/html/2606.24414#bib.bib65)\], where a formally\-synthesized*shield*monitors and corrects a system’s actions at runtime to enforce safety\. Shields are increasingly proposed as*guardrails for AI and reinforcement\-learning agents*\[[1](https://arxiv.org/html/2606.24414#bib.bib54),[17](https://arxiv.org/html/2606.24414#bib.bib66),[47](https://arxiv.org/html/2606.24414#bib.bib63)\], overriding unsafe actions while provably maintaining a safety specification, which is again increasingly more important due to recent extension to rich data and infinite\-state settings\[[26](https://arxiv.org/html/2606.24414#bib.bib42),[65](https://arxiv.org/html/2606.24414#bib.bib49)\]\. A cycle\-consistent explanation layer could make a shield’s interventions intelligible—*why*an action was blocked and*which*safety property it would have violated—while guaranteeing the explanation faithfully reflects the shield’s formal logic\. As AI guardrails move toward formally\-verified enforcement, faithful and automatically\-validated explanations of*why*a guardrail fired become a safety requirement in their own right, not merely a convenience\.

#### Beyond formal methods\.

The same principle reaches any field that produces rigorous structured outputs, provided both conditions above hold\.Automated planningis perhaps the closest neighbor: a plan is an ordered sequence of actions and intermediate states\[[33](https://arxiv.org/html/2606.24414#bib.bib56),[41](https://arxiv.org/html/2606.24414#bib.bib57),[54](https://arxiv.org/html/2606.24414#bib.bib87)\], structurally analogous to the reachability paths studied here, and plan validators such as VAL\[[44](https://arxiv.org/html/2606.24414#bib.bib58)\]provide exactly the fact\-level equivalence check our method requires—a verbalized plan could be accepted iff its reconstruction recovers the same action sequence and causal structure\. The connection extends to the artifacts planning produces beyond plans themselves:certificates of unsolvability\[[30](https://arxiv.org/html/2606.24414#bib.bib62)\]andcost\-optimality proofs\[[55](https://arxiv.org/html/2606.24414#bib.bib88)\]are structured objects whose faithful explanation would make a planner’s negative or optimality claims intelligible to a human operator\. Indatabases and data systems, query execution plans\[[72](https://arxiv.org/html/2606.24414#bib.bib67)\], schema mappings, and provenance/lineage graphs\[[18](https://arxiv.org/html/2606.24414#bib.bib68)\]decompose into relational\-algebra trees and dependency graphs with well\-defined equivalence, satisfying both conditions directly\. Inoperations research, the certificate of optimality or infeasibility accompanying an LP/MILP solution—e\.g\., a Farkas certificate or an optimal dual basis\[[78](https://arxiv.org/html/2606.24414#bib.bib55)\]—is a structured artifact whose correctness is exactly checkable\. Incompilers and systems, type\-inference derivations, translation\-validation certificates\[[61](https://arxiv.org/html/2606.24414#bib.bib69)\], and configuration or policy artifacts \(access\-control rules, network\-routing tables\) are structured objects whose explanations carry real consequences if unfaithful\. Inknowledge representation, ontology subsumption proofs and the justifications produced by description\-logic reasoners\[[43](https://arxiv.org/html/2606.24414#bib.bib70)\]are fact sets over which set\-level equivalence is naturally defined\.

What unifies these cases—and what distinguishes them from open\-ended generation tasks such as summarization or dialogue—is precisely what makes our setting tractable: the output is structured, a fact set is extractable, and equivalence is checkable\. Where these conditions fail, the cycle no longer yields an*exact*faithfulness signal, and the method degrades to the approximate proxies it was designed to replace\. The boundary of applicability is therefore sharp and principled: the technique applies exactly when faithfulness can be reduced to round\-trip structural equivalence\.

## Appendix IComplete Examples

We present four concrete examples illustrating the full cycle\-consistent pipeline with different soundness and coverage outcomes\. In each example,bold monospacetokens in the explanation are copied via the pointer\-generator mechanism \(pgen≈0p\_\{\\mathrm\{gen\}\}\\approx 0\); regular text is generated from vocabulary \(pgen≈1p\_\{\\mathrm\{gen\}\}\\approx 1\)\.

### Example 1: Perfect — Sound with Full Coverage

CertificateCC\(YES/reachability\)verdict: YES, kind: reachabilitypath: \[KYCVerification→\\rightarrowSanctionsScreening→\\rightarrowCollateralCheck→\\rightarrowSettlementFinal\]target: SettlementFinal

↓\\downarrowNN1\(pointer\-generator\)

ExplanationEEThe verification confirms that the property holds: a reachability path exists fromKYCVerificationthroughSanctionsScreeningandCollateralCheckto the target stateSettlementFinal\.

↓\\downarrowNN2\(inverse Transformer\)

ReconstructedC′C^\{\\prime\}verdict: YES, kind: reachabilitypath: \[KYCVerification→\\rightarrowSanctionsScreening→\\rightarrowCollateralCheck→\\rightarrowSettlementFinal\]target: SettlementFinal

↓\\downarrowSymbolic VerifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)

Fact Extraction:ℱ\(C\)=ℱ\(C′\)=\{\\mathcal\{F\}\(C\)=\\mathcal\{F\}\(C^\{\\prime\}\)=\\\{path\_0: KYCVerification, path\_1: SanctionsScreening, path\_2: CollateralCheck, path\_3: SettlementFinal, target: SettlementFinal\}\\\}Soundness:5/5 = 1\.00 ✓Coverage:5/5 = 1\.00 ✓Result: SOUND

### Example 2: Sound but Incomplete \(Partial Omission\)

CertificateCC\(NO/lasso\)verdict: NO, kind: lassoprefix: \[InitState→\\rightarrowRiskAssessment→\\rightarrowExposureCalc→\\rightarrowLimitBreach\]cycle: \[LimitBreach→\\rightarrowAlertGeneration→\\rightarrowEscalationReview→\\rightarrowLimitBreach\]

↓\\downarrowNN1\(pointer\-generator\)

ExplanationEEThe property is refuted by a lasso\-shaped counterexample\. The prefix path proceeds fromInitStatethroughRiskAssessmenttoLimitBreach, which then enters a cycle visitingAlertGenerationandEscalationReviewbefore returning toLimitBreach\.

↓\\downarrowNN2\(inverse Transformer\)

ReconstructedC′C^\{\\prime\}verdict: NO, kind: lassoprefix: \[InitState→\\rightarrowRiskAssessment→\\rightarrowLimitBreach\]←\\leftarrowmissing ExposureCalccycle: \[LimitBreach→\\rightarrowAlertGeneration→\\rightarrowEscalationReview→\\rightarrowLimitBreach\]

↓\\downarrowSymbolic VerifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)

Fact Extraction:\|ℱ\(C\)\|=8\|\\mathcal\{F\}\(C\)\|=8,\|ℱ\(C′\)\|=7\|\\mathcal\{F\}\(C^\{\\prime\}\)\|=7,\|ℱ\(C\)∩ℱ\(C′\)\|=7\|\\mathcal\{F\}\(C\)\\cap\\mathcal\{F\}\(C^\{\\prime\}\)\|=7Soundness:7/7 = 1\.00 ✓Coverage:7/8 = 0\.875Result: SOUND\(but incomplete\)Note: The explanation omittedExposureCalcfrom the prefix\. No hallucinated facts, but one fact is missing\. The verifier marks this as sound \(no wrong facts\) with reduced coverage\.

### Example 3: Unsound — Sequence Attribution Error

CertificateCC\(NO/kk\-induction\)verdict: NO, kind: k\_inductionbase\_states: \{ComplianceGate, AuditTrail, RegCheck\}step\_states: \{TransactionLog, FinalApproval\}counterexample: \[RegCheck→\\rightarrowTransactionLog→\\rightarrowViolationState\]

↓\\downarrowNN1\(pointer\-generator\)

ExplanationEEThekk\-induction proof fails\. The base set containsComplianceGate,AuditTrail, andRegCheck\. The step set containsRegCheckandFinalApproval\. A counterexample trace leads fromRegCheckthroughTransactionLogtoViolationState\.←\\leftarrowError:RegCheckattributed to step set instead ofTransactionLog

↓\\downarrowNN2\(inverse Transformer\)

ReconstructedC′C^\{\\prime\}base\_states: \{ComplianceGate, AuditTrail, RegCheck\}step\_states: \{RegCheck, FinalApproval\}←\\leftarrowwrong: should be TransactionLogcounterexample: \[RegCheck→\\rightarrowTransactionLog→\\rightarrowViolationState\]

↓\\downarrowSymbolic VerifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)

Fact Extraction:ℱ\(C′\)\\mathcal\{F\}\(C^\{\\prime\}\)containsstep: RegCheckwhich is*not*inℱ\(C\)\\mathcal\{F\}\(C\)Soundness:7/8 = 0\.875 ✗Coverage:7/8 = 0\.875Result: UNSOUND The verifier detects thatRegCheckin the step set is a hallucinated fact—it belongs to the base set in the original certificate\. One wrong fact is sufficient to flag the explanation as unsound\.

### Example 4: LLM Baseline Failure — Entity Misspelling

CertificateCC\(NO/bounded\_proof\)verdict: NO, kind: bounded\_proofchecked\_states: \{DerivativePricing, CounterpartyExposure, MarginCallTrigger, NettingAgreement, CreditLimitCheck\}

↓\\downarrowBest LLM pair \(Opus 4\.5/Opus 4\.5\)

LLM ExplanationThe bounded model checking found no violation within the bound\. The checked state set includesDerivativePricing,CounterPartyExposure,MarginCallTrigger,NettingAgreement, andCreditLimitCheck\.←\\leftarrowMisspelling: “CounterPartyExposure” \(capital P\) vs\. correct “CounterpartyExposure”

↓\\downarrowLLM Reconstruction \(Opus 4\.5\)

ReconstructedC′C^\{\\prime\}checked\_states: \{DerivativePricing,CounterPartyExposure, MarginCallTrigger, NettingAgreement, CreditLimitCheck\}

↓\\downarrowSymbolic VerifierV\(C,C′\)\\mathrm\{V\}\(C,C^\{\\prime\}\)

Soundness:4/5 = 0\.80 ✗Coverage:4/5 = 0\.80Result: UNSOUND The LLM generated the state name from its vocabulary rather than copying it, introducing a capitalization error\. The symbolic verifier performs exact string matching, correctly flaggingCounterPartyExposureas a hallucinated entity not present in the original certificate\. The pointer\-generator mechanism in NN1prevents this class of error entirely by copying state tokens directly from the source\.

Summary\.These four examples illustrate the key behaviors of the cycle\-consistent pipeline: \(1\) perfect faithfulness when all facts are correctly copied and attributed; \(2\) sound but incomplete explanations that omit facts without hallucinating; \(3\) unsound explanations caused by sequence attribution errors \(the dominant neural failure mode\); and \(4\) the LLM baseline’s vulnerability to entity\-level errors that the pointer\-generator mechanism eliminates by design\.
Cycle-Consistent Neural Explanation of Formal Verification Certificates

Similar Articles

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)

Closing the Loop: Formally Verified Law as a Reward Signal for Self-Improving Legal AI

Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

Logic-Regularized Verifier Elicits Reasoning from LLMs

Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

Submit Feedback

Similar Articles

Neuro-Symbolic Verification of LLM Outputs for Data-Sensitive Domains (extended preprint)
Closing the Loop: Formally Verified Law as a Reward Signal for Self-Improving Legal AI
Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
Logic-Regularized Verifier Elicits Reasoning from LLMs
Towards Verifiable Transformers: Solver-Checkable Circuit Explanations