Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
This paper introduces Verifiable Transformers, a framework that converts task-localized Transformer circuits into bounded, solver-checkable claims, enabling formal verification of properties such as functional equivalence, edge necessity, and robustness.
arXiv:2605.24033v1 Announce Type: new Abstract: Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning. This leaves a gap between finding a plausible circuit and proving what the circuit does. We introduce Verifiable Transformers, a framework for converting task-localized Transformer circuits into bounded, solver-checkable claims. Given a behavior, a finite task domain, and a candidate-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task-relevant invariance, and final-residual robustness. Direct verification encodes the extracted circuit itself into an SMT solver. When a circuit contains operators that are not exactly or tractably encodable, surrogate-mediated verification fits an SMT-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate. We instantiate direct verification with a GPT-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU. On small symbolic sequence tasks, we train an SMT-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final-residual robustness. At GPT-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable. We also demonstrate surrogate-mediated verification on task-localized circuits with hard-to-encode attention, showing both verified symbolic explanations and solver-generated counterexamples. The goal is not full-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted.
Original Article
View Cached Full Text
Cached at: 05/26/26, 08:58 AM
# Towards Verifiable Transformers: Solver-Checkable Circuit Explanations
Source: [https://arxiv.org/html/2605.24033](https://arxiv.org/html/2605.24033)
\(May 18, 2026\)

###### Abstract

Mechanistic interpretability often identifies circuits inside Transformer models, but explanations of those circuits are usually validated through examples, ablations, and manual reasoning\. This leaves a gap between finding a plausible circuit and proving what the circuit does\. We introduce*Verifiable Transformers*, a framework for converting task\-localized Transformer circuits into bounded, solver\-checkable claims\. Given a behavior, a finite task domain, and a candidate\-token projection, we extract a task circuit and verify properties such as projected functional equivalence, edge necessity, task\-relevant invariance, and final\-residual robustness\.

The method first extracts a task\-localized circuit\. Direct verification encodes the extracted circuit itself into an SMT solver\. When a circuit contains operators that are not exactly or tractably encodable, surrogate\-mediated verification fits an SMT\-encodable surrogate, validates it against the extracted circuit over the bounded domain, and verifies symbolic explanations against the surrogate\. We instantiate direct verification with a GPT\-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU\. This architecture is designed for SMT\-based circuit verification; most components are piecewise\-affine, while standard QK sparsemax attention remains SMT\-representable but introduces nonlinear arithmetic through query\-key products and attention\-weighted value aggregation\.

On small symbolic sequence tasks, we train an SMT\-representable Transformer, extract sparse circuits for quote closing and bracket type tracking, and exhaustively verify projected functional equivalence, content invariance, edge necessity, and final\-residual robustness\. At GPT\-2 scale, the same operator stack trains stably on OpenWebText, although naive direct SMT verification remains intractable\. We also demonstrate surrogate\-mediated verification on task\-localized circuits with hard\-to\-encode attention, showing both verified symbolic explanations and solver\-generated counterexamples\. The guarantees in this paper are bounded, projected, and circuit\-level\. The goal is not full\-model verification, but a concrete path for turning mechanistic circuit explanations into formal propositions that can be proven or refuted\.

## 1Introduction

Much of mechanistic interpretability treats trained Transformers as natural objects to be reverse engineered after training\. This empirical approach has produced important results, including localized attention heads, sparse features, and hand\-analyzed circuits for algorithmic behaviors\[[14](https://arxiv.org/html/2605.24033#bib.bib14),[13](https://arxiv.org/html/2605.24033#bib.bib13),[6](https://arxiv.org/html/2605.24033#bib.bib6)\]\. But Transformer architectures are engineered artifacts\. Instead of only developing better post\-hoc analysis tools for existing models, we can also ask how to design models and circuit\-analysis pipelines so that mechanistic claims become easier to state, extract, and verify\.

This paper studies that question under the name*Verifiable Transformers*\. The goal is not to verify arbitrary large language models end\-to\-end\. Rather, the goal is to make a narrower but concrete class of mechanistic claims solver\-checkable\. Examples include:

- •Does this extracted circuit choose the same task token as a symbolic reference program on every input in a finite domain?
- •Is each retained edge behaviorally necessary under a specified ablation semantics?
- •Is the circuit invariant to task\-irrelevant token changes?
- •Is the circuit’s task decision stable under bounded perturbations to a specified internal interface?
- •Does a compact symbolic program exactly describe the circuit’s bounded\-domain behavior, or can a solver find a counterexample?

These claims are common in spirit in mechanistic interpretability, but they are often evaluated on sampled inputs or argued from qualitative circuit analysis\. We instead formulate them as bounded verification problems\.

### 1\.1Circuit extraction and verification

The workflow begins with a trained modelMM, a task behavior, a finite task domainDD, and a candidate\-token setTT\. We extract a task\-localized circuitCEC\_\{E\}, whereEEis the retained edge set, and evaluate claims about the projected task decision

dT\(F,x\)=argmaxt∈T⁡Ft\(x\),d\_\{T\}\(F,x\)=\\operatorname\*\{arg\\,max\}\_\{t\\in T\}F\_\{t\}\(x\),whereFt\(x\)F\_\{t\}\(x\)is the logit assigned to tokenttby model or circuitFF\. The projected decision is the relevant object for tasks such as quote closing, where the question is whether the model chooses a single quote or a double quote, and bracket type tracking, where the question is whether the model chooses a square or curly closing delimiter\.

After circuit extraction, the verification method depends on whether the extracted circuit can be encoded directly\. Figure[1](https://arxiv.org/html/2605.24033#S1.F1)summarizes the resulting verification paths\.

Full modelMM↓\\downarrowTask\-domain specification\(D,T\)\(D,T\)and circuit extraction↓\\downarrowExtracted circuitCEC\_\{E\}Direct verificationSurrogate\-mediated verificationCE→C\_\{E\}\\rightarrowSMT proof or counterexampleCE→S→P→C\_\{E\}\\rightarrow S\\rightarrow P\\rightarrowSMT proof or counterexample

Figure 1:Circuit verification paths\. Starting from a task\-localized extracted circuit, direct verification encodesCEC\_\{E\}itself\. Surrogate\-mediated verification fits an SMT\-encodable surrogateSS, synthesizes or specifies a symbolic programPP, and verifies program\-surrogate equivalence over the bounded domain\.In direct verification,CEC\_\{E\}itself is encoded into an SMT solver, so proofs and counterexamples apply to the encoded circuit directly\. In surrogate\-mediated verification, used whenCEC\_\{E\}contains operations that are not exactly or tractably encodable, we fit an SMT\-encodable surrogateSS, validate it againstCEC\_\{E\}over the bounded domain, and then verify symbolic hypotheses aboutSS\.

### 1\.2Why architecture matters

Standard Transformer components were not designed for exact symbolic reasoning\. Softmax attention uses exponentials and normalization; LayerNorm divides by a data\-dependent standard deviation\[[1](https://arxiv.org/html/2605.24033#bib.bib1)\]; GELU is a smooth non\-piecewise\-linear activation\[[8](https://arxiv.org/html/2605.24033#bib.bib8)\]; and standard attention contains bilinear query\-key interactions\[[16](https://arxiv.org/html/2605.24033#bib.bib16)\]\. These operations are not impossible to reason about in principle, but they make direct formal verification of even small extracted circuits difficult\.

We therefore instantiate direct verification with a GPT\-style architecture in which the difficult components are replaced by more SMT\-friendly alternatives:

- •Signed L1 BandNorm replaces LayerNorm\.
- •Sparsemax replaces softmax in attention\[[11](https://arxiv.org/html/2605.24033#bib.bib11)\]\.
- •LeakyReLU replaces GELU\[[8](https://arxiv.org/html/2605.24033#bib.bib8)\]\.

The resulting model is*SMT\-representable*, not automatically SMT\-tractable\. This distinction is important\. Sparsemax itself is piecewise\-linear in the score vector, but full sparsemax attention with standard QK scoring still contains products of variables:qi⊤kjq\_\{i\}^\{\\top\}k\_\{j\}andaijvja\_\{ij\}v\_\{j\}\. These terms push the encoding into nonlinear real arithmetic or require special abstraction\. In our experiments, the architecture is trainable at GPT\-2 scale, but direct verification at that scale remains intractable\.

### 1\.3Contributions

This paper makes six contributions\.

1. 1\.We formalize projected verification of extracted Transformer circuits over finite bounded domains\.
2. 2\.We define circuit\-level properties for projected functional equivalence, edge necessity, task\-relevant invariance, and final\-residual robustness\.
3. 3\.We introduce a direct SMT verification method for circuits built from representable operators, and instantiate it with a GPT\-style architecture using Signed L1 BandNorm, sparsemax attention, and LeakyReLU\.
4. 4\.We demonstrate end\-to\-end direct verification on two symbolic sequence tasks\. The small\-model circuits for quote closing and bracket type tracking are exhaustively verified for all target properties\.
5. 5\.We evaluate the verifiable operator stack at GPT\-2 scale on OpenWebText and WikiText\-103\. The full stack trains stably, but naive direct SMT verification remains intractable\.
6. 6\.We introduce a surrogate\-mediated symbolic verification method for extracted circuits with hard\-to\-encode components\. The method verifies symbolic explanations when possible and returns concrete counterexamples when the explanation fails\.

### 1\.4Claim discipline

The guarantees in this paper are bounded, projected, and circuit\-level\. We do not claim full\-model verification, full\-vocabulary equivalence, global circuit minimality, arbitrary\-length correctness, or scalable GPT\-2\-scale SMT verification\. The task domain, candidate\-token projection, circuit extraction semantics, and surrogate validation procedure are part of the specification\. These restrictions are not incidental; they are what make the claims precise enough to be checked\.

## 2Formal Framework for Verifying Extracted Circuits

### 2\.1Task domains and projected decisions

LetDtaskD\_\{\\mathrm\{task\}\}be a finite bounded task domain\. For example,DtaskD\_\{\\mathrm\{task\}\}may be the set of all token sequences of a fixed template family with a bounded length and a small finite vocabulary\. LetTTbe a task\-specific candidate\-token set, such as the two quote tokens for quote closing or the two closing delimiter tokens for bracket type\.

For a model or circuitFF, letFT\(x\)F\_\{T\}\(x\)denote the vector of logits restricted toTT\. We define the projected task decision

dT\(F,x\):=argmaxt∈T⁡Ft\(x\)\.d\_\{T\}\(F,x\):=\\operatorname\*\{arg\\,max\}\_\{t\\in T\}F\_\{t\}\(x\)\.\(1\)LetP:Dtask→TP:D\_\{\\mathrm\{task\}\}\\rightarrow Tbe a symbolic reference program\. For syntax\-like behaviors, this projected decision is the relevant target\. We do not require equality of all vocabulary logits\.

### 2\.2Extracted circuits and zero\-ablation semantics

LetMMbe the full trained model\. A circuitCEC\_\{E\}is obtained by retaining an edge setEEin a coarse computational graph and setting deleted edge contributions to zero\. In the experiments here, the coarse graph contains residual\-stream components such as embeddings, attention outputs, MLP outputs, and logits\. The retained circuit uses the original trained weights and activations along retained paths\.

For a retained edgee∈Ee\\in E, letCE∖\{e\}C\_\{E\\setminus\\\{e\\\}\}be the same circuit with edgeeeremoved\. We call this*zero\-ablation semantics*\. This matters because edge necessity and faithfulness claims depend on how deleted computation is represented\.

Circuit extraction is not claim\-neutral\. IfCEC\_\{E\}is extracted to preserve the full model’s projected decision over a domainDD, then the relevant faithfulness condition is

∀x∈D,dT\(CE,x\)=dT\(M,x\)\.\\forall x\\in D,\\quad d\_\{T\}\(C\_\{E\},x\)=d\_\{T\}\(M,x\)\.\(2\)IfCEC\_\{E\}is extracted to optimize task accuracy againstPPrather than to matchMM, it should be called a task circuit, not necessarily a faithful model circuit\.

### 2\.3Properties

Table[1](https://arxiv.org/html/2605.24033#S2.T1)summarizes the core properties\.

Table 1:Circuit\-level properties verified by direct SMT verification\. All properties are over a finite bounded task domain and a task\-specific candidate\-token projection\.In the final\-residual robustness property,rE\(x\)r\_\{E\}\(x\)is the final residual of circuitCEC\_\{E\}before final normalization and unembedding, andGT\(r\)G\_\{T\}\(r\)is the projected task decision obtained by applying final normalization and unembedding to residualrrand restricting toTT\.

### 2\.4Direct SMT verification

WhenCEC\_\{E\}is directly SMT\-representable and tractable, the verifier encodes the circuit and negates the target property\. For functional equivalence, for example, the solver checks satisfiability of

∃x∈Dtask:dT\(CE,x\)≠P\(x\)\.\\exists x\\in D\_\{\\mathrm\{task\}\}:\\;d\_\{T\}\(C\_\{E\},x\)\\neq P\(x\)\.\(3\)If the query is unsatisfiable, the property is proven overDtaskD\_\{\\mathrm\{task\}\}\. If it is satisfiable, the solver returns a concrete counterexample\. The proof is with respect to the exported exact real or rational circuit representation used by the verifier, not bit\-for\-bit IEEE floating\-point execution in PyTorch; the exhaustive PyTorch sanity check is used to catch mismatches between the implementation and the encoded circuit\. For branchy operators such as sparsemax and BandNorm, the implementation uses branch certificates\. If the traced branch cannot be certified throughout the relevant domain or perturbation set, the verifier returns unknown rather than a proof\.

### 2\.5Surrogate\-mediated verification

When the extracted circuit contains operations that cannot be encoded directly or tractably, we introduce an SMT\-encodable surrogateSS\. A symbolic explanation programPPis then verified againstSSover the bounded domain\.

The guarantee consists of links:

circuit faithfulness:∀x∈D,dT\(CE,x\)=dT\(M,x\),\\displaystyle\\forall x\\in D,\\;d\_\{T\}\(C\_\{E\},x\)=d\_\{T\}\(M,x\),\(4\)surrogate fidelity:∀x∈D,dT\(S,x\)=dT\(CE,x\),\\displaystyle\\forall x\\in D,\\;d\_\{T\}\(S,x\)=d\_\{T\}\(C\_\{E\},x\),\(5\)program equivalence:∀x∈D,P\(x\)=dT\(S,x\)\.\\displaystyle\\forall x\\in D,\\;P\(x\)=d\_\{T\}\(S,x\)\.\(6\)HereP\(x\)∈TP\(x\)\\in Tdenotes the token returned by the symbolic program, whiledT\(S,x\)d\_\{T\}\(S,x\)denotes the surrogate’s projected token decision\. If the last two links hold on the same domain, thenPPcaptures the extracted circuit’s projected behavior overDD\. If all three hold, thenPPcaptures the full model’s projected behavior overDD\. In our surrogate\-mediated experiments, the symbolic proof is over the surrogate or finite teacher relation; the original hard\-to\-encode attention computation is validated behaviorally over the bounded domain rather than directly encoded\.

## 3SMT\-Representable Transformer Architecture

### 3\.1Design goal

Direct SMT verification benefits from architectures whose extracted circuits can be encoded exactly\. We therefore define a GPT\-style decoder\-only Transformer that preserves the high\-level residual structure of GPT\-2\[[15](https://arxiv.org/html/2605.24033#bib.bib15)\], while replacing components that are difficult to encode with more representable alternatives\. The architecture is designed for SMT\-based circuit verification, not for immediate full\-model proof at large scale\.

### 3\.2Primitive operations

The preferred primitive set includes affine maps, comparisons, max/min, piecewise\-linear activations, thresholding, top\-k or support selection, and finite disjunctions over linear regions\. These components can often be encoded in quantifier\-free linear real arithmetic with branch structure\. However, standard attention introduces bounded\-degree polynomial terms through variable products\. We therefore use the broader phrase*SMT\-representable*\. In particular, sparsemax attention with QK scoring is exactly representable in nonlinear real arithmetic, but this is a major scaling bottleneck\.

### 3\.3Signed L1 BandNorm

LayerNorm normalizes by a data\-dependent standard deviation\[[1](https://arxiv.org/html/2605.24033#bib.bib1)\]\. This introduces division and square roots, which are awkward for exact SMT encodings\. Signed L1 BandNorm replaces variance normalization with projection\-style mass control\.

A natural alternative is to remove normalization rather than replace it\. Baroni et al\. show that all LayerNorm layers can be removed from GPT\-2\-family models after fine\-tuning with only a small validation\-loss increase, including about\+0\.03\+0\.03cross\-entropy loss for GPT\-2 XL\[[2](https://arxiv.org/html/2605.24033#bib.bib2)\]\. That result is directly relevant to this work\. If the goal is post\-hoc analysis of an existing GPT\-2 model, LayerNorm removal may be a stronger practical route than training a new normalizer from scratch\. Signed L1 BandNorm instead targets the architecture\-design question: can a Transformer be trained with a normalization\-like operator whose computation is explicitly structured for SMT\-based circuit verification? The GPT\-2\-scale results below show that BandNorm is trainable, but also that it is not yet competitive with either LayerNorm or normalization\-removal approaches\.

Given an input vectorx∈ℝdx\\in\\mathbb\{R\}^\{d\}, BandNorm computes:

c\\displaystyle c=x−mean⁡\(x\),\\displaystyle=x\-\\operatorname\{mean\}\(x\),\(7\)p\\displaystyle p=max⁡\(c,0\),\\displaystyle=\\max\(c,0\),\(8\)n\\displaystyle n=max⁡\(−c,0\)\.\\displaystyle=\\max\(\-c,0\)\.\(9\)The positive and negative masses are controlled separately\. If the L1 mass of a side is too large, the vector is projected onto an L1 ball using thresholding\. If the mass is too small, the operator applies a bounded lift over active coordinates\. The signed vector is then recombined, recentered, and passed through learned affine parameters:

z=p′−n′,z←z−mean⁡\(z\),BandNorm⁡\(x\)=γ⊙z\+β\.z=p^\{\\prime\}\-n^\{\\prime\},\\qquad z\\leftarrow z\-\\operatorname\{mean\}\(z\),\\qquad\\operatorname\{BandNorm\}\(x\)=\\gamma\\odot z\+\\beta\.\(10\)
The key design choice is projection\-based scale control rather than elementwise saturation\. Earlier clamp\-based normalization variants stabilized some runs but destroyed dynamic range\. BandNorm preserves a global notion of scale while remaining expressible through affine operations, comparisons, max operations, and threshold\-defined projection branches\.

### 3\.4Sparsemax attention

Softmax attention computes

Attn⁡\(Q,K,V\)=softmax⁡\(QK⊤dk\)V\.\\operatorname\{Attn\}\(Q,K,V\)=\\operatorname\{softmax\}\\left\(\\frac\{QK^\{\\top\}\}\{\\sqrt\{d\_\{k\}\}\}\\right\)V\.\(11\)Softmax introduces exponentials and normalization\. Sparsemax replaces the softmax map with Euclidean projection onto the probability simplex\[[11](https://arxiv.org/html/2605.24033#bib.bib11)\]\. For a score vectors∈ℝms\\in\\mathbb\{R\}^\{m\}, wheremmis the number of scores in the attention row, sparsemax is

sparsemax⁡\(s\)=argminp∈Δm−1⁡‖p−s‖22,Δm−1=\{p∈ℝm:pi≥0,∑i=1mpi=1\}\.\\operatorname\{sparsemax\}\(s\)=\\operatorname\*\{arg\\,min\}\_\{p\\in\\Delta^\{m\-1\}\}\\left\\lVert p\-s\\right\\rVert\_\{2\}^\{2\},\\qquad\\Delta^\{m\-1\}=\\left\\\{p\\in\\mathbb\{R\}^\{m\}:p\_\{i\}\\geq 0,\\;\\sum\_\{i=1\}^\{m\}p\_\{i\}=1\\right\\\}\.\(12\)It is piecewise\-linear inssand can produce exact zeros\. Sparse attention mechanisms based on sparsemax and entmax have also been studied as trainable alternatives to dense softmax attention\[[4](https://arxiv.org/html/2605.24033#bib.bib4)\]\.

However, sparsemax does not make the entire attention layer piecewise\-affine\. Standard attention still computes scoressij=qi⊤kjs\_\{ij\}=q\_\{i\}^\{\\top\}k\_\{j\}and outputsoi=∑jaijvjo\_\{i\}=\\sum\_\{j\}a\_\{ij\}v\_\{j\}\. Both terms contain products of variables whenqq,kk,aa, andvvare symbolic\. Thus, sparsemax attention improves the branch structure relative to softmax, but full QK sparsemax attention remains a nonlinear real arithmetic encoding problem\.

### 3\.5LeakyReLU MLP

GPT\-2 uses GELU activations in the MLP block\[[8](https://arxiv.org/html/2605.24033#bib.bib8)\]\. We replace GELU with LeakyReLU:

LeakyReLU⁡\(x\)=\{x,x≥0,αx,x<0\.\\operatorname\{LeakyReLU\}\(x\)=\\begin\{cases\}x,&x\\geq 0,\\\\ \\alpha x,&x<0\.\\end\{cases\}\(13\)This is exactly piecewise\-linear and avoids the dead\-activation issue of a pure ReLU\.

Together, the direct\-verification architecture keeps the GPT\-style residual\-block structure, but replaces the components that most directly obstruct exact symbolic reasoning\. LayerNorm is replaced by Signed L1 BandNorm to avoid division by a data\-dependent variance estimate\. Softmax is replaced by sparsemax to remove exponentials and softmax normalization\. GELU is replaced by LeakyReLU to make the MLP activation exactly piecewise\-linear\. Verification checks are performed over candidate\-token projections rather than the full vocabulary distribution\.

The main remaining caveat is attention\. Sparsemax is piecewise\-linear as a map from scores to attention weights, but the full attention layer is not purely piecewise\-affine because standard QK scoring and value weighting introduce products of symbolic variables\. Thus the architecture is SMT\-representable, while QK sparsemax attention remains the principal scaling bottleneck\.

## 4Direct Verification Experiments on Small Circuits

### 4\.1Tasks

We evaluate direct SMT verification on two symbolic syntax\-like tasks\.

#### Quote closing\.

The model must distinguish whether the next token should be a single quote or a double quote\. The candidate\-token setTTcontains the two quote tokens\. The reference program returns the closing quote matching the unmatched opening quote\.

#### Bracket type\.

The model must distinguish whether the next token should be a square closing delimiter or a curly closing delimiter\. The candidate\-token setTTcontains the two closing delimiter tokens\. The reference program returns the closing delimiter matching the opening delimiter\.

These tasks are deliberately small\. They have finite exhaustive domains, known symbolic reference programs, and crisp task projections\. They are intended to test the verification machinery rather than to demonstrate natural\-language capability\.

The two direct\-verification domains contain only 128 discrete sequences each\. For projected functional equivalence alone, brute\-force enumeration is sufficient: one can simply run every input and compare the circuit’s candidate\-token decision to the symbolic reference program\. We therefore treat finite\-domain equivalence as an encoding and extraction sanity check rather than as the main scientific result\. The SMT formulation matters because it provides a common representation for stronger circuit claims\. Edge\-necessity queries ask the solver for existential witnesses, failed explanations return concrete counterexamples, and final\-residual robustness quantifies over every real perturbation in anℓ∞\\ell\_\{\\infty\}ball around the final residual rather than over a finite list of token sequences\. The small experiments are therefore a minimal end\-to\-end test of whether extracted Transformer circuits can be converted into universal, existential, and continuous solver queries\.

### 4\.2Model and extraction

We train a small decoder\-only Transformer with a custom 32\-token vocabulary and SMT\-representable components\. The model has approximately 8K parameters and is trained in a multitask setting until both tasks reach 100 percent candidate\-token accuracy\. We then extract task\-specific circuits and run an exhaustive PyTorch sanity check of the controlled circuit before invoking the SMT verifier\. The sanity check enumerates the discrete token domain; the SMT queries then check the formal properties, including existential edge witnesses and continuous final\-residual robustness\.

### 4\.3Verification results

Table[2](https://arxiv.org/html/2605.24033#S4.T2)shows the direct verification results\. In both tasks, the verifier proves projected functional equivalence, content invariance, edge necessity, and final\-residual robustness over the finite domain\.

Table 2:Small end\-to\-end direct verification results\. The circuit decision is projected to the task candidate set\. Equivalence, invariance, and edge necessity are checked over the finite task domain; robustness additionally quantifies over bounded continuous perturbations to the final residual\.These experiments demonstrate direct verification end to end: train a Transformer\-like model, extract a sparse task circuit, and prove bounded claims about the circuit’s projected behavior\. The result is not large\-scale Transformer verification\. It is a proof of concept that mechanistic circuit claims can be turned into solver\-checkable propositions\.

## 5GPT\-2\-Scale Trainability

The small experiments test direct proof in a deliberately tiny setting\. This section instead asks whether the same operator replacements train at nontrivial scale\. We use a GPT\-2\-small\-scale model with the same tokenizer, context length, optimization recipe, and evaluation code across variants\. The local GPT\-2 baseline is the primary comparison anchor\. These experiments are trainability ablations, not verification results\.

### 5\.1Protocol

We train on OpenWebText, an open\-source replication of WebText\[[7](https://arxiv.org/html/2605.24033#bib.bib7)\], and evaluate on OpenWebText validation and WikiText\-103 validation\[[12](https://arxiv.org/html/2605.24033#bib.bib12)\]\. A pretrained GPT\-2 reference evaluated under the same local protocol gives OpenWebText validation loss 3\.1187 and WikiText\-103 perplexity 31\.0403\. The local baseline run reaches OpenWebText validation loss 3\.1340 at step 260K and WikiText\-103 perplexity 52\.9820\. Because the local baseline is optimized for this OpenWebText training pipeline and stopping criterion, relative comparisons against the local baseline are the primary signal\.

### 5\.2Results

We compare three architecture variants against the local GPT\-2 baseline:

1. 1\.sparsemax\-only: replace softmax attention with sparsemax while keeping standard LayerNorm;
2. 2\.BandNorm\-only: replace LayerNorm with Signed L1 BandNorm while keeping standard attention;
3. 3\.full stack: Signed L1 BandNorm, sparsemax attention, and LeakyReLU\.

Table[3](https://arxiv.org/html/2605.24033#S5.T3)reports the main results\.

Table 3:GPT\-2\-scale replacement results\. Sparsemax attention is close to the local baseline relative to normalization replacements\. The full SMT\-representable stack trains stably, with the remaining performance gap mainly due to normalization\.Figures[2](https://arxiv.org/html/2605.24033#S5.F2)and[3](https://arxiv.org/html/2605.24033#S5.F3)visualize the final performance gaps\.

![Refer to caption](https://arxiv.org/html/2605.24033v1/x1.png)Figure 2:OpenWebText validation loss delta versus the local GPT\-2 baseline\.![Refer to caption](https://arxiv.org/html/2605.24033v1/x2.png)Figure 3:WikiText\-103 perplexity delta versus the local GPT\-2 baseline\.The sparsemax\-only run is near the local baseline on OpenWebText and modestly worse on WikiText\-103\. BandNorm introduces the largest gap\. Combining sparsemax and LeakyReLU with BandNorm adds only a small degradation relative to BandNorm alone\. This suggests that normalization, not sparsemax attention, is the main remaining performance bottleneck\.

This comparison should be read against recent LayerNorm\-removal results\. Baroni et al\. show that GPT\-2\-family models can be fine\-tuned after removing all LayerNorm layers with much smaller loss degradation than the BandNorm replacement reported here\[[2](https://arxiv.org/html/2605.24033#bib.bib2)\]\. Thus, BandNorm is not presented as the best current way to obtain a near\-baseline LayerNorm\-free GPT\-2\. Its value in this paper is as a fully specified normalization\-like operator for training SMT\-representable circuits from scratch\. For practical large\-model interpretability, normalization removal from pretrained models may currently be the more promising path\.

### 5\.3Verification tractability

We do not report GPT\-2\-scale circuit verification as a main result\. Even when a GPT\-2\-scale model exhibits simple behaviors, current direct encodings are not tractable at that scale\. The bottlenecks are bilinear QK attention, attention\-weighted value products, sparsemax support branching, BandNorm branch structure, residual graph density, hidden width, and sequence length\. The GPT\-2\-scale experiments are therefore evidence about trainability of the operator replacements, not evidence of scalable formal proof\.

## 6Surrogate\-Mediated Symbolic Circuit Verification

Direct SMT verification applies when the extracted circuit can be encoded directly and the resulting solver problem is tractable\. Some circuits of interest contain operations that are not directly encodable under the chosen theory, or are encodable only with prohibitive nonlinear or branch complexity\. In that case we use a surrogate\-mediated route: fit an SMT\-encodable surrogate to the extracted circuit’s bounded\-domain behavior, then verify a symbolic explanation against the surrogate\.

### 6\.1Method

The researcher supplies the task domain, candidate\-token projection, teacher circuit or teacher relation, and DSL or template family\. The method does not automatically discover the behavior represented by the circuit\. Its role is narrower: given a task\-localized circuit, test whether the circuit’s bounded\-domain behavior is equivalent to a compact symbolic explanation, and return counterexamples when the explanation fails\. This distinction is important\. Selecting a quote\-closing domain already says that quote behavior is relevant, but it does not determine the exact algorithm\. A circuit could implement first\-quote tracking, last\-quote tracking, parity, count thresholds, position\-weighted heuristics, or accidental finite\-domain special cases\.

Let the extracted circuit define a deterministic teacher function

fcircuit:X→Y,f\_\{\\mathrm\{circuit\}\}:X\\rightarrow Y,\(14\)whereXXis a bounded token domain andYYis a task output or projected decision\. We fit a small ReLU surrogateS:X→YS:X\\rightarrow Y\. ReLU networks are piecewise\-linear and have a long history in neural\-network verification\[[9](https://arxiv.org/html/2605.24033#bib.bib9),[10](https://arxiv.org/html/2605.24033#bib.bib10)\]\. If the domain is finite and small enough, we exhaustively validateSSagainst the teacher on allx∈Xx\\in X\. If validation is sampled, the resulting claim is correspondingly weaker\.

We then search over a constrained DSL of symbolic programs\. The DSL includes token predicates, counters, toggles, threshold detectors, guarded updates, and small state machines\. A typical template has the form

```
state = INIT
for tok in input:
    if predicate(tok):
        state = update(state)
return output_fn(state)
```

The search is complete only over the chosen template family, not over all possible programs\.

Given a candidate programPPthat returns a token inTTand a surrogateSSwith projected decisiondT\(S,x\)d\_\{T\}\(S,x\), the verifier asks whether there exists an input where they disagree:

∃x∈D:P\(x\)≠dT\(S,x\)\.\\exists x\\in D:\\;P\(x\)\\neq d\_\{T\}\(S,x\)\.\(15\)If the formula is unsatisfiable, then the program and the projected surrogate decision are equivalent overDD\. If satisfiable, the solver returns a concrete counterexample\. Counterexamples are useful interpretability artifacts: they identify specific inputs where the proposed explanation fails\.

### 6\.2Results

We evaluate surrogate\-mediated verification on two task\-localized circuit behaviors derived from sparse\-circuit artifacts and teacher functions\. The tasks are quote tracking and bracket counting\. Table[4](https://arxiv.org/html/2605.24033#S6.T4)summarizes the results\. “Finite\-dataset agreement” refers to agreement on the finite task examples used for fitting or selecting candidate symbolic programs before exhaustive bounded\-domain checking\. It is distinct from surrogate\-teacher agreement over the bounded abstract domain used by the SMT verifier\.

Table 4:Surrogate\-mediated symbolic verification\. Finite\-dataset agreement is measured on the finite task examples used for program fitting or selection\. Surrogate\-teacher agreement is measured over the bounded abstract domain used for verification\. Finite\-dataset agreement is not sufficient for formal equivalence: the quote task returns a counterexample despite dataset\-level agreement, while the bracket task admits a verified compact program over the bounded domain\.For the quote task, no program in the restricted template family is formally equivalent to the teacher behavior over the bounded domain\. The solver returns the following counterexample:

```
[QUOTE] counterexample (abstract domain): [2, 0, 2]
[QUOTE] counterexample (actual tokens): [0, 1, 0]
[QUOTE] rendered: [’!’, ’"’, ’!’]
[QUOTE] teacher: single-quote, program: double-quote, surrogate: single-quote
```

This result illustrates the value of exhaustive bounded checks\. A proposed explanation can agree on a standard dataset while failing on an edge case found by the solver\. For the bracket task, the synthesized program matches the teacher and surrogate on all bounded\-domain inputs, and the SMT check proves equivalence\.

### 6\.3Interpretation

Surrogate\-mediated verification is weaker than direct verification because the proof is about the surrogate or finite teacher relation, not necessarily about the original floating\-point circuit encoded end\-to\-end\. But it applies to circuits whose original operators are not tractable for direct SMT\. It turns symbolic explanations into refutable hypotheses and supplies concrete counterexamples when those explanations fail\.

## 7Discussion

### 7\.1Strength of the guarantees

Direct SMT verification proves properties of the encoded circuit itself\. Surrogate\-mediated verification validates an encodable abstractionSSand verifies a symbolic programPPagainst it\. The direct method gives stronger guarantees but requires representable operators and small enough circuits\. The surrogate\-mediated method is weaker but applies more broadly\. In both cases, the explanation must be stated as a bounded proposition that can be proven or refuted\.

### 7\.2Bounded domains and solver checks

Bounded verification is not a substitute for arbitrary\-length generalization\. Nor is SMT needed merely to compare two functions on a tiny finite list of discrete inputs; brute\-force enumeration is enough for that subclaim\. The value of the solver formulation is that it gives a uniform language for universal and existential circuit claims, symbolic counterexamples, and continuous perturbation properties such as final\-residual robustness\. Many mechanistic interpretability claims are already about finite synthetic domains, task templates, or specific projected behaviors\. In those settings, bounded solver checks are stronger than sampled validation, and counterexamples are especially valuable: even when a proof is not obtained, a concrete failing input can reveal that the proposed explanation is incomplete\.

### 7\.3Architecture as an interpretability tool

The GPT\-2\-scale experiments suggest that architectural changes for verifiability need not make training impossible\. Sparsemax attention remains close to the local baseline, and the full stack trains stably\. However, the normalization results should be interpreted alongside work showing that LayerNorm can be removed from pretrained GPT\-2 models with much smaller loss degradation\[[2](https://arxiv.org/html/2605.24033#bib.bib2)\]\. For future systems, the choice may not be between LayerNorm and a new normalizer; it may be between replacement, removal, and training procedures that make normalization unnecessary\. Standard QK attention also remains a verification bottleneck, motivating attention mechanisms that avoid continuous variable\-weighted aggregation or replace QK with a more tractable selection mechanism\.

## 8Related Work

#### Transformers and sparse attention\.

Transformers use scaled dot\-product attention and residual blocks as their core architecture\[[16](https://arxiv.org/html/2605.24033#bib.bib16)\]\. GPT\-2 scales decoder\-only Transformers for language modeling\[[15](https://arxiv.org/html/2605.24033#bib.bib15)\]\. Sparsemax replaces softmax with projection onto the simplex and can produce exact zeros\[[11](https://arxiv.org/html/2605.24033#bib.bib11)\]\. Entmax generalizes sparsemax and softmax and has been used in adaptively sparse Transformers\[[4](https://arxiv.org/html/2605.24033#bib.bib4)\]\.

#### Normalization and interpretability\.

LayerNorm was introduced to stabilize hidden\-state dynamics during training\[[1](https://arxiv.org/html/2605.24033#bib.bib1)\]\. Recent work shows that LayerNorm can be removed from GPT\-2\-family models after fine\-tuning with only small validation\-loss degradation, and argues that LayerNorm\-free GPT\-2 variants can simplify certain mechanistic interpretability analyses\[[2](https://arxiv.org/html/2605.24033#bib.bib2)\]\. This work is complementary: Signed L1 BandNorm explores a training\-time replacement designed for exact symbolic representation, while normalization removal explores whether pretrained models can be made easier to analyze after training\.

#### Mechanistic interpretability and circuit discovery\.

Prior work has reverse engineered specific Transformer mechanisms, including induction heads\[[14](https://arxiv.org/html/2605.24033#bib.bib14)\]and modular arithmetic circuits in grokking models\[[13](https://arxiv.org/html/2605.24033#bib.bib13)\]\. ACDC systematizes the process of finding sparse circuits in neural networks\[[3](https://arxiv.org/html/2605.24033#bib.bib3)\]\. Weight\-sparse Transformers expose more human\-understandable circuits and release pruned circuit artifacts for algorithmic tasks\[[6](https://arxiv.org/html/2605.24033#bib.bib6)\]\.

#### Formal verification of neural networks\.

SMT solvers such as Z3 support verification over combinations of logical theories\[[5](https://arxiv.org/html/2605.24033#bib.bib5)\]\. Reluplex and Marabou show how SMT\-style methods can verify ReLU networks and return counterexamples\[[9](https://arxiv.org/html/2605.24033#bib.bib9),[10](https://arxiv.org/html/2605.24033#bib.bib10)\]\. Our work differs in focusing on extracted Transformer circuits and projected mechanistic claims rather than robustness properties of feed\-forward classifiers\.

#### Permutation\-invariant views of attention\.

Deep Sets characterize permutation\-invariant functions as sum\-decomposable under suitable assumptions\[[17](https://arxiv.org/html/2605.24033#bib.bib17)\]\. This perspective motivates a broader design space of query\-conditioned aggregation operators for attention, discussed in Appendix[F](https://arxiv.org/html/2605.24033#A6)\. In this paper, we instantiate a simple point in this space using QK scoring followed by sparsemax\.

## 9Limitations

The results have several important limitations\.

1. 1\.All formal guarantees are bounded\-domain guarantees\.
2. 2\.In the small direct\-verification experiments, projected functional equivalence over the 128 discrete inputs is itself an exhaustive\-testable sanity check; the stronger use of SMT is for logical circuit properties and continuous residual robustness\.
3. 3\.Verification is over projected candidate\-token decisions, not full vocabulary logits\.
4. 4\.Circuit extraction does not prove global minimality\. Edge necessity is local to retained edges under zero\-ablation semantics\.
5. 5\.GPT\-2\-scale direct SMT verification is not currently tractable\.
6. 6\.Sparsemax attention is not purely piecewise\-affine when used with standard QK scores and attention\-weighted values\.
7. 7\.Signed L1 BandNorm still underperforms LayerNorm in the GPT\-2\-scale experiments and is not competitive with recent post\-hoc LayerNorm\-removal results\.
8. 8\.Surrogate\-mediated verification depends on surrogate fidelity\. If surrogate validation is not exhaustive over the relevant domain, the claim is weaker\.
9. 9\.The symbolic DSL searches only a chosen template family\. Failure to find an equivalent program does not rule out more complex equivalent programs\.
10. 10\.The task domain and candidate\-token projection are supplied by the researcher\. The method does not automatically discover what behavior a circuit represents\.
11. 11\.SMT proofs are about the exported exact real or rational circuit encoding, not bit\-for\-bit PyTorch floating\-point execution\.
12. 12\.Branch\-certified verification can return unknown if branch stability cannot be certified\.

## 10Future Work

Several directions follow naturally from these results\. First, the most direct next milestone is to prove nontrivial properties of circuits extracted from a GPT\-2\-scale variant\. The current encodings are too large and branchy for this, so progress likely requires both smaller extracted circuits and tighter solver encodings\. Training with enforced sparsity, following the Sparse Circuits line of work\[[6](https://arxiv.org/html/2605.24033#bib.bib6)\], is one natural way to make the extracted graphs smaller before verification\.

Second, normalization remains unresolved\. Signed L1 BandNorm gives an exactly specified normalization\-like operator, but its performance gap is large\. An alternative path is to remove normalization entirely after training, following LayerNorm\-removal work on GPT\-2\-family models\[[2](https://arxiv.org/html/2605.24033#bib.bib2)\]\. A useful future comparison would train or fine\-tune models under both approaches and measure not only language\-modeling loss, but also circuit sparsity, attribution quality, and SMT encoding cost\.

Third, attention operators without bilinear QK are a named architectural direction\. Appendix[F](https://arxiv.org/html/2605.24033#A6)sketches a Deep Sets view of attention as query\-conditioned aggregation\. Future work should search this design space for operators that preserve the algorithmic role of attention while avoiding bilinear query\-key products and continuous attention\-weighted value products\. Hard selection, Boolean gates, and Deep Sets\-style aggregators with tractable coefficients are plausible starting points\.

Finally, counterexample\-guided DSL expansion could make surrogate\-mediated verification more useful: when the solver refutes a proposed program, the counterexample could automatically refine the template family\. The framework should also be applied to richer algorithmic behaviors, such as string transformations, addition, and code\-like syntax, where ground\-truth reference programs are known\.

## 11Conclusion

Verifiable Transformers reframes mechanistic interpretability as a process of turning circuit explanations into bounded formal claims\. Direct SMT verification proves properties of extracted circuits when the circuit is encodable and small enough\. Surrogate\-mediated verification provides a weaker but broader route when direct encoding is not tractable\. Together, these results show a concrete path from circuit discovery to solver\-checkable explanations: specify a behavior, extract a circuit, state the claim, and ask the solver for a proof or counterexample\.

## Appendix AFull Notation and Property Definitions

The main text states the four properties\. The robustness check is branch\-certified: for each input, the verifier first checks that the traced final BandNorm branch remains stable throughout theϵ\\epsilon\-ball\. It then checks that no task\-decision flip is possible within that branch\. If branch stability cannot be certified, the result is unknown rather than verified\.

## Appendix BSigned L1 BandNorm Details

Signed L1 BandNorm was motivated by failures of simpler SMT\-friendly normalization variants\. Affine\-clamp\-affine normalizers were stable only briefly because they lacked data\-dependent centering and scale control\. MAD\-based piecewise\-linear normalizers restored centering and adaptive scaling but introduced discontinuous gain buckets\. Unbounded leaky clamps improved gradient flow but allowed residual magnitudes to grow\. Bounded elementwise clamps stabilized activations but damaged dynamic range\.

BandNorm instead controls global signed mass\. Letc=x−mean⁡\(x\)c=x\-\\operatorname\{mean\}\(x\)and splitccinto positive and negative partsppandnn\. Each side is constrained to lie within a target L1 mass band\. If‖p‖1\\left\\lVert p\\right\\rVert\_\{1\}is above the upper threshold,ppis projected onto the L1 ball\. If‖p‖1\\left\\lVert p\\right\\rVert\_\{1\}is below the lower threshold, mass is lifted over active coordinates\. The same procedure is applied tonn\. The signed vector is recombined and recentered\.

Projection onto an L1 ball can be expressed through a threshold over sorted coordinates, analogously to the support threshold used in sparsemax\. This introduces branch structure, but the branches are finite and piecewise\-affine\.

## Appendix CSmall Model Experimental Details

The small direct\-verification experiments use a custom 32\-token vocabulary and finite exhaustive task domains\. Training stops when both tasks achieve 100 percent candidate\-token accuracy\. Circuit extraction is task\-specific\. The verifier first runs an exhaustive PyTorch sanity check of the controlled circuit over the task domain before constructing SMT queries\.

The implemented properties are projected functional equivalence, content invariance, edge necessity, and continuous robustness\. The result files contain per\-property solver status, counterexamples when present, and branch\-certificate diagnostics\.

## Appendix DGPT\-2\-Scale Normalization Ablations

Table 5:Summary of normalization replacement experiments\.The progression suggests that SMT\-friendly normalization replacements must provide both centering and scale control without destroying coordinate\-level dynamic range\. Among the replacements tested here, Signed L1 BandNorm is the strongest candidate, but the performance gap relative to LayerNorm remains significant\. Recent LayerNorm\-removal results also suggest that removing normalization may be a stronger practical baseline than replacing it\.

## Appendix ESurrogate\-Mediated Symbolic Verification Details

The surrogate\-mediated verification uses a small ReLU network to match the teacher circuit’s bounded\-domain behavior\. Candidate symbolic programs are enumerated from a DSL with counters, toggles, token predicates, threshold detectors, and small state machines\. The verifier encodes both the surrogate and the candidate program and checks for a disagreement input\. Equivalence is certified only within the bounded domain and template family\.

For the quote task, the restricted template family has no equivalent program over the bounded domain\. The solver returns a counterexample, demonstrating that dataset\-level agreement is insufficient\. For the bracket task, the synthesized program is equivalent over the bounded abstract domain\.

## Appendix FAttention as Verifiable Aggregation

Attention can be viewed as query\-conditioned aggregation over key\-value pairs\. Letqqbe fixed and let each element bexi=\(q,ki,vi\)x\_\{i\}=\(q,k\_\{i\},v\_\{i\}\)\. A permutation\-invariant attention operator over the key\-value multiset can be written in a Deep Sets form under suitable assumptions\[[17](https://arxiv.org/html/2605.24033#bib.bib17)\]:

A\(q,K,V\)=ρ\(∑iϕ\(q,ki,vi\)\)\.A\(q,K,V\)=\\rho\\left\(\\sum\_\{i\}\\phi\(q,k\_\{i\},v\_\{i\}\)\\right\)\.\(16\)A convenient factorization is

ϕ\(q,ki,vi\)=u\(q,ki\)v\(ki,vi\),\\phi\(q,k\_\{i\},v\_\{i\}\)=u\(q,k\_\{i\}\)\\,v\(k\_\{i\},v\_\{i\}\),\(17\)whereuuis a scalar relevance function andvvis a content map\. The factorization is not forced by the Deep Sets theorem; it is a design choice that aligns with standard attention\.

This perspective suggests a family of verifiable attention operators obtained by constraininguu,vv, andρ\\rhoto finite expression trees built from affine maps, piecewise\-linear functions, thresholding, and tractable aggregation\. The present paper does not run a full search over that space\. It uses standard QK scoring followed by sparsemax as a simple trainable point in the design space, while identifying QK and value\-weight products as the remaining verification bottleneck\.

## Appendix GReproducibility Notes

The public artifact layout is split across two repositories\.

Artifacts\.

- •verifiable\-transformers/artifacts/gpt2\-baseline: local GPT\-2 baseline metrics and outputs\.
- •verifiable\-transformers/artifacts/small\_circuits: small extracted circuits and verification outputs\.
- •symbolic\-circuit\-distillation/results: surrogate\-mediated symbolic verification outputs\.

## Acknowledgments

This manuscript was drafted with assistance from OpenAI GPT\-5\.5 Pro\. All technical content, experiments, and claims are the author’s\.

## References

- \[1\]Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E\. Hinton\.Layer Normalization\.arXiv:1607\.06450, 2016\.
- \[2\]Luca Baroni, Galvin Khara, Joachim Schaeffer, Marat Subkhankulov, and Stefan Heimersheim\.Transformers Don’t Need LayerNorm at Inference Time: Scaling LayerNorm Removal to GPT\-2 XL and the Implications for Mechanistic Interpretability\.arXiv:2507\.02559, 2025\.
- \[3\]Arthur Conmy, Augustine N\. Mavor\-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga\-Alonso\.Towards Automated Circuit Discovery for Mechanistic Interpretability\.NeurIPS, 2023\.
- \[4\]Gonçalo M\. Correia, Vlad Niculae, and André F\. T\. Martins\.Adaptively Sparse Transformers\.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 2174–2184, 2019\.
- \[5\]Leonardo de Moura and Nikolaj Bjørner\.Z3: An Efficient SMT Solver\.TACAS, 2008\.
- \[6\]Leo Gao, Achyuta Rajaram, Jacob Coxon, Soham V\. Govande, Bowen Baker, and Dan Mossing\.Weight\-Sparse Transformers Have Interpretable Circuits\.arXiv:2511\.13653, 2025\.
- \[7\]Aaron Gokaslan and Vanya Cohen\.OpenWebText Corpus\.2019\.
- \[8\]Dan Hendrycks and Kevin Gimpel\.Gaussian Error Linear Units \(GELUs\)\.arXiv:1606\.08415, 2016\.
- \[9\]Guy Katz, Clark Barrett, David L\. Dill, Kyle Julian, and Mykel J\. Kochenderfer\.Reluplex: An Efficient SMT Solver for Verifying Deep Neural Networks\.CAV, 2017\.
- \[10\]Guy Katz, Derek A\. Huang, Duligur Ibeling, Kyle Julian, Christopher Lazarus, Rachel Lim, Parth Shah, Shantanu Thakoor, Haoze Wu, Aleksandar Zeljić, David L\. Dill, Mykel J\. Kochenderfer, and Clark Barrett\.The Marabou Framework for Verification and Analysis of Deep Neural Networks\.CAV, 2019\.
- \[11\]André F\. T\. Martins and Ramón Fernandez Astudillo\.From Softmax to Sparsemax: A Sparse Model of Attention and Multi\-Label Classification\.ICML, 2016\.
- \[12\]Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher\.Pointer Sentinel Mixture Models\.arXiv:1609\.07843, 2016\.
- \[13\]Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt\.Progress Measures for Grokking via Mechanistic Interpretability\.ICLR, 2023\.
- \[14\]Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Dawn Drain, Deep Ganguli, Zac Hatfield\-Dodds, Danny Hernandez, Scott Johnston, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah\.In\-Context Learning and Induction Heads\.arXiv:2209\.11895, 2022\.
- \[15\]Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever\.Language Models are Unsupervised Multitask Learners\.OpenAI blog, 1\(8\):9, 2019\.
- \[16\]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N\. Gomez, Lukasz Kaiser, and Illia Polosukhin\.Attention Is All You Need\.NeurIPS, 2017\.
- \[17\]Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Ruslan Salakhutdinov, and Alexander J\. Smola\.Deep Sets\.NeurIPS, 2017\.
Towards Verifiable Transformers: Solver-Checkable Circuit Explanations

Similar Articles

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement

Transformers Linearly Represent Highly Structured World Models

Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization

Transformers Are Inherently Succinct

Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization

Submit Feedback

Similar Articles

Precise Verification of Transformers through ReLU-Catalyzed Abstraction Refinement
Transformers Linearly Represent Highly Structured World Models
Vertex-Softmax: Tight Transformer Verification via Exact Softmax Optimization
Transformers Are Inherently Succinct
Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization