Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don't

arXiv cs.LG Papers

Summary

This theoretical paper analyzes the expressivity of padded transformers, showing that attention type, width, and uniformity have little impact compared to numeric precision and model depth. It establishes equivalences between transformer variants and circuit complexity classes like AC0 and TC0, providing a robust characterization.

arXiv:2605.30523v1 Announce Type: new Abstract: Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices. Padded transformers -- to whose input filler symbols such as ``...'' are appended -- emerge as a useful gadget for establishing equivalences to circuit classes by providing polynomial space for adaptive parallel computation. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity. We find that, under practical assumptions, padded transformers are surprisingly robust to all of these, and identify numeric precision and model depth as the main factors affecting expressivity. Concretely, we prove that polynomially padded $\text{L-uniform}$ constant-precision transformers are equivalent to $\text{L-uniform AC}^0$, while growing-precision ones achieve $\text{L-uniform TC}^0$ regardless of width. Furthermore, looping enables sequential processing analogous to circuits: $\log^d N$-looped constant-precision transformers reach $\text{FO-uniform AC}^d$, and growing-precision ones reach $\text{FO-uniform TC}^d$. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for both softmax and average hard attention transformers.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:26 AM

# Which Architectural Choices Matter and Which Don’t
Source: [https://arxiv.org/html/2605.30523](https://arxiv.org/html/2605.30523)
## Revisiting Padded Transformer Expressivity: Which Architectural Choices Matter and Which Don’t

###### Abstract

Recent work describes what transformers can and cannot compute through connections to boolean circuits, but existing results lack exact characterizations and are sensitive to modeling choices\.*Padded*transformers—to whose input filler symbols such as “…” are appended—emerge as a useful gadget for establishing*equivalences*to circuit classes by providing polynomial space for adaptive parallel computation\. However, only a limited set of padded transformer idealizations has been studied, leaving open how robustly these equivalences hold under changes to attention type, model width, and uniformity\. We find that, under practical assumptions, padded transformers are surprisingly*robust*to all of these, and identify numeric precision and model depth as the main factors affecting expressivity\. Concretely, we prove that polynomially padded𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}constant\-precision transformers are equivalent to𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}, while growing\-precision ones achieve𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}regardless of width\. Furthermore,*looping*enables sequential processing analogous to circuits:logd⁡N\\log^\{d\}\{\{N\}\}\-looped constant\-precision transformers reach𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}, and growing\-precision ones reach𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\. Interestingly, growing width or precision beyond logarithmic does not increase expressivity, and all our results hold for*both*softmax and average hard attention transformers\.

Machine Learning, ICML

## 1Introduction

A large body of work explores transformer expressivity, i\.e\., what functions transformers can and cannot compute\. Because such work is formal, one must mathematically pin down all aspects of the model\. However, it has become apparent that transformer expressivity can be highly sensitive to some design choices; for example, the type of attention—soft versus hard—has severe consequences on what languages transformers can recognize\(Hao et al\.,[2022](https://arxiv.org/html/2605.30523#bib.bib21); Jerad et al\.,[2025](https://arxiv.org/html/2605.30523#bib.bib24)\)\. This has led to diverse and sometimes difficult\-to\-reconcile results on transformer expressivity, where seemingly similar transformer variants—differing only slightly in, say, numeric precision assumptions—achieve substantially different expressivity\(Strobl et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib47)\)\.

In stark contrast to this brittleness, we show that*padded transformers*\(Pfau et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib42)\)—which append polynomially many dedicated filler symbols□\{\\square\}to the input before processing it—are surprisingly*robust*to a variety of changes to model specification, including attention type, model width, and parameter uniformity\. Numeric precision and model depth, in contrast, emerge as major factors determining expressivity; log\-precision padded transformers are always more expressive than constant\-precision ones, and expressivity grows with model depth\. Once logarithmic precision is achieved in an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}padded transformer \(a transformer that can be constructed by a logspace Turing machine; cf\.[Def\.˜2\.4](https://arxiv.org/html/2605.30523#S2.Thmdefinition4)\), however, other aspects like attention type, model width, or further increase in precision do not affect expressivity\. This may be attractive to both theorists and practitioners: On the one hand, it simplifies theoretical analyses, which can focus on whichever equivalent specification is easiest to study; on the other hand, it suggests that the derived characterizations are more likely to describe real\-world models\.

𝙰𝙲/𝚃𝙲\{\{\\mathtt\{AC\}\}\}/\{\{\\mathtt\{TC\}\}\}divide𝙰𝙲/𝚃𝙲\{\{\\mathtt\{AC\}\}\}/\{\{\\mathtt\{TC\}\}\}divideFigure 1:Polynomially padded transformer expressivity across depth \(↓\\downarrow\), precision \(↓\\downarrow\), uniformity \(→\\rightarrow\), and width \(→\\rightarrow\) regimes, ignoring parameterizations thatdo not satisfy the sufficient volume constraint\(cf\.[Def\.˜2\.3](https://arxiv.org/html/2605.30523#S2.Thmdefinition3)\)\. In contrast to most existing results on transformer expressivity, these results are*exact*for padded transformers\. Thepurplelines mark the*𝙰𝙲/𝚃𝙲\{\{\\mathtt\{AC\}\}\}/\{\{\\mathtt\{TC\}\}\}divide*: Constant\-precision transformers are limited to𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}, while growing\-precision transformers achieve𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.†\\daggermarksMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)results on fully uniformAHATs \(average hard attention transformers\) and⋆\\starLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)results on𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATs \(soft attention transformers\)\.Intuitively, padding abstracts ways of adaptively increasing parallel inference\-time computation, such as pause symbols\(Pfau et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib42)\)and text diffusion models\(Svete & Sabharwal,[2026](https://arxiv.org/html/2605.30523#bib.bib48)\), which improve the empirical performance of transformers on a variety of tasks\(Pfau et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib42); Goyal et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib17); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. It has also proven to be a useful theoretical gadget for studying transformer expressivity, yielding*exact*expressivity characterizations—but only for specific choices of uniformity, precision, width, and attention type\(Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30); Merrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\), leaving open how robust these characterizations are\. Our comprehensive analysis of a large set of possible transformer idealizations \(cf\.[Fig\.˜1](https://arxiv.org/html/2605.30523#S1.F1)\) reveals the previously unknown robustness of padded transformers to these differences\.

Padding facilitates a particularly convenient connection toboolean circuits—computational models that process fixed\-length inputs through layers of logic gates in the form of an acyclic graph\(Hao et al\.,[2022](https://arxiv.org/html/2605.30523#bib.bib21); Merrill & Sabharwal,[2023](https://arxiv.org/html/2605.30523#bib.bib33); Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32),inter alia\)\. Natural and well\-understood examples of circuit classes are𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}—circuits withAND,OR, andNOTgates whose number of gates scales polynomially with string lengthN\{\{N\}\}and depth withlogd⁡N\\log^\{d\}\{\{N\}\}—and𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}, which add threshold gates that test whether the number of active inputs exceeds some threshold\. While connections to circuits have been a fruitful avenue toward understanding transformers, establishing*equivalences*to natural circuit classes is difficult: The attention mechanism’s famously*quadratic*complexity inN\{\{N\}\}fundamentally*caps*the parallel computation a transformer can perform at a quadratic amount inN\{\{N\}\}\. It is unclear how a transformer could execute a cubic or higher\-degree polynomial number of parallel computations, making equivalences to natural circuit classes unlikely\. Most transformer\-to\-circuit connections, thus, take the form of \(loose\)*upper bounds*—showing that transformers can be simulated by either𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}or𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuits—without matching lower bounds\.

We study how prominent aspects of the transformer architecture affect the exact expressivity of polynomially padded transformers to establish a set of equivalences robust to changes in model specification\. We focus on\(1\)the type of attention \(softmax attention \(SMAT\) and average hard attention \(AHAT\) transformers\),\(2\)the scaling of numeric precision, width, and depth, and\(3\)the uniformity of the transformer constructions\.We find order in existing literature by shifting focus to padded uniform transformer families\. Studying*families*is necessary since letting transformer parameters depend on context length requires constructing a separate model for each length\.111Without uniformity constraints, transformers and circuits are unrealistically powerful\. For example, consider the unary language\{1N∣theNthTuring machine halts\}\\\{1^\{\{N\}\}\\mid\\text\{the $\{\{N\}\}$\\textsuperscript\{th\} Turing machine halts\}\\\}under some fixed enumeration of Turing machines\. This undecidable language is recognizable by a non\-uniform circuit family, since we can hard\-code the correct answer for each input lengthN\{\{N\}\}into the circuitCN\{C\}\_\{\{N\}\}\. Uniformity conditions prevent such pathological cases by requiring a single, feasible algorithm to construct all circuits in the family\.A uniform family describes how each model is built\. We study𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}families\(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\), which enforce that the transformers can be built by a simple computational model—a logspace Turing machine\. We build onLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)characterizations of padded𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATs andMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)padded log\-precision*fully*uniformAHATs \(where one set of parameters works for every input length\) \(cf\.†\\dagger\- and⋆\\star\-marked cells in[Fig\.˜1](https://arxiv.org/html/2605.30523#S1.F1)\)\. By connectingAHATs andSMATs, translatingLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)results toAHATs, and extendingMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)results to constant\-precision transformers, we establish the following results, summarized in[Fig\.˜1](https://arxiv.org/html/2605.30523#S1.F1):

1. \(1\)Constant vs\. growing precisionand the𝙰𝙲/𝚃𝙲\{\{\\mathtt\{AC\}\}\}/\{\{\\mathtt\{TC\}\}\}expressivity divide: A consistent trend in the effect of numeric precision appears: It determines whether the equivalence is to𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(with constant precision\) or to𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuits \(with growing precision\)\.
2. \(2\)We findrobustness with padding: A particularly important quantity turns out to be the transformer’svolumeV​\(N\)=defb⋅D\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{b\}\}\\cdot\{\{D\}\}—the number of bits available at each symbol at each layer of the transformer—whereb\{\{b\}\}denotes the numeric precision andD\{\{D\}\}the model width\. As long as the volume is at leastΩ​\(log⁡N\)\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}\(which is required to distinguishN\{\{N\}\}input positions\),𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}padded transformers are*robust*to changes in their specification: They either match𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(with constant precision\) or𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(with growing precision\)\. The type of attention, width, and precision beyond logarithmic*do not affect*expressivity\.
3. \(3\)Natural scaling with looping:*Looping*endows transformers with both parallel and sequential processing, enabling them to recognize regular\(Merrill & Sabharwal,[2025b](https://arxiv.org/html/2605.30523#bib.bib36)\)and context\-free\(Jerad et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib25)\)languages that constant\-depth transformers cannot\. We extend the characterizations of constant\-depth transformers to looped ones, showing that their expressivity scales analogously to circuits under all regimes:Θ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\-looped constant\-precision transformers achieve𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}while growing\-precision ones achieve𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\. Asd→∞d\\to\\inftyboth approach𝙽𝙲=⋃d≥0𝙰𝙲𝚍=⋃d≥0𝚃𝙲d\{\{\\mathtt\{NC\}\}\}=\\bigcup\_\{d\\geq 0\}\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}=\\bigcup\_\{d\\geq 0\}\{\{\\mathtt\{TC\}\}\}^\{d\}\.
4. \(4\)With growing precision, there isno benefit to growing width or weakening uniformity: Fully uniform and𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}growing\-precision padded transformers withΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}looping are equivalent to𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}regardless of the width\. There is also no benefit to polynomial precision over logarithmic precision\.
5. \(5\)Describing the expressivity of transformers withinsufficient \(o​\(log⁡N\)\{\{\{\{\{o\}\}\}\(\\log\{\{N\}\}\)\}\}\) volumeis difficult with natural circuit classes\. Understanding sub\-classes of𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}is likely required\.

## 2Preliminaries

Here, we introduce the core objects of the paper: The two attention variants \(SMATandAHAT\), the width, precision, and volume of a transformer, uniform transformer families, fixed\-point arithmetic, and looped padded transformers\. We followMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)andLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)in our formal exposition\. We outline the setup here and give additional details in[App\.˜A](https://arxiv.org/html/2605.30523#A1)\. We study softmax attention transformers \(SMATs\), which are commonly used in practice, and average hard attention transformers \(AHATs\), which are often preferred in the theoretical literature\. Both can be defined viatemperature\-scaled softmax attention, which computes the attention weights for input of lengthN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}based on unnormalized attention scores𝒙∈ℝN\{\{\{\\bm\{x\}\}\}\}\\in\{\{\\mathbb\{R\}\}\}^\{\{N\}\}, positionn∈\[N\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}, andtemperatureτ\>0\{\{\\tau\}\}\>0as

softmaxτ​\(𝒙\)n=exp⁡\(xn/τ\)∑n′=1Nexp⁡\(xn′/τ\)\.\{\{\{\{\\mathrm\{softmax\}\}\}\_\{\{\{\\tau\}\}\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\_\{\{n\}\}=\\frac\{\\exp\(\\nicefrac\{\{\{\{\{x\}\}\}\_\{\{n\}\}\}\}\{\{\{\{\\tau\}\}\}\}\)\}\{\\sum\_\{\{\{n\}\}^\{\\prime\}=1\}^\{\{N\}\}\\exp\(\\nicefrac\{\{\{\{\{x\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\}\}\{\{\{\{\\tau\}\}\}\}\)\}\.\(1\)We treat the temperatureτ=τ​\(N\)\{\{\\tau\}\}=\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}as a model parameter that may depend on the input length, analogously to the parameter matrices and PEs of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}; concretely, we requireτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}to be computable in the same complexity class as the rest of the model \(cf\.[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)\)\. This treatment of the temperature as a computable parameter mirrors the role that it plays in practical attention implementations, and captures the standard way of approximatingAHATs withSMATs\(Yang et al\.,[2026a](https://arxiv.org/html/2605.30523#bib.bib54)\)—theτ→0\{\{\\tau\}\}\\to 0limit ofSMATs yieldsAHATs:

limτ→0softmaxτ​\(𝒙\)n=\{1\|argmax\(𝒙\)\|if​xn=max⁡\(𝒙\),0otherwise\.\\lim\_\{\{\{\\tau\}\}\\to 0\}\{\{\{\{\\mathrm\{softmax\}\}\}\_\{\{\{\\tau\}\}\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\_\{\{n\}\}=\\begin\{cases\}\\frac\{1\}\{\|\\operatorname\*\{\{\{argmax\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\|\}&\\textbf\{if \}\{\{\{x\}\}\}\_\{\{n\}\}=\\max\(\{\{\{\\bm\{x\}\}\}\}\),\\\\ 0&\\textbf\{otherwise\}\.\\end\{cases\}\(2\)One of the main aims of the paper is to unify the literature onSMATs andAHATs\. Throughout the paper, all statements referencing transformers hold for both attention types; we only specify the type when the distinction is necessary\.

We useD\{\{D\}\}to denote the model width, i\.e\., the size of each symbol’s internal representation \(the residual stream\), andb\{\{b\}\}to denote the numeric precision of the model’s computations, i\.e\., the number of bits with which each scalar in the model is stored and manipulated \(cf\.[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)\)\.

###### Definition 2\.1\(Width and precision\)\.

ThewidthD​\(N\)∈ℕ\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}\\in\{\{\\mathbb\{N\}\}\}of a transformer is the dimensionality of each symbol’s residual\-stream representation\. Thenumeric precisionb​\(N\)∈ℕ\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}\\in\{\{\\mathbb\{N\}\}\}is the number of bits used to store and manipulate every scalar in the model under fixed\-point arithmetic \(cf\.[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)\)\. Both may depend on the input lengthN\{\{N\}\}\.

As standard in theoretical literature\(e\.g\., Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30)\), we allowD\{\{D\}\}andb\{\{b\}\}to*grow*with input lengthN\{\{N\}\}\. This is required even just to*name*theN\{\{N\}\}input positions: DistinguishingN\{\{N\}\}positions requireslog⁡N\\log\{\{N\}\}bits per symbol, so a transformer must satisfyD​\(N\)⋅b​\(N\)=Ω​\(log⁡N\)\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}\\cdot\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}to encode uncompressed positional information\. This motivates the definition of the transformer \(representation\) volume\.

###### Definition 2\.2\(Volume\)\.

Thevolumeof a transformer is

V​\(N\)=defD​\(N\)⋅b​\(N\)\.\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}\\cdot\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}\.\(3\)

Intuitively, this corresponds to the logarithm of the number of distinct attention query vectors \(whoseD\{\{D\}\}entries are stored as ab\{\{b\}\}\-bit number\) at any position in a string of lengthN\{\{N\}\}\. One of our main results reveals that, as long as a polynomially padded transformer has volumeV​\(N\)=Ω​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}, additional model growth does not increase expressivity\. We capture this in the following definition\.

###### Definition 2\.3\(Sufficient volume\)\.

A transformer family hassufficient volumeif its volume satisfiesV​\(N\)=Ω​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}, andinsufficient volumeotherwise\.

What matters, though, is where the growing volume comes from—growing\-precision transformers can be more expressive than growing\-width ones\.

Transformer families\.Making the width a function of the input length means that each input length is processed by a*separate*transformer𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}, yielding afamily\{𝒯N\}N∈ℕ\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}of transformers\. In a general transformer family, there need not be any relationship between the transformers of different sizes, resulting in similar pathologies as in non\-uniform circuit classes \(cf\.[Footnote˜1](https://arxiv.org/html/2605.30523#footnote1)\)\. Thus, we enforce𝙻​\-uniformity\{\{\\mathtt\{L\}\}\\text\{\-uniformity\}\}in our constructions, which requires that a logspace Turing machine can construct𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}from input1N1^\{\{N\}\}\(cf\.[§˜A\.4](https://arxiv.org/html/2605.30523#A1.SS4)\)\. This means that all transformers are easily algorithmically constructible\. We contrast𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformer families tofully uniformfamilies, in which a*single*transformer𝒯\{\{\\mathcal\{T\}\}\}\(with one fixed set of parameters and a single positional\-encoding scheme\) processes inputs of every lengthN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\. In the language of[§˜A\.4](https://arxiv.org/html/2605.30523#A1.SS4), this means that the Turing machine computing the transformer parameters is a*constant function*inN\{\{N\}\}\. Fully uniform families are thus more restrictive than𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}ones, since the latter allow the description of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}to depend onN\{\{N\}\}\(via a logspace Turing machine\), whereas the former require a length\-independent description\.

###### Definition 2\.4\(Uniform transformer families; variant ofLondon & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32), Def\. 3\.6\)\.

A transformer family\{𝒯N\}N∈ℕ\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}is𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}if there exist logspace Turing machinesℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}andℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}such that:

1. \(1\)ℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}outputs a description of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}on input1N1^\{\{N\}\}, and
2. \(2\)ℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}outputs𝒑​\(n,N\)\{\\bm\{p\}\}\(\{\{n\}\},\{\{N\}\}\)on input\(1N,B​\(n\)\)\(1^\{\{N\}\},\{\\texttt\{B\}\}\(\{\{n\}\}\)\)

where𝐩​\(n,N\)\{\\bm\{p\}\}\(\{\{n\}\},\{\{N\}\}\)is the positional encoding \(PE\) at positionn\{\{n\}\}andB​\(n\)∈\{0,1\}⌈log⁡N⌉\{\\texttt\{B\}\}\(\{\{n\}\}\)\\in\{\\\{0,1\\\}\}^\{\\lceil\\log\{\{N\}\}\\rceil\}is the binary encoding ofn\{\{n\}\}\.

Fixed\-point arithmetic\.We model the transformer’s computations with fixed\-point arithmetic of growing precision\. This specifies how model parameters and activations are stored and manipulated\. The transformer𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}with precisionb\{\{b\}\}operates over the set𝔽b=def\{x±⋅a⋅2−b∣x±∈\{−1,1\},a∈\{0,1,…,22​b−1\}\}\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\\{x\_\{\\pm\}\\cdot a\\cdot 2^\{\-\{\{b\}\}\}\\mid x\_\{\\pm\}\\in\\\{\-1,1\\\},a\\in\\\{0,1,\\ldots,2^\{2\{\{b\}\}\}\-1\\\}\\\}—signed fixed\-point numbers withb\{\{b\}\}bits each for the integer and fractional parts \(cf\.[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)\), following the convention ofLi et al\. \([2024b](https://arxiv.org/html/2605.30523#bib.bib30)\)andLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. Fixed\-point arithmetic both limits the precision of operations and enables the implementation of useful attention gadgets; in particular, the*non\-associativity*of fixed\-point arithmetic operations enables the implementation of non\-linear functions in otherwise linear parts of the model; see[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)for details\.

Loopedpaddedtransformers \(LPTs\)\.Loopedtransformers repeatedly apply a fixed block of transformer layers to the input\(Dehghani et al\.,[2019](https://arxiv.org/html/2605.30523#bib.bib12)\)\. This dynamically increases model depth, enabling more complex reasoning, while keeping model size constant, thus reducing the memory footprint\(Bae et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib3)\)\.Paddedtransformers additionally pad the input with blank symbols, which can be used to perform additional computations*in parallel*\. This additional padding space is analogous to increasing the circuit width in circuit complexity\. Concretely, alooped padded transformer\(LPT\) is a triple\(𝒯,r,P\)\(\{\{\\mathcal\{T\}\}\},r,\{P\}\)in which𝒯\{\{\\mathcal\{T\}\}\}is a transformer with a designated block of layers,r:ℕ→ℕr\\colon\{\{\\mathbb\{N\}\}\}\\to\{\{\\mathbb\{N\}\}\}is the number of times that block is applied \(theloop count\), andP:ℕ→ℕ\{P\}\\colon\{\{\\mathbb\{N\}\}\}\\to\{\{\\mathbb\{N\}\}\}is thepadding length: On input𝒘∈ΣN\{\{\\bm\{w\}\}\}\\in\{\{\\Sigma\}\}^\{\{N\}\}, the model runs𝒯\{\{\\mathcal\{T\}\}\}on𝒘∘□⋯□⏟P​\(N\)\{\{\\bm\{w\}\}\}\\circ\\underbrace\{\{\\square\}\\cdots\{\\square\}\}\_\{\{P\}\(\{\{N\}\}\)\}and applies the looped block of layersr​\(N\)r\(\{\{N\}\}\)times\. See[§˜A\.5](https://arxiv.org/html/2605.30523#A1.SS5)for details\.

Boolean circuitsprovide a useful abstraction of transformers\. They process binary strings by passing them through a directed acyclic graph of nodes that represent boolean operations\. Particularly interesting are\(1\)𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuits\(Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30)\)—boolean circuits of depth𝒪​\(logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\(N\{\{N\}\}being the input length\), size𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}, and withAND,OR, andNOTgates of unbounded fan\-in; and\(2\)𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuits, the class of*threshold circuits*of depth𝒪​\(logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}and size𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}that additionally allow threshold gates \(which determine whether the number of inputs exceeds some threshold\)\.[§§˜A\.2](https://arxiv.org/html/2605.30523#A1.SS2)and[A\.4](https://arxiv.org/html/2605.30523#A1.SS4)provide more details\.

Notation\.We focus on*polylogarithmically looped*and*polynomially padded*transformers\.222In the following, we will use the term padded transformer to specifically refer to*polynomially*padded transformers\.We use𝙻𝙿𝚃b,D𝚍\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{\{\\color\[rgb\]\{0\.640625,0\.02734375,0\.453125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.640625,0\.02734375,0\.453125\}d\}\}\}\_\{\{\\color\[rgb\]\{0\.12890625,0\.359375,0\.6875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.12890625,0\.359375,0\.6875\}\{\{b\}\}\},\{\\color\[rgb\]\{0\.3828125,0\.44921875,0\.07421875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3828125,0\.44921875,0\.07421875\}\{\{D\}\}\}\}\}to refer to transformers with polynomial padding,Θ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{\{\\color\[rgb\]\{0\.640625,0\.02734375,0\.453125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.640625,0\.02734375,0\.453125\}d\}\}\{\{\{N\}\}\}\)\}\}looping,numeric precisionb\{\\color\[rgb\]\{0\.12890625,0\.359375,0\.6875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.12890625,0\.359375,0\.6875\}\{\{b\}\}\}, andwidthD\{\\color\[rgb\]\{0\.3828125,0\.44921875,0\.07421875\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.3828125,0\.44921875,0\.07421875\}\{\{D\}\}\}\. Rather than writingΘ​\(1\)\{\{\{\{\\Theta\}\}\(1\)\}\},Θ​\(log⁡N\)\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}, or𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}for numeric precision and width, we use the shorthandc\(constant\) forΘ​\(1\)\{\{\{\{\\Theta\}\}\(1\)\}\},l\(log\) forΘ​\(log⁡N\)\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}, andp\(polynomial\) for𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}, respectively\. Thus, e\.g\.,𝙻𝙿𝚃l,c𝚍\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}refers tologd⁡N\\log^\{d\}\{\{N\}\}\-looped log\-precision constant\-width LPTs with polynomial padding\.

## 3SMATs Can SimulateAHATs

Early work on transformer expressivity focused largely onAHATs\(Hao et al\.,[2022](https://arxiv.org/html/2605.30523#bib.bib21); Merrill et al\.,[2022](https://arxiv.org/html/2605.30523#bib.bib37); Svete et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib49)\)since their sparser activations make them easier to connect to boolean circuits\. Recently,SMATs under different precision regimes have received more attention, providing a new perspective on the expressivity of practical transformers\(Merrill & Sabharwal,[2023](https://arxiv.org/html/2605.30523#bib.bib33); Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32); Li & Cotterell,[2025](https://arxiv.org/html/2605.30523#bib.bib27); Svete & Sabharwal,[2026](https://arxiv.org/html/2605.30523#bib.bib48)\)\. In specific cases, the relationship between the two attention variants is known; for example,Li & Cotterell \([2025](https://arxiv.org/html/2605.30523#bib.bib27)\)show their equivalence in the constant\-precision constant\-width unpadded regime\.Yang et al\. \([2026a](https://arxiv.org/html/2605.30523#bib.bib54)\)study*approximating*real\-valuedAHATs with real\-valuedSMATs, showing that, by scaling the*temperature*of the softmax normalization inSMATs, one can approximateAHATs arbitrarily well\. However, beyond the constant\-precision constant\-depth regime, no*exact*equivalence results are known\.

We bring the two frameworks closer together by showing that their expressivity coincides under logarithmic\-precision fixed\-point arithmetic\. We begin by showing that anyAHATcan be simulated by a sufficiently preciseSMAT*exactly*\.333The proofs of all claims in the main text appear in[App\.C](https://arxiv.org/html/2605.30523#A3)\.

###### Lemma 3\.1\.

Let\{𝒯\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}be a logarithmic\-precision𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}AHATfamily\. There exists a logarithmic\-precision𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATfamily\{𝒯′\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}^\{\\prime\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}such that for any input𝐰∈ΣN\{\{\\bm\{w\}\}\}\\in\{\{\\Sigma\}\}^\{\{\{N\}\}\}andN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}, the outputs of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}and𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{\{N\}\}\}match\.

See[§˜C\.1](https://arxiv.org/html/2605.30523#A3.SS1)for the full proof\. Thus:

###### Corollary 3\.1\.

For anyd∈ℕd\\in\{\{\\mathbb\{N\}\}\}, we have𝙻​\-uniform​SMAT​𝙻𝙿𝚃∗,†𝚍⊇𝙻​\-uniform​AHAT​𝙻𝙿𝚃∗,†𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\\texttt\{SMAT\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\*,\\dagger\}\}\\supseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\\texttt\{AHAT\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\*,\\dagger\}\}, where∗\*is a placeholder forlorp,∗\*is a placeholder forc,l\{\\texttt\{c\}\},\{\\texttt\{l\}\}, orp, andSMAT/AHATrefer to the type of attention in the transformer family\.

[Lem\.˜3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1)lets us translate any \(fully uniform or𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\)AHATinto an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATby formalizing the intuition that𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATs are at least as powerful as fully\-uniformAHATs\. This lets us transfer existing expressivity lower bounds on fully\-uniformAHATs to𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATs as well\. It remains unclear whether*fully uniform*SMATs can simulate \(fully uniform\)AHATs; some degree of non\-uniformity seems necessary to implement the temperature scaling required to focus on individual positions\.

## 4Padded Constant\-depth Transformers Are Constant\-depth Circuits

We now describe the expressivity of constant\-depth transformers\. Their expressivity under different idealizations has been studied extensively\. It is known that fully\-uniform log\-precision transformers fall under𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Merrill & Sabharwal,[2023](https://arxiv.org/html/2605.30523#bib.bib33)\)\. Recent work has refined these results, showing that polynomially padded \(𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\) constant\-precision log\-widthSMATs are equivalent to \(𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\)𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}, while polynomially padded \(𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\) log\-precision log\-widthSMATs are equivalent to \(𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\)𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. We restate these results here for reference \(see also[Fig\.˜1](https://arxiv.org/html/2605.30523#S1.F1)\)\.

###### Theorem 4\.1\(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32), Thms\. 4\.1 and 4\.5\)\.

𝙻​\-uniform​𝙻𝙿𝚃c,l𝟶=𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝙻​\-uniform​𝙻𝙿𝚃l,l𝟶=𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{l\}\}\}\}=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.

Missing, however, is the placement of𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}AHATs and*constant\-width*SMATs\. Here, we extend the known relationships by showing:

- •A lower bound\([§˜4\.1](https://arxiv.org/html/2605.30523#S4.SS1)\): We tighten the TC0direction of[Thm\.˜4\.1](https://arxiv.org/html/2605.30523#S4.Thmtheorem1)\(i\.e\.,𝙻​\-uniform​𝚃𝙲𝟶⊆𝙻​\-uniform​𝙻𝙿𝚃l,l𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\color\[rgb\]\{0\.71875,0\.20703125,0\.17578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.71875,0\.20703125,0\.17578125\}\{\\texttt\{l\}\}\}\}\}\) by showing that*constant*\-width log\-precision𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformers already suffice:𝙻​\-uniform​𝚃𝙲𝟶⊆𝙻​\-uniform​𝙻𝙿𝚃l,c𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\color\[rgb\]\{0\.71875,0\.20703125,0\.17578125\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.71875,0\.20703125,0\.17578125\}\{\\texttt\{c\}\}\}\}\}\. This shows that precision can compensate for width\. In contrast toMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\), who establish the equivalence of fully\-uniform growing\-precisionAHATs to𝙵𝙾​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}, we connect𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformers to𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.
- •An upper bound\([§˜4\.2](https://arxiv.org/html/2605.30523#S4.SS2)\): We show that even with maximal resources—polynomial width in the constant\-precision case and polynomial width and polynomial precision in the growing\-precision case—transformers cannot exceed𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}or𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}, respectively: We have𝙻​\-uniform​𝙻𝙿𝚃c,p𝟶⊆𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{p\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}in the constant\-precision case and𝙻​\-uniform​𝙻𝙿𝚃p,p𝟶⊆𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{p\}\},\{\\texttt\{p\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}in the growing\-precision case\.

We show these bounds for*both*AHATs andSMATs, revealing that the distinction between the two is not important in this regime\. Together, these bounds sandwich all intermediate regimes, characterizing their expressivity\.

### 4\.1Lower Bound

Here, we outline how constant\-width, log\-precision transformers can compute𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.444We generalizeLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)constructions rather than using[Lem\.3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1)andMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)equivalences as lower bounds sinceMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)results concern𝙵𝙾​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}rather than its superset𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\. By extendingLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)constructions directly, we arrive at the tighter lower bound of𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.

##### Intuition: Constant\-width PEs\.

The key challenge in simulating𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}with constant\-width transformers is encoding positional information in constant space;London & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)use logarithmic width to store binary PEs\. We use*unit\-length*PEs \(cf\.[Lem\.˜B\.4](https://arxiv.org/html/2605.30523#A2.Thmlemma4)\) that encode positions in two dimensions while maintaining sufficient separation for attention\-based indexing\. The encoding maps a positionn\{\{n\}\}to\(1/ntgt1−1/ntgt\)⊤\\begin\{pmatrix\}\\sqrt\{\\nicefrac\{\{1\}\}\{\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\}\}\}&\\sqrt\{1\-\\nicefrac\{\{1\}\}\{\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\}\}\}\\end\{pmatrix\}^\{\\top\}wherentgt\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}is the position that then​th\{\{n\}\}\\textsuperscript\{th\}symbol has to attend to \(e\.g\., if then​th\{\{n\}\}\\textsuperscript\{th\}symbol encodes an input to a gate,ntgt\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}would correspond to the position of that gate\)\. This encoding has approximately unit length \(the approximation is due to fixed\-point rounding\) while ensuring sufficient*gap*between attending tontgt\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}andn′≠ntgt\{\{n\}\}^\{\\prime\}\\neq\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}to be able to focus onntgt\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}exclusively\. FollowingLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\), this allows us to encode any𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}circuit into the PEs, enabling simulation by a transformer\. This contrasts withMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)construction, which stores a𝙵𝙾​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuit family \(with𝙵𝙾​\-uniform​𝚃𝙲𝟶⊆𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\) in the parameters of a fully uniform transformer family\. Thus:

###### Lemma 4\.1\.

𝙻​\-uniform​𝚃𝙲𝟶⊆𝙻​\-uniform​𝙻𝙿𝚃l,c𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\.

##### Precision compensates for width\.

An immediate corollary of[Lem\.˜4\.1](https://arxiv.org/html/2605.30523#S4.Thmlemma1)is that logarithmic precision can compensate for logarithmic width—that log\-precision constant\-width transformers \(𝙻𝙿𝚃l,c𝟶\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\) are more powerful than those with constant precision but logarithmic width \(𝙻𝙿𝚃c,l𝟶\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\):

###### Corollary 4\.1\.

𝙻​\-uniform​𝙻𝙿𝚃l,c𝟶⊃𝙻​\-uniform​𝙻𝙿𝚃c,l𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\\supset\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.

To the best of our knowledge, this is the first result formally showing that precision can compensate for width in transformer expressivity—for bothAHATs andSMATs\. WhileMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)results on fully\-uniformAHATs lower bound the expressivity of constant\-width transformers, they do not address growing width\. Moreover,Li et al\. \([2024b](https://arxiv.org/html/2605.30523#bib.bib30)\)andLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)results only consider growing\-width transformers, leaving the expressivity of constant\-width ones open\.

### 4\.2Upper Bounds

The following lemma formalizes that, even with polynomial width, constant\-depth transformers cannot exceed𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}with constant precision and𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}with growing precision\. Intuitively, a constant\-depth transformer only performs a constant number of attention operations\. Each attention layer computes a weighted average of input representations, which can, even with polynomial resources, be simulated by circuits of constant depth\. Polynomial precision and width do not enable the transformer to perform fundamentally more powerful computations—they only provide more space to store intermediate values\. Thus:

###### Lemma 4\.2\.

𝙻​\-uniform​𝙻𝙿𝚃c,p𝟶⊆𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{p\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝙻​\-uniform​𝙻𝙿𝚃p,p𝟶⊆𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{p\}\},\{\\texttt\{p\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.

The separation intuitively arises from the inability to perform unbounded*counting*, which is needed for computing threshold gates with constant\-precision activations\. Thus, constant\-precision transformers remain limited to𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.

##### The constant\-depth picture\.

Combining[Thm\.˜4\.1](https://arxiv.org/html/2605.30523#S4.Thmtheorem1), the constant\-width lower bound in[Lem\.˜4\.1](https://arxiv.org/html/2605.30523#S4.Thmlemma1), and the upper bound in[Lem\.˜4\.2](https://arxiv.org/html/2605.30523#S4.Thmlemma2)closes both sandwiches and yields the headline characterization of the constant\-depth regime:

###### Theorem 4\.2\(Constant\-depth expressivity collapse\)\.

As long as the transformer has sufficient volume \([Def\.˜2\.3](https://arxiv.org/html/2605.30523#S2.Thmdefinition3)\) and width at most polynomial, for any widthD​\(N\)\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}:

𝙻​\-uniform​𝙻𝙿𝚃c,D𝟶\\displaystyle\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\{D\}\}\}\}=𝙻​\-uniform​𝙰𝙲𝟶,\\displaystyle=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\},\(4a\)𝙻​\-uniform​𝙻𝙿𝚃l,D𝟶\\displaystyle\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\{D\}\}\}\}=𝙻​\-uniform​𝚃𝙲𝟶\.\\displaystyle=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.\(4b\)

Note that the volume constraint imposes a precision\-dependent floor on width: In the constant\-precision case it requiresD​\(N\)=Ω​\(log⁡N\)\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}, whereas in the log\-precision case any width suffices since the precision already suppliesΩ​\(log⁡N\)\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}volume\. Thus, for any volumeV​\(N\)=Ω​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}, the expressivity of constant\-depth𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}polynomially padded transformers depends only on whether precision is constant or grows\.555Contrast this to polynomially padded*fully*uniformAHATs, equivalent to𝙵𝙾​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Merrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)\.This is precisely the𝙰𝙲/𝚃𝙲\{\{\\mathtt\{AC\}\}\}/\{\{\\mathtt\{TC\}\}\}dividemarked by the purple lines in[Fig\.˜1](https://arxiv.org/html/2605.30523#S1.F1): Once volume is sufficient, the only thing that matters for expressivity is which side of the divide the precision falls on\.

## 5Looped Padded Transformers Are Highly Uniform Growing\-depth Circuits

We now extend the results to the equivalence betweenΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\-*looped*transformers and circuit complexity classes𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}and𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}ford≥1d\\geq 1\. As in the constant\-depth regime, we find a*precision\-dependent separation*: Constant precision leads to𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}, and growing precision achieves𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\. We first discuss a general upper bound for looping circuit layers \([§˜5\.1](https://arxiv.org/html/2605.30523#S5.SS1);[Lem\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmlemma1)\)\. Perhaps surprisingly, applying this lemma to our constant\-depth transformer expressivity results shows that even looped𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformers are*at most*as powerful as*fully uniform*ones\. This then conveniently allows us to apply the*lower bounds*on logarithmic\-precision fully\-uniform transformers fromMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)to the𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}growing\-precision case, establishing the equivalence oflogd⁡N\\log^\{d\}\{\{N\}\}\-looped padded growing\-precision transformers to𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\. Existing lower bounds, however, do not cover*constant\-precision*transformers, which we consider separately\. We extendMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)lower bounds to the constant\-precision case, showing that𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}polynomially padded constant\-precision transformers are equivalent to𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\([§˜5\.2](https://arxiv.org/html/2605.30523#S5.SS2)\)\.666To facilitate a comparison toMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)results, the brackets in our definitions and results in this section list the analog inMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)\.

### 5\.1A General Upper Bound

We begin with a general lemma showing that looping can be captured by very uniform growing\-depth circuit classes, which gives an upper bound on the expressivity of looped transformers\. Intuitively, anr​\(N\)r\(\{\{N\}\}\)\-looped transformer repeatedly applies the same transformationffto its input, computingfr​\(N\)f^\{r\(\{\{N\}\}\)\}\. Ifffcan be computed by a depth\-d​\(N\)d\(\{\{N\}\}\)circuit, thenfr​\(N\)=deff∘⋯∘f⏟r​\(N\)f^\{r\(\{\{N\}\}\)\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\underbrace\{f\\circ\\cdots\\circ f\}\_\{r\(\{\{N\}\}\)\}can be computed by a depth\-r​\(N\)⋅d​\(N\)r\(\{\{N\}\}\)\\cdot d\(\{\{N\}\}\)circuit\. For transformers withr​\(N\)=Θ​\(logd⁡N\)r\(\{\{N\}\}\)=\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}loops overd​\(N\)=Θ​\(1\)d\(\{\{N\}\}\)=\{\{\{\{\\Theta\}\}\(1\)\}\}layers, this yields depthΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\.

###### Lemma 5\.1\(Lem\. 9\)\.

Letm​\(N\)m\(\{\{N\}\}\),d​\(N\)d\(\{\{N\}\}\), andr​\(N\)r\(\{\{N\}\}\)be functions at most polynomial inN\{\{N\}\}and logspace\-computable given1N1^\{\{N\}\}such thatr​\(N\)⋅d​\(N\)≥1r\(\{\{N\}\}\)\\cdot d\(\{\{N\}\}\)\\geq 1\. Letf:\{0,1\}m​\(N\)→\{0,1\}m​\(N\)f\\colon\{\\\{0,1\\\}\}^\{m\(\{\{N\}\}\)\}\\to\{\\\{0,1\\\}\}^\{m\(\{\{N\}\}\)\}be computed by an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}polynomial\-size circuit family of depth𝒪​\(d​\(N\)\)\{\{\{\{\\mathcal\{O\}\}\}\(d\(\{\{N\}\}\)\)\}\}withAND,OR,NOTgates \(resp\. additionally withTHRgates\)\. Thenfr​\(N\)f^\{r\(\{\{N\}\}\)\}is computed by an𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}polynomial\-size circuit family of depth𝒪​\(r​\(N\)​d​\(N\)\)\{\{\{\{\\mathcal\{O\}\}\}\(r\(\{\{N\}\}\)\\,d\(\{\{N\}\}\)\)\}\}with the same gate set\.

This leads to the following upper bounds:

###### Lemma 5\.2\(Lem\. 6\)\.

Ford≥1d\\geq 1:

𝙻​\-uniform​𝙻𝙿𝚃c,p𝚍\\displaystyle\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{p\}\}\}\}⊆𝙵𝙾​\-uniform​𝙰𝙲𝚍\\displaystyle\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(5a\)𝙻​\-uniform​𝙻𝙿𝚃l,p𝚍\\displaystyle\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{p\}\}\}\}⊆𝙵𝙾​\-uniform​𝚃𝙲𝚍\.\\displaystyle\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.\(5b\)

##### Uniformity collapse\.

[Lem\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2)shows that𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}looped transformers are upper\-bounded by𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}circuit classes\. Intuitively, this occurs because the looped architecture has a fixed set of parameters that are reused at each iteration of the loop and thus result in simpler circuits\. More formally, this is a consequence of “*uniformity collapse*” resulting from the fact that𝙻⊆𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\\mathtt\{L\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}ford≥1d\\geq 1, meaning that𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuits can themselves implement any𝙻\{\\mathtt\{L\}\}construction\. This contrasts with the constant\-depth regime, where𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformer families achieve𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}circuit classes\.

### 5\.2Lower Bounds via Circuit Evaluation

##### Equivalence for growing\-precision transformers\.

[§˜5\.1](https://arxiv.org/html/2605.30523#S5.SS1)provides an upper bound for𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}looped transformers\.Merrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)show the*matching lower bound*for \(fully uniform\) growing\-precisionAHATs\. This lower bound naturally applies to𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}AHATs, and, using[Lem\.˜3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1), to𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATs too\. This immediately gives us a complete characterization of growing\-precision𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}padded looped transformers ford≥1d\\geq 1:

𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍=𝙵𝙾​\-uniform​𝚃𝙲𝚍\.\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.\(6\)
The rest of the section establishes the matching lower bound for*constant\-precision*transformers, showing, ford≥1d\\geq 1:

𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍=𝙵𝙾​\-uniform​𝙰𝙲𝚍\.\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.\(7\)We establish the lower bound using a reduction\-based technique\. The strategy has three steps, each of which adapts a step inMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)to constant precision\.

##### Step 1: The reduction framework\.

We show that if𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}looped constant\-precision transformers can recognize a𝒞\{\\mathcal\{C\}\}\-complete languageℒ\{\\mathcal\{L\}\}underℛ\{\\mathcal\{R\}\}reductions and can compute allℛ\{\\mathcal\{R\}\}reductions, then they can recognize all of𝒞\{\\mathcal\{C\}\}\. To do so, we recallMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)framework for reductions with transformers\.

###### Definition 5\.1\(Def\. 9\)\.

Letℛ\{\\mathcal\{R\}\}be a class of languages\.t:Σ∗→Σ∗\{\\texttt\{t\}\}\\colon\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}\\to\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}is anℛ\{\\mathcal\{R\}\}reductionif\|t​\(𝐰\)\|\{\{\\left\\lvert\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\\right\\rvert\}\}is polynomial in\|𝐰\|\{\{\\left\\lvert\{\{\\bm\{w\}\}\}\\right\\rvert\}\}andℒt=def\{\(𝐰,B​\(n\),w\)∣t​\(𝐰\)n=w\}\{\\mathcal\{L\}\}\_\{\\texttt\{t\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\\{\(\{\{\\bm\{w\}\}\},\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\},\{\{w\}\}\)\\mid\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\_\{\{n\}\}=\{\{w\}\}\\\}is inℛ\{\\mathcal\{R\}\}\.

###### Definition 5\.2\(Def\. 10\)\.

A transformer familycomputes anℛ\{\\mathcal\{R\}\}reductiontif it recognizes the languageℒt\{\\mathcal\{L\}\}\_\{\\texttt\{t\}\}\.

The following lemma shows that transformers can use reductions to recognize any language in a complexity class if they can recognize a language complete for that class and can compute any relevant reduction\.

###### Lemma 5\.3\(Lem\. 3\)\.

Let𝒞,ℛ\{\\mathcal\{C\}\},\{\\mathcal\{R\}\}be language classes and letℒ\{\\mathcal\{L\}\}be𝒞\{\\mathcal\{C\}\}\-complete underℛ\{\\mathcal\{R\}\}reductions\. Ifℒ∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}and𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}can compute everyℛ\{\\mathcal\{R\}\}reduction, then𝒞⊆𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\\mathcal\{C\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\. The analogous claim holds for𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\.

[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)allows us to show transformer expressivity lower bounds—for example, that transformers can recognize a class𝒞\{\\mathcal\{C\}\}—by showing that\(1\)transformers can compute all reductions of a particular complexity classℛ\{\\mathcal\{R\}\}\(we will considerℛ=𝙻\{\\mathcal\{R\}\}=\{\\mathtt\{L\}\}reductions in Step 2\) and\(2\)transformers can recognize a languageℒ\{\\mathcal\{L\}\}complete for𝒞\{\\mathcal\{C\}\}\(we will consider the languageℒ\{\\mathcal\{L\}\}of \(wide\)𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation, which is complete for𝒞=𝙰𝙲𝚍\{\\mathcal\{C\}\}=\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}under𝙻\{\\mathtt\{L\}\}reductions in Step 3\)\.

##### Step 2: Computing𝙻\{\\mathtt\{L\}\}reductions\.

We show that*graph connectivity*, which is complete for𝙽𝙻\{\\mathtt\{NL\}\}—and thus𝙻\{\\mathtt\{L\}\}—under𝙵𝙾\{\\mathtt\{FO\}\}reductions, can be solved by𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}—𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}constant\-precision log\-width transformers withΘ​\(log⁡N\)\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}looping\. Looped transformers can solve graph connectivity by iteratively computing*reachability*, where each iteration propagates reachability information along the graph’s edges\. After logarithmically many iterations in the number of vertices, all reachable vertices are discovered;[Thm\.˜C\.1](https://arxiv.org/html/2605.30523#A3.Thmtheorem1)formalizes this\. Thus,[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)combined with[Thm\.˜4\.1](https://arxiv.org/html/2605.30523#S4.Thmtheorem1), which gives𝙻​\-uniform​𝙻𝙿𝚃c,l𝟶=𝙻​\-uniform​𝙰𝙲𝟶⊇𝙵𝙾​\-uniform​𝙰𝙲𝟶=𝙵𝙾\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\supseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}=\{\\mathtt\{FO\}\}, yields:

###### Lemma 5\.4\(Lem\. 4\)\.

𝙽𝙻⊆𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\\mathtt\{NL\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.

Thus,𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}can compute all𝙻⊆𝙽𝙻\{\\mathtt\{L\}\}\\subseteq\{\\mathtt\{NL\}\}reductions\.

##### Step 3: Transformers can evaluate circuits\.

Finally, we show𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}looped constant\-precision transformers can solve wide𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation \(cf\.[§˜C\.3\.1](https://arxiv.org/html/2605.30523#A3.SS3.SSS1)\), which is complete for𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}under𝙻\{\\mathtt\{L\}\}reductions\. This involves the transformer receiving the encoding of a circuit \([Def\.˜C\.1](https://arxiv.org/html/2605.30523#A3.Thmdefinition1)\) and a string to evaluate it on as input, and computing the circuit’s output on that string\. Intuitively, a transformer can evaluate a circuit by encoding it in its residual stream and evaluating gates level\-by\-level\. In each loop, it identifies gates whose inputs have been computed, evaluates them, and propagates results forward\. AfterΘ​\(L\+log⁡N\)\{\{\{\{\\Theta\}\}\(\{\{L\}\}\+\\log\{\{N\}\}\)\}\}iterations \(whereL\{\{L\}\}is the circuit depth andlog⁡N\\log\{\{N\}\}loops are required for circuit pre\-processing\), the output gate’s value is computed\. This is formalized in[Lem\.˜C\.3](https://arxiv.org/html/2605.30523#A3.Thmlemma3)and implies:

###### Corollary 5\.1\(Cor\. 7\.1\)\.

Ford≥1d\\geq 1, the wide\-𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation problem is in𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.

From this, we deduce the inclusion of𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}in𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}looped constant\-precision transformers\. With[Lem\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2), this characterizes padded looped transformers:

###### Theorem 5\.1\(Thm\. 2\)\.

For anyd≥1d\\geq 1:

𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\\displaystyle\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}=𝙵𝙾​\-uniform​𝙰𝙲𝚍\\displaystyle=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(8a\)𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍\\displaystyle\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}=𝙵𝙾​\-uniform​𝚃𝙲𝚍\.\\displaystyle=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.\(8b\)

## 6Discussion

##### Transformer volume and the interplay of precision and width\.

[Fig\.˜1](https://arxiv.org/html/2605.30523#S1.F1)reveals the importance of*sufficient volume*for padded transformer expressivity: As long as the volume grows logarithmically with string length \(V​\(N\)=Ω​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}\), expressivity only depends on whether or not precision grows—the result of the𝙰𝙲/𝚃𝙲\{\{\\mathtt\{AC\}\}\}/\{\{\\mathtt\{TC\}\}\}expressivity divide\. Constant\-precision transformers are constrained to𝙻​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}while growing\-precision transformers span𝙻​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}at every depth leveldd\. This cannot be compensated for with a polynomial increase in model*width*\. We note that the constructions connecting transformers to𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}only require growing precision for representing the model*activations*\. This aligns with the common techniques of quantizing model weights to lower precision while keeping transformer parameters high\-precision\(Groeneveld et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib18); OpenAI,[2025](https://arxiv.org/html/2605.30523#bib.bib40)\)\. Such quantization is sufficient to achieve the entire\(𝙻​\-uniform\)​𝚃𝙲𝟶\(\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\)\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}expressivity\.

Padded transformers with*insufficient*volume \(V​\(N\)=o​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\{o\}\}\}\(\\log\{\{N\}\}\)\}\}\) remain difficult to describe\. Existing work describes*unpadded*variants of fully\-uniform constant\-precisionAHATs andSMATs as equivalent to𝙿𝙵𝙾​2​\[<\]\{\{\{\\mathtt\{PFO\}\}\\textsuperscript\{2\}\}\\mathtt\{\[<\]\}\}—two\-variable first\-order logic extended with the linear order<<, a limited fragment of first\-order logic\(Li & Cotterell,[2025](https://arxiv.org/html/2605.30523#bib.bib27)\)—but it remains unknown whether padding or looser uniformity increase their expressivity\. Crucially, results withV​\(N\)=o​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\{o\}\}\}\(\\log\{\{N\}\}\)\}\}rely on*causal masking*to provide positional information in the absence of PEs; since constant volume prevents using injective PEs, unmasked attention is particularly limited in this case\. In contrast, our constructions are agnostic to the use of masking as, with sufficient volume, masked transformers can be simulated by unmasked ones and vice versa \(Merrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 1andSvete & Sabharwal,[2026](https://arxiv.org/html/2605.30523#bib.bib48), Lem\. D\.14\)\.

##### The role of padding\.

We conjecture that any “reasonable” parameterization of*unpadded*transformers will be unable to simulate full𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}or𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}classes\.777Note that all equivalences in[Fig\.1](https://arxiv.org/html/2605.30523#S1.F1)naturally apply as*upper bounds*to unpadded transformers\.By*reasonable*, we mean transformers with volumeV​\(N\)=Θ​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}, i\.e\., models with sufficient volume to capture*entire*𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}classes when padded, but whose size scales slowly withN\{\{N\}\}\.

###### Conjecture 6\.1\.

Let\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}be an𝚇​\-uniform\{\{\\mathtt\{X\}\}\\text\{\-uniform\}\}unpadded transformer family withV​\(N\)=Θ​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}\. Then,

1. \(1\)ifD=Θ​\(log⁡N\)\{\{D\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}andb=Θ​\(1\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(1\)\}\}, there exist𝚇​\-uniform\{\{\\mathtt\{X\}\}\\text\{\-uniform\}\}𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuits that cannot be simulated by the family, and
2. \(2\)ifD=Θ​\(1\)\{\{D\}\}=\{\{\{\{\\Theta\}\}\(1\)\}\}andb=Θ​\(log⁡N\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}, there exist𝚇​\-uniform\{\{\\mathtt\{X\}\}\\text\{\-uniform\}\}𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuits that cannot be simulated by the family

unless the𝚇​\-uniform​𝙰𝙲𝟶/𝚃𝙲𝟶\{\{\\mathtt\{X\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}/\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}size hierarchies collapse\.

The intuition behind this conjecture stems from*size hierarchy*results in circuit complexity, which state that entire𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}classes cannot be captured by circuits whose size \(the number of gates or wires\) is bounded by some fixed polynomial\. In the non\-uniform𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}setting, the polynomial\-size hierarchy is strict: For any fixedK∈ℕK\\in\{\{\\mathbb\{N\}\}\}, there exist functions in \(non\-uniform\)𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}that cannot be computed by circuits of size𝒪​\(NK\)\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{K\}\)\}\}\(Arora & Barak,[2009](https://arxiv.org/html/2605.30523#bib.bib1)\)\. In the𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and uniform𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}settings, however, no analogous size hierarchies are known\(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. Because any unpadded transformer processing inputs of lengthN\{\{N\}\}can be simulated by a circuit of size𝒪​\(NK\)\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{K\}\)\}\}for some fixed polynomial determined by the transformer’s architecture, we can conclude that unpadded non\-uniform constant\-precision transformers will miss some parts of𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(as already shown byLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\)\.[Conj\.˜6\.1](https://arxiv.org/html/2605.30523#S6.Thmconjecture1)extends this separation to uniform𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuits\.

Existing constructions also do not quantify the*minimal*padding required to solve specific problems beyond usingNK\{\{N\}\}^\{K\}padding symbols to compute the values ofNK\{\{N\}\}^\{K\}circuit gates\. This further suggests connecting transformers with limited padding to restricted circuit classes such as physically\-realizable circuits\(Prada & Mali,[2025a](https://arxiv.org/html/2605.30523#bib.bib43),[b](https://arxiv.org/html/2605.30523#bib.bib44)\), whose size cannot scale as an arbitrary polynomial\.

It remains open to what degree polynomial model*width*and numeric*precision*can compensate for or even eliminate the need for padding\. We argue, however, that those regimes might not be as interesting, since padded transformers more naturally describe*uniform*and*distributed*models of computation\. Growing width forces some degree of non\-uniformity in the transformer family, while padding constant\-width transformers allows for strict parameter uniformity\. Simultaneously, padding ensures*distributed*processing of information rather than gathering all information in a few positions of a polynomially\-wide transformer\.

##### The role of uniformity and positional encodings\.

Our focus on𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformer families allows a direct connection to existing results on𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}families\(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\.888Note thatLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)equivalences generalize to any uniformity class as their construction directly uses the computational resources of one computational model \(e\.g\., a circuit\) to construct the other one \(the equivalent transformer\)\.The power of logspace Turing machines is particularly useful for computing PEs\. For example, our constructions that generalize the constant\-precision results to looping rely on functions such as division and modulo, which the𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\-upper\-bounded constant\-precision transformers cannot compute\. In this case, PEs, despite only depending on the position and not the input string, provide the model with information that it cannot itself compute\. This suggests that a constant\-precision transformer family must have𝙻\{\\mathtt\{L\}\}\- \(or less\) uniform PEs for looping to lead to expected scaling behavior analogous to circuits \(i\.e\., to increase expressivity\)\. In other words, more uniform \(and thus simpler\) PEs might not contain enough information for constant\-precision transformers to benefit from looping\.

The definition of transformer family uniformity also allows us to study the complexity of constructing the transformer parameters and the PEs separately\. This is useful, as the interest in length generalization and algorithm learning motivates the study of*fully*uniform families \(which reuse identical parameters for all input lengths\)\. While some work is beginning to look at the expressivity enabled by different PEs\(Yang et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib52); Li et al\.,[2024a](https://arxiv.org/html/2605.30523#bib.bib29); Li & Cotterell,[2025](https://arxiv.org/html/2605.30523#bib.bib27),[2026](https://arxiv.org/html/2605.30523#bib.bib28)\), it is an open question how changing the complexity of PEs changes transformer expressivity\. Uniform transformer families allow one to study such questions precisely\. For example, describing the expressivity of transformers with no or𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}PEs would quantify the importance and the expressivity gain afforded by𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}PEs\.

##### Looping vs\. depth\.

The distinction between*looping*and*depth*matters because the two scale expressivity through different mechanisms: AL\{\{L\}\}\-*deep*transformer hasL\{\{L\}\}*independent*sets of parameters, so increasing depth grows both the computation graph*and*the parameter count, while aL\{\{L\}\}\-*looped*transformer reuses a single fixed set of layersL\{\{L\}\}times, growing the computation graph without growing the parameter count and yielding a more uniform computational model\. Our results nevertheless carry over to deep transformers\. Since anyL\{\{L\}\}\-looped transformer is a special case of anL\{\{L\}\}\-deep one \(with all layers tied\), our lower bounds—constructions of𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}and𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuits—transfer toΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\-deep𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformers\. The matching upper bounds transfer as well: AΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\-deep𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}family has a number of*independent*layers that grows withN\{\{N\}\}, but for the family to be𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}the logspace construction machine must itself emit the parameters of allΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}layers from input1N1^\{\{N\}\}\. Then, due to uniformity collapse \(cf\.[§˜5\.1](https://arxiv.org/html/2605.30523#S5.SS1)\), the same𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}and𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}upper bounds apply here\. Looping is thus the more parameter\-efficient—and, arguably, the more natural—way of obtaining the scaling behavior we describe, but the same expressivity is recovered if the additional layers carry independent parameters\.

##### Resource allocation\.

Our results suggest expressivity\-driven*resource allocation*in real\-world transformers\. Model depth and precision increase expressivity, but both incur a degree of non\-parallelizable*sequentiality*\. However, expressivity*scales better with depth than with precision*: Given a budget of, e\.g\.,logd⁡N\\log^\{d\}\{\{N\}\}sequential steps, our results suggest devoting onlylog⁡N\\log\{\{N\}\}of those to numeric precision and the remaininglogd−1⁡N\\log^\{d\-1\}\{\{N\}\}ones to looping, since additional precision would not increase expressivity, whereas expressivity scales gracefully with looping\. We note, however, that we only consider expressivity; while different uniformity regimes have implications for generalization, we do not make any claims about the learnability of the considered classes of languages\. A growing body of literature considers this question both theoretically\(Hahn & Rofin,[2024](https://arxiv.org/html/2605.30523#bib.bib19); Yang et al\.,[2025](https://arxiv.org/html/2605.30523#bib.bib53); Merrill et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib38); Kövér et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib26),inter alia\)and practically\(Weiss et al\.,[2018](https://arxiv.org/html/2605.30523#bib.bib51); Bhattamishra et al\.,[2020](https://arxiv.org/html/2605.30523#bib.bib5); van der Poel et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib50); Delétang et al\.,[2023](https://arxiv.org/html/2605.30523#bib.bib13); Someya et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib46); Borenstein et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib6); Svete et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib49); Butoi et al\.,[2025](https://arxiv.org/html/2605.30523#bib.bib7),inter alia\); the connection of model uniformity to learnability is an exciting avenue for future work\.

##### The role of transformer arithmetic\.

Throughout the paper, we useroundingof fixed\-point arithmetic results as a “hammer” that allows transformers to perform useful gadgets \(e\.g\., focus on individual positions\)\. Specifically, under fixed\-point arithmetic, every arithmetic operation is followed by a projection onto the closest representable value in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\(cf\.[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)\), which lets us implement non\-linear behavior—e\.g\., thresholding and exact comparisons—inside the otherwise affine layers of a transformer\. This is true for*both*constant and growing precision: In either regime, rounding turns small numerical gaps into sharp combinatorial decisions, and our constructions rely on choosing weights large enough that the gap between attended and non\-attended positions is wider than the rounding error\. The fact that rounding applies across different transformer idealizations is part of why our results compose cleanly across constant and growing precision\. Note that our results \(particularly upper bounds\) may not transfer to*floating point*transformers\(Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30); Park et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib41)\)\.

## 7Conclusion

We show that polynomially padded transformers align well to natural circuit complexity classes and are surprisingly robust to variations in attention type, model width, and uniformity—as long as the model has enough memory per symbol to index individual positions\. This contrasts them with unpadded transformers, whose expressivity is difficult to link to well\-known circuit classes and is brittle in parametrization\. Concretely,𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}padded constant\-precision transformers span𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}while growing\-precision ones reach𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\. Moreover, they scale analogously to circuits when looped, showing that looping is a viable way of increasing inference\-time compute by increasing model expressivity\. The most pressing question they leave open is a tight characterization of*insufficient\-volume*\(V​\(N\)=o​\(log⁡N\)\{\{\{\{V\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\{o\}\}\}\(\\log\{\{N\}\}\)\}\}\) transformers, as standard𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}/𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}classes appear too coarse and likely require sub\-𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}classes such as𝙿𝙵𝙾​2​\[<\]\{\{\{\\mathtt\{PFO\}\}\\textsuperscript\{2\}\}\\mathtt\{\[<\]\}\}\. Altogether, these results establish a more unified view of transformer expressivity and solidify its connection to natural circuit classes\.

## Impact Statement

This paper presents work whose goal is to advance the field of machine learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## Acknowledgments

Anej Svete is supported by the ETH AI Center doctoral fellowship\. Many ideas and the motivation for this work were first developed during the 2025 Dagstuhl Seminar 25282 “Theory of Neural Language Models”\(Barceló et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib4)\), particularly in the “Uniformity” working group attended by Satwik Bhattamishra, Michaël Cadilhac, David Chiang, Will Merrill, Ashish Sabharwal, Clayton Sanford, Howard Straubing, Laura Strieker, and Anej Svete\. We thank the attendees for insightful discussions\. We also thank Reda Boumasmoud for helpful discussions about fixed\-point arithmetic\.

## References

- Arora & Barak \(2009\)Arora, S\. and Barak, B\.*Computational Complexity: A Modern Approach*\.Cambridge University Press, USA, 1st edition, 2009\.ISBN 0521424267\.URL[https://dl\.acm\.org/doi/10\.5555/1540612](https://dl.acm.org/doi/10.5555/1540612)\.
- Ba et al\. \(2016\)Ba, J\. L\., Kiros, J\. R\., and Hinton, G\. E\.Layer normalization\.*arXiv preprint arXiv:1607\.06450*, 2016\.URL[https://arxiv\.org/abs/1607\.06450](https://arxiv.org/abs/1607.06450)\.
- Bae et al\. \(2026\)Bae, S\., Kim, Y\., Bayat, R\., Kim, S\., Ha, J\., Schuster, T\., Fisch, A\., Harutyunyan, H\., Ji, Z\., Courville, A\., and Yun, S\.\-Y\.Mixture\-of\-recursions: Learning dynamic recursive depths for adaptive token\-level computation\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2026\.URL[https://openreview\.net/forum?id=QuqsEIVWIG](https://openreview.net/forum?id=QuqsEIVWIG)\.
- Barceló et al\. \(2026\)Barceló, P\., Chiang, D\., Cybenko, G\., Strobl, L\., and Yang, A\.Theory of Neural Language Models \(Dagstuhl Seminar 25282\)\.*Dagstuhl Reports*, 15\(7\):22–52, 2026\.ISSN 2192\-5283\.doi:10\.4230/DagRep\.15\.7\.22\.URL[https://drops\.dagstuhl\.de/entities/document/10\.4230/DagRep\.15\.7\.22](https://drops.dagstuhl.de/entities/document/10.4230/DagRep.15.7.22)\.
- Bhattamishra et al\. \(2020\)Bhattamishra, S\., Ahuja, K\., and Goyal, N\.On the Ability and Limitations of Transformers to Recognize Formal Languages\.In Webber, B\., Cohn, T\., He, Y\., and Liu, Y\. \(eds\.\),*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pp\. 7096–7116, Online, November 2020\. Association for Computational Linguistics\.doi:10\.18653/v1/2020\.emnlp\-main\.576\.URL[https://aclanthology\.org/2020\.emnlp\-main\.576/](https://aclanthology.org/2020.emnlp-main.576/)\.
- Borenstein et al\. \(2024\)Borenstein, N\., Svete, A\., Chan, R\., Valvoda, J\., Nowak, F\., Augenstein, I\., Chodroff, E\., and Cotterell, R\.What languages are easy to language\-model? a perspective from learning probabilistic regular languages\.In*ACL*, 2024\.URL[https://aclanthology\.org/2024\.acl\-long\.807/](https://aclanthology.org/2024.acl-long.807/)\.
- Butoi et al\. \(2025\)Butoi, A\., Khalighinejad, G\., Svete, A\., Valvoda, J\., Cotterell, R\., and DuSell, B\.Training neural networks as recognizers of formal languages\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=aWLQTbfFgV](https://openreview.net/forum?id=aWLQTbfFgV)\.
- Chan et al\. \(2024\)Chan, R\., Boumasmoud, R\., Svete, A\., Ren, Y\., Guo, Q\., Jin, Z\., Ravfogel, S\., Sachan, M\., Schölkopf, B\., El\-Assady, M\., and Cotterell, R\.On affine homotopy between language encoders\.In*The Thirty\-eighth Annual Conference on Neural Information Processing Systems*, 2024\.URL[https://openreview\.net/forum?id=FTpOwIaWUz](https://openreview.net/forum?id=FTpOwIaWUz)\.
- Chen et al\. \(2025\)Chen, Y\., Shang, J\., Zhang, Z\., Xie, Y\., Sheng, J\., Liu, T\., Wang, S\., Sun, Y\., Wu, H\., and Wang, H\.Inner thinking transformer: Leveraging dynamic depth scaling to foster adaptive internal thinking\.In Che, W\., Nabende, J\., Shutova, E\., and Pilehvar, M\. T\. \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 28241–28259, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:10\.18653/v1/2025\.acl\-long\.1369\.URL[https://aclanthology\.org/2025\.acl\-long\.1369/](https://aclanthology.org/2025.acl-long.1369/)\.
- Chiang \(2025\)Chiang, D\.Transformers in uniform TC0\.*Transactions on Machine Learning Research*, 2025\.ISSN 2835\-8856\.URL[https://openreview\.net/forum?id=ZA7D4nQuQF](https://openreview.net/forum?id=ZA7D4nQuQF)\.
- Cotterell et al\. \(2024\)Cotterell, R\., Svete, A\., Meister, C\., Liu, T\., and Du, L\.Formal aspects of language modeling\.*arXiv preprint arXiv:2311\.04329*, 2024\.URL[https://arxiv\.org/abs/2311\.04329](https://arxiv.org/abs/2311.04329)\.
- Dehghani et al\. \(2019\)Dehghani, M\., Gouws, S\., Vinyals, O\., Uszkoreit, J\., and Kaiser, L\.Universal transformers\.In*International Conference on Learning Representations*, 2019\.URL[https://openreview\.net/forum?id=HyzdRiR9Y7](https://openreview.net/forum?id=HyzdRiR9Y7)\.
- Delétang et al\. \(2023\)Delétang, G\., Ruoss, A\., Grau\-Moya, J\., Genewein, T\., Wenliang, L\. K\., Catt, E\., Cundy, C\., Hutter, M\., Legg, S\., Veness, J\., and Ortega, P\. A\.Neural networks and the Chomsky hierarchy\.In*Proceedings of the Eleventh International Conference on Learning Representations \(ICLR\)*, 2023\.URL[https://openreview\.net/forum?id=WbxHAzkeQcn](https://openreview.net/forum?id=WbxHAzkeQcn)\.
- Furst et al\. \(1984\)Furst, M\., Saxe, J\. B\., and Sipser, M\.Parity, circuits, and the polynomial\-time hierarchy\.*Mathematical systems theory*, 17\(1\):13–27, Dec 1984\.ISSN 1433\-0490\.doi:10\.1007/BF01744431\.URL[https://doi\.org/10\.1007/BF01744431](https://doi.org/10.1007/BF01744431)\.
- Geiping et al\. \(2025\)Geiping, J\., McLeish, S\. M\., Jain, N\., Kirchenbauer, J\., Singh, S\., Bartoldson, B\. R\., Kailkhura, B\., Bhatele, A\., and Goldstein, T\.Scaling up test\-time compute with latent reasoning: A recurrent depth approach\.In*ES\-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models*, 2025\.URL[https://openreview\.net/forum?id=D6o6Bwtq7h](https://openreview.net/forum?id=D6o6Bwtq7h)\.
- Giannou et al\. \(2023\)Giannou, A\., Rajput, S\., Sohn, J\.\-Y\., Lee, K\., Lee, J\. D\., and Papailiopoulos, D\.Looped transformers as programmable computers\.2023\.URL[https://proceedings\.mlr\.press/v202/giannou23a\.html](https://proceedings.mlr.press/v202/giannou23a.html)\.
- Goyal et al\. \(2024\)Goyal, S\., Ji, Z\., Rawat, A\. S\., Menon, A\. K\., Kumar, S\., and Nagarajan, V\.Think before you speak: Training language models with pause tokens\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=ph04CRkPdC](https://openreview.net/forum?id=ph04CRkPdC)\.
- Groeneveld et al\. \(2024\)Groeneveld, D\., Beltagy, I\., Walsh, E\., Bhagia, A\., Kinney, R\., Tafjord, O\., Jha, A\., Ivison, H\., Magnusson, I\., Wang, Y\., Arora, S\., Atkinson, D\., Authur, R\., Chandu, K\., Cohan, A\., Dumas, J\., Elazar, Y\., Gu, Y\., Hessel, J\., Khot, T\., Merrill, W\., Morrison, J\., Muennighoff, N\., Naik, A\., Nam, C\., Peters, M\., Pyatkin, V\., Ravichander, A\., Schwenk, D\., Shah, S\., Smith, W\., Strubell, E\., Subramani, N\., Wortsman, M\., Dasigi, P\., Lambert, N\., Richardson, K\., Zettlemoyer, L\., Dodge, J\., Lo, K\., Soldaini, L\., Smith, N\., and Hajishirzi, H\.OLMo: Accelerating the science of language models\.In Ku, L\.\-W\., Martins, A\., and Srikumar, V\. \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 15789–15809, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.841\.URL[https://aclanthology\.org/2024\.acl\-long\.841/](https://aclanthology.org/2024.acl-long.841/)\.
- Hahn & Rofin \(2024\)Hahn, M\. and Rofin, M\.Why are sensitive functions hard for transformers?In Ku, L\.\-W\., Martins, A\., and Srikumar, V\. \(eds\.\),*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 14973–15008, Bangkok, Thailand, August 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.800\.URL[https://aclanthology\.org/2024\.acl\-long\.800/](https://aclanthology.org/2024.acl-long.800/)\.
- Hao et al\. \(2024\)Hao, S\., Sukhbaatar, S\., Su, D\., Li, X\., Hu, Z\., Weston, J\., and Tian, Y\.Training large language models to reason in a continuous latent space\.*arXiv preprint arXiv:2412\.06769*, 2024\.URL[https://arxiv\.org/abs/2412\.06769](https://arxiv.org/abs/2412.06769)\.
- Hao et al\. \(2022\)Hao, Y\., Angluin, D\., and Frank, R\.Formal language recognition by hard attention transformers: Perspectives from circuit complexity\.*Transactions of the Association for Computational Linguistics*, 10:800–810, 2022\.doi:10\.1162/tacl\_a\_00490\.URL[https://aclanthology\.org/2022\.tacl\-1\.46/](https://aclanthology.org/2022.tacl-1.46/)\.
- Hesse et al\. \(2002\)Hesse, W\., Allender, E\., and Mix Barrington, D\. A\.Uniform constant\-depth threshold circuits for division and iterated multiplication\.*Journal of Computer and System Sciences*, 65\(4\):695–716, 2002\.ISSN 0022\-0000\.doi:https://doi\.org/10\.1016/S0022\-0000\(02\)00025\-9\.URL[https://www\.sciencedirect\.com/science/article/pii/S0022000002000259](https://www.sciencedirect.com/science/article/pii/S0022000002000259)\.Special Issue on Complexity 2001\.
- Immerman \(1999\)Immerman, N\.*Descriptive Complexity*\.Graduate Texts in Computer Science\. Springer, New York, 1999\.ISBN 978\-0\-387\-98600\-5\.doi:10\.1007/978\-1\-4612\-0539\-5\.URL[https://link\.springer\.com/book/10\.1007/978\-1\-4612\-0539\-5](https://link.springer.com/book/10.1007/978-1-4612-0539-5)\.
- Jerad et al\. \(2025\)Jerad, S\., Svete, A\., Li, J\., and Cotterell, R\.Unique hard attention: A tale of two sides\.In Che, W\., Nabende, J\., Shutova, E\., and Pilehvar, M\. T\. \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pp\. 977–996, Vienna, Austria, July 2025\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-252\-7\.doi:10\.18653/v1/2025\.acl\-short\.76\.URL[https://aclanthology\.org/2025\.acl\-short\.76/](https://aclanthology.org/2025.acl-short.76/)\.
- Jerad et al\. \(2026\)Jerad, S\., Svete, A\., Hao, S\., Cotterell, R\., and Merrill, W\.Context\-free recognition with transformers\.In*Forty\-third International Conference on Machine Learning*, 2026\.URL[https://openreview\.net/forum?id=JOyxs9ElI7](https://openreview.net/forum?id=JOyxs9ElI7)\.
- Kövér et al\. \(2026\)Kövér, B\., Butoi, A\., Svete, A\., Hahn, M\., and Cotterell, R\.A framework for understanding learnability in transformers\.In*Forty\-third International Conference on Machine Learning*, 2026\.URL[https://openreview\.net/forum?id=rUZVcnQjYW](https://openreview.net/forum?id=rUZVcnQjYW)\.
- Li & Cotterell \(2025\)Li, J\. and Cotterell, R\.Characterizing the expressivity of fixed\-precision transformer language models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=29LwAgLFpj](https://openreview.net/forum?id=29LwAgLFpj)\.
- Li & Cotterell \(2026\)Li, J\. and Cotterell, R\.Characterizing the expressivity of local attention in transformers\.*arXiv preprint arXiv:2605\.00768*, 2026\.URL[https://arxiv\.org/abs/2605\.00768](https://arxiv.org/abs/2605.00768)\.
- Li et al\. \(2024a\)Li, X\., Liang, Y\., Shi, Z\., Song, Z\., and Wan, M\.Theoretical constraints on the expressive power of𝖱𝗈𝖯𝖤\\mathsf\{RoPE\}\-based tensor attention transformers\.*arXiv preprint arXiv:2412\.18040*, 2024a\.URL[https://arxiv\.org/abs/2412\.18040](https://arxiv.org/abs/2412.18040)\.
- Li et al\. \(2024b\)Li, Z\., Liu, H\., Zhou, D\., and Ma, T\.Chain of thought empowers transformers to solve inherently serial problems\.In*ICLR*, 2024b\.URL[https://openreview\.net/forum?id=3EWTEy9MTM](https://openreview.net/forum?id=3EWTEy9MTM)\.
- Li et al\. \(2026\)Li, Z\., Li, Y\., and Zhou, T\.Skip a layer or loop it? learning program\-of\-layers in LLMs\.In*Forty\-third International Conference on Machine Learning*, 2026\.URL[https://openreview\.net/forum?id=pl10b6EQAN](https://openreview.net/forum?id=pl10b6EQAN)\.
- London & Kanade \(2025\)London, C\. and Kanade, V\.Pause tokens strictly increase the expressivity of constant\-depth transformers\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=eG5oh8l1WZ](https://openreview.net/forum?id=eG5oh8l1WZ)\.
- Merrill & Sabharwal \(2023\)Merrill, W\. and Sabharwal, A\.The parallelism tradeoff: Limitations of log\-precision transformers\.*TACL*, 11:531–545, 2023\.URL[https://aclanthology\.org/2023\.tacl\-1\.31/](https://aclanthology.org/2023.tacl-1.31/)\.
- Merrill & Sabharwal \(2024\)Merrill, W\. and Sabharwal, A\.The expressive power of transformers with chain of thought\.In*ICLR*, 2024\.URL[https://openreview\.net/forum?id=NjNGlPh8Wh](https://openreview.net/forum?id=NjNGlPh8Wh)\.
- Merrill & Sabharwal \(2025a\)Merrill, W\. and Sabharwal, A\.Exact expressive power of transformers with padding\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025a\.URL[https://openreview\.net/forum?id=O1abxStFcy](https://openreview.net/forum?id=O1abxStFcy)\.
- Merrill & Sabharwal \(2025b\)Merrill, W\. and Sabharwal, A\.A little depth goes a long way: The expressive power of log\-depth transformers\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025b\.URL[https://openreview\.net/forum?id=5pHfYe10iX](https://openreview.net/forum?id=5pHfYe10iX)\.
- Merrill et al\. \(2022\)Merrill, W\., Sabharwal, A\., and Smith, N\. A\.Saturated transformers are constant\-depth threshold circuits\.*TACL*, 10:843–856, 2022\.URL[https://aclanthology\.org/2022\.tacl\-1\.49/](https://aclanthology.org/2022.tacl-1.49/)\.
- Merrill et al\. \(2026\)Merrill, W\., Li, Y\., Romero, T\., Svete, A\., Costello, C\., Dasigi, P\., Groeneveld, D\., Heineman, D\., Kuehl, B\., Lambert, N\., Li, C\., Lo, K\., Malik, S\., Matusz, D\., Minixhofer, B\., Morrison, J\., Soldaini, L\., Timbers, F\., Walsh, P\., Smith, N\. A\., Hajishirzi, H\., and Sabharwal, A\.Olmo hybrid: From theory to practice and back\.*arXiv preprint arXiv:2604\.03444*, 2026\.URL[https://arxiv\.org/abs/2604\.03444](https://arxiv.org/abs/2604.03444)\.
- Mix Barrington et al\. \(1990\)Mix Barrington, D\. A\., Immerman, N\., and Straubing, H\.On uniformity within nc1\.*Journal of Computer and System Sciences*, 41\(3\):274–306, 1990\.ISSN 0022\-0000\.doi:https://doi\.org/10\.1016/0022\-0000\(90\)90022\-D\.URL[https://www\.sciencedirect\.com/science/article/pii/002200009090022D](https://www.sciencedirect.com/science/article/pii/002200009090022D)\.
- OpenAI \(2025\)OpenAI\.gpt\-oss\-120b & gpt\-oss\-20b model card\.Model card, OpenAI, August 2025\.URL[https://openai\.com/gpt\-oss\-models](https://openai.com/gpt-oss-models)\.Available under Apache 2\.0 license\.
- Park et al\. \(2026\)Park, S\., Park, Y\., and Hwang, G\.On the expressive power of floating\-point transformers\.*arXiv preprint arXiv:2601\.16450*, 2026\.URL[https://arxiv\.org/abs/2601\.16450](https://arxiv.org/abs/2601.16450)\.
- Pfau et al\. \(2024\)Pfau, J\., Merrill, W\., and Bowman, S\. R\.Let’s think dot by dot: Hidden computation in transformer language models\.In*COLM*, 2024\.URL[https://openreview\.net/forum?id=NikbrdtYvG](https://openreview.net/forum?id=NikbrdtYvG)\.
- Prada & Mali \(2025a\)Prada, B\. and Mali, A\.Realizable circuit complexity: Embedding computation in space\-time\.*arXiv preprint arXiv:2509\.19161*, 2025a\.URL[https://arxiv\.org/abs/2509\.19161](https://arxiv.org/abs/2509.19161)\.
- Prada & Mali \(2025b\)Prada, B\. and Mali, A\.Circuit complexity from physical constraints: Scaling limitations of attention\.*arXiv preprint arXiv:2509\.19161v1*, 2025b\.URL[https://arxiv\.org/abs/2509\.19161v1](https://arxiv.org/abs/2509.19161v1)\.
- Saunshi et al\. \(2025\)Saunshi, N\., Dikkala, N\., Li, Z\., Kumar, S\., and Reddi, S\. J\.Reasoning with latent thoughts: On the power of looped transformers\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=din0lGfZFd](https://openreview.net/forum?id=din0lGfZFd)\.
- Someya et al\. \(2024\)Someya, T\., Yoshida, R\., and Oseki, Y\.Targeted syntactic evaluation on the Chomsky hierarchy\.In*Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\)*, pp\. 15595–15605, 2024\.URL[https://aclanthology\.org/2024\.lrec\-main\.1356/](https://aclanthology.org/2024.lrec-main.1356/)\.
- Strobl et al\. \(2024\)Strobl, L\., Merrill, W\., Weiss, G\., Chiang, D\., and Angluin, D\.What formal languages can transformers express? a survey\.*TACL*, 12:543–561, 2024\.URL[https://aclanthology\.org/2024\.tacl\-1\.30/](https://aclanthology.org/2024.tacl-1.30/)\.
- Svete & Sabharwal \(2026\)Svete, A\. and Sabharwal, A\.On the reasoning abilities of masked diffusion language models\.In*The Fourteenth International Conference on Learning Representations*, 2026\.URL[https://openreview\.net/forum?id=BVnIsh4Nz1](https://openreview.net/forum?id=BVnIsh4Nz1)\.
- Svete et al\. \(2024\)Svete, A\., Borenstein, N\., Zhou, M\., Augenstein, I\., and Cotterell, R\.Can transformers learnnn\-gram language models?In*EMNLP*, 2024\.URL[https://aclanthology\.org/2024\.emnlp\-main\.550/](https://aclanthology.org/2024.emnlp-main.550/)\.
- van der Poel et al\. \(2024\)van der Poel, S\., Lambert, D\., Kostyszyn, K\., Gao, T\., Verma, R\., Andersen, D\., Chau, J\., Peterson, E\., Clair, C\. S\., Fodor, P\., Shibata, C\., and Heinz, J\.MLRegTest: A benchmark for the machine learning of regular languages\.*Journal of Machine Learning Research*, 25\(283\):1–45, 2024\.URL[https://jmlr\.org/papers/v25/23\-0518\.html](https://jmlr.org/papers/v25/23-0518.html)\.
- Weiss et al\. \(2018\)Weiss, G\., Goldberg, Y\., and Yahav, E\.On the practical computational power of finite precision RNNs for language recognition\.In Gurevych, I\. and Miyao, Y\. \(eds\.\),*Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics \(Volume 2: Short Papers\)*, pp\. 740–745, Melbourne, Australia, July 2018\. Association for Computational Linguistics\.doi:10\.18653/v1/P18\-2117\.URL[https://aclanthology\.org/P18\-2117/](https://aclanthology.org/P18-2117/)\.
- Yang et al\. \(2024\)Yang, A\., Chiang, D\., and Angluin, D\.Masked hard\-attention transformers recognize exactly the star\-free languages\.In*NeurIPS*, 2024\.URL[https://openreview\.net/forum?id=FBMsBdH0yz](https://openreview.net/forum?id=FBMsBdH0yz)\.
- Yang et al\. \(2025\)Yang, A\., Cadilhac, M\., and Chiang, D\.Knee\-deep in c\-RASP: A transformer depth hierarchy\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=jPduiyxyfw](https://openreview.net/forum?id=jPduiyxyfw)\.
- Yang et al\. \(2026a\)Yang, A\., Strobl, L\., Chiang, D\., and Angluin, D\.Simulating hard attention using soft attention\.*Transactions of the Association for Computational Linguistics*, 14:147–166, 2026a\.doi:10\.1162/tacl\.a\.597\.URL[https://aclanthology\.org/2026\.tacl\-1\.8/](https://aclanthology.org/2026.tacl-1.8/)\.
- Yang et al\. \(2026b\)Yang, A\., Watson, C\., Xue, A\., Bhattamishra, S\., Llarena, J\., Merrill, W\., Ferreira, E\. D\. S\., Svete, A\., and Chiang, D\.The transformer cookbook\.*Transactions on Machine Learning Research*, 2026b\.ISSN 2835\-8856\.URL[https://openreview\.net/forum?id=sPshCSvDrX](https://openreview.net/forum?id=sPshCSvDrX)\.
- Zeng et al\. \(2026\)Zeng, B\., Song, S\., Huang, S\., Wang, Y\., Li, H\., He, Z\., Wang, X\., li, Z\., and Lin, Z\.PonderLM: Pretraining language models to ponder in continuous space\.In*The Fourteenth International Conference on Learning Representations*, 2026\.URL[https://openreview\.net/forum?id=UrM4MNRYZm](https://openreview.net/forum?id=UrM4MNRYZm)\.

## Contents of the Appendix

## Appendix APreliminaries

### A\.1Notation

LetΣ\{\{\\Sigma\}\}be an alphabet, a finite, non\-empty set ofsymbols\. Alanguageℒ\{\\mathcal\{L\}\}is a subset ofΣ∗=def⋃N∈ℕΣN\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\bigcup\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}\{\{\\Sigma\}\}^\{\{\{N\}\}\}, the set of all strings\. We denote the concatenation of two strings𝒘1,𝒘2∈Σ∗\{\{\\bm\{w\}\}\}\_\{1\},\{\{\\bm\{w\}\}\}\_\{2\}\\in\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}as𝒘1∘𝒘2\{\{\\bm\{w\}\}\}\_\{1\}\\circ\{\{\\bm\{w\}\}\}\_\{2\}or simply𝒘1​𝒘2\{\{\\bm\{w\}\}\}\_\{1\}\{\{\\bm\{w\}\}\}\_\{2\}\. We use𝚙𝚘𝚕𝚢​\(N\)=def\{f:ℕ→ℕ​∣∃K\>​0,f​\(N\)=𝒪​\(NK\)\}\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\\{\{\{f\}\}\\colon\{\{\\mathbb\{N\}\}\}\\to\{\{\\mathbb\{N\}\}\}\\mid\\exists K\>0,\{\{f\}\}\(\{\{N\}\}\)=\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{K\}\)\}\}\\\}to denote the set of functions with at most polynomial growth rate\.

### A\.2Circuit Complexity

Boolean circuits are a model of parallel computation that processes binary input strings through a series of logical operations to produce binary outputs\.999By representing symbols from any alphabet with binary encodings, circuits \(or circuit functions\) can be used to process strings over any finite alphabet\. We focus on binary strings for simplicity\.Formally, aboolean circuitis a directed acyclic graph where source nodes represent theN\{\{N\}\}\-bit input, and a single sink node represents the output\. Non\-source vertices are calledgatesand are labeled with logical operations \(e\.g\.,AND,OR,NOT\)\. Thesizeof a circuit is the number of gates, and itsdepthis the longest path from any input to the output\.

A circuitCN\{C\}\_\{\{N\}\}computes a functionCN:\{0,1\}N→\{0,1\}\{C\}\_\{\{N\}\}\\colon\\\{0,1\\\}^\{\{N\}\}\\to\\\{0,1\\\}for someN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}; we attach the subscriptN\{\{N\}\}to be compatible with the circuit\-family notation\{CN\}N∈ℕ\\\{\{C\}\_\{\{N\}\}\\\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}used below\. The valueCN​\(𝒘\)\{C\}\_\{\{N\}\}\(\{\{\\bm\{w\}\}\}\)for input string𝒘∈\{0,1\}N\{\{\\bm\{w\}\}\}\\in\\\{0,1\\\}^\{\{N\}\}is computed by evaluating the gates in topological order starting from the input bits\. We say that the circuitCN\{C\}\_\{\{N\}\}acceptsa string𝒘\{\{\\bm\{w\}\}\}ifCN​\(𝒘\)=1\{C\}\_\{\{N\}\}\(\{\{\\bm\{w\}\}\}\)=1\.

Circuit familiesprocess input strings of variable length\. A circuit family is a sequence of circuits𝒞=def\{CN\}N∈ℕ\{\{\{\{\\mathcal\{C\}\}\}\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\\{\{C\}\_\{\{\{N\}\}\}\\\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}whereCN\{C\}\_\{\{N\}\}processes inputs of lengthN\{\{N\}\}\. A circuit family is said to recognize a language if, for any given input string, the corresponding circuit outputs11if and only if the string is in the language\.

Acircuit complexity classis a set of circuit families that satisfy certain constraints on size, depth, and the types of gates used\. This paper focuses on two common classes:

- •𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}: Circuits withNOT,AND, andORgates that have unbounded fan\-in, size polynomial inN\{\{N\}\}, and depth𝒪​\(logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\.
- •𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}: The extension of𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}that addsthreshold gatesTHR, which output11if the sum of their inputs exceeds a given threshold\. It is known that𝙰𝙲𝟶⊊𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subsetneq\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝙰𝙲𝚍⊆𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\. For example,Parity, the language of binary strings with an even number of 1s, is in𝚃𝙲𝟶\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}but not in𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Furst et al\.,[1984](https://arxiv.org/html/2605.30523#bib.bib14)\)\.

Without additional constraints, circuit families can recognize undecidable languages by having arbitrary solutions for each input length\. To avoid this and ensure the model of computation is realistic, we can impose auniformitycondition\. A circuit family isuniformif there exists a Turing machine that, on input1N1^\{\{N\}\}, generates a description of the circuitCN\{C\}\_\{\{N\}\}\. In particular, a circuit class is𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}if a Turing machine using𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}space can construct its description from the input1N1^\{\{N\}\}\. This ensures the circuits for different input lengths are related by a systematic procedure\. A finer notion of uniformity, used throughout our main results, is𝙵𝙾​\-uniformity\{\{\\mathtt\{FO\}\}\\text\{\-uniformity\}\}\. To state it, fix a standard encoding of the gates ofCN\{C\}\_\{\{N\}\}by binary strings of length𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}; theconnection languageof the family\{CN\}N∈ℕ\{\\\{\{C\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}then consists of the tuples\(1N,B​\(i\),B​\(j\),t\)\(1^\{\{N\}\},\{\{\{\\texttt\{B\}\}\\left\(i\\right\)\}\},\{\{\{\\texttt\{B\}\}\\left\(j\\right\)\}\},t\)such that gateiiinCN\{C\}\_\{\{N\}\}has typett\(input,AND,OR,NOT,THR, or output\) and is wired to gatejj\. A circuit family is𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}if its connection language is definable in first\-order logic over strings with the predicates\+\+,×\\times, and≤\\leqon position indices\(Mix Barrington et al\.,[1990](https://arxiv.org/html/2605.30523#bib.bib39)\); equivalently, if a𝙳𝙻𝙾𝙶𝚃𝙸𝙼𝙴\\mathtt\{DLOGTIME\}Turing machine decides it\. Because𝙵𝙾⊆𝙻\{\\mathtt\{FO\}\}\\subseteq\{\\mathtt\{L\}\}, every𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}family is𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\.

### A\.3Fixed\-point Arithmetic

Our computation models perform operations withfixed\-point arithmetic\(Li et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30); Saunshi et al\.,[2025](https://arxiv.org/html/2605.30523#bib.bib45); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32); Svete & Sabharwal,[2026](https://arxiv.org/html/2605.30523#bib.bib48)\)\.

###### Definition A\.1\(Fixed\-point representation\)\.

Letb∈ℕ\{\{b\}\}\\in\{\{\\mathbb\{N\}\}\}be the number of bits devoted to each of the integer and fractional parts\. We use𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}to denote the set

𝔽b=def\{x±⋅a⋅2−b∣x±∈\{−1,1\},a∈\{0,1,…,22​b−1\}\}\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\\{x\_\{\\pm\}\\cdot a\\cdot 2^\{\-\{\{b\}\}\}\\mid x\_\{\\pm\}\\in\\\{\-1,1\\\},a\\in\\\{0,1,\\ldots,2^\{2\{\{b\}\}\}\-1\\\}\\\}\(9\)

We defineB𝔽=defmax⁡𝔽b=2b−2−b\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\max\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}=2^\{\{b\}\}\-2^\{\-\{\{b\}\}\}\. All values exceedingB𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}are considered out of range and are rounded toB𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\. Note, however, thatB𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}does*not*behave like infinity—it is not an annihilator \(in the algebraic sense\), i\.e\., does not absorb all subsequent operations\. For example, for any non\-negativex∈𝔽bx\\in\{\{\\mathbb\{F\}\}\}\_\{\{b\}\},B𝔽−x≠B𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\-x\\neq\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}is a valid number\. To handle the results of arithmetic operations that may not be exactly representable in the fixed\-point format, we define a standard for rounding\.

###### Definition A\.2\(Rounding\)\.

For anyx∈ℝx\\in\{\{\\mathbb\{R\}\}\}and any closed subset𝔽\{\{\\mathbb\{F\}\}\}ofℝ\{\{\\mathbb\{R\}\}\}containing 0, we defineround​\(x,𝔽\)\{\\texttt\{round\}\}\(x,\{\{\\mathbb\{F\}\}\}\)as the closest number toxxin𝔽\{\{\\mathbb\{F\}\}\}\. In case of a tie, the value with the larger absolute value is chosen\.

We denote the rounding operation as\[⋅\]b=defround​\(⋅,𝔽b\)\\left\[\\cdot\\right\]\_\{\{b\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\\texttt\{round\}\}\(\\cdot,\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\)\. This operation is applied to vectors and matrices element\-wise\. All binary operations are defined by first performing the ideal mathematical operation and then rounding the result to the nearest representable value in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\. Division by zero is considered an error and results in an incorrect output\.

For operations involving more than two numbers, rounding is applied iteratively\.

###### Definition A\.3\(Summation with iterative rounding\)\.

Forb,N∈ℕ\{\{b\}\},N\\in\{\{\\mathbb\{N\}\}\}and𝐱∈ℝN\{\{\{\\bm\{x\}\}\}\}\\in\{\{\\mathbb\{R\}\}\}^\{N\}, we define summation with iterative rounding tob\{\{b\}\}fractional bits as the functionsumb:⋃N∈ℕ\(𝔽b\)N→𝔽b\\textsc\{sum\}\_\{\{b\}\}\\colon\\bigcup\_\{N\\in\{\{\\mathbb\{N\}\}\}\}\(\{\{\\mathbb\{F\}\}\}\_\{\{\{b\}\}\}\)^\{N\}\\rightarrow\{\{\\mathbb\{F\}\}\}\_\{\{\{b\}\}\}, where for anyN∈ℕ\+N\\in\{\{\\mathbb\{N\}\}\}^\{\+\}and𝐱∈\(𝔽b\)N\{\{\{\\bm\{x\}\}\}\}\\in\(\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\)^\{N\}:

sumb​\(𝒙\)=def\[…​\[\[x1\+x2\]b\+x3\]b\+⋯\+xN\]b\\textsc\{sum\}\_\{\{b\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\left\[\\dots\\left\[\\left\[\{\{\{x\}\}\}\_\{1\}\+\{\{\{x\}\}\}\_\{2\}\\right\]\_\{\{b\}\}\+\{\{\{x\}\}\}\_\{3\}\\right\]\_\{\{b\}\}\+\\dots\+\{\{\{x\}\}\}\_\{N\}\\right\]\_\{\{b\}\}\(10\)

This iterative rounding process is not associative, and the order of operations can affect the final result\. Based on this, we can also define more complex operations such as thefixed\-point inner product⟨𝒙,𝒚⟩b=defsumb​\(𝒙⊙𝒚\)\{\{\\langle\{\{\{\\bm\{x\}\}\}\},\{\{\{\\bm\{y\}\}\}\}\\rangle\}\}\_\{\{b\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\textsc\{sum\}\_\{\{b\}\}\(\{\{\{\\bm\{x\}\}\}\}\\odot\{\{\{\\bm\{y\}\}\}\}\), where⊙\\odotdenotes the element\-wise product of two vectors, andfixed\-point matrix productfor matrices𝑨\{\{\{\\bm\{A\}\}\}\}and𝑩\{\{\{\\bm\{B\}\}\}\}, where\(𝑨×b𝑩\)i,j=def⟨\(𝑨i,:\)⊤,𝑩:,j⟩b\(\{\{\{\\bm\{A\}\}\}\}\\times\_\{\{b\}\}\{\{\{\\bm\{B\}\}\}\}\)\_\{i,j\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\langle\(\{\{\{\\bm\{A\}\}\}\}\_\{i,:\}\)^\{\\top\},\{\{\{\\bm\{B\}\}\}\}\_\{:,j\}\\rangle\}\}\_\{\{b\}\}\. These operations will be used by fixed\-point arithmetic transformers; throughout, the rounding step\[⋅\]b\\left\[\\cdot\\right\]\_\{\{b\}\}that appears in every binary operation and in the iterative summation above is applied after every \(sub\-\)step of the operation in transformer’s computation\.

### A\.4Transformers and Transformer Families

#### A\.4\.1The Transformer Architecture

For a fixed input lengthN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}and modelwidthD∈ℕ\{\{D\}\}\\in\{\{\\mathbb\{N\}\}\}, a transformer𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}consists of:

1. \(1\)asymbol embedding𝒆:Σ→𝔽D\{\\bm\{e\}\}\\colon\{\{\\Sigma\}\}\\to\{\{\\mathbb\{F\}\}\}^\{\{\{D\}\}\}forw∈Σ\{\{w\}\}\\in\{\{\\Sigma\}\},
2. \(2\)apositional encoding\(PE\)𝒑:ℕ×ℕ→𝔽D\{\\bm\{p\}\}\\colon\{\{\\mathbb\{N\}\}\}\\times\{\{\\mathbb\{N\}\}\}\\to\{\{\\mathbb\{F\}\}\}^\{\{D\}\},
3. \(3\)L\{\{L\}\}layers𝝉\(1\),…,𝝉\(L\)\{\{\\bm\{\\tau\}\}\}^\{\(1\)\},\\ldots,\{\{\\bm\{\\tau\}\}\}^\{\(\{\{L\}\}\)\}, each of which consists of two sub\-layers: A self\-attention layer and a position\-wise feed\-forward network𝒇\{\{\\bm\{f\}\}\}, and
4. \(4\)a classificationoutput layer𝒐\{\\bm\{o\}\}of the form𝒐:𝔽D→\{0,1\}\{\\bm\{o\}\}\\colon\{\{\\mathbb\{F\}\}\}^\{\{\{D\}\}\}\\to\{\\\{0,1\\\}\}\.

A transformer with layers𝝉\(1\),…,𝝉\(L\)\{\{\\bm\{\\tau\}\}\}^\{\(1\)\},\\ldots,\{\{\\bm\{\\tau\}\}\}^\{\(\{\{L\}\}\)\}computes𝒉n\(l\)∈𝔽D\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}\\in\{\{\\mathbb\{F\}\}\}^\{\{\{D\}\}\}forl∈\{1,…,L\}\{\{l\}\}\\in\{\\\{1,\\ldots,\{\{L\}\}\\\}\}and each positionn∈\[N\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}in the input string𝒘=w1​⋯​wN∈Σ∗\{\{\\bm\{w\}\}\}=\{\{w\}\}\_\{1\}\\cdots\{\{w\}\}\_\{\{N\}\}\\in\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}as follows:101010We focus on transformers with a single head per layer\. As multi\-head attention can be simulated by single\-head attention with a constant factor increase in depth\(Yang et al\.,[2026b](https://arxiv.org/html/2605.30523#bib.bib55)\), our results extend to multi\-head transformers as well\.

𝒉n\(0\)\\displaystyle\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(0\)\}\_\{\{n\}\}=def𝒆​\(wn\)\+𝒑​\(n,N\)∈𝔽D​for​n∈\[N\]\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\\bm\{e\}\}\(\{\{w\}\}\_\{\{n\}\}\)\+\{\\bm\{p\}\}\(\{\{n\}\},\{\{N\}\}\)\\in\{\{\\mathbb\{F\}\}\}^\{\{\{D\}\}\}\\text\{ for \}\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}\(11a\)𝑯\(l\)\\displaystyle\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}=def\(𝒉1\(l\)⊤⋯𝒉N\(l\)⊤\)⊤∈𝔽N×D\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\begin\{pmatrix\}\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\\top\}\_\{1\}&\\cdots&\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\\top\}\_\{\{N\}\}\\end\{pmatrix\}^\{\\top\}\\in\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}\(11b\)𝑸\(l\)\\displaystyle\{\{\{\{\{\\bm\{Q\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}=def𝑯\(l\)𝑾Q\(l\),𝑲\(l\)=def𝑯\(l\)𝑾K\(l\),𝑽\(l\)=def𝑯\(l\)𝑾V\(l\)∈𝔽N×D\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\{\{\{\\bm\{W\}\}\}\}\_\{Q\}^\{\(\{\{l\}\}\)\},\\quad\{\{\{\{\{\\bm\{K\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\{\{\{\\bm\{W\}\}\}\}\_\{K\}^\{\(\{\{l\}\}\)\},\\quad\{\{\{\{\{\\bm\{V\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\{\{\{\\bm\{W\}\}\}\}\_\{V\}^\{\(\{\{l\}\}\)\}\\quad\\in\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}\(11c\)𝑮\(l\)\\displaystyle\{\{\{\\bm\{G\}\}\}\}^\{\(\{\{l\}\}\)\}=def𝝂​\(M​\(𝚫​\(𝑸\(l\)​𝑲\(l\)⊤\)\)​𝑽\(l\)\)\+𝑯\(l\)∈𝔽N×D\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\\bm\{\\nu\}\}\}\(\{\{\{\{M\}\}\(\{\{\\bm\{\\Delta\}\}\}\(\{\{\{\{\{\\bm\{Q\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\{\{\{\{\{\\bm\{K\}\}\}\}\}\}^\{\(\{\{l\}\}\)\\top\}\)\)\}\}\{\{\{\{\{\\bm\{V\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\)\}\}\+\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\\in\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}\(11d\)𝑯\(l\+1\)\\displaystyle\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\+1\)\}=def𝑮\(l\)\+𝝂​\(𝒇​\(𝑮\(l\)\)\)∈𝔽N×D\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\\bm\{G\}\}\}\}^\{\(\{\{l\}\}\)\}\+\{\{\{\{\\bm\{\\nu\}\}\}\(\{\{\\bm\{f\}\}\}\(\{\{\{\\bm\{G\}\}\}\}^\{\(\{\{l\}\}\)\}\)\)\}\}\\in\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}\(11e\)Here,𝚫:𝔽∗→𝔽∗\{\{\\bm\{\\Delta\}\}\}\\colon\{\{\{\{\\mathbb\{F\}\}\}^\{\*\}\}\}\\to\{\{\{\{\\mathbb\{F\}\}\}^\{\*\}\}\}is a length\-preserving function that computes attention weights and is applied row\-wise, and𝝂:𝔽D→𝔽D\{\{\\bm\{\\nu\}\}\}\\colon\{\{\\mathbb\{F\}\}\}^\{\{D\}\}\\to\{\{\\mathbb\{F\}\}\}^\{\{D\}\}is the layer normalization function\(Ba et al\.,[2016](https://arxiv.org/html/2605.30523#bib.bib2)\)applied position\-wise\.111111Following previous work, we assume that layer normalization can be applied to*parts*of the vector individually to enable gadgets such as thelayer hash norm\(Merrill & Sabharwal,[2024](https://arxiv.org/html/2605.30523#bib.bib34),[2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)\.Here,M:𝔽N×N→\(𝔽∪\{−∞\}\)N×N\{\{M\}\}\\colon\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{N\}\}\}\\to\(\{\{\\mathbb\{F\}\}\}\\cup\{\\\{\-\\infty\\\}\}\)^\{\{\{N\}\}\\times\{\{N\}\}\}is themasking function\.121212FollowingLi et al\. \([2024b](https://arxiv.org/html/2605.30523#bib.bib30)\)andSaunshi et al\. \([2025](https://arxiv.org/html/2605.30523#bib.bib45)\), we define masking with a function rather than an additive matrix since subtractingB𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}from an arbitrary number in𝔽\{\{\\mathbb\{F\}\}\}does not necessarily result in−B𝔽\-\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\.We say that thel​th\{\{l\}\}\\textsuperscript\{th\}layer𝝉\(l\)\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\)\}computes the function𝝉\(l\):𝔽N×D→𝔽N×D\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\)\}\\colon\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}\\to\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}, defined as𝝉\(l\):𝑯\(l−1\)↦𝑯\(l\)\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\)\}\\colon\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\-1\)\}\\mapsto\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}forl∈\{1,…,L\}\{\{l\}\}\\in\{\\\{1,\\ldots,\{\{L\}\}\\\}\}\. We also denote with𝒯\{\{\\mathcal\{T\}\}\}the function𝒯:Σ∗→𝔽N×D\{\{\\mathcal\{T\}\}\}\\colon\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}\\to\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}, defined as𝒯:𝒘↦𝑯\(L\)\{\{\\mathcal\{T\}\}\}\\colon\{\{\\bm\{w\}\}\}\\mapsto\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{L\}\}\)\}\.

Both attention variants studied in theoretical literature are special cases oftemperature\-scaled softmax attention:

𝚫​\(𝒙\)n=defsoftmaxτ​\(𝒙\)n=exp⁡\(xn/τ\)∑n′=1Nexp⁡\(xn′/τ\)​for​𝒙∈ℝN,n∈\[N\],and​τ\>0\.\{\{\{\{\\bm\{\\Delta\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\}\}\_\{\{n\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\\mathrm\{softmax\}\}\}\_\{\{\{\\tau\}\}\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\_\{\{n\}\}=\\frac\{\\exp\(\\nicefrac\{\{\{\{\{x\}\}\}\_\{\{n\}\}\}\}\{\{\{\{\\tau\}\}\}\}\)\}\{\\sum\_\{\{\{n\}\}^\{\\prime\}=1\}^\{\{N\}\}\\exp\(\\nicefrac\{\{\{\{\{x\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\}\}\{\{\{\{\\tau\}\}\}\}\)\}\\text\{ for \}\{\{\{\\bm\{x\}\}\}\}\\in\{\{\\mathbb\{R\}\}\}^\{\{N\}\},\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\},\\text\{ and \}\{\{\\tau\}\}\>0\.\(12\)Thetemperatureτ=τ​\(N\)\>0\{\{\\tau\}\}=\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\>0is a per\-model parameter that may depend on the input length and is included alongside the parameter matrices in the description of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}produced by the construction machineℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}of[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5); in particular,τ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}must be computable within the same complexity class as the rest of the model\. We view this as a natural strengthening of the standard definition: Practical implementations apply softmax to scaled scores \(𝒒⊤​𝒌/D\{\{\{\\bm\{q\}\}\}\}^\{\\top\}\{\{\{\\bm\{k\}\}\}\}/\\sqrt\{\{\{D\}\}\}, with the scaling absorbable into a temperature\), and a length\-dependent temperature is the standard mechanism by whichSMATs approximateAHATs\(Yang et al\.,[2026a](https://arxiv.org/html/2605.30523#bib.bib54)\)\.

1. \(1\)Softmax attention\(with a general functionτ\{\{\\tau\}\}\): We call such transformerssoftmax\-attentiontransformers\(SMATs\)\.
2. \(2\)Average hard attention\(τ→0\{\{\\tau\}\}\\to 0\): The limit yields limτ→0softmaxτ​\(𝒙\)n=\{1\|argmax\(𝒙\)\|if​xn=max⁡\(𝒙\),0otherwise\.\\lim\_\{\{\{\\tau\}\}\\to 0\}\{\{\{\{\\mathrm\{softmax\}\}\}\_\{\{\{\\tau\}\}\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\_\{\{n\}\}=\\begin\{cases\}\\frac\{1\}\{\|\\operatorname\*\{\{\{argmax\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\|\}&\\textbf\{if \}\{\{\{x\}\}\}\_\{\{n\}\}=\\max\(\{\{\{\\bm\{x\}\}\}\}\),\\\\ 0&\\textbf\{otherwise\}\.\\end\{cases\}\(13\)We call such transformersaveragehard\-attentiontransformers\(AHATs\)\.

##### Fixed\-point arithmetic transformers\.

All transformer operations are performed using fixed\-point arithmetic from[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)for some precisionb\{\{b\}\}, which can depend on the input lengthN\{\{N\}\}\. We focus on three regimes:

- •Constant precision:b=Θ​\(1\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(1\)\}\}\.
- •Logarithmic precision:b=Θ​\(log⁡N\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}\.
- •Polynomial precision:b=𝚙𝚘𝚕𝚢​\(N\)\{\{b\}\}=\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\.

We also allow our transformers to use*higher*precision when computing individual components, such as the attention sub\-layer or the position\-wise MLP: A component may perform its internal arithmetic at a precisionb′≥b\{\{b\}\}^\{\\prime\}\\geq\{\{b\}\}as long asb′=Θ​\(b\)\{\{b\}\}^\{\\prime\}=\{\{\{\{\\Theta\}\}\(\{\{b\}\}\)\}\}and its output is then*clamped*back to the residual\-stream precision𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}when written between components\. This allows, for instance, the attention weights—which can be sensitive to high or low activations—to be computed more precisely, and the construction of equivalent transformers to manipulate intermediate quantities \(such as rescaled query matrices\) that may not be exactly representable in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\. At the granularity of the precision regimes we consider \(constant, logarithmic, or polynomial precision\), this mixed\-precision allowance does*not*change the expressivity of the transformer—since the residual\-stream precisionb\{\{b\}\}remains the regime’s defining quantity, andb′=Θ​\(b\)\{\{b\}\}^\{\\prime\}=\{\{\{\{\\Theta\}\}\(\{\{b\}\}\)\}\}stays inside the same regime—but it considerably simplifies some constructions by removing the need to fit auxiliary computations exactly into𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\. This aligns with the modern practice of training large models with quantized weights while allowing certain activations \(such as the attention weights\) to use higher precision\(Groeneveld et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib18); Merrill & Sabharwal,[2025b](https://arxiv.org/html/2605.30523#bib.bib36)\)\.

##### b\{\{b\}\}\-precision\.

Many existing expressivity results rely on the exact nature of the arithmetic used\.Merrill & Sabharwal \([2025b](https://arxiv.org/html/2605.30523#bib.bib36)\)\(and subsequent work\) abstract the datatype, requiring only that the operations areb\{\{b\}\}\-precise in the following sense\.

###### Definition A\.4\.

Letf:ℝN→ℝf\\colon\{\{\\mathbb\{R\}\}\}^\{\{N\}\}\\to\{\{\\mathbb\{R\}\}\}be a function with the𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}realizationf~:𝔽bN→𝔽b\\widetilde\{f\}\\colon\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}^\{\{N\}\}\\to\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\. We say thatf~\\widetilde\{f\}isb\{\{b\}\}\-precise if, for anyx1,…,xN∈𝔽b\{\{\{x\}\}\}\_\{1\},\\ldots,\{\{\{x\}\}\}\_\{\{N\}\}\\in\{\{\\mathbb\{F\}\}\}\_\{\{b\}\},

round​\(f​\(x1,…,xN\)\)=f~​\(x1,…,xN\)\.\{\\texttt\{round\}\}\(f\(\{\{\{x\}\}\}\_\{1\},\\ldots,\{\{\{x\}\}\}\_\{\{N\}\}\)\)=\\widetilde\{f\}\(\{\{\{x\}\}\}\_\{1\},\\ldots,\{\{\{x\}\}\}\_\{\{N\}\}\)\.\(14\)

By definition of rounding to the nearest representable number, the components of a transformer areb\{\{b\}\}\-precise\. Moreover, for all growing precision \(b​\(N\)=Ω​\(log⁡N\)\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Omega\}\}\(\\log\{\{N\}\}\)\}\}\), attention is alsob\{\{b\}\}\-precise\(Merrill & Sabharwal,[2025b](https://arxiv.org/html/2605.30523#bib.bib36)\)\.131313Since we derive all results for constant\-precision transformers without relying onb\{\{b\}\}\-precision, we do not require it from constant\-precision transformers\.b\{\{b\}\}\-precision of growing\-precision transformers, however, allows us to apply all results fromMerrill & Sabharwal \([2025b](https://arxiv.org/html/2605.30523#bib.bib36)\)andMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)to that case\.

Fixed\-point arithmetic thus both limits what can be computed and enables the construction of various gadgets that leverage the arithmetic to implement logical operations and attention patterns\. We collect some useful gadgets in[App\.˜B](https://arxiv.org/html/2605.30523#A2)\. The proofs of our main results then use these constructions to implement the necessary computations for simulating circuit classes with transformers and vice versa\.

#### A\.4\.2Transformer Families and Uniformity

Analogously to circuit families, each string lengthN\{\{N\}\}is processed by a separate transformer\. To process all ofΣ∗\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}, we therefore define atransformer family\{𝒯N\}N∈ℕ\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}as a sequence of transformers where each𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}processes strings of lengthN\{\{N\}\}\. Further, we impose a uniformity condition on the family, which will relate the transformers for different input lengths\.

###### Definition A\.5\(Uniform transformer families; variant ofLondon & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32), Def\. 3\.6\)\.

Let𝚇\{\\mathtt\{X\}\}be a computational complexity class\. A transformer family\{𝒯N\}\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}is𝚇​\-uniform\{\{\\mathtt\{X\}\}\\text\{\-uniform\}\}if there exist Turing machinesℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}andℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}whose resource usage is constrained by the complexity class𝚇\{\\mathtt\{X\}\}such that:

1. \(1\)ℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}takes input1N1^\{\{N\}\}and outputs a description of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}, and
2. \(2\)ℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}takes input\(1N,B​\(n\)\)\(1^\{\{N\}\},\{\\texttt\{B\}\}\(\{\{n\}\}\)\)and outputs𝒑​\(n,N\)\{\\bm\{p\}\}\(\{\{n\}\},\{\{N\}\}\)\.

[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)allows for size\-dependent transformers while keeping them related \(as the same Turing machines must construct them for allN\{\{N\}\}\)\. It also facilitates natural connections to uniform circuit classes \(cf\.[§˜A\.2](https://arxiv.org/html/2605.30523#A1.SS2)\)\(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. All our results concern𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}transformer families, in which case, the Turing machines in[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)operate in logspace\.

##### Information\-rich positional encodings\.

[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)defines two components of a transformer based on the notion of uniform computability: The transformer model itself and the PE\. Although superficially similar, these two components appear in distinctly different roles in our constructions\. The transformer model intuitively defines the “algorithm” that processes the input string and is limited in terms of the operations to those implementable by the attention mechanism\. For example, although it is*constructed*by a logspace Turing machine, it cannot*compute*arbitrary logspace functions\. The PEs, in contrast, provide a way to inject additional information into the model that it cannot compute itself*on the fly*\. They can, for example, provide direct access to the binary representation of the position indexn\{\{n\}\}, and pre\-compute arithmetic operations such as division and modulo\. This way, PEs provide the transformer with information that it cannot necessarily compute itself, but that is still computable in logspace\.

##### Language recognition\.

We treat a transformer𝒯\{\{\\mathcal\{T\}\}\}as alanguage encoder—a length\-preserving functionΣ∗→\(𝔽D\)∗\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}\\to\{\{\(\{\{\\mathbb\{F\}\}\}^\{\{\{D\}\}\}\)^\{\*\}\}\}\(Cotterell et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib11); Chan et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib8)\)—and regard the output𝑯\(L\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{L\}\}\)\}on a string𝒘\{\{\\bm\{w\}\}\}as a\|𝒘\|×D\|\{\{\\bm\{w\}\}\}\|\\times\{\{D\}\}matrix, where each row corresponds to the contextual representation of the symbol at the corresponding position\. To convert this into a language recognizer, we use the output layer𝒐\{\\bm\{o\}\}that maps the contextual representation of the*final symbol*to the membership \(classification\) decision0or11\.

### A\.5Looped Padded Transformers

Looped \(or universal\) transformers use a fixed block of transformer layers that is applied repeatedly to the input string\(Dehghani et al\.,[2019](https://arxiv.org/html/2605.30523#bib.bib12); Giannou et al\.,[2023](https://arxiv.org/html/2605.30523#bib.bib16); Hao et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib20); Goyal et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib17); Chen et al\.,[2025](https://arxiv.org/html/2605.30523#bib.bib9); Zeng et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib56); Geiping et al\.,[2025](https://arxiv.org/html/2605.30523#bib.bib15)\)\. This increases the depth of the model, enabling more complex reasoning by applying layers multiple times, and does not increase the model size, as the same block is reused for each iteration, thus reducing the memory footprint and computational cost\(Bae et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib3)\)\. We define looped transformers as follows\.

###### Definition A\.6\(Looped transformer\)\.

LetL,T∈ℕ\{\{L\}\},\{T\}\\in\{\{\\mathbb\{N\}\}\}and let1≤l1≤l2≤L1\\leq\{\{l\}\}\_\{1\}\\leq\{\{l\}\}\_\{2\}\\leq\{\{L\}\}\. Given a depth\-L\{\{L\}\}transformer, alooped transformercomputes symbol contextual representations𝐇\{\{\{\{\{\\bm\{H\}\}\}\}\}\}by

1. 1\.Computing the initial hidden states𝑯\(0\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(0\)\}for the input string𝒘=w1​⋯​wN\{\{\\bm\{w\}\}\}=\{\{w\}\}\_\{1\}\\cdots\{\{w\}\}\_\{\{N\}\}and computing𝑯\(l1\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\_\{1\}\)\}using the firstl1\{\{l\}\}\_\{1\}layers of the transformer\.
2. 2\.Applying the transformer layersl1\+1,…,l2\{\{l\}\}\_\{1\}\+1,\\ldots,\{\{l\}\}\_\{2\}T\{T\}times to the hidden states𝑯\(l1\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\_\{1\}\)\}to obtain𝑯\(l1\+T​\(l2−l1\)\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\_\{1\}\+\{T\}\(\{\{l\}\}\_\{2\}\-\{\{l\}\}\_\{1\}\)\)\}\.
3. 3\.Applying the transformer layersl2\+1,…,L\{\{l\}\}\_\{2\}\+1,\\ldots,\{\{L\}\}to the hidden states𝑯\(l1\+T​\(l2−l1\)\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\_\{1\}\+\{T\}\(\{\{l\}\}\_\{2\}\-\{\{l\}\}\_\{1\}\)\)\}to obtain the final representations𝑯\{\{\{\{\{\\bm\{H\}\}\}\}\}\}that are passed to the output layer\.

The dynamic computational depth of looped transformers lets them perform more complex reasoning tasks by iteratively refining their hidden states across multiple timesteps\. Importantly, these reasoning steps combine sequential and parallel processing of the input symbols, yielding both parallel efficiency and reasoning depth\. For brevity,[Def\.˜A\.6](https://arxiv.org/html/2605.30523#A1.Thmdefinition6)states a single looped block, but the definition extends in the natural way to a finite number of sequentially composed looped blocks\[l1\(k\),l2\(k\)\]\[\{\{l\}\}\_\{1\}^\{\(k\)\},\{\{l\}\}\_\{2\}^\{\(k\)\}\]with their own iteration countsT\(k\)\{T\}^\{\(k\)\}; this form resembles learned by program\-of\-layers architectures\(Li et al\.,[2026](https://arxiv.org/html/2605.30523#bib.bib31)\)and the form we use in our constructions when convenient\.

Padded transformers additionally pad the input string with padding \(pause\) symbols\(Pfau et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib42); Goyal et al\.,[2024](https://arxiv.org/html/2605.30523#bib.bib17); London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. The number of padding symbols can depend on the input length\.

###### Definition A\.7\(Padded Transformer\)\.

Given a \(looped\) transformer𝒯\{\{\\mathcal\{T\}\}\}and apadding length functionP:ℕ→ℕ\{P\}\\colon\{\{\\mathbb\{N\}\}\}\\to\{\{\\mathbb\{N\}\}\}, apadded transformeris the pair\(𝒯,P\)\\left\(\{\{\\mathcal\{T\}\}\},\{P\}\\right\)that computes the contextual representations𝐇\{\{\{\{\{\{\{\\bm\{H\}\}\}\}\}\}\}\}of a string𝐰∈Σ∗\{\{\\bm\{w\}\}\}\\in\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}by running𝒯\{\{\\mathcal\{T\}\}\}on the padded input𝐰∘□⋯□⏟P​\(\|𝐰\|\)\{\{\\bm\{w\}\}\}\\circ\\underbrace\{\{\\square\}\\cdots\{\\square\}\}\_\{\{P\}\(\|\{\{\\bm\{w\}\}\}\|\)\}, where□∉Σ\{\\square\}\\notin\{\{\\Sigma\}\}is a designated padding symbol\.

Instead of being restricted to the contextual representations of theN\{\{N\}\}input symbols, a padded transformer can determine string membership or symbol probabilities based on the contextual representations of theP​\(N\)\{P\}\(\{\{N\}\}\)additional padded symbols as well\. This additional space can be used to perform more operations and is analogous to increasing the circuit width in circuit complexity\.

## Appendix BTheoretical Gadgets

This section contains various \(known\) theoretical gadgets that we use in the proofs of the main results\. In the following,N∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}always refers to the length of the original input string\. We define theinterleavingof the vectors𝒙,𝒚∈ℝD\{\{\{\\bm\{x\}\}\}\},\{\{\{\\bm\{y\}\}\}\}\\in\{\{\\mathbb\{R\}\}\}^\{\{D\}\}as𝒙⌢​𝒚∈ℝ2​D\{\{\{\{\{\\bm\{x\}\}\}\}\}^\{\\frown\}\{\{\{\{\\bm\{y\}\}\}\}\}\}\\in\{\{\\mathbb\{R\}\}\}^\{2\{\{D\}\}\}where

𝒙⌢​𝒚d=def\{x\(d\+1\)/2if​d​is odd,yd/2otherwise\.\{\{\{\{\{\\bm\{x\}\}\}\}\}^\{\\frown\}\{\{\{\{\\bm\{y\}\}\}\}\}\}\_\{\{d\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\begin\{cases\}\{\{\{x\}\}\}\_\{\\nicefrac\{\{\(\{\{d\}\}\+1\)\}\}\{\{2\}\}\}&\\textbf\{if \}\{\{d\}\}\\text\{ is odd\},\\\\ \{\{\{y\}\}\}\_\{\\nicefrac\{\{\{\{d\}\}\}\}\{\{2\}\}\}&\\textbf\{otherwise\}\.\\end\{cases\}\(15\)
The following lemma, due toMerrill & Sabharwal \([2023](https://arxiv.org/html/2605.30523#bib.bib33), Lem\. 1\)\(where it was adapted fromHao et al\. \([2022](https://arxiv.org/html/2605.30523#bib.bib21)\)\) and used as Lem\. C\.2 byLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\), lets us turn any logspace\-computable function into an𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuit\.

###### Lemma B\.1\(Merrill & Sabharwal,[2023](https://arxiv.org/html/2605.30523#bib.bib33), Lem\. 1\)\.

Letf:\{0,1\}∗→\{0,1\}mf\\colon\{\\\{0,1\\\}\}^\{\*\}\\to\{\\\{0,1\\\}\}^\{m\}be a function computable in linear space\. Then, for any constantc∈ℝ\>0c\\in\{\{\\mathbb\{R\}\}\}\_\{\>0\}, there is a Turing machine that, on input1N1^\{\{N\}\}, uses at mostc​log⁡N\+log⁡mc\\log\{\{N\}\}\+\\log mspace to output the description of a depth\-33𝙰𝙲𝟶\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuit of size at mostNc\+c​log⁡N\+m\{\{N\}\}^\{c\}\+c\\log\{\{N\}\}\+mcomputingffon inputs of size at mostc​log⁡Nc\\log\{\{N\}\}\.

### B\.1Positional Encodings

The following lemmata follow from the definition of fixed\-point arithmetic and the rounding and thresholding applied therein\.

###### Lemma B\.2\.

Letx∈𝔽bx\\in\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}for someb∈ℕ\{\{b\}\}\\in\{\{\\mathbb\{N\}\}\}such thatx\>log⁡2​\(b\+1\)x\>\\log\{2\}\(\{\{b\}\}\+1\)\. Then, it holds that

exp⁡\(x\)\\displaystyle\\exp\(x\)=B𝔽,\\displaystyle=\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\},\(16a\)exp⁡\(−x\)\\displaystyle\\exp\(\-x\)=0\.\\displaystyle=0\.\(16b\)

###### Lemma B\.3\(Generalization ofLi et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30), Lem\. E\.3\)\.

ForN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\},n∈\[N\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}, define the vectors𝐪n∈ℝ2​⌈log⁡N⌉\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}\\in\{\{\\mathbb\{R\}\}\}^\{2\\lceil\\log\{\{N\}\}\\rceil\}and𝐤n∈ℝ2​⌈log⁡N⌉\{\{\{\\bm\{k\}\}\}\}\_\{\{n\}\}\\in\{\{\\mathbb\{R\}\}\}^\{2\\lceil\\log\{\{N\}\}\\rceil\}as follows:

𝒒n\\displaystyle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}=defB𝔽/m⋅\(B±​\(ntgt\)⌢​𝟏⌈log⁡N⌉\)\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\nicefrac\{\{\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\}\}\{\{m\}\}\\cdot\(\{\{\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\right\)\}\}^\{\\frown\}\{\{\{\\bm\{1\}\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}\}\}\)\(17a\)𝒌n′\\displaystyle\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}=defB±​\(n′\)⌢​\(−𝟏⌈log⁡N⌉\)\.\\displaystyle\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}^\{\\prime\}\\right\)\}\}^\{\\frown\}\{\(\-\{\{\\bm\{1\}\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}\)\}\}\.\(17b\)Then, it holds that

⟨𝒒n,𝒌n′⟩b=\{0if​n′=ntgtxotherwise\.\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}=\\begin\{cases\}0&\\textbf\{if \}\{\{n\}\}^\{\\prime\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\\\ x&\\textbf\{otherwise\}\.\\end\{cases\}\(18\)wherex≤−2​B𝔽/mx\\leq\-\\nicefrac\{\{2\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\}\}\{\{m\}\}\.

###### Lemma B\.4\(Constant\-sized positional encodings\)\.

LetN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\},b≥5​log⁡N\{\{b\}\}\\geq 5\\log\{\{N\}\}, andμ:\[N\]→𝔽b2\{\\mu\}\\colon\{\\left\[\{\{N\}\}\\right\]\}\\to\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}^\{2\}be defined by

μ:n↦\(1n1−1n\)\{\\mu\}\\colon\{\{n\}\}\\mapsto\\begin\{pmatrix\}\\sqrt\{\\frac\{1\}\{\{\{n\}\}\}\}\\\\ \\sqrt\{1\-\\frac\{1\}\{\{\{n\}\}\}\}\\end\{pmatrix\}\(19\)where all operations are computed inb\{\{b\}\}\-bit fixed\-point arithmetic\. Then, for alln,n′∈\[N\]\{\{n\}\},\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\}withn≠n′\{\{n\}\}\\neq\{\{n\}\}^\{\\prime\}:

⟨μ​\(n\),μ​\(n\)⟩b−⟨μ​\(n\),μ​\(n′\)⟩b≥12​N4−C​N⋅2−b≥14​N4,\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}\)\}\\rangle\}\}\_\{\{\{b\}\}\}\-\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{\{b\}\}\}\\geq\\frac\{1\}\{2\{\{N\}\}^\{4\}\}\-C\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\}\\geq\\frac\{1\}\{4\{\{N\}\}^\{4\}\},\(20\)whereC\>0C\>0is an absolute constant\.

###### Proof\.

ByMerrill & Sabharwal \([2024](https://arxiv.org/html/2605.30523#bib.bib34), Lem\. 8\), in exact arithmetic:

μ​\(n\)⊤​μ​\(n\)−μ​\(n\)⊤​μ​\(n′\)≥12max\(n,n′\)4≥12​N4\.\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}\)\}\}\}\-\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\}\}\\geq\\frac\{1\}\{2\\max\(\{\{n\}\},\{\{n\}\}^\{\\prime\}\)^\{4\}\}\\geq\\frac\{1\}\{2\{\{N\}\}^\{4\}\}\.\(21\)
We now bound the error introduced by fixed\-point computation\. Letround​\(⋅\)\{\\texttt\{round\}\}\(\\cdot\)denote rounding tob\{\{b\}\}bits of precision, satisfying\|round​\(x\)−x\|≤2−b−1=defζ\|\{\\texttt\{round\}\}\(x\)\-x\|\\leq 2^\{\-\{\{b\}\}\-1\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\zeta\. We compute⟨μ​\(n\),μ​\(n′\)⟩b\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{b\}\}forn,n′∈\[N\]\{\{n\}\},\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\}and track errors through each operation\. Leta=round​\(1n\)a=\{\\texttt\{round\}\}\(\\frac\{1\}\{\{\{n\}\}\}\)anda′=round​\(1n′\)a^\{\\prime\}=\{\\texttt\{round\}\}\(\\frac\{1\}\{\{\{n\}\}^\{\\prime\}\}\), so\|a−1n\|≤ζ\|a\-\\frac\{1\}\{\{\{n\}\}\}\|\\leq\\zetaand\|a′−1n′\|≤ζ\|a^\{\\prime\}\-\\frac\{1\}\{\{\{n\}\}^\{\\prime\}\}\|\\leq\\zeta\.

##### Square root\.

Forx∈\[δ,1\]x\\in\[\\delta,1\]withδ\>0\\delta\>0, the functionx\\sqrt\{x\}has derivative12​x≤12​δ\\frac\{1\}\{2\\sqrt\{x\}\}\\leq\\frac\{1\}\{2\\sqrt\{\\delta\}\}\. By the mean value theorem, if\|x′−x\|≤ϵ\|x^\{\\prime\}\-x\|\\leq\\epsilon, then\|x′−x\|≤ϵ2​δ\|\\sqrt\{x^\{\\prime\}\}\-\\sqrt\{x\}\|\\leq\\frac\{\\epsilon\}\{2\\sqrt\{\\delta\}\}\. Sincen≥1\{\{n\}\}\\geq 1, we have1n∈\(0,1\]\\frac\{1\}\{\{\{n\}\}\}\\in\(0,1\]and1−1n∈\[0,1\)1\-\\frac\{1\}\{\{\{n\}\}\}\\in\[0,1\)\. Forn≥2\{\{n\}\}\\geq 2, both arguments to⋅\\sqrt\{\\cdot\}are bounded away from 0 by at least1/N1/\{\{N\}\}\. Forn=1\{\{n\}\}=1, we haveμ​\(1\)=\(10\)⊤\{\{\\mu\}\(1\)\}=\\begin\{pmatrix\}1&0\\end\{pmatrix\}^\{\\top\}exactly\. Thus, forn≥2\{\{n\}\}\\geq 2:

- •Letb=round​\(a\)b=\{\\texttt\{round\}\}\(\\sqrt\{a\}\)\. Then\|b−1n\|≤12​1/N⋅ζ\+ζ=\(N2\+1\)​ζ\|b\-\\sqrt\{\\frac\{1\}\{\{\{n\}\}\}\}\|\\leq\\frac\{1\}\{2\\sqrt\{1/\{\{N\}\}\}\}\\cdot\\zeta\+\\zeta=\(\\frac\{\\sqrt\{\{\{N\}\}\}\}\{2\}\+1\)\\zeta\.
- •Letc=round​\(1−a\)c=\{\\texttt\{round\}\}\(1\-a\)\. Then\|c−\(1−1n\)\|≤2​ζ\|c\-\(1\-\\frac\{1\}\{\{\{n\}\}\}\)\|\\leq 2\\zeta\.
- •Letd=round​\(c\)d=\{\\texttt\{round\}\}\(\\sqrt\{c\}\)\. Then\|d−1−1n\|≤N2⋅2​ζ\+ζ=\(N\+1\)​ζ\|d\-\\sqrt\{1\-\\frac\{1\}\{\{\{n\}\}\}\}\|\\leq\\frac\{\\sqrt\{\{\{N\}\}\}\}\{2\}\\cdot 2\\zeta\+\\zeta=\(\\sqrt\{\{\{N\}\}\}\+1\)\\zeta\.

##### Inner product\.

The inner product⟨μ​\(n\),μ​\(n′\)⟩b=b⋅b′\+d⋅d′\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{b\}\}=b\\cdot b^\{\\prime\}\+d\\cdot d^\{\\prime\}involves multiplying values in\[0,1\]\[0,1\]\. Forx,y,x′,y′∈\[0,1\]x,y,x^\{\\prime\},y^\{\\prime\}\\in\[0,1\]:

\|x′​y′−x​y\|=\|x′​y′−x′​y\+x′​y−x​y\|≤\|x′\|​\|y′−y\|\+\|y\|​\|x′−x\|≤\|y′−y\|\+\|x′−x\|\.\|x^\{\\prime\}y^\{\\prime\}\-xy\|=\|x^\{\\prime\}y^\{\\prime\}\-x^\{\\prime\}y\+x^\{\\prime\}y\-xy\|\\leq\|x^\{\\prime\}\|\|y^\{\\prime\}\-y\|\+\|y\|\|x^\{\\prime\}\-x\|\\leq\|y^\{\\prime\}\-y\|\+\|x^\{\\prime\}\-x\|\.\(22\)
Combining all error terms and accounting for final rounding in the multiplications and addition, there exists an absolute constantC′\>0C^\{\\prime\}\>0such that

\|⟨μ​\(n\),μ​\(n′\)⟩b−μ​\(n\)⊤​μ​\(n′\)\|≤C′​N⋅ζ\{\{\\left\\lvert\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{b\}\}\-\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\}\}\\right\\rvert\}\}\\leq C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot\\zeta\(23\)for alln,n′∈\[N\]\{\{n\}\},\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\}\.

##### Combining the bounds\.

Forn=1\{\{n\}\}=1, the arithmetic is exact and the bounds hold trivially\. Forn≥2\{\{n\}\}\\geq 2: In the worst case, rounding decreases the self\-product and increases the cross\-product:

⟨μ​\(n\),μ​\(n\)⟩b\\displaystyle\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}\)\}\\rangle\}\}\_\{\{\{b\}\}\}≥μ​\(n\)⊤​μ​\(n\)−C′​N⋅ζ=1−C′​N⋅ζ,\\displaystyle\\geq\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}\)\}\}\}\-C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot\\zeta=1\-C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot\\zeta,\(24a\)⟨μ​\(n\),μ​\(n′\)⟩b\\displaystyle\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{b\}\}≤μ​\(n\)⊤​μ​\(n′\)\+C′​N⋅ζ\.\\displaystyle\\leq\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\}\}\+C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot\\zeta\.\(24b\)Therefore,

⟨μ​\(n\),μ​\(n\)⟩b−⟨μ​\(n\),μ​\(n′\)⟩b\\displaystyle\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}\)\}\\rangle\}\}\_\{\{\{b\}\}\}\-\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{b\}\}≥μ​\(n\)⊤​μ​\(n\)−μ​\(n\)⊤​μ​\(n′\)−2​C′​N⋅ζ\\displaystyle\\geq\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}\)\}\}\}\-\{\{\{\{\\mu\}\(\{\{n\}\}\)\}^\{\\top\}\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\}\}\-2C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot\\zeta\(25a\)≥12​N4−2​C′​N⋅2−b−1\\displaystyle\\geq\\frac\{1\}\{2\{\{N\}\}^\{4\}\}\-2C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\-1\}\(25b\)=12​N4−C′​N⋅2−b\.\\displaystyle=\\frac\{1\}\{2\{\{N\}\}^\{4\}\}\-C^\{\\prime\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\}\.\(25c\)TakingC=C′C=C^\{\\prime\}and noting thatb≥5​log⁡N\{\{b\}\}\\geq 5\\log\{\{N\}\}impliesC​N⋅2−b≤C​N⋅N−5=C​N−9/2≤14​N4C\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\}\\leq C\\sqrt\{\{\{N\}\}\}\\cdot\{\{N\}\}^\{\-5\}=C\{\{N\}\}^\{\-9/2\}\\leq\\frac\{1\}\{4\{\{N\}\}^\{4\}\}for sufficiently largeN\{\{N\}\}, we obtain

⟨μ​\(n\),μ​\(n\)⟩b−⟨μ​\(n\),μ​\(n′\)⟩b≥14​N4\.∎\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}\)\}\\rangle\}\}\_\{\{\{b\}\}\}\-\{\{\\langle\{\{\\mu\}\(\{\{n\}\}\)\},\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}\\rangle\}\}\_\{\{b\}\}\\geq\\frac\{1\}\{4\{\{N\}\}^\{4\}\}\.\\qed\(26\)
The following follows readily\.

###### Corollary B\.1\.

LetN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\},n,n′∈\[N\]\{\{n\}\},\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\},b≥6​log⁡N\{\{b\}\}\\geq 6\\log\{\{N\}\}, and letntgt∈\[N\]\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}be a target position\. Define𝐪,𝐤n′∈ℝ3\{\{\{\\bm\{q\}\}\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\in\{\{\\mathbb\{R\}\}\}^\{3\}by

𝒒n=def\(N5​1ntgtN5​1−1ntgt−N5\)and𝒌n′=def\(1n′1−1n′1\),\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\begin\{pmatrix\}\{\{N\}\}^\{5\}\\sqrt\{\\frac\{1\}\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\}\}\\\\ \{\{N\}\}^\{5\}\\sqrt\{1\-\\frac\{1\}\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\}\}\\\\ \-\{\{N\}\}^\{5\}\\end\{pmatrix\}\\quad\\text\{and\}\\quad\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\begin\{pmatrix\}\\sqrt\{\\frac\{1\}\{\{\{n\}\}^\{\\prime\}\}\}\\\\ \\sqrt\{1\-\\frac\{1\}\{\{\{n\}\}^\{\\prime\}\}\}\\\\ 1\\end\{pmatrix\},\(27\)where all operations are computed inb\{\{b\}\}\-bit fixed\-point arithmetic\. Then, for alln′∈\[N\]\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\}:

\{\|⟨𝒒n,𝒌n′⟩b\|≤C​N5​N⋅2−b−1≤C​N−12if​n′=ntgt⟨𝒒n,𝒌n′⟩b≤−C′​Notherwise\\begin\{cases\}\{\{\\left\\lvert\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\right\\rvert\}\}\\leq C\{\{N\}\}^\{5\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\-1\}\\leq C\{\{N\}\}^\{\-\\frac\{1\}\{2\}\}&\\textbf\{if \}\{\{n\}\}^\{\\prime\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\\\ \{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\leq\-C^\{\\prime\}\{\{N\}\}&\\textbf\{otherwise\}\\end\{cases\}\(28\)for some absolute constantsC,C′\>0C,C^\{\\prime\}\>0and sufficiently largeN\{\{N\}\}\.

###### Proof\.

Write𝒖n=defμ​\(n\)=\(un′,1,un′,2\)∈ℝ2\{\{\{\\bm\{u\}\}\}\}\_\{\{n\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\mu\}\(\{\{n\}\}\)\}=\(\{\{\{u\}\}\}\_\{\{\{n\}\}^\{\\prime\},1\},\{\{\{u\}\}\}\_\{\{\{n\}\}^\{\\prime\},2\}\)\\in\{\{\\mathbb\{R\}\}\}^\{2\}for the unit\-length PE of[Lem\.˜B\.4](https://arxiv.org/html/2605.30523#A2.Thmlemma4)and𝒖n′=\(un′,1′,un′,2′\)∈ℝ2\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{n\}\}=\(\{\{\{u\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\},1\},\{\{\{u\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\},2\}\)\\in\{\{\\mathbb\{R\}\}\}^\{2\}for its fixed\-point realization \(so𝒌n′=\(un′,1′,un′,2′,1\)\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}=\(\{\{\{u\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\},1\},\{\{\{u\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\},2\},1\)and𝒒n=\(N5​untgt,1′,N5​untgt,2′,−N5\)\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}=\(\{\{N\}\}^\{5\}\{\{\{u\}\}\}^\{\\prime\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},1\},\{\{N\}\}^\{5\}\{\{\{u\}\}\}^\{\\prime\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},2\},\-\{\{N\}\}^\{5\}\)\)\. By construction,⟨𝒒n,𝒌n′⟩b=N5⋅\(⟨𝒖ntgt′,𝒖n′′⟩b−1\)\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}=\{\{N\}\}^\{5\}\\cdot\\big\(\{\{\\langle\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\-1\\big\)up to the final rounding step\.

Forn′=ntgt\{\{n\}\}^\{\\prime\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},[Lem\.˜B\.4](https://arxiv.org/html/2605.30523#A2.Thmlemma4)’s proof gives\|⟨𝒖ntgt′,𝒖ntgt′⟩b−1\|≤C0​N⋅2−b−1\{\{\\left\\lvert\{\{\\langle\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\rangle\}\}\_\{\{b\}\}\-1\\right\\rvert\}\}\\leq C\_\{0\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\-1\}for an absolute constantC0\>0C\_\{0\}\>0\(the fixed\-point error accumulated in the inner product\)\. Scaling byN5\{\{N\}\}^\{5\}and absorbing the outer rounding step into the constant gives\|⟨𝒒n,𝒌ntgt⟩b\|≤C​N5​N⋅2−b−1\{\{\\left\\lvert\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\rangle\}\}\_\{\{b\}\}\\right\\rvert\}\}\\leq C\{\{N\}\}^\{5\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\-1\}for some absoluteC\>0C\>0\. Forb≥6​log⁡N\{\{b\}\}\\geq 6\\log\{\{N\}\}, this is at mostC​N11/2−6=C​N−1/2C\{\{N\}\}^\{11/2\-6\}=C\{\{N\}\}^\{\-1/2\}\.

Forn′≠ntgt\{\{n\}\}^\{\\prime\}\\neq\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},[Lem\.˜B\.4](https://arxiv.org/html/2605.30523#A2.Thmlemma4)gives⟨𝒖ntgt′,𝒖ntgt′⟩b−⟨𝒖ntgt′,𝒖n′′⟩b≥14​N4\{\{\\langle\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\rangle\}\}\_\{\{b\}\}\-\{\{\\langle\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\geq\\tfrac\{1\}\{4\{\{N\}\}^\{4\}\}, and by above⟨𝒖ntgt′,𝒖ntgt′⟩b≤1\+C0​N⋅2−b−1\{\{\\langle\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\rangle\}\}\_\{\{b\}\}\\leq 1\+C\_\{0\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\-1\}\. Combining these usingb≥6​log⁡N\{\{b\}\}\\geq 6\\log\{\{N\}\},⟨𝒖ntgt′,𝒖n′′⟩b−1≤−14​N4\+C0​N⋅2−b−1≤−18​N4\{\{\\langle\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\},\{\{\{\\bm\{u\}\}\}\}^\{\\prime\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\-1\\leq\-\\tfrac\{1\}\{4\{\{N\}\}^\{4\}\}\+C\_\{0\}\\sqrt\{\{\{N\}\}\}\\cdot 2^\{\-\{\{b\}\}\-1\}\\leq\-\\tfrac\{1\}\{8\{\{N\}\}^\{4\}\}for sufficiently largeN\{\{N\}\}\. Multiplying byN5\{\{N\}\}^\{5\}yields⟨𝒒n,𝒌n′⟩b≤−N8\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\leq\-\\tfrac\{\{\{N\}\}\}\{8\}, which gives the claim withC′=18C^\{\\prime\}=\\tfrac\{1\}\{8\}\(absorbing the final rounding intoC′C^\{\\prime\}\)\. ∎

The following lemma shows that the PEs defined above can be used to focus on individual positions of interest\.

###### Lemma B\.5\.

Let𝐪n\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}and𝐤n′\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}be defined either as in[Lem\.˜B\.3](https://arxiv.org/html/2605.30523#A2.Thmlemma3)or[Cor\.˜B\.1](https://arxiv.org/html/2605.30523#A2.Thmcorollary1)with target indexntgt∈\[N\]\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}\(with corresponding precision requirements\)\. Let𝐬=\(⟨𝐪n,𝐤n′⟩b\)n′∈\[N\]\{\{\{\\bm\{s\}\}\}\}=\\left\(\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\right\)\_\{\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\}\}for someN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}andn∈\[N\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}\. Then,

softmax​\(s\)n′=\{1if​n′=ntgt0otherwise\.\{\{\\mathrm\{softmax\}\\left\(\{\{\{s\}\}\}\\right\)\_\{\{\{n\}\}^\{\\prime\}\}\}\}=\\begin\{cases\}1&\\textbf\{if \}\{\{n\}\}^\{\\prime\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\\\\ 0&\\textbf\{otherwise\}\\end\{cases\}\.\(29\)

###### Proof\.

The case of[Lem\.˜B\.3](https://arxiv.org/html/2605.30523#A2.Thmlemma3)follows directly from⟨𝒒n,𝒌n′⟩b≤−2​B𝔽/m\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\leq\-\\nicefrac\{\{2\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\}\}\{\{m\}\}forn′≠ntgt\{\{n\}\}^\{\\prime\}\\neq\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\.

To ensure no attention is placed to positionsn′≠ntgt\{\{n\}\}^\{\\prime\}\\neq\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}in[Cor\.˜B\.1](https://arxiv.org/html/2605.30523#A2.Thmcorollary1), it suffices to ensure⟨𝒒n,𝒌n′⟩b≤−log⁡2​\(b\+1\)\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\leq\-\\log 2\\,\(\{\{b\}\}\+1\)\(cf\.[Lem\.˜B\.2](https://arxiv.org/html/2605.30523#A2.Thmlemma2)\) while⟨𝒒n,𝒌ntgt⟩b\>−log⁡2​\(b\+1\)\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\}\\rangle\}\}\_\{\{b\}\}\>\-\\log 2\\,\(\{\{b\}\}\+1\)\. Substituting the bounds from[Cor\.˜B\.1](https://arxiv.org/html/2605.30523#A2.Thmcorollary1), the off\-target case satisfies⟨𝒒n,𝒌n′⟩b≤−C′​N\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{b\}\}\\leq\-C^\{\\prime\}\{\{N\}\}, which dominates−log⁡2​\(b\+1\)=−𝒪​\(log⁡N\)\-\\log 2\\,\(\{\{b\}\}\+1\)=\-\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}for sufficiently largeN\{\{N\}\}; the on\-target case satisfies\|⟨𝒒n,𝒌ntgt⟩b\|≤C​N−1/2\{\{\\left\\lvert\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\}\\rangle\}\}\_\{\{b\}\}\\right\\rvert\}\}\\leq C\{\{N\}\}^\{\-1/2\}, which is greater than−log⁡2​\(b\+1\)\-\\log 2\\,\(\{\{b\}\}\+1\)for sufficiently largeN\{\{N\}\}\. Thus, with logarithmic precision of at least6​log⁡N6\\log\{\{N\}\}, softmax attention exclusively attends to individual positions with large attention scores\. ∎

### B\.2Useful Attention Patterns

###### Lemma B\.6\(Detecting a symbol occurrence\)\.

There exists a𝙻​\-uniform​𝙻𝙿𝚃c,l𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}family of transformers\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}such that, for anyN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}, on input𝐰∈Σ∗\{\{\\bm\{w\}\}\}\\in\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}of lengthN\{\{N\}\}andw∈Σ\{\{w\}\}\\in\{\{\\Sigma\}\},𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}’s residual stream at positionN\{\{N\}\}contains the entry𝟙​\{w∈𝐰\}\{\\mathbbm\{1\}\}\\left\\\{\{\{w\}\}\\in\{\{\\bm\{w\}\}\}\\right\\\}\.

###### Proof sketch\.\.

Note that𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}cannot use the commonly\-used*exact*uniform attention over all symbols to detect𝟙​\{w∈𝒘\}\{\\mathbbm\{1\}\}\\left\\\{\{\{w\}\}\\in\{\{\\bm\{w\}\}\}\\right\\\}due to constant precision\. Nevertheless, constant\-precision rounded uniform attention suffices\. By attending to all symbols in the string with weight 1, the denominator of the attention scores is at mostB𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\. Using one\-hot encodings of symbolswn\{\{w\}\}\_\{\{n\}\}as the attention values𝒗n\{\{\{\\bm\{v\}\}\}\}\_\{\{n\}\}, it is easy to see that the final contextual representation at the final position will have a positive value at the entry corresponding tow\{\{w\}\}if and only ifw∈𝒘\{\{w\}\}\\in\{\{\\bm\{w\}\}\}, sincec/B𝔽\>0\\nicefrac\{\{c\}\}\{\{\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\}\}\>0for anyc≥1c\\geq 1\. This condition can be checked by the MLP applied after the attention aggregation operation\. This construction requires parameters that do not change with input length \(since the size of the one\-hot encodings is constant\), and is thus logspace computable\. ∎

###### Lemma B\.7\(Reading binary positional encodings into the residual stream\)\.

There exists a𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}family of transformers\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}such that, on input&B​\(n\)\\texttt\{\\& \}\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\},𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}’s residual stream at position⌈log⁡N⌉\+1\\lceil\\log\{\{N\}\}\\rceil\+1contains the valueB​\(n\)\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}in a designated⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceil\-dimensional sub\-block for anyN∈ℕ\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}andn∈\[N\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}\.

###### Proof sketch\.

The transformer𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}has to convert the binary representationB​\(n\)\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}ofn\{\{n\}\}contained across⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceilpositions in the input string into a single⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceil\-dimensional binary vector in the residual stream\. The construction uses a fixed single\-layer block that is looped⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceiltimes—i\.e\., constant depth and𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}loop iterations of a single block, as required by𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\. We index loop iterations by a timestept∈\{1,…,⌈log⁡N⌉\}t\\in\{\\\{1,\\ldots,\\lceil\\log\{\{N\}\}\\rceil\\\}\}\.

1. 1\.At timestept=1t=1\(the first application of the looped block\), each symbolwn′∈\{0,1\}\{\{w\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\in\{\\\{0,1\\\}\}checks if it is immediately preceded by the&symbol, which denotes the beginning of the pointer in the string\. If it is,wn′\{\{w\}\}\_\{\{\{n\}\}^\{\\prime\}\}stores𝒆1\{\{\{\\bm\{e\}\}\}\}\_\{1\}and𝒅1=defwn′​𝒆1\{\{\{\\bm\{d\}\}\}\}\_\{1\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{w\}\}\_\{\{\{n\}\}^\{\\prime\}\}\{\{\{\\bm\{e\}\}\}\}\_\{1\}in designated parts of its residual stream\. Here,𝒆1\{\{\{\\bm\{e\}\}\}\}\_\{1\}is the first unit vector ofℝ⌈log⁡N⌉\{\{\\mathbb\{R\}\}\}^\{\\lceil\\log\{\{N\}\}\\rceil\}\.
2. 2\.At each subsequent timestept∈\{2,…,⌈log⁡N⌉\}t\\in\{\\\{2,\\ldots,\\lceil\\log\{\{N\}\}\\rceil\\\}\}\(i\.e\., thett\-th application of the same single\-layer block\), each symbolwn′\{\{w\}\}\_\{\{\{n\}\}^\{\\prime\}\}checks if the entry𝒆t−1\{\{\{\\bm\{e\}\}\}\}\_\{t\-1\}has already been written to the designated space of the previous symbol’s residual stream\. If it has,wn′\{\{w\}\}\_\{\{\{n\}\}^\{\\prime\}\}copies and shifts𝒆t−1\{\{\{\\bm\{e\}\}\}\}\_\{t\-1\}into𝒆t\{\{\{\\bm\{e\}\}\}\}\_\{t\}, and stores𝒆t\{\{\{\\bm\{e\}\}\}\}\_\{t\}and𝒅t=def𝒅t−1\+wn′​𝒆t\{\{\{\\bm\{d\}\}\}\}\_\{t\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\\bm\{d\}\}\}\}\_\{t\-1\}\+\{\{w\}\}\_\{\{\{n\}\}^\{\\prime\}\}\{\{\{\\bm\{e\}\}\}\}\_\{t\}in designated parts of its residual stream\.

After⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceiltimesteps \(i\.e\.,⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceilapplications of the looped block\), the residual stream at position⌈log⁡N⌉\+1\\lceil\\log\{\{N\}\}\\rceil\+1thus containsB​\(n\)\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}\.

This construction only requires PEs that containB​\(n\)\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}andB​\(N−1\)\{\{\{\\texttt\{B\}\}\\left\(\{\{N\}\}\-1\\right\)\}\}, which are logarithmically computable\. Moreover, the parameters of the transformer𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}only change withN\{\{N\}\}in terms of the size of the matrices, while their structure remains the same—they either project onto specific coordinates \(whose indices can be computed with counters in a logspace Turing machine\) or shift the coordinates of vectors, which can also be done in logspace\. Thus, the family\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}is in𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\. ∎

### B\.3Layer Normalization

For compactness, our constructions ignore layer normalization \(apart from those extendingMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)constructions, which rely on layer normalization for implementing the layer hash norm\)\. However, it is not difficult to account for it\. For upper bounds, it suffices to see that, in the case of*constant precision*b=Θ​\(1\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(1\)\}\}, any position\-wise operation \(such as layer normalization\) can be implemented by a finite lookup table, which can be hardcoded into the MLPs of the transformer\. In the case of growing precision, all layer normalization operations \(addition, division, square root, etc\.\) can be implemented in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Chiang,[2025](https://arxiv.org/html/2605.30523#bib.bib10)\), ensuring that the same upper bounds still hold\. For lower bounds, we can use the same construction asLi et al\.,[2024b](https://arxiv.org/html/2605.30523#bib.bib30), §F\.1, which minimally changes the constructed transformers to ensure that the layer normalization operation does not affect the computations\.

Whenever we do refer to layer normalization, we use the standard numerically\-stable variant of layer normalization

𝝂​\(𝒙\)=def𝒙−μ​\(𝒙\)σ2​\(𝒙\)\+ε,\{\{\{\{\\bm\{\\nu\}\}\}\(\{\{\{\\bm\{x\}\}\}\}\)\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\frac\{\{\{\{\\bm\{x\}\}\}\}\-\\mu\(\{\{\{\\bm\{x\}\}\}\}\)\}\{\\sqrt\{\\sigma^\{2\}\(\{\{\{\\bm\{x\}\}\}\}\)\+\{\{\\varepsilon\}\}\}\},\(30\)whereμ​\(𝒙\)\\mu\(\{\{\{\\bm\{x\}\}\}\}\)andσ2​\(𝒙\)\\sigma^\{2\}\(\{\{\{\\bm\{x\}\}\}\}\)denote the empirical mean and variance of the coordinates of𝒙\{\{\{\\bm\{x\}\}\}\}, andε≥2−b\{\{\\varepsilon\}\}\\geq 2^\{\-\{\{b\}\}\}is a fixed stabilizer\.

## Appendix CProofs

### C\.1Proofs of the Results on the Relationship betweenAHATs andSMATs

See[3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1)

###### Proof\.

The construction is to keep all parameters of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}fixed and replace every average\-hard attention layer by a temperature\-scaled softmax attention layer with a suitably small temperatureτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\. We first establish, inℝ\{\{\\mathbb\{R\}\}\}\-valued arithmetic, a per\-layer error bound between𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}and its softmax counterpart in terms of the attention gap and the temperature\. We then choose the temperature small enough that the per\-coordinate error is below the fixed\-point rounding margin, so that the rounding step in[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3)collapses the two models onto identical outputs\. For brevity, writeb=defb​\(N\)=Θ​\(log⁡N\)\{\{b\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\},𝔽=def𝔽b\{\{\\mathbb\{F\}\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}, and letL\{\{L\}\}denote the number of layers of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\.

##### Step 1: Gap and activation bounds\.

FollowingYang et al\. \([2026a](https://arxiv.org/html/2605.30523#bib.bib54), Def\. 5\), for an attention layer with scoressn,1,…,sn,N∈𝔽s\_\{\{\{n\}\},1\},\\ldots,s\_\{\{\{n\}\},\{\{N\}\}\}\\in\{\{\\mathbb\{F\}\}\}at positionn\{\{n\}\}with maximumsn,max=defmaxn′⁡sn,n′s\_\{\{\{n\}\},\\max\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\max\_\{\{\{n\}\}^\{\\prime\}\}s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}, the layer hasgapγ​\(N\)\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}at positionn\{\{n\}\}ifsn,n′≤sn,max−γ​\(N\)s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}\\leq s\_\{\{\{n\}\},\\max\}\-\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}for everyn′\{\{n\}\}^\{\\prime\}withsn,n′<sn,maxs\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}<s\_\{\{\{n\}\},\\max\}\. The gap of the layer is the largest suchγ​\(N\)\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}that works for alln\{\{n\}\}and all length\-N\{\{N\}\}inputs, and the gap of the transformer is the minimum over itsL\{\{L\}\}layers\. Because every score is an element of𝔽\{\{\\mathbb\{F\}\}\}and any two distinct elements of𝔽\{\{\\mathbb\{F\}\}\}differ by at least2−b2^\{\-\{\{b\}\}\}, we have

γ​\(N\)≥2−b=Ω​\(N−c\)\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}\\;\\geq\\;2^\{\-\{\{b\}\}\}\\;=\\;\{\{\{\{\\Omega\}\}\(\{\{N\}\}^\{\-c\}\)\}\}\(31\)for some constantccdepending only on the constant hidden inb=Θ​\(log⁡N\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}\. Similarly, every activation𝒉n\(l\)∈𝔽D\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}\\in\{\{\\mathbb\{F\}\}\}^\{\{D\}\}has every coordinate in\[−B𝔽,B𝔽\]\[\-\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\},\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\]withB𝔽=2b−2−b\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}=2^\{\{b\}\}\-2^\{\-\{\{b\}\}\}, so the analogue ofYang et al\. \([2026a](https://arxiv.org/html/2605.30523#bib.bib54)\)xmax​\(N\)x\_\{\\max\}\(\{\{N\}\}\)\(maximum value in any entry of the residual stream of length\-N\{\{N\}\}strings\) satisfies

xmax​\(N\)≤B𝔽=𝒪​\(Nc\)\.x\_\{\\max\}\(\{\{N\}\}\)\\;\\leq\\;\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\\;=\\;\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\}\)\}\}\.\(32\)The same bound applies to the parameter matrices𝑾Q\(l\),𝑾K\(l\),𝑾V\(l\)\{\{\{\\bm\{W\}\}\}\}\_\{Q\}^\{\(\{\{l\}\}\)\},\{\{\{\\bm\{W\}\}\}\}\_\{K\}^\{\(\{\{l\}\}\)\},\{\{\{\\bm\{W\}\}\}\}\_\{V\}^\{\(\{\{l\}\}\)\}and the MLP parameters, since they too are in𝔽\{\{\\mathbb\{F\}\}\}\.

Let𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}be the candidate softmax model: It has the same parameters as𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}except that every attention layer usessoftmaxτ\{\{\{\{\\mathrm\{softmax\}\}\}\_\{\{\{\\tau\}\}\}\}\}with temperatureτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}in place ofahardmax\{\{\\mathrm\{ahardmax\}\}\}\. Both models operate under the same fixed\-point semantics from[§˜A\.3](https://arxiv.org/html/2605.30523#A1.SS3), so every layer’s output is rounded coordinate\-wise to𝔽D\{\{\\mathbb\{F\}\}\}^\{\{D\}\}\.

##### Step 2: Per\-layer error inℝ\{\{\\mathbb\{R\}\}\}\-valued arithmetic\.

Fix any layerl∈\{1,…,L\}\{\{l\}\}\\in\{\\\{1,\\ldots,\{\{L\}\}\\\}\}and any input𝑯\(l−1\)∈𝔽N×D\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\-1\)\}\\in\{\{\\mathbb\{F\}\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}shared by both models\. Let𝒉n\(l\)\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}and𝒉~n\(l\)\\widetilde\{\{\{\{\{\{\\bm\{h\}\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}denote the layer\-l\{\{l\}\}outputs of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}and𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}respectively when both are computed in idealizedℝ\{\{\\mathbb\{R\}\}\}\-valued arithmetic on this shared input \(i\.e\., without the rounding step\)\. ByYang et al\. \([2026a](https://arxiv.org/html/2605.30523#bib.bib54), Lem\. 25\)per\-layer bound \(specialized to zero input error, since the two models share their layer input\), there is a constantK1≥1K\_\{1\}\\geq 1depending polynomially onD\{\{D\}\}and the maximum parameter magnitude such that, for every positionn\{\{n\}\},

‖𝒉n\(l\)−𝒉~n\(l\)‖1≤K1​xmax​\(N\)​N​e−γ​\(N\)/τ​\(N\)\.\{\{\\left\\lVert\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}\-\\widetilde\{\{\{\{\{\{\\bm\{h\}\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}\\right\\rVert\}\}\_\{1\}\\;\\leq\\;K\_\{1\}\\,x\_\{\\max\}\(\{\{N\}\}\)\\,\{\{N\}\}\\,e^\{\-\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\}\.\(33\)

##### Step 3: Choosing the temperature so a single layer’s error rounds away\.

We want the implemented attention sub\-layer’s output to lie strictly below half the spacing2−b2^\{\-\{\{b\}\}\}between adjacent elements of𝔽\{\{\\mathbb\{F\}\}\}from𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}’s hardmax\-attention output; this guarantees that every coordinate of𝒉~n\(l\)\\widetilde\{\{\{\{\{\{\\bm\{h\}\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}lies strictly closer to the corresponding coordinate of𝒉n\(l\)∈𝔽\{\{\{\{\{\\bm\{h\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\_\{\{n\}\}\\in\{\{\\mathbb\{F\}\}\}than to any other element of𝔽\{\{\\mathbb\{F\}\}\}, and therefore rounds to that same element\. LetK2≥1K\_\{2\}\\geq 1be a constant bound on the per\-layer Lipschitz constant of the remaining sub\-layers \(value projection, residual sum, MLP, Lipschitz layer normalization with stabilizerε\{\{\\varepsilon\}\}; cf\.[§˜B\.3](https://arxiv.org/html/2605.30523#A2.SS3)\) so that they amplifyℝ\{\{\\mathbb\{R\}\}\}\-arithmetic errors by at most a factor ofK2K\_\{2\}\. SolvingK1​K2​xmax​\(N\)​N​e−γ​\(N\)/τ​\(N\)<2−\(b\+2\)K\_\{1\}K\_\{2\}\\,x\_\{\\max\}\(\{\{N\}\}\)\\,\{\{N\}\}\\,e^\{\-\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\}<2^\{\-\(\{\{b\}\}\+2\)\}forτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}—tightened to2−\(b\+2\)2^\{\-\(\{\{b\}\}\+2\)\}so that the Step 4 mixed\-precision rounding \(which contributes at most2−\(b\+2\)2^\{\-\(\{\{b\}\}\+2\)\}per coordinate; cf\. Step 4\(ii\)\) and the Step 5 sub\-layer amplification still leave a margin strictly below2−\(b\+1\)2^\{\-\(\{\{b\}\}\+1\)\}—yields the condition

1τ​\(N\)\>1γ​\(N\)​log⁡\(2b\+2​K1​K2​xmax​\(N\)​N\)\.\\frac\{1\}\{\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\}\\;\>\\;\\frac\{1\}\{\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}\}\\log\\left\(2^\{\{\{b\}\}\+2\}\\,K\_\{1\}K\_\{2\}\\,x\_\{\\max\}\(\{\{N\}\}\)\\,\{\{N\}\}\\right\)\.\(34\)Plugging inb=Θ​\(log⁡N\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\},γ​\(N\)=Ω​\(N−c\)\{\{\\gamma\}\\left\(\{\{N\}\}\\right\)\}=\{\{\{\{\\Omega\}\}\(\{\{N\}\}^\{\-c\}\)\}\}from[Eq\.˜31](https://arxiv.org/html/2605.30523#A3.E31), andxmax​\(N\)=𝒪​\(Nc\)x\_\{\\max\}\(\{\{N\}\}\)=\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\}\)\}\}from[Eq\.˜32](https://arxiv.org/html/2605.30523#A3.E32)gives

1τ​\(N\)∈𝒪​\(Nc​log⁡N\)⊆𝒪​\(Nc\+1\),\\frac\{1\}\{\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\}\\;\\in\\;\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\}\\log\{\{N\}\}\)\}\}\\;\\subseteq\\;\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\+1\}\)\}\},\(35\)so1/τ​\(N\)1/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}is a polynomial inN\{\{N\}\}and is logspace\-computable\. We fixτ​\(N\)∈𝔽\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\\in\{\{\\mathbb\{F\}\}\}to be any value satisfying[Eq\.˜34](https://arxiv.org/html/2605.30523#A3.E34)\.

##### Step 4:𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}as a simulator of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\.

We define𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}to have the same parameters as𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\(same𝑾Q,𝑾K,𝑾V\{\{\{\\bm\{W\}\}\}\}\_\{Q\},\{\{\{\\bm\{W\}\}\}\}\_\{K\},\{\{\{\\bm\{W\}\}\}\}\_\{V\}, same MLP, same layer normalization, same PEs\), at the same residual\-stream precisionb\{\{b\}\}, with each attention layer usingsoftmaxτ\{\{\{\{\\mathrm\{softmax\}\}\}\_\{\{\{\\tau\}\}\}\}\}at the temperatureτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}from Step 3\. Sinceτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}from Step 3 satisfies1/τ​\(N\)∈𝒪​\(Nc\+1\)1/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\\in\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\+1\}\)\}\}, choosing it to be a negative power of two—permitted since[Eq\.˜34](https://arxiv.org/html/2605.30523#A3.E34)only constrains1/τ​\(N\)1/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}from below—makesτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}logspace\-computable from1N1^\{\{N\}\}\.

However, the unnormalized attention scoressn,n′=𝒒n⊤​𝒌n′∈𝔽bs\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}=\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}^\{\\top\}\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\in\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}are scaled by1/τ​\(N\)1/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}before being passed through the softmax, meaning that the scaled scoressn,n′/τ​\(N\)s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}can be polynomially larger in magnitude thanB𝔽\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}, and thus not representable in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\. We handle this with mixed\-precision \(cf\.[§˜A\.4](https://arxiv.org/html/2605.30523#A1.SS4)\)—narrowed to a single component: The attention sub\-layer computes the scaled scores and their softmax at a higher precisionb′=defκ⋅b\{\{b\}\}^\{\\prime\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\kappa\\cdot\{\{b\}\}for a constantκ\>1\\kappa\>1, after which the resulting attention weights and the value\-weighted sum are clamped back to𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}when written to the residual stream\. All other components of the transformer remain at the original precisionb\{\{b\}\}: The mixed\-precision is used only for the internal score\-and\-softmax computation\. We pickκ\\kappalarge enough that

1. \(i\)the scaled scoressn,n′/τ​\(N\)s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}are exactly representable in𝔽b′\{\{\\mathbb\{F\}\}\}\_\{\{\{b\}\}^\{\\prime\}\}, and
2. \(ii\)the accumulated rounding error in computing the softmax and the value\-weighted sum at precisionb′\{\{b\}\}^\{\\prime\}is at most2−\(b\+2\)2^\{\-\(\{\{b\}\}\+2\)\}per output coordinate\.

Using1/τ​\(N\)∈𝒪​\(Nc\+1\)1/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\\in\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\+1\}\)\}\},B𝔽∈𝒪​\(Nc\)\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\\in\{\{\{\{\\mathcal\{O\}\}\}\(\{\{N\}\}^\{c\}\)\}\}, andD≤𝚙𝚘𝚕𝚢​\(N\)\{\{D\}\}\\leq\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}, any constantκ\\kappawithκ​b≥b\+log2⁡\(N​D/τ​\(N\)\)\+Θ​\(1\)\\kappa\{\{b\}\}\\geq\{\{b\}\}\+\\log\_\{2\}\(\{\{N\}\}\{\{D\}\}/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\)\+\{\{\{\{\\Theta\}\}\(1\)\}\}suffices for both \(i\) and \(ii\); such aκ\\kappadepends only on the constant hidden inb=Θ​\(log⁡N\)\{\{b\}\}=\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}\.

##### Step 5: Layer\-by\-layer agreement in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}and𝙻​\-uniformity\{\{\\mathtt\{L\}\}\\text\{\-uniformity\}\}of𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}\.

Bound \(ii\) above plays the same role as Step 2’s per\-layer error bound, but at the level of the implemented attention sub\-layer rather than its idealizedℝ\{\{\\mathbb\{R\}\}\}\-arithmetic counterpart: At every layerl\{\{l\}\}, the clamped output of𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}’s attention sub\-layer matches𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}’s hardmax\-attention output to within2−\(b\+2\)2^\{\-\(\{\{b\}\}\+2\)\}inℝ\{\{\\mathbb\{R\}\}\}\. The remaining components amplify this by at most a factor ofK2K\_\{2\}, which is absorbed by theK2K\_\{2\}factor already in[Eq\.˜34](https://arxiv.org/html/2605.30523#A3.E34)\(Step 3\); the resulting total per\-layer error remains strictly below the rounding threshold2−\(b\+1\)2^\{\-\(\{\{b\}\}\+1\)\}\. Standard induction onl\{\{l\}\}then yields

𝑯′⁣\(l\)=𝑯\(l\)in​𝔽bN×D,\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\\prime\(\{\{l\}\}\)\}\\;=\\;\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\)\}\\qquad\\text\{in \}\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}^\{\{\{N\}\}\\times\{\{D\}\}\},\(36\)for everyl\{\{l\}\}: The base casel=0\{\{l\}\}=0holds because both models share the same embedding and PEs, and the inductive step uses the per\-layer bound above with the shared input𝑯\(l−1\)\{\{\{\{\{\\bm\{H\}\}\}\}\}\}^\{\(\{\{l\}\}\-1\)\}given by IH to conclude that the𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}rounding of𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}’s layer\-l\{\{l\}\}output collapses onto𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}’s layer\-l\{\{l\}\}output\. In particular, the final outputs of𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}and𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}coincide in𝔽bN×D\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}^\{\{\{N\}\}\\times\{\{D\}\}\}\.

For𝙻​\-uniformity\{\{\\mathtt\{L\}\}\\text\{\-uniformity\}\}of𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}: By construction,𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}has the same parameters as𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}except for the additional temperatureτ​\(N\)\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}, which is logspace\-computable from1N1^\{\{N\}\}\(a negative power of two withlog2⁡\(1/τ​\(N\)\)=𝒪​\(log⁡N\)\\log\_\{2\}\(1/\{\{\{\{\\tau\}\}\(\{\{N\}\}\)\}\}\)=\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}bits\)\. The construction machineℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}of[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)is therefore unchanged for the weight matrices and the MLP, and emits one additional logspace\-computable quantity \(the temperature\); the PE machineℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}is unchanged\. Hence𝒯N′\{\{\\mathcal\{T\}\}\}^\{\\prime\}\_\{\{N\}\}is an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}logarithmic\-precisionτ\{\{\\tau\}\}\-SMATfamily whose outputs coincide with𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}’s on every input by[Eq\.˜36](https://arxiv.org/html/2605.30523#A3.E36), as the lemma claims\. ∎

### C\.2Proofs of the Constant\-depth Results

See[4\.1](https://arxiv.org/html/2605.30523#S4.Thmlemma1)

###### Proof\.

The proof followsLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32), Thm\. C\.7\), which constructs an𝙻​\-uniform​𝙻𝙿𝚃l,l𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{l\}\}\}\}SMATfamily simulating a given𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}circuit family\{CN\}N∈ℕ\{\\\{\{C\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}\. Our construction reuses theirs verbatim except for the PE: Wherever the original construction stores or queries a binary\-valued \(i\.e\., constant\-precision\) pointer of width⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceil\(containingB​\(n\)\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}\), we substitute the two\-dimensional unit\-length PEμ\{\\mu\}of[Eq\.˜19](https://arxiv.org/html/2605.30523#A2.E19)\. We describe the substitution concretely below\.

##### Symbol types and original residual\-stream layout\.

FollowingLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32), §C\.1\), the input to the transformer is a residual stream with entries of three types: Input symbols𝖨𝗇𝗉​\(n\)\{\\mathsf\{Inp\}\\left\(\{\{n\}\}\\right\)\}\(entered from the input string\), argument symbols𝖠𝗋𝗀​\(ng,na\)\{\\mathsf\{Arg\}\\left\(\{\{n\}\}\_\{g\},\{\{n\}\}\_\{a\}\\right\)\}\(encoding that gateng\{\{n\}\}\_\{g\}takes the value at positionna\{\{n\}\}\_\{a\}as one of its arguments; entered from the PEs\), and gate symbols𝖳𝗒𝗉𝖾​\(ng\)\{\\mathsf\{Type\}\\left\(\{\{n\}\}\_\{g\}\\right\)\}\(encoding the type and threshold of gateng\{\{n\}\}\_\{g\}; also entered from the PEs\)\. Here,n,ng,na∈\[N\]\{\{n\}\},\{\{n\}\}\_\{g\},\{\{n\}\}\_\{a\}\\in\{\\left\[\{\{N\}\}\\right\]\}are positions in the padded input sequence:ng\{\{n\}\}\_\{g\}is the position of the gate symbol itself, andna\{\{n\}\}\_\{a\}is the position of the source symbol \(an𝖨𝗇𝗉​\(⋅\)\{\\mathsf\{Inp\}\\left\(\\cdot\\right\)\}or a previously\-placed𝖳𝗒𝗉𝖾​\(⋅\)\{\\mathsf\{Type\}\\left\(\\cdot\\right\)\}\) feeding into it\. In their construction, each symbol’s PE has the form

ϕbin​\(⋅\)=\(c1,c2,c3,𝒌ntgt1,𝒒ntgt1,𝒌ntgt2,𝒒ntgt2\),\\phi\_\{\\textrm\{bin\}\}\(\\cdot\)=\(c\_\{1\},c\_\{2\},c\_\{3\},\\,\{\{\{\\bm\{k\}\}\}\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}\},\{\{\{\\bm\{q\}\}\}\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}\},\\,\{\{\{\\bm\{k\}\}\}\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}\},\{\{\{\\bm\{q\}\}\}\}\_\{\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}\}\),\(37\)wherec1,c2,c3∈\{0,1\}c\_\{1\},c\_\{2\},c\_\{3\}\\in\{\\\{0,1\\\}\}are constant\-size flags, and each\(𝒌∙,𝒒∙\)\(\{\{\{\\bm\{k\}\}\}\}\_\{\\bullet\},\{\{\{\\bm\{q\}\}\}\}\_\{\\bullet\}\)pair is the signed\-binary key–query construction of[Lem\.˜B\.3](https://arxiv.org/html/2605.30523#A2.Thmlemma3), i\.e\.,𝒌n∙=B±​\(n∙\)⌢​𝟏⌈log⁡N⌉\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}\_\{\\bullet\}\}=\{\{\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}\_\{\\bullet\}\\right\)\}\}^\{\\frown\}\{\{\{\\bm\{1\}\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}\}\}and𝒒n∙=B𝔽⋅B±​\(n∙\)⌢​\(−𝟏⌈log⁡N⌉\)\{\{\{\\bm\{q\}\}\}\}\_\{\{\{n\}\}\_\{\\bullet\}\}=\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\\cdot\{\{\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}\_\{\\bullet\}\\right\)\}\}^\{\\frown\}\{\(\-\{\{\\bm\{1\}\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}\)\}\}\. The two stored positionsntgt1,ntgt2∈\[N\]\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\},\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}\\in\{\\left\[\{\{N\}\}\\right\]\}are the two attention targets a symbol must address—one per attention layer of the two\-layer\-per\-circuit\-layer construction, each an instance of the targetntgt\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}from[Lem\.˜B\.4](https://arxiv.org/html/2605.30523#A2.Thmlemma4)\. Specifically, in the first layer, an𝖠𝗋𝗀​\(ng,na\)\{\\mathsf\{Arg\}\\left\(\{\{n\}\}\_\{g\},\{\{n\}\}\_\{a\}\\right\)\}symbol*reads*the value at the source positionntgt1=na\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}=\{\{n\}\}\_\{a\}into its own slot, and in the second layer, a𝖳𝗒𝗉𝖾​\(ng\)\{\\mathsf\{Type\}\\left\(\{\{n\}\}\_\{g\}\\right\)\}symbol*collects*and*computes*the value of the gate at its own indexntgt2=ng\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}=\{\{n\}\}\_\{g\}\. Each binary key/query block occupies2​⌈log⁡N⌉2\\lceil\\log\{\{N\}\}\\rceilcoordinates, so the residual stream has widthΘ​\(log⁡N\)\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}\.

##### Constant\-width substitution\.

We replace[Eq\.˜37](https://arxiv.org/html/2605.30523#A3.E37)by

ϕunit​\(⋅\)=\(c1,c2,c3,μ​\(ntgt1\),μ​\(ntgt2\)\),\\phi\_\{\\textrm\{unit\}\}\(\\cdot\)=\(c\_\{1\},c\_\{2\},c\_\{3\},\\,\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}\)\},\\,\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}\)\}\),\(38\)so each of the two log\-width\(𝒌∙,𝒒∙\)\(\{\{\{\\bm\{k\}\}\}\}\_\{\\bullet\},\{\{\{\\bm\{q\}\}\}\}\_\{\\bullet\}\)blocks in[Eq\.˜37](https://arxiv.org/html/2605.30523#A3.E37)collapses to a single two\-dimensional unit\-length PE\. The total PE width is3\+2\+2=73\+2\+2=7, i\.e\., constant inN\{\{N\}\}\. The machineℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}from[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)that generates the transformer’s parameters is identical to London et al\.’s\. The machineℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}that computes the PEs, in contrast, writesμ​\(n∙\)\{\{\\mu\}\(\{\{n\}\}\_\{\\bullet\}\)\}whenever their construction would write a\(𝒌n∙,𝒒n∙\)\(\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}\_\{\\bullet\}\},\{\{\{\\bm\{q\}\}\}\}\_\{\{\{n\}\}\_\{\\bullet\}\}\)block into the PE; this PE is logspace\-computable fromn∙\{\{n\}\}\_\{\\bullet\}andN\{\{N\}\}by[Lem\.˜B\.4](https://arxiv.org/html/2605.30523#A2.Thmlemma4)\.

##### Attention with unit\-length PEs\.

The substitution preserves the attention pattern ofLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)\. Each attention layer must focus exclusively on a single target, namelyntgt1\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}in the first layer andntgt2\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}in the second\.[Cor\.˜B\.1](https://arxiv.org/html/2605.30523#A2.Thmcorollary1)provides the query–key construction: Scaling yields a query𝒒n=defB𝔽/m⋅μ​\(ntgt\)\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\nicefrac\{\{\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\}\}\{\{m\}\}\\cdot\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\)\}and key𝒌n′=defμ​\(n′\)\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\mu\}\(\{\{n\}\}^\{\\prime\}\)\}such that the inner\-product gap betweenn′=ntgt\{\{n\}\}^\{\\prime\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}andn′≠ntgt\{\{n\}\}^\{\\prime\}\\neq\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}exceeds the saturation threshold required by[Lem\.˜B\.5](https://arxiv.org/html/2605.30523#A2.Thmlemma5); we instantiate this withntgt=ntgt1\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}in the first layer andntgt=ntgt2\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}=\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}in the second\. Both targets are already coordinates of the symbol’s PE \(cf\.[Eq\.˜38](https://arxiv.org/html/2605.30523#A3.E38)\), so the query and key matrices are simple projections—with theB𝔽/m\\nicefrac\{\{\{\{B\_\{\{\{\\mathbb\{F\}\}\}\}\}\}\}\}\{\{m\}\}scaling absorbed into them, as in the proof of[Lem\.˜3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1)—onto those two coordinates\. Concretely, in the first attention layer the query of each symbol projects outμ​\(ntgt1\)\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}\)\}from its PE while every symbol’s key projects outμ​\(ntgt2\)\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}\)\}; since each non\-pause symbol’sntgt2\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}equals its own position inLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)encoding, an argument symbol withntgt1=na\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}=\{\{n\}\}\_\{a\}attends exactly to the source at positionna\{\{n\}\}\_\{a\}\. The second layer swaps the roles—query fromμ​\(ntgt2\)\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{2\}\)\}, key fromμ​\(ntgt1\)\{\{\\mu\}\(\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}\)\}—so that every argument symbol withntgt1=ng\{\{\{n\}\}\_\{\\textrm\{tgt\}\}\}\_\{1\}=\{\{n\}\}\_\{g\}is attended to by the gate at positionng\{\{n\}\}\_\{g\}, matching London et al\.’s second\-layer pattern\. By[Lem\.˜B\.5](https://arxiv.org/html/2605.30523#A2.Thmlemma5), the resulting post\-softmax weights are exactly11on the target position and0elsewhere in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}arithmetic\. The value matrix, MLP, and the rest of the residual\-stream bookkeeping are inherited verbatim fromLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32), Thm\. C\.7\), since they act position\-wise on the constant\-size flagsc1,c2,c3c\_\{1\},c\_\{2\},c\_\{3\}and the \(already constant\-size\) gate\-value coordinate, none of which touch the PE block\.

##### Residual\-stream contents remain constant\-width\.

Beyond the PEs in[Eq\.˜38](https://arxiv.org/html/2605.30523#A3.E38), the residual stream stores only a constant number of additional coordinates: The computed gate value \(one coordinate with value in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}\) and the temporary scratch coordinates used by the MLP to recombine flags into thresholds—all inherited verbatim fromLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)and already constant\-size there\. What changes is only the*PE block*: FromΘ​\(log⁡N\)\{\{\{\{\\Theta\}\}\(\\log\{\{N\}\}\)\}\}coordinates of binary key/query pairs to44coordinates of unit\-length PEs\. The overall width is therefore𝒪​\(1\)\{\{\{\{\\mathcal\{O\}\}\}\(1\)\}\}\.

Finally, we note thatLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)construction effectively constructs anAHAT\(anSMATin which the non\-max values are small enough to map to 0 via the softmax; cf\.[Lem\.˜B\.5](https://arxiv.org/html/2605.30523#A2.Thmlemma5)\), meaning that this constructions show the same relationship forAHATs as well\. ∎

See[4\.2](https://arxiv.org/html/2605.30523#S4.Thmlemma2)

###### Proof\.

The proof followsLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32), Thms\. C\.5 and C\.11\), who handle𝙻𝙿𝚃c,l𝟶\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}and𝙻𝙿𝚃l,l𝟶\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{l\}\}\}\}, respectively, by decomposing each transformer layer into components that are individually in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(constant\-precision case\) or𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(log\-precision case\) and composing a constant number of them\. We follow the same decomposition but track how each component scales when both width and precision are polynomial\. Fix a𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATfamily\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}of constant depth and letb​\(N\),D​\(N\)∈𝚙𝚘𝚕𝚢​\(N\)\{\{\{\{b\}\}\(\{\{N\}\}\)\}\},\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}\\in\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}denote its precision and width, respectively\.

##### Positional encodings and per\-coordinate parameters\.

[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)guarantees that PE coordinates and parameter\-matrix entries are produced by a logspace TM, given the input lengthN\{\{N\}\}, the relevant positionn∈\[N\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}\\right\]\}, and the coordinate indices\. Concatenating allN⋅D​\(N\)⋅b​\(N\)=𝚙𝚘𝚕𝚢​\(N\)\{\{N\}\}\\cdot\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}\\cdot\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}=\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}output bits gives a logspace\-computable function of1N1^\{\{N\}\}, so[Lem\.˜B\.1](https://arxiv.org/html/2605.30523#A2.Thmlemma1)yields an𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}sub\-circuit of polynomial size and depth33that produces every PE and parameter bit consumed by the attention and feedforward components, just as inLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32)\)proofs of Thms\. C\.5 and C\.11 use of the same lemma\.

##### Attention layer\.

Per head and per query positionn\{\{n\}\}, the layer computes attention scoressn,n′=def⟨𝒒n,𝒌n′⟩b​\(N\)s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\langle\{\{\{\\bm\{q\}\}\}\}\_\{\{n\}\},\{\{\{\\bm\{k\}\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}\\rangle\}\}\_\{\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}\}forn′∈\[N\]\{\{n\}\}^\{\\prime\}\\in\{\\left\[\{\{N\}\}\\right\]\}, normalizes them via softmax, and returns the weighted value sum\. Eachsn,n′s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}is an inner product of twoD​\(N\)\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}\-dimensional vectors withb​\(N\)\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}\-bit entries—i\.e\., a sum of𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}products of𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\-bit numbers; multiplication of two𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\-bit numbers is in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Hesse et al\.,[2002](https://arxiv.org/html/2605.30523#bib.bib22)\), and iterated addition of𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}such products is in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}byHesse et al\. \([2002](https://arxiv.org/html/2605.30523#bib.bib22), Thm\. 2\.1\)\. Evaluatingexp\\expon a𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\-bit fixed\-point input to𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\-bit precision is also in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}: It is computed by a truncated Taylor series whose terms reduce to iterated multiplication and iterated addition, both of which are in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Chiang,[2025](https://arxiv.org/html/2605.30523#bib.bib10)\)\. Hence the unnormalized weightsexp⁡\(sn,n′\)\\exp\(s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}\)and their iterated sum \(the normalizer\) are in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}as well\. Dividing eachexp⁡\(sn,n′\)\\exp\(s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}\)by the normalizer and computing the new value involves𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}divisions and one more iterated sum per output coordinate, both in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\. Composing a constant number of𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}subroutines keeps the layer in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\. In the constant\-precision case the same decomposition collapses to𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}: Each product becomes a finite lookup table \(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32), Lem\. C\.4\), exponentiation reduces to a lookup table as well \(London & Kanade,[2025](https://arxiv.org/html/2605.30523#bib.bib32), Lem\. D\.4\), and iterated addition of𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}constant\-precision numbers is in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}byLondon & Kanade \([2025](https://arxiv.org/html/2605.30523#bib.bib32), Thm\. C\.3\)\.

##### Feedforward layer and residual connection\.

A feedforward layer applies a position\-wise affine map followed by aReLU\{\{\\mathrm\{ReLU\}\}\}nonlinearity\. Each output coordinate is a sum ofD​\(N\)\{\{\{\{D\}\}\(\{\{N\}\}\)\}\}products ofb​\(N\)\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}\-bit numbers plus aReLU\{\{\\mathrm\{ReLU\}\}\}thresholding—both in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}at polynomial precision and in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}at constant precision by the same reasoning as above\. The residual connection adds twob​\(N\)\{\{\{\{b\}\}\(\{\{N\}\}\)\}\}\-bit numbers position\-wise, which is in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}even at polynomial precision\(Hesse et al\.,[2002](https://arxiv.org/html/2605.30523#bib.bib22)\)\.

##### Composition\.

A constant\-depth transformer applies a constant number of attention and feedforward layers \(each in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}for polynomial precision,𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}for constant precision\), so the whole computation lies in the same class\. This yields𝙻​\-uniform​𝙻𝙿𝚃c,p𝟶⊆𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{p\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}and𝙻​\-uniform​𝙻𝙿𝚃p,p𝟶⊆𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{p\}\},\{\\texttt\{p\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\.

##### Extension toAHATs\.

ForAHATs, the softmax\-and\-value\-sum step is replaced by argmax\-then\-average: For each query positionn\{\{n\}\}, identify the set of positions inargmaxn′sn,n′\\operatorname\*\{\{\{argmax\}\}\}\_\{\{\{n\}\}^\{\\prime\}\}s\_\{\{\{n\}\},\{\{n\}\}^\{\\prime\}\}and return the average of their values\. Computing the maximum ofN\{\{N\}\}polynomial\-precision integers and counting the argmax positions are both in𝙵𝙾​\-uniform​𝙰𝙲𝟶⊆𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\(Chiang,[2025](https://arxiv.org/html/2605.30523#bib.bib10), Thm\. 2\), and the resulting average is again iterated addition followed by one division—in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}at polynomial precision and in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}at constant precision\. Hence both inclusions extend toAHATs\. ∎

### C\.3Proofs of Polylogarithmic\-depth Results

See[5\.1](https://arxiv.org/html/2605.30523#S5.Thmlemma1)

###### Proof\.

We follow the strategy ofMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 9\), who cover𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}input families\. In contrast, we assume an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}circuit family forff—so we first construct an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}iterated circuit family forfr​\(N\)f^\{r\(\{\{N\}\}\)\}by a logspace stitching argument, and then invoke uniformity collapse to strengthen to𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\.

##### Step 1: an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}iterated family\.

Fix the input lengthN\{\{N\}\}\. Let𝒞f=\{CNf\}N=0∞\\mathcal\{C\}^\{f\}=\{\\\{C^\{f\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}=0\}^\{\\infty\}be the assumed polynomial\-size𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}circuit family forff, and letMfM\_\{f\}be a logspace Turing machine that, given1N1^\{\{N\}\}and a gate or edge query, decides membership in the connection language \(cf\.[§˜A\.2](https://arxiv.org/html/2605.30523#A1.SS2)\) ofCNfC^\{f\}\_\{\{N\}\}\. SinceCNfC^\{f\}\_\{\{N\}\}has polynomially many gates, we may padCNfC^\{f\}\_\{\{N\}\}with dummy gates \(handled by a fixed default response fromMfM\_\{f\}\) so that it has exactlyσ=def2s​\(N\)\\sigma\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}2^\{s\(\{\{N\}\}\)\}gates for some integers​\(N\)=𝒪​\(log⁡N\)s\(\{\{N\}\}\)=\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}; this preserves𝙻\{\\mathtt\{L\}\}\-uniformity and at most doubles the circuit size\. We further assume, without loss of generality, that them​\(N\)m\(\{\{N\}\}\)input gates ofCNfC^\{f\}\_\{\{N\}\}occupy indices0,…,m​\(N\)−10,\\ldots,m\(\{\{N\}\}\)\-1and them​\(N\)m\(\{\{N\}\}\)output gates occupy indicesσ−m​\(N\),…,σ−1\\sigma\-m\(\{\{N\}\}\),\\ldots,\\sigma\-1; if they do not, we prepend a layer ofm​\(N\)m\(\{\{N\}\}\)identity gates at indices0,…,m​\(N\)−10,\\ldots,m\(\{\{N\}\}\)\-1wired to the original input gates and append a layer ofm​\(N\)m\(\{\{N\}\}\)identity gates at indicesσ−m​\(N\),…,σ−1\\sigma\-m\(\{\{N\}\}\),\\ldots,\\sigma\-1fed by the original output gates\. This adds22to the depth and2​m​\(N\)2m\(\{\{N\}\}\)to the size, and preserves𝙻\{\\mathtt\{L\}\}\-uniformity since the rewiring can be carried out byMfM\_\{f\}in logspace\.

We construct the circuitCNC\_\{\{N\}\}asr​\(N\)r\(\{\{N\}\}\)copies ofCNfC^\{f\}\_\{\{N\}\}stacked sequentially, with the output gates of each copy wired to the input positions of the next\. Indices inCNC\_\{\{N\}\}range over\{0,1,…,r​\(N\)⋅σ−1\}\{\\\{0,1,\\ldots,r\(\{\{N\}\}\)\\cdot\\sigma\-1\\\}\}; we decompose any such indexiiuniquely asi=qi⋅σ\+i′i=q\_\{i\}\\cdot\\sigma\+i^\{\\prime\}with the block indexqi∈\{0,…,r​\(N\)−1\}q\_\{i\}\\in\{\\\{0,\\ldots,r\(\{\{N\}\}\)\-1\\\}\}and the within\-block indexi′∈\{0,…,σ−1\}i^\{\\prime\}\\in\{\\\{0,\\ldots,\\sigma\-1\\\}\}\. Both components are computable from the𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}\-bit representation ofiiin logspace\.

The connection language ofCNC\_\{\{N\}\}is then decided by a logspace machineMMthat, on input1N1^\{\{N\}\}and a query aboutCNC\_\{\{N\}\}, proceeds as follows:

1. \(a\)On agate queryat indexii,MMrejects ifi≥r​\(N\)⋅σi\\geq r\(\{\{N\}\}\)\\cdot\\sigma; otherwise, it computes\(qi,i′\)\(q\_\{i\},i^\{\\prime\}\)and queriesMfM\_\{f\}for the gate type of indexi′i^\{\\prime\}inCNfC^\{f\}\_\{\{N\}\}\. IfMfM\_\{f\}reports an input gate andqi≥1q\_\{i\}\\geq 1,MMoverrides the response to an identity gate—so that blockqiq\_\{i\}’s “input” positions become internal pass\-through gates fed by blockqi−1q\_\{i\}\-1—and otherwise it returnsMfM\_\{f\}’s response unchanged\.
2. \(b\)On an edge query for the pair\(i,j\)\(i,j\),MMcomputes\(qi,i′\)\(q\_\{i\},i^\{\\prime\}\)and\(qj,j′\)\(q\_\{j\},j^\{\\prime\}\)and proceeds as follows\. Ifqi=qjq\_\{i\}=q\_\{j\},MMforwards the edge query\(i′,j′\)\(i^\{\\prime\},j^\{\\prime\}\)toMfM\_\{f\}, reproducing the intra\-block edges ofCNfC^\{f\}\_\{\{N\}\}inside every block\. Ifqj=qi\+1q\_\{j\}=q\_\{i\}\+1,i′≥σ−m​\(N\)i^\{\\prime\}\\geq\\sigma\-m\(\{\{N\}\}\),j′<m​\(N\)j^\{\\prime\}<m\(\{\{N\}\}\), andi′−j′=σ−m​\(N\)i^\{\\prime\}\-j^\{\\prime\}=\\sigma\-m\(\{\{N\}\}\)—equivalently, thekk\-th output gate of blockqiq\_\{i\}is paired with thekk\-th input position of blockqi\+1q\_\{i\}\+1for the samekk—MMreturns11, wiring exactly the intendedm​\(N\)m\(\{\{N\}\}\)edges per block boundary\. All remaining edge queries return0\.

Every test performed byMMis a comparison or addition on𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}\-bit numbers, and answering queries aboutCNfC^\{f\}\_\{\{N\}\}is delegated toMfM\_\{f\}; henceMMruns in logspace\. By construction,CNC\_\{\{N\}\}has sizer​\(N\)⋅σ=𝚙𝚘𝚕𝚢​\(N\)r\(\{\{N\}\}\)\\cdot\\sigma=\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}, depth𝒪​\(d​\(N\)⋅r​\(N\)\)\{\{\{\{\\mathcal\{O\}\}\}\(d\(\{\{N\}\}\)\\cdot r\(\{\{N\}\}\)\)\}\}, and computesfr​\(N\)f^\{r\(\{\{N\}\}\)\}\. Sincer​\(N\)⋅d​\(N\)r\(\{\{N\}\}\)\\cdot d\(\{\{N\}\}\)is at most polynomial,𝒞\\mathcal\{C\}is an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}polynomial\-size circuit family of depth𝒪​\(r​\(N\)​d​\(N\)\)\{\{\{\{\\mathcal\{O\}\}\}\(r\(\{\{N\}\}\)\\,d\(\{\{N\}\}\)\)\}\}over the same gate set as𝒞f\\mathcal\{C\}^\{f\}\.

##### Step 2: upgrading𝙻\{\\mathtt\{L\}\}\-uniformity to𝙵𝙾\{\\mathtt\{FO\}\}\-uniformity\.

Sincer​\(N\)⋅d​\(N\)≥1r\(\{\{N\}\}\)\\cdot d\(\{\{N\}\}\)\\geq 1wheneverr​\(N\),d​\(N\)≥1r\(\{\{N\}\}\),d\(\{\{N\}\}\)\\geq 1, the*uniformity collapse*resultMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Thm\. 3—namely,𝙵𝙾​\-uniform​𝙰𝙲d=𝙻​\-uniform​𝙰𝙲d\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{AC\}\}\}^\{d\}=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{AC\}\}\}^\{d\}and𝙵𝙾​\-uniform​𝚃𝙲d=𝙻​\-uniform​𝚃𝙲d\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{TC\}\}\}^\{d\}=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{TC\}\}\}^\{d\}ford≥1d\\geq 1—applies, yielding the desired𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}circuit family forfr​\(N\)f^\{r\(\{\{N\}\}\)\}at depthd​\(N\)⋅r​\(N\)d\(\{\{N\}\}\)\\cdot r\(\{\{N\}\}\)\. ∎

See[5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2)

###### Proof\.

The argument adaptsMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 6\), which proves an analogous statement for fully\-uniform log\-precisionAHATs\. Despite the different starting point—we begin from𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}constant\-depth transformers in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}or𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}instead of fully\-uniform log\-precisionAHATs in𝙵𝙾​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}—we arrive at the same𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}upper bounds for𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}looped transformers\. Additionally, the argument now applies to bothAHATs andSMATs at constant and growing precision\. Fix any family\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}in𝙻​\-uniform​𝙻𝙿𝚃c,p𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{p\}\}\}\}or𝙻​\-uniform​𝙻𝙿𝚃l,p𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{p\}\}\}\}for somed≥1d\\geq 1, and letℒ\{\\mathcal\{L\}\}be the language it recognizes\. WriteΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}for the loop count and let𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}have a designated looped block of constant depth \(cf\.[Def\.˜A\.6](https://arxiv.org/html/2605.30523#A1.Thmdefinition6)\)\.FollowingMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\), we partition theL\{\{L\}\}layers𝝉\(1\),…,𝝉\(L\)\{\{\\bm\{\\tau\}\}\}^\{\(1\)\},\\ldots,\{\{\\bm\{\\tau\}\}\}^\{\(\{\{L\}\}\)\}of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}into three pieces using the loop boundaries1≤l1≤l2≤L1\\leq\{\{l\}\}\_\{1\}\\leq\{\{l\}\}\_\{2\}\\leq\{\{L\}\}of[Def\.˜A\.6](https://arxiv.org/html/2605.30523#A1.Thmdefinition6): The pre\-loop layers𝒯NA=def𝝉\(l1\)∘⋯∘𝝉\(1\)\{\{\\mathcal\{T\}\}\}^\{A\}\_\{\{N\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\_\{1\}\)\}\\circ\\cdots\\circ\{\{\\bm\{\\tau\}\}\}^\{\(1\)\}, the looped constant\-depth block𝒯NB=def𝝉\(l2\)∘⋯∘𝝉\(l1\+1\)\{\{\\mathcal\{T\}\}\}^\{B\}\_\{\{N\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\_\{2\}\)\}\\circ\\cdots\\circ\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\_\{1\}\+1\)\}iteratedr​\(N\)=defΘ​\(logd⁡N\)=Tr\(\{\{N\}\}\)\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}=\{T\}times, and the post\-loop layers𝒯NC=def𝝉\(L\)∘⋯∘𝝉\(l2\+1\)\{\{\\mathcal\{T\}\}\}^\{C\}\_\{\{N\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\bm\{\\tau\}\}\}^\{\(\{\{L\}\}\)\}\\circ\\cdots\\circ\{\{\\bm\{\\tau\}\}\}^\{\(\{\{l\}\}\_\{2\}\+1\)\}, so that𝒯N=𝒯NC∘\(𝒯NB\)r​\(N\)∘𝒯NA\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}=\{\{\\mathcal\{T\}\}\}^\{C\}\_\{\{N\}\}\\circ\(\{\{\\mathcal\{T\}\}\}^\{B\}\_\{\{N\}\}\)^\{r\(\{\{N\}\}\)\}\\circ\{\{\\mathcal\{T\}\}\}^\{A\}\_\{\{N\}\}\.

##### Step 1: Each piece is a constant\-depth padded transformer\.

By construction, each of𝒯NA\{\{\\mathcal\{T\}\}\}^\{A\}\_\{\{N\}\},𝒯NB\{\{\\mathcal\{T\}\}\}^\{B\}\_\{\{N\}\}, and𝒯NC\{\{\\mathcal\{T\}\}\}^\{C\}\_\{\{N\}\}is a constant\-depth padded transformer with the same width, precision, and attention type as𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\. All three are themselves𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}: Their parameter matrices are an𝙻\{\\mathtt\{L\}\}\-computable projection \(selecting a contiguous range of layers\) of those of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}, and the construction machineℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}of𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}only needs to be augmented with logspace\-computable counters marking the layer rangesAA,BB,CC\. The PE machineℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}is unchanged\. Hence each piece lies in𝙻​\-uniform​𝙻𝙿𝚃c,p𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{p\}\}\}\}or𝙻​\-uniform​𝙻𝙿𝚃p,p𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{p\}\},\{\\texttt\{p\}\}\}\}depending on the precision regime\.

##### Step 2: Simulating each piece by a constant\-depth circuit\.

Applying[Lem\.˜4\.2](https://arxiv.org/html/2605.30523#S4.Thmlemma2)to𝒯NA\{\{\\mathcal\{T\}\}\}^\{A\}\_\{\{N\}\},𝒯NB\{\{\\mathcal\{T\}\}\}^\{B\}\_\{\{N\}\}, and𝒯NC\{\{\\mathcal\{T\}\}\}^\{C\}\_\{\{N\}\}yields𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}circuit families\{CNA\}N∈ℕ\{\\\{\{C\}^\{A\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\},\{CNB\}N∈ℕ\{\\\{\{C\}^\{B\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}, and\{CNC\}N∈ℕ\{\\\{\{C\}^\{C\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}of polynomial size and constant depth that simulate them; The gate set is\{AND,OR,NOT\}\{\\\{\{\\texttt\{AND\}\},\{\\texttt\{OR\}\},\{\\texttt\{NOT\}\}\\\}\}in the constant\-precision case \(so each family is in𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\) and\{AND,OR,NOT,THR\}\{\\\{\{\\texttt\{AND\}\},\{\\texttt\{OR\}\},\{\\texttt\{NOT\}\},\{\\texttt\{THR\}\}\\\}\}in the growing\-precision case \(so each is in𝙻​\-uniform​𝚃𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\)\. In both cases the constructed circuits act on the residual\-stream representation of the padded input, which has polynomially many positions of width𝚙𝚘𝚕𝚢​\(N\)\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}bits\.

##### Step 3: LoopingCNB\{C\}^\{B\}\_\{\{N\}\}via[Lem\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmlemma1)\.

The composition\(𝒯NB\)r​\(N\)\(\{\{\\mathcal\{T\}\}\}^\{B\}\_\{\{N\}\}\)^\{r\(\{\{N\}\}\)\}corresponds to iteratingCNB\{C\}^\{B\}\_\{\{N\}\}forr​\(N\)=Θ​\(logd⁡N\)r\(\{\{N\}\}\)=\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}steps\. Invoking[Lem\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmlemma1)with depth functiondB​\(N\)=Θ​\(1\)d\_\{B\}\(\{\{N\}\}\)=\{\{\{\{\\Theta\}\}\(1\)\}\}, loop functionr​\(N\)=Θ​\(logd⁡N\)r\(\{\{N\}\}\)=\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}, and the appropriate gate set produces an𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}circuit family of polynomial size and depth𝒪​\(logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}computing\(𝒯NB\)r​\(N\)\(\{\{\\mathcal\{T\}\}\}^\{B\}\_\{\{N\}\}\)^\{r\(\{\{N\}\}\)\}—an𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}family in the constant\-precision case and an𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}family in the growing\-precision case\. The uniformity*tightens*from𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}to𝙵𝙾​\-uniform\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}via the uniformity\-collapse \(cf\.[Lem\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmlemma1)\), since𝙻⊆𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\\mathtt\{L\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}ford≥1d\\geq 1\.

##### Step 4: Composing the three pieces\.

The looped circuit sits betweenCNA\{C\}^\{A\}\_\{\{N\}\}andCNC\{C\}^\{C\}\_\{\{N\}\}\. The result is a serial composition of an𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(resp\.𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\) family with two𝙻​\-uniform​𝙰𝙲𝟶⊆𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(resp\.𝙻​\-uniform​𝚃𝙲𝟶⊆𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\) families, and these classes are closed under fixed serial composition ford≥1d\\geq 1\(Merrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 8\), so the composed family lies in𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(resp\.𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\)\. It recognizes the same languageℒ\{\\mathcal\{L\}\}as𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}, yieldingℒ∈𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}in the constant\-precision case andℒ∈𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}in the growing\-precision case\.

The argument is agnostic to attention type:[Lem\.˜4\.2](https://arxiv.org/html/2605.30523#S4.Thmlemma2)is stated and proved for bothAHATs andSMATs, and the remaining steps only manipulate the resulting circuit families\. Hence the lemma holds for both attention patterns\. ∎

See[5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)

###### Proof\.

The proof for𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}follows directly fromMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 3\(which gives the construction in the log\-precisionAHATregime\) combined with[Lem\.˜3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1)of this paper \(which lifts it to log\-precisionSMATs while preserving𝙻​\-uniformity\{\{\\mathtt\{L\}\}\\text\{\-uniformity\}\}\)\. We therefore focus on the𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}case, where the main adaptation is the absence of the layer hash norm at constant precision; we adaptMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)construction step\-by\-step\.

Fix any languageℒ′∈𝒞\{\\mathcal\{L\}\}^\{\\prime\}\\in\{\\mathcal\{C\}\}\. By the assumed completeness ofℒ\{\\mathcal\{L\}\}for𝒞\{\\mathcal\{C\}\}underℛ\{\\mathcal\{R\}\}reductions, there is anℛ\{\\mathcal\{R\}\}reductiont:Σ∗→Σ∗\{\\texttt\{t\}\}\\colon\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}\\to\{\{\{\{\{\{\\Sigma\}\}^\{\*\}\}\}\}\}such that𝒘∈ℒ′\{\{\\bm\{w\}\}\}\\in\{\\mathcal\{L\}\}^\{\\prime\}if and only ift​\(𝒘\)∈ℒ\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\\in\{\\mathcal\{L\}\}, with\|t​\(𝒘\)\|∈𝚙𝚘𝚕𝚢​\(\|𝒘\|\)\{\{\\left\\lvert\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\\right\\rvert\}\}\\in\{\{\\mathtt\{poly\}\}\\left\(\{\{\\left\\lvert\{\{\\bm\{w\}\}\}\\right\\rvert\}\}\\right\)\}\(cf\.[Def\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmdefinition1)\)\. Let𝒯ℒ∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}be the family recognizingℒ\{\\mathcal\{L\}\}, and let𝒯t∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}be the family computingt\(which exists by the second precondition:𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}recognizes the languageℒt\{\\mathcal\{L\}\}\_\{\\texttt\{t\}\}of[Def\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmdefinition2)\)\. We construct an𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}family𝒯\{\{\\mathcal\{T\}\}\}forℒ′\{\\mathcal\{L\}\}^\{\\prime\}by stacking𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}on top of𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}: The first layers of𝒯\{\{\\mathcal\{T\}\}\}computet​\(𝒘\)\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)symbol\-by\-symbol in parallel inside disjoint blocks of padding, and the remaining layers run𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}on the resulting string\.

##### Step 1: Layout of the padding space\.

LetN=def\|𝒘\|\{\{N\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\left\\lvert\{\{\\bm\{w\}\}\}\\right\\rvert\}\}and fix a polynomial bound\|t​\(𝒘\)\|≤Nc\{\{\\left\\lvert\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\\right\\rvert\}\}\\leq\{\{N\}\}^\{c\}on the reduction output \(withc∈ℕc\\in\{\{\\mathbb\{N\}\}\}depending ont\)\. Suppose𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}uses at mostNK\{\{N\}\}^\{K\}padding positions to compute one output symbolt​\(𝒘\)n\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\_\{\{n\}\}on input\(𝒘,B​\(n\)\)\(\{\{\\bm\{w\}\}\},\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}\); for sufficiently largeN\{\{N\}\}we upper\-bound this byNK\+1\{\{N\}\}^\{K\+1\}, and for smallN\{\{N\}\}we assume𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}uses a finite lookup table and no padding\. The constructed transformer𝒯\{\{\\mathcal\{T\}\}\}divides its padding space intoNc\{\{N\}\}^\{c\}blocksof sizeNK\+1\{\{N\}\}^\{K\+1\}each, so that blockn∈\[Nc\]\{\{n\}\}\\in\{\\left\[\{\{N\}\}^\{c\}\\right\]\}holds the workspace for computingt​\(𝒘\)n\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\_\{\{n\}\}\. The total padding is polynomial:Nc\+K\+1∈𝚙𝚘𝚕𝚢​\(N\)\{\{N\}\}^\{c\+K\+1\}\\in\{\{\\mathtt\{poly\}\}\\left\(\{\{N\}\}\\right\)\}\.

##### Step 2: Identifying the block indexn\{\{n\}\}at every position \(simulating the role of the layer hash norm\)\.

Merrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)construction at log\-precision uses the*layer hash norm*to compute, at every padding positionn′\{\{n\}\}^\{\\prime\}, the indexn=def⌊\(n′−N\)/NK\+1⌋\{\{n\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\\left\\lfloor\(\{\{n\}\}^\{\\prime\}\-\{\{N\}\}\)/\{\{N\}\}^\{K\+1\}\\right\\rfloor\}of the block containingn′\{\{n\}\}^\{\\prime\}\. This identifier is what tells each padding position which output symbol oftits block is supposed to compute, and it is also what lets attention be restricted to a single block\. At constant precision the layer hash norm is unavailable, and the transformer cannot itself compute⌊\(n′−N\)/NK\+1⌋\{\\left\\lfloor\(\{\{n\}\}^\{\\prime\}\-\{\{N\}\}\)/\{\{N\}\}^\{K\+1\}\\right\\rfloor\}on the fly\. However, this quantity depends only onn′\{\{n\}\}^\{\\prime\}andN\{\{N\}\}, both visible to the PE machineℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}of[Def\.˜A\.5](https://arxiv.org/html/2605.30523#A1.Thmdefinition5)\. We therefore extend the PE at every positionn′\{\{n\}\}^\{\\prime\}with the additional logspace\-computable fields

𝒑blk​\(n′,N\)=defB±​\(⌊\(n′−N\)/NK\+1⌋\),𝒑off​\(n′,N\)=defB±​\(\(n′−N\)modNK\+1\),\{\\bm\{p\}\}\_\{\\textrm\{blk\}\}\(\{\{n\}\}^\{\\prime\},\{\{N\}\}\)\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\\left\\lfloor\(\{\{n\}\}^\{\\prime\}\-\{\{N\}\}\)/\{\{N\}\}^\{K\+1\}\\right\\rfloor\}\\right\)\},\\qquad\{\\bm\{p\}\}\_\{\\textrm\{off\}\}\(\{\{n\}\}^\{\\prime\},\{\{N\}\}\)\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\(\{\{n\}\}^\{\\prime\}\-\{\{N\}\}\)\\mod\{\{N\}\}^\{K\+1\}\\right\)\},\(39\)encoding the block index and within\-block offset; both are computable in𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}space since division and modulo on𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}\-bit numbers are in𝙻\{\\mathtt\{L\}\}\. This precomputation injects*at the time of PE computation*exactly the information the layer hash norm would have provided in the log\-precision construction on the fly\.

##### Step 3: Stage 1—computing the reduction in parallel across blocks\.

Using the per\-position block index from Step 2, every position inside blockn\{\{n\}\}can recognize itself as belonging to blockn\{\{n\}\}and can read off the block\-local offset\. Attention can therefore be routed within a single block by matching block indices: BySvete & Sabharwal,[2026](https://arxiv.org/html/2605.30523#bib.bib48), Lem\. D\.7, a single attention head can be made to attend exclusively to positions whose PE block index equals a query block index, ignoring all other positions\. This lets us simulate𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}inside blockn\{\{n\}\}: The input symbolsw1​⋯​wN\{\{w\}\}\_\{1\}\\cdots\{\{w\}\}\_\{\{N\}\}are addressed unconditionally, theNK\+1\{\{N\}\}^\{K\+1\}padding positions of blockn\{\{n\}\}play the role of𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}’s padding tape, and the query indexn\{\{n\}\}is available from𝒑blk\{\\bm\{p\}\}\_\{\\textrm\{blk\}\}at every position in blockn\{\{n\}\}\. Running this in parallel inside every block uses the same fixed set of layers \(every block executes the same computation on different binary inputsB​\(n\)\{\{\{\\texttt\{B\}\}\\left\(\{\{n\}\}\\right\)\}\}\), so𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}’s contribution to𝒯\{\{\\mathcal\{T\}\}\}’s depth is exactly𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}’s own depth\. After Stage 1, the final position of blockn\{\{n\}\}contains the symbolt​\(𝒘\)n\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\_\{\{n\}\}\.

##### Step 4: Stage 2—recognizingt​\(𝒘\)∈ℒ\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\\in\{\\mathcal\{L\}\}with𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}\.

We now run𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}on top of Stage 1’s output\. The stringt​\(𝒘\)\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)that𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}consumes is spread across theNc\{\{N\}\}^\{c\}block\-final positions: Positionn′=defN\+n⋅NK\+1\+\(NK\+1−1\)\{\{n\}\}^\{\\prime\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{N\}\}\+\{\{n\}\}\\cdot\{\{N\}\}^\{K\+1\}\+\(\{\{N\}\}^\{K\+1\}\-1\)holdst​\(𝒘\)n\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\_\{\{n\}\}\. We modify every attention head of𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}in exactly the wayMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)construction does: Add a large negative bias to the attention score of any key position that is*not*block\-final, so that each head attends only over theNc\{\{N\}\}^\{c\}block\-final positions and effectively sees the stringt​\(𝒘\)\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\. Whether a position is block\-final is a function ofn′\{\{n\}\}^\{\\prime\}andN\{\{N\}\}alone, so it can also be precomputed as a binary PE flag𝒑end​\(n′,N\)∈\{0,1\}\{\\bm\{p\}\}\_\{\\textrm\{end\}\}\(\{\{n\}\}^\{\\prime\},\{\{N\}\}\)\\in\{\\\{0,1\\\}\}byℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}in logspace\. Likewise,𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}’s PEs are reinterpreted: At block\-final positionn′\{\{n\}\}^\{\\prime\}we expose𝒑𝒯ℒ​\(n,Nc\)\{\\bm\{p\}\}\_\{\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}\}\(\{\{n\}\},\{\{N\}\}^\{c\}\)\(withn\{\{n\}\}read from𝒑blk\{\\bm\{p\}\}\_\{\\textrm\{blk\}\}\), so that𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}sees the same PE pattern it would on a length\-Nc\{\{N\}\}^\{c\}input\. Since𝒯ℒ∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}, the modified𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}contributes exactly𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}’s depth, and the resulting computation accepts if and only ift​\(𝒘\)∈ℒ\{\\texttt\{t\}\}\(\{\{\\bm\{w\}\}\}\)\\in\{\\mathcal\{L\}\}, i\.e\.,𝒘∈ℒ′\{\{\\bm\{w\}\}\}\\in\{\\mathcal\{L\}\}^\{\\prime\}\.

##### Step 5:𝙻​\-uniformity\{\{\\mathtt\{L\}\}\\text\{\-uniformity\}\}and resource budget\.

By construction,𝒯\{\{\\mathcal\{T\}\}\}’s parameter matrices are obtained by gluing the matrices of𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}and𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}\(both𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\) with constant\-size routing weights for the block\-matching and block\-final attention biases, soℳ1\{\{\\mathcal\{M\}\}\}\_\{1\}remains logspace\. The PE machineℳ2\{\{\\mathcal\{M\}\}\}\_\{2\}emits, in addition to𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}’s and𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}’s PEs, the signed\-binary block index𝒑blk\{\\bm\{p\}\}\_\{\\textrm\{blk\}\}, the within\-block offset𝒑off\{\\bm\{p\}\}\_\{\\textrm\{off\}\}, and the block\-final flag𝒑end\{\\bm\{p\}\}\_\{\\textrm\{end\}\}; all three are computable in𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}space\. The depth, width, precision, and loop count of𝒯\{\{\\mathcal\{T\}\}\}are the sum of those of𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}and𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}plus a constant overhead for the routing layers; the two looped blocks of𝒯t\{\{\\mathcal\{T\}\}\}\_\{\\texttt\{t\}\}and𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}fit into the multi\-block formulation of[Def\.˜A\.6](https://arxiv.org/html/2605.30523#A1.Thmdefinition6), each loopedΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}times, or equivalently merge into a single block run forrt​\(N\)\+rℒ​\(N\)=Θ​\(logd⁡N\)r\_\{\\texttt\{t\}\}\(\{\{N\}\}\)\+r\_\{\\mathcal\{L\}\}\(\{\{N\}\}\)=\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}iterations under a one\-bit phase flag selecting which sub\-block executes\. Either way,𝒯∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathcal\{T\}\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\. Finally, the Stage 1 construction builds anSMATthat, by the focusing argument of[Lem\.˜B\.5](https://arxiv.org/html/2605.30523#A2.Thmlemma5), behaves identically to anAHAT; the same applies to Stage 2 via𝒯ℒ\{\{\\mathcal\{T\}\}\}\_\{\\mathcal\{L\}\}\. The lemma therefore holds for both attention patterns, completing the constant\-precision case\. ∎

###### Theorem C\.1\(Analog ofMerrill & Sabharwal,[2025b](https://arxiv.org/html/2605.30523#bib.bib36), Thm\. 2\)\.

There exists an𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}transformer family\{𝒯N\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}such that𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}solves connectivity on \(directed or undirected\) graphs overN\{\{N\}\}vertices: Given theN×N\{\{N\}\}\\times\{\{N\}\}adjacency matrix of a graph𝒢\{\\mathcal\{G\}\},N3\{\{N\}\}^\{3\}padding positions, ands,t∈\[N\]s,t\\in\{\\left\[\{\{N\}\}\\right\]\}in binary,𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}checks whether𝒢\{\\mathcal\{G\}\}has a path from vertexssto vertextt\.

###### Proof\.

We consider a directed graph𝒢\{\\mathcal\{G\}\}overN\{\{N\}\}vertices and follow the construction fromMerrill & Sabharwal,[2025b](https://arxiv.org/html/2605.30523#bib.bib36), Thm\. 2\. We again only highlight the differences in the construction\.

Let𝑨∈\{0,1\}N×N\{\{\{\\bm\{A\}\}\}\}\\in\\\{0,1\\\}^\{\{\{N\}\}\\times\{\{N\}\}\}be𝒢\{\\mathcal\{G\}\}’s adjacency matrix\. The input to the transformer is a string over the alphabetΣ=def\{0,1,□,&\}\{\{\\Sigma\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\\{0,1,\{\\square\},\\texttt\{\\&\}\\\}, where□\{\\square\}is the padding symbol and&is a dedicated*separator*symbol used to delimit the binary encodings of the source and target nodes; the adjacency matrix𝑨\{\{\{\\bm\{A\}\}\}\}occupiesN2\{\{N\}\}^\{2\}positions over\{0,1\}\\\{0,1\\\}, followed byN3\{\{N\}\}^\{3\}padding positions□\{\\square\}, and finally the source and target nodess,t∈\{1,…,N\}s,t\\in\\\{1,\\ldots,\{\{N\}\}\\\}\. In contrast toMerrill & Sabharwal \([2025b](https://arxiv.org/html/2605.30523#bib.bib36), Thm\. 2\),ssandttare represented in binary with⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceilbits each:

A1,1​…​AN,N⏟N2​□…□⏟N3​&B​\(s\)​&B​\(t\)\\displaystyle\\underbrace\{\{\{A\}\}\_\{1,1\}\\ldots\{\{A\}\}\_\{\{\{N\}\},\{\{N\}\}\}\}\_\{\{\{N\}\}^\{2\}\}\\ \\underbrace\{\{\\square\}\\ldots\{\\square\}\}\_\{\{\{N\}\}^\{3\}\}\\texttt\{ \\& \}\{\{\{\\texttt\{B\}\}\\left\(s\\right\)\}\}\\texttt\{ \\& \}\{\{\{\\texttt\{B\}\}\\left\(t\\right\)\}\}\(40\)
In contrast toMerrill & Sabharwal \([2025b](https://arxiv.org/html/2605.30523#bib.bib36), Thm\. 2\), we cannot rely on the layer hash norm to identify positions\. To account for that, we provide the transformer with the following PEs:

PE​\(n,N\)=def\(B±​\(n\)B±​\(nmodN\)B±​\(⌊n/N⌋\)B±​\(n′\)B±​\(n′modN\)B±​\(⌊n′/N2⌋\)\)∈\{0,1\}𝒪​\(log⁡\(N\)\)\.\{\{\{\{\\texttt\{PE\}\}\}\\left\(\{\{n\}\},\{\{N\}\}\\right\)\}\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\begin\{pmatrix\}\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}\\right\)\}\\\\ \{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}\\mod\{\{N\}\}\\right\)\}\\\\ \{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\\left\\lfloor\\nicefrac\{\{\{\{n\}\}\}\}\{\{\{\{N\}\}\}\}\\right\\rfloor\}\\right\)\}\\\\ \{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}^\{\\prime\}\\right\)\}\\\\ \{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}^\{\\prime\}\\mod\{\{N\}\}\\right\)\}\\\\ \{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\\left\\lfloor\\nicefrac\{\{\{\{n\}\}^\{\\prime\}\}\}\{\{\{\{N\}\}^\{2\}\}\}\\right\\rfloor\}\\right\)\}\\\\ \\end\{pmatrix\}\\in\{\\\{0,1\\\}\}^\{\{\{\{\{\\mathcal\{O\}\}\}\(\{\\log\\left\(\{\{N\}\}\\right\)\}\)\}\}\}\.\(41\)wheren′=defmax⁡\(0,n−N2\)\{\{n\}\}^\{\\prime\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\max\(0,\{\{n\}\}\-\{\{N\}\}^\{2\}\)\.B​\(n\)\{\\texttt\{B\}\}\(\{\{n\}\}\)denotes the binary encoding ofn\{\{n\}\}using⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceilbits andB±​\(n\)\{\{\\texttt\{B\}\}^\{\\pm\}\}\(\{\{n\}\}\)denotes the signed binary encoding2​B​\(n\)−𝟏2\{\\texttt\{B\}\}\(\{\{n\}\}\)\-\{\{\\bm\{1\}\}\}, where𝟏\{\{\\bm\{1\}\}\}is the⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceil\-dimensional vector of all ones\. These PEs can be computed in logspace\.

##### Initial layers: Reading the source and target\.

The transformer𝒯N\{\{\\mathcal\{T\}\}\}\_\{\{N\}\}first uses⌈log⁡N⌉\+1\\lceil\\log\{\{N\}\}\\rceil\+1layers to read the binary encodings of the source and target nodesssandttstored in the string into a single dimension of the residual stream, using the𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}construction from[Lem\.˜B\.7](https://arxiv.org/html/2605.30523#A2.Thmlemma7)\. After these layers, the residual stream at the dedicated positions forssandttcontainsB​\(s\)\{\{\{\\texttt\{B\}\}\\left\(s\\right\)\}\}andB​\(t\)\{\{\{\\texttt\{B\}\}\\left\(t\\right\)\}\}, respectively, in a single coordinate block; combined with the PEs, every position can now refer to both endpoints of the requested path\.

##### Repeated layers: Iteratively doubling reachability\.

The transformer maintains, in its residual stream, two families of predicates indexed byℓ∈\{0,1,…,⌈log⁡N⌉\}\\ell\\in\{\\\{0,1,\\ldots,\\lceil\\log\{\{N\}\}\\rceil\\\}\}:

1. \(1\)Bℓ​\(i,j\)∈\{0,1\}\{\{B\}\}\_\{\\ell\}\(i,j\)\\in\{\\\{0,1\\\}\}, stored at the input position with coordinates\(i,j\)\(i,j\)\(one of the firstN2\{\{N\}\}^\{2\}positions\), encoding whether𝒢\{\\mathcal\{G\}\}has a path of length at most2ℓ2^\{\\ell\}fromiitojj\.
2. \(2\)Cℓ​\(i,k,j\)∈\{0,1\}\{\{C\}\}\_\{\\ell\}\(i,k,j\)\\in\{\\\{0,1\\\}\}, stored at the padding position with coordinates\(i,k,j\)\(i,k,j\)\(one of theN3\{\{N\}\}^\{3\}padding positions\), encoding whether𝒢\{\\mathcal\{G\}\}has paths of length at most2ℓ−12^\{\\ell\-1\}fromiitokk*and*fromkktojj\.

The base predicateB0​\(i,j\)=𝑨​\(i,j\)∨𝟙​\{i=j\}\{\{B\}\}\_\{0\}\(i,j\)=\{\{\{\\bm\{A\}\}\}\}\(i,j\)\\lor\{\\mathbbm\{1\}\}\\left\\\{i=j\\right\\\}is computed at every input position\(i,j\)\(i,j\)in a single layer: Position\(i,j\)\(i,j\)already holds the adjacency bit𝑨​\(i,j\)\{\{\{\\bm\{A\}\}\}\}\(i,j\), and the testi=ji=jis a function of the PE coordinates stored at the position\. The iterative step then alternates between two layers, each of which applies once perℓ∈\{1,…,⌈log⁡N⌉\}\\ell\\in\{\\\{1,\\ldots,\\lceil\\log\{\{N\}\}\\rceil\\\}\}:

1. 1\.ComputingCℓ\{\{C\}\}\_\{\\ell\}in the padding positions\.A padding position with coordinates\(i,k,j\)\(i,k,j\)uses two attention heads to retrieveBℓ−1​\(i,k\)\{\{B\}\}\_\{\\ell\-1\}\(i,k\)from the input position\(i,k\)\(i,k\)andBℓ−1​\(k,j\)\{\{B\}\}\_\{\\ell\-1\}\(k,j\)from the input position\(k,j\)\(k,j\)\. Both retrievals are facilitated by binary PEs: Head 1 attends with queryB±​\(i⋅N\+k\)\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(i\\cdot\{\{N\}\}\+k\\right\)\}and head 2 with queryB±​\(k⋅N\+j\)\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(k\\cdot\{\{N\}\}\+j\\right\)\}, both against keysB±​\(i′⋅N\+j′\)\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(i^\{\\prime\}\\cdot\{\{N\}\}\+j^\{\\prime\}\\right\)\}at input position\(i′,j′\)\(i^\{\\prime\},j^\{\\prime\}\)\.[Lemmata˜B\.3](https://arxiv.org/html/2605.30523#A2.Thmlemma3)and[B\.5](https://arxiv.org/html/2605.30523#A2.Thmlemma5)ensure each head focuses on the unique target input position in𝔽b\{\{\\mathbb\{F\}\}\}\_\{\{b\}\}arithmetic\. The MLP then computesCℓ​\(i,k,j\)=Bℓ−1​\(i,k\)∧Bℓ−1​\(k,j\)\{\{C\}\}\_\{\\ell\}\(i,k,j\)=\{\{B\}\}\_\{\\ell\-1\}\(i,k\)\\land\{\{B\}\}\_\{\\ell\-1\}\(k,j\)from the retrieved bits\.
2. 2\.ComputingBℓ\{\{B\}\}\_\{\\ell\}in the input positions\.An input position with coordinates\(i,j\)\(i,j\)accepts asBℓ​\(i,j\)\{\{B\}\}\_\{\\ell\}\(i,j\)iff there existskkwithCℓ​\(i,k,j\)=1\{\{C\}\}\_\{\\ell\}\(i,k,j\)=1, i\.e\.,Bℓ​\(i,j\)=⋁k=1NCℓ​\(i,k,j\)\{\{B\}\}\_\{\\ell\}\(i,j\)=\\bigvee\_\{k=1\}^\{\{\{N\}\}\}\{\{C\}\}\_\{\\ell\}\(i,k,j\)\. We compute this disjunction by detection \([Lem\.˜B\.6](https://arxiv.org/html/2605.30523#A2.Thmlemma6)\): One attention head at the input position\(i,j\)\(i,j\)attends over theN\{\{N\}\}padding positions\(i,k,j\)\(i,k,j\)fork∈\[N\]k\\in\{\\left\[\{\{N\}\}\\right\]\}\(selected via the PE blockB±​\(n′\)\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(\{\{n\}\}^\{\\prime\}\\right\)\}that exposes the full padding position index\) with valueCℓ​\(i,k,j\)\{\{C\}\}\_\{\\ell\}\(i,k,j\)encoded as a one\-hot vector; the aggregated value is positive iff somekkhasCℓ​\(i,k,j\)=1\{\{C\}\}\_\{\\ell\}\(i,k,j\)=1, and the MLP thresholds it to recoverBℓ​\(i,j\)∈\{0,1\}\{\{B\}\}\_\{\\ell\}\(i,j\)\\in\{\\\{0,1\\\}\}\.

Each pair of layers doubles the reachability radius, so after⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceiliterations every reachable pair is captured byB⌈log⁡N⌉\{\{B\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}\. Both the initial⌈log⁡N⌉\+1\\lceil\\log\{\{N\}\}\\rceil\+1layers that reads,ts,tvia[Lem\.˜B\.7](https://arxiv.org/html/2605.30523#A2.Thmlemma7)and the⌈log⁡N⌉\\lceil\\log\{\{N\}\}\\rceildoubling iterations are obtained by looping a constant\-depth block \(depth11for the reader and depth22for the doubler\), with the iteration counterℓ\\elltracked in a dedicated residual\-stream coordinate\. Both blocks fit into the multi\-block formulation of[Def\.˜A\.6](https://arxiv.org/html/2605.30523#A1.Thmdefinition6), each looped𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}times; equivalently, the two blocks can be merged into a single block run for𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}iterations by adding a one\-bit phase flag to the residual stream and selecting which sub\-block to execute via the MLP\. Either way, the construction lies in𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.

##### Final layer: Reading offB⌈log⁡N⌉​\(s,t\)\{\{B\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}\(s,t\)\.

OnceB⌈log⁡N⌉\{\{B\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}has been populated at every input position, a single additional layer at the last position reads off the answer: It issues an attention head with queryB±​\(s⋅N\+t\)\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(s\\cdot\{\{N\}\}\+t\\right\)\}against the keysB±​\(i⋅N\+j\)\{\{\{\\texttt\{B\}\}^\{\\pm\}\}\\left\(i\\cdot\{\{N\}\}\+j\\right\)\}stored at input positions\(i,j\)\(i,j\), attending uniquely to the position with coordinates\(s,t\)\(s,t\), and copies itsB⌈log⁡N⌉\{\{B\}\}\_\{\\lceil\\log\{\{N\}\}\\rceil\}value into the residual stream of the final symbol\. The output layer then accepts if and only if this value is11, i\.e\.,𝒢\{\\mathcal\{G\}\}has anss\-ttpath\.

##### Constant precision and attention\-type\.

Every gadget above uses constant\-precision arithmetic\. Because the only attention pattern used is a saturated lookup or a binary\-detection aggregation, the constructedSMATbehaves identically to anAHAT, so the construction works for both attention types\. ∎

See[5\.4](https://arxiv.org/html/2605.30523#S5.Thmlemma4)

###### Proof\.

We follow the proof ofMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 4\)\. Letℒ\{\\mathcal\{L\}\}be the graph connectivity problem, class𝒞\{\\mathcal\{C\}\}be𝙽𝙻\{\\mathtt\{NL\}\}, and classℛ\{\\mathcal\{R\}\}be𝙵𝙾\{\\mathtt\{FO\}\}\. We will show that the preconditions of[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)are met, which will give us𝙽𝙻⊆𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\\mathtt\{NL\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}:

1. 1\.First, graph connectivity is known to be𝙽𝙻\{\\mathtt\{NL\}\}\-complete under𝙵𝙾\{\\mathtt\{FO\}\}reductions\(Immerman,[1999](https://arxiv.org/html/2605.30523#bib.bib23)\)\.
2. 2\.Second,[Thm\.˜C\.1](https://arxiv.org/html/2605.30523#A3.Thmtheorem1)shows that log\-depth transformers with cubic padding can recognize the graph connectivity problemℒ\{\\mathcal\{L\}\}, i\.e\.,ℒ∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.
3. 3\.Finally,[Thm\.˜4\.1](https://arxiv.org/html/2605.30523#S4.Thmtheorem1)gives𝙻​\-uniform​𝙻𝙿𝚃c,l𝟶=𝙻​\-uniform​𝙰𝙲𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}=\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\. Since𝙻​\-uniform​𝙰𝙲𝟶⊇𝙵𝙾​\-uniform​𝙰𝙲𝟶=𝙵𝙾\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}\\supseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{0\}\}\}\}\}\}=\{\\mathtt\{FO\}\}\(Mix Barrington et al\.,[1990](https://arxiv.org/html/2605.30523#bib.bib39)\),𝙻​\-uniform​𝙻𝙿𝚃c,l𝟶\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}can recognize languages in𝙵𝙾\{\\mathtt\{FO\}\}and can therefore compute𝙵𝙾\{\\mathtt\{FO\}\}reductions\.

∎

#### C\.3\.1Circuit Evaluation

We now introduce the circuit evaluation problem and show its completeness under𝙻\{\\mathtt\{L\}\}reductions\. This section adaptsMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), App\. Dto our setting\.

###### Definition C\.1\(𝙰𝙲\{\{\\mathtt\{AC\}\}\}circuit encoding\)\.

LetC\{C\}be an𝙰𝙲\{\{\\mathtt\{AC\}\}\}circuit overN\{\{N\}\}inputs\. We define the encoding

⟨C⟩=defX …X⏟N​times∘⟨G1⟩∘…∘⟨GM⟩\{\\langle\{C\}\\rangle\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\underbrace\{\\texttt\{ X \\ldots X \}\}\_\{\{\{N\}\}\\text\{ times \}\}\\circ\\ \{\\langle\{G\}\_\{1\}\\rangle\}\\circ\\texttt\{ \\ldots\}\\circ\{\\langle\{G\}\_\{M\}\\rangle\}\(42\)whereXis a placeholder symbol for holding the input of the circuit141414Note that, although the firstN\{\{N\}\}positions contain placeholders for input strings, the circuit encoding is independent of the input string—since all compatible strings are of the same length, the circuit encoding does not change depending on the input string\.and⟨Gm⟩\{\\langle\{G\}\_\{m\}\\rangle\}form∈\[M\]m\\in\{\\left\[M\\right\]\}encodes themmthgate withKKinputs as:

⟨Gm⟩=defT​\(Gm\)​&B​\(g1\)​\#B​\(m\)​…​&B​\(gK\)​\#B​\(m\),\{\\langle\{G\}\_\{m\}\\rangle\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\{\{\\texttt\{T\}\}\\left\(\{G\}\_\{m\}\\right\)\}\\ \\texttt\{ \\& \}\{\{\{\\texttt\{B\}\}\\left\(g\_\{1\}\\right\)\}\}\\texttt\{ \\\# \}\{\{\{\\texttt\{B\}\}\\left\(m\\right\)\}\}\\ \\ldots\\texttt\{ \\& \}\{\{\{\\texttt\{B\}\}\\left\(g\_\{K\}\\right\)\}\}\\texttt\{ \\\# \}\{\{\{\\texttt\{B\}\}\\left\(m\\right\)\}\},\(43\)whereT​\(Gm\)∈\{AND,OR,NOT\}\{\{\\texttt\{T\}\}\\left\(\{G\}\_\{m\}\\right\)\}\\in\{\\\{\{\\texttt\{AND\}\},\{\\texttt\{OR\}\},\{\\texttt\{NOT\}\}\\\}\}denotesGm\{G\}\_\{m\}’s type,B​\(gi\)\{\{\{\\texttt\{B\}\}\\left\(g\_\{i\}\\right\)\}\}is the binary encoding of the position of theiithargument of the gateGm\{G\}\_\{m\}, andB​\(m\)\{\{\{\\texttt\{B\}\}\\left\(m\\right\)\}\}is the binary encoding of the gate’s indexmm\.

###### Example C\.1\.

The encoding of the circuitC​\(x1,x2,x3\)=\(x1∧x2\)∨x3\{C\}\(x\_\{1\},x\_\{2\},x\_\{3\}\)=\(x\_\{1\}\\land x\_\{2\}\)\\lor x\_\{3\}is

⟨C⟩=defX X X⏟input​AND​&000 \#010 &001 \#010⏟arguments toAND​OR​&011 \#100 &010 \#100⏟arguments to OR\.\{\\langle\{C\}\\rangle\}\\mathrel\{\\stackrel\{\{\\scriptstyle\\textnormal\{\\tiny def\}\}\}\{\{=\}\}\}\\underbrace\{\\texttt\{ X X X \}\}\_\{\\text\{input\}\}\\ \{\\texttt\{AND\}\}\\ \\underbrace\{\\texttt\{\\&000 \\\#010 \\&001 \\\#010 \}\}\_\{\\text\{arguments to \{\{AND\}\}\}\}\\ \{\\texttt\{OR\}\}\\ \\underbrace\{\\texttt\{\\&011 \\\#100 \\&010 \\\#100 \}\}\_\{\\text\{arguments to OR\}\}\.\(44\)

There are two main differences betweenMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)definitions and ours:

1. \(1\)We encode the argument pointers of each gate using their*binary encodings*instead of the unary ones used byMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)\. We require this to ensure that a shallow \(in our case, log\-depth\) constant\-precision transformer can convert the circuit encoding into the contents of the residual stream\. It is not clear how to do this with unary encodings\.
2. \(2\)Second, we replicate the positions of the gates after the pointer to each argument \(prefixed by the special\#symbol\)\. This is needed to avoid the use of the layer hash norm to compute the pointer to the argument’s gate, which is done byMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)\.

###### Definition C\.2\(ℱ\{\{\{\{\\mathcal\{F\}\}\}\}\}circuit evaluation; Def\. 15\)\.

Theℱ\{\{\{\{\\mathcal\{F\}\}\}\}\}circuit evaluation problem takes as input\(𝐰,⟨C⟩\)\(\{\{\\bm\{w\}\}\},\{\\langle\{C\}\\rangle\}\)where𝐰∈\{0,1\}∗\{\{\\bm\{w\}\}\}\\in\{\{\\\{0,1\\\}^\{\*\}\}\}and⟨C⟩\{\\langle\{C\}\\rangle\}is a serialized circuit, and outputsC​\(𝐰\)\{C\}\(\{\{\\bm\{w\}\}\}\)\.

We refer to the special case whereℱ=𝙰𝙲𝚍\{\{\{\{\\mathcal\{F\}\}\}\}\}=\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}as the𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation problem\. We further definewide\-𝙰𝙲𝚍⊆𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}as the class of circuit families\{CN\}N=0∞\\\{\{C\}\_\{\{N\}\}\\\}\_\{\{\{N\}\}=0\}^\{\\infty\}such that there exists someccsuch that, for largeN\{\{N\}\}, the depth ofCN\{C\}\_\{\{N\}\}is at mostc​logd⁡Nc\\log^\{d\}\{\{N\}\}and, crucially, the size is*at least*Nc\{\{N\}\}^\{c\}\. That is, the size \(and hence the width\) of the circuit is large relative to its depth\. We define the correspondingwide\-𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation problemas a variant of the circuit evaluation problem withℱ=wide\-​𝙰𝙲𝚍\{\{\{\{\\mathcal\{F\}\}\}\}\}=\\text\{wide\-\}\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\. The𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}analogs of these concepts are defined inMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), App\. D\.

The following lemmata show that both𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}and wide\-𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation are hard for𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}under𝙽𝙻\{\\mathtt\{NL\}\}reductions\. Their proofs are identical to the ones inMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\), except that we replace𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}with𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(the proofs are not affected by the differences in circuit serializations, since our serialization is also logspace\-computable\)\.

###### Lemma C\.1\(Lem\. 11\)\.

Ford≥1d\\geq 1,𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation is hard for𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}under𝙻\{\\mathtt\{L\}\}reductions\.

###### Lemma C\.2\(Lem\. 12\)\.

Ford≥1d\\geq 1, wide\-𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation is hard for𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}under𝙻\{\\mathtt\{L\}\}reductions\.

###### Lemma C\.3\(Constant\-precision path; Lem\. 13\)\.

Let⟨C⟩\{\\langle\{C\}\\rangle\}be the serialization of a depth\-L\{\{L\}\}circuit withAND,OR, andNOTgates over𝐰∈\{0,1\}∗\{\{\\bm\{w\}\}\}\\in\{\{\{\\\{0,1\\\}\}^\{\*\}\}\}\. There exist two𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}constant\-precision log\-width transformer families\{𝒯Nread\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}^\{\\textup\{read\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}and\{𝒯Nstep\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}^\{\\textup\{step\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}such that:

1. \(1\)\{𝒯Nread\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}^\{\\textup\{read\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}is in𝙻​\-uniform​𝙻𝙿𝚃c,l𝟷\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{1\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}and converts the input𝒘∘⟨C⟩\{\{\\bm\{w\}\}\}\\circ\{\\langle\{C\}\\rangle\}into internal pointer representations in𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}loop iterations;
2. \(2\)\{𝒯Nstep\}N∈ℕ\{\\\{\{\{\\mathcal\{T\}\}\}^\{\\textup\{step\}\}\_\{\{N\}\}\\\}\}\_\{\{\{N\}\}\\in\{\{\\mathbb\{N\}\}\}\}is a constant\-depth𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}family in𝙻𝙿𝚃c,l𝟶\{\{\\mathtt\{LPT\}\}^\{\\mathtt\{0\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}that, when unrolledL\{\{L\}\}times on the output of𝒯Nread\{\{\\mathcal\{T\}\}\}^\{\\textup\{read\}\}\_\{\{N\}\}, computesC​\(𝒘\)\{C\}\(\{\{\\bm\{w\}\}\}\)\.

Their composition𝒯Nstep∘𝒯Nread\{\{\\mathcal\{T\}\}\}^\{\\textup\{step\}\}\_\{\{N\}\}\\circ\{\{\\mathcal\{T\}\}\}^\{\\textup\{read\}\}\_\{\{N\}\}, run for𝒪​\(log⁡N\)\+L\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}\+\{\{L\}\}total loop iterations, computesC​\(𝐰\)\{C\}\(\{\{\\bm\{w\}\}\}\)\. In particular, whenL=𝒪​\(logd⁡N\)\{\{L\}\}=\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}ford≥1d\\geq 1, the composed family is in𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.

###### Proof sketch\.

The high\-level idea of the construction is similar to that ofMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 7, but we need to adapt it to the constant\-precision setting\. LetC\{C\}be a circuit of depthL\{\{L\}\}overN\{\{N\}\}inputs\. The simulation ofC\{C\}is split across the two transformers𝒯Nread\{\{\\mathcal\{T\}\}\}^\{\\textup\{read\}\}\_\{\{N\}\}and𝒯Nstep\{\{\\mathcal\{T\}\}\}^\{\\textup\{step\}\}\_\{\{N\}\}, each responsible for one stage:

1. 1\.Stage 1, implemented by𝒯Nread\{\{\\mathcal\{T\}\}\}^\{\\textup\{read\}\}\_\{\{N\}\}: Converting the input𝒘∘⟨C⟩\{\{\\bm\{w\}\}\}\\circ\{\\langle\{C\}\\rangle\}into internal representations that will allow the transformer to process it\. This stage requires𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}layers\.
2. 2\.Stage 2, implemented by𝒯Nstep\{\{\\mathcal\{T\}\}\}^\{\\textup\{step\}\}\_\{\{N\}\}unrolledL\{\{L\}\}times: Iteratively applying the circuit operations to the internal representations to compute the final output\. This stage requiresL\{\{L\}\}layers\.

Stage 1 converts the binary encodings of the argument pointers stored in the input string⟨C⟩\{\\langle\{C\}\\rangle\}into binary encodings stored in the residual stream as per the𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}construction in[Lem\.˜B\.7](https://arxiv.org/html/2605.30523#A2.Thmlemma7)\. These pointers allow the model to retrieve the values stored in those positions \(once they become available\)\. Importantly, this conversion only has to happen once for all gates and inputs in parallel, even if the inputs have not been computed yet—this is possible since the encoding of the entire circuit is available at the beginning\. This means that the computation of pointers adds a fixed overhead of𝒪​\(log⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\)\}\}layers to the simulation\.

At the same time, the positions containing the*gate*addressesB​\(m\)\{\{\{\\texttt\{B\}\}\\left\(m\\right\)\}\}compute their binary encodings in the same way as the input arguments \(cf\.[Lem\.˜B\.7](https://arxiv.org/html/2605.30523#A2.Thmlemma7)\), storing the encoding in another designated part of the residual stream\. The final position of the input argument can then attend⌈log⁡N⌉\+1\\lceil\\log\{\{N\}\}\\rceil\+1positions forward to retrieve the position of the gate it belongs to—this can be done by storing the \(signed\) binary encodings of bothn\{\{n\}\}andn\+⌈log⁡N⌉\+1\{\{n\}\}\+\\lceil\\log\{\{N\}\}\\rceil\+1in the PEs\. With this, each input argumentgNg\_\{\{N\}\}contains both the pointer to its value as well as the pointer to the gate it belongs to—this will be used at a later stage of the simulation, when each gate has to read its input argument values before computing its own value\.

The rest of the proof \(Stage 2\) closely follows that ofMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 7\. In contrast toMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)fully uniform transformer, ours is𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}; the only aspect of the transformer family that depends onN\{\{N\}\}is the size of the matrices, which needs to grow with the growing PEs\. The counters required to construct such matrices can be implemented in log\-space, which is why the transformer family is𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\. The transformer usesL\{\{L\}\}layers to simulate the circuit layer by layer, using the pointers constructed in Stage 1 to retrieve the values of the input arguments of each gate\. The only difference is that, to avoid issues with constant precision, the model computes theANDgates by detecting0among the inputs and theORgates by detecting a11among the inputs as per[Lem\.˜B\.6](https://arxiv.org/html/2605.30523#A2.Thmlemma6)—this is enough to determine the truth value of the gate\.

As above, this constructs anSMATthat behaves like anAHAT, meaning that it holds for both types of models\.∎

See[5\.1](https://arxiv.org/html/2605.30523#S5.Thmtheorem1)

###### Proof\.

Letℒ\{\\mathcal\{L\}\}be the wide\-𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation problem,𝒞\{\\mathcal\{C\}\}be the class𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}, andℛ\{\\mathcal\{R\}\}be the class𝙻\{\\mathtt\{L\}\}\. We prove the two equalities in turn\.

Part \(a\):𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍=𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.The upper bound𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍⊆𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}follows directly from[Lem\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2)\. For the lower bound, the preconditions of[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)are met:

1. \(1\)[Cor\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmcorollary1)shows thatℒ∈𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}ford≥1d\\geq 1\. Together with[Lem\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2), this impliesℒ∈𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\\mathcal\{L\}\}\\in\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.
2. \(2\)[Lem\.˜C\.2](https://arxiv.org/html/2605.30523#A3.Thmlemma2)shows thatℒ\{\\mathcal\{L\}\}is hard for𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}under𝙻\{\\mathtt\{L\}\}reductions\. Together with step \(1\), we get thatℒ\{\\mathcal\{L\}\}is complete for𝙵𝙾​\-uniform​𝙰𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.
3. \(3\)[Lem\.˜5\.4](https://arxiv.org/html/2605.30523#S5.Thmlemma4)gives us𝙻⊆𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\\mathtt\{L\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}, meaning that𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}can compute any𝙻\{\\mathtt\{L\}\}reduction\.

Thus,[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)applies, yielding𝙵𝙾​\-uniform​𝙰𝙲𝚍⊆𝙻​\-uniform​𝙻𝙿𝚃c,l𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{c\}\},\{\\texttt\{l\}\}\}\}\.

Note that, unlikeMerrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35), Thm\. 3\), which constructs a transformer with depth𝒪​\(logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}to simulate𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuits, our application of[Lem\.˜C\.3](https://arxiv.org/html/2605.30523#A3.Thmlemma3)yields a transformer with depth𝒪​\(log⁡N\+logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log\{\{N\}\}\+\\log^\{d\}\{\{N\}\}\)\}\}to simulate an𝙰𝙲𝚍\{\{\{\{\{\{\\mathtt\{AC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit\. Ford≥1d\\geq 1, this reduces to𝒪​\(logd⁡N\)\{\{\{\{\\mathcal\{O\}\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}\.

Part \(b\):𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍=𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\.

*Upper bound\.*ForSMATs,𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍⊆𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\(cf\.[Lem\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2)\)\. ForAHATs, the same inclusion applies because[Lem\.˜4\.2](https://arxiv.org/html/2605.30523#S4.Thmlemma2)and[Lem\.˜5\.2](https://arxiv.org/html/2605.30523#S5.Thmlemma2)both hold forAHATs as well\.

*Lower bound\.*The matching lower bound𝙵𝙾​\-uniform​𝚃𝙲𝚍⊆𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}traces the steps outlined in[§˜5](https://arxiv.org/html/2605.30523#S5)\. The starting point isMerrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Thm\. 3, which establishes that fully\-uniform log\-precisionAHATs withΘ​\(logd⁡N\)\{\{\{\{\\Theta\}\}\(\\log^\{d\}\{\{N\}\}\)\}\}looping simulate𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}:

𝙵𝙾​\-uniform​𝚃𝙲𝚍⊆𝙵𝙾​\-uniform​𝙻𝙿𝚃l,c𝚍​forAHATs\.\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\\text\{ for \{\{\{AHAT\}\}\{\}s\}\.\}\(45\)Merrill & Sabharwal \([2025a](https://arxiv.org/html/2605.30523#bib.bib35)\)proof of[Eq\.˜45](https://arxiv.org/html/2605.30523#A3.E45)bases on the reduction framework restated in[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3): They show that fully\-uniform log\-precisionAHATs can compute every𝙻\{\\mathtt\{L\}\}reduction \(via the𝙽𝙻\{\\mathtt\{NL\}\}lower bound, theAHATanalog of[Lem\.˜5\.4](https://arxiv.org/html/2605.30523#S5.Thmlemma4)\) and can recognize the wide\-𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation problem \(theAHAT/𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}analog of[Cor\.˜5\.1](https://arxiv.org/html/2605.30523#S5.Thmcorollary1)\), and they invoke their reduction lemma \(theAHAT/log\-precision analog of[Lem\.˜5\.3](https://arxiv.org/html/2605.30523#S5.Thmlemma3)\) to conclude\. Since wide\-𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}circuit evaluation is𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\-complete under𝙻\{\\mathtt\{L\}\}reductions\(Merrill & Sabharwal,[2025a](https://arxiv.org/html/2605.30523#bib.bib35), Lem\. 12, App\. D\)\(the𝚃𝙲𝚍\{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}analog of[Lem\.˜C\.2](https://arxiv.org/html/2605.30523#A3.Thmlemma2)\), this yields[Eq\.˜45](https://arxiv.org/html/2605.30523#A3.E45)\.

We lift[Eq\.˜45](https://arxiv.org/html/2605.30523#A3.E45)to𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}SMATs in two steps\. First, since fully uniform families are a special case of𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}families,[Eq\.˜45](https://arxiv.org/html/2605.30523#A3.E45)gives

𝙵𝙾​\-uniform​𝚃𝙲𝚍⊆𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍​forAHATs\.\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\\text\{ for \{\{\{AHAT\}\}\{\}s\}\.\}\(46\)Second, we apply[Lem\.˜3\.1](https://arxiv.org/html/2605.30523#S3.Thmlemma1)to eachAHATfamily produced by[Eq\.˜46](https://arxiv.org/html/2605.30523#A3.E46): The lemma constructs an𝙻​\-uniform\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}logarithmic\-precisionSMATfamily whose outputs match theAHAT’s on every input, uses the same number of layers, and preserves the loop count of the original looped block \(theSMATreplacement is layer\-by\-layer, with the looped block of theAHATcorresponding to the looped block of theSMAT\)\. The width and PE width are preserved up to a constant factor\. Hence the resultingSMATfamily lies in𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}, transferring the lower bound toSMATs:

𝙵𝙾​\-uniform​𝚃𝙲𝚍⊆𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍​forSMATs\.\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}\\subseteq\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}\\text\{ for \{\{\{SMAT\}\}\{\}s\}\.\}\(47\)
Combining[Eqs\.˜46](https://arxiv.org/html/2605.30523#A3.E46)and[47](https://arxiv.org/html/2605.30523#A3.E47)with the upper bound established above yields𝙻​\-uniform​𝙻𝙿𝚃l,c𝚍=𝙵𝙾​\-uniform​𝚃𝙲𝚍\{\{\\mathtt\{L\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\\mathtt\{LPT\}\}^\{\\mathtt\{d\}\}\_\{\{\\texttt\{l\}\},\{\\texttt\{c\}\}\}\}=\{\{\\mathtt\{FO\}\}\\text\{\-uniform\}\}\\penalty 10000\\ \{\{\{\{\{\{\\mathtt\{TC\}\}\}^\{\\mathtt\{d\}\}\}\}\}\}for both attention types\.∎

Similar Articles

Transformer Math Explorer [P]

Reddit r/MachineLearning

This interactive tool visualizes the mathematical underpinnings of transformer models through dataflow graphs, covering architectures from GPT-2 to Qwen 3.6 and various attention mechanisms.

Transformers Are Inherently Succinct

Hacker News Top

This paper argues that transformer architectures are inherently succinct, meaning they can represent certain functions more efficiently than other models. It presents theoretical analysis and proofs.