Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

arXiv cs.LG Papers

Summary

This paper presents Block-Wise Differentiable Sinkhorn Attention, a method for efficient long-context balanced entropic optimal transport attention on TPU hardware. It introduces a tail-refinement surrogate for exact differentiation, proving an efficient backward pass schedule and demonstrating significant improvements in Pfam sequence alignment reconstruction.

arXiv:2605.08123v1 Announce Type: new Abstract: We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the production $R=2$ case, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R=2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O((T+R)LW)$, $O(Ld)$ input storage, and $O(L)$ additional HBM usage for fixed head dimension $d$ and band width $W$. We also formalize the current \texttt{dustbin\_block} path as the same balanced surrogate on an augmented support, so the schedule lifts to the gap-aware transport path used in our TPU runs. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within $10^{-5}$--$10^{-10}$. On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced $R=2$ run sustains roughly $8.5$ examples per second through a three-hour budget, reaching step $1437$. Held-out Pfam test shards improve reconstruction from $3.17$ to $0.99$ and sparse CE from $5.86$ to $5.69$ relative to step $0$. These results support exact fixed-depth backward theory, a theorem-matching gap-aware bridge, and trainability evidence for the production path.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 05/12/26, 06:44 AM

# Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge
Source: [https://arxiv.org/html/2605.08123](https://arxiv.org/html/2605.08123)
## Block\-Wise Differentiable Sinkhorn Attention: Tail\-Refinement Gradients with a Gap\-Aware Dustbin Bridge

###### Abstract

We study long\-context balanced entropic optimal transport \(OT\) attention on TPU hardware through a stopped\-base, fixed\-depth tail\-refinement surrogate\. After a stoppedTT\-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly\. For the productionR=2R=2case, the backward pass contains four staircase plan factors\. We prove an exact one\-reference\-tile schedule: theR=2R=2score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences\. This yields block\-wise costO​\(\(T\+R\)​L​W\)O\(\(T\+R\)LW\),O​\(L​d\)O\(Ld\)input storage, andO​\(L\)O\(L\)additional HBM usage for fixed head dimensionddand band widthWW\. We also formalize the currentdustbin\_blockpath as the same balanced surrogate on an augmented support, so the schedule lifts to the gap\-aware transport path used in our TPU runs\. We provide a local surrogate\-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks\. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within10−510^\{\-5\}–10−1010^\{\-10\}\. On TPU v6e\-8, a four\-configuration Pfam screen completes end\-to\-end, and a promoted balancedR=2R=2run sustains roughly8\.58\.5examples per second through a three\-hour budget, reaching step14371437\. Held\-out Pfam test shards improve reconstruction from3\.173\.17to0\.990\.99and sparse CE from5\.865\.86to5\.695\.69relative to step0\. These results support exact fixed\-depth backward theory, a theorem\-matching gap\-aware bridge, and trainability evidence for the production path\.

## 1Introduction

Pairwise alignment and structural matching remain central problems for sequence modeling\. Classical dynamic\-programming methods such as Smith–Waterman\(Smith and Waterman,[1981](https://arxiv.org/html/2605.08123#bib.bib10)\)provide optimal alignments but are not differentiable, so they cannot be inserted directly into end\-to\-end neural architectures\. Entropic optimal transport \(OT\) offers a natural differentiable relaxation\(Cuturi,[2013](https://arxiv.org/html/2605.08123#bib.bib5); Peyré and Cuturi,[2019](https://arxiv.org/html/2605.08123#bib.bib9)\): it replaces a discrete assignment with a smooth transport plan computed by Sinkhorn scaling\.

For long\-context models, however, two bottlenecks dominate\. First, dense OT attention materializesL×LL\\times Lscores or plans, which is prohibitive at largeLL\. Second, backward\-pass design is difficult\. Exact autodiff through many Sinkhorn steps is expensive and memory\-hungry, while implicit differentiation can require large and unstable linear solves\(Luise et al\.,[2018](https://arxiv.org/html/2605.08123#bib.bib7); Bai et al\.,[2019](https://arxiv.org/html/2605.08123#bib.bib1)\)\. The key question is therefore not only how to make OT attention differentiable, but how to make its differentiated object explicit, exact for the surrogate actually used, and compatible with accelerator kernels\.

We address this with a stopped\-base, fixed\-depth*tail\-refinement surrogate*\. After a stoppedTT\-step balanced Sinkhorn solve, we append a short differentiable tail of depthRRand differentiate that surrogate exactly\. The production TPU path usesR=2R=2\. A direct implementation of the exactR=2R=2adjoint must evaluate four distinct staircase plan factors\. The signature implementation observation of the paper is that these factors lie in a single row/column scaling orbit, so each block can be evaluated from one reference transport tile plus vector modifiers, avoiding separate plan\-tile formation for the remaining three terms\.

#### Gap\-aware bridge\.

The central factorization is balanced and fixed\-support\. The first extension carried in the unified manuscript is intentionally narrow: the currentdustbin\_blockpath is not treated as a new OT solver, but as the same balanced surrogate on an augmented state space\. This keeps the gap\-aware transport path inside the algebraic orbit of the main factorization, rather than forcing a new theorem prematurely\.

#### Related context\.

Our work is closest in spirit to recent efforts to make OT\-style attention practical at accelerator scale, including FlashSinkhorn\(Ye et al\.,[2026](https://arxiv.org/html/2605.08123#bib.bib11)\)\. The comparison separates along two axes\. FlashSinkhorn is an IO\-aware solver for streaming entropic OT updates; our contribution is the differentiated stopped\-base backward object\. In particular, a directR=2R=2adjoint must evaluate four distinct staircase plan factors, whereas our transport\-orbit calculus evaluates each block from one reference transport tile plus vector modifiers\. This does not replace a direct solver\-level benchmark against FlashSinkhorn or a direct four\-plan\-adjoint kernel benchmark, both of which remain future work, and we do not claim wall\-clock dominance without such benchmarks\. The present paper instead establishes the exact factorized backward object and its TPU realization on the alignment stack used in our experiments\. The approach is also complementary to long\-context transformer architectures that obtain efficiency through structured sparsity rather than transport constraints, such as Longformer\(Beltagy et al\.,[2020](https://arxiv.org/html/2605.08123#bib.bib2)\)and BigBird\(Zaheer et al\.,[2020](https://arxiv.org/html/2605.08123#bib.bib12)\)\.

The paper makes three main contributions:

- •an explicit balanced fixed\-support tail\-refinement surrogate together with its exactR=2R=2adjoint and one\-reference\-tile backward schedule;
- •a theorem\-aligned dustbin augmentation showing that the current gap\-aware transport path is the same balanced surrogate on an enlarged state space;
- •an empirical bridge from exactness validation to TPU\-scale training, comprising a completed four\-configuration Pfam screen, a promoted long\-budget balanced run, and a held\-out checkpoint evaluation on Pfam test shards\.

## 2Method

### 2\.1Forward Pass: Block\-Wise Balanced Sinkhorn

LetQ,K,VQ,K,Vbe query, key, and value tensors with head dimensiondd, and letΩ\\Omegadenote a fixed active support with band widthWW\. We absorb the entropic temperatureε\\varepsiloninto the score matrix

Si​j=Qi​Kj⊤d​ε\.S\_\{ij\}=\\frac\{Q\_\{i\}K\_\{j\}^\{\\top\}\}\{\\sqrt\{d\}\\,\\varepsilon\}\.The stopped base solve and the refinement tail both use the same masked balanced half\-steps

ui\(t\+1\)\\displaystyle u\_\{i\}^\{\(t\+1\)\}=−log⁡\[∑j:\(i,j\)∈Ωexp⁡\(Si​j\+vj\(t\)\)\],\\displaystyle=\-\\log\\\!\\Bigl\[\\sum\_\{j:\(i,j\)\\in\\Omega\}\\exp\(S\_\{ij\}\+v\_\{j\}^\{\(t\)\}\)\\Bigr\],\(1\)vj\(t\+1\)\\displaystyle v\_\{j\}^\{\(t\+1\)\}=−log⁡\[∑i:\(i,j\)∈Ωexp⁡\(Si​j\+ui\(t\+1\)\)\]\.\\displaystyle=\-\\log\\\!\\Bigl\[\\sum\_\{i:\(i,j\)\\in\\Omega\}\\exp\(S\_\{ij\}\+u\_\{i\}^\{\(t\+1\)\}\)\\Bigr\]\.\(2\)The implementation computes these updates in tiles, in the same spirit as IO\-aware attention kernels\. For fixed band widthWW, aTT\-step base solve followed by anRR\-step tail has costO​\(\(T\+R\)​L​W\)O\(\(T\+R\)LW\)time,O​\(L​d\)O\(Ld\)input storage, andO​\(L\)O\(L\)additional HBM usage\.

### 2\.2Balanced Tail Refinement and theR=2R=2Staircase

Let\(u\(0\),v\(0\)\)\(u^\{\(0\)\},v^\{\(0\)\}\)be the stopped potentials produced by the baseTT\-step solve\. The differentiated object is not the fullTT\-step map; it is the fixed\-depth surrogate

O~\[R\]​\(Q,K,V;u\(0\),v\(0\)\)=𝒜​\(Q,K,V,u\(R\),v\(R\)\),\\widetilde\{O\}^\{\[R\]\}\(Q,K,V;u^\{\(0\)\},v^\{\(0\)\}\)=\\mathcal\{A\}\(Q,K,V,u^\{\(R\)\},v^\{\(R\)\}\),where the refinement iterates are recomputed from the stopped base pair\. The implementation stores centered dual representatives for numerical stability, while the algebra below uses ungauged representatives; Appendix[A](https://arxiv.org/html/2605.08123#A1)proves that the centered and ungauged forward recurrences are gauge\-equivalent\.

Define staircase plans

Pi​j\(a,b\)=\{exp⁡\(Si​j\+ui\(a\)\+vj\(b\)\),\(i,j\)∈Ω,0,\(i,j\)∉Ω\.P\_\{ij\}^\{\(a,b\)\}=\\begin\{cases\}\\exp\(S\_\{ij\}\+u\_\{i\}^\{\(a\)\}\+v\_\{j\}^\{\(b\)\}\),&\(i,j\)\\in\\Omega,\\\\ 0,&\(i,j\)\\notin\\Omega\.\\end\{cases\}Letℒ\\mathcal\{L\}be a scalar loss depending onO~\[R\]\\widetilde\{O\}^\{\[R\]\}, letG=∂ℒ/∂O~\[R\]G=\\partial\\mathcal\{L\}/\\partial\\widetilde\{O\}^\{\[R\]\}, and define

Zi​j=⟨Gi,Vj⟩,gu=\(P\(R,R\)⊙Z\)​𝟏,gv=\(P\(R,R\)⊙Z\)⊤​𝟏\.Z\_\{ij\}=\\langle G\_\{i\},V\_\{j\}\\rangle,\\qquad g\_\{u\}=\(P^\{\(R,R\)\}\\odot Z\)\\mathbf\{1\},\\qquad g\_\{v\}=\(P^\{\(R,R\)\}\\odot Z\)^\{\\top\}\\mathbf\{1\}\.
###### Proposition 1\(Exact Fixed\-Depth Tail Adjoint\)\.

The cotangents of the refinement duals satisfy

v¯\(R\)\\displaystyle\\bar\{v\}^\{\(R\)\}=gv,\\displaystyle=g\_\{v\},\(3\)u¯\(R\)\\displaystyle\\bar\{u\}^\{\(R\)\}=gu−P\(R,R\)​v¯\(R\),\\displaystyle=g\_\{u\}\-P^\{\(R,R\)\}\\bar\{v\}^\{\(R\)\},\(4\)v¯\(t−1\)\\displaystyle\\bar\{v\}^\{\(t\-1\)\}=−\(P\(t,t−1\)\)⊤​u¯\(t\),t=R,R−1,…,1,\\displaystyle=\-\\bigl\(P^\{\(t,t\-1\)\}\\bigr\)^\{\\top\}\\bar\{u\}^\{\(t\)\},\\qquad t=R,R\-1,\\dots,1,\(5\)u¯\(t−1\)\\displaystyle\\bar\{u\}^\{\(t\-1\)\}=−P\(t−1,t−1\)​v¯\(t−1\),t=R,R−1,…,2,\\displaystyle=\-P^\{\(t\-1,t\-1\)\}\\bar\{v\}^\{\(t\-1\)\},\\qquad t=R,R\-1,\\dots,2,\(6\)where the final line stops att=2t=2because the base state is stop\-gradient\. The score\-space cotangent is

S¯=P\(R,R\)⊙Z−∑t=1R\[P\(t,t\)⊙\(𝟏​v¯\(t\)⊤\)\+P\(t,t−1\)⊙\(u¯\(t\)​𝟏⊤\)\]\.\\bar\{S\}=P^\{\(R,R\)\}\\odot Z\-\\sum\_\{t=1\}^\{R\}\\Bigl\[P^\{\(t,t\)\}\\odot\\bigl\(\\mathbf\{1\}\\bar\{v\}^\{\(t\)\\top\}\\bigr\)\+P^\{\(t,t\-1\)\}\\odot\\bigl\(\\bar\{u\}^\{\(t\)\}\\mathbf\{1\}^\{\\top\}\\bigr\)\\Bigr\]\.\(7\)

For the production kernel we useR=2R=2, which yields the staircase

\(P\(2,2\),P\(2,1\),P\(1,1\),P\(1,0\)\)\.\\bigl\(P^\{\(2,2\)\},\\,P^\{\(2,1\)\},\\,P^\{\(1,1\)\},\\,P^\{\(1,0\)\}\\bigr\)\.
###### Proposition 2\(One\-Reference\-TileR=2R=2Backward Schedule\)\.

For any staircase pair\(a,b\)∈\{\(2,2\),\(2,1\),\(1,1\),\(1,0\)\}\(a,b\)\\in\\\{\(2,2\),\(2,1\),\(1,1\),\(1,0\)\\\},

P\(a,b\)=P\(2,2\)⊙exp⁡\(u\(a\)−u\(2\)\)⊙exp⁡\(v\(b\)−v\(2\)\)\.P^\{\(a,b\)\}=P^\{\(2,2\)\}\\odot\\exp\(u^\{\(a\)\}\-u^\{\(2\)\}\)\\odot\\exp\(v^\{\(b\)\}\-v^\{\(2\)\}\)\.Define

αi=exp⁡\(ui\(1\)−ui\(2\)\),βj=exp⁡\(vj\(1\)−vj\(2\)\),δj=exp⁡\(vj\(0\)−vj\(2\)\)\.\\alpha\_\{i\}=\\exp\(u\_\{i\}^\{\(1\)\}\-u\_\{i\}^\{\(2\)\}\),\\qquad\\beta\_\{j\}=\\exp\(v\_\{j\}^\{\(1\)\}\-v\_\{j\}^\{\(2\)\}\),\\qquad\\delta\_\{j\}=\\exp\(v\_\{j\}^\{\(0\)\}\-v\_\{j\}^\{\(2\)\}\)\.Then on the active support the exactR=2R=2score\-space VJP is

S¯i​j=Pi​j\(2,2\)​\[Zi​j−v¯j\(2\)−βj​u¯i\(2\)−αi​βj​v¯j\(1\)−αi​δj​u¯i\(1\)\]\.\\bar\{S\}\_\{ij\}=P^\{\(2,2\)\}\_\{ij\}\\Bigl\[Z\_\{ij\}\-\\bar\{v\}^\{\(2\)\}\_\{j\}\-\\beta\_\{j\}\\bar\{u\}^\{\(2\)\}\_\{i\}\-\\alpha\_\{i\}\\beta\_\{j\}\\bar\{v\}^\{\(1\)\}\_\{j\}\-\\alpha\_\{i\}\\delta\_\{j\}\\bar\{u\}^\{\(1\)\}\_\{i\}\\Bigr\]\.Hence, for aB×BB\\times Bscore tile, the exact backward can be scheduled with one resident reference plan tilePI,J\(2,2\)P^\{\(2,2\)\}\_\{I,J\}, the correspondingO​\(B\)O\(B\)vector slices, and theO​\(B​d\)O\(Bd\)activation/cotangent slices needed to formSI,JS\_\{I,J\}andZI,JZ\_\{I,J\}\. The remaining staircase factors are never formed as resident plan tiles\. For fixed block size, head dimension, band width,TT, andR=2R=2, the schedule usesO​\(L​d\)O\(Ld\)input storage,O​\(L\)O\(L\)global vector state, andO​\(L​W\)O\(LW\)tile arithmetic for the tail backward\.

ForR=2R=2, equation \([7](https://arxiv.org/html/2605.08123#S2.E7)\) expands to

S¯\\displaystyle\\bar\{S\}=P\(2,2\)⊙Z−P\(2,2\)⊙\(𝟏​v¯\(2\)⊤\)−P\(2,1\)⊙\(u¯\(2\)​𝟏⊤\)\\displaystyle=P^\{\(2,2\)\}\\odot Z\-P^\{\(2,2\)\}\\odot\\bigl\(\\mathbf\{1\}\\bar\{v\}^\{\(2\)\\top\}\\bigr\)\-P^\{\(2,1\)\}\\odot\\bigl\(\\bar\{u\}^\{\(2\)\}\\mathbf\{1\}^\{\\top\}\\bigr\)−P\(1,1\)⊙\(𝟏​v¯\(1\)⊤\)−P\(1,0\)⊙\(u¯\(1\)​𝟏⊤\)\.\\displaystyle\\qquad\-P^\{\(1,1\)\}\\odot\\bigl\(\\mathbf\{1\}\\bar\{v\}^\{\(1\)\\top\}\\bigr\)\-P^\{\(1,0\)\}\\odot\\bigl\(\\bar\{u\}^\{\(1\)\}\\mathbf\{1\}^\{\\top\}\\bigr\)\.\(8\)The reconstruction identity above then replaces separate formation of the last three plan tiles with row and column rescalings ofP\(2,2\)P^\{\(2,2\)\}\.

P\(2,2\)P^\{\(2,2\)\}P\(2,1\)P^\{\(2,1\)\}P\(1,1\)P^\{\(1,1\)\}P\(1,0\)P^\{\(1,0\)\}materializerescalerescalerescaleFigure 1:TheR=2R=2staircase\. Only the base planP\(2,2\)P^\{\(2,2\)\}is materialized; the remaining three plans are reconstructed from dual differences by row and column rescaling\.This schedule is the signature implementation observation in the manuscript\. It converts the explicit four\-factorR=2R=2backward into evaluation from one reference plan tile plus cheap row and column modifiers, which is why the method maps cleanly to TPU kernels\.

#### WhyR=2R=2\.

The multiplicative orbit identity itself is not unique toR=2R=2\. Relative to any chosen base staircase plan, every other staircase plan factors by the corresponding dual differences in the same way\. What is special aboutR=2R=2is that the exact adjoint closes into the compact four\-factor formula above, so the identity yields a concrete one\-reference\-tile schedule rather than a larger family of rescaled plans\.

### 2\.3Gap\-Aware Dustbin Bridge

The current gap\-aware production path does not use a new OT solver\. Instead, it augments the balanced problem\. Let the base sequence length beLLand letB=128B=128denote the dustbin block size used in the implementation\. Define augmented tensors

Q~,K~,V~∈ℝ\(L\+B\)×d\\widetilde\{Q\},\\widetilde\{K\},\\widetilde\{V\}\\in\\mathbb\{R\}^\{\(L\+B\)\\times d\}by appending one learned dustbin token per side together withB−1B\-1inactive filler tokens\. The masks are augmented analogously: one appended token is active, the rest are inactive fillers\. The implementation then applies the same balanced tail\-refinement solver to this augmented system, with effective band width

W~=max⁡\(W,L\)\.\\widetilde\{W\}=\\max\(W,L\)\.
###### Proposition 3\(Dustbin Augmentation Bridge\)\.

Let\(Q~,K~,V~,Ω~\)\(\\widetilde\{Q\},\\widetilde\{K\},\\widetilde\{V\},\\widetilde\{\\Omega\}\)denote the dustbin\-augmented tensors and active support constructed above\. Then the currentdustbin\_blockpath is exactly the same balanced fixed\-support tail\-refinement surrogate as in Section 2\.2, but applied on the augmented state space\.

###### Corollary 1\(ExactR=2R=2Adjoint on the Augmented State Space\)\.

Apply Propositions[1](https://arxiv.org/html/2605.08123#Thmproposition1)and[2](https://arxiv.org/html/2605.08123#Thmproposition2)to the augmented system\. Then the current dustbin\-augmented transport path inherits the exact fixed\-depth adjoint and the same one\-reference\-tile schedule on the augmented support\.

This is the first gap\-aware extension carried into the unified paper because it already matches the cloud path\. It is a bridge theorem, not a departure from the central balanced factorization\.

### 2\.4A Local Surrogate\-Bias Bound

The fixed\-depth backward is exact for the surrogate it differentiates\. A separate question is how far that surrogate gradient can be from the gradient through the full stopped\-base computation\. Let

z=\(u,v\),Φx​\(z\)=z\+,z=\(u,v\),\\qquad\\Phi\_\{x\}\(z\)=z^\{\+\},be the centered one\-step refinement map at differentiable parametersxx, and letAx​\(z\)A\_\{x\}\(z\)denote the transport application map\. Write

zT​\(x\)=ΦxT​\(z0\),ΨxR​\(z\)=Ax​\(ΦxR​\(z\)\)\.z\_\{T\}\(x\)=\\Phi\_\{x\}^\{T\}\(z\_\{0\}\),\\qquad\\Psi\_\{x\}^\{R\}\(z\)=A\_\{x\}\\\!\\bigl\(\\Phi\_\{x\}^\{R\}\(z\)\\bigr\)\.Then the full fixed\-depth gradient is

gfullT,R​\(x\)=∇xΨxR​\(zT​\(x\)\),g\_\{\\mathrm\{full\}\}^\{T,R\}\(x\)=\\nabla\_\{x\}\\Psi\_\{x\}^\{R\}\(z\_\{T\}\(x\)\),while the stop\-gradient surrogate uses

gsurrT,R​\(x\)=∂xΨxR​\(z\)\|z=zT​\(x\)\.g\_\{\\mathrm\{surr\}\}^\{T,R\}\(x\)=\\partial\_\{x\}\\Psi\_\{x\}^\{R\}\(z\)\\big\|\_\{z=z\_\{T\}\(x\)\}\.
###### Theorem 1\(Local Surrogate\-Bias Bound\)\.

Assume fixed support,C1C^\{1\}dependence ofΦx\\Phi\_\{x\}andAxA\_\{x\}on\(x,z\)\(x,z\), and constantsρ∈\[0,1\)\\rho\\in\[0,1\),LΦ\>0L\_\{\\Phi\}\>0, andLA\>0L\_\{A\}\>0such that

‖Dz​Φx​\(z\)‖≤ρ,‖Dx​Φx​\(z\)‖≤LΦ,‖Dz​Ax​\(z\)‖≤LA\\\|D\_\{z\}\\Phi\_\{x\}\(z\)\\\|\\leq\\rho,\\qquad\\\|D\_\{x\}\\Phi\_\{x\}\(z\)\\\|\\leq L\_\{\\Phi\},\\qquad\\\|D\_\{z\}A\_\{x\}\(z\)\\\|\\leq L\_\{A\}on a forward\-invariant neighborhood\. Then

gfullT,R​\(x\)−gsurrT,R​\(x\)=Dz​ΨxR​\(zT​\(x\)\)​Dx​zT​\(x\),g\_\{\\mathrm\{full\}\}^\{T,R\}\(x\)\-g\_\{\\mathrm\{surr\}\}^\{T,R\}\(x\)=D\_\{z\}\\Psi\_\{x\}^\{R\}\(z\_\{T\}\(x\)\)\\,D\_\{x\}z\_\{T\}\(x\),and

‖gfullT,R​\(x\)−gsurrT,R​\(x\)‖≤LA​LΦ1−ρ​ρR\.\\bigl\\\|g\_\{\\mathrm\{full\}\}^\{T,R\}\(x\)\-g\_\{\\mathrm\{surr\}\}^\{T,R\}\(x\)\\bigr\\\|\\leq\\frac\{L\_\{A\}L\_\{\\Phi\}\}\{1\-\\rho\}\\,\\rho^\{R\}\.

The point of this bound is modest but useful: once the support is fixed and the refinement map is locally contractive, the stop\-gradient bias decays geometrically in the tail depthRR\. We also record the finite\-instance version used by the validation harness\.

###### Proposition 4\(Projective Sinkhorn Contraction Certificate\)\.

Consider a strictly positive active block of the balanced or dustbin\-augmented support, and let

Ki​j=exp⁡\(Si​j\)K\_\{ij\}=\\exp\(S\_\{ij\}\)be the positive kernel on that block, using the scaled score convention from Section 2\.1\. For positive column scalingsc,c′c,c^\{\\prime\}, define the two\-half\-step Sinkhorn scaling map

𝒯​\(c\)=b⊘\[K⊤​\(a⊘\(K​c\)\)\],\\mathcal\{T\}\(c\)=b\\oslash\\left\[K^\{\\top\}\\left\(a\\oslash\(Kc\)\\right\)\\right\],with positive marginalsa,ba,b; in our normalized updatea=b=𝟏a=b=\\mathbf\{1\}\. LetdHd\_\{H\}denote Hilbert’s projective metric on the positive cone and letΔ​\(A\)\\Delta\(A\)be the projective diameter of a positive linear mapAA\. Then

dH​\(𝒯​\(c\),𝒯​\(c′\)\)≤ρH​dH​\(c,c′\),ρH=tanh⁡\(Δ​\(K\)4\)​tanh⁡\(Δ​\(K⊤\)4\)<1\.d\_\{H\}\(\\mathcal\{T\}\(c\),\\mathcal\{T\}\(c^\{\\prime\}\)\)\\leq\\rho\_\{H\}\\,d\_\{H\}\(c,c^\{\\prime\}\),\\qquad\\rho\_\{H\}=\\tanh\\\!\\left\(\\frac\{\\Delta\(K\)\}\{4\}\\right\)\\tanh\\\!\\left\(\\frac\{\\Delta\(K^\{\\top\}\)\}\{4\}\\right\)<1\.Equivalently, in log scalings, withΦcol​\(v\)=log⁡𝒯​\(ev\)\\Phi\_\{\\mathrm\{col\}\}\(v\)=\\log\\mathcal\{T\}\(e^\{v\}\)modulo additive gauge, the centered half\-step pair is a contraction in oscillation seminorm:

osc⁡\(Φcol​\(v\)−Φcol​\(v′\)\)≤ρH​osc⁡\(v−v′\)\.\\operatorname\{osc\}\\\!\\left\(\\Phi\_\{\\mathrm\{col\}\}\(v\)\-\\Phi\_\{\\mathrm\{col\}\}\(v^\{\\prime\}\)\\right\)\\leq\\rho\_\{H\}\\,\\operatorname\{osc\}\(v\-v^\{\\prime\}\)\.IfΩS=max\(i,j\)⁡Si​j−min\(i,j\)⁡Si​j\\Omega\_\{S\}=\\max\_\{\(i,j\)\}S\_\{ij\}\-\\min\_\{\(i,j\)\}S\_\{ij\}on the active block, then a computable sufficient bound is

ρH≤tanh2⁡\(ΩS2\)<1\.\\rho\_\{H\}\\leq\\tanh^\{2\}\\\!\\left\(\\frac\{\\Omega\_\{S\}\}\{2\}\\right\)<1\.Equivalently, for unscaled logitsS¯\\bar\{S\}withS=S¯/εS=\\bar\{S\}/\\varepsilon, this reads

ρH≤tanh2⁡\(ΩS¯2​ε\)\.\\rho\_\{H\}\\leq\\tanh^\{2\}\\\!\\left\(\\frac\{\\Omega\_\{\\bar\{S\}\}\}\{2\\varepsilon\}\\right\)\.

This proposition is a support\-local certificate, not a claim about masked zeros\. The Birkhoff–Hopf argument\(Birkhoff,[1957](https://arxiv.org/html/2605.08123#bib.bib3); Bushell,[1973](https://arxiv.org/html/2605.08123#bib.bib4)\)applies on strictly positive active blocks after masked entries and inactive dustbin fillers have been excluded\. The dustbin bridge preserves the statement because the active dustbin token is part of the augmented positive block, while the inactive filler tokens are outside the cone on which the contraction is asserted\.

###### Proposition 5\(A Posteriori Bias Certificate\)\.

LetFR​\(x,z\)F\_\{R\}\(x,z\)denote the scalar tail objective obtained by applying the terminal loss toAx​\(ΦxR​\(z\)\)A\_\{x\}\(\\Phi\_\{x\}^\{R\}\(z\)\), and define the measured base\-state cotangent

ηR​\(x\)=∇zFR​\(x,z\)\|z=zT​\(x\)\.\\eta\_\{R\}\(x\)=\\nabla\_\{z\}F\_\{R\}\(x,z\)\\big\|\_\{z=z\_\{T\}\(x\)\}\.Then the full\-gradient remainder omitted by the stopped\-base surrogate is exactly

∇xFR​\(x,zT​\(x\)\)−∂xFR​\(x,z\)\|z=zT​\(x\)=Dx​zT​\(x\)⊤​ηR​\(x\)\.\\nabla\_\{x\}F\_\{R\}\(x,z\_\{T\}\(x\)\)\-\\partial\_\{x\}F\_\{R\}\(x,z\)\\big\|\_\{z=z\_\{T\}\(x\)\}=D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\eta\_\{R\}\(x\)\.Consequently, for any compatible norm,

‖Δ​gR‖≤‖Dx​zT​\(x\)⊤‖​‖ηR​\(x\)‖\.\\\|\\Delta g\_\{R\}\\\|\\leq\\\|D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\\|\\,\\\|\\eta\_\{R\}\(x\)\\\|\.

This certificate is operational:ηR\\eta\_\{R\}is the base cotangent produced by the same fixed\-depth tail reverse pass before it is discarded by the stop\-gradient rule, andDx​zT​\(x\)⊤​ηRD\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\eta\_\{R\}is a single VJP through the base Sinkhorn solve\.

###### Corollary 2\(Certified Tail\-Depth Selection\)\.

Fix a toleranceτ\>0\\tau\>0and a maximum tail depthRmaxR\_\{\\max\}\. For eachR≤RmaxR\\leq R\_\{\\max\}, let

cR​\(x\)=Dx​zT​\(x\)⊤​ηR​\(x\)\.c\_\{R\}\(x\)=D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\eta\_\{R\}\(x\)\.LetC​\(R\)C\(R\)be any nondecreasing implementation cost over the candidate depths\{0,…,Rmax\}\\\{0,\\ldots,R\_\{\\max\}\\\}\. If the implementation chooses the firstRRsuch that‖cR​\(x\)‖≤τ\\\|c\_\{R\}\(x\)\\\|\\leq\\tauand returns the stopped\-base gradient at that depth, then the returned gradient differs from full BPTT through the sameTT\-step base solve by at mostτ\\tau\. Moreover, this first feasibleRRsolves the discrete certified optimization problem

min0≤R≤Rmax⁡C​\(R\)subject to‖cR​\(x\)‖≤τ\.\\min\_\{0\\leq R\\leq R\_\{\\max\}\}C\(R\)\\quad\\text\{subject to\}\\quad\\\|c\_\{R\}\(x\)\\\|\\leq\\tau\.If an upper boundCT​\(x\)≥‖Dx​zT​\(x\)⊤‖C\_\{T\}\(x\)\\geq\\\|D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\\|is available, the same guarantee and optimality statement hold with the sufficient constraintCT​\(x\)​‖ηR​\(x\)‖≤τC\_\{T\}\(x\)\\\|\\eta\_\{R\}\(x\)\\\|\\leq\\tau\.

This turns tail\-depth choice into a small certified resource\-allocation problem rather than a purely heuristic hyperparameter\. Instead of solving an adjoint linear system, one can either use a fixed depth such asR=2R=2when the certificate has already been calibrated, or periodically run the certificate VJP to verify that the stopped\-base omission remains below a chosen tolerance\.

###### Proposition 6\(Transport\-Orbit Calculus\)\.

Consider any finite collection of transport plansP\(a,b\)P^\{\(a,b\)\}generated from the same masked score tensorSSand the same active support, including the plans used by theR=2R=2staircase, theTTstopped base reverse pass, theRRdifferentiable tail reverse pass, and the certificate VJPDx​zT⊤​ηRD\_\{x\}z\_\{T\}^\{\\top\}\\eta\_\{R\}\. Fix any reference planP⋆=P\(a⋆,b⋆\)P^\{\\star\}=P^\{\(a\_\{\\star\},b\_\{\\star\}\)\}in this collection and define

r\(a\)=exp⁡\(u\(a\)−u\(a⋆\)\),s\(b\)=exp⁡\(v\(b\)−v\(b⋆\)\)\.r^\{\(a\)\}=\\exp\(u^\{\(a\)\}\-u^\{\(a\_\{\\star\}\)\}\),\\qquad s^\{\(b\)\}=\\exp\(v^\{\(b\)\}\-v^\{\(b\_\{\\star\}\)\}\)\.Then every plan in the collection lies in the row/column scaling orbit ofP⋆P^\{\\star\}:

P\(a,b\)=diag⁡\(r\(a\)\)​P⋆​diag⁡\(s\(b\)\)\.P^\{\(a,b\)\}=\\operatorname\{diag\}\(r^\{\(a\)\}\)\\,P^\{\\star\}\\,\\operatorname\{diag\}\(s^\{\(b\)\}\)\.Equivalently, for any finite sum of plan\-weighted score cotangent terms

G=∑ℓ=1mP\(aℓ,bℓ\)⊙Hℓ,G=\\sum\_\{\\ell=1\}^\{m\}P^\{\(a\_\{\\ell\},b\_\{\\ell\}\)\}\\odot H\_\{\\ell\},one has the single\-reference representation

G=P⋆⊙∑ℓ=1m\(r\(aℓ\)​\(s\(bℓ\)\)⊤\)⊙Hℓ\.G=P^\{\\star\}\\odot\\sum\_\{\\ell=1\}^\{m\}\\bigl\(r^\{\(a\_\{\\ell\}\)\}\(s^\{\(b\_\{\\ell\}\)\}\)^\{\\top\}\\bigr\)\\odot H\_\{\\ell\}\.Consequently, for fixedTT,RR, band widthWW, and head dimensiondd, the stopped gradient, the certified omitted\-gradient VJP, and the certified tail\-depth test can all be evaluated blockwise withO​\(\(T\+R\)​L​W\)O\(\(T\+R\)LW\)arithmetic,O​\(L​d\)O\(Ld\)input storage, andO​\(\(T\+R\)​L\)O\(\(T\+R\)L\)vector state, while materializing only one transport tile at a time\. The compiler abstraction is therefore not a special four\-plan trick: it is accumulation over a single transport orbit\.

## 3Validation and Experiments

### 3\.1Exactness Validation of the Masked Surrogate

We first separate surrogate correctness from systems scale\. On synthetic masked problems for which exact autodiff through the same centered surrogate remains tractable, we compare the optimizedR=2R=2kernel against a pure\-JAX reference\. Table[1](https://arxiv.org/html/2605.08123#S3.T1)reports representative maximum absolute gradient deviations under the default validation setting \(ε=1\\varepsilon=1, band width256256,R=2R=2\)\.

Table 1:Maximum absolute gradient deviations between the optimized maskedR=2R=2kernel and exact autodiff through the same centered surrogate\.Additional forward\-consistency checks run through lengths512512,10241024,20482048,40964096,81928192, and1638416384, with extended spot checks at3276832768,6553665536, and131072131072\. We treat the larger checks as forward\-consistency evidence rather than as exact\-gradient checks\.

To calibrate the local surrogate\-bias discussion, we also compared tail\-refinement gradients against full BPTT through the stopped\-base solve on the same synthetic masked setup atL=512L=512,d=8d=8,T=15T=15,W=256W=256, andε=1\\varepsilon=1\. Over seeds0,1,20,1,2, the worst\-case gradient discrepancymax⁡\{\|Δ​Q\|,\|Δ​K\|,\|Δ​V\|\}\\max\\\{\|\\Delta Q\|,\|\\Delta K\|,\|\\Delta V\|\\\}was1\.49×10−41\.49\\times 10^\{\-4\}forR=0R=0,1\.63×10−51\.63\\times 10^\{\-5\}forR=1R=1,1\.42×10−51\.42\\times 10^\{\-5\}forR=2R=2, and1\.42×10−51\.42\\times 10^\{\-5\}forR=4R=4\. In this local exact\-gradient calibration, the main empirical drop therefore occurs between zero tail refinement and the first one or two refinement steps, whileR=2R=2andR=4R=4are already effectively tied at the scale of the masked validation harness\.

Table[2](https://arxiv.org/html/2605.08123#S3.T2)reports the a posteriori certificate from Proposition[5](https://arxiv.org/html/2605.08123#Thmproposition5)on a smaller full\-BPTT setting where the base\-solve VJP can be checked directly \(L=128L=128,d=8d=8,T=15T=15,W=128W=128, seeds0,1,20,1,2\)\. The certificate reconstructs the full\-gradient remainder to numerical precision and shows rapid decay of the measured base\-state cotangent\.

Table 2:A posteriori surrogate\-bias certificate\. The residual is the maximum absolute difference between the actual full\-vs\-stopped gradient gap and the certified base\-solve VJPDx​zT⊤​ηRD\_\{x\}z\_\{T\}^\{\\top\}\\eta\_\{R\}\.Using the certified selector from Corollary[2](https://arxiv.org/html/2605.08123#Thmcorollary2)with max\-absolute tolerance10−510^\{\-5\}selectsR=2R=2on all three seeds in this calibration\. Since the per\-depth tail cost is monotone inRR, this is the cheapest certified depth among the tested candidates and gives a certificate\-level justification for usingR=2R=2as the production depth in the TPU kernel\.

We also validate the transport\-orbit calculus from Proposition[6](https://arxiv.org/html/2605.08123#Thmproposition6)directly over all2​\(T\+R\)2\(T\+R\)base and tail half\-step plans atT=15,R=2,L=128T=15,R=2,L=128\. Reconstructing every plan from the final reference plan gives maximum log\-domain reconstruction error1\.91×10−61\.91\\times 10^\{\-6\}\. Appendix[C](https://arxiv.org/html/2605.08123#A3)reports two additional diagnostics: a controlled direct\-four\-plan score\-adjoint benchmark and projective contraction certificates computed from the promoted Pfam checkpoint\.

### 3\.2Short Pfam Screen on TPU v6e\-8

The first empirical bridge is the short Pfam screen on TPU v6e\-8\. All four core configurations complete end\-to\-end:

k2\_eps1,k4\_eps1,muot\_dustbin\_mono,muot\_dustbin\_hybrid\.\\texttt\{k2\\\_eps1\},\\quad\\texttt\{k4\\\_eps1\},\\quad\\texttt\{muot\\\_dustbin\\\_mono\},\\quad\\texttt\{muot\\\_dustbin\\\_hybrid\}\.Each uses50,00050\{,\}000training pairs,T=15T=15base iterations, band width10241024, and global batch size6464\. Table[3](https://arxiv.org/html/2605.08123#S3.T3)reports the step\-50 screen snapshot\.

Table 3:Short TPU v6e\-8 Pfam screen across the four core balanced and dustbin\-augmented configurations\.Two immediate conclusions follow\. First, on the balanced path we observe no meaningful difference betweenR=2R=2andR=4R=4at this short\-screen budget, whileR=2R=2remains slightly faster\. Second, the theorem\-aligned dustbin bridge is operational at the same systems scale as the balanced path\.

### 3\.3Promoted Balanced Run

We then promoted the balancedk2\_eps1configuration to the longer\-budget TPU training path\. The full objective details are given in the protocol notes in the appendix; for the interpretation of the main text, the important point is that this promotion is a trainability and systems\-stability test of the factorization\-backed balancedR=2R=2path\. The run reached step14371437, repeatedly persisted checkpoints at steps250250,500500,750750,10001000, and12501250, and sustained approximately8\.58\.5examples per second after startup warmup\. The initial throughput at step0was2\.672\.67examples per second, after which the run stabilized near8\.58\.5examples per second for the remainder of the budget\.

Table 4:Held\-out Pfam test\-shard evaluation for the promoted balanced run\.This promoted run should therefore be interpreted narrowly but positively\. It is strong evidence that the factorization\-backed balancedR=2R=2path is stable, efficient, and checkpointable under a realistic TPU budget\. Table[4](https://arxiv.org/html/2605.08123#S3.T4)shows that the final promoted checkpoint also improves held\-out reconstruction substantially and improves sparse CE modestly relative to step0, even though sparse CE is not optimized directly in this run\. The monotonicity violation metric does not improve in this reconstruction\-only run\. The minibatch training loss is not monotone step\-by\-step, so late fluctuations in scalar loss should be read as ordinary stochastic optimization behavior rather than as an instability signal\. The role of this experiment is still not to serve as a final downstream comparison against alternative attention mechanisms; it is to show that the production factorization\-backed path runs cleanly over a long TPU budget and yields nontrivial held\-out reconstruction improvement on the benchmark family used in the study\.

## 4Discussion

The unified picture is now clear\. The core contribution remains the balanced fixed\-support tail\-refinement surrogate together with the exactR=2R=2one\-reference\-tile backward schedule\. The dustbin bridge matters because it keeps the first gap\-aware extension inside the same theorem family and already matches the current cloud path\. The empirical evidence aligns with that theory stack: exactness holds on small masked problems, the four core balanced and dustbin\-augmented screen configurations complete on TPU, and the promoted balanced path both trains cleanly through a long budget and improves held\-out reconstruction and sparse CE relative to its initialization\.

What remains outside the present theorem spine should stay outside it\. In particular, a genuinely KL\-unbalanced transport theorem, differentiable support geometry, and multimodal routing all remain reasonable follow\-ons, but they are not required for the central claim of this manuscript\.

#### Limitations and scope\.

The strongest limitations are deliberate\. The main factorization is balanced and fixed\-support; the dustbin bridge is an augmented balanced construction, not a full KL\-unbalanced OT theorem\. The local surrogate\-bias result is local and contraction\-based rather than global, and the projective contraction certificate applies to strictly positive active blocks rather than to masked zeros or disconnected supports\. Empirically, the promoted long\-budget run improves held\-out reconstruction and sparse CE, but it is still best read as evidence of trainability and systems stability rather than as a final downstream comparison against alternative attention mechanisms, a direct FlashSinkhorn benchmark, a direct four\-plan\-adjoint kernel benchmark, or a CE\-optimized alignment objective\.

#### Broader context\.

This work is a methods and systems contribution for long\-context differentiable alignment\. Its immediate positive impact is to make factorization\-backed OT attention practical on accelerator hardware\. Its main practical costs are computational: the method still requires substantial accelerator time, and the empirical conclusions should be interpreted only within the stated benchmark scope rather than as deployment claims about downstream decision\-making systems\.

## 5Conclusion

We presented a one\-reference\-tile backward schedule for long\-context Sinkhorn attention and carried it through the first gap\-aware extension that already matches the production code path\. The balanced fixed\-supportR=2R=2schedule remains the centerpiece\. The dustbin augmentation bridge lifts that schedule onto the current gap\-aware transport path without changing the underlying surrogate\. Exactness validation, a completed four\-way TPU screen, and a promoted long\-budget balanced run with held\-out reconstruction and sparse\-CE improvement together show that the result is not merely algebraic: it maps to a practical TPU training path\.

## References

- Bai et al\. \(2019\)Shaojie Bai, J Zico Kolter, and Vladlen Koltun\.Deep equilibrium models\.In*Advances in Neural Information Processing Systems*, volume 32, 2019\.
- Beltagy et al\. \(2020\)Iz Beltagy, Matthew E\. Peters, and Arman Cohan\.Longformer: The long\-document transformer\.*arXiv preprint arXiv:2004\.05150*, 2020\.doi:10\.48550/arXiv\.2004\.05150\.
- Birkhoff \(1957\)Garrett Birkhoff\.Extensions of Jentzsch’s theorem\.*Transactions of the American Mathematical Society*, 85\(1\):219–227, 1957\.doi:10\.1090/S0002\-9947\-1957\-0087058\-6\.
- Bushell \(1973\)P\. J\. Bushell\.Hilbert’s metric and positive contraction mappings in a Banach space\.*Archive for Rational Mechanics and Analysis*, 52\(4\):330–338, 1973\.doi:10\.1007/BF00247467\.
- Cuturi \(2013\)Marco Cuturi\.Sinkhorn distances: Lightspeed computation of optimal transport\.In*Advances in Neural Information Processing Systems*, volume 26, 2013\.
- Lin et al\. \(2023\)Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al\.Evolutionary\-scale prediction of atomic\-level protein structure with a language model\.*Science*, 379\(6637\):1123–1130, 2023\.doi:10\.1126/science\.ade2574\.
- Luise et al\. \(2018\)Giulia Luise, Alessandro Rudi, Massimiliano Pontil, and Carlo Ciliberto\.Differential properties of Sinkhorn approximation for learning with Wasserstein distance\.In*Advances in Neural Information Processing Systems*, volume 31, 2018\.
- Mistry et al\. \(2021\)Jaina Mistry, Sara Chuguransky, Lowri Williams, Matloob Qureshi, Gustavo A Salazar, et al\.Pfam: The protein families database in 2021\.*Nucleic Acids Research*, 49\(D1\):D412–D419, 2021\.doi:10\.1093/nar/gkaa913\.
- Peyré and Cuturi \(2019\)Gabriel Peyré and Marco Cuturi\.Computational optimal transport: With applications to data science\.*Foundations and Trends in Machine Learning*, 11\(5\-6\):355–607, 2019\.doi:10\.1561/2200000073\.
- Smith and Waterman \(1981\)Temple F Smith and Michael S Waterman\.Identification of common molecular subsequences\.*Journal of Molecular Biology*, 147\(1\):195–197, 1981\.doi:10\.1016/0022\-2836\(81\)90087\-5\.
- Ye et al\. \(2026\)Felix X\.\-F\. Ye, Xingjie Li, An Yu, Ming\-Ching Chang, Linsong Chu, and Davis Wertheimer\.FlashSinkhorn: IO\-aware entropic optimal transport\.*arXiv preprint arXiv:2602\.03067*, 2026\.doi:10\.48550/arXiv\.2602\.03067\.
- Zaheer et al\. \(2020\)Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontañón, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed\.Big Bird: Transformers for longer sequences\.In*Advances in Neural Information Processing Systems*, volume 33, pages 17283–17297, 2020\.

## Appendix ABalanced Tail\-Refinement Proofs

#### Notation\.

For the appendix proofs we work with ungauged representatives satisfying the raw masked logsumexp half\-step equations\. LetΩ\\Omegadenote the active support and define

Pi​j\(a,b\)=\{exp⁡\(Si​j\+ui\(a\)\+vj\(b\)\),\(i,j\)∈Ω,0,\(i,j\)∉Ω\.P^\{\(a,b\)\}\_\{ij\}=\\begin\{cases\}\\exp\(S\_\{ij\}\+u\_\{i\}^\{\(a\)\}\+v\_\{j\}^\{\(b\)\}\),&\(i,j\)\\in\\Omega,\\\\ 0,&\(i,j\)\\notin\\Omega\.\\end\{cases\}
ui\(t\+1\)\\displaystyle u\_\{i\}^\{\(t\+1\)\}=−log⁡\[∑j:\(i,j\)∈Ωexp⁡\(Si​j\+vj\(t\)\)\],\\displaystyle=\-\\log\\\!\\Bigl\[\\sum\_\{j:\(i,j\)\\in\\Omega\}\\exp\(S\_\{ij\}\+v\_\{j\}^\{\(t\)\}\)\\Bigr\],\(9\)vj\(t\+1\)\\displaystyle v\_\{j\}^\{\(t\+1\)\}=−log⁡\[∑i:\(i,j\)∈Ωexp⁡\(Si​j\+ui\(t\+1\)\)\]\.\\displaystyle=\-\\log\\\!\\Bigl\[\\sum\_\{i:\(i,j\)\\in\\Omega\}\\exp\(S\_\{ij\}\+u\_\{i\}^\{\(t\+1\)\}\)\\Bigr\]\.\(10\)

#### Centering map\.

Let𝟏q\\mathbf\{1\}\_\{q\}and𝟏k\\mathbf\{1\}\_\{k\}denote the all\-ones vectors on valid queries and keys, and let

m=𝟏q𝟏q⊤​𝟏q\.m=\\frac\{\\mathbf\{1\}\_\{q\}\}\{\\mathbf\{1\}\_\{q\}^\{\\top\}\\mathbf\{1\}\_\{q\}\}\.The implementation stores centered representatives via

𝒞​\(u,v\)=\(u−\(m⊤​u\)​𝟏q,v\+\(m⊤​u\)​𝟏k\)\.\\mathcal\{C\}\(u,v\)=\\bigl\(u\-\(m^\{\\top\}u\)\\mathbf\{1\}\_\{q\},\\;v\+\(m^\{\\top\}u\)\\mathbf\{1\}\_\{k\}\\bigr\)\.
###### Lemma 1\(Centering Pullback\)\.

If\(u¯,v¯\)\(\\bar\{u\},\\bar\{v\}\)are cotangents on the outputs of𝒞\\mathcal\{C\}, then

𝒞∗​\(u¯,v¯\)=\(u¯\+m​\(𝟏k⊤​v¯−𝟏q⊤​u¯\),v¯\)\.\\mathcal\{C\}^\{\*\}\(\\bar\{u\},\\bar\{v\}\)=\\Bigl\(\\bar\{u\}\+m\\bigl\(\\mathbf\{1\}\_\{k\}^\{\\top\}\\bar\{v\}\-\\mathbf\{1\}\_\{q\}^\{\\top\}\\bar\{u\}\\bigr\),\\;\\bar\{v\}\\Bigr\)\.

###### Proof\.

Writes=m⊤​us=m^\{\\top\}u, soδ​s=m⊤​δ​u\\delta s=m^\{\\top\}\\delta u\. Then

δ​𝒞​\(u,v\)=\(δ​u−\(δ​s\)​𝟏q,δ​v\+\(δ​s\)​𝟏k\)\.\\delta\\mathcal\{C\}\(u,v\)=\\bigl\(\\delta u\-\(\\delta s\)\\mathbf\{1\}\_\{q\},\\;\\delta v\+\(\\delta s\)\\mathbf\{1\}\_\{k\}\\bigr\)\.Pairing this variation with\(u¯,v¯\)\(\\bar\{u\},\\bar\{v\}\)and collecting coefficients ofδ​u\\delta uandδ​v\\delta vyields the stated pullback\. ∎

###### Lemma 2\(Gauge Equivariance\)\.

LetΓc​\(u,v\)=\(u−c​𝟏q,v\+c​𝟏k\)\\Gamma\_\{c\}\(u,v\)=\(u\-c\\mathbf\{1\}\_\{q\},\\;v\+c\\mathbf\{1\}\_\{k\}\)\. Then the raw two\-half\-step map defined by equations \([9](https://arxiv.org/html/2605.08123#A1.E9)\)–\([10](https://arxiv.org/html/2605.08123#A1.E10)\) is gauge equivariant\.

###### Proof\.

Addingccto every valid key potential shifts every logsumexp argument in theuu\-update by the same constant, so the update subtractsccfrom every valid query potential\. Thevv\-update is identical with rows and columns swapped\. ∎

###### Proposition 7\(Centered\-Ungauged Forward Equivalence\)\.

Let\(u^\(t\),v^\(t\)\)\(\\widehat\{u\}^\{\(t\)\},\\widehat\{v\}^\{\(t\)\}\)be the ungauged iterates and\(u\(t\),v\(t\)\)\(u^\{\(t\)\},v^\{\(t\)\}\)the iterates produced by the centered implementation, both started from the same centered base pair\. Then for everyt≥1t\\geq 1there exists a scalarctc\_\{t\}such that

\(u\(t\),v\(t\)\)=Γct​\(u^\(t\),v^\(t\)\)\.\(u^\{\(t\)\},v^\{\(t\)\}\)=\\Gamma\_\{c\_\{t\}\}\\bigl\(\\widehat\{u\}^\{\(t\)\},\\widehat\{v\}^\{\(t\)\}\\bigr\)\.Consequently the centered and ungauged implementations produce exactly the same staircase transport plans\.

###### Proof\.

The claim holds att=1t=1by definition of the centering step\. Inductively, if the two previous representatives differ by a gauge transform, then gauge equivariance of the raw refinement step shows that the next raw outputs also differ by a gauge transform; applying𝒞\\mathcal\{C\}merely chooses another representative of the same orbit\. ∎

###### Lemma 3\(Half\-Step Jacobians\)\.

For every refinement stept≥1t\\geq 1,

∂ui\(t\)∂vj\(t−1\)\\displaystyle\\frac\{\\partial u\_\{i\}^\{\(t\)\}\}\{\\partial v\_\{j\}^\{\(t\-1\)\}\}=−Pi​j\(t,t−1\),\\displaystyle=\-P\_\{ij\}^\{\(t,t\-1\)\},\(11\)∂vj\(t\)∂ui\(t\)\\displaystyle\\frac\{\\partial v\_\{j\}^\{\(t\)\}\}\{\\partial u\_\{i\}^\{\(t\)\}\}=−Pi​j\(t,t\)\.\\displaystyle=\-P\_\{ij\}^\{\(t,t\)\}\.\(12\)

###### Proof\.

Differentiate the masked logsumexp updates\. For example,

∂ui\(t\)∂vj\(t−1\)=−exp⁡\(Si​j\+vj\(t−1\)\)∑k:\(i,k\)∈Ωexp⁡\(Si​k\+vk\(t−1\)\)=−exp⁡\(Si​j\+ui\(t\)\+vj\(t−1\)\)=−Pi​j\(t,t−1\)\.\\frac\{\\partial u\_\{i\}^\{\(t\)\}\}\{\\partial v\_\{j\}^\{\(t\-1\)\}\}=\-\\frac\{\\exp\(S\_\{ij\}\+v\_\{j\}^\{\(t\-1\)\}\)\}\{\\sum\_\{k:\(i,k\)\\in\\Omega\}\\exp\(S\_\{ik\}\+v\_\{k\}^\{\(t\-1\)\}\)\}=\-\\exp\(S\_\{ij\}\+u\_\{i\}^\{\(t\)\}\+v\_\{j\}^\{\(t\-1\)\}\)=\-P\_\{ij\}^\{\(t,t\-1\)\}\.The second identity is identical with rows and columns swapped\. ∎

#### Proof of Proposition[1](https://arxiv.org/html/2605.08123#Thmproposition1)\.

###### Proof\.

The direct terminal transport application contributesP\(R,R\)⊙ZP^\{\(R,R\)\}\\odot ZtoS¯\\bar\{S\}together with

v¯\(R\)=gv,u¯\(R\)=gu−P\(R,R\)​v¯\(R\)\.\\bar\{v\}^\{\(R\)\}=g\_\{v\},\\qquad\\bar\{u\}^\{\(R\)\}=g\_\{u\}\-P^\{\(R,R\)\}\\bar\{v\}^\{\(R\)\}\.Now fix a reverse stept∈\{1,…,R\}t\\in\\\{1,\\dots,R\\\}\. Pullingu¯\(t\)\\bar\{u\}^\{\(t\)\}back through theuu\-half\-step uses the Jacobian−P\(t,t−1\)\-P^\{\(t,t\-1\)\}:

v¯\(t−1\)=−\(P\(t,t−1\)\)⊤u¯\(t\),S¯\-=P\(t,t−1\)⊙\(u¯\(t\)𝟏⊤\)\.\\bar\{v\}^\{\(t\-1\)\}=\-\\bigl\(P^\{\(t,t\-1\)\}\\bigr\)^\{\\top\}\\bar\{u\}^\{\(t\)\},\\qquad\\bar\{S\}\\mathrel\{\-\}=P^\{\(t,t\-1\)\}\\odot\\bigl\(\\bar\{u\}^\{\(t\)\}\\mathbf\{1\}^\{\\top\}\\bigr\)\.Fort≥2t\\geq 2, pullingv¯\(t−1\)\\bar\{v\}^\{\(t\-1\)\}back through thevv\-half\-step uses the Jacobian−P\(t−1,t−1\)\-P^\{\(t\-1,t\-1\)\}:

u¯\(t−1\)=−P\(t−1,t−1\)v¯\(t−1\),S¯\-=P\(t−1,t−1\)⊙\(𝟏v¯\(t−1\)⊤\)\.\\bar\{u\}^\{\(t\-1\)\}=\-P^\{\(t\-1,t\-1\)\}\\bar\{v\}^\{\(t\-1\)\},\\qquad\\bar\{S\}\\mathrel\{\-\}=P^\{\(t\-1,t\-1\)\}\\odot\\bigl\(\\mathbf\{1\}\\bar\{v\}^\{\(t\-1\)\\top\}\\bigr\)\.The recursion stops atu¯\(1\)\\bar\{u\}^\{\(1\)\}becauseu\(0\)u^\{\(0\)\}belongs to the stopped base solve\. Summing the direct terminal term with the reverse\-time half\-step contributions yields equation \([7](https://arxiv.org/html/2605.08123#S2.E7)\)\. ∎

#### Proof of Proposition[2](https://arxiv.org/html/2605.08123#Thmproposition2)\.

###### Proof\.

For any two staircase plans,

Pi​j\(a,b\)Pi​j\(2,2\)=exp⁡\(ui\(a\)−ui\(2\)\+vj\(b\)−vj\(2\)\),\\frac\{P\_\{ij\}^\{\(a,b\)\}\}\{P\_\{ij\}^\{\(2,2\)\}\}=\\exp\(u\_\{i\}^\{\(a\)\}\-u\_\{i\}^\{\(2\)\}\+v\_\{j\}^\{\(b\)\}\-v\_\{j\}^\{\(2\)\}\),which gives the exact reconstruction formula immediately\. For the three non\-reference factors in theR=2R=2staircase this gives

Pi​j\(2,1\)=Pi​j\(2,2\)​βj,Pi​j\(1,1\)=Pi​j\(2,2\)​αi​βj,Pi​j\(1,0\)=Pi​j\(2,2\)​αi​δj\.P^\{\(2,1\)\}\_\{ij\}=P^\{\(2,2\)\}\_\{ij\}\\beta\_\{j\},\\qquad P^\{\(1,1\)\}\_\{ij\}=P^\{\(2,2\)\}\_\{ij\}\\alpha\_\{i\}\\beta\_\{j\},\\qquad P^\{\(1,0\)\}\_\{ij\}=P^\{\(2,2\)\}\_\{ij\}\\alpha\_\{i\}\\delta\_\{j\}\.Substituting these three identities into the expanded VJP in equation \([8](https://arxiv.org/html/2605.08123#S2.E8)\) and factoring outPi​j\(2,2\)P^\{\(2,2\)\}\_\{ij\}gives

S¯i​j=Pi​j\(2,2\)​\[Zi​j−v¯j\(2\)−βj​u¯i\(2\)−αi​βj​v¯j\(1\)−αi​δj​u¯i\(1\)\]\.\\bar\{S\}\_\{ij\}=P^\{\(2,2\)\}\_\{ij\}\\Bigl\[Z\_\{ij\}\-\\bar\{v\}^\{\(2\)\}\_\{j\}\-\\beta\_\{j\}\\bar\{u\}^\{\(2\)\}\_\{i\}\-\\alpha\_\{i\}\\beta\_\{j\}\\bar\{v\}^\{\(1\)\}\_\{j\}\-\\alpha\_\{i\}\\delta\_\{j\}\\bar\{u\}^\{\(1\)\}\_\{i\}\\Bigr\]\.Thus every entry of the direct four\-factor adjoint is reproduced exactly from the reference plan and vector modifiers\.

For a tileI×JI\\times J, computePI,J\(2,2\)P^\{\(2,2\)\}\_\{I,J\}once fromSI,JS\_\{I,J\},uI\(2\)u\_\{I\}^\{\(2\)\}, andvJ\(2\)v\_\{J\}^\{\(2\)\}, computeZI,JZ\_\{I,J\}from the resident slices ofGIG\_\{I\}andVJV\_\{J\}, and form the bracketed modifier using only the slices ofαI\\alpha\_\{I\},u¯I\(2\)\\bar\{u\}\_\{I\}^\{\(2\)\},αI​u¯I\(1\)\\alpha\_\{I\}\\bar\{u\}\_\{I\}^\{\(1\)\},βJ\\beta\_\{J\},δJ\\delta\_\{J\},v¯J\(2\)\\bar\{v\}\_\{J\}^\{\(2\)\}, andβJ​v¯J\(1\)\\beta\_\{J\}\\bar\{v\}\_\{J\}^\{\(1\)\}\. Multiplying the single resident reference tile by this modifier yields the exact score\-space VJP tile\. The value\-gradient term uses the same resident reference plan tile, since∂ℒ/∂V\\partial\\mathcal\{L\}/\\partial Vreceives\(PI,J\(2,2\)\)⊤​GI\(P^\{\(2,2\)\}\_\{I,J\}\)^\{\\top\}G\_\{I\}\. All subsequent propagation toQQandKKis linear in the score cotangent tile, so it can be accumulated before the tile is evicted\. The stated storage and arithmetic bounds follow by streaming over theO​\(L​W/B2\)O\(LW/B^\{2\}\)active tiles with fixedBB,dd,WW, and fixed tail depth\. ∎

## Appendix BProofs for the Dustbin Bridge and Surrogate Bias

#### Proof of Proposition[3](https://arxiv.org/html/2605.08123#Thmproposition3)\.

###### Proof\.

After constructing the augmented tensors and masks, the implementation calls the same balanced Sinkhorn forward and backward path as in Section 2\.2\. Because only the first appended token in each dustbin block is active, every transport entry touching the remaining fillers is identically zero\. Because the effective band width isW~=max⁡\(W,L\)\\widetilde\{W\}=\\max\(W,L\), every active base token can reach the active dustbin token on the opposite side\. Thus the implemented path is exactly the balanced fixed\-support surrogate on an enlarged state space\. ∎

#### Proof of Corollary[1](https://arxiv.org/html/2605.08123#Thmcorollary1)\.

###### Proof\.

Apply Propositions[1](https://arxiv.org/html/2605.08123#Thmproposition1)and[2](https://arxiv.org/html/2605.08123#Thmproposition2)to the augmented tensors and support from Proposition[3](https://arxiv.org/html/2605.08123#Thmproposition3)\. No new algebra is introduced after the augmentation itself\. ∎

#### Proof of Theorem[1](https://arxiv.org/html/2605.08123#Thmtheorem1)\.

###### Proof\.

By the chain rule,

∇xΨxR​\(zT​\(x\)\)=∂xΨxR​\(z\)\|z=zT​\(x\)\+Dz​ΨxR​\(zT​\(x\)\)​Dx​zT​\(x\),\\nabla\_\{x\}\\Psi\_\{x\}^\{R\}\(z\_\{T\}\(x\)\)=\\partial\_\{x\}\\Psi\_\{x\}^\{R\}\(z\)\\big\|\_\{z=z\_\{T\}\(x\)\}\+D\_\{z\}\\Psi\_\{x\}^\{R\}\(z\_\{T\}\(x\)\)\\,D\_\{x\}z\_\{T\}\(x\),which gives the exact bias identity after subtracting the partial derivative term\.

Next,

Dz​ΨxR​\(zT​\(x\)\)=Dz​Ax​\(zT\+R​\(x\)\)​∏r=0R−1Dz​Φx​\(zT\+r​\(x\)\),D\_\{z\}\\Psi\_\{x\}^\{R\}\(z\_\{T\}\(x\)\)=D\_\{z\}A\_\{x\}\(z\_\{T\+R\}\(x\)\)\\prod\_\{r=0\}^\{R\-1\}D\_\{z\}\\Phi\_\{x\}\(z\_\{T\+r\}\(x\)\),so

‖Dz​ΨxR​\(zT​\(x\)\)‖≤LA​ρR\.\\\|D\_\{z\}\\Psi\_\{x\}^\{R\}\(z\_\{T\}\(x\)\)\\\|\\leq L\_\{A\}\\rho^\{R\}\.Differentiate the base recursionzt\+1​\(x\)=Φx​\(zt​\(x\)\)z\_\{t\+1\}\(x\)=\\Phi\_\{x\}\(z\_\{t\}\(x\)\):

Dx​zt\+1​\(x\)=Dx​Φx​\(zt​\(x\)\)\+Dz​Φx​\(zt​\(x\)\)​Dx​zt​\(x\)\.D\_\{x\}z\_\{t\+1\}\(x\)=D\_\{x\}\\Phi\_\{x\}\(z\_\{t\}\(x\)\)\+D\_\{z\}\\Phi\_\{x\}\(z\_\{t\}\(x\)\)\\,D\_\{x\}z\_\{t\}\(x\)\.Becausez0z\_\{0\}is independent ofxx, repeated application of the norm bounds yields

‖Dx​zT​\(x\)‖≤LΦ​∑j=0T−1ρj≤LΦ1−ρ\.\\\|D\_\{x\}z\_\{T\}\(x\)\\\|\\leq L\_\{\\Phi\}\\sum\_\{j=0\}^\{T\-1\}\\rho^\{j\}\\leq\\frac\{L\_\{\\Phi\}\}\{1\-\\rho\}\.Combining the two estimates with the exact bias identity gives

‖gfullT,R​\(x\)−gsurrT,R​\(x\)‖≤LA​LΦ1−ρ​ρR\.∎\\bigl\\\|g\_\{\\mathrm\{full\}\}^\{T,R\}\(x\)\-g\_\{\\mathrm\{surr\}\}^\{T,R\}\(x\)\\bigr\\\|\\leq\\frac\{L\_\{A\}L\_\{\\Phi\}\}\{1\-\\rho\}\\,\\rho^\{R\}\.\\qed

#### Proof of Proposition[4](https://arxiv.org/html/2605.08123#Thmproposition4)\.

###### Proof\.

Hilbert’s projective metric on the positive cone is

dH​\(x,y\)=log⁡maxi⁡xi/yimini⁡xi/yi\.d\_\{H\}\(x,y\)=\\log\\frac\{\\max\_\{i\}x\_\{i\}/y\_\{i\}\}\{\\min\_\{i\}x\_\{i\}/y\_\{i\}\}\.Positive diagonal scaling and coordinatewise inversion preserve this metric: diagonal scaling cancels in the coordinate ratios, and inversion swaps the maximum and minimum ratios\. By the Birkhoff–Hopf theorem, any positive linear mapAAwith finite projective diameterΔ​\(A\)\\Delta\(A\)satisfies

dH​\(A​x,A​y\)≤tanh⁡\(Δ​\(A\)4\)​dH​\(x,y\)\.d\_\{H\}\(Ax,Ay\)\\leq\\tanh\\\!\\left\(\\frac\{\\Delta\(A\)\}\{4\}\\right\)d\_\{H\}\(x,y\)\.Decompose the two\-half\-step map as

c↦K​c↦a⊘\(K​c\)↦K⊤​\(a⊘\(K​c\)\)↦b⊘K⊤​\(a⊘\(K​c\)\)\.c\\mapsto Kc\\mapsto a\\oslash\(Kc\)\\mapsto K^\{\\top\}\(a\\oslash\(Kc\)\)\\mapsto b\\oslash K^\{\\top\}\(a\\oslash\(Kc\)\)\.The two linear maps contribute Birkhoff–Hopf contraction factorstanh⁡\(Δ​\(K\)/4\)\\tanh\(\\Delta\(K\)/4\)andtanh⁡\(Δ​\(K⊤\)/4\)\\tanh\(\\Delta\(K^\{\\top\}\)/4\), while the marginal scalings and inversions are projective isometries\. Therefore

dH​\(𝒯​\(c\),𝒯​\(c′\)\)≤tanh⁡\(Δ​\(K\)4\)​tanh⁡\(Δ​\(K⊤\)4\)​dH​\(c,c′\)\.d\_\{H\}\(\\mathcal\{T\}\(c\),\\mathcal\{T\}\(c^\{\\prime\}\)\)\\leq\\tanh\\\!\\left\(\\frac\{\\Delta\(K\)\}\{4\}\\right\)\\tanh\\\!\\left\(\\frac\{\\Delta\(K^\{\\top\}\)\}\{4\}\\right\)d\_\{H\}\(c,c^\{\\prime\}\)\.Strict positivity on the active block gives finite projective diameters, henceρH<1\\rho\_\{H\}<1\.

For log scalings,

dH​\(ev,ev′\)=osc⁡\(v−v′\),osc⁡\(w\)=maxi⁡wi−mini⁡wi,d\_\{H\}\(e^\{v\},e^\{v^\{\\prime\}\}\)=\\operatorname\{osc\}\(v\-v^\{\\prime\}\),\\qquad\\operatorname\{osc\}\(w\)=\\max\_\{i\}w\_\{i\}\-\\min\_\{i\}w\_\{i\},and additive centering does not change oscillation\. This gives the log\-potential contraction\.

Finally, for a positive matrixKi​j=exp⁡\(Si​j\)K\_\{ij\}=\\exp\(S\_\{ij\}\),

Δ​\(K\)=log⁡maxi,i′,j,j′⁡Ki​j​Ki′​j′Ki​j′​Ki′​j=maxi,i′,j,j′⁡\(Si​j\+Si′​j′−Si​j′−Si′​j\)\.\\Delta\(K\)=\\log\\max\_\{i,i^\{\\prime\},j,j^\{\\prime\}\}\\frac\{K\_\{ij\}K\_\{i^\{\\prime\}j^\{\\prime\}\}\}\{K\_\{ij^\{\\prime\}\}K\_\{i^\{\\prime\}j\}\}=\\max\_\{i,i^\{\\prime\},j,j^\{\\prime\}\}\\left\(S\_\{ij\}\+S\_\{i^\{\\prime\}j^\{\\prime\}\}\-S\_\{ij^\{\\prime\}\}\-S\_\{i^\{\\prime\}j\}\\right\)\.IfΩS=max⁡S−min⁡S\\Omega\_\{S\}=\\max S\-\\min Son the active block, the last expression is at most2​ΩS2\\Omega\_\{S\}\. The same bound holds forK⊤K^\{\\top\}\. Substituting into the Birkhoff–Hopf coefficient yields

ρH≤tanh2⁡\(ΩS2\),\\rho\_\{H\}\\leq\\tanh^\{2\}\\\!\\left\(\\frac\{\\Omega\_\{S\}\}\{2\}\\right\),and the unscaled\-logit form follows fromS=S¯/εS=\\bar\{S\}/\\varepsilon\. ∎

#### Proof of Proposition[5](https://arxiv.org/html/2605.08123#Thmproposition5)\.

###### Proof\.

By definition,

gfull​\(x\)=∇xFR​\(x,zT​\(x\)\),gsurr​\(x\)=∂xFR​\(x,z\)\|z=zT​\(x\)\.g\_\{\\mathrm\{full\}\}\(x\)=\\nabla\_\{x\}F\_\{R\}\(x,z\_\{T\}\(x\)\),\\qquad g\_\{\\mathrm\{surr\}\}\(x\)=\\partial\_\{x\}F\_\{R\}\(x,z\)\\big\|\_\{z=z\_\{T\}\(x\)\}\.Applying the chain rule to the first expression gives

∇xFR​\(x,zT​\(x\)\)=∂xFR​\(x,z\)\|z=zT​\(x\)\+Dx​zT​\(x\)⊤​∇zFR​\(x,z\)\|z=zT​\(x\)\.\\nabla\_\{x\}F\_\{R\}\(x,z\_\{T\}\(x\)\)=\\partial\_\{x\}F\_\{R\}\(x,z\)\\big\|\_\{z=z\_\{T\}\(x\)\}\+D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\nabla\_\{z\}F\_\{R\}\(x,z\)\\big\|\_\{z=z\_\{T\}\(x\)\}\.Subtracting the stopped\-base partial derivative gives the identity withηR=∇zFR​\(x,zT​\(x\)\)\\eta\_\{R\}=\\nabla\_\{z\}F\_\{R\}\(x,z\_\{T\}\(x\)\)\. The norm bound follows immediately from submultiplicativity for the chosen compatible operator norm\. ∎

#### Proof of Corollary[2](https://arxiv.org/html/2605.08123#Thmcorollary2)\.

###### Proof\.

Proposition[5](https://arxiv.org/html/2605.08123#Thmproposition5)gives the exact identityΔ​gR=cR​\(x\)\\Delta g\_\{R\}=c\_\{R\}\(x\)\. Therefore‖cR​\(x\)‖≤τ\\\|c\_\{R\}\(x\)\\\|\\leq\\tauimplies‖Δ​gR‖≤τ\\\|\\Delta g\_\{R\}\\\|\\leq\\tau\. If only an operator\-norm boundCT​\(x\)≥‖Dx​zT​\(x\)⊤‖C\_\{T\}\(x\)\\geq\\\|D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\\|is used, the bound in Proposition[5](https://arxiv.org/html/2605.08123#Thmproposition5)gives

‖Δ​gR‖≤‖Dx​zT​\(x\)⊤‖​‖ηR​\(x\)‖≤CT​\(x\)​‖ηR​\(x\)‖≤τ\.\\\|\\Delta g\_\{R\}\\\|\\leq\\\|D\_\{x\}z\_\{T\}\(x\)^\{\\top\}\\\|\\,\\\|\\eta\_\{R\}\(x\)\\\|\\leq C\_\{T\}\(x\)\\\|\\eta\_\{R\}\(x\)\\\|\\leq\\tau\.For the optimization claim, letR⋆R^\{\\star\}be the first feasible depth\. Every smaller depth is infeasible by definition ofR⋆R^\{\\star\}, and every larger feasible depth has cost at leastC​\(R⋆\)C\(R^\{\\star\}\)becauseCCis nondecreasing\. HenceR⋆R^\{\\star\}minimizesC​\(R\)C\(R\)over the certified feasible set\. ∎

#### Proof of Proposition[6](https://arxiv.org/html/2605.08123#Thmproposition6)\.

###### Proof\.

All plans in the collection use the same masked score tensorSSand the same active support\. For any two dual pairs on that support,

Pi​j\(a,b\)Pi​j\(a⋆,b⋆\)=exp⁡\(ui\(a\)−ui\(a⋆\)\+vj\(b\)−vj\(b⋆\)\),\\frac\{P\_\{ij\}^\{\(a,b\)\}\}\{P\_\{ij\}^\{\(a\_\{\\star\},b\_\{\\star\}\)\}\}=\\exp\(u\_\{i\}^\{\(a\)\}\-u\_\{i\}^\{\(a\_\{\\star\}\)\}\+v\_\{j\}^\{\(b\)\}\-v\_\{j\}^\{\(b\_\{\\star\}\)\}\),which gives the orbit formula entrywise\. Multiplying this identity byHℓH\_\{\\ell\}and summing overℓ\\ellgives the single\-reference representation forGG\. The reverse recurrences in Propositions[1](https://arxiv.org/html/2605.08123#Thmproposition1)and[5](https://arxiv.org/html/2605.08123#Thmproposition5)require only products of these plans with row or column cotangent vectors and score\-space outer factors, all of which are instances of this representation\. Each such product can therefore be evaluated by streaming one reference plan tile and applying the appropriate row and column multipliers\. With fixedTT,RR,WW, anddd, each half\-step contributesO​\(L​W\)O\(LW\)arithmetic andO​\(L\)O\(L\)vector state, while the input activations requireO​\(L​d\)O\(Ld\)storage\. ∎

## Appendix CExperimental Protocol Notes

The current manuscript uses three empirical layers:

1. 1\.exactness checks against exact autodiff on small masked problems;
2. 2\.a four\-configuration TPU v6e\-8 Pfam screen across balanced and dustbin\-augmented paths;
3. 3\.a promoted long\-budget balanced run for the factorization\-backedR=2R=2path\.

All TPU results use a single Cloud TPU v6e\-8 host\. The short screen uses50,00050\{,\}000Pfam training pairs with the production benchmark stack, and the promoted run uses the same benchmark family at longer budget\. The biological inputs and language\-model\-derived embeddings are pre\-existing assets rather than new data collection\.

For the promoted balanced run, the objective weights were

λrec=1,λce=0,λmono=0,\\lambda\_\{\\mathrm\{rec\}\}=1,\\qquad\\lambda\_\{\\mathrm\{ce\}\}=0,\\qquad\\lambda\_\{\\mathrm\{mono\}\}=0,so reconstruction loss was optimized directly while CE and monotonicity were logged diagnostically\.

The anonymized supplementary code bundle contains the single\-file balanced and dustbin\-augmented transport kernel analyzed by the theory, together with a Sinkhorn\-only Pfam training harness whose defaults match the reported v6e\-8 promoted\-run settings\. It intentionally omits cloud launchers, project\-specific storage paths, checkpoints, logs, FlashAttention branches, and external data or embedding generation\. The reported empirical runs additionally depend on precomputed Pfam/ESM\-2 shards and TPU environment setup, so the empirical claims are scoped to the disclosed protocol notes and reported artifacts rather than to a complete public data pipeline\.

The main external assets used in the experiments are public Pfam family data\(Mistry et al\.,[2021](https://arxiv.org/html/2605.08123#bib.bib8)\)and ESM\-2 embeddings from the public ESM\-2 checkpoint\(Lin et al\.,[2023](https://arxiv.org/html/2605.08123#bib.bib6)\)\. Pfam is distributed under CC0 through the Pfam/InterPro documentation, and theesm2\_t6\_8M\_UR50Dweights used for the embeddings are distributed under the MIT license\.

The present cloud\-aligned gap\-aware path is the dustbin\-augmented balanced surrogate\. Genuinely KL\-unbalanced transport and differentiable support geometry remain natural follow\-on directions, but they are not needed for the main claim carried by this manuscript\.

#### Direct four\-plan adjoint diagnostic\.

Table[5](https://arxiv.org/html/2605.08123#A3.T5)isolates theR=2R=2score\-adjoint stage after the base and tail potentials are already available\. The direct baseline forms the four staircase factors\(P\(2,2\),P\(2,1\),P\(1,1\),P\(1,0\)\)\(P^\{\(2,2\)\},P^\{\(2,1\)\},P^\{\(1,1\)\},P^\{\(1,0\)\}\)separately; the one\-reference implementation forms onlyP\(2,2\)P^\{\(2,2\)\}and applies the row/column modifiers from Proposition[2](https://arxiv.org/html/2605.08123#Thmproposition2)\. This is a local dense\-JAX CPU diagnostic rather than a TPU systems benchmark, so it should be read as a controlled materialization check, not as a FlashSinkhorn comparison or a production wall\-clock claim\.

Table 5:Controlled score\-adjoint benchmark for direct four\-plan materialization versus the one\-reference representation\. Logical active plan\-factor storage is reduced by exactly4×4\\timesin all rows:3\.15/0\.793\.15/0\.79MB,7\.35/1\.847\.35/1\.84MB, and15\.76/3\.9415\.76/3\.94MB for direct/one\-reference respectively\. Timings are mean post\-compilation CPU times over 20 repetitions\.
#### Projective contraction diagnostics on Pfam checkpoints\.

Table[6](https://arxiv.org/html/2605.08123#A3.T6)reports the Birkhoff–Hopf certificate from Proposition[4](https://arxiv.org/html/2605.08123#Thmproposition4)on the local Pfam test shards for the step\-0and step\-14371437Sinkhorn checkpoints\. For each strictly positive active128128\-row block we compute the exact projective\-diameter coefficientρH=tanh2⁡\(Δ/4\)\\rho\_\{H\}=\\tanh^\{2\}\(\\Delta/4\)on the scaled scores atε=1\\varepsilon=1, excluding masked rows and columns\. We also report the coarser range\-only sufficient boundρrange=tanh2⁡\(Ω/2\)\\rho\_\{\\mathrm\{range\}\}=\\tanh^\{2\}\(\\Omega/2\)\. Across335335active blocks from100100held\-out examples, both certificates remain nontrivial; the final checkpoint has medianρH=0\.081\\rho\_\{H\}=0\.081and maximumρH=0\.492\\rho\_\{H\}=0\.492\.

Table 6:Production\-checkpoint projective contraction diagnostics on local Pfam test shards\. The exact projective certificate is computed from the active block score matrix; the range\-only bound is the simpler sufficient bound stated in Proposition[4](https://arxiv.org/html/2605.08123#Thmproposition4)\.

## Appendix DKernel\-Oriented Pseudocode

#### Balanced stopped\-base andR=2R=2tail\.

1. 1\.Form masked scaled scoresSi​j=Qi​Kj⊤/\(d​ε\)S\_\{ij\}=Q\_\{i\}K\_\{j\}^\{\\top\}/\(\\sqrt\{d\}\\,\\varepsilon\)on the active supportΩ\\Omega\.
2. 2\.Run the balanced masked Sinkhorn half\-steps forTTstopped iterations to obtain the base pair\(u\(0\),v\(0\)\)\(u^\{\(0\)\},v^\{\(0\)\}\)\.
3. 3\.Recompute two differentiable refinement steps from the stopped base pair, producing\(u\(1\),v\(1\)\)\(u^\{\(1\)\},v^\{\(1\)\}\)and\(u\(2\),v\(2\)\)\(u^\{\(2\)\},v^\{\(2\)\}\)\.
4. 4\.Materialize only the base staircase planP\(2,2\)P^\{\(2,2\)\}blockwise\.
5. 5\.ReconstructP\(2,1\)P^\{\(2,1\)\},P\(1,1\)P^\{\(1,1\)\}, andP\(1,0\)P^\{\(1,0\)\}by row and column rescaling ofP\(2,2\)P^\{\(2,2\)\}using the dual differences\(u\(a\)−u\(2\),v\(b\)−v\(2\)\)\(u^\{\(a\)\}\-u^\{\(2\)\},v^\{\(b\)\}\-v^\{\(2\)\}\)\.
6. 6\.Evaluate the exactR=2R=2score\-space VJP from Propositions[1](https://arxiv.org/html/2605.08123#Thmproposition1)and[2](https://arxiv.org/html/2605.08123#Thmproposition2), then propagate to\(Q,K,V\)\(Q,K,V\)\.

#### Dustbin\-augmented bridge\.

1. 1\.Append one active dustbin token per side together with inactive fillers to construct the augmented tensors\(Q~,K~,V~\)\(\\widetilde\{Q\},\\widetilde\{K\},\\widetilde\{V\}\)\.
2. 2\.Augment the masks analogously and enlarge the effective band width toW~=max⁡\(W,L\)\\widetilde\{W\}=\\max\(W,L\)\.
3. 3\.Apply the same balanced stopped\-base andR=2R=2tail procedure on the augmented state space\.

Similar Articles

Adaptive Computation Depth via Learned Token Routing in Transformers

arXiv cs.LG

This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

arXiv cs.LG

This paper introduces SURGE, a novel learnable gradient compensation framework for training Binary Neural Networks that addresses gradient mismatch and information loss issues found in traditional methods like the Straight-Through Estimator.

Attention Drift: What Autoregressive Speculative Decoding Models Learn

Reddit r/LocalLLaMA

This paper identifies 'attention drift' in autoregressive speculative decoding models, where drafters' attention shifts from the prompt to their own generated tokens. The authors propose architectural changes, such as post-norm and RMSNorm, which improve acceptance rates and robustness across various benchmarks.

AtManRL: Towards Faithful Reasoning via Differentiable Attention Saliency

arXiv cs.CL

AtManRL is a method that uses differentiable attention manipulation and reinforcement learning to train LLMs to generate more faithful chain-of-thought reasoning by ensuring reasoning tokens causally influence final predictions. Experiments on GSM8K and MMLU with Llama-3.2-3B demonstrate the approach can identify influential reasoning tokens and improve reasoning transparency.