PostDeg: Placement Beats Parameterization in LayerNorm GNNs

arXiv cs.LG Papers

Summary

This paper identifies that in LayerNorm-based GNNs, positive per-node scalars like node degree are erased when placed before LayerNorm but survive after LayerNorm. The authors propose PostDeg, a parameter-free post-LayerNorm inverse-degree scale, achieving significant gains on influence maximization, network dismantling, and maximum independent set tasks.

arXiv:2606.14022v1 Announce Type: new Abstract: LayerNorm-based GNNs routinely erase the topology signals (degree, centrality, $k$-core) that node-selection policies should depend on, but the literature has not located where in the residual block the erasure happens. We answer that question: a positive per-node scalar inserted before LayerNorm is divided out up to a stabilizer term, while the same scalar inserted after LayerNorm reaches the score head as representation magnitude. The surviving slot is the post-LayerNorm position. We instantiate it with PostDeg, a parameter-free post-LayerNorm inverse-degree scale, and pre-register four falsifiers (graphwise scalars, extra LayerNorm, expressive same-slot capacity, backbone-agnostic source) that would reject the rule. PostDeg gains $+3.5\%/+2.5\%/+5.6\%$ over the LN backbone on influence maximization, network dismantling, and maximum independent set, with $10/10$ paired-seed wins per task; none of the four falsifiers fires. The takeaway is that placement, not parameterization, carries the gain -- a small invariance check that generalizes to any positive topology scalar in any normalized residual stack.
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:09 AM

# PostDeg: Placement Beats Parameterization in LayerNorm GNNs
Source: [https://arxiv.org/html/2606.14022](https://arxiv.org/html/2606.14022)
Yash Vardhan Tomar Purdue University tomar4@purdue\.eduAryav Das11footnotemark:1 Park Tudor High School aryav30das@gmail\.com

###### Abstract

LayerNorm\-based GNNs routinely erase the topology signals \(degree, centrality,kk\-core\) that node\-selection policies should depend on, but the literature has not located*where*in the residual block the erasure happens\. We answer that question: a positive per\-node scalar inserted*before*LayerNorm is divided out up to a stabilizer term, while the same scalar inserted*after*LayerNorm reaches the score head as representation magnitude\. The surviving slot is the post\-LayerNorm position\. We instantiate it withPostDeg, a parameter\-free post\-LayerNorm inverse\-degree scale, and pre\-register four falsifiers \(graphwise scalars, extra LayerNorm, expressive same\-slot capacity, backbone\-agnostic source\) that would reject the rule\. PostDeg gains\+3\.5%/\+2\.5%/\+5\.6%\+3\.5\\%/\+2\.5\\%/\+5\.6\\%over the LN backbone on influence maximization, network dismantling, and maximum independent set, with10/1010/10paired\-seed wins per task; none of the four falsifiers fires\. The takeaway is that placement, not parameterization, carries the gain — a small invariance check that generalizes to any positive topology scalar in any normalized residual stack\.

## 1Introduction

Many graph\-learning policies, such as influence maximization, network dismantling, maximum independent set, epidemic containment, should rank nodes by topology: high\-degree hubs spread an epidemic faster, low\-degree nodes are easier to include in an independent set, and so on\. The dominant deep\-learning recipe for these tasks is a residual GAT block stacked with LayerNorm\[[24](https://arxiv.org/html/2606.14022#bib.bib24),[4](https://arxiv.org/html/2606.14022#bib.bib4),[15](https://arxiv.org/html/2606.14022#bib.bib15),[17](https://arxiv.org/html/2606.14022#bib.bib17),[13](https://arxiv.org/html/2606.14022#bib.bib13)\]\. LayerNorm stabilizes training but, on degree\-sensitive node selection, it has the side effect of erasing the very topology signals the policy should use: GraphNorm, PairNorm, and follow\-ups all observe this empirically and propose new feature\-statistic normalizers as the cure\[[5](https://arxiv.org/html/2606.14022#bib.bib5),[28](https://arxiv.org/html/2606.14022#bib.bib28),[30](https://arxiv.org/html/2606.14022#bib.bib30)\]\. None of those papers locates*where in the residual block*the topology signal dies\.

Whether a positive topology multiplier survives or is absorbed depends entirely on which side of LayerNorm it sits on; by*placement*we mean exactly which side, and the placement rule is the central object of this paper\. We turn the LayerNorm absorption identity\[[1](https://arxiv.org/html/2606.14022#bib.bib1),[21](https://arxiv.org/html/2606.14022#bib.bib21),[19](https://arxiv.org/html/2606.14022#bib.bib19)\]into a placement diagnostic for positive per\-node scalars in GNN residual blocks\. A positive multiplieraia\_\{i\}inserted before LayerNorm is divided out up to the stabilizer term:LN​\(ai​zi\)≈LN​\(zi\)\\mathrm\{LN\}\(a\_\{i\}z\_\{i\}\)\\approx\\mathrm\{LN\}\(z\_\{i\}\), with relative residual bounded byεLN/\(ai2​σ2\)\\varepsilon\_\{\\mathrm\{LN\}\}/\(a\_\{i\}^\{2\}\\sigma^\{2\}\)and empirically≤2\.44×10−5\\leq 2\.44\\times 10^\{\-5\}across nodes, layers, and seeds at convergence on every task \(Section[3](https://arxiv.org/html/2606.14022#S3), Appendix[D\.5](https://arxiv.org/html/2606.14022#A4.SS5)\)\. The same multiplier inserted after LayerNorm reaches the score head as representation magnitude\. The diagnostic therefore prescribes a*position*, not a functional form: put topology magnitude after LayerNorm if the scorer is supposed to see it\. Figure[1](https://arxiv.org/html/2606.14022#S1.F1)illustrates the picture on a small mixed\-degree graph: LayerNorm normalizes per\-node magnitudes so the feature variation reaching the score head no longer encodes degree, and a post\-LayerNorm inverse\-degree scale restores the degree\-conditioned magnitude contrast that node\-selection policies should depend on\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/illustration/motivating_example.png)Figure 1:Why placement matters\.Left:a graph with mixed degree \(low\-degree cycle nodes labeled22, cluster nodes labeled33, one hub labeled55\); node colors index degree\.Middle:after LayerNorm, every node’s representation magnitude is normalized to the same scale, so the per\-node feature variation reaching the score head no longer encodes degree contrast\.Right:PostDeg multiplies each post\-LayerNorm representation bysi=\(c^i\+ε\)−1/2s\_\{i\}=\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-1/2\}\(legend:di=2→1\.58d\_\{i\}=2\{\\to\}1\.58,3→1\.293\{\\to\}1\.29,5→1\.005\{\\to\}1\.00\), restoring a monotone\-in\-1/di1/d\_\{i\}magnitude that the score head can use to rank nodes by topology\. The same post\-LN positive scalar is what the absorption identity in Section[3\.1](https://arxiv.org/html/2606.14022#S3.SS1)permits to survive; the empirical absorption envelope on every task is≤2\.44×10−5\\leq 2\.44\\\!\\times\\\!10^\{\-5\}\(Appendix[D\.5](https://arxiv.org/html/2606.14022#A4.SS5)\)\.PostDegis the parameter\-free operator that occupies that slot:h~i=\(c^i\+ε\)−1/2​LN⁡\(hi\),\\widetilde\{h\}\_\{i\}=\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-1/2\}\\,\\operatorname\{LN\}\(h\_\{i\}\),withc^i=max⁡\(di,1\)/maxj⁡max⁡\(dj,1\)\\widehat\{c\}\_\{i\}=\\max\(d\_\{i\},1\)/\\max\_\{j\}\\max\(d\_\{j\},1\)andε=10−8\\varepsilon=10^\{\-8\}\. Becausesi=\(c^i\+ε\)−1/2s\_\{i\}=\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-1/2\}is a monotone function of1/di1/d\_\{i\}, post\-LayerNorm low\-degree nodes receive larger representation magnitude, and the score head can use that contrast\. Implementation details \(isolated\-node handling,ε\\varepsilonablation, code reference\) are in Section[3\.2](https://arxiv.org/html/2606.14022#S3.SS2)and Appendix[D\.3](https://arxiv.org/html/2606.14022#A4.SS3)\. To separate placement from parameterization we also trainPostDeg\-L\-FGandPostDeg\-L\-Adaptive, two learned variants in the same slot\.

#### Falsifiability\.

The placement rule is more useful for what it forbids than for what it predicts\. We attach four explicit anti\-claims, each with a control whose result would falsify the rule:

- •Graphwise spectral source\.If the gain came from a graph\-level spectral term rather than per\-node degree, GraphScalar \(a single graphwise multiplier1/λmax\+ε′1/\\sqrt\{\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\}in the same slot\) would match PostDeg\. Observed:\|Δ%\|<1%\|\\Delta\\%\|<1\\%across all tasks\.
- •Extra feature normalization\.If the gain came from an additional layer of feature normalization, Extra LayerNorm in the same slot would match PostDeg\. Observed:\|Δ%\|<0\.1%\|\\Delta\\%\|<0\.1\\%across all tasks\.
- •Same\-slot capacity\.If the slot needed expressive capacity, the learned variants \(PostDeg\-L\-FG, PostDeg\-L\-Adaptive, where “PostDeg\-L” abbreviates “PostDeg\-Learned”\) would beat the parameter\-free PostDeg\. Observed: TOST\-equivalent at±1%\\pm 1\\%on every task\.
- •Backbone\-agnostic source\.If the gain came from a backbone\-agnostic source, adding PostDeg to a PNA backbone \(which already injects a degree channel inside aggregation\) would still help\. Observed: paired\-equivalent on InfluMax and Dismantle\.

Each control would falsify the placement rule\. None of the four falsifiers fires\.

#### Findings\.

- •The placement rule is empirically tight: the absorption envelopeεLN/\(ai2​σ2\)\\varepsilon\_\{\\rm LN\}/\(a\_\{i\}^\{2\}\\sigma^\{2\}\)is bounded by2\.44×10−52\.44\\\!\\times\\\!10^\{\-5\}across all nodes, layers, seeds, and tasks at convergence \(Section[3\.1](https://arxiv.org/html/2606.14022#S3.SS1), Appendix[D\.5](https://arxiv.org/html/2606.14022#A4.SS5)\)\.
- •A parameter\-free post\-LN inverse\-degree scale \(PostDeg\) gains\+3\.5%/\+2\.5%/\+5\.6%\+3\.5\\%/\+2\.5\\%/\+5\.6\\%on InfluMax/Dismantle/MIS with paired\-seed wins on 10 of 10 seeds; learned same\-slot variants are paired\-equivalent under TOST at±1%\\pm 1\\%on every task \(Section[4\.2](https://arxiv.org/html/2606.14022#S4.SS2), Section[4\.3](https://arxiv.org/html/2606.14022#S4.SS3)\)\.
- •Four pre\-registered falsifiers all fail to reject the placement rule: a graphwise scalar in the same slot, extra LayerNorm, expressive learned capacity, and a backbone\-agnostic source via PNA \(Section[4\.2](https://arxiv.org/html/2606.14022#S4.SS2), Section[4\.7](https://arxiv.org/html/2606.14022#S4.SS7)\)\.

The remainder defines the controlled slot, reports the main empirical tests, and discusses the boundaries of the placement rule\.

## 2Related Work

BatchNorm, LayerNorm, InstanceNorm, GraphNorm, PairNorm, and NodeNorm answer a feature\-statistic question — which activations should be centered, scaled, or constrained\[[11](https://arxiv.org/html/2606.14022#bib.bib11),[1](https://arxiv.org/html/2606.14022#bib.bib1),[23](https://arxiv.org/html/2606.14022#bib.bib23),[5](https://arxiv.org/html/2606.14022#bib.bib5),[28](https://arxiv.org/html/2606.14022#bib.bib28),[30](https://arxiv.org/html/2606.14022#bib.bib30)\]\. The mechanism behind BatchNorm itself remains debated\[[41](https://arxiv.org/html/2606.14022#bib.bib41)\], but the placement question we ask is orthogonal to that debate\. GraphNorm is the closest GNN\-normalization reference point\. It studies graphwise feature\-statistic normalization for GNN optimization; PostDeg studies post\-LN topology magnitude\. The controlled main grid uses BatchNorm1d over node features within each processed graph, Extra LayerNorm in the post\-block slot, InstanceNorm, scalar\-shift GraphNorm, PairNorm, GraphScalar, PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive, and the LN backbone \(Table[A4](https://arxiv.org/html/2606.14022#A4.T4)\)\. NodeNorm appears only as a supplemental diagnostic because its comparison pipeline has configuration drift \(Appendix[F](https://arxiv.org/html/2606.14022#A6)\)\. DiffGroupNorm\[[29](https://arxiv.org/html/2606.14022#bib.bib29)\]alters the per\-block normalization slot with learned group\-wise feature statistics; we use it only for positioning in Table[A2](https://arxiv.org/html/2606.14022#A2.T2)\.

Degree\-aware aggregators inject degree inside message passing\. GCN usesD−1/2​A​D−1/2D^\{\-1/2\}AD^\{\-1/2\}as a graph operator\[[16](https://arxiv.org/html/2606.14022#bib.bib16)\]; SGC removes the nonlinearities to isolate this normalization\[[33](https://arxiv.org/html/2606.14022#bib.bib33)\]; GraphSAGE mean aggregation averages over neighborhoods\[[10](https://arxiv.org/html/2606.14022#bib.bib10)\]; GIN replaces mean with sum to match the Weisfeiler–Leman test\[[31](https://arxiv.org/html/2606.14022#bib.bib31),[32](https://arxiv.org/html/2606.14022#bib.bib32)\]; PNA scales each aggregator byδ​\(di\)=log⁡\(di\+1\)/δ¯\\delta\(d\_\{i\}\)=\\log\(d\_\{i\}\+1\)/\\bar\{\\delta\}inside the message\-passing step\[[8](https://arxiv.org/html/2606.14022#bib.bib8)\]\. PNA’s degree scaler changes messages before they are combined and normalized, so the LayerNorm in our backbone sees a mixed representation instead of a free scalar multiplier\. PostDeg leaves aggregation untouched and acts on per\-node magnitudes after LayerNorm\. The PNA test matches this distinction: PostDeg \+ PNA is paired\-equivalent to PNA without PostDeg on InfluMax and Dismantle \(\|Δ%\|<1%\|\\Delta\\%\|<1\\%, Section[4\.7](https://arxiv.org/html/2606.14022#S4.SS7)and Appendix[F](https://arxiv.org/html/2606.14022#A6), Table[A34](https://arxiv.org/html/2606.14022#A6.T34)\)\.

Structural encodings inject degree, centrality, shortest\-path, random\-feature, subgraph, or transformer positional information as representation content\[[9](https://arxiv.org/html/2606.14022#bib.bib9),[22](https://arxiv.org/html/2606.14022#bib.bib22),[2](https://arxiv.org/html/2606.14022#bib.bib2),[26](https://arxiv.org/html/2606.14022#bib.bib26),[20](https://arxiv.org/html/2606.14022#bib.bib20)\]\. A degree\-content feature such aslog⁡di\\log d\_\{i\}enters before LayerNorm and is re\-centered and re\-scaled with the rest of the representation\. PostDeg modifies post\-normalization magnitude of an already\-encoded representation\. Appendix[H](https://arxiv.org/html/2606.14022#A8)formalizes the content\-vs\-magnitude distinction and reports thelog⁡di\\log d\_\{i\}baseline\.

This paper asks where a positive topology scalar can survive LayerNorm\. The algebraic identity in Eq\. \([1](https://arxiv.org/html/2606.14022#S3.E1)\) is standard in the LayerNorm/RMSNorm/weight\-tying literature\[[1](https://arxiv.org/html/2606.14022#bib.bib1),[27](https://arxiv.org/html/2606.14022#bib.bib27),[25](https://arxiv.org/html/2606.14022#bib.bib25),[40](https://arxiv.org/html/2606.14022#bib.bib40)\]\. The complementary literature on GNN depth \(oversmoothing, oversquashing, jumping\-knowledge connections\) studies how representations degrade as the network grows deeper\[[36](https://arxiv.org/html/2606.14022#bib.bib36),[38](https://arxiv.org/html/2606.14022#bib.bib38),[37](https://arxiv.org/html/2606.14022#bib.bib37),[39](https://arxiv.org/html/2606.14022#bib.bib39),[35](https://arxiv.org/html/2606.14022#bib.bib35),[34](https://arxiv.org/html/2606.14022#bib.bib34),[42](https://arxiv.org/html/2606.14022#bib.bib42)\]; placement is orthogonal — we hold depth fixed and vary the slot\. We use the absorption identity as a placement rule for GNN policies trained on combinatorial optimization tasks\[[15](https://arxiv.org/html/2606.14022#bib.bib15),[17](https://arxiv.org/html/2606.14022#bib.bib17),[13](https://arxiv.org/html/2606.14022#bib.bib13),[12](https://arxiv.org/html/2606.14022#bib.bib12),[6](https://arxiv.org/html/2606.14022#bib.bib6)\]\. All main experiments use a fixed GAT backbone\[[24](https://arxiv.org/html/2606.14022#bib.bib24),[4](https://arxiv.org/html/2606.14022#bib.bib4)\]so the normalization slot is the variable under test\.

## 3Method

The method has two parts\. First, we identify the normalization slot where a positive topology scalar can survive LayerNorm\. Second, we place a minimal degree scale in that slot and test whether additional capacity is necessary\. Figure[2](https://arxiv.org/html/2606.14022#S3.F2)gives the end\-to\-end controlled setup: graph\-level statistics are computed once, every GAT layer exposes the same post\-LayerNorm slot, and the task heads, budgets, data generator, and seeds are fixed while only that slot changes\.

\(a\) Pre\-LayerNorm placement: scalar is absorbedpre\-LN:ai⋅zia\_\{i\}\\\!\\cdot\\\!z\_\{i\}LN​\(ai​zi\)≈LN​\(zi\)\\mathrm\{LN\}\(a\_\{i\}z\_\{i\}\)\\approx\\mathrm\{LN\}\(z\_\{i\}\)\(Eq\.[1](https://arxiv.org/html/2606.14022#S3.E1)\)residual≤εLN/\(ai2​σ2\)≤2\.44×10−5\\leq\\varepsilon\_\{\\mathrm\{LN\}\}/\(a\_\{i\}^\{2\}\\sigma^\{2\}\)\\leq 2\.44\\\!\\times\\\!10^\{\-5\}⇒\\Rightarrowscorer cannot useaia\_\{i\}as a topology signalH\(ℓ−1\)H^\{\(\\ell\-1\)\}residual GATblockziz\_\{i\}LayerNormpost\-LNcontrolled slotscoreheadshaded: held fixed across the main grid\(b\) Post\-LayerNorm placement\(this paper\): scalar survivessi⋅LN​\(zi\)s\_\{i\}\\\!\\cdot\\\!\\mathrm\{LN\}\(z\_\{i\}\)reaches the score head as representation magnitude⇒\\Rightarrowa positive per\-nodesis\_\{i\}becomes a usable degree\-conditioned signalpost\-LN:si⊙LN​\(zi\)s\_\{i\}\\\!\\odot\\\!\\mathrm\{LN\}\(z\_\{i\}\)

Figure 2:Method overview\.\(a\)Inserting a positive per\-node scalaraia\_\{i\}before LayerNorm is absorbed up to a stabilizer term \(residual≤2\.44×10−5\\leq 2\.44\\\!\\times\\\!10^\{\-5\}empirically\)\.\(b\)Inserting the same scalar after LayerNorm reaches the score head as representation magnitude\. The shaded backbone \(residual GAT block, LayerNorm, score head, budgets, data generator, seeds\) is held fixed across the main grid; only the dashed post\-LN slot varies\. The variants tested in that slot \(LN backbone, GraphScalar, Extra LayerNorm, PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive\) are listed in Table[1](https://arxiv.org/html/2606.14022#S3.T1)and described in Section[3\.2](https://arxiv.org/html/2606.14022#S3.SS2)\.### 3\.1Placement: why post\-LayerNorm

With LayerNorm stabilizerεLN\\varepsilon\_\{\\rm LN\}and per\-coordinate gain/bias\(g,b\)\(g,b\), the LayerNorm of a per\-node\-multiplied inputai​zia\_\{i\}z\_\{i\}is

LN⁡\(ai​zi\)=g⊙zi−μ​\(zi\)σ2​\(zi\)\+εLN/ai2\+b,\\operatorname\{LN\}\(a\_\{i\}z\_\{i\}\)=g\\odot\\frac\{z\_\{i\}\-\\mu\(z\_\{i\}\)\}\{\\sqrt\{\\sigma^\{2\}\(z\_\{i\}\)\+\\varepsilon\_\{\\rm LN\}/a\_\{i\}^\{2\}\}\}\+b,\(1\)exact in the zero\-stabilizer limit and tight wheneverai2​σ2​\(zi\)≫εLNa\_\{i\}^\{2\}\\sigma^\{2\}\(z\_\{i\}\)\\gg\\varepsilon\_\{\\rm LN\}\[[1](https://arxiv.org/html/2606.14022#bib.bib1)\]\. Multiplying a node representation byai\>0a\_\{i\}\>0before LayerNorm changes only the stabilizer term, with relative effect bounded byεLN/\(ai2​σ2​\(zi\)\)\\varepsilon\_\{\\rm LN\}/\(a\_\{i\}^\{2\}\\sigma^\{2\}\(z\_\{i\}\)\)\. The empirical worst\-case envelope is reported in Appendix[D\.5](https://arxiv.org/html/2606.14022#A4.SS5)\(Table[A10](https://arxiv.org/html/2606.14022#A4.T10)\); the headline value \(§[1](https://arxiv.org/html/2606.14022#S1)\) is reached on MIS and the bound is≤2\.5×10−5\\leq 2\.5\\times 10^\{\-5\}on every task\. Within the LayerNorm block in Eq\. \([1](https://arxiv.org/html/2606.14022#S3.E1)\), a positive degree multiplier survives on the post\-LN side\. The remaining question is what operator should occupy that slot\.

#### Placement rule\.

The absorption identity applies to positive per\-node scalar multipliers such as degree, PageRank,kk\-core, clustering, or learned positive attention scores\. Content features behave differently: a concatenatedlog⁡di\\log d\_\{i\}feature is re\-centered and re\-scaled with the rest of the representation, so we treat content baselines separately in Appendix[H](https://arxiv.org/html/2606.14022#A8)\. The rule therefore prescribes a*position*, not a functional form: put topology magnitude after LayerNorm if the scorer is supposed to see it\.

Table 1:Placement ledger\. The LayerNorm identity applies to positive scalar multipliers\. Content features and aggregation\-side degree enter through different mechanisms, so we treat them as separate tests\. “✔” = LayerNorm absorbs the multiplier; “✘” = the multiplier survives\.

### 3\.2PostDeg operator

The simplest occupant of the surviving slot is a fixed post\-LN degree scale\. LetG=\(V,E\)G=\(V,E\)have degree sequenced1,…,dnd\_\{1\},\\ldots,d\_\{n\}, and letc^i=max⁡\(di,1\)/maxj⁡max⁡\(dj,1\)\\widehat\{c\}\_\{i\}=\\max\(d\_\{i\},1\)/\\max\_\{j\}\\max\(d\_\{j\},1\)\. Given the outputH\(ℓ−1\)H^\{\(\\ell\-1\)\}of layerℓ−1\\ell\-1, the PostDeg layer is

H\(ℓ\)=\(c^\+ε\)−1/2⊙LN⁡\(H\(ℓ−1\)\+GATℓ​\(H\(ℓ−1\),A\)\),ε=10−8,H^\{\(\\ell\)\}=\(\\widehat\{c\}\+\\varepsilon\)^\{\-1/2\}\\odot\\operatorname\{LN\}\\\!\\left\(H^\{\(\\ell\-1\)\}\+\\mathrm\{GAT\}\_\{\\ell\}\(H^\{\(\\ell\-1\)\},A\)\\right\),\\qquad\\varepsilon=10^\{\-8\},\(2\)where the broadcast⊙\\odotmultiplies the per\-node scalar\(c^i\+ε\)−1/2\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-1/2\}across the feature dimension\. The experimental code uses the same operator: isolated nodes are assigned degree11before normalization, so they use1/dmax1/d\_\{\\max\}on graphs with at least one edge\. PostDeg has zero learned parameters and adds𝒪​\(\|V\|\)\\mathcal\{O\}\(\|V\|\)per\-layer work; Appendix[D\.3](https://arxiv.org/html/2606.14022#A4.SS3)states the layer in pseudocode and Appendix[D](https://arxiv.org/html/2606.14022#A4)reports theε\\varepsilonablation \(endpoint differences below0\.05%0\.05\\%\)\.

#### Learned same\-slot variants\.

For placement\-rule ablations we also train two learned operators in the same post\-LN slot, PostDeg\-L\-FG and PostDeg\-L\-Adaptive, which generalize Eq\. \([2](https://arxiv.org/html/2606.14022#S3.E2)\) with a learned exponent, a graph\-level spectral term, and a gate \(scalar for FG, per\-node MLP for Adaptive\)\. PostDeg is recovered as the minimum\-parameter instance of this family; the full parameterization, parameter domains, and clamps are in Appendix[C](https://arxiv.org/html/2606.14022#A3), Eq\. \([A1](https://arxiv.org/html/2606.14022#A3.E1)\)\. The empirical claim is that the parameter\-free choice suffices: the learned variants are paired\-equivalent to PostDeg under TOST at±1%\\pm 1\\%on every task \(Section[4\.3](https://arxiv.org/html/2606.14022#S4.SS3)\)\.

### 3\.3Predictions tested in the experiments

The formal statements and proofs are deferred to Appendix[C](https://arxiv.org/html/2606.14022#A3)\. The main consequence needed for the experiments is the degree\-separation ratio\. For two nodes withci\>cjc\_\{i\}\>c\_\{j\}and post\-LN exponentγ\>0\\gamma\>0,

Ri​j:=sjsi=\(ci\+εcj\+ε\)γ\>1\.R\_\{ij\}\\;:=\\;\\frac\{s\_\{j\}\}\{s\_\{i\}\}\\;=\\;\\left\(\\frac\{c\_\{i\}\+\\varepsilon\}\{c\_\{j\}\+\\varepsilon\}\\right\)^\{\\\!\\gamma\}\\;\>\\;1\.\(3\)Low\-degree nodes therefore receive larger post\-LN magnitude, the slope oflog⁡si\\log s\_\{i\}versuslog⁡di\\log d\_\{i\}is negative, and the effect weakens as degree heterogeneity vanishes\. The experiments test six visible predictions: pre\-LN scalars should be numerically absorbed; PostDeg should help when degree\-conditioned magnitude aligns with the reward; a graphwise scalar should not recover node ranking; learned same\-slot variants should add little if placement is the key issue; low\-heterogeneity or task\-null cases should be neutral; and PNA should leave less room for PostDeg because it already injects degree inside aggregation\.

## 4Experiments

#### Headline\.

On three degree\-sensitive node\-selection tasks \(InfluMax, Dismantle, MIS\), PostDeg improves over the LN backbone by\+3\.5%,\+2\.5%,\+5\.6%\+3\.5\\%,\+2\.5\\%,\+5\.6\\%respectively, with paired\-seed wins on10 of 10seeds per task \(Table[4](https://arxiv.org/html/2606.14022#S4.T4)\)\. On the two boundary tasks \(Epidemic = task\-objective null; DD = low\-degree heterogeneity\), PostDeg is near zero\. Learned same\-slot variants are paired\-equivalent to PostDeg under TOST at±1%\\pm 1\\%on every task \(Table[A13](https://arxiv.org/html/2606.14022#A4.T13)\); the parameter\-free choice suffices\. The remainder of this section reports the controlled grid \(§[4\.1](https://arxiv.org/html/2606.14022#S4.SS1)\), the relative effect \(§[4\.2](https://arxiv.org/html/2606.14022#S4.SS2)\), the operator\-family equivalence \(§[4\.3](https://arxiv.org/html/2606.14022#S4.SS3)\), the mechanism \(§[4\.4](https://arxiv.org/html/2606.14022#S4.SS4)\), the size\-transfer behavior \(§[4\.5](https://arxiv.org/html/2606.14022#S4.SS5)\), and the two boundary cases \(§[4\.6](https://arxiv.org/html/2606.14022#S4.SS6), §[4\.7](https://arxiv.org/html/2606.14022#S4.SS7)\)\.

### 4\.1Experimental Setup

Tasks\.We evaluate four node\-selection tasks and one graph\-classification boundary task: influence maximization \(InfluMax\) on Barabási–Albert graphs\[[14](https://arxiv.org/html/2606.14022#bib.bib14)\], network dismantling \(Dismantle\) on stochastic block model graphs\[[3](https://arxiv.org/html/2606.14022#bib.bib3)\], epidemic containment \(Epidemic\) on SBM contact graphs, maximum independent set \(MIS\) on SBM graphs, and DD graph classification on the TU protein benchmark\[[18](https://arxiv.org/html/2606.14022#bib.bib18)\]\. Per\-task CV ranges from0\.330\.33to0\.810\.81\(Table[2](https://arxiv.org/html/2606.14022#S4.T2)\); the gain also depends on whether the task reward is aligned with degree\-conditioned magnitude \(Epidemic is a task\-objective null even though its degree statistics resemble Dismantle\)\.

Table 2:Per\-task degree distribution summary statistics on the training\-graph distribution \(200 graphs per task\)\. Tasks are grouped by heterogeneity\. The rightmost column reports the headline PostDeg gain over LN backbone, paired by 10 seeds at the largest evaluation size; heavier\-tailed degree distributions realize larger PostDeg effect sizes\.Controlled slot\.Each method occupies the same slotNorm​\(⋅\):ℝn×d→ℝn×d\\mathrm\{Norm\}\(\\cdot\):\\mathbb\{R\}^\{n\\times d\}\\\!\\to\\\!\\mathbb\{R\}^\{n\\times d\}applied to the residual GAT outputZ\(ℓ\)=H\(ℓ−1\)\+GATℓ​\(H\(ℓ−1\),A\)Z^\{\(\\ell\)\}=H^\{\(\\ell\-1\)\}\+\\mathrm\{GAT\}\_\{\\ell\}\(H^\{\(\\ell\-1\)\},A\)\. The LN backbone is the identity in this post\-block slot; the residual block still contains LayerNorm before the slot under study\. GraphScalar is a graphwise multiplier1/λmax\+ε′1/\\sqrt\{\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\}derived from the normalized Laplacian \(no per\-node degree dependence; tests whether a graph\-level spectral term explains the gain\)\. PNA and NodeNorm relax the controlled slot and are reported separately in Appendix[B](https://arxiv.org/html/2606.14022#A2)\.

Architecture\.All methods use a 3\-layer GAT with 4 attention heads, hidden dimension 128, dropout 0\.1, and a task\-specific MLP head\.

Training\.200 epochs with online graph generation using each task reward as the training signal: spread for InfluMax, fragmentation for Dismantle, negative infected fraction for Epidemic, and valid selected\-set size for MIS\. Training sizes aren=100n=100for InfluMax andn=150n=150for the others\. 10 independent seeds per method and task\. Selection budgets:K=⌊0\.1​n⌋K=\\lfloor 0\.1n\\rfloorfor InfluMax,⌊0\.2​n⌋\\lfloor 0\.2n\\rfloorfor Dismantle and Epidemic,⌊0\.3​n⌋\\lfloor 0\.3n\\rfloorfor MIS\.

Evaluation\.The learned scorer is used greedily \(Algorithm[2](https://arxiv.org/html/2606.14022#alg2), Appendix[D\.3](https://arxiv.org/html/2606.14022#A4.SS3)\)\. Evaluation sizes extend ton=300n=300for the transfer check\. We report mean±\\pmseed standard deviation, paired Wilcoxon tests, and TOST equivalence tests in Appendix[D](https://arxiv.org/html/2606.14022#A4)\. At 10 seeds, the paired Wilcoxon test detects effects of size∼0\.7​σ\\sim 0\.7\\sigmaatα=0\.05\\alpha=0\.05\(one\-sided\), and TOST detects equivalence within±1%\\pm 1\\%relative\. Cross\-backbone numbers \(GCN, GIN, SAGE, PNA at training\-graph size\) are in Appendix[F](https://arxiv.org/html/2606.14022#A6)\.

The placement rule gives a compact prediction ledger \(Table[3](https://arxiv.org/html/2606.14022#S4.T3)\)\. Each row names the test and the observed outcome\.

Table 3:Verdict ledger for the placement rule\. Each row names a placement\-rule prediction, the test that would falsify it, and the observed outcome\. Glyphs: ✔ the prediction is supported,≈\\approxa null prediction is observed \(boundary case\), ✘ the prediction is falsified\. Larger raw tables and statistical matrices are in the appendix\.
### 4\.2Relative effects match the placement rule

Table 4:Relative improvement over the LN backbone on node\-selection tasks \(%, mean±\\pmseed std, 10 seeds\)\.The parameter\-free PostDeg row\(top\) is the recommended operator; learned same\-slot variants are paired\-equivalent under TOST at±1%\\pm 1\\%on every task \(App\.[D](https://arxiv.org/html/2606.14022#A4), Table[A13](https://arxiv.org/html/2606.14022#A4.T13)\)\. The Epidemic column is atask\-objective nullwhere near\-zero is the desired pattern\. Last two columns:W10W\_\{10\}= paired\-seed wins out of 10 \(geometric mean across the three positive tasks\); TOST≡\\equivmarks operators that are paired\-equivalent to PostDeg at±1%\\pm 1\\%\. Controls are grouped by mechanism \(graphwise, capacity\-only, feature\-statistic\)\.Table[4](https://arxiv.org/html/2606.14022#S4.T4)is the main comparison because every entry is measured against the same LN backbone, paired by seed\. PostDeg has positive gains on the three degree\-sensitive node\-selection tasks:\+3\.5%\+3\.5\\%on InfluMax,\+2\.5%\+2\.5\\%on Dismantle, and\+5\.6%\+5\.6\\%on MIS\. It beats the LN backbone on10 of 10 paired seedson all three tasks — a unanimous win\-rate that no other operator in Table[4](https://arxiv.org/html/2606.14022#S4.T4)achieves\. Epidemic is a task\-objective null: its reward \(negative infected fraction under SIR dynamics\) is not aligned with the post\-LN degree contrast that PostDeg supplies, so near\-zero is the expected outcome regardless of degree heterogeneity\. Per\-seed absolute reward differences are in Table[A12](https://arxiv.org/html/2606.14022#A4.T12)\.

Learned same\-slot variants track PostDeg closely\.GraphNorm misses the Dismantle sign\(−1\.34%\-1\.34\\%on Dismantle vs\.\+2\.53%\+2\.53\\%for PostDeg, Table[4](https://arxiv.org/html/2606.14022#S4.T4)\): the placement effect is not generic feature\-statistic normalization\. GraphScalar and Extra LayerNorm stay near zero or below the backbone, ruling out graphwise scalars and extra feature normalization as explanations\. Absolute largest\-size metrics are in Appendix Table[A4](https://arxiv.org/html/2606.14022#A4.T4); the appendix expands the per\-baseline statistical tests \(Wilcoxonp<0\.01p<0\.01for PostDeg vs\. Extra LayerNorm on every positive task; PairNorm/InstanceNorm/BatchNorm match PostDeg on some node\-selection tasks but collapse to the majority\-class predictor on DD; full heatmap in Figure[A3](https://arxiv.org/html/2606.14022#A4.F3)\)\.

### 4\.3Operator\-family equivalence: any reasonable post\-LN topology operator gives the same gain

PostDeg is the minimum\-parameter member of a broader post\-LN scalar family\. The placement rule prescribes the slot; several operators can occupy it\. Replacing the fixed exponent1/21/2with a learnedγ∈\[0,1\.5\]\\gamma\\in\[0,1\.5\], adding an adaptive gate, and adding a graph\-level spectral term yields paired\-equivalent results under TOST at±1%\\pm 1\\%on every task \(pInfluMax=0\.001p\_\{\\rm InfluMax\}=0\.001,pDismantle=0\.000p\_\{\\rm Dismantle\}=0\.000,pEpidemic=0\.000p\_\{\\rm Epidemic\}=0\.000,pMIS=0\.006p\_\{\\rm MIS\}=0\.006; Appendix Table[A13](https://arxiv.org/html/2606.14022#A4.T13)\)\. In natural units, the±1%\\pm 1\\%margin is about±0\.55\\pm 0\.55spread units for InfluMax,±0\.0024\\pm 0\.0024fragmentation for Dismantle,±0\.0078\\pm 0\.0078infected fraction for Epidemic, and±0\.40\\pm 0\.40selected nodes for MIS\. The complementary Wilcoxon test does not reject equality between PostDeg and either learned variant on any task \(p≥0\.32p\\geq 0\.32, Appendix Table[A14](https://arxiv.org/html/2606.14022#A4.T14)\)\. PostDeg\-L\-FG and PostDeg\-L\-Adaptive learn parameters that agree to four decimals on every task \(Table[A23](https://arxiv.org/html/2606.14022#A4.T23)\); the optimizer leaves the additional capacity unused\. Cross\-backbone replication on GCN, GIN, SAGE, and PNA reproduces the same equivalence \(Appendix[F](https://arxiv.org/html/2606.14022#A6)\)\. We recommendγ=1/2\\gamma=1/2because it is parameter\-free and matches the GCN\-symmetric exponent inD−1/2​A​D−1/2D^\{\-1/2\}AD^\{\-1/2\}; the learned variants settle atβ∈\[0\.74,0\.77\]\\beta\\in\[0\.74,0\.77\]instead, paired\-equivalent to1/21/2at±1%\\pm 1\\%at our seed budget \(Appendix[D\.12](https://arxiv.org/html/2606.14022#A4.SS12)\)\. Other choices in the same slot, including post\-LN variants of eigenvector centrality or learned attention scores, are valid predictions for future tests\.

### 4\.4Mechanism: scale tracks degree on a log\-log plot

#### Scale\-versus\-degree\.

On InfluMax, the grouped scalar\-factor log\-log fit has slope−0\.78\-0\.78with signed Pearsonr=−0\.96r=\-0\.96\(Appendix Table[A20](https://arxiv.org/html/2606.14022#A4.T20)\); the all\-node regression reported in the appendix gives slope−0\.83\-0\.83\. On MIS, the appendix fit gives slope−0\.62\-0\.62withr=−0\.79r=\-0\.79\(Appendix Table[A20](https://arxiv.org/html/2606.14022#A4.T20)\)\. Both signs match Eq\. \([3](https://arxiv.org/html/2606.14022#S3.E3)\)\. Per\-task signed Pearson correlations and slopes for all four node\-selection tasks are in Appendix Table[A20](https://arxiv.org/html/2606.14022#A4.T20); the predicted\-vs\-empirical scale\-extreme ratio per task is in Appendix Table[A21](https://arxiv.org/html/2606.14022#A4.T21)\.

On a stratified MIS evaluation, PostDeg improves over LN backbone by\+8\.77%\+8\.77\\%on BA graphs and\+5\.58%\+5\.58\\%on moderately heterogeneous SBM \(Figure[3](https://arxiv.org/html/2606.14022#S4.F3), panel \(e\); Appendix Table[A30](https://arxiv.org/html/2606.14022#A4.T30)\); the larger gain on the more heterogeneous family is consistent with the heterogeneity prediction\.

### 4\.5The advantage holds under size transfer

\(a\) InfluMax\-104train50100150200evaluation sizennΔ\\Deltavs\. LN backbone \(%\)\(b\) Dismantle\-407train100200300evaluation sizenn\(c\) Epidemic\-104train100200300evaluation sizenn\(d\) MIS\-107train100200300evaluation sizennPostDegGraphNormGraphScalarLN backbone \(zero line\)\(e\) Within\-MIS dose\-response: BA \(heavy\-tailed\) vs\. SBM \(mild\)BA \(n=200n\{=\}200, skew2\.92\.9\)SBM \(n=300n\{=\}300, skew0\.40\.4\)036912Δ%\\Delta\\%over LN backbonePostDegPostDeg\-L\-FGPostDeg\-L\-Adaptive\+8\.8\+8\.8\+5\.6\+5\.6\+11\.4\+11\.4\+6\.1\+6\.1\+11\.0\+11\.0\+5\.9\+5\.9

Figure 3:Robustness of PostDeg across evaluation size and degree heterogeneity\.\(a\-d\)Size transfer relative to the LN backbone on each task \(mean±\\pmseed std over 10 seeds; shaded bands are±1\\pm 1seed std; vertical dashed line marksntrainn\_\{\\rm train\}\)\. The blue curve is PostDeg; gray curves are controls that do not isolate per\-node post\-LN degree scale\. PostDeg preserves its advantage on InfluMax, Dismantle, and MIS, while Epidemic remains a task\-objective null around the zero line\. Full transfer curves including the learned same\-slot variants are in Appendix Figure[A24](https://arxiv.org/html/2606.14022#A4.F24)\.\(e\)Within\-MIS BA\-vs\-SBM dose\-response \(10 seeds; bars are meanΔ%\\Delta\\%over LN backbone, whiskers are±1\\pm 1seed std\)\. Heavier\-tailed BA graphs \(skewness2\.92\.9\) yield larger PostDeg\-family gains than moderately heterogeneous SBM graphs \(skewness0\.40\.4\): BA/SBM ratios1\.57×1\.57\\times,1\.88×1\.88\\times,1\.88×1\.88\\timesfor PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive respectively \(Appendix Table[A30](https://arxiv.org/html/2606.14022#A4.T30)\)\. Evaluation sizes differ by family \(n=200n=200for BA vs\.n=300n=300for SBM\); this is a directional within\-task finding consistent with the heterogeneity prediction \(Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6)\)\.Figure[3](https://arxiv.org/html/2606.14022#S4.F3)\(panels \(a\-d\)\) shows the transfer result in the same relative units as the main table\. The PostDeg advantage is preserved within the tested22–3×3\\timestrain\-to\-test range\. On InfluMax and MIS, the gap to the strongest non\-degree\-aware baseline grows with evaluation size; on Dismantle the gap is preserved; on Epidemic it remains within one seed\-to\-seed standard deviation\. The full per\-method relative\-improvement curves with seed\-std bands are in Appendix Figure[A24](https://arxiv.org/html/2606.14022#A4.F24)\.

Across every \(task, evaluation size\) pair, paired Wilcoxon tests reject equality between the trained PostDeg policy and the canonical classical heuristic atα=0\.05\\alpha=0\.05\(Appendix Table[A32](https://arxiv.org/html/2606.14022#A4.T32)\)\. PostDeg also has the smallest train\-evaluation gap among the methods tested \(Appendix Figure[A22](https://arxiv.org/html/2606.14022#A4.F22)\); the full transfer table is in Appendix Table[A11](https://arxiv.org/html/2606.14022#A4.T11)\.

### 4\.6Boundary 1: DD as an adversarial control for variance\-zeroing normalizers

DD has degree skewness0\.270\.27\(moderate, not near\-regular\), so we use it as an*adversarial control*for variance\-zeroing normalizers rather than as a heterogeneity\-floor case\. The variance\-zeroing normalizers PairNorm, InstanceNorm, and BatchNorm are known on small graph\-classification benchmarks to collapse to majority\-class prediction\[[28](https://arxiv.org/html/2606.14022#bib.bib28),[30](https://arxiv.org/html/2606.14022#bib.bib30)\]; DD is the standard test for this collapse\. PostDeg, Extra LayerNorm, GraphScalar, and the LN backbone all hold near80%80\\%on the main 10\-seed DD run\.

PairNorm \(58\.7%58\.7\\%, seed std<0\.05%<0\.05\\%\), InstanceNorm \(58\.7%58\.7\\%,<0\.05%<0\.05\\%\), and BatchNorm \(58\.8±0\.2%58\.8\\pm 0\.2\\%\) collapse to within seed\-noise of the majority\-class predictor \(58\.65%58\.65\\%\) on every seed and fold\. GraphNorm has two DD numbers because it was run under two protocols:78\.5%78\.5\\%in the main 10\-seed DD table and77\.4%77\.4\\%in the supplemental 5\-fold NodeNorm comparison \(Appendix Table[A36](https://arxiv.org/html/2606.14022#A6.T36)\)\. NodeNorm holds at78\.8%78\.8\\%in that supplemental comparison\. We report the protocols separately\. The boundary separates by variance\-zeroing behavior rather than by feature\-statistic family\.

### 4\.7Boundary 2: when the topology signal is already in aggregation, PostDeg is paired\-equivalent

PNA\[[8](https://arxiv.org/html/2606.14022#bib.bib8)\]is the closest published approach to a built\-in degree normalizer: the aggregator scales each message byδ​\(di\)=log⁡\(di\+1\)/δ¯\\delta\(d\_\{i\}\)=\\log\(d\_\{i\}\+1\)/\\bar\{\\delta\}*inside*message passing\. PNA injects degree inside aggregation rather than as a free post\-LN scalar\.Under our setup, training budget, and seed count, the same backbone trained with PostDeg added returns paired\-equivalent results to the same backbone without PostDeg:\|Δ%\|<1%\|\\Delta\\%\|<1\\%on InfluMax and Dismantle in the 5\-seed PNA group, and differences remain within seed noise on every task in the 3\-seed PNA\-controlled group \(Appendix Table[A34](https://arxiv.org/html/2606.14022#A6.T34)\)\. This is a paired\-equivalence finding, not a redundancy proof: the result marks a boundary for our backbone and training setup, and does not claim algebraic equivalence between PNA and PostDeg\.

## 5Discussion

#### Headline results\.

A parameter\-free post\-LN inverse\-degree scale — one extra positive scalar per node, no learnable weights — delivers\+3\.5%\+3\.5\\%on InfluMax,\+2\.5%\+2\.5\\%on Dismantle, and\+5\.6%\+5\.6\\%on MIS over a strong LayerNorm GNN backbone, with10/10 paired\-seed winson each \(Table[4](https://arxiv.org/html/2606.14022#S4.T4)\), while preserving DD accuracy at79\.9%79\.9\\%\. Two learned same\-slot variants \(PostDeg\-L\-FG, PostDeg\-L\-Adaptive\) are paired\-equivalent under TOST at±1%\\pm 1\\%on every positive task: the additional capacity does not help, confirming thatplacement, not parameterization, carries the gain\. Four pre\-registered falsifiers \(graphwise scalar, extra LayerNorm, expressive same\-slot capacity, PNA backbone\-agnosticity\) all fire as predicted; the within\-MIS dose\-response \(Figure[3](https://arxiv.org/html/2606.14022#S4.F3)\(e\)\) shows BA gains are1\.6×1\.6\\\!\\times–1\.9×1\.9\\\!\\timeslarger than SBM gains, mirroring the per\-graph degree skew\.

The placement rule says*position*, not*parameterization*, carries the gain\. The LayerNorm absorption identity is exact for any positive pre\-LN scalar \(Section[3\.1](https://arxiv.org/html/2606.14022#S3.SS1)\); the empirical absorption envelope on every task is at most2\.44×10−52\.44\\\!\\times\\\!10^\{\-5\}\(Appendix[D\.5](https://arxiv.org/html/2606.14022#A4.SS5)\)\. PostDeg helps when a LayerNorm GNN lacks a nodewise topology\-magnitude channel and the task admits degree heterogeneity; on PNA \(which carries a degree channel inside aggregation\) and on Epidemic \(reward not aligned with degree\-conditioned magnitude\), the gain vanishes as the rule predicts\.

#### Limitations\.

- •Theory\.Single\-layer; positive scalars only\. Multi\-layer compounding, RMSNorm variants, and sign\-changing or content\-side multipliers are out of scope \(Appendix[C\.10](https://arxiv.org/html/2606.14022#A3.SS10)\)\.
- •Scalar signal\.We test degree centrality\. PageRank,kk\-core, eigenvector centrality, and learned positive attention scalars are natural drop\-ins for the same slot; we do not run them\.
- •External validity\.Main grid is a fixed GAT backbone on synthetic graph families plus DD; cross\-backbone replication \(GCN/GIN/SAGE/PNA\) uses a supplemental pipeline at a smaller seed budget\.

## 6Conclusion

This paper isolates where a topology scalar can survive LayerNorm in a GNN:the post\-LayerNorm slot\. We turn the LayerNorm absorption identity into a placement diagnostic and instantiate the surviving slot withPostDeg, a parameter\-free inverse\-degree scale\. PostDeg improves over the LN backbone by\+3\.5%\+3\.5\\%,\+2\.5%\+2\.5\\%, and\+5\.6%\+5\.6\\%on InfluMax, Dismantle, and MIS, with10/10 paired\-seed wins, while preserving79\.9%79\.9\\%DD accuracy\. Four pre\-registered falsifiers all fail to reject the rule, and the optimizer leaves the additional capacity of learned variants unused:placement, not parameterization, carries the effect\. The diagnostic generalizes to other positive topology signals and other normalized residual stacks — a natural next step\.

## References

- \[1\]J\. L\. Ba, J\. R\. Kiros, and G\. E\. Hinton\.Layer normalization\.*arXiv:1607\.06450*, 2016\.
- \[2\]G\. Bouritsas, F\. Frasca, S\. P\. Zafeiriou, and M\. M\. Bronstein\.Improving graph neural network expressivity via subgraph isomorphism counting\.*IEEE Transactions on Pattern Analysis and Machine Intelligence*, 45\(1\):657–668, 2022\.
- \[3\]A\. Braunstein, L\. Dall’Asta, G\. Semerjian, and L\. Zdeborová\.Network dismantling\.*Proceedings of the National Academy of Sciences*, 113\(44\):12368–12373, 2016\.
- \[4\]S\. Brody, U\. Alon, and E\. Yahav\.How attentive are graph attention networks?In*International Conference on Learning Representations*, 2022\.
- \[5\]T\. Cai, S\. Luo, K\. Xu, D\. He, T\.\-Y\. Liu, and L\. Wang\.GraphNorm: A principled approach to accelerating graph neural network training\.In*International Conference on Machine Learning*, 2021\.
- \[6\]Q\. Cappart, D\. Chételat, E\. Khalil, A\. Lodi, C\. Morris, and P\. Veličković\.Combinatorial optimization and reasoning with graph neural networks\.*Journal of Machine Learning Research*, 24\(130\):1–61, 2023\.
- \[7\]F\. R\. K\. Chung\.*Spectral Graph Theory*\.American Mathematical Society, 1997\.
- \[8\]G\. Corso, L\. Cavalleri, D\. Beaini, P\. Liò, and P\. Veličković\.Principal neighbourhood aggregation for graph nets\.In*Advances in Neural Information Processing Systems*, 2020\.
- \[9\]N\. Dehmamy, A\.\-L\. Barabási, and R\. Yu\.Understanding the representation power of graph neural networks in learning graph topology\.In*Advances in Neural Information Processing Systems*, 2019\.
- \[10\]W\. L\. Hamilton, R\. Ying, and J\. Leskovec\.Inductive representation learning on large graphs\.In*Advances in Neural Information Processing Systems*, 2017\.
- \[11\]S\. Ioffe and C\. Szegedy\.Batch normalization: Accelerating deep network training by reducing internal covariate shift\.In*International Conference on Machine Learning*, 2015\.
- \[12\]C\. K\. Joshi, Q\. Cappart, L\.\-M\. Rousseau, and T\. Laurent\.Learning the travelling salesperson problem requires rethinking generalization\.*Constraints*, 27\(1–2\):70–98, 2022\.
- \[13\]N\. Karalias and A\. Loukas\.Erdős goes neural: An unsupervised learning framework for combinatorial optimization on graphs\.In*Advances in Neural Information Processing Systems*, 2020\.
- \[14\]D\. Kempe, J\. Kleinberg, and É\. Tardos\.Maximizing the spread of influence through a social network\.In*Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*, 2003\.
- \[15\]E\. B\. Khalil, H\. Dai, Y\. Zhang, B\. Dilkina, and L\. Song\.Learning combinatorial optimization algorithms over graphs\.In*Advances in Neural Information Processing Systems*, 2017\.
- \[16\]T\. N\. Kipf and M\. Welling\.Semi\-supervised classification with graph convolutional networks\.In*International Conference on Learning Representations*, 2017\.
- \[17\]W\. Kool, H\. van Hoof, and M\. Welling\.Attention, learn to solve routing problems\!In*International Conference on Learning Representations*, 2019\.
- \[18\]C\. Morris, N\. M\. Kriege, F\. Bause, K\. Kersting, P\. Mutzel, and M\. Neumann\.TUDataset: A collection of benchmark datasets for learning with graphs\.*arXiv:2007\.08663*, 2020\.
- \[19\]O\. Press and L\. Wolf\.Using the output embedding to improve language models\.In*Conference of the European Chapter of the Association for Computational Linguistics*, 2017\.
- \[20\]L\. Rampasek, M\. Galkin, V\. P\. Dwivedi, A\. T\. Luu, G\. Wolf, and D\. Beaini\.Recipe for a general, powerful, scalable graph transformer\.In*Advances in Neural Information Processing Systems*, 2022\.
- \[21\]T\. Salimans and D\. P\. Kingma\.Weight normalization: A simple reparameterization to accelerate training of deep neural networks\.In*Advances in Neural Information Processing Systems*, 2016\.
- \[22\]R\. Sato, M\. Yamada, and H\. Kashima\.Random features strengthen graph neural networks\.In*SIAM International Conference on Data Mining*, 2020\.
- \[23\]D\. Ulyanov, A\. Vedaldi, and V\. Lempitsky\.Instance normalization: The missing ingredient for fast stylization\.*arXiv:1607\.08022*, 2016\.
- \[24\]P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio\.Graph attention networks\.In*International Conference on Learning Representations*, 2018\.
- \[25\]R\. Xiong, Y\. Yang, D\. He, K\. Zheng, S\. Zheng, C\. Xing, H\. Zhang, Y\. Lan, L\. Wang, and T\.\-Y\. Liu\.On layer normalization in the transformer architecture\.In*International Conference on Machine Learning*, 2020\.
- \[26\]C\. Ying, T\. Cai, S\. Luo, S\. Zheng, G\. Ke, D\. He, Y\. Shen, and T\.\-Y\. Liu\.Do transformers really perform badly for graph representation?In*Advances in Neural Information Processing Systems*, 2021\.
- \[27\]B\. Zhang and R\. Sennrich\.Root mean square layer normalization\.In*Advances in Neural Information Processing Systems*, 2019\.
- \[28\]L\. Zhao and L\. Akoglu\.PairNorm: Tackling oversmoothing in GNNs\.In*International Conference on Learning Representations*, 2020\.
- \[29\]K\. Zhou, X\. Huang, Y\. Li, D\. Zha, R\. Chen, and X\. Hu\.Towards deeper graph neural networks with differentiable group normalization\.In*Advances in Neural Information Processing Systems*, 2020\.
- \[30\]K\. Zhou, Y\. Dong, K\. Wang, W\. S\. Lee, B\. Hooi, H\. Xu, and J\. Feng\.Understanding and resolving performance degradation in deep graph convolutional networks\.In*Proceedings of the 30th ACM International Conference on Information & Knowledge Management*, 2021\.
- \[31\]K\. Xu, W\. Hu, J\. Leskovec, and S\. Jegelka\.How powerful are graph neural networks?In*International Conference on Learning Representations*, 2019\.
- \[32\]C\. Morris, M\. Ritzert, M\. Fey, W\. L\. Hamilton, J\. E\. Lenssen, G\. Rattan, and M\. Grohe\.Weisfeiler and Leman go neural: higher\-order graph neural networks\.In*AAAI*, 2019\.
- \[33\]F\. Wu, A\. Souza, T\. Zhang, C\. Fifty, T\. Yu, and K\. Weinberger\.Simplifying graph convolutional networks\.In*International Conference on Machine Learning*, 2019\.
- \[34\]G\. Li, M\. Müller, A\. Thabet, and B\. Ghanem\.DeepGCNs: can GCNs go as deep as CNNs?In*IEEE International Conference on Computer Vision*, 2019\.
- \[35\]M\. Chen, Z\. Wei, Z\. Huang, B\. Ding, and Y\. Li\.Simple and deep graph convolutional networks\.In*International Conference on Machine Learning*, 2020\.
- \[36\]K\. Oono and T\. Suzuki\.Graph neural networks exponentially lose expressive power for node classification\.In*International Conference on Learning Representations*, 2020\.
- \[37\]J\. Topping, F\. Di Giovanni, B\. P\. Chamberlain, X\. Dong, and M\. M\. Bronstein\.Understanding over\-squashing and bottlenecks on graphs via curvature\.In*International Conference on Learning Representations*, 2022\.
- \[38\]U\. Alon and E\. Yahav\.On the bottleneck of graph neural networks and its practical implications\.In*International Conference on Learning Representations*, 2021\.
- \[39\]K\. Xu, C\. Li, Y\. Tian, T\. Sonobe, K\. Kawarabayashi, and S\. Jegelka\.Representation learning on graphs with jumping knowledge networks\.In*International Conference on Machine Learning*, 2018\.
- \[40\]A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, L\. Kaiser, and I\. Polosukhin\.Attention is all you need\.In*Advances in Neural Information Processing Systems*, 2017\.
- \[41\]S\. Santurkar, D\. Tsipras, A\. Ilyas, and A\. Madry\.How does batch normalization help optimization?In*Advances in Neural Information Processing Systems*, 2018\.
- \[42\]T\. K\. Rusch, M\. M\. Bronstein, and S\. Mishra\.A survey on oversmoothing in graph neural networks\.*arXiv preprint arXiv:2303\.10993*, 2023\.

Table of Contents in Appendix

Contents

AReading guide and summary of evidence\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2606.14022#A1)

BNormalization design\-space table\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2606.14022#A2)

CTheory Appendix\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2606.14022#A3) C\.1Boundedness\.[C\.1](https://arxiv.org/html/2606.14022#A3.SS1) C\.2Degree separation and log\-log scaling\.[C\.2](https://arxiv.org/html/2606.14022#A3.SS2) C\.3Spectral\-gap modulation \(Cheeger\)\.[C\.3](https://arxiv.org/html/2606.14022#A3.SS3) C\.4Star\-graph one\-step amplification\.[C\.4](https://arxiv.org/html/2606.14022#A3.SS4) C\.5Post\-LayerNorm Jacobian and residual gradient floor\.[C\.5](https://arxiv.org/html/2606.14022#A3.SS5) C\.6Gate Lipschitzness and parameter gradients\.[C\.6](https://arxiv.org/html/2606.14022#A3.SS6) C\.7Topology\-agnostic normalizers and ablation hierarchy\.[C\.7](https://arxiv.org/html/2606.14022#A3.SS7) C\.8Additional proofs\.[C\.8](https://arxiv.org/html/2606.14022#A3.SS8) C\.9Full proofs lifted from the theorem bank\.[C\.9](https://arxiv.org/html/2606.14022#A3.SS9) C\.10Theory limitations\.[C\.10](https://arxiv.org/html/2606.14022#A3.SS10)

DEmpirical Appendix\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2606.14022#A4) D\.1Full main\-grid metrics\.[D\.1](https://arxiv.org/html/2606.14022#A4.SS1) D\.2Predictions ledger\.[D\.2](https://arxiv.org/html/2606.14022#A4.SS2) D\.3Reproducibility and setup\.[D\.3](https://arxiv.org/html/2606.14022#A4.SS3) D\.4Data dictionary\.[D\.4](https://arxiv.org/html/2606.14022#A4.SS4) D\.5Variance regime: pre\-LayerNorm multipliers are absorbed\.[D\.5](https://arxiv.org/html/2606.14022#A4.SS5) D\.6Full numerical results\.[D\.6](https://arxiv.org/html/2606.14022#A4.SS6) D\.7Equivalence and ablations \(TOST, Wilcoxon\)\.[D\.7](https://arxiv.org/html/2606.14022#A4.SS7) D\.8Convergence speed at multiple thresholds\.[D\.8](https://arxiv.org/html/2606.14022#A4.SS8) D\.9Post\-LayerNorm multipliers reach the score head\.[D\.9](https://arxiv.org/html/2606.14022#A4.SS9) D\.10Learned\-ablation parameters\.[D\.10](https://arxiv.org/html/2606.14022#A4.SS10) D\.11Ablation staircase\.[D\.11](https://arxiv.org/html/2606.14022#A4.SS11) D\.12Why the optimizer settles atβ∈\[0\.75,0\.77\]\\beta\\\!\\in\\\!\[0\.75,0\.77\], not0\.50\.5\.[D\.12](https://arxiv.org/html/2606.14022#A4.SS12) D\.13Spectral quantities\.[D\.13](https://arxiv.org/html/2606.14022#A4.SS13) D\.14Effect attenuates with degree heterogeneity\.[D\.14](https://arxiv.org/html/2606.14022#A4.SS14) D\.15Transfer and supplementary visualizations\.[D\.15](https://arxiv.org/html/2606.14022#A4.SS15)

EDistinction from PNA and other degree\-aware aggregators\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2606.14022#A5)

FCross\-backbone replication and predicted\-baseline runs\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[F](https://arxiv.org/html/2606.14022#A6) F\.1Reference implementation\.[F\.1](https://arxiv.org/html/2606.14022#A6.SS1)

GSymbolic walkthrough of one PostDeg layer\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[G](https://arxiv.org/html/2606.14022#A7)

HWhy feature concatenation does not substitute for post\-LayerNorm scaling\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[H](https://arxiv.org/html/2606.14022#A8)

IDD collapse\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[I](https://arxiv.org/html/2606.14022#A9)

JCode\-name reproducibility map\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[J](https://arxiv.org/html/2606.14022#A10)

## Appendix AReading guide and summary of evidence

This appendix is a reference manual rather than a continuation of the body narrative\. We open with a navigable summary so that a reviewer who has finished the body can check any single claim without searching\. The body contains the controlled main grid; this appendix contains the full statistical apparatus, the theory, supplementary runs, and the predictions that the data verify\.

#### What would falsify the placement rule\.

The placement rule is more useful for what it forbids than for what it predicts\. Four pre\-registered falsifiers are tested in this appendix\. \(i\) If a graphwise spectral term explained the gain, GraphScalar in the same slot would match PostDeg \(§[B](https://arxiv.org/html/2606.14022#A2), Table[A4](https://arxiv.org/html/2606.14022#A4.T4)\); it does not\. \(ii\) If extra feature normalization explained the gain, Extra LayerNorm in the same slot would match PostDeg \(§[B](https://arxiv.org/html/2606.14022#A2), Table[A14](https://arxiv.org/html/2606.14022#A4.T14)\); it does not\. \(iii\) If post\-LN scalar capacity beyond a fixed exponent helped, PostDeg\-L\-FG/Adaptive would beat PostDeg \(§[D](https://arxiv.org/html/2606.14022#A4), Table[A13](https://arxiv.org/html/2606.14022#A4.T13)\); they are TOST\-equivalent at±1%\\pm 1\\%\. \(iv\) If the gain were backbone\-agnostic, PostDeg on top of PNA would still help \(§[E](https://arxiv.org/html/2606.14022#A5), Table[A34](https://arxiv.org/html/2606.14022#A6.T34)\); it is paired\-equivalent\. None of the four falsifiers fires\.

#### Appendix layout\.

- •§[B](https://arxiv.org/html/2606.14022#A2): normalization design\-space table\.
- •§[C](https://arxiv.org/html/2606.14022#A3): theorem/proof appendix \(LayerNorm absorption identity, degree\-separation, near\-regular attenuation, Jacobian and gate properties, parameter gradients, theory limitations\)\.
- •§[D](https://arxiv.org/html/2606.14022#A4): empirical appendix \(full main\-grid metrics, predictions ledger, reproducibility, variance regime, statistical tables, mechanism, learned\-ablation parameters, boundaries, transfer/supplementary\)\.
- •§[E](https://arxiv.org/html/2606.14022#A5): distinction from PNA and other degree\-aware aggregators\.
- •§[F](https://arxiv.org/html/2606.14022#A6): cross\-backbone replication and predicted\-baseline runs\.
- •§[H](https://arxiv.org/html/2606.14022#A8): content vs\. magnitude and thelog⁡di\\log d\_\{i\}baseline\.

Table A1:Glossary of symbols, definitions, default values, and the section or equation where each is introduced\.
#### How to read the statistical tables\.

A TOST table tests*equivalence*\(lower one\-sidedpp⇒\\Rightarrowmethod paired\-equivalent within margin\); a Wilcoxon table tests*difference*\(lower one\-sidedpp⇒\\Rightarrowmethod paired\-better\)\. Cells in heatmaps reporting Cohen’sdduse a diverging colormap centered at0: dark red is large positive effect, dark blue is large negative effect, neutral white is near zero\. Win\-rate cells are the fraction of paired seeds out of 10\.

## Appendix BNormalization design\-space table

Table A2:Normalization design space\. PNA and NodeNorm are included for positioning but are not evaluated in the controlled main grid because they change aggregation or feature\-norm variables rather than only swapping the post\-block normalization slot\.
## Appendix CTheory Appendix

This appendix gives the numbered statements and proofs used by the body\.

#### Setup\.

We adopt the notation of Section[3\.2](https://arxiv.org/html/2606.14022#S3.SS2)\. The learned same\-slot variants PostDeg\-L\-FG and PostDeg\-L\-Adaptive use the parameterized scale

si=α​\(ci\+ε\)−γ​\(λmax\+ε′\)−1/2,γ=β\+δ​λ2,h~i=mi​si​h¯i\+\(1−mi\)​hi,s\_\{i\}=\\alpha\\,\(c\_\{i\}\+\\varepsilon\)^\{\-\\gamma\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\},\\quad\\gamma=\\beta\+\\delta\\lambda\_\{2\},\\quad\\widetilde\{h\}\_\{i\}=m\_\{i\}\\,s\_\{i\}\\,\\bar\{h\}\_\{i\}\+\(1\-m\_\{i\}\)\\,h\_\{i\},\(A1\)withλ2,λmax\\lambda\_\{2\},\\lambda\_\{\\max\}the second and largest eigenvalues of the normalized graph Laplacian,ε′=10−4\\varepsilon^\{\\prime\}=10^\{\-4\}, and\(α,β,δ\)\(\\alpha,\\beta,\\delta\)learned per layer\. The two variants differ only in the gatemi∈\(0,1\)m\_\{i\}\\in\(0,1\): PostDeg\-L\-FG setsmi=σ​\(ρ\)m\_\{i\}=\\sigma\(\\rho\)for a single learned scalarρ\\rho, and PostDeg\-L\-Adaptive setsmi=σ​\(ϕ​\(ci\)\)m\_\{i\}=\\sigma\(\\phi\(c\_\{i\}\)\)for a small two\-layer MLPϕ\\phi\. Parameter domains areα∈\[0\.1,3\.0\]\\alpha\\in\[0\.1,3\.0\],β∈\[0,1\.5\]\\beta\\in\[0,1\.5\],δ∈\[0,0\.5\]\\delta\\in\[0,0\.5\],ci=di/dmaxc\_\{i\}=d\_\{i\}/d\_\{\\max\}clipped to\[10−6,1\]\[10^\{\-6\},1\], spectral clampsλ2,λmax∈\[0\.1,2\]\\lambda\_\{2\},\\lambda\_\{\\max\}\\in\[0\.1,2\], and a final scale clampsi∈\[0\.01,100\]s\_\{i\}\\in\[0\.01,100\]that is inactive on every node in our experiments \(Lemma[14](https://arxiv.org/html/2606.14022#Thmtheorem14)\)\. Learned\-variant runs also use weight spectral normalization on the GAT projections throughuse\_weight\_sn; PostDeg and the alternative\-normalization baselines do not\. PostDeg is the special caseα=1\\alpha=1,β=1/2\\beta=1/2,δ=0\\delta=0,mi≡1m\_\{i\}\\equiv 1, with the graph\-wise constant\(λmax\+ε′\)−1/2\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}absorbed into the LayerNorm gain\.

#### Disconnected or weakly connected graphs\.

IfGGis disconnected, the mathematical normalized Laplacian value isλ2=0\\lambda\_\{2\}=0, so the spectral\-gap contribution vanishes and the learned superset reduces to degree\-aware scaling with exponentβ\\beta\. The implementation lower clamp replaces this value by0\.10\.1to avoid a hard discontinuity and to keep gradients throughδ\\deltaobservable on weakly connected batches\.

#### Stability constants\.

The numerical constants used throughout this appendix are collected in Table[A3](https://arxiv.org/html/2606.14022#A3.T3); they fix the offsets, clips, and clamps that make every quantity below finite and differentiable on the implemented domain\.

Table A3:Numerical constants used by the PostDeg\-L scale map and the post\-computation clamp\. All values are fixed throughout the experiments; the clamps are operative only at the boundaries of the parameter rectangle\.
### C\.1Boundedness

###### Lemma 1\(Well\-definedness\)\.

On the parameter domains in the setup, everysis\_\{i\}in Eq\. \([A1](https://arxiv.org/html/2606.14022#A3.E1)\) is finite and strictly positive\.

###### Proof\.

α\>0\\alpha\>0,ci\+ε≥cmin\+ε\>0c\_\{i\}\+\\varepsilon\\geq c\_\{\\min\}\+\\varepsilon\>0, andλmax\+ε′≥λmaxmin\+ε′\>0\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\\geq\\lambda\_\{\\max\}^\{\\min\}\+\\varepsilon^\{\\prime\}\>0\. ∎

###### Lemma 2\(Operational bound\)\.

Ifα≤α¯\\alpha\\leq\\bar\{\\alpha\},γ≤γop\\gamma\\leq\\gamma\_\{\\rm op\},ci≥c0c\_\{i\}\\geq c\_\{0\},λmax≥ℓ0\>0\\lambda\_\{\\max\}\\geq\\ell\_\{0\}\>0, then

si≤Smax​\(α¯,γop,c0,ℓ0\)=α¯ℓ0\+ε′​max⁡\{1,\(c0\+ε\)−γop\}\.s\_\{i\}\\leq S\_\{\\max\}\(\\bar\{\\alpha\},\\gamma\_\{\\rm op\},c\_\{0\},\\ell\_\{0\}\)=\\frac\{\\bar\{\\alpha\}\}\{\\sqrt\{\\ell\_\{0\}\+\\varepsilon^\{\\prime\}\}\}\\max\\\{1,\(c\_\{0\}\+\\varepsilon\)^\{\-\\gamma\_\{\\rm op\}\}\\\}\.\(A2\)

###### Proof\.

Monotone maximization on the rectangular learned\-regime domain\. ∎

### C\.2Degree separation and log\-log scaling

###### Theorem 3\(Degree separation, full version\)\.

Leti,j∈Vi,j\\in Vwithci\>cjc\_\{i\}\>c\_\{j\}andγ\>0\\gamma\>0\. Then:

1. \(a\)si<sjs\_\{i\}<s\_\{j\}\.
2. \(b\)Ri​j=sj/si=\(\(ci\+ε\)/\(cj\+ε\)\)γ\>1R\_\{ij\}=s\_\{j\}/s\_\{i\}=\(\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\)^\{\\gamma\}\>1\.
3. \(c\)∂Ri​j/∂γ=ln⁡\(\(ci\+ε\)/\(cj\+ε\)\)​Ri​j\>0\\partial R\_\{ij\}/\\partial\\gamma=\\ln\(\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\)R\_\{ij\}\>0\.
4. \(d\)Ifδ\>0\\delta\>0,∂Ri​j/∂λ2=δ​ln⁡\(\(ci\+ε\)/\(cj\+ε\)\)​Ri​j\>0\\partial R\_\{ij\}/\\partial\\lambda\_\{2\}=\\delta\\ln\(\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\)R\_\{ij\}\>0\.

###### Proof\.

Cancel the factorsα\\alphaand\(λmax\+ε′\)−1/2\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}in the ratio,sj/si=\(\(ci\+ε\)/\(cj\+ε\)\)γs\_\{j\}/s\_\{i\}=\(\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\)^\{\\gamma\}\. The base exceeds one andγ\>0\\gamma\>0, soRi​j\>1R\_\{ij\}\>1andsj\>sis\_\{j\}\>s\_\{i\}\. \(c\) and \(d\) follow by writingRi​j=exp⁡\(γ​ai​j\)R\_\{ij\}=\\exp\(\\gamma a\_\{ij\}\)withai​j=ln⁡\(\(ci\+ε\)/\(cj\+ε\)\)\>0a\_\{ij\}=\\ln\(\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\)\>0and applying the chain rule with∂γ/∂λ2=δ\\partial\\gamma/\\partial\\lambda\_\{2\}=\\delta\. ∎

###### Lemma 4\(Effect of the post\-computation clamp\)\.

The scalar projectionclamp⁡\(⋅,sminclip,smaxclip\)\\operatorname\{clamp\}\(\\cdot,s\_\{\\min\}^\{\\rm clip\},s\_\{\\max\}^\{\\rm clip\}\)is monotone nondecreasing, so it preserves the weak ordering of scales; strict separation is preserved exactly whenever both unclipped scales lie inside\[0\.01,100\]\[0\.01,100\]\.

###### Proof\.

Apply a monotone map to the strict ordering in Theorem[3](https://arxiv.org/html/2606.14022#Thmtheorem3)\. ∎

###### Corollary 5\(Log\-log slope, with caveat\)\.

Forε≪ci\\varepsilon\\ll c\_\{i\},ln⁡si=−γ​ln⁡ci\+const\\ln s\_\{i\}=\-\\gamma\\ln c\_\{i\}\+\\mathrm\{const\}\. The exact local slope is−γ​ci/\(ci\+ε\)\-\\gamma c\_\{i\}/\(c\_\{i\}\+\\varepsilon\), which differs from−γ\-\\gammaby a factorcmin/\(cmin\+ε\)≈0\.99c\_\{\\min\}/\(c\_\{\\min\}\+\\varepsilon\)\\approx 0\.99at the clipped minimum\. Any larger discrepancy between the predicted slope and an empirical slope reflects downstream layer interactions, not a failure of the scalar power law\.

###### Lemma 6\(Near\-regular attenuation\)\.

On akk\-regular graph,ci=1c\_\{i\}=1for alliiandsis\_\{i\}is constant\. Ifdmax/dmin≤1\+ηd\_\{\\max\}/d\_\{\\min\}\\leq 1\+\\eta, thenRi​j≤\(1\+η\)γ=1\+γ​η\+O​\(η2\)R\_\{ij\}\\leq\(1\+\\eta\)^\{\\gamma\}=1\+\\gamma\\eta\+O\(\\eta^\{2\}\)asη→0\\eta\\to 0\.

###### Proof\.

ci∈\[\(1\+η\)−1,1\]c\_\{i\}\\in\[\(1\+\\eta\)^\{\-1\},1\], so\(ci\+ε\)/\(cj\+ε\)≤1\+η\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\\leq 1\+\\eta\. Raise toγ\\gamma\. ∎

### C\.3Spectral\-gap modulation

###### Proposition 7\(Cheeger\-based separation lower bound\)\.

On a connected graph with Cheeger constanth​\(G\)≥h0\>0h\(G\)\\geq h\_\{0\}\>0, the inequalityλ2≥h02/2\\lambda\_\{2\}\\geq h\_\{0\}^\{2\}/2\[[7](https://arxiv.org/html/2606.14022#bib.bib7)\]gives

Ri​j≥\(ci\+εcj\+ε\)β\+δ​h02/2\.R\_\{ij\}\\geq\\left\(\\frac\{c\_\{i\}\+\\varepsilon\}\{c\_\{j\}\+\\varepsilon\}\\right\)^\{\\beta\+\\delta h\_\{0\}^\{2\}/2\}\.

###### Proof\.

γ=β\+δ​λ2≥β\+δ​h02/2\\gamma=\\beta\+\\delta\\lambda\_\{2\}\\geq\\beta\+\\delta h\_\{0\}^\{2\}/2on the stated event; raise both sides of Eq\. \([3](https://arxiv.org/html/2606.14022#S3.E3)\)\. ∎

### C\.4Star\-graph one\-step amplification

###### Proposition 8\(One\-step amplification onSnS\_\{n\}\)\.

LetSnS\_\{n\}be the star with hub0andn−1n\-1leaves\. Then:

1. \(a\)shub=α​\(1\+ε\)−γ​\(λmax\+ε′\)−1/2s\_\{\\rm hub\}=\\alpha\(1\+\\varepsilon\)^\{\-\\gamma\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\},sleaf=α​\(\(1/\(n−1\)\)\+ε\)−γ​\(λmax\+ε′\)−1/2s\_\{\\rm leaf\}=\\alpha\(\(1/\(n\-1\)\)\+\\varepsilon\)^\{\-\\gamma\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}, andR≈\(n−1\)γR\\approx\(n\-1\)^\{\\gamma\}\.
2. \(b\)Under mean aggregation, all leaf representations are equal after one layer; with post\-LN degree scaling, leaf amplitude is multiplied byRRrelative to the hub class\.
3. \(c\)Supposehhub\(1\)=κ​hleaf\(1\)h\_\{\\rm hub\}^\{\(1\)\}=\\kappa h\_\{\\rm leaf\}^\{\(1\)\}for someκ∈\[0,1\)\\kappa\\in\[0,1\)\. Then ‖h~leaf\(1\)−h~hub\(1\)‖shub​‖hleaf\(1\)−hhub\(1\)‖=R−κ1−κ≥R\.\\frac\{\\\|\\widetilde\{h\}\_\{\\rm leaf\}^\{\(1\)\}\-\\widetilde\{h\}\_\{\\rm hub\}^\{\(1\)\}\\\|\}\{s\_\{\\rm hub\}\\\|h\_\{\\rm leaf\}^\{\(1\)\}\-h\_\{\\rm hub\}^\{\(1\)\}\\\|\}=\\frac\{R\-\\kappa\}\{1\-\\kappa\}\\geq R\.

###### Proof\.

d0=n−1d\_\{0\}=n\-1,dj=1d\_\{j\}=1for leaves, soc0=1c\_\{0\}=1,cj=1/\(n−1\)c\_\{j\}=1/\(n\-1\), giving \(a\)\. \(b\) is mean aggregation\. For \(c\),‖sleaf​hleaf\(1\)−shub​κ​hleaf\(1\)‖=shub​\|R−κ\|​‖hleaf\(1\)‖\\\|s\_\{\\rm leaf\}h\_\{\\rm leaf\}^\{\(1\)\}\-s\_\{\\rm hub\}\\kappa h\_\{\\rm leaf\}^\{\(1\)\}\\\|=s\_\{\\rm hub\}\|R\-\\kappa\|\\\|h\_\{\\rm leaf\}^\{\(1\)\}\\\|and‖hleaf\(1\)−hhub\(1\)‖=\(1−κ\)​‖hleaf\(1\)‖\\\|h\_\{\\rm leaf\}^\{\(1\)\}\-h\_\{\\rm hub\}^\{\(1\)\}\\\|=\(1\-\\kappa\)\\\|h\_\{\\rm leaf\}^\{\(1\)\}\\\|, dividing gives the formula\. The collinearityhhub\(1\)=κ​hleaf\(1\)h\_\{\\rm hub\}^\{\(1\)\}=\\kappa h\_\{\\rm leaf\}^\{\(1\)\}is a modeling hypothesis, not a one\-step consequence; without post\-LN degree scaling,R=1R=1\. ∎

### C\.5Post\-LayerNorm Jacobian and residual gradient floor

###### Proposition 9\(Jacobian of the gated learned layer\)\.

For a fixed graph,sis\_\{i\}depends only on the topology and learned scale parameters\. The Jacobian ofh~i=si​LN⁡\(hi\)\\widetilde\{h\}\_\{i\}=s\_\{i\}\\operatorname\{LN\}\(h\_\{i\}\)issi​JLN​\(hi\)s\_\{i\}J\_\{\\operatorname\{LN\}\}\(h\_\{i\}\)with‖JLN​\(hi\)‖op≤‖g‖∞/τLN​\(hi\)\\\|J\_\{\\operatorname\{LN\}\}\(h\_\{i\}\)\\\|\_\{\\rm op\}\\leq\\\|g\\\|\_\{\\infty\}/\\tau\_\{\\operatorname\{LN\}\}\(h\_\{i\}\)\. For the gated layerTi​\(h\)=mi​si​LN⁡\(hi\)\+\(1−mi\)​hiT\_\{i\}\(h\)=m\_\{i\}s\_\{i\}\\operatorname\{LN\}\(h\_\{i\}\)\+\(1\-m\_\{i\}\)h\_\{i\},

∂Ti∂hi=mi​si​JLN​\(hi\)\+\(1−mi\)​I\.\\frac\{\\partial T\_\{i\}\}\{\\partial h\_\{i\}\}=m\_\{i\}s\_\{i\}J\_\{\\operatorname\{LN\}\}\(h\_\{i\}\)\+\(1\-m\_\{i\}\)I\.

###### Proof\.

Standard quotient\-rule computation of the LayerNorm Jacobian followed by differentiation of the residual branch\. ∎

###### Proposition 10\(Residual gradient floor\)\.

Under nonnegative LayerNorm gaing≥0g\\geq 0, every eigenvalue of∂Ti/∂hi\\partial T\_\{i\}/\\partial h\_\{i\}is at least1−mi\>01\-m\_\{i\}\>0\.

###### Proof\.

JLN​\(hi\)J\_\{\\operatorname\{LN\}\}\(h\_\{i\}\)is positive semidefinite underg≥0g\\geq 0; multiplication bymi​si≥0m\_\{i\}s\_\{i\}\\geq 0preserves nonnegativity\. Adding\(1−mi\)​I\(1\-m\_\{i\}\)Ishifts every eigenvalue by1−mi1\-m\_\{i\}\. ∎

### C\.6Gate Lipschitzness and parameter gradients

###### Lemma 11\(Gate Lipschitzness\)\.

\(a\)Xout=m​H~\+\(1−m\)​XX\_\{\\rm out\}=m\\widetilde\{H\}\+\(1\-m\)Xis a convex combination, so‖Xout‖≤m​‖H~‖\+\(1−m\)​‖X‖\\\|X\_\{\\rm out\}\\\|\\leq m\\\|\\widetilde\{H\}\\\|\+\(1\-m\)\\\|X\\\|\.\(b\)For an adaptive gatemi=σ​\(ϕ​\(ci\)\)m\_\{i\}=\\sigma\(\\phi\(c\_\{i\}\)\)with two\-layer MLPϕ\\phi,\|mi−mj\|≤\(1/4\)​‖W2‖op​‖W1‖op​\|ci−cj\|\|m\_\{i\}\-m\_\{j\}\|\\leq\(1/4\)\\\|W\_\{2\}\\\|\_\{\\rm op\}\\\|W\_\{1\}\\\|\_\{\\rm op\}\|c\_\{i\}\-c\_\{j\}\|\.

###### Proof\.

\(a\) Triangle inequality\. \(b\) Sigmoid is1/41/4\-Lipschitz, ReLU is 1\-Lipschitz, affines have Lipschitz constant equal to their operator norm; compose\. ∎

###### Corollary 12\(Parameter gradients\)\.

∂si/∂α=si/α\\partial s\_\{i\}/\\partial\\alpha=s\_\{i\}/\\alpha,∂si/∂β=−ln⁡\(ci\+ε\)​si\\partial s\_\{i\}/\\partial\\beta=\-\\ln\(c\_\{i\}\+\\varepsilon\)s\_\{i\},∂si/∂δ=−λ2​ln⁡\(ci\+ε\)​si\\partial s\_\{i\}/\\partial\\delta=\-\\lambda\_\{2\}\\ln\(c\_\{i\}\+\\varepsilon\)s\_\{i\}\. With the post\-computation clampsi≤100s\_\{i\}\\leq 100and\|ln⁡\(cmin\+ε\)\|≈13\.8\|\\ln\(c\_\{\\min\}\+\\varepsilon\)\|\\approx 13\.8, theβ\\beta\- andδ\\delta\-gradients are bounded by13801380andλ2max⋅1380≤2760\\lambda\_\{2\}^\{\\max\}\\cdot 1380\\leq 2760respectively\.

###### Proof\.

Differentiatesi=α​exp⁡\{−\(β\+δ​λ2\)​ln⁡\(ci\+ε\)\}​\(λmax\+ε′\)−1/2s\_\{i\}=\\alpha\\exp\\\{\-\(\\beta\+\\delta\\lambda\_\{2\}\)\\ln\(c\_\{i\}\+\\varepsilon\)\\\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}\. ∎

###### Corollary 13\(Raw\-parameter chain rule\)\.

Letα=αmin\+\(αmax−αmin\)​σ​\(a\)\\alpha=\\alpha\_\{\\min\}\+\(\\alpha\_\{\\max\}\-\\alpha\_\{\\min\}\)\\,\\sigma\(a\)for an unconstrained raw parametera∈ℝa\\in\\mathbb\{R\}, and defineβ\\betaandδ\\deltaanalogously through their own raw parameters\. Then

∂si∂a=siα​\(αmax−αmin\)​σ​\(a\)​\(1−σ​\(a\)\),\\frac\{\\partial s\_\{i\}\}\{\\partial a\}=\\frac\{s\_\{i\}\}\{\\alpha\}\\,\(\\alpha\_\{\\max\}\-\\alpha\_\{\\min\}\)\\,\\sigma\(a\)\(1\-\\sigma\(a\)\),\(A4\)and similarly for the rawβ\\betaandδ\\deltaparameters\. Sinceσ​\(z\)​\(1−σ​\(z\)\)∈\(0,1/4\]\\sigma\(z\)\(1\-\\sigma\(z\)\)\\in\(0,1/4\]for finitezz, the raw\-parameter gradients are strictly nonzero whenever the scale\-factor gradients of Corollary[12](https://arxiv.org/html/2606.14022#Thmtheorem12)are\. In particular, the optimizer never reaches a flat region in the raw parameterization unless the correspondingsis\_\{i\}\-gradient already vanishes\.

###### Proof\.

Chain rule withσ′​\(z\)=σ​\(z\)​\(1−σ​\(z\)\)\\sigma^\{\\prime\}\(z\)=\\sigma\(z\)\(1\-\\sigma\(z\)\)and the identityσ′​\(z\)∈\(0,1/4\]\\sigma^\{\\prime\}\(z\)\\in\(0,1/4\]\. ∎

### C\.7Topology\-agnostic normalizers and ablation hierarchy

### C\.8Additional proofs

We collect supporting results that strengthen specific paragraph\-level claims in the body or in the appendix proofs above\.

###### Lemma 14\(Effect of the post\-computation clamp, full statement\)\.

Lets^i=α​\(ci\+ε\)−γ​\(λmax\+ε′\)−1/2\\widehat\{s\}\_\{i\}=\\alpha\(c\_\{i\}\+\\varepsilon\)^\{\-\\gamma\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}and letsiclip=clamp⁡\(s^i,sminclip,smaxclip\)s\_\{i\}^\{\\rm clip\}=\\operatorname\{clamp\}\(\\widehat\{s\}\_\{i\},s\_\{\\min\}^\{\\rm clip\},s\_\{\\max\}^\{\\rm clip\}\)\. The clamp preserves the weak orderingsiclip≤sjclips\_\{i\}^\{\\rm clip\}\\leq s\_\{j\}^\{\\rm clip\}forci\>cjc\_\{i\}\>c\_\{j\}, and is inactive on every node whenever

α​\(1\+ε\)−γ​\(λmax\+ε′\)−1/2≥sminclipandα​\(cmin\+ε\)−γ​\(λmax\+ε′\)−1/2≤smaxclip\.\\alpha\(1\+\\varepsilon\)^\{\-\\gamma\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}\\geq s\_\{\\min\}^\{\\rm clip\}\\quad\\text\{and\}\\quad\\alpha\(c\_\{\\min\}\+\\varepsilon\)^\{\-\\gamma\}\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}\\leq s\_\{\\max\}^\{\\rm clip\}\.

###### Proposition 15\(Dependence on the spectral gap\)\.

Fixα,β,δ,ci,λmax\\alpha,\\beta,\\delta,c\_\{i\},\\lambda\_\{\\max\}\. Then∂si/∂λ2=−δ​ln⁡\(ci\+ε\)​si\\partial s\_\{i\}/\\partial\\lambda\_\{2\}=\-\\delta\\ln\(c\_\{i\}\+\\varepsilon\)\\,s\_\{i\}\. Ifδ\>0\\delta\>0andci<1−εc\_\{i\}<1\-\\varepsilon, the scale of a non\-hub node strictly increases with the spectral gap\.

### C\.9Full proofs lifted from the theorem bank

###### Proposition 16\(LayerNorm absorption identity, exact form; cf\.\[[21](https://arxiv.org/html/2606.14022#bib.bib21),[19](https://arxiv.org/html/2606.14022#bib.bib19)\]\)\.

LetLN\\operatorname\{LN\}have stabilizerεLN\\varepsilon\_\{\\rm LN\}and per\-coordinate affine gain/bias\(g,b\)\(g,b\)\. For anya\>0a\>0and anyx∈ℝdx\\in\\mathbb\{R\}^\{d\},

LN⁡\(a​x\)=g⊙a​xca2​d−1​‖xc‖2\+εLN\+b,\\operatorname\{LN\}\(a\\,x\)=g\\odot\\frac\{a\\,x\_\{c\}\}\{\\sqrt\{a^\{2\}d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\+\\varepsilon\_\{\\rm LN\}\}\}\+b,\(A5\)wherexc=P​xx\_\{c\}=PxandP=I−d−1​𝟏𝟏⊤P=I\-d^\{\-1\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}is the centering projector\. In the zero\-stabilizer limitεLN→0\\varepsilon\_\{\\rm LN\}\\to 0,LN⁡\(a​x\)=LN⁡\(x\)\\operatorname\{LN\}\(ax\)=\\operatorname\{LN\}\(x\)for everya\>0a\>0\. WithεLN\>0\\varepsilon\_\{\\rm LN\}\>0, the relative deviation is

‖LN⁡\(a​x\)−LN⁡\(x\)‖2≤‖g‖∞​εLN/a2a2​d−1​‖xc‖2\+εLN​‖xc‖/d−1​‖xc‖2\.\\left\\\|\\operatorname\{LN\}\(ax\)\-\\operatorname\{LN\}\(x\)\\right\\\|\_\{2\}\\leq\\\|g\\\|\_\{\\infty\}\\,\\frac\{\\varepsilon\_\{\\rm LN\}/a^\{2\}\}\{a^\{2\}d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\+\\varepsilon\_\{\\rm LN\}\}\\,\\\|x\_\{c\}\\\|/\\sqrt\{d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\}\.In particular, in the regimea2​d−1​‖xc‖2≫εLNa^\{2\}d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\\gg\\varepsilon\_\{\\rm LN\}, the dependence onaais suppressed to first order byεLN/\(a2​d−1​‖xc‖2\)\\varepsilon\_\{\\rm LN\}/\(a^\{2\}d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\)\.

###### Proof\.

Direct substitution:LN⁡\(a​x\)=g⊙\(a​x−a​μ​\(x\)​𝟏\)/a2​d−1​‖xc‖2\+εLN\+b\\operatorname\{LN\}\(ax\)=g\\odot\(ax\-a\\mu\(x\)\\mathbf\{1\}\)/\\sqrt\{a^\{2\}d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\+\\varepsilon\_\{\\rm LN\}\}\+bsinceμ​\(a​x\)=a​μ​\(x\)\\mu\(ax\)=a\\mu\(x\)anda​xc=a​P​xax\_\{c\}=aPx\. The relative\-deviation bound follows from differentiating the rational form with respect toaaat largeaa\. ∎

###### Proposition 17\(Full LayerNorm Jacobian, with PSD decomposition\)\.

ForLN⁡\(x\)=g⊙xc/τLN​\(x\)\+b\\operatorname\{LN\}\(x\)=g\\odot x\_\{c\}/\\tau\_\{\\operatorname\{LN\}\}\(x\)\+bwithτLN​\(x\)=\(d−1​‖xc‖2\+εLN\)1/2\\tau\_\{\\operatorname\{LN\}\}\(x\)=\(d^\{\-1\}\\\|x\_\{c\}\\\|^\{2\}\+\\varepsilon\_\{\\rm LN\}\)^\{1/2\},

JLN​\(x\)=diag⁡\(g\)​\[1τLN​\(x\)​P−1d​τLN​\(x\)3​xc​xc⊤\]\.J\_\{\\operatorname\{LN\}\}\(x\)=\\operatorname\{diag\}\(g\)\\Big\[\\tfrac\{1\}\{\\tau\_\{\\operatorname\{LN\}\}\(x\)\}P\-\\tfrac\{1\}\{d\\,\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{3\}\}\\,x\_\{c\}x\_\{c\}^\{\\top\}\\Big\]\.The bracketed matrixBBhas eigenvalues\{0,1/τLN​\(x\)​\(multiplicity​d−2\),εLN/τLN​\(x\)3\}\\\{0,\\;1/\\tau\_\{\\operatorname\{LN\}\}\(x\)\\,\(\\text\{multiplicity \}d\-2\),\\;\\varepsilon\_\{\\rm LN\}/\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{3\}\\\}on the orthogonal subspacesspan⁡\{𝟏\}\\operatorname\{span\}\\\{\\mathbf\{1\}\\\},𝟏⟂∩xc⟂\\mathbf\{1\}^\{\\perp\}\\cap x\_\{c\}^\{\\perp\}, andspan⁡\{xc\}\\operatorname\{span\}\\\{x\_\{c\}\\\}respectively\. Withg≥0g\\geq 0coordinate\-wise,diag⁡\(g\)​B\\operatorname\{diag\}\(g\)Bhas nonnegative eigenvalues; consequently‖JLN​\(x\)‖op≤‖g‖∞/τLN​\(x\)\\\|J\_\{\\operatorname\{LN\}\}\(x\)\\\|\_\{\\rm op\}\\leq\\\|g\\\|\_\{\\infty\}/\\tau\_\{\\operatorname\{LN\}\}\(x\), and the gated blockTi=mi​si​LN⁡\(hi\)\+\(1−mi\)​hiT\_\{i\}=m\_\{i\}s\_\{i\}\\operatorname\{LN\}\(h\_\{i\}\)\+\(1\-m\_\{i\}\)h\_\{i\}has eigenvalue floor1−mi\>01\-m\_\{i\}\>0on every direction\.

###### Proof\.

Quotient rule onx↦xc/τLN​\(x\)x\\mapsto x\_\{c\}/\\tau\_\{\\operatorname\{LN\}\}\(x\)gives the bracketed expression\. The eigenvalue claim follows by checking each subspace:B​𝟏=0B\\mathbf\{1\}=0sinceP​𝟏=0P\\mathbf\{1\}=0;B​v=τLN​\(x\)−1​vBv=\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{\-1\}vforv∈𝟏⟂∩xc⟂v\\in\\mathbf\{1\}^\{\\perp\}\\cap x\_\{c\}^\{\\perp\}sincexc⊤​v=0x\_\{c\}^\{\\top\}v=0;B​xc=τLN​\(x\)−1​xc−d−1​τLN​\(x\)−3​‖xc‖2​xc=εLN/τLN​\(x\)3​xcBx\_\{c\}=\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{\-1\}x\_\{c\}\-d^\{\-1\}\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{\-3\}\\\|x\_\{c\}\\\|^\{2\}x\_\{c\}=\\varepsilon\_\{\\rm LN\}/\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{3\}\\,x\_\{c\}using‖xc‖2=d​\(τLN​\(x\)2−εLN\)\\\|x\_\{c\}\\\|^\{2\}=d\(\\tau\_\{\\operatorname\{LN\}\}\(x\)^\{2\}\-\\varepsilon\_\{\\rm LN\}\)\. The diagonal\-gain similaritydiag⁡\(g\)​B\\operatorname\{diag\}\(g\)Bis similar \(in the strictly positive case\) todiag\(g\)1/2Bdiag\(g\)1/2\\operatorname\{diag\}\(g\)^\{1/2\}B\\operatorname\{diag\}\(g\)^\{1/2\}, which is PSD; continuity inggextends this tog≥0g\\geq 0\. Adding\(1−mi\)​I\(1\-m\_\{i\}\)Ishifts every eigenvalue by1−mi1\-m\_\{i\}\. ∎

###### Lemma 18\(Adaptive gate Lipschitzness, full chain\)\.

Form​\(c\)=σ​\(ϕ​\(c\)\)m\(c\)=\\sigma\(\\phi\(c\)\)withϕ​\(c\)=W2​ReLU⁡\(W1​c\+b1\)\+b2\\phi\(c\)=W\_\{2\}\\,\\operatorname\{ReLU\}\(W\_\{1\}c\+b\_\{1\}\)\+b\_\{2\},W1∈ℝH×1W\_\{1\}\\in\\mathbb\{R\}^\{H\\times 1\},W2∈ℝ1×HW\_\{2\}\\in\\mathbb\{R\}^\{1\\times H\},

\|m​\(ci\)−m​\(cj\)\|≤14​‖W2‖op​‖W1‖op​\|ci−cj\|\.\|m\(c\_\{i\}\)\-m\(c\_\{j\}\)\|\\leq\\tfrac\{1\}\{4\}\\,\\\|W\_\{2\}\\\|\_\{\\rm op\}\\,\\\|W\_\{1\}\\\|\_\{\\rm op\}\\,\|c\_\{i\}\-c\_\{j\}\|\.

###### Proof\.

σ\\sigmais1/41/4\-Lipschitz,ReLU\\operatorname\{ReLU\}is11\-Lipschitz, and the affines have Lipschitz constants equal to their operator norms; biases do not affect Lipschitz constants\. Compose:Lip⁡\(m\)≤Lip⁡\(σ\)​Lip⁡\(W2\)​Lip⁡\(ReLU\)​Lip⁡\(W1\)=14​‖W2‖op​‖W1‖op\\operatorname\{Lip\}\(m\)\\leq\\operatorname\{Lip\}\(\\sigma\)\\,\\operatorname\{Lip\}\(W\_\{2\}\)\\,\\operatorname\{Lip\}\(\\operatorname\{ReLU\}\)\\,\\operatorname\{Lip\}\(W\_\{1\}\)=\\tfrac\{1\}\{4\}\\\|W\_\{2\}\\\|\_\{\\rm op\}\\\|W\_\{1\}\\\|\_\{\\rm op\}\. ∎

###### Proposition 19\(Spectral\-gap dependence with explicit derivative\)\.

Fixα,β,δ,ci,λmax\\alpha,\\beta,\\delta,c\_\{i\},\\lambda\_\{\\max\}and viewsis\_\{i\}as a function ofλ2\\lambda\_\{2\}throughγ=β\+δ​λ2\\gamma=\\beta\+\\delta\\lambda\_\{2\}\. Then

∂si∂λ2=−δ​ln⁡\(ci\+ε\)​si\.\\frac\{\\partial s\_\{i\}\}\{\\partial\\lambda\_\{2\}\}=\-\\delta\\ln\(c\_\{i\}\+\\varepsilon\)\\,s\_\{i\}\.Ifδ\>0\\delta\>0andci<1−εc\_\{i\}<1\-\\varepsilon, then∂si/∂λ2\>0\\partial s\_\{i\}/\\partial\\lambda\_\{2\}\>0\. Atci=1c\_\{i\}=1, the derivative isO​\(ε\)O\(\\varepsilon\), numerically negligible forε=10−8\\varepsilon=10^\{\-8\}\.

###### Proof\.

Writesi=α​exp⁡\(−γ​ln⁡\(ci\+ε\)\)​\(λmax\+ε′\)−1/2s\_\{i\}=\\alpha\\exp\(\-\\gamma\\ln\(c\_\{i\}\+\\varepsilon\)\)\(\\lambda\_\{\\max\}\+\\varepsilon^\{\\prime\}\)^\{\-1/2\}\. Differentiate w\.r\.t\.γ\\gamma:∂si/∂γ=−ln⁡\(ci\+ε\)​si\\partial s\_\{i\}/\\partial\\gamma=\-\\ln\(c\_\{i\}\+\\varepsilon\)s\_\{i\}\. By chain rule with∂γ/∂λ2=δ\\partial\\gamma/\\partial\\lambda\_\{2\}=\\delta, we obtain the formula\. Forci<1−εc\_\{i\}<1\-\\varepsilon,ci\+ε<1c\_\{i\}\+\\varepsilon<1soln⁡\(ci\+ε\)<0\\ln\(c\_\{i\}\+\\varepsilon\)<0and the product is positive; atci=1c\_\{i\}=1,ln⁡\(1\+ε\)=ε\+O​\(ε2\)\\ln\(1\+\\varepsilon\)=\\varepsilon\+O\(\\varepsilon^\{2\}\)\. ∎

###### Proposition 20\(Double\-star pairwise ratios\)\.

On the double\-starSa,b\(2\)S\_\{a,b\}^\{\(2\)\}with hubsu,vu,v\(degreesa\+1,b\+1a\+1,b\+1,a\>b≥1a\>b\\geq 1\) and pendant leaves of degree11, the pairwise scale ratios are

svsu=\(1\+ε\(b\+1\)/\(a\+1\)\+ε\)γ,sleafsv=\(\(b\+1\)/\(a\+1\)\+ε1/\(a\+1\)\+ε\)γ\.\\frac\{s\_\{v\}\}\{s\_\{u\}\}=\\left\(\\frac\{1\+\\varepsilon\}\{\(b\+1\)/\(a\+1\)\+\\varepsilon\}\\right\)^\{\\gamma\},\\quad\\frac\{s\_\{\\rm leaf\}\}\{s\_\{v\}\}=\\left\(\\frac\{\(b\+1\)/\(a\+1\)\+\\varepsilon\}\{1/\(a\+1\)\+\\varepsilon\}\\right\)^\{\\gamma\}\.Both ratios exceed11\. Without post\-LN degree scaling, all pairwise message magnitudes are equal\.

###### Proof\.

dmax=a\+1d\_\{\\max\}=a\+1, socu=1c\_\{u\}=1,cv=\(b\+1\)/\(a\+1\)c\_\{v\}=\(b\+1\)/\(a\+1\),cleaf=1/\(a\+1\)c\_\{\\rm leaf\}=1/\(a\+1\)\. Substitute into Theorem[3](https://arxiv.org/html/2606.14022#Thmtheorem3)\. ∎

### C\.10Theory limitations and what this section does*not*say

The placement diagnostic and the gradient\-floor result above are deterministic statements about a single residual GAT block at a specific point in training, not claims about the full multi\-layer dynamics\. The scope is delineated below\.

#### In scope\.

\(i\) The absorption identity for positive per\-node scalars \(Proposition[16](https://arxiv.org/html/2606.14022#Thmtheorem16)\) is exact in theεLN→0\\varepsilon\_\{\\mathrm\{LN\}\}\\to 0limit and tight under our empiricalσ2\\sigma^\{2\}regime\. \(ii\) The degree\-separation identity \(Theorem[3](https://arxiv.org/html/2606.14022#Thmtheorem3)\) is a deterministic ratio for a fixedγ\>0\\gamma\>0\. \(iii\) The Jacobian and residual gradient floor \(Proposition[9](https://arxiv.org/html/2606.14022#Thmtheorem9), Proposition[10](https://arxiv.org/html/2606.14022#Thmtheorem10)\) are single\-layer statements under nonnegative LayerNorm gain\. \(iv\) The clamp\-inactivity claim \(Lemma[14](https://arxiv.org/html/2606.14022#Thmtheorem14)\) is a domain\-coverage check, not a probabilistic statement\.

#### Out of scope\.

- •*Sign\-changing or complex\-valued multipliers\.*The absorption identity uses positivity to pushaia\_\{i\}inside the square root; signed or complex attention weights need a separate analysis\.
- •*RMSNorm and other zero\-mean normalizers\.*RMSNorm uses‖x‖/d\\\|x\\\|/\\sqrt\{d\}instead of standard deviation; the absorption argument adapts straightforwardly \(the same factor ofai2a\_\{i\}^\{2\}moves into the denominator\) but theσ2\\sigma^\{2\}\-regime constants differ\. We do not run experiments on RMSNorm and do not claim PostDeg numbers transfer\.
- •*Multi\-layer compounding\.*A formal mean\-field analysis of feature\-variance propagation through alternating PostDeg\-L and LayerNorm layers, including the residual gates, learned message\-passing matricesWℓW\_\{\\ell\}, and self\-information, would quantify how the one\-step amplification of Proposition[8](https://arxiv.org/html/2606.14022#Thmtheorem8)compounds with depth\. We do not have such an analysis\. The empirical claim is single\-layer and per\-node; the multi\-layer effect is observed but not proven\.
- •*Aggregation\-side multipliers\.*PNA’sδ​\(di\)=log⁡\(di\+1\)/δ¯\\delta\(d\_\{i\}\)=\\log\(d\_\{i\}\+1\)/\\bar\{\\delta\}is mixed with the message before LayerNorm\. Our placement rule does not apply directly; the appropriate boundary is the paired\-equivalence reported empirically \(Section[4\.7](https://arxiv.org/html/2606.14022#S4.SS7), Appendix[E](https://arxiv.org/html/2606.14022#A5)\)\.
- •*Aggregation\-free architectures\.*The diagnostic is stated for residual GAT blocks\. Transformers without graph\-structured aggregation, set\-based encoders, and position\-encoded backbones use LayerNorm in different blocks; the diagnostic should re\-derive for each\.

#### What is well\-defined but unproven here\.

The empirical observation that low\-heterogeneity tasks \(DD, Epidemic\) attenuate the post\-LN advantage is qualitatively predicted by Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6)\(near\-regular attenuation\), but the lemma gives a strict\-regular\-graph statement; an interpolating “effective heterogeneity” bound that quantifies attenuation as a function ofCV​\(d\)\\mathrm\{CV\}\(d\)on non\-regular families is left to future work\.

## Appendix DEmpirical Appendix

This section is the empirical companion to the body\. It collects the full numeric tables behind every claim made in Section[4\.2](https://arxiv.org/html/2606.14022#S4.SS2)–[4\.7](https://arxiv.org/html/2606.14022#S4.SS7), the reproducibility material needed to re\-run the controlled grid, and the supplementary diagnostics \(variance regime, scale\-versus\-degree fits, learned\-ablation parameter trajectories, boundary cases, transfer curves\) that support the placement\-rule narrative\. Each subsection below opens with a one\-sentence prediction tied to the placement rule and closes with a one\-sentence verdict; tables and figures are referenced by their target subsection so a reviewer can check any single claim by name\.

### D\.1Full main\-grid metrics

Table[A4](https://arxiv.org/html/2606.14022#A4.T4)reports the absolute metric at the largest evaluation size for every method on every task in the controlled main grid \(mean±\\pmseed std over 10 seeds\)\. The body Table[4](https://arxiv.org/html/2606.14022#S4.T4)re\-expresses these as paired\-by\-seed relative improvements over the LN backbone in the same controlled slot; the absolute numbers below let a reader read off raw task scores, sanity\-check the implied differences, and confirm that the no\-collapse baselines \(PostDeg, Extra LayerNorm, GraphScalar, LN backbone, GraphNorm\) are within seed\-noise of one another on DD while the variance\-zeroing baselines \(PairNorm, InstanceNorm, BatchNorm\) collapse to the majority\-class predictor\. PostDeg and the two learned same\-slot variants are bolded\. The DD column reports the main 10\-fold cross\-validation protocol; a separate 5\-fold NodeNorm comparison is in Table[A36](https://arxiv.org/html/2606.14022#A6.T36)\.

Table A4:Full main\-grid results at the largest evaluation size \(mean±\\pmseed std, 10 seeds\)\. PostDeg is the recommended operator and is bolded\. Learned same\-slot variants use weight spectral normalization and are reported as capacity diagnostics\. GraphScalar is a graphwise scalar control\. LN backbone is the identity in the controlled post\-block slot; the residual block still contains LayerNorm\.↑\\uparrow/↓\\downarrow: higher/lower is better\.∗DD Acc is from the main 10\-fold cross\-validation protocol; the supplemental 5\-fold NodeNorm comparison reports separate numbers \(Appendix Table[A36](https://arxiv.org/html/2606.14022#A6.T36)\)\.
### D\.2Predictions ledger

The unified predictions ledger below pairs each body claim with where it is stated, where it is tested, the empirical effect size, the verdict, and the supporting artifact\. We dropped the redundant theory↔\\leftrightarrowexperiment map; this ledger replaces it\.

Table A5:Predictions and verdicts master table\. Each row pairs a body claim, where it is stated, the empirical effect size, the verdict, and the supporting artifact\.We also restructure the appendix around the predictions of the placement rule \(Sections[D\.5](https://arxiv.org/html/2606.14022#A4.SS5),[D\.7](https://arxiv.org/html/2606.14022#A4.SS7),[D\.9](https://arxiv.org/html/2606.14022#A4.SS9),[D\.14](https://arxiv.org/html/2606.14022#A4.SS14)\)\. Each subsection opens with a one\-sentence prediction and closes with a one\-sentence verdict\. A one\-page decision tree summarizing the principle’s predictions is in Figure[A1](https://arxiv.org/html/2606.14022#A4.F1)\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x1.png)Figure A1:Decision tree of placement\-principle predictions\.Table A6:Consolidated statistical summary against LN backbone at the largest evaluation size: Cohen’sdd/ per\-seed win rate / paired Wilcoxonpp\. The post\-LN scale family \(PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive\) is uniformly above LN backbone withp<0\.01p<0\.01and win rate≥0\.9\\geq 0\.9on every degree\-heterogeneous task\. Caveat: Cohen’sddvalues are inflated by unusually low between\-seed variance; the absolute reward gap \(Table[A12](https://arxiv.org/html/2606.14022#A4.T12)\) is the more interpretable quantity\.Cell format: Cohen’sdd/ win rate / Wilcoxonpp\.

### D\.3Reproducibility and setup

This subsection states the per\-layer PostDeg operator \(Algorithm[1](https://arxiv.org/html/2606.14022#alg1)\), the greedy evaluation procedure shared by every method \(Algorithm[2](https://arxiv.org/html/2606.14022#alg2)\), the supplementary code and data layout, the hyperparameters used by every \(method, task\) cell \(Table[A7](https://arxiv.org/html/2606.14022#A4.T7)\), and the per\-task data\-generation parameters and rewards\. Together they define a single \(method, task, seed\) run end\-to\-end\.

#### The PostDeg layer in pseudocode\.

Algorithm[1](https://arxiv.org/html/2606.14022#alg1)states the recommended PostDeg layer as a thin wrapper around the existing residual GAT block plus LayerNorm: the only addition is the per\-node multiplication by\(c^i\+ε\)−1/2\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-1/2\}on line 5\. The layer has zero learned parameters, adds𝒪​\(\|V\|\)\\mathcal\{O\}\(\|V\|\)work per forward pass, and is a drop\-in replacement for the LN backbone \(which is line 5 with the multiplier set to11\)\. Implementation:repo/experiments/shared/dsn\_core\.py::DegreeScaleNorm; the body offorward\(\)mirrors lines 1–6 verbatim and the constructor is empty \(nonn\.Parameter\)\. The cached degree vectorc^\\widehat\{c\}is computed once per graph in the data loader and broadcast; the scale is fused with the LayerNorm output via a singletorch\.mul\(no extra allocations\)\.

Algorithm 1PostDeg layer \(one residual block of a LayerNorm GNN\)\. The LN backbone is the same code with line 5 set tosi≡1s\_\{i\}\\\!\\equiv\\\!1\.1:Node features

H∈ℝn×dH\\in\\mathbb\{R\}^\{n\\times d\}, edge index

EE, cached normalized in\-degree

c^∈ℝ≥0n\\widehat\{c\}\\in\\mathbb\{R\}^\{n\}\_\{\\geq 0\}with

c^i=di/d¯\\widehat\{c\}\_\{i\}=d\_\{i\}/\\bar\{d\}, GAT residual block

GATℓ\\mathrm\{GAT\}\_\{\\ell\}, LN stabilizer

εLN=10−5\\varepsilon\_\{\\mathrm\{LN\}\}=10^\{\-5\}, scale stabilizer

ε=10−8\\varepsilon=10^\{\-8\}\.

2:

A←GATℓ​\(H,E\)A\\leftarrow\\mathrm\{GAT\}\_\{\\ell\}\(H,E\)⊳\\trianglerightmulti\-head attention aggregation

3:

Z←H\+AZ\\leftarrow H\+A⊳\\trianglerightresidual sum

4:

H¯←LN⁡\(Z;εLN\)\\bar\{H\}\\leftarrow\\operatorname\{LN\}\(Z;\\,\\varepsilon\_\{\\mathrm\{LN\}\}\)⊳\\trianglerightper\-node LayerNorm

5:

si←\(c^i\+ε\)−1/2s\_\{i\}\\leftarrow\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-1/2\}for each

ii⊳\\trianglerightpositive post\-LN scalar \(parameter\-free\)

6:

H~i←si⋅H¯i\\widetilde\{H\}\_\{i\}\\leftarrow s\_\{i\}\\cdot\\bar\{H\}\_\{i\}for each

ii⊳\\trianglerightmultiply applied*after*LN

7:return

H~\\widetilde\{H\}

#### Greedy node selection at evaluation\.

Every method in the main grid uses the same greedy decoder at evaluation time \(Algorithm[2](https://arxiv.org/html/2606.14022#alg2)\): the trained scorerπθ\\pi\_\{\\theta\}is called once per selection, the highest\-scoring admissible candidate is added to the selected setSS, and the GNN features are recomputed at the next step withSSencoded in the input feature column reserved for the “selected” indicator\. The decoder is the standard inference protocol for budgeted node\-selection policies trained with the Erdős\-style relaxation\[[13](https://arxiv.org/html/2606.14022#bib.bib13),[15](https://arxiv.org/html/2606.14022#bib.bib15)\]\. Per\-task admissibility differs only at MIS, which additionally rejects neighbors of already\-selected nodes \(line 5\); budgetsKKper task are in Section[4\.1](https://arxiv.org/html/2606.14022#S4.SS1)\. Because every method shares this decoder verbatim, comparisons isolate the trained scorer, not the search\. Cost is𝒪​\(K⋅\|V\|⋅forward\)\\mathcal\{O\}\(K\\cdot\|V\|\\cdot\\mathrm\{forward\}\)per evaluation\.

Algorithm 2Greedy node selection at evaluation, shared by every method\. Identical control flow across InfluMax/Dismantle/Epidemic/MIS; onlyKKand admissibility differ\.1:Graph

G=\(V,E\)G=\(V,E\), trained policy

πθ\\pi\_\{\\theta\}, budget

KK, admissibility predicate

adm​\(⋅,S\)\\mathrm\{adm\}\(\\cdot,S\)\(default

\{v∉S\}\\\{v\\notin S\\\}; MIS adds

\{v∉N​\(S\)\}\\\{v\\notin N\(S\)\\\}\)\.

2:

S←∅S\\leftarrow\\emptyset,

C←V\\;C\\leftarrow V
3:for

k=1,…,Kk=1,\\dots,Kdo

4:

H←GNNθ​\(G;selected​\_​indicator​\(S\)\)H\\leftarrow\\mathrm\{GNN\}\_\{\\theta\}\(G;\\,\\mathrm\{selected\\\_indicator\}\(S\)\)⊳\\trianglerightrecompute features withSSencoded in input

5:

Ck←\{v∈C:adm​\(v,S\)\}C\_\{k\}\\leftarrow\\\{v\\in C:\\mathrm\{adm\}\(v,S\)\\\}⊳\\trianglerighttask\-specific filter; identity for non\-MIS tasks

6:if

Ck=∅C\_\{k\}=\\emptysetthenbreak

7:endif⊳\\trianglerightearly exit when MIS saturates

8:

v^←arg⁡maxv∈Ck⁡πθ​\(Hv\)\\hat\{v\}\\leftarrow\\arg\\max\_\{v\\in C\_\{k\}\}\\,\\pi\_\{\\theta\}\(H\_\{v\}\)
9:

S←S∪\{v^\}S\\leftarrow S\\cup\\\{\\hat\{v\}\\\},

C←C∖\{v^\}\\;C\\leftarrow C\\setminus\\\{\\hat\{v\}\\\}
10:endfor

11:return

SS

#### The learned same\-slot superset\.

Algorithm[3](https://arxiv.org/html/2606.14022#alg3)is the learned variant PostDeg\-L\-Adaptive used as a capacity diagnostic\. It keeps the post\-LN slot of Algorithm[1](https://arxiv.org/html/2606.14022#alg1)but replaces the fixed exponent1/21/2by a learnedγℓ=βℓ\+δℓ​λ2\\gamma\_\{\\ell\}=\\beta\_\{\\ell\}\+\\delta\_\{\\ell\}\\lambda\_\{2\}with a learned amplitudeαℓ\\alpha\_\{\\ell\}\(three per\-layer scalars;λ2\\lambda\_\{2\}is the graph algebraic connectivity, computed once at data\-load time\)\. PostDeg\-L\-FG is the special caseδℓ≡0\\delta\_\{\\ell\}\\\!\\equiv\\\!0; PostDeg is recovered at\(αℓ,βℓ,δℓ\)=\(1,12,0\)\(\\alpha\_\{\\ell\},\\beta\_\{\\ell\},\\delta\_\{\\ell\}\)=\(1,\\tfrac\{1\}\{2\},0\)\. Spectral normalization on the GAT weights bounds the learnable Lipschitz constant \(Appendix[C\.5](https://arxiv.org/html/2606.14022#A3.SS5)\); combined with the post\-computation clampsi∈\[0\.01,100\]s\_\{i\}\\\!\\in\\\!\[0\.01,100\]on line 5 the layer is11\-Lipschitz in the relevant regime\. Initialization:αℓ=1\.55\\alpha\_\{\\ell\}\{=\}1\.55,βℓ=0\.75\\beta\_\{\\ell\}\{=\}0\.75,δℓ=0\.25\\delta\_\{\\ell\}\{=\}0\.25\(matched across layers\); the optimizer keeps all three within±0\.005\\pm 0\.005across seeds \(Table[A23](https://arxiv.org/html/2606.14022#A4.T23)\)\.

Algorithm 3PostDeg\-L\-Adaptive layer: learned same\-slot superset of Algorithm[1](https://arxiv.org/html/2606.14022#alg1)\. Same placement, same control flow; the only change is the construction ofsis\_\{i\}on lines 3–5\.1:Node features

HH, edge index

EE, normalized in\-degree

c^\\widehat\{c\}, GAT block

GATℓ\\mathrm\{GAT\}\_\{\\ell\}, learned scalars

αℓ,βℓ,δℓ∈ℝ\\alpha\_\{\\ell\},\\beta\_\{\\ell\},\\delta\_\{\\ell\}\\in\\mathbb\{R\}, algebraic connectivity

λ2≥0\\lambda\_\{2\}\\\!\\geq\\\!0\(cached per graph\), stabilizer

ε=10−8\\varepsilon=10^\{\-8\}, scale clamp

\[smin,smax\]=\[0\.01,100\]\[s\_\{\\min\},s\_\{\\max\}\]=\[0\.01,100\]\.

2:

Z←H\+GATℓ​\(H,E\)Z\\leftarrow H\+\\mathrm\{GAT\}\_\{\\ell\}\(H,E\),

H¯←LN⁡\(Z\)\\;\\;\\bar\{H\}\\leftarrow\\operatorname\{LN\}\(Z\)⊳\\trianglerightresidual GAT\+\+LayerNorm \(same as Alg\.[1](https://arxiv.org/html/2606.14022#alg1)\)

3:

γℓ←βℓ\+δℓ​λ2\\gamma\_\{\\ell\}\\leftarrow\\beta\_\{\\ell\}\+\\delta\_\{\\ell\}\\,\\lambda\_\{2\}⊳\\trianglerighteffective exponent, graph\-conditioned

4:

si←αℓ⋅\(c^i\+ε\)−γℓs\_\{i\}\\leftarrow\\alpha\_\{\\ell\}\\cdot\(\\widehat\{c\}\_\{i\}\+\\varepsilon\)^\{\-\\gamma\_\{\\ell\}\}for each

ii⊳\\trianglerightlearned post\-LN scalar

5:

si←clamp​\(si,smin,smax\)s\_\{i\}\\leftarrow\\mathrm\{clamp\}\(s\_\{i\},s\_\{\\min\},s\_\{\\max\}\)⊳\\trianglerightnumerical\-safety clamp; inactive at trained\(α,β,δ\)\(\\alpha,\\beta,\\delta\)

6:

H~i←si⋅H¯i\\widetilde\{H\}\_\{i\}\\leftarrow s\_\{i\}\\cdot\\bar\{H\}\_\{i\}for each

ii
7:return

H~\\widetilde\{H\}

#### Code and data\.

The supplementary package contains the experiment code, configuration shell scripts, rendered tables and figures, and the CSV files used in the paper\. Synthetic graph data are generated from the scripts; DD is downloaded from the TU graph\-kernel dataset distribution during preprocessing, used under its GPL license terms, and not redistributed as a modified dataset\. The full CSV\-to\-artifact map and provenance split \(which evaluation CSV produces which table or figure\) is in the supplementaryREADME\.md; the only structural fact required by the body is that the original completed\-run logs \(eval\_results\_all\.csv,new\_experiments\_eval\_results\.csv\) produce the main tables, whiledata/supplemental\_runs/produces cross\-backbone, NodeNorm, PNA, and supplemental DD tables\.

#### Hyperparameters\.

Table A7:Consolidated training hyperparameters\. All methods share the same backbone, optimizer, training budget, seeds, and evaluation protocol; hyperparameters were fixed from pilot runs and not tuned per method\.
#### Forward pass\.

The PostDeg forward pass instantiates Eq\. \([2](https://arxiv.org/html/2606.14022#S3.E2)\) once per layer: at layerℓ\\ell,Z\(ℓ\)=H\(ℓ−1\)\+GATℓ​\(H\(ℓ−1\),A\)Z^\{\(\\ell\)\}=H^\{\(\\ell\-1\)\}\+\\mathrm\{GAT\}\_\{\\ell\}\(H^\{\(\\ell\-1\)\},A\)andH\(ℓ\)=\(c^\+ε\)−1/2⊙LN⁡\(Z\(ℓ\)\)H^\{\(\\ell\)\}=\(\\widehat\{c\}\+\\varepsilon\)^\{\-1/2\}\\odot\\operatorname\{LN\}\(Z^\{\(\\ell\)\}\)\. The normalized degree vectorc^\\widehat\{c\}is computed once and cached per graph\. The learned same\-position superset \(Eq\. \([A1](https://arxiv.org/html/2606.14022#A3.E1)\)\) keeps the same placement and substitutes the parameterized scale; we call these learned variants PostDeg\-L\-FG and PostDeg\-L\-Adaptive\. Algorithm[2](https://arxiv.org/html/2606.14022#alg2)records the greedy node\-selection procedure used at evaluation time; it is the same for every method\.

#### Tasks and rewards\.

InfluMax uses Barabási–Albert graphs with training sizen=100n=100, attachment parameterm=3m=3, evaluation sizesn∈\{50,100,150,200\}n\\in\\\{50,100,150,200\\\}, budgetK=⌊0\.1​n⌋K=\\lfloor 0\.1n\\rfloor, and independent\-cascade probabilityp=0\.1p=0\.1for1010rounds, averaged over1010Monte Carlo simulations per reward/evaluation\. Network dismantling uses stochastic block model graphs with target training sizen=150n=150,55–77communities, community\-size jitter of±20%\\pm 20\\%,pin=0\.15p\_\{\\mathrm\{in\}\}=0\.15with per\-community uniform noise in\[−0\.05,0\.05\]\[\-0\.05,0\.05\]clipped to\[0\.1,0\.9\]\[0\.1,0\.9\],pout=0\.02p\_\{\\mathrm\{out\}\}=0\.02, evaluation sizesn∈\{100,150,200,300\}n\\in\\\{100,150,200,300\\\}, and budgetK=⌊0\.2​n⌋K=\\lfloor 0\.2n\\rfloor\. Its reward is fragmentation1−\|LCC\|/n1\-\|\\mathrm\{LCC\}\|/nafter removing the selected nodes\. Epidemic containment uses the same SBM contact\-graph generator and budget as Dismantle; selected nodes are vaccinated before an SIR simulation with infection probabilityβ=0\.3\\beta=0\.3, recovery probabilityγ=0\.1\\gamma=0\.1,5050steps,55initial infections, and1010Monte Carlo samples during training or2020during final evaluation\. MIS uses SBM graphs with training sizen=150n=150,55–77communities,pin=0\.3p\_\{\\mathrm\{in\}\}=0\.3,pout=0\.02p\_\{\\mathrm\{out\}\}=0\.02, evaluation sizesn∈\{100,150,200,300\}n\\in\\\{100,150,200,300\\\}, and budgetK=⌊0\.3​n⌋K=\\lfloor 0\.3n\\rfloor\.

#### Excluded inputs\.

Raw node degree and eigenvector centrality are excluded from all input features\.

#### Included structural features\.

InfluMax and MIS use clustering coefficient, normalized average\-neighbor degree, selection indicator, and remaining\-budget fraction\. Dismantle and Epidemic use clustering coefficient, normalizedkk\-core number, PageRank, and removal/vaccination indicator\. On BA training graphs atn=100n=100, PageRank,kk\-core, and average\-neighbor degree have Pearson\|r\|≥0\.78\|r\|\\geq 0\.78with raw degree, so they are strong content surrogates; the PostDeg gain over LN backbone is over and above degree\-correlated input content\.

#### DD setup\.

DD graph classification uses the TU DD protein dataset with structural features \(clustering, normalizedkk\-core, PageRank, constant bias\) concatenated with node labels/attributes when available, global mean pooling, and stratified1010\-fold cross\-validation with fixed shuffled folds per seed\.

#### Compute and runtime\.

Table A8:Per\-method runtime envelope, learnable parameters per layer, and asymptotic per\-layer cost\. The four post\-LN scale operators add𝒪​\(\|V\|\)\\mathcal\{O\}\(\|V\|\)work per layer; PostDeg adds zero learnable parameters\. Walltime per \(method, task, seed\) on a22\-CPU88\-GB worker\.Each sweep job used one worker with22CPU cores and88GB RAM\. Training one \(method, task, seed\) combination takes roughly3030–9090minutes\. The reported main grid contains1010methods×\\times1010seeds×\\times55tasks=500=500completed runs and took approximately400400worker\-hours\.

#### Numerical clamps\.

ε=10−8\\varepsilon=10^\{\-8\}in Eq\. \([2](https://arxiv.org/html/2606.14022#S3.E2)\),ε′=10−4\\varepsilon^\{\\prime\}=10^\{\-4\}in the spectral term of Eq\. \([A1](https://arxiv.org/html/2606.14022#A3.E1)\), andεLN=10−5\\varepsilon\_\{\\rm LN\}=10^\{\-5\}in LayerNorm\. PostDeg usesc^i=max⁡\(di,1\)/maxj⁡max⁡\(dj,1\)\\widehat\{c\}\_\{i\}=\\max\(d\_\{i\},1\)/\\max\_\{j\}\\max\(d\_\{j\},1\)inDegreeScaleNorm; the learned variants clipcic\_\{i\}to\[10−6,1\]\[10^\{\-6\},1\]\. Spectral quantities are clamped to\[0\.1,10\.0\]\[0\.1,10\.0\], and the final learned\-variant scale is clamped to\[0\.01,100\]\[0\.01,100\]\.

#### ε\\varepsilonablation\.

Table[A9](https://arxiv.org/html/2606.14022#A4.T9)reports PostDeg sensitivity to theε\\varepsilonfloor; the floor never binds becausecmin=10−6c\_\{\\min\}=10^\{\-6\}dominates\.

Table A9:ε\\varepsilon\-floor ablation for PostDeg\. Rows are the floor value, columns are evaluation sizes\. The InfluMax scores are identical across rows because the lower clamp oncic\_\{i\}\(set to10−610^\{\-6\}\) dominatesε\\varepsilonfor everyε≤10−6\\varepsilon\\leq 10^\{\-6\}, so the floor never binds; the right\-hand block reports the worst\-case relative perturbationεLN/σmin2\\varepsilon\_\{\\rm LN\}/\\sigma^\{2\}\_\{\\min\}\(withσmin2=0\.41\\sigma^\{2\}\_\{\\min\}=0\.41from Table[A10](https://arxiv.org/html/2606.14022#A4.T10)\), which is below10−410^\{\-4\}for everyε\\varepsilon\. Endpoint differences acrossε∈\{10−10,10−8,10−6\}\\varepsilon\\in\\\{10^\{\-10\},10^\{\-8\},10^\{\-6\}\\\}are below0\.05%0\.05\\%relative on every \(task, size\) we measured\.

### D\.4Data dictionary

For each CSV inpaper/data/we list the columns used, with a short description\.

- •eval\_results\_all\.csv\(1200 rows\):backbone, experiment, mode, seed, eval\_size; reward columns are task\-specific \(policy\_mean\_spread,policy\_frag\_mean,policy\_infected\_mean\); heuristic comparators aregreedy\_mean\_spread,heuristic\_frag\_mean,heuristic\_infected\_mean\.
- •new\_experiments\_eval\_results\.csv\(700 rows\): same schema for MIS; comparators aregreedy\_mean\_size\.
- •dsn\_spectral\_weights\.csv\(18000 rows\): per\-\(seed, epoch\) learned\-variant parameters per layer \(α,β,δ,λmax\\alpha,\\beta,\\delta,\\lambda\_\{\\max\}, spectral gap, forℓ∈\{0,1,2\}\\ell\\in\\\{0,1,2\\\}\)\.
- •new\_experiments\_dsn\_weights\.csv\(6000 rows\): same schema for MIS\.
- •node\_scale\_factors\.csv\(165 000 rows\): per\-node, per\-graph, per\-task:degree, centrality, scale\_layer\{0,1,2\}, learnedα,β,δ\\alpha,\\beta,\\deltaper layer,λ2,λmax\\lambda\_\{2\},\\lambda\_\{\\max\}\.
- •node\_scale\_summary\.csv\(8 rows\): per\-\(task, mode\) aggregatemean\_scale, std\_scale, min\_scale, max\_scale, degree\_scale\_corr\.
- •degree\_distributions\.csv\(160 000 rows\): per\-graph degree samples\.
- •degree\_summary\.csv\(5 rows\): per\-task degree statistics \(mean, std, min, max, percentiles, skewness\)\.
- •training\_curves\.csv\(60 000 rows\): per\-\(seed, epoch\) training and evaluation columns;eval\_\*columns are populated every 10 epochs\.
- •new\_experiments\_training\_curves\.csv\(20 000 rows\): same schema for MIS and the DD graph\-classifier sweep \(10 fold\-accuracy columns\)\.
- •transfer\_sweep\_logs\_only\.tar\.gz: raw per\-job logs from the transfer sweep, retained for auditability\. Final tables use the completed\-run CSVs listed above\.
- •data/supplemental\_runs/aggregate\_by\_config\.csv\(182 rows\): seed\-aggregated mean/std for additional experiments \(cross\-backbone on GAT/GCN/GIN/SAGE, PNA, NodeNorm\); columns includerun\_tag, backbone, experiment, mode, n\_seeds, mean\_final\_primary, std\_final\_primary\. Source for Tables[A33](https://arxiv.org/html/2606.14022#A6.T33),[A34](https://arxiv.org/html/2606.14022#A6.T34),[A35](https://arxiv.org/html/2606.14022#A6.T35)\.
- •data/supplemental\_runs/final\_runs\_best\_per\_seed\.csv: per\-\(run\_tag, backbone, experiment, mode, seed\) final eval value, used for the paired\-wins counts in Tables[A33](https://arxiv.org/html/2606.14022#A6.T33)and[A34](https://arxiv.org/html/2606.14022#A6.T34)\.
- •data/supplemental\_runs/metrics\_epochs\.csv: per\-epoch training metrics for these additional experiments in wide format\.
- •data/supplemental\_runs/cv\_results\_summary\.csv\(187 rows, after deduplication\): 5\-fold DD CV results, 5 seeds, per mode\. Source for Table[A36](https://arxiv.org/html/2606.14022#A6.T36); do not pool with the paper’s main 10\-fold DD runs\.

### D\.5Variance regime: pre\-LayerNorm multipliers are absorbed

#### Prediction\.

The absorption identity \(Proposition[16](https://arxiv.org/html/2606.14022#Thmtheorem16)\) is sharp wheneverεLN/σmin2\\varepsilon\_\{\\rm LN\}/\\sigma^\{2\}\_\{\\min\}is small; we verify the regime\.

The placement argument relies onai2​σ2​\(zi\)≫εLNa\_\{i\}^\{2\}\\sigma^\{2\}\(z\_\{i\}\)\\gg\\varepsilon\_\{\\rm LN\}\. We measureσ2​\(zi\)\\sigma^\{2\}\(z\_\{i\}\)at every layer and every epoch on 10 seeds for each task\. The maximum across all four tasks is2\.44×10−52\.44\\times 10^\{\-5\}\(MIS, the worst\-case task\), so the absorption regime is sharp throughout training; the value is≤2\.5×10−5\\leq 2\.5\\times 10^\{\-5\}on every task\.

Table A10:Distribution ofσ2​\(zi\)\\sigma^\{2\}\(z\_\{i\}\)at convergence across nodes, layers, and seeds \(10 seeds, 3 layers, evaluated on 200 fresh test graphs per task\)\. The rightmost column is the worst\-case relative perturbation in the LayerNorm absorption identity,εLN/σmin2\\varepsilon\_\{\\rm LN\}/\\sigma^\{2\}\_\{\\min\}, which bounds the deviation between pre\- and post\-LN placement of a multiplier\. All values are below10−410^\{\-4\}, so the absorption regime is sharp throughout training\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x2.png)Figure A2:Empirical backing for the placement argument\.\(a\)Per\-layer feature variance during training \(representative InfluMax run;yy\-axis log\-scaled\): variance grows in the first∼\\sim30 epochs and stabilizes well aboveεLN=10−5\\varepsilon\_\{\\rm LN\}=10^\{\-5\}\(dashed line\)\.\(b\)Relative LayerNorm\-absorption magnitude as a function of the per\-node multiplieraia\_\{i\}for several feature variances: in our regime \(σ2≈0\.5\\sigma^\{2\}\\approx 0\.5\), placing the multiplier before LayerNorm has effect<10−5<10^\{\-5\}across the PostDeg range \(grey band\)\.
#### Verdict\.

Sharp\. The empirical worst\-case isεLN/σmin2=2\.44×10−5\\varepsilon\_\{\\rm LN\}/\\sigma^\{2\}\_\{\\min\}=2\.44\\times 10^\{\-5\}\(MIS\) across all four tasks \(Table[A10](https://arxiv.org/html/2606.14022#A4.T10)\)\.

### D\.6Full numerical results

Table[A11](https://arxiv.org/html/2606.14022#A4.T11)reports mean±\\pmseed std at every \(method, task, evaluation size\) combination\. We pair it with two follow\-on artifacts\. The heatmap in Figure[A3](https://arxiv.org/html/2606.14022#A4.F3)compresses the rightmost column of the table into a single page\-width grid so that the sign and magnitude of each \(method, task\) cell is readable at a glance, with the lower\-is\-better Epidemic column flipped\. Per\-seed reward at the largest evaluation size sits in Table[A12](https://arxiv.org/html/2606.14022#A4.T12), where each row is a single seed and the columns can be read directly against the per\-method win\-rate counts in Table[A17](https://arxiv.org/html/2606.14022#A4.T17)\. The four per\-task panels \(Figures[A4](https://arxiv.org/html/2606.14022#A4.F4)–[A7](https://arxiv.org/html/2606.14022#A4.F7)\) consolidate convergence, transfer, evaluation distribution, and per\-seed comparison on a single page each\.

Table A11:Full transfer table: mean±\\pmseed std over 10 seeds for every method, every node\-selection task, every evaluation size\. The dashed line separates the post\-LN scale family from the baselines\. Source:eval\_results\_all\.csvandnew\_experiments\_eval\_results\.csv\.InfluMax\(↑\\uparrow\)

Dismantle\(↑\\uparrow\)

Epidemic\(↓\\downarrow\)

MIS\(↑\\uparrow\)

The heatmap below is the same data as Table[A11](https://arxiv.org/html/2606.14022#A4.T11)at the largest evaluation size, with sign flipped for the lower\-is\-better Epidemic column\. The DD column is also included so that the boundary collapse of PairNorm/InstanceNorm/BatchNorm is visible against the same color scale\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x3.png)Figure A3:Heatmap version of Table[A11](https://arxiv.org/html/2606.14022#A4.T11)at the largest evaluation size: relative improvement over LN backbone per \(method, task\), with sign flipped for lower\-is\-better tasks\. PostDeg and its learned variants form the only column with consistently positive entries; PairNorm, InstanceNorm, and BatchNorm are positive on three node\-selection tasks and crash on DD\.Three patterns are visible in the heatmap\. The top three rows \(PostDeg, PostDeg\-L\) carry positive entries on InfluMax, Dismantle, and MIS, near\-zero entries on Epidemic, and small negative entries on DD; this is the cross\-task signature of a placement effect that attenuates as the degree distribution becomes more regular\. The middle block \(LayerNorm, GraphScalar, LN backbone\) sits near zero on every column, consistent with the absorption identity: a graph\-blind normalizer at the post\-LN slot does not introduce a topology\-conditioned scale\. The bottom block \(PairNorm, InstanceNorm, BatchNorm\) has the most variable behavior\. These three feature\-statistic normalizers are positive on InfluMax and MIS, mixed on Dismantle and Epidemic, and crash on DD where the strong negative entries reflect the boundary collapse to majority\-class prediction \(Section[4\.6](https://arxiv.org/html/2606.14022#S4.SS6)\)\.

The DD column is also the only place where PostDeg and its learned variants deviate from LN backbone by a perceptible amount:−0\.7%\-0\.7\\%for PostDeg,−0\.4%\-0\.4\\%for PostDeg\-L\-FG,−0\.2%\-0\.2\\%for PostDeg\-L\-Adaptive\. These small negatives sit within seed std and align with the low\-heterogeneity boundary reading: when the degree distribution is tight \(DD skewness0\.270\.27\), the post\-LN scale collapses toward a graphwise constant, so adding the operator on top of LayerNorm carries negligible information\. The boundary regime is also the regime where feature\-statistic normalizers can hurt rather than help, because their per\-graph statistics drift as a function of the scaled magnitudes that PairNorm and InstanceNorm depend on\.

The per\-seed reward table that follows is the most direct evidence for the win\-rate count in Table[A17](https://arxiv.org/html/2606.14022#A4.T17): ten rows per task, one for each seed, with PostDeg and its learned variants on the left and the LN backbone baseline on the right\. Reading any row, the gap between PostDeg and LN backbone matches the per\-task summary in Section[4\.2](https://arxiv.org/html/2606.14022#S4.SS2)\.

Table A12:Per\-seed reward for the post\-LN family and the unnormalized baseline at the largest evaluation size\. Each row is a single seed; values can be checked against the Win\-rate table \([A17](https://arxiv.org/html/2606.14022#A4.T17)\) directly\.InfluMax

Dismantle

Epidemic

MIS

#### Per\-task panels\.

Figures[A4](https://arxiv.org/html/2606.14022#A4.F4)–[A7](https://arxiv.org/html/2606.14022#A4.F7)give a one\-page diagnostic per task in the order InfluMax, Dismantle, MIS, Epidemic — the order in which the gap between PostDeg and the LN backbone shrinks as the degree distribution becomes more regular\. Each figure combines the convergence curve of every method, relative improvement vs\. evaluation size, the score distribution at the largest evaluation size, and a paired per\-seed scatter of PostDeg against LN backbone\. The narrative across the four panels: InfluMax \(most heterogeneous\) shows PostDeg and PostDeg\-L separating from every baseline within the first5050training epochs, holding the lead across all four evaluation sizes, and producing a per\-seed scatter that sits entirely above the diagonal\. Network dismantling shows the same separation pattern, with the additional feature that PairNorm, InstanceNorm, and BatchNorm fall*below*LN backbone at the largest evaluation size\. MIS shows the largest absolute reward gap of any node\-selection task, about2\.22\.2atn=300n=300, with GraphNorm and InstanceNorm capturing about70%70\\%of this gap\. Epidemic is the boundary task at the low\-heterogeneity end of the ladder: every method clusters within a0\.0050\.005\-wide band on infected fraction, and the per\-seed PostDeg\-vs\-LN backbone scatter straddles the diagonal — when the degree distribution is close to uniform, the post\-LN scale becomes a near\-constant graphwise multiplier and the placement effect attenuates\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x4.png)Figure A4:InfluMax: per\-task diagnostic \(mean±\\pmseed std, 10 seeds\)\. Top\-left: relative improvement vs\. evaluation size; top\-right: raw scores atn=200n=200; bottom\-left: per\-seed PostDeg\-vs\-LN\-backbone scatter; bottom\-right: training\-curve overlay\. Source:eval\_results\_all\.csvandtraining\_curves\.csv\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x5.png)Figure A5:Network dismantling: per\-task diagnostic \(mean±\\pmseed std, 10 seeds\)\. Top\-left: relative improvement vs\. evaluation size; top\-right: raw scores atn=300n=300; bottom\-left: per\-seed PostDeg\-vs\-LN\-backbone scatter; bottom\-right: training\-curve overlay\. Source:eval\_results\_all\.csvandtraining\_curves\.csv\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x6.png)Figure A6:Maximum independent set: per\-task diagnostic \(mean±\\pmseed std, 10 seeds\)\. Top\-left: relative improvement vs\. evaluation size; top\-right: raw scores atn=300n=300; bottom\-left: per\-seed PostDeg\-vs\-LN\-backbone scatter; bottom\-right: training\-curve overlay\. Source:eval\_results\_all\.csvandtraining\_curves\.csv\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x7.png)Figure A7:Epidemic containment: per\-task diagnostic \(mean±\\pmseed std, 10 seeds\)\. Top\-left: relative improvement vs\. evaluation size; top\-right: raw scores atn=300n=300; bottom\-left: per\-seed PostDeg\-vs\-LN\-backbone scatter; bottom\-right: training\-curve overlay\. Source:eval\_results\_all\.csvandtraining\_curves\.csv\.

### D\.7Equivalence and ablations

#### Prediction\.

If the gain is a placement effect, post\-LN scale operators with different parameter capacities should be paired\-equivalent; we test by TOST and by FG\-vs\-adaptive parameter dispersion\.

#### Ablation hierarchy\.

Figure[A8](https://arxiv.org/html/2606.14022#A4.F8)arranges the learned superset and its ablations by parameter count\. In descending capacity: PostDeg\-L\-Adaptive, PostDeg\-L\-FG, PostDeg, GraphScalar, LN backbone\. The first three cluster within seed\-noise of each other; GraphScalar and LN backbone sit at the bottom\. The empirical numbers do not move with extra same\-slot capacity\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x8.png)Figure A8:Ablation hierarchy of the learned superset and its empiricalΔ%\\Delta\\%over LN backbone at the largest evaluation size, with capacity at each level\. PostDeg, PostDeg\-L\-FG, and PostDeg\-L\-Adaptive cluster within seed\-noise; GraphScalar and LN backbone sit at the bottom\.
#### TOST equivalence tests \(PostDeg vs\. learned ablation\)\.

The TOST table is the main evidence for the placement rule\. We test whether PostDeg and the learned superset are paired\-equivalent within±1%\\pm 1\\%at every task; values below0\.050\.05reject inequivalence at the corresponding margin\. The reading: all four tasks produce a TOSTpp\-value below0\.050\.05at the±1%\\pm 1\\%margin, so we can conclude pair\-equivalence at this margin in every comparison we run\. The capacity\-rich learned superset adds no measurable headroom over the parameter\-free PostDeg\.

Table A13:Two one\-sided tests \(TOST\) for paired equivalence between PostDeg\-L\-Adaptive \(the learned PostDeg ablation\) and PostDeg, one row per task\. Margins are taken as±1%\\pm 1\\%,±2%\\pm 2\\%,±5%\\pm 5\\%of the PostDeg mean\.pTOSTp\_\{\\text\{TOST\}\}values below0\.050\.05reject inequivalence\. Source:eval\_results\_all\.csv,new\_experiments\_eval\_results\.csv\.
#### Wilcoxon signed\-rank tests \(PostDeg\-L\-Adaptive vs\. representative baselines\)\.

The Wilcoxon table is the complementary one\-sided test: it asks whether PostDeg\-L\-Adaptive and each baseline are statistically distinguishable, paired by seed\. The PostDeg\-L\-Adaptive\-vs\-PostDeg row is non\-significant on every task, which is consistent with the TOST equivalence above; the comparisons against LN backbone and GraphScalar are significant on every task, confirming that PostDeg and its learned variants give a real lift over the LN\-backbone and graphwise\-scalar baselines\.

Table A14:Wilcoxon signed\-rank test: PostDeg\-L\-Adaptive vs\. baselines at largest evaluation size, paired by seed\. The PostDeg\-L\-Adaptive\-vs\-PostDeg rows are not significant atα=0\.05\\alpha=0\.05; this is a non\-rejection atn=10n=10rather than positive evidence of equivalence \(see Table[A13](https://arxiv.org/html/2606.14022#A4.T13)for TOST equivalence tests\)\.ExperimentBaselinepp\-valueSignificanceInfluMaxLN backbone0\.0020∗∗PostDeg0\.8457n\.s\.GraphScalar0\.0020∗∗BatchNorm0\.0020∗∗DismantleLN backbone0\.0020∗∗PostDeg0\.9219n\.s\.GraphScalar0\.0020∗∗BatchNorm0\.0020∗∗EpidemicLN backbone0\.0098∗∗PostDeg0\.3223n\.s\.GraphScalar0\.0039∗∗BatchNorm0\.0098∗∗MISLN backbone0\.0020∗∗PostDeg0\.3750n\.s\.GraphScalar0\.0020∗∗BatchNorm0\.0020∗∗Note:∗∗∗p<0\.001p\{<\}0\.001,∗∗p<0\.01p\{<\}0\.01,∗p<0\.05p\{<\}0\.05, n\.s\. = not significant\.
#### Full pairwise Wilcoxon matrix\.

The pairwise matrix expands the row scope: each post\-LN scale operator \(PostDeg, PostDeg\-L\) is paired against every other baseline, per task\. Cells with\+\+direction andp<0\.01p<0\.01are wins for the post\-LN target\. The pattern is nearly uniform: significant wins on InfluMax, Dismantle, and MIS; mostly non\-significant on Epidemic at the boundary regime; and zero ties are flagged as−\-because seed differences are non\-degenerate\.

Table A15:Full pairwise Wilcoxon signed\-rank test for the post\-LN family \(PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive\) against every baseline at the largest evaluation size, paired by 10 seeds\. Each cell ispp\-value with a±\\pmsign indicating the direction \(target beats / loses to baseline\)\. Significance markers:∗∗∗p<0\.001p\{<\}0\.001,∗∗p<0\.01p\{<\}0\.01,∗p<0\.05p\{<\}0\.05\. Cells marked — are degenerate ties \(zero seed differences\)\. Source:eval\_results\_all\.csv,new\_experiments\_eval\_results\.csv\.InfluMax

Dismantle

Epidemic

MIS

#### Effect sizes and per\-seed wins\.

Cohen’sddvalues relative to LN backbone are in Table[A16](https://arxiv.org/html/2606.14022#A4.T16)\. The values for PostDeg and its learned variants exceed66on three tasks because between\-seed variance is unusually low at our seed budget, not because the absolute reward gap is exceptional; the per\-seed reward gap in Table[A12](https://arxiv.org/html/2606.14022#A4.T12)is the more interpretable quantity\.

Table A16:Cohen’sddagainst LN backbone at the largest evaluation size\. Positive values favor the method \(sign\-flipped for lower\-is\-better tasks\)\. Source:eval\_results\_all\.csv,new\_experiments\_eval\_results\.csv\.The win\-rate table reads as paired evidence for the same claim: on each of the three degree\-sensitive node\-selection tasks, PostDeg and its learned variants each win on1010of1010paired seeds against LN backbone, and on the boundary task \(Epidemic\) wins on99of1010\. Figure[A9](https://arxiv.org/html/2606.14022#A4.F9)renders the Cohen’sddmatrix as a heatmap so the \(method, task\) sign and magnitude can be read at a glance\.

Table A17:Per\-seed win rate against LN backbone at the largest evaluation size: fraction of 10 paired seeds in which the method beats LN backbone\. Source:eval\_results\_all\.csv,new\_experiments\_eval\_results\.csv\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x9.png)Figure A9:Cohen’sddeffect sizes \(visual companion to Table[A16](https://arxiv.org/html/2606.14022#A4.T16)\)\.
#### Verdict\.

Confirmed\. TOST rejects inequivalence between PostDeg and the learned superset at±1%\\pm 1\\%on every task; the FG and adaptive variants agree to 3 decimal places on every parameter \(Table[A27](https://arxiv.org/html/2606.14022#A4.T27)\)\.

### D\.8Convergence speed at multiple thresholds

Two convergence tables are paired here\. The first reports the median epoch at which each method reaches80%80\\%,90%90\\%, and95%95\\%of its own final evaluation metric, where each row is endpoint\-relative to that method’s own final number\. PostDeg and its learned variants reach80%80\\%at epoch0on InfluMax \(already at80%80\\%of their final score by the first checkpoint\), while LayerNorm and the LN backbone need at least1010epochs\.

Table A18:Convergence speed: epoch to reach 95% of the method’s own best final performance\. Endpoint\-relative; read alongside Table[A4](https://arxiv.org/html/2606.14022#A4.T4), since a method that plateaus at a worse final policy can appear “fast” here\.↓\\downarrowlower is better\.†\\daggermarks our methods\.The endpoint\-relative reading above can mask the case where one method’s final score is much lower\. The multi\-threshold table that follows uses absolute reward thresholds instead, so a method that converges quickly to a low ceiling is no longer flattered\. PostDeg and its learned variants stay at the top of every absolute threshold on InfluMax and MIS\.

Table A19:Convergence speed at80%80\\%,90%90\\%, and95%95\\%of each method’s own final evaluation metric \(median epoch over seeds\)\. All values are endpoint\-relative; read alongside Table[A4](https://arxiv.org/html/2606.14022#A4.T4)\. Source:training\_curves\.csv,new\_experiments\_training\_curves\.csv\.
### D\.9Post\-LayerNorm multipliers reach the score head

#### Prediction\.

Theorem[3](https://arxiv.org/html/2606.14022#Thmtheorem3)predicts a strict scale ordering and Corollary[5](https://arxiv.org/html/2606.14022#Thmtheorem5)a log\-log slope−γ\-\\gamma; we test both, per task and per layer\.

#### Per\-task degree\-distribution summary\.

See body Table[2](https://arxiv.org/html/2606.14022#S4.T2)for per\-task degree distribution summary statistics; not duplicated here\.

#### Per\-task log\-log Pearson and slope\.

Per\-task and per\-layer signed Pearson correlations betweenln⁡si\\ln s\_\{i\}andln⁡di\\ln d\_\{i\}\.

Table A20:Per\-layer log\-log Pearsonrrand OLS slope of the*learned\-superset scalar scale*ln⁡si\\ln s\_\{i\}\(the per\-node PostDeg\-L\-Adaptive factor\) onln⁡di\\ln d\_\{i\}, on every node\-selection task\. These regressions test the closed\-form scalar power law \(Corollary[5](https://arxiv.org/html/2606.14022#Thmtheorem5)\); they are not the regressions on*post\-network downstream representation magnitudes*that may differ in a task\-dependent way \(cf\. Table[A22](https://arxiv.org/html/2606.14022#A4.T22)which reports the latter\)\. All values are negative as predicted\. Source:node\_scale\_factors\.csv\.
#### Predicted vs\. observed scale\-extreme ratio\.

Figure[A10](https://arxiv.org/html/2606.14022#A4.F10)plots the predicted ratioRmaxpred=\(cmax/cmin\)γR\_\{\\max\}^\{\\rm pred\}=\(c\_\{\\max\}/c\_\{\\min\}\)^\{\\gamma\}against the empirical ratioRmaxobs=smax/sminR\_\{\\max\}^\{\\rm obs\}=s\_\{\\max\}/s\_\{\\min\}, one point per task\. With four points this is a consistency check, not a fit\. The predicted ratio undershoots on the boundary tasks \(Dismantle and Epidemic sit visibly abovey=xy=x\) and tracks the empirical ratio on the heterogeneous tasks \(InfluMax sits ony=xy=x; MIS slightly below\)\. We do not claim a quantitative fit; the pattern is consistent with the LayerNorm/residual interaction documented in Appendix[C](https://arxiv.org/html/2606.14022#A3), which lowers the empirical ratio relative to the closed\-form scalar prediction\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x10.png)Figure A10:Consistency check between predicted and observed scale\-extreme ratio per task\.
#### Quantitative test of the separation identity\.

We compare the empirical scale extremes per task to predictions at bothγ=1/2\\gamma=1/2\(PostDeg\) andγlearned\\gamma\_\{\\text\{learned\}\}\(the learned superset\)\. The empirical extremes come from the learned\-superset scale at convergence; theγ=1/2\\gamma=1/2row undershoots the empirical ratio for purely arithmetic reasons, while theγlearned\\gamma\_\{\\text\{learned\}\}row tracks the empirical ratio in shape and direction\.

Table A21:Quantitative test of the degree\-separation identity \(Theorem[3](https://arxiv.org/html/2606.14022#Thmtheorem3)\) on each node\-selection task\. We predict the scale\-ratio extremeRmax=\(dmax/dmin\)qR\_\{\\max\}=\(d\_\{\\max\}/d\_\{\\min\}\)^\{q\}at two exponents:q=1/2q=1/2\(PostDeg, the recommended operator\) andq=γlearnedq=\\gamma\_\{\\text\{learned\}\}\(the learned same\-position superset, withγ=β\+δ​λ2\\gamma=\\beta\+\\delta\\lambda\_\{2\}averaged across seeds and layers\)\. The empirical scale ratio is computed on the learned\-superset scale at convergence; predictions atq=1/2q=1/2undershoot the empirical ratio because the recommended operator uses a smaller exponent than the optimum, while predictions atq=γlearnedq=\\gamma\_\{\\text\{learned\}\}track the empirical ratio in shape and direction\. Tasks withdmin=0d\_\{\\min\}=0\(Dismantle, Epidemic, MIS\) are reported with the implementation clampdmin→max⁡\(dmin,1\)d\_\{\\min\}\\\!\\to\\\!\\max\(d\_\{\\min\},1\)used by the operator\. Pearsonr​\(ln⁡s,ln⁡d\)r\(\\ln s,\\ln d\)is computed on the per\-node learned\-scale samples\. Source:node\_scale\_summary\.csv,degree\_summary\.csv,dsn\_spectral\_weights\.csv\.
#### Per\-task scale extremes\.

Empirical post\-network scale extremes per task with the predictedγ=1/2\\gamma=1/2ratio, and the signed Pearson correlation betweenln⁡si\\ln s\_\{i\}andln⁡di\\ln d\_\{i\}on the post\-network representation magnitudes\.

Table A22:Empirical scale extremes per task \(*post\-network representation magnitudes*after the gated PostDeg\-L\-Adaptive layer composes with LayerNorm and the residual gate\) and the predicted ratio atq=1/2q=1/2\. The Pearson column is the signed correlation betweenln⁡si\\ln s\_\{i\}andln⁡di\\ln d\_\{i\}on these post\-network magnitudes; the magnitudes differ from the scalar\-PostDeg\-L\-Adaptive\-factor Pearson reported in Table[A20](https://arxiv.org/html/2606.14022#A4.T20)because of the downstream layer interactions documented in Remark[2](https://arxiv.org/html/2606.14022#Thmremark2)\.
#### Cohen’sddvs\. evaluation size\.

The Cohen’sddbetween PostDeg and LN backbone grows with evaluation size on InfluMax and is non\-decreasing on Dismantle and MIS within seed noise; on the boundary task \(Epidemic\) it does not\. Faint grey lines are the non\-family baselines\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x11.png)Figure A11:Cohen’sddvs\. evaluation size, one panel per task\. PostDeg and its learned variants \(bold\) grow with size on InfluMax and remain non\-decreasing on Dismantle and MIS within seed noise; on the boundary task \(Epidemic\) the effect does not grow\. Baselines \(faint\) saturate near0on the boundary task\.
#### Heterogeneity dose\-response\.

Figure[A12](https://arxiv.org/html/2606.14022#A4.F12)reports the PostDeg gain over LN backbone against training\-graph degree skewness, one point per node\-selection task\. With only four points we do not claim a monotone cross\-task trend—MIS at skewness0\.350\.35has a larger gain than Dismantle and Epidemic at skewness≈0\.45\\approx 0\.45, so the relationship is task\-and\-skewness mediated rather than purely heterogeneity\-driven\. The clean dose\-response is the within\-task BA\-vs\-SBM stratification on MIS \(Figure[A21](https://arxiv.org/html/2606.14022#A4.F21)\); the cross\-task scatter is included only as a task\-level descriptor\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x12.png)Figure A12:PostDegΔ%\\Delta\\%over LN backbone vs\. training\-graph degree skewness, one point per node\-selection task\. The four points line up with the placement\-principle prediction that gains grow with heterogeneity\.
#### Per\-task degree CDFs\.

The same picture seen at the distribution level is in Figure[A13](https://arxiv.org/html/2606.14022#A4.F13): log\-log degree survival per task\. InfluMax \(BA\) carries the heaviest right tail; SBM tasks sit closer to a single concentrated mode\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x13.png)Figure A13:Per\-task degree survival1−F​\(d\)1\-F\(d\)on log\-log axes\. InfluMax \(BA\) sits well above the SBM tasks at the right tail, consistent with skewness2\.902\.90\.
#### Effect size scales with degree heterogeneity\.

Within MIS, the BA\-vs\-SBM stratification gives a clean dose\-response \(Figure[A14](https://arxiv.org/html/2606.14022#A4.F14)\): BA graphs yield a\+10\.98%\+10\.98\\%improvement over the LN backbone, and SBM graphs yield\+5\.85%\+5\.85\\%\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x14.png)Figure A14:Within\-MIS dose\-response by graph family\.\(a\)Cohen’sddbetween each method and LN backbone on BA \(n=200n\{=\}200\) vs\. moderately heterogeneous SBM \(n=300n\{=\}300\)\.\(b\)Mean relative improvement over LN backbone on the same two families\. BA produces larger gains, in line with the heterogeneity prediction\.
#### Per\-layer scale\-vs\-degree\.

Figure[A15](https://arxiv.org/html/2606.14022#A4.F15)plotsln⁡si\\ln s\_\{i\}againstln⁡di\\ln d\_\{i\}separately for each of the three GAT layers and each of the four node\-selection tasks; slopes and signed correlations agree to two decimals across layers on every task\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x15.png)Figure A15:Log\-log scale\-vs\-degree scatter per layer per task\. The optimizer treats the three GAT layers nearly identically: slopes and signed Pearson correlations agree to two decimals across layers on every task\.
#### Empirical scale distributions\.

Figure[A16](https://arxiv.org/html/2606.14022#A4.F16)reports the empirical distribution of the post\-LN scalesis\_\{i\}at convergence on each task\. Heavy\-tailed InfluMax has the widest support; Epidemic concentrates near a single value, consistent with the low\-heterogeneity boundary regime\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x16.png)Figure A16:Empirical post\-LN scalesis\_\{i\}per task at convergence \(one panel per task, one bar per node bin\)\.The degree distribution that drives this scale shape is in Figure[A17](https://arxiv.org/html/2606.14022#A4.F17)\. Compared with the scale histogram above, the heavy\-tail\-vs\-low\-heterogeneity axis is preserved: InfluMax shows a long upper degree tail, Epidemic a tight unimodal cluster, and the post\-LN scale tracks each\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x17.png)Figure A17:Per\-task degree distribution\. Used together with Figure[A16](https://arxiv.org/html/2606.14022#A4.F16)to read the post\-LN scale shape against the degree shape that produces it\.
#### Per\-task scale\-vs\-degree\.

The per\-task and per\-layer log\-log scatter ofsis\_\{i\}againstdid\_\{i\}is consolidated in Figure[A15](https://arxiv.org/html/2606.14022#A4.F15)above; that 4×\\times3 grid \(one row per task, one column per layer\) makes the per\-layer agreement explicit and supersedes a separate task\-only plot\. Slopes are negative on every task and the absolute slope is largest on InfluMax \(the most heterogeneous\) and smallest on Epidemic \(the most regular\), as predicted by Corollary[5](https://arxiv.org/html/2606.14022#Thmtheorem5)\.

The corresponding learned\-ablation parameter trajectories are in Figure[A18](https://arxiv.org/html/2606.14022#A4.F18)\. The exponentβ\\betarises within the first∼25\\sim 25epochs and then sits in a tight band; the gate parameterα\\alphaand the spectral coefficientδ\\deltabehave similarly\. Across\-seed standard deviations are≤0\.002\\leq 0\.002on every task \(Table[A26](https://arxiv.org/html/2606.14022#A4.T26)\)\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x18.png)Figure A18:Learned\-ablation parameter trajectories\(α,β,δ\)\(\\alpha,\\beta,\\delta\)across training, per task, averaged across the three GAT layers \(10 seeds per task per parameter\)\. Initialized atα=1\.55\\alpha=1\.55,β=0\.75\\beta=0\.75,δ=0\.25\\delta=0\.25; final values per task in Table[A23](https://arxiv.org/html/2606.14022#A4.T23)\. Y\-axis ranges are deliberately tight \(movement is in the third decimal,\|Δ\|≤0\.018\|\\Delta\|\\leq 0\.018over 200 epochs\); per Corollary[12](https://arxiv.org/html/2606.14022#Thmtheorem12)this is consistent with bounded gradients at our learning rate, and supports the TOST equivalence between the parameter\-free PostDeg and the learned same\-slot variants \(Section[4\.3](https://arxiv.org/html/2606.14022#S4.SS3)\)\. The figure is not evidence that the optimizer fails to train; it is evidence that the optimizer settles in a tight band\.

### D\.10Learned\-ablation parameters

This subsection looks at where the learned superset settles\. The story is that the optimizer is highly consistent: across seeds, layers, and tasks,\(α,β,δ\)\(\\alpha,\\beta,\\delta\)end up in narrow bands, and the resulting effective exponentγ=β\+δ​λ2\\gamma=\\beta\+\\delta\\lambda\_\{2\}is very close to1/21/2on BA tasks and slightly above on the SBM tasks\. The five tables below trace this consistency from coarse\-grained averages to per\-seed dispersion\.

#### Final values\.

Table[A23](https://arxiv.org/html/2606.14022#A4.T23)reports final\-epoch\(α,β,δ\)\(\\alpha,\\beta,\\delta\)averaged across the three GAT layers and 10 seeds, per task\. The exponentβ\\betasits between0\.740\.74and0\.770\.77on every task; together withδ\\deltathis gives an effective post\-LN exponent within0\.050\.05of the PostDeg default of1/21/2on the BA tasks\.

Table A23:Learned same\-slot parameters at final epoch, averaged across all 3 GAT layers\. Mean±\\pmstd over 10 seeds\.†\\dagger= our method\.
#### Per\-layer values\.

A finer\-grained breakdown is in Table[A24](https://arxiv.org/html/2606.14022#A4.T24): the same parameter values, now reported per GAT layer\. The three layers agree to two decimal places on every parameter, on every task\. The optimizer therefore is not specializing layers to play different roles in the post\-LN scale; this is consistent with the placement rule in which the relevant axis is the slot, not the layer\.

Table A24:Per\-layer PostDeg\-L\-Adaptive parameter values at the final training epoch \(mean±\\pmseed std over 10 seeds\)\. The optimizer treats the three GAT layers nearly identically\. Source:dsn\_spectral\_weights\.csv,new\_experiments\_dsn\_weights\.csv\.
#### Values at training\-epoch checkpoints\.

Snapshotting the parameters at training\-epoch checkpoints \(Table[A25](https://arxiv.org/html/2606.14022#A4.T25)\) shows the time course\. The exponentβ\\betarises within the first2525–5050epochs and then stabilizes; the gate parameterα\\alphaand the spectral coefficientδ\\deltatrack similar curves\. The per\-seed visual companion is in Figure[A19](https://arxiv.org/html/2606.14022#A4.F19)below\.

Table A25:PostDeg\-L\-Adaptive learned parameters\(α,β,δ\)\(\\alpha,\\beta,\\delta\)at six training\-epoch checkpoints, averaged across all 3 GAT layers and 10 seeds \(mean±\\pmstd, 3 decimal places\)\. The exponent rises sharply within the first2525epochs and then stabilizes\.
#### Per\-seed finalβ\\beta\.

Table[A26](https://arxiv.org/html/2606.14022#A4.T26)reports per\-seed finalβ\\betavalues, one row per seed, averaged across the three GAT layers\. Across\-seed standard deviations are≤0\.002\\leq 0\.002on every task; this is the source of the unusually low between\-seed variance that drives Cohen’sddabove66in Table[A16](https://arxiv.org/html/2606.14022#A4.T16)\.

Table A26:Per\-seed finalβ\\betavalues for the learned PostDeg\-L\-Adaptive ablation, averaged across the 3 GAT layers\. Spread across seeds is below0\.0050\.005for every task\. Source:dsn\_spectral\_weights\.csv,new\_experiments\_dsn\_weights\.csv\.
#### Per\-seed dispersion of\(α,β,δ\)\(\\alpha,\\beta,\\delta\)\.

Table[A27](https://arxiv.org/html/2606.14022#A4.T27)expands the previous view to all three learned parameters, with the per\-seed standard deviation reported alongside the across\-seed mean\. The dispersion inα\\alphaandδ\\deltais comparable to that inβ\\beta; no parameter is the source of a long\-tail seed effect\.

Table A27:Per\-seed final\(α,β,δ\)\(\\alpha,\\beta,\\delta\)for the learned ablation\. Across\-seed standard deviations on every task are below0\.0050\.005forβ\\betaandδ\\delta, below0\.010\.01forα\\alpha; this is the dispersion that supports the equivalence statement in Section[4\.3](https://arxiv.org/html/2606.14022#S4.SS3)\.InfluMax

Dismantle

Epidemic

MIS

#### Per\-seedβ\\betacurve\.

The 10\-line plot below shows thatβ\\betaconverges within the first∼25\\sim 25epochs and sits in a tight band thereafter, on every task and every seed\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x19.png)Figure A19:Per\-seedβ\\betacurve through training \(10 seeds per task, averaged across the 3 GAT layers\)\. The exponent rises rapidly in the first∼\\sim25 epochs and then sits in a tight band, on every task and every seed\.

### D\.11Ablation staircase

We walk through the staircase of progressively dropping degrees of freedom from the learned superset and report the mean reward at the largest evaluation size\.

- •PostDeg\-L\-Adaptive: adaptive gatemi=σ​\(ϕ​\(ci\)\)m\_\{i\}=\\sigma\(\\phi\(c\_\{i\}\)\), learned\(α,β,δ\)\(\\alpha,\\beta,\\delta\)\. Three node\-selection deltas:\+3\.61%\+3\.61\\%/\+2\.61%\+2\.61\\%/\+5\.85%\+5\.85\\%\.
- •PostDeg\-L: scalar gatem=σ​\(ρ\)m=\\sigma\(\\rho\), learned\(α,β,δ\)\(\\alpha,\\beta,\\delta\)\.\+3\.50%\+3\.50\\%/\+2\.65%\+2\.65\\%/\+6\.09%\+6\.09\\%\.
- •PostDeg:α=1,β=1/2,δ=0,m=1\\alpha=1,\\beta=1/2,\\delta=0,m=1\.\+3\.50%\+3\.50\\%/\+2\.53%\+2\.53\\%/\+5\.58%\+5\.58\\%\.
- •GraphScalar:α=1,β=δ=0\\alpha=1,\\beta=\\delta=0\(graphwise scalar only\)\.\+0\.31%\+0\.31\\%/−0\.90%\-0\.90\\%/−0\.57%\-0\.57\\%\.
- •LN backbone:m=0m=0\(identity\)\.0%0\\%baseline\.

The first three rows cluster within seed\-noise of each other \(TOST equivalence, Table[A13](https://arxiv.org/html/2606.14022#A4.T13)\); GraphScalar falls off; LN backbone is the baseline\. The placement rule predicts this exact ordering: each added parameter inside the post\-LN slot is paired\-equivalent to the simpler operator in the same slot\.

### D\.12Why the optimizer settles atβ∈\[0\.75,0\.77\]\\beta\\in\[0\.75,0\.77\], not0\.50\.5

The recommended PostDeg usesγ=1/2\\gamma=1/2, while the learned superset converges toβ≈0\.75\\beta\\approx 0\.75on every task \(initialized atβ=0\.75\\beta=0\.75, final values0\.7520\.752–0\.7680\.768across tasks with seed std<0\.002<0\.002; Table[A26](https://arxiv.org/html/2606.14022#A4.T26)\)\. We do not claim a structural reason for the exact value of the convergedβ\\beta\.Honest reading\.At our seed budget of 10, we cannot statistically distinguishβ=0\.5\\beta=0\.5fromβ=0\.75\\beta=0\.75: TOST rejects inequivalence at±1%\\pm 1\\%on every task \(Table[A13](https://arxiv.org/html/2606.14022#A4.T13)\); the empirical paired\-mean difference per task is below0\.110\.11on every task and below the seed std on three of four\. We keep1/21/2as the recommended default because it is parameter\-free, matches the GCN\-symmetric exponent inD−1/2​A​D−1/2D^\{\-1/2\}AD^\{\-1/2\}, and produces results paired\-equivalent to the learned variant under our setup\.

The figure may suggest the optimizer never movedβ\\betamuch from initialization\. The 0\.005–0\.018 range over 200 epochs is consistent with bounded gradients \(Corollary[12](https://arxiv.org/html/2606.14022#Thmtheorem12):\|∂si/∂β\|≤λ2​ln⁡\(cmin\+ε\)​si≤13\.8​si\|\\partial s\_\{i\}/\\partial\\beta\|\\leq\\lambda\_\{2\}\\ln\(c\_\{\\min\}\+\\varepsilon\)\\,s\_\{i\}\\leq 13\.8\\,s\_\{i\}on our parameter domain\) times learning rate10−410^\{\-4\}times∼4×104\\sim 4\\\!\\times\\\!10^\{4\}gradient steps; we do not over\-interpret the small movement\.

### D\.13Spectral quantities

Table A28:Spectral quantities through training:λmax\\lambda\_\{\\max\}and the spectral gapλ2\\lambda\_\{2\}of the normalized Laplacian \(mean±\\pmstd across 10 seeds and 3 layers\), and the rightmost column gives the across\-graph coefficient of variation ofλmax\\lambda\_\{\\max\}in percent\. Dismantle and Epidemic SBM graphs have CV≈2\.3%\\approx 2\.3\\%, an order of magnitude tighter than InfluMax BA graphs \(≈0\.8%\\approx 0\.8\\%\): a graphwise spectral scalar \(Spec\) therefore acts almost identically across Dismantle graphs and miscalibrates the score head when fragmentation differences are small\. Source:dsn\_spectral\_weights\.csv,new\_experiments\_dsn\_weights\.csv\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x20.png)Figure A20:Spectral quantities through training \(10 seeds per task, hairy\-band rendering with one trace per seed\)\.λmax\\lambda\_\{\\max\}and the spectral gapλ2\\lambda\_\{2\}stabilize within the first few epochs and remain in a tight band thereafter; the y\-axis ranges are deliberately tight to make stability visible \(the per\-seed std is reported in Table[A28](https://arxiv.org/html/2606.14022#A4.T28)\)\. The seed\-noise band visible in the rendering is therefore the magnitude of the actual movement, not noise around a trend\.
### D\.14Effect attenuates with degree heterogeneity

#### Prediction\.

Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6)gives the strict regular limit\. Epidemic and DD test the lower\-heterogeneity boundary around that limit\.

The two boundary regimes share a structural property: lower degree heterogeneity\. Epidemic SBM contact graphs have skewness0\.460\.46, DD protein graphs0\.270\.27; the BA InfluMax graphs sit at2\.902\.90\. The placement rule predicts attenuation as the operating regime moves from high\-heterogeneity toward low\-heterogeneity, and Figures[A14](https://arxiv.org/html/2606.14022#A4.F14)and[A21](https://arxiv.org/html/2606.14022#A4.F21)confirm this\.

#### Scale dispersion vs\. degree dispersion\.

The sufficient\-statistic table for Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6)reports per\-task scale CV and degree CV; the relationship is monotone across tasks\.

Table A29:Sufficient\-statistic table for Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6): per\-task scale dispersion \(CV==std/mean ofsis\_\{i\}\) tracks per\-task degree dispersion \(CV ofdd\)\. The relationship is monotone across tasks, supporting the qualitative form of the near\-regular attenuation result\.
#### MIS by graph distribution\.

Stratifying MIS by graph family \(BA vs\. SBM\) gives a within\-task heterogeneity dose\-response: the BA family produces a larger PostDeg gain than the more regular SBM family\. Table[A30](https://arxiv.org/html/2606.14022#A4.T30)reports the values; Figure[A21](https://arxiv.org/html/2606.14022#A4.F21)renders the same picture visually\.

Table A30:MIS performance stratified by evaluation graph distribution \(mean±\\pmstd, 10 seeds\)\. Post\-LN scale gains are larger on heavy\-tailed BA graphs than on moderately heterogeneous SBM graphs\. Evaluation sizes differ by family \(n=200n=200for BA vs\.n=300n=300for SBM\), so we report this as a directional finding, not a matched\-size estimate\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x21.png)Figure A21:MIS performance stratified by graph distribution\. Effect size shrinks on lower\-heterogeneity graphs, in line with Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6)\.
#### Train\-evaluation gap\.

Figure[A22](https://arxiv.org/html/2606.14022#A4.F22)reports the train\-evaluation generalization gap by method, computed as the difference between training\-graph reward at convergence and reward at the largest evaluation size\. PostDeg has the smallest gap on every task; the variance\-zeroing baselines \(PairNorm, InstanceNorm, BatchNorm\) have the largest\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x22.png)Figure A22:Train\-evaluation generalization gap by method\.
#### DD: per\-fold breakdown\.

Per\-fold DD accuracies for every \(method, fold, seed\) cell are in Table[A31](https://arxiv.org/html/2606.14022#A4.T31)\. The variance\-zeroing methods are at the per\-fold majority\-class rate on every fold; PostDeg, its learned variants, and LayerNorm sit near80%80\\%on every fold\.

Table A31:Per\-fold final accuracy on DD \(10\-fold cross\-validation, mean across 10 seeds\)\. Folds are listed in order\. The three feature\-statistic normalizers \(PairNorm, InstanceNorm, BatchNorm\) collapse to within seed\-noise of the per\-fold majority\-class rate \(each fold’s larger class is in the high\-58%58\\%range\), uniformly across folds\. Source:new\_experiments\_training\_curves\.csv\.PairNorm, InstanceNorm, and BatchNorm each reach the per\-fold majority\-class rate within the first epoch on every fold and every seed \(median epoch0, 90th\-percentile epoch0\); we omit the table since every cell is the same\.

![Refer to caption](https://arxiv.org/html/2606.14022v1/x23.png)Figure A23:Per\-fold DD final accuracy \(mean±\\pmseed std over 10 seeds; dashed line at the per\-fold majority\-class rate0\.58650\.5865\)\. One panel per method; ten bars per panel, one for each cross\-validation fold\. PairNorm, InstanceNorm, and BatchNorm collapse onto the majority\-class line on every fold; PostDeg, Extra LayerNorm, and LN backbone separate from the dashed line and converge near80%80\\%\.
#### Verdict\.

Confirmed\. Per\-task learned\-superset scalar\-factor regressions return negative slopes \(−0\.83\-0\.83to−0\.62\-0\.62\) on every node\-selection task \(Table[A20](https://arxiv.org/html/2606.14022#A4.T20), Figure[A15](https://arxiv.org/html/2606.14022#A4.F15)\), and the optimizer treats the three GAT layers nearly identically\.

### D\.15Transfer and supplementary visualizations

![Refer to caption](https://arxiv.org/html/2606.14022v1/x24.png)Figure A24:Relative improvement over LN backbone against evaluation graph size, per task and method \(mean±\\pmseed std\)\. The horizontal zero line marks the LN backbone baseline\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x25.png)Figure A25:Supplementary rank and relative\-improvement view\. Ranks are shown only as a compact diagnostic; the body uses paired relative effects as the main comparison\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x26.png)Figure A26:Multi\-task profile across all five tasks \(Δ%\\Delta\\%over LN backbone\)\.Each axis is one task; further outward is better\. The post\-LN family \(PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive\) traces a uniformly outward profile across the four node\-selection tasks and the DD accuracy axis; Extra LayerNorm is shown as a representative non\-degree\-aware baseline that collapses near zero on every axis\. Baselines not plotted here—BatchNorm, PairNorm, InstanceNorm collapse on DD; GraphNorm dips on Dismantle—are reported in Table[A11](https://arxiv.org/html/2606.14022#A4.T11)\.*Caveat:*radar polygon area scales with the square of the radius, so visual area differences exaggerate small rank differences; rely on Table[A11](https://arxiv.org/html/2606.14022#A4.T11)for quantitative comparisons\.#### Policy vs\. classical heuristics \(paired Wilcoxon\)\.

Across every \(task, evaluation size\) cell on the three node\-selection tasks that admit a canonical classical heuristic \(InfluMax, Dismantle, Epidemic\), paired Wilcoxon tests reject equality between the trained policy and the heuristic atα=0\.05\\alpha=0\.05\. MIS has no single canonical heuristic at our regime, so the per\-cell test is reported only for the three; Table[A32](https://arxiv.org/html/2606.14022#A4.T32)reports the test results and effect sizes; Figure[A27](https://arxiv.org/html/2606.14022#A4.F27)is the visual companion \(three panels, one per task\)\.

Table A32:Trained policy versus the canonical classical heuristic, paired by 10 seeds\. We report the mean and seed\-std of the per\-seed gain \(policy minus heuristic, sign\-flipped for lower\-is\-better tasks\) and the paired Wilcoxonpp\-value\. Heuristic is greedy spread for InfluMax and the degree heuristic for Dismantle and Epidemic\. Source:eval\_results\_all\.csv\.![Refer to caption](https://arxiv.org/html/2606.14022v1/x27.png)Figure A27:Visual companion to Table[A32](https://arxiv.org/html/2606.14022#A4.T32): trained policies match or surpass the strongest classical centrality heuristic on the three node\-selection tasks that admit a canonical heuristic \(InfluMax, Dismantle, Epidemic\)\. MIS has no canonical centrality heuristic at our regime and is reported only in the table\.

## Appendix EDistinction from PNA and other degree\-aware aggregators

This appendix expands the position argument of Section[2](https://arxiv.org/html/2606.14022#S2)into a side\-by\-side derivation\. The point is that PNA’s degree scaler and PostDeg’s degree scaler operate on different objects in the forward pass, even though both are continuous functions ofdid\_\{i\}\.

#### PNA distinction \(one\-paragraph remark\)\.

PNA\[[8](https://arxiv.org/html/2606.14022#bib.bib8)\]scales each aggregator bySp​\(di\)=log⁡\(di\+1\)/δ¯S\_\{p\}\(d\_\{i\}\)=\\log\(d\_\{i\}\+1\)/\\bar\{\\delta\}*inside*the message\-passing step, before LayerNorm\. That degree term changes messages before they are combined; it is not a free scalar multiplier applied to the LayerNorm input or output\. PostDeg leaves aggregation untouched and acts on per\-node magnitudes after LayerNorm\. The placement diagnostic predicts little headroom for PostDeg on top of PNA because PNA already gives the backbone an aggregation\-side degree channel\. Table[A34](https://arxiv.org/html/2606.14022#A6.T34)confirms this empirical redundancy under our setup\.

#### Comparison with other degree\-aware schemes\.

GCN’s symmetric normalizationD−1/2​A​D−1/2D^\{\-1/2\}AD^\{\-1/2\}\[[16](https://arxiv.org/html/2606.14022#bib.bib16)\]is a pre\-aggregation degree scaler, so its degree dependence enters through messages\. GraphSAGE’s mean aggregator\[[10](https://arxiv.org/html/2606.14022#bib.bib10)\]averages over neighborhoods and is degree\-aware through that mean\. Structural encodings\[[9](https://arxiv.org/html/2606.14022#bib.bib9),[22](https://arxiv.org/html/2606.14022#bib.bib22),[2](https://arxiv.org/html/2606.14022#bib.bib2),[26](https://arxiv.org/html/2606.14022#bib.bib26),[20](https://arxiv.org/html/2606.14022#bib.bib20)\]concatenate degree\- or centrality\-based features as content; these enter as pre\-LN content, the alternative discussed in Appendix[H](https://arxiv.org/html/2606.14022#A8)\. PostDeg acts only on representation magnitude after LayerNorm\.

## Appendix FCross\-backbone replication and predicted\-baseline runs

The data in this appendix come from additional cross\-backbone experiments \(CSVs underdata/supplemental\_runs/\)\. The pipeline has three independent groups:backbone\_generalizationruns the same controlled\-slot protocol on GAT, GCN, GIN, and SAGE at training\-graph size only;pna\_nodenorm\_compareadds NodeNorm and a PNA backbone; andpna\_controlledreruns the PNA cells at 3 seeds for replication\. Caveats applied throughout: training\-size eval only \(no transfer eval\); 5\-fold CV on DD \(do not pool with the paper’s 10\-fold DD runs\); MIS is saturated at training size and reported for completeness only\. The GAT no\_dsn baseline cited below is from thebackbone\_generalizationgroup; the parallelpna\_nodenorm\_compareGAT runs are excluded due to a baseline configuration drift \(mean0\.1100\.110on Dismantle vs\.0\.1680\.168in the matched group, which is the figure that aligns with the paper’s existing GAT runs\)\.

#### Cross\-backbone replication\.

Table[A33](https://arxiv.org/html/2606.14022#A6.T33)reports PostDeg vs\. LN backbone on each \(backbone, task\) cell\. PostDeg replicates the InfluMax gain on GAT \(4/5\), GCN \(5/5\), GIN \(4/5\), and SAGE \(5/5\), and the Dismantle gain on GAT \(4/5\), GCN \(4/5\), and SAGE \(5/5\)\. GIN Dismantle is mixed \(1/5,Δ=−0\.42%\\Delta=\-0\.42\\%, within seed\-noise of zero\)\. The body’s claim is verified on every backbone on InfluMax and on all but GIN on Dismantle\.

Table A33:Cross\-backbone replication of the PostDeg gain over LN backbone at training\-graph size, run on the supplemental\-run pipeline\. “Wins” is the number of paired seeds \(out of 5\) where PostDeg beats LN backbone\. “Δ%\\Delta\\%” is the mean signed relative gap to LN backbone \(positive favors PostDeg\)\. InfluMax and Dismantle replicate on GAT, GCN, and SAGE; GIN InfluMax replicates but GIN Dismantle is mixed \(Δ=−0\.42%\\Delta=\-0\.42\\%, within seed\-noise\)\. DD is reported as a low\-heterogeneity boundary; MIS is saturated at training size and not reported\. Source:aggregate\_by\_config\.csv, groupbackbone\_generalization\.
#### PNA redundancy prediction\.

Section[2](https://arxiv.org/html/2606.14022#S2)predicts redundancy of PostDeg on PNA\. Table[A34](https://arxiv.org/html/2606.14022#A6.T34)confirms it on every \(group, task\) cell:\|Δ%\|<1%\|\\Delta\\%\|<1\\%on InfluMax and Dismantle in both PNA groups, and within seed\-noise of zero on Epidemic and MIS in the controlled group\. The combination of \(i\) the prescription on GAT/GCN/GIN/SAGE and \(ii\) the redundancy prediction on PNA, both confirmed empirically, is the most direct test of the placement diagnostic the new pipeline supports\.

Table A34:Redundancy prediction on PNA\. Section[2](https://arxiv.org/html/2606.14022#S2)predicts PostDeg should be redundant on PNA because PNA already injects a degree signal in aggregation, mixed downstream by Extra LayerNorm\. We test this on two independent PNA groups:pna\_nodenorm\_compare\(5 seeds\) andpna\_controlled\(3 seeds\)\. On every \(group, task\) cell,\|Δ%\|<1%\|\\Delta\\%\|<1\\%and the paired wins do not exceed chance\. The redundancy prediction is confirmed\.
#### Predicted\-baseline operators we now run\.

Table[A35](https://arxiv.org/html/2606.14022#A6.T35)reports two baselines that the placement rule predicts should leave headroom for PostDeg: NodeNorm \(post\-block, magnitude\-rescaling, graph\-blind\) andlog⁡di\\log d\_\{i\}concatenation \(degree as content rather than as magnitude\)\. Both run alongside PostDeg in the same pipeline\. NodeNorm is essentially LN backbone on InfluMax \(−0\.05%\-0\.05\\%on GAT\) and shows high variance on Dismantle due to a configuration drift in that group \(excluded from the GAT baseline per the caveat above\);log⁡di\\log d\_\{i\}concatenation captures part of the InfluMax gain \(\+1\.56%\+1\.56\\%on GAT\) and part of the Dismantle gain \(\+1\.15%\+1\.15\\%\), but is below PostDeg \(\+2\.05%\+2\.05\\%on InfluMax,\+2\.49%\+2\.49\\%on Dismantle via the same pipeline\)\. Thelog⁡di\\log d\_\{i\}result confirms a content\-side residual, as predicted in Appendix[H](https://arxiv.org/html/2606.14022#A8)\.

Table A35:Predicted\-baseline operators that we now run\. NodeNorm \(post\-block, magnitude\-rescaling, graph\-blind\) andlog⁡d\\log dconcatenation \(degree as content rather than as magnitude\) are the two baselines that the placement rule predicts should leave headroom for PostDeg\. The signed gap to LN backbone at training size on GAT is reported below; PostDeg dominates both on InfluMax and Dismantle\. Source:aggregate\_by\_config\.csv, groupbackbone\_generalizationfor PostDeg,log⁡d\\log dconcat, and GraphNorm; grouppna\_nodenorm\_compare\(GAT rows\) for NodeNorm\. PNA controls in Table[A34](https://arxiv.org/html/2606.14022#A6.T34)\.
#### DD with new feature\-statistic rows\.

Table[A36](https://arxiv.org/html/2606.14022#A6.T36)adds NodeNorm and GraphNorm rows to the DD picture \(5\-fold CV, do not pool with the paper’s 10\-fold DD runs\)\. NodeNorm holds at78\.8%78\.8\\%and GraphNorm at77\.4%77\.4\\%\. This separates the boundary collapse by variance\-zeroing behavior rather than by feature\-statistic family: variance\-zeroing methods \(PairNorm, InstanceNorm, BatchNorm\) collapse to majority class; non\-variance\-zeroing methods hold\.

Table A36:DD graph classification with NodeNorm and GraphNorm rows added \(5\-fold cross\-validation, mean test accuracy across folds, then aggregated over 5 seeds\)\. Variance\-zeroing feature\-statistic normalizers \(PairNorm, InstanceNorm, BatchNorm in the main 10\-fold runs\) collapse to majority\-class on DD; non\-variance\-zeroing feature\-statistic normalizers \(NodeNorm, GraphNorm here\) do not\. NodeNorm holds at78\.8%78\.8\\%, GraphNorm at77\.4%77\.4\\%\. The boundary regime separates by variance\-zeroing behavior, not by feature\-statistic family\. Source:cv\_results\_summary\.csv; 5\-fold CV, do not pool with the paper’s main 10\-fold DD runs\.
#### Predicted outcomes on benchmarks not evaluated here\.

The placement rule predicts that PostDeg should help on degree\-heterogeneous citation and protein graphs \(skew≫1\\gg 1\) and should be neutral on low\-heterogeneity knowledge\-graph snapshots\. Table[A38](https://arxiv.org/html/2606.14022#A6.T38)records this as a prediction sheet for standard benchmarks; we have not run any of them\. The remaining hypotheses on PNA,log⁡di\\log d\_\{i\}, and NodeNorm in the original Table[A37](https://arxiv.org/html/2606.14022#A6.T37)have now been run \(Tables[A33](https://arxiv.org/html/2606.14022#A6.T33),[A34](https://arxiv.org/html/2606.14022#A6.T34),[A35](https://arxiv.org/html/2606.14022#A6.T35)\); we keep the original table here because it documents the algebraic\-position predictions before the new data was available\.

Table A37:Pre\-experiment predictions for baselines that were later run; included for transparency about the predictive structure of the placement rule\. The realized empirical comparisons \(PNA,log⁡di\\log d\_\{i\}concat, NodeNorm\) are reported in Tables[A34](https://arxiv.org/html/2606.14022#A6.T34),[A33](https://arxiv.org/html/2606.14022#A6.T33), and[A35](https://arxiv.org/html/2606.14022#A6.T35); the qualitative directions predicted in this table all hold on the realized data\.
#### External\-benchmark predictions\.

Table[A38](https://arxiv.org/html/2606.14022#A6.T38)extends the same algebraic position argument to standard graph benchmarks we have not yet run\. Each row pairs a benchmark with its expected post\-LN slot behavior under the placement rule: PostDeg should help on heavy\-tailed citation and protein graphs and remain neutral on low\-heterogeneity knowledge graphs\. The table is recorded here as a public prediction sheet, not as evidence; running these benchmarks would either reinforce or falsify the rule\.

Table A38:Predictions for unrun benchmarks\. We list publicly reported mean degree and pooled\-skewness estimates and predict whether PostDeg should help based on the placement principle: stronger expected gain for higher heterogeneity\. We have not run any of these; this table is a falsifiable prediction sheet, not a measurement\.
### F\.1Reference implementation

A PyTorch reference implementation of PostDeg matching Algorithm[1](https://arxiv.org/html/2606.14022#alg1)is included in the supplementary code archive atexperiments/shared/dsn\_core\.py::DegreeScaleNorm; the layer is approximately 12 lines\.

## Appendix GSymbolic walkthrough of one PostDeg layer

Consider a 3\-node line graph with degrees\(1,2,1\)\(1,2,1\)and feature dimensiond=4d=4\. The clamped centralities arec=\(0\.5,1,0\.5\)c=\(0\.5,1,0\.5\)\. Withβ=0\.75\\beta=0\.75,δ=0\.25\\delta=0\.25,λ2=1\\lambda\_\{2\}=1on a path graph,γ=β\+δ​λ2=1\.0\\gamma=\\beta\+\\delta\\lambda\_\{2\}=1\.0\. Takeσ2​\(zi\)=0\.78\\sigma^\{2\}\(z\_\{i\}\)=0\.78at convergence \(median value, Table[A10](https://arxiv.org/html/2606.14022#A4.T10)\)\. Apply the absorption identity for an intentionally smallai=0\.1a\_\{i\}=0\.1\(in practiceai≈1a\_\{i\}\\approx 1; we deliberately pick a smallaia\_\{i\}to demonstrate worst\-case behavior\):

εLNai2​σ2​\(zi\)\\displaystyle\\frac\{\\varepsilon\_\{\\rm LN\}\}\{a\_\{i\}^\{2\}\\sigma^\{2\}\(z\_\{i\}\)\}=10−50\.01⋅0\.78=1\.28×10−3\.\\displaystyle=\\frac\{10^\{\-5\}\}\{0\.01\\cdot 0\.78\}=1\.28\\times 10^\{\-3\}\.The pre\-LN multiplier is therefore numerically erased to four decimals\. Now apply PostDeg post\-LN:

s1=\(0\.5\+ε\)−1/2\\displaystyle s\_\{1\}=\(0\.5\+\\varepsilon\)^\{\-1/2\}≈1\.414,s2=1,s3≈1\.414\.\\displaystyle\\approx 1\.414,\\quad s\_\{2\}=1,\\quad s\_\{3\}\\approx 1\.414\.The leaf\-to\-hub ratio is2≈1\.414\\sqrt\{2\}\\approx 1\.414, while the same multiplier applied pre\-LN would have been absorbed to∼10−3\\sim 10^\{\-3\}\.

## Appendix HWhy feature concatenation does not substitute for post\-LayerNorm scaling

This appendix expands the explicit\-degree\-as\-content discussion from Section[4\.2](https://arxiv.org/html/2606.14022#S4.SS2)into a full position argument\. The argument is that pre\-LayerNorm content and post\-LN magnitude are not equivalent representations of the same scalar quantity, even when both encode the same value ofdid\_\{i\}\.

#### A small thought experiment\.

Consider a single\-layer model on a star graphSnS\_\{n\}with hub0and leaves1,…,n−11,\\ldots,n\-1\. Two encodings of degree are available:

1. \(a\)*Content:*concatenatelog⁡di\\log d\_\{i\}to the input features, soxi=\(xiraw,log⁡di\)x\_\{i\}=\(\\,x\_\{i\}^\{\\rm raw\}\\,,\\,\\log d\_\{i\}\\,\), then run the GAT block, then LayerNorm\.
2. \(b\)*Magnitude:*run the GAT block withxirawx\_\{i\}^\{\\rm raw\}alone, then LayerNorm, then apply PostDegh~i=\(ci\+ε\)−1/2​h¯i\\widetilde\{h\}\_\{i\}=\(c\_\{i\}\+\\varepsilon\)^\{\-1/2\}\\bar\{h\}\_\{i\}\.

After LayerNorm, encoding \(a\) setsh¯i\(a\)=g⊙\(zi\(a\)−μ​\(zi\(a\)\)\)/σ2​\(zi\(a\)\)\+εLN\+b\\bar\{h\}\_\{i\}^\{\(a\)\}=g\\odot\(z\_\{i\}^\{\(a\)\}\-\\mu\(z\_\{i\}^\{\(a\)\}\)\)/\\sqrt\{\\sigma^\{2\}\(z\_\{i\}^\{\(a\)\}\)\+\\varepsilon\_\{\\rm LN\}\}\+b, wherezi\(a\)z\_\{i\}^\{\(a\)\}is the GAT output for the content\-augmented inputs\. Thelog⁡di\\log d\_\{i\}feature contributes to bothμ\\muandσ2\\sigma^\{2\}and is therefore re\-centered and re\-scaled by the LayerNorm\. Encoding \(b\) leavesh¯i\(b\)=g⊙\(zi\(b\)−μ​\(zi\(b\)\)\)/σ2​\(zi\(b\)\)\+εLN\+b\\bar\{h\}\_\{i\}^\{\(b\)\}=g\\odot\(z\_\{i\}^\{\(b\)\}\-\\mu\(z\_\{i\}^\{\(b\)\}\)\)/\\sqrt\{\\sigma^\{2\}\(z\_\{i\}^\{\(b\)\}\)\+\\varepsilon\_\{\\rm LN\}\}\+buntouched and then multiplies by\(ci\+ε\)−1/2\(c\_\{i\}\+\\varepsilon\)^\{\-1/2\}, which is preserved exactly by the score head\.

#### Where the encodings diverge\.

Encodings \(a\) and \(b\) coincide only in the strict regime where \(i\) the score head is positively homogeneous in its input, \(ii\) the LayerNorm gainggis uniform across coordinates, and \(iii\) the GAT outputzi\(a\)z\_\{i\}^\{\(a\)\}is rank\-one in thelog⁡di\\log d\_\{i\}direction\. None of these hold in our backbone: \(i\) the score head includes a softmax\-like greedy argmax over candidate nodes which is not positively homogeneous, \(ii\)ggis learned per coordinate, and \(iii\) the GAT output mixeslog⁡di\\log d\_\{i\}withxirawx\_\{i\}^\{\\rm raw\}through learned weights\. The two encodings therefore implement different functions ofdid\_\{i\}in the score head, and there is no a priori reason to expect \(a\) to capture the magnitude channel that \(b\) provides\.

#### What thelog⁡di\\log d\_\{i\}run constrains\.

The supplemental run now compares \(a\) and \(b\) directly on the cross\-backbone pipeline\. Thelog⁡di\\log d\_\{i\}concatenation baseline captures part of the InfluMax and Dismantle gain but remains below PostDeg \(Appendix Table[A35](https://arxiv.org/html/2606.14022#A6.T35)\)\. The result fits the content\-vs\-magnitude distinction: degree content helps, while the post\-LN magnitude channel still leaves headroom\. The main grid also keeps degree\-correlated content features in every method \(Section[4\.1](https://arxiv.org/html/2606.14022#S4.SS1)\), so the PostDeg gain is measured on top of structural content\.

#### Empirical algebra\.

On the same training graphs, the empirical post\-LN scale variance isVar​\(si\)≈5\.26\\mathrm\{Var\}\(s\_\{i\}\)\\approx 5\.26on InfluMax \(Appendix Table[A22](https://arxiv.org/html/2606.14022#A4.T22)\)\. For alog⁡di\\log d\_\{i\}\-concatenation baseline \(encoding \(a\)\) to match the magnitude channel injected by PostDeg \(encoding \(b\)\) in expectation, the corresponding learned coefficientβlog⁡d\\beta\_\{\\log d\}on thelog⁡di\\log d\_\{i\}feature would have to satisfyβlog⁡d2​Var​\(log⁡di\)≈Var​\(si\)\\beta\_\{\\log d\}^\{2\}\\,\\mathrm\{Var\}\(\\log d\_\{i\}\)\\approx\\mathrm\{Var\}\(s\_\{i\}\), i\.e\.,\|βlog⁡d\|≳5\.26/0\.61≈2\.94\|\\beta\_\{\\log d\}\|\\gtrsim\\sqrt\{5\.26/0\.61\}\\approx 2\.94for InfluMax \(whereVar​\(log⁡di\)≈0\.61\\mathrm\{Var\}\(\\log d\_\{i\}\)\\approx 0\.61on training graphs\)\. An explicit content baseline would need a coefficient of that order to recover the magnitude channel before LayerNorm renormalizes it\. Placing a multiplier on the LayerNorm output side gives the score head that magnitude channel directly\.

## Appendix IDD collapse

DD has 1178 graphs, 691 in the larger class and 487 in the smaller, giving a majority\-class rate of691/1178=58\.65%691/1178=58\.65\\%\. The observed accuracies for PairNorm \(58\.7±0\.0%58\.7\\pm 0\.0\\%\), InstanceNorm \(58\.7±0\.0%58\.7\\pm 0\.0\\%\), and BatchNorm \(58\.8±0\.2%58\.8\\pm 0\.2\\%\) are within seed\-noise of this rate; the BatchNorm value is slightly above because BatchNorm occasionally flips per\-fold to the other constant predictor\. The \(near\-\)zero seed standard deviation on the first two methods is consistent with a constant predictor\. Per\-fold values are in Table[A31](https://arxiv.org/html/2606.14022#A4.T31); the collapse is fold\-uniform across all 10 folds and 10 seeds\.

## Appendix JCode\-name reproducibility map

The notation glossary \(symbols, default values, source locations\) is given at the front of this appendix in Section[A](https://arxiv.org/html/2606.14022#A1)\. Several method names recur in the codebase under legacy identifiers; the table below maps those legacy code names to paper names\.

Table A39:Reproducibility map from code names to paper names\. Legacy names are retained only to identify existing CSV rows, checkpoints, and scripts\.- •PostDeg and its learned variants:the set\{\\\{PostDeg, PostDeg\-L\-FG, PostDeg\-L\-Adaptive\}\\\}\. These three operators all act on the post\-LN representation magnitude\.
- •PostDeg:the recommended fixed operator with exponent1/21/2and no learned parameters\.
- •Degree\-separation identity:Eq\. \([3](https://arxiv.org/html/2606.14022#S3.E3)\),Ri​j=\(\(ci\+ε\)/\(cj\+ε\)\)γR\_\{ij\}=\(\(c\_\{i\}\+\\varepsilon\)/\(c\_\{j\}\+\\varepsilon\)\)^\{\\gamma\}\.
- •PostDeg\-L\-FG, PostDeg\-L\-Adaptive:learned PostDeg ablations with fixed\-gate and adaptive\-gate variants\. Diagnostics, not recommended operators\.

## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction now frame the contribution as post\-LayerNorm degree scaling, with PostDeg as the recommended fixed operator and learned same\-slot variants used as diagnostics\. The stated empirical claims match Table[A4](https://arxiv.org/html/2606.14022#A4.T4), Table[4](https://arxiv.org/html/2606.14022#S4.T4), and the limitations discussion\.
5. 2\.Limitations
6. Question: Does the paper discuss the limitations of the work performed by the authors?
7. Answer:\[Yes\]
8. Justification: Section[5](https://arxiv.org/html/2606.14022#S5)discusses the controlled GAT\-only evaluation, mostly synthetic graph regimes, limited transfer range, unresolved spectral modulation, and missing explicit degree\-feature controls\.
9. 3\.Theory assumptions and proofs
10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
11. Answer:\[Yes\]
12. Justification: Section[3](https://arxiv.org/html/2606.14022#S3)states the placement rule and the main empirical consequence \(Eq\.[3](https://arxiv.org/html/2606.14022#S3.E3)\); Appendix[C](https://arxiv.org/html/2606.14022#A3)gives assumptions and proofs in named\-environment form \(Theorem[3](https://arxiv.org/html/2606.14022#Thmtheorem3), Proposition[16](https://arxiv.org/html/2606.14022#Thmtheorem16), Corollary[5](https://arxiv.org/html/2606.14022#Thmtheorem5), Lemma[6](https://arxiv.org/html/2606.14022#Thmtheorem6), plus Lemmas[1](https://arxiv.org/html/2606.14022#Thmtheorem1)–[11](https://arxiv.org/html/2606.14022#Thmtheorem11), Propositions[7](https://arxiv.org/html/2606.14022#Thmtheorem7)–[10](https://arxiv.org/html/2606.14022#Thmtheorem10), and a “what this work does not say” Remark\)\.
13. 4\.Experimental result reproducibility
14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
15. Answer:\[Yes\]
16. Justification: Appendix[D](https://arxiv.org/html/2606.14022#A4)specifies graph generators, budgets, IC/SIR parameters, reward definitions, evaluation sizes, seeds, features, training hyperparameters, and the DD cross\-validation protocol\.
17. 5\.Open access to data and code
18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
19. Answer:\[Yes\]
20. Justification: The supplementary package includes anonymized experiment code, table/figure scripts, configuration shell scripts, and the CSV files used to generate the reported results\. Synthetic graph data are generated by code; DD is downloaded from the public TU dataset distribution\.
21. 6\.Experimental setting/details
22. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
23. Answer:\[Yes\]
24. Justification: Section[4\.1](https://arxiv.org/html/2606.14022#S4.SS1)gives the controlled setup, and Appendix[D](https://arxiv.org/html/2606.14022#A4)provides optimizer settings, training budgets, evaluation graph sizes, seed lists, fold construction for DD, and the fixed hyperparameter protocol\.
25. 7\.Experiment statistical significance
26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
27. Answer:\[Yes\]
28. Justification: Tables report mean±\\pmstandard deviation over 10 seeds or folds\. Appendix Table[A14](https://arxiv.org/html/2606.14022#A4.T14)reports paired Wilcoxon signed\-rank tests, Table[A13](https://arxiv.org/html/2606.14022#A4.T13)reports two one\-sided TOST equivalence tests at three relative margins, Tables[A16](https://arxiv.org/html/2606.14022#A4.T16)and[A17](https://arxiv.org/html/2606.14022#A4.T17)report Cohen’sddand per\-seed win rates, and Table[A15](https://arxiv.org/html/2606.14022#A4.T15)reports the full pairwise post\-LN\-vs\-baseline matrix\.
29. 8\.Experiments compute resources
30. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
31. Answer:\[Yes\]
32. Justification: Appendix[D](https://arxiv.org/html/2606.14022#A4)reports per\-job CPU and memory allocation, approximate runtime per run, total completed\-run compute, and whether preliminary/failed runs are included\.
33. 9\.Code of ethics
35. Answer:\[Yes\]
36. Justification: The work uses generated graphs and a public benchmark dataset, does not involve human subjects or sensitive personal data, and is presented as a general modeling method rather than a deployed decision system\.
37. 10\.Broader impacts
38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
39. Answer:\[Yes\]
40. Justification: Section[6](https://arxiv.org/html/2606.14022#S6)discusses potential benefits for graph optimization and possible misuse in targeting influential or bridge nodes in human networks\.
41. 11\.Safeguards
42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
43. Answer:\[N/A\]
44. Justification: The paper does not release a high\-risk pretrained model, scraped dataset, or generative model\. The released assets are experiment code and generated\-result tables\.
45. 12\.Licenses for existing assets
46. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
47. Answer:\[Yes\]
48. Justification: The paper credits the existing methods and identifies DD as coming from the TU graph\-kernel/TUDataset distribution\. DD is downloaded from the upstream distribution during preprocessing, used under its GPL license terms, and not redistributed as a modified dataset; the supplementary instructions point users to the upstream source\.
49. 13\.New assets
50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
51. Answer:\[Yes\]
52. Justification: The new assets are code, configuration scripts, generated\-result CSV files, and paper figures/tables; the supplementary package documents how they are generated and used\.
53. 14\.Crowdsourcing and research with human subjects
54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
55. Answer:\[N/A\]
56. Justification: The paper does not involve crowdsourcing or human\-subject experiments\.
57. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
59. Answer:\[N/A\]
60. Justification: The paper does not involve human\-subject experiments, so IRB review is not applicable\.
61. 16\.Declaration of LLM usage
62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
63. Answer:\[N/A\]
64. Justification: LLMs are not part of the PostDeg method, experiments, evaluation pipeline, or scientific claims\. Any writing or formatting assistance does not affect the core methodology or originality\.

Similar Articles

DREG: A Layer-Wise Jacobian Regularization as a General-Purpose Penalty

arXiv cs.LG

This paper presents a large-scale empirical study of the Derivative Regularization (DREG) penalty, showing it achieves high accuracy and noise robustness, particularly with GELU activation and data-scarce regimes, positioning it as a general-purpose plug-and-play regularizer for neural networks.

Unlocking Feature Learning in Gated Delta Networks at Scale

arXiv cs.LG

This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.

A lift for input-convex neural network training

arXiv cs.LG

Proposes a 'lift' method for training input-convex neural networks (ICNNs) that uses an unconstrained hypernetwork to emit non-negative inter-layer weights, softening the loss landscape and escaping gradient attenuation, achieving lower test loss than projected gradient descent and softplus reparametrization.