Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion

arXiv cs.LG 05/13/26, 04:00 AM Papers
Summary
This paper analyzes oversmoothing in Neural Sheaf Diffusion (NSD) as a representation degeneracy phenomenon using quiver theory and Geometric Invariant Theory. It proposes moment-map-inspired regularizers and explores non-uniform stalk dimensions to mitigate this issue in heterophilic graph benchmarks.
arXiv:2605.11178v1 Announce Type: new Abstract: Neural Sheaf Diffusion (NSD) generalizes diffusion-based Graph Neural Networks by replacing scalar graph Laplacians with sheaf Laplacians whose learned restriction maps define a task-adapted geometry. While the diffusion limit of NSD is known to be the space of global sections, the representation-theoretic structure of this harmonic space remains largely implicit. We develop a quiver-theoretic interpretation of NSD by identifying cellular sheaves on graphs with representations of the associated incidence quiver. Under this correspondence, learned sheaf geometries become points in a finite-dimensional representation space. We show that direct-sum decompositions of the underlying incidence-quiver representation induce decompositions of the harmonic space reached in the diffusion limit. This gives an algebraic interpretation of oversmoothing as representation degeneration: learned sheaves may collapse toward low-complexity summands whose global sections fail to preserve discriminative information. Building on this viewpoint, we connect sheaf diffusion to stability and moment-map principles from Geometric Invariant Theory. We introduce moment-map-inspired regularizers that bias restriction maps toward balanced representation geometries, and identify a structural obstruction in equal-stalk architectures: when $d_v = d_e$, admissibility for learnable stability parameters forces the trivial all-object summand onto a stability wall. Non-uniform stalk dimensions remove this obstruction, making adaptive stability meaningful. Experiments on heterophilic benchmarks are consistent with this mechanism: breaking stalk symmetry can reduce variance or improve validation behavior, and adaptive stability becomes more effective in selected rectangular settings. Overall, our framework reframes oversmoothing as a degeneration phenomenon in the representation geometry underlying learned sheaf diffusion.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/13/26, 06:33 AM
# Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion
Source: [https://arxiv.org/html/2605.11178](https://arxiv.org/html/2605.11178)
Arif Dönmez1,2,∗Axel Mosig2,3Ellen Fritsche1,2,4Katharina Koch1,2 1IUF – Leibniz Research Institute for Environmental Medicine, Düsseldorf, Germany 2DNTOX GmbH, Düsseldorf, Germany 3Bioinformatics Group, Ruhr University Bochum, Bochum, Germany 4Swiss Centre for Applied Human Toxicology \(SCAHT\), Basel, Switzerland ∗Corresponding author:arif\.doenmez@ruhr\-uni\-bochum\.de

###### Abstract

Neural Sheaf Diffusion \(NSD\) generalizes diffusion\-based Graph Neural Networks by replacing scalar graph Laplacians with sheaf Laplacians whose learned restriction maps define a task\-adapted geometry\. While the diffusion limit of NSD is known to be the space of global sections, the representation\-theoretic structure of this harmonic space remains largely implicit\. In this paper, we develop a quiver\-theoretic interpretation of NSD by identifying cellular sheaves on graphs with representations of the associated incidence quiver\. Under this correspondence, learned sheaf geometries become points in a finite\-dimensional representation space\. We prove that direct\-sum decompositions of the underlying incidence\-quiver representation induce corresponding decompositions of the harmonic space reached in the diffusion limit\. This provides an algebraic interpretation of oversmoothing as representation degeneration: a conceptual framing where learned sheaves collapse toward trivial or low\-complexity summands whose global sections fail to preserve discriminative information\. Building on this viewpoint, we connect sheaf diffusion to stability, moduli, and moment\-map principles from Geometric Invariant Theory\. We introduce moment\-map\-inspired regularizers that bias learned restriction maps toward more balanced representation geometries, and we identify a structural obstruction in standard equal\-stalk architectures: whendv=ded\_\{v\}=d\_\{e\}, the admissibility condition for learnable stability parameters forces the trivial all\-object summand onto a stability wall\. We show that non\-uniform stalk dimensions remove this obstruction, making adaptive stability meaningful in principle\. Empirical evaluations on heterophilic benchmarks are consistent with this mechanism: breaking stalk symmetry can reduce variance or improve validation behavior on some datasets, and adaptive stability regularization becomes more effective in selected rectangular settings\. These results support the view that moment\-map regularization is a structured but dataset\-dependent geometric bias rather than a universal performance booster\. Overall, our framework interprets oversmoothing not only as a spectral pathology, but as a degeneration phenomenon in the underlying representation geometry\.

## 1Introduction

Many Graph Neural Networks \(GNNs\) can be understood as discrete diffusion processes on graphs\(Kipf and Welling,[2017](https://arxiv.org/html/2605.11178#bib.bib6); Veličkovićet al\.,[2018](https://arxiv.org/html/2605.11178#bib.bib7)\)\. In message\-passing architectures, nodes repeatedly aggregate information from their neighbors, effectively implementing a form of Laplacian smoothing\. While this diffusion perspective explains the success of GNNs in propagating local information, it also exposes a fundamental limitation: repeated smoothing drives node representations toward indistinguishable limits\. This phenomenon, known as oversmoothing, is especially problematic in deep architectures where node features collapse to low\-dimensional or nearly constant subspaces\(Liet al\.,[2018](https://arxiv.org/html/2605.11178#bib.bib4); Chenet al\.,[2020](https://arxiv.org/html/2605.11178#bib.bib5)\)\. In the classical graph Laplacian setting, this collapse is a structural consequence of the underlying diffusion geometry: the heat equation asymptotically projects features onto the kernel of the Laplacian, which for standard connected graphs consists solely of the uninformative constant signal\. However, this classical assumption—that connected nodes should become identical—is severely mismatched for heterophilic graphs, motivating the search for richer geometric structures that avoid collapsing into overly simple harmonic states\.

Neural Sheaf Diffusion \(NSD\) addresses this by lifting the diffusion process into a vector bundle\-like structure, replacing the standard Laplacian with a sheaf Laplacian whose restriction maps are learned from the data\(Hansen and Ghrist,[2019](https://arxiv.org/html/2605.11178#bib.bib8); Bodnaret al\.,[2022](https://arxiv.org/html/2605.11178#bib.bib1)\)\. By learning a task\-adapted geometry, NSD allows the network to natively model heterophily, as the diffusion limit \(the space of global sections\) can encode complex, signed, and multi\-dimensional feature agreements rather than simple constants\. However, optimizing these highly parameterized general sheaves is notoriously fragile\(Barberoet al\.,[2022b](https://arxiv.org/html/2605.11178#bib.bib2),[a](https://arxiv.org/html/2605.11178#bib.bib3)\)\. Furthermore, unconstrained gradient descent inherently exhibits an implicit bias\(Rahamanet al\.,[2019](https://arxiv.org/html/2605.11178#bib.bib20); Shahet al\.,[2020](https://arxiv.org/html/2605.11178#bib.bib21); Aroraet al\.,[2019](https://arxiv.org/html/2605.11178#bib.bib23)\)\. Left unregularized, the optimizer may implicitly converge to trivial, identity\-like restriction maps, collapsing the expressive capacity of the sheaf back into classical oversmoothing\.

In this work, we formalize this failure mode by interpreting the learned sheaf geometry through the lens of quiver representation theory\(Schiffler,[2014](https://arxiv.org/html/2605.11178#bib.bib12); Derksen and Weyman,[2005](https://arxiv.org/html/2605.11178#bib.bib13)\)\. By identifying cellular sheaves on graphs as representations of an associated incidence quiver, we map the continuous parameters of the neural network directly into a finite\-dimensional representation space\. This allows us to provide a formal conceptual interpretation of asymptotic oversmoothing in NSD not merely as a spectral inevitability, but as a homological degeneration: it corresponds to the optimizer converging toward the basin of attraction of the trivial subrepresentation,ℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\.

To mitigate this representation collapse, we invoke Geometric Invariant Theory \(GIT\)\(King,[1994](https://arxiv.org/html/2605.11178#bib.bib11); Mumfordet al\.,[1994](https://arxiv.org/html/2605.11178#bib.bib14)\)\. We use polystability as an algebraic model of nondegenerate sheaf geometry: representations that avoid certain destabilizing subobjects provide a principled target for regularizing learned sheaves\. We introduce stability\-aware moment\-map regularizers that bias the learned restriction maps toward well\-conditioned, expressive geometries\. Crucially, our algebraic analysis uncovers a structural obstruction in standard sheaf architectures: when vertex and edge stalks have equal dimensions \(dv=ded\_\{v\}=d\_\{e\}\), the admissibility constraints of GIT force the trivial subrepresentation onto a stability wall\. Consequently, learnable adaptive stabilization cannot strictly exclude this collapse mode\.

By identifying and breaking this architectural symmetry through rectangular stalks, we remove the obstruction that prevents adaptive stability parameters from assigning nonzero weight to the trivial all\-object summand\. Empirically, our results suggest that this symmetry breaking changes the optimization geometry of General\-NSD: rectangular architectures improve variance or validation behavior on some WebKB datasets, and adaptive moment\-map regularization becomes beneficial in settings where it was ineffective in the equal\-stalk regime\. These findings support the representation\-geometric view that oversmoothing is sensitive to the decomposition structure of the learned sheaf representation, while also showing that moment\-map regularization is a structured but dataset\-dependent inductive bias rather than a universal performance booster\.

Contributions\.In summary, our core contributions are:

- •Quiver\-theoretic oversmoothing:We identify cellular sheaves on graphs with incidence\-quiver representations and show that direct\-sum decompositions of the learned representation induce decompositions of the harmonic space reached by sheaf diffusion\.
- •Representation degeneracy:We interpret oversmoothing as degeneration toward trivial or low\-complexity representation summands whose global sections fail to preserve discriminative information\.
- •Moment\-map regularization:We introduce non\-gauge\-fixed moment\-map\-inspired regularizers, including a central moment penalty, acting directly on the raw incidence restriction maps\.
- •Architectural symmetry breaking:We show that equal\-stalk architectures place the trivial all\-object summand on a stability wall for learnableθ\\theta\-stability, and we propose rectangular stalks as a natural extension that removes this obstruction\. Preliminary experiments support the usefulness and dataset dependence of this representation\-geometric bias\.

## 2Background

Graph Diffusion and Oversmoothing\.LetG=\(V,E\)G=\(V,E\)be an undirected graph withn=\|V\|n=\|V\|nodes, and let𝐗∈ℝn×d\\mathbf\{X\}\\in\\mathbb\{R\}^\{n\\times d\}denote a node signal matrix\. Standard graph diffusion is governed by the heat equation𝐗˙\(t\)=−Δ𝐗\(t\)\\dot\{\\mathbf\{X\}\}\(t\)=\-\\Delta\\mathbf\{X\}\(t\), whereΔ\\Deltais the normalized graph Laplacian\. Ast→∞t\\to\\infty, the signal converges to the orthogonal projection ontoker⁡\(Δ\)\\ker\(\\Delta\)\. For a connected graph, the scalar graph Laplacian has a one\-dimensional kernel corresponding to globally constant signals, up to the usual degree weighting in the symmetrically normalized case\. This drives all node features toward a common global average, structurally explaining the oversmoothing phenomenon in deep Graph Neural Networks \(GNNs\)\(Liet al\.,[2018](https://arxiv.org/html/2605.11178#bib.bib4); Chenet al\.,[2020](https://arxiv.org/html/2605.11178#bib.bib5)\)\.

Cellular Sheaves\.A*cellular sheaf*ℱ\\mathcal\{F\}over a graphGGequips the network with a richer structure by assigning a vector spaceℱ\(v\)\\mathcal\{F\}\(v\)to each vertexv∈Vv\\in V, a vector spaceℱ\(e\)\\mathcal\{F\}\(e\)to each edgee∈Ee\\in E, and a linear restriction mapℱv⊴e:ℱ\(v\)→ℱ\(e\)\\mathcal\{F\}\_\{v\\trianglelefteq e\}:\\mathcal\{F\}\(v\)\\to\\mathcal\{F\}\(e\)for every incidencev⊴ev\\trianglelefteq e\(Curry,[2014](https://arxiv.org/html/2605.11178#bib.bib10); Shepard,[1985](https://arxiv.org/html/2605.11178#bib.bib24)\)\. The space of0\-cochains, or vertex\-level sheaf signals, isC0\(G;ℱ\)=⨁v∈Vℱ\(v\)C^\{0\}\(G;\\mathcal\{F\}\)=\\bigoplus\_\{v\\in V\}\\mathcal\{F\}\(v\), and the space of11\-cochains isC1\(G;ℱ\)=⨁e∈Eℱ\(e\)C^\{1\}\(G;\\mathcal\{F\}\)=\\bigoplus\_\{e\\in E\}\\mathcal\{F\}\(e\)\. Choosing an arbitrary edge orientatione=\(u,v\)e=\(u,v\), the coboundary operatorδℱ:C0\(G;ℱ\)→C1\(G;ℱ\)\\delta\_\{\\mathcal\{F\}\}:C^\{0\}\(G;\\mathcal\{F\}\)\\to C^\{1\}\(G;\\mathcal\{F\}\)is defined component\-wise as\(δℱ𝐱\)e=ℱv⊴e𝐱v−ℱu⊴e𝐱u\(\\delta\_\{\\mathcal\{F\}\}\\mathbf\{x\}\)\_\{e\}=\\mathcal\{F\}\_\{v\\trianglelefteq e\}\\mathbf\{x\}\_\{v\}\-\\mathcal\{F\}\_\{u\\trianglelefteq e\}\\mathbf\{x\}\_\{u\}\. It measures the disagreement of neighboring vertex data after both have been transported into the corresponding edge stalk\.

Sheaf Laplacians and NSD\.The sheaf Laplacian is defined asΔℱ=δℱ∗δℱ\\Delta\_\{\\mathcal\{F\}\}=\\delta\_\{\\mathcal\{F\}\}^\{\\ast\}\\delta\_\{\\mathcal\{F\}\}\(Hansen and Ghrist,[2019](https://arxiv.org/html/2605.11178#bib.bib8)\)\. It is positive semidefinite, and its kernel coincides exactly with the space of*global sections*H0\(G;ℱ\)=ker⁡\(δℱ\)=ker⁡\(Δℱ\)H^\{0\}\(G;\\mathcal\{F\}\)=\\ker\(\\delta\_\{\\mathcal\{F\}\}\)=\\ker\(\\Delta\_\{\\mathcal\{F\}\}\)\. A global section represents an assignment of local vertex data that is perfectly consistent across all edges after transport through the sheaf maps\. Neural Sheaf Diffusion \(NSD\) generalizes classical graph diffusion by treating the restriction mapsℱv⊴e\\mathcal\{F\}\_\{v\\trianglelefteq e\}as learnable parameters\(Bodnaret al\.,[2022](https://arxiv.org/html/2605.11178#bib.bib1)\)\. The continuous sheaf diffusion equation𝐗˙\(t\)=−Δℱ𝐗\(t\)\\dot\{\\mathbf\{X\}\}\(t\)=\-\\Delta\_\{\\mathcal\{F\}\}\\mathbf\{X\}\(t\)converges asymptotically toΠH0\(G;ℱ\)𝐗\(0\)\\Pi\_\{H^\{0\}\(G;\\mathcal\{F\}\)\}\\mathbf\{X\}\(0\)\. Thus, the expressive capacity of deep sheaf diffusion is governed by the algebraic structure ofH0\(G;ℱ\)H^\{0\}\(G;\\mathcal\{F\}\)\.

Quiver Representations\.To analyze this algebraic structure, we use quiver representation theory\(Schiffler,[2014](https://arxiv.org/html/2605.11178#bib.bib12); Derksen and Weyman,[2005](https://arxiv.org/html/2605.11178#bib.bib13)\)\. A quiverQ=\(Q0,Q1\)Q=\(Q\_\{0\},Q\_\{1\}\)consists of verticesQ0Q\_\{0\}and directed arrowsQ1Q\_\{1\}\. A representationMMofQQassigns a vector spaceMiM\_\{i\}to eachi∈Q0i\\in Q\_\{0\}and a linear mapMa:Ms\(a\)→Mt\(a\)M\_\{a\}:M\_\{s\(a\)\}\\to M\_\{t\(a\)\}to each arrowa∈Q1a\\in Q\_\{1\}, wheres\(a\)s\(a\)andt\(a\)t\(a\)denote source and target\. For a fixed dimension vector𝐝=\(dimMi\)i∈Q0\\mathbf\{d\}=\(\\dim M\_\{i\}\)\_\{i\\in Q\_\{0\}\}, the space of all such representations isRep⁡\(Q,𝐝\)=∏a∈Q1Hom⁡\(kds\(a\),kdt\(a\)\)\\operatorname\{Rep\}\(Q,\\mathbf\{d\}\)=\\prod\_\{a\\in Q\_\{1\}\}\\operatorname\{Hom\}\\left\(k^\{d\_\{s\(a\)\}\},k^\{d\_\{t\(a\)\}\}\\right\)\. The base\-change, or gauge, groupG𝐝=∏i∈Q0GL\(di\)G\_\{\\mathbf\{d\}\}=\\prod\_\{i\\in Q\_\{0\}\}GL\(d\_\{i\}\)acts on this space by changing bases in the assigned vector spaces\. The central observation of this work is that cellular sheaves on a graphGGcan be identified exactly with representations of the graph’s*incidence quiver*\. This links the diffusion limit of NSD to direct\-sum decompositions, stability, and moduli of quiver representations\.

## 3Cellular Sheaves as Quiver Representations

To analyze the representation geometry of Neural Sheaf Diffusion, we first make explicit the dictionary between cellular sheaves on a graph and quiver representations\. This correspondence identifies the learnable restriction maps of an NSD model with representation data in a finite\-dimensional algebraic space\.

The Incidence Quiver\.For an undirected graphG=\(V,E\)G=\(V,E\), define its*incidence quiver*QG=\(Q0,Q1\)Q\_\{G\}=\(Q\_\{0\},Q\_\{1\}\)as the bipartite quiver with vertex setQ0=V⊔EQ\_\{0\}=V\\sqcup E\. For every incidencev⊴ev\\trianglelefteq e, we introduce one directed arrowav,e:v→ea\_\{v,e\}:v\\to e\. Thus,Q1=\{av,e:v→e∣v⊴e\}Q\_\{1\}=\\\{a\_\{v,e\}:v\\to e\\mid v\\trianglelefteq e\\\}\. The quiverQGQ\_\{G\}should not be confused with an orientation of the original graph: graph edges become objects of the quiver, and each such edge\-object receives arrows from its incident endpoints\. For a graph edgee=\{u,v\}e=\\\{u,v\\\}, the incidence quiver therefore contains two arrowsu→eu\\to eandv→ev\\to e\.

The Correspondence\.A representationMMofQGQ\_\{G\}assigns a vector spaceMvM\_\{v\}to each graph vertexv∈Vv\\in V, a vector spaceMeM\_\{e\}to each graph edgee∈Ee\\in E, and a linear mapMav,e:Mv→MeM\_\{a\_\{v,e\}\}:M\_\{v\}\\to M\_\{e\}to each incidence arrow\. This is exactly the data of a cellular sheafℱ\\mathcal\{F\}onGG: we setMv=ℱ\(v\)M\_\{v\}=\\mathcal\{F\}\(v\),Me=ℱ\(e\)M\_\{e\}=\\mathcal\{F\}\(e\), andMav,e=ℱv⊴eM\_\{a\_\{v,e\}\}=\\mathcal\{F\}\_\{v\\trianglelefteq e\}\. Conversely, every representation ofQGQ\_\{G\}defines a cellular sheaf by the same assignment\. Hence the category of finite\-dimensional cellular sheaves onGGis equivalent to the category of finite\-dimensional representations of the incidence quiverQGQ\_\{G\}\.

Dimension Vectors and Parameter Space\.In Neural Sheaf Diffusion, the stalk dimensions are architectural choices\. Writingdv=dimℱ\(v\)d\_\{v\}=\\dim\\mathcal\{F\}\(v\)andde=dimℱ\(e\)d\_\{e\}=\\dim\\mathcal\{F\}\(e\)gives a dimension vector𝐝=\(\(dv\)v∈V,\(de\)e∈E\)\\mathbf\{d\}=\\big\(\(d\_\{v\}\)\_\{v\\in V\},\(d\_\{e\}\)\_\{e\\in E\}\\big\)\. For this fixed dimension vector, the space of cellular sheaves with these stalk dimensions is the quiver representation spaceRep⁡\(QG,𝐝\)=∏v⊴eHom⁡\(kdv,kde\)\\operatorname\{Rep\}\(Q\_\{G\},\\mathbf\{d\}\)=\\prod\_\{v\\trianglelefteq e\}\\operatorname\{Hom\}\\left\(k^\{d\_\{v\}\},k^\{d\_\{e\}\}\\right\)\. Thus, training the restriction maps of an NSD model amounts to optimizing a point inRep⁡\(QG,𝐝\)\\operatorname\{Rep\}\(Q\_\{G\},\\mathbf\{d\}\)\.

Gauge Equivalence and Moduli\.The matrix representation of a sheaf depends on choices of bases in the stalks\. Changing bases bygv∈GL\(dv\)g\_\{v\}\\in GL\(d\_\{v\}\)andge∈GL\(de\)g\_\{e\}\\in GL\(d\_\{e\}\)transforms each restriction map byℱv⊴e⟼geℱv⊴egv−1\\mathcal\{F\}\_\{v\\trianglelefteq e\}\\longmapsto g\_\{e\}\\mathcal\{F\}\_\{v\\trianglelefteq e\}g\_\{v\}^\{\-1\}\. This is precisely the base\-change action of the gauge groupG𝐝=∏v∈VGL\(dv\)×∏e∈EGL\(de\)G\_\{\\mathbf\{d\}\}=\\prod\_\{v\\in V\}GL\(d\_\{v\}\)\\times\\prod\_\{e\\in E\}GL\(d\_\{e\}\)onRep⁡\(QG,𝐝\)\\operatorname\{Rep\}\(Q\_\{G\},\\mathbf\{d\}\)\. Therefore, the intrinsic learned sheaf geometry is not a single matrix tuple, but its orbit under this gauge action\. The corresponding quotient, or moduli problem, is the natural representation\-theoretic object underlying learned sheaf diffusion\.

## 4The Kernel Decomposition Theorem

Having established the equivalence between cellular sheaves and representations of the incidence quiverQGQ\_\{G\}, we can recast the diffusion limit in the language of homological algebra\.

Global Sections as Categorical Limits\.For a cellular sheafℱ\\mathcal\{F\}viewed as a representation diagramM:QG→𝐕𝐞𝐜𝐭M:Q\_\{G\}\\to\\mathbf\{Vect\}, the space of global sections is canonically isomorphic to the categorical limit of this diagram\. Indeed, the limit consists of compatible assignments to all vertex and edge stalks, whileH0\(G;ℱ\)=ker⁡\(δℱ\)H^\{0\}\(G;\\mathcal\{F\}\)=\\ker\(\\delta\_\{\\mathcal\{F\}\}\)records the corresponding vertex assignments satisfying the compatibility equations\. Thus the operation sending a sheaf to its harmonic space is the global section functorΓ:Rep⁡\(QG\)→𝐕𝐞𝐜𝐭\\Gamma:\\operatorname\{Rep\}\(Q\_\{G\}\)\\to\\mathbf\{Vect\},ℱ⟼H0\(G;ℱ\)\\mathcal\{F\}\\longmapsto H^\{0\}\(G;\\mathcal\{F\}\)\. Equivalently,Γ\\Gammais the limit functor applied to the incidence\-quiver representation\. As a limit functor, it is left\-exact; in particular, it preserves finite limits\.

Functorial Decomposition\.By the Krull–Schmidt theorem, every finite\-dimensional representation in the Abelian categoryRep⁡\(QG\)\\operatorname\{Rep\}\(Q\_\{G\}\)admits a direct\-sum decomposition into indecomposable objects, unique up to isomorphism and permutation:ℱ≅⨁k=1Kℱ\(k\)\\mathcal\{F\}\\cong\\bigoplus\_\{k=1\}^\{K\}\\mathcal\{F\}^\{\(k\)\}\. Since finite direct sums in𝐕𝐞𝐜𝐭\\mathbf\{Vect\}are biproducts, and sinceΓ\\Gammapreserves finite products, the global section functor preserves this decomposition up to natural isomorphism\. ApplyingΓ\\GammagivesH0\(G;ℱ\)≅⨁k=1KH0\(G;ℱ\(k\)\)H^\{0\}\(G;\\mathcal\{F\}\)\\cong\\bigoplus\_\{k=1\}^\{K\}H^\{0\}\\big\(G;\\mathcal\{F\}^\{\(k\)\}\\big\)\. Thus the decomposition of the diffusion limit does not require a separate spectral argument or manual block\-diagonalization of the sheaf Laplacian\. It is a functorial consequence of viewing learned cellular sheaves as objects of the incidence\-quiver representation category\.

Homological Interpretation of Oversmoothing\.Continuous sheaf diffusion asymptotically projects node features onto the global\-section spaceH0\(G;ℱ\)H^\{0\}\(G;\\mathcal\{F\}\)\. Since the global section functor preserves finite direct sums, the information that survives diffusion decomposes into the harmonic contributions of the individual Krull–Schmidt summands\. If the learned sheaf degenerates toward summands whose global sections are trivial, low\-dimensional, or poorly aligned with the task, then the diffusion limit becomes correspondingly non\-expressive\. In this sense, oversmoothing is not merely a spectral artifact of repeated message passing, but a homological degeneration: the network learns a representation whose categorical limit is too simple to support heterophilic, task\-relevant features\.

Subrepresentations and Harmonic Injection\.While the Krull–Schmidt theorem describes direct\-sum decompositions, the relationship between subrepresentations and global sections is governed by the exactness properties of the limit functor\.

###### Lemma 4\.1\(Harmonic Injection\)\.

Letℱ\\mathcal\{F\}be a cellular sheaf onGGand letℱ′↪ℱ\\mathcal\{F\}^\{\\prime\}\\hookrightarrow\\mathcal\{F\}be a subrepresentation of the corresponding incidence quiver\. Then there is an induced injection of global sectionsH0\(G;ℱ′\)↪H0\(G;ℱ\)H^\{0\}\(G;\\mathcal\{F\}^\{\\prime\}\)\\hookrightarrow H^\{0\}\(G;\\mathcal\{F\}\)\.

###### Proof\.

The global section functorΓ:Rep⁡\(QG\)→𝐕𝐞𝐜𝐭\\Gamma:\\operatorname\{Rep\}\(Q\_\{G\}\)\\to\\mathbf\{Vect\}acts by taking the categorical limit of the representation diagram\. Because the limit functor is left exact, it preserves monomorphisms\. Therefore, the inclusion of the subrepresentationℱ′↪ℱ\\mathcal\{F\}^\{\\prime\}\\hookrightarrow\\mathcal\{F\}induces an injective linear mapΓ\(ℱ′\)↪Γ\(ℱ\)\\Gamma\(\\mathcal\{F\}^\{\\prime\}\)\\hookrightarrow\\Gamma\(\\mathcal\{F\}\)\. ∎

This lemma fundamentally links stability to oversmoothing: if the trivial all\-object representationℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}exists merely as a subrepresentation ofℱ\\mathcal\{F\}, the left exactness ofΓ\\Gammaguarantees that the constant\-signal harmonic space ofℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}injects directly intoH0\(G;ℱ\)H^\{0\}\(G;\\mathcal\{F\}\)\. Thus, excludingℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}via stability provides an algebraic way to rule out this specific constant\-signal collapse mode\.

## 5Representation Stability and Oversmoothing

Having interpreted oversmoothing as degeneration of the learned incidence\-quiver representation, we now turn to stability and moduli as a language for distinguishing controlled sheaf geometries from degenerate ones\.

The Trivial Subrepresentation\.To understand why representation degeneracy relates to oversmoothing, consider the extreme case of the*trivial subrepresentation*, denotedℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\. Ifℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}exists as a subrepresentation of a learned sheafℱ\\mathcal\{F\}, there exists a 1\-dimensional subspaceWv⊂ℝdvW\_\{v\}\\subset\\mathbb\{R\}^\{d\_\{v\}\}at every vertexvvand a 1\-dimensional subspaceWe⊂ℝdeW\_\{e\}\\subset\\mathbb\{R\}^\{d\_\{e\}\}at every edgeeesuch that the restriction maps act as an isomorphism between these subspaces\. In the idealized trivial\-line case, we may choose compatible baseswv∈Wvw\_\{v\}\\in W\_\{v\}andwe∈Wew\_\{e\}\\in W\_\{e\}so thatℱv⊴ewv=we\\mathcal\{F\}\_\{v\\trianglelefteq e\}w\_\{v\}=w\_\{e\}\. For any signal𝐱\\mathbf\{x\}restricted to this subrepresentation, the node features take the form𝐱v=cvwv\\mathbf\{x\}\_\{v\}=c\_\{v\}w\_\{v\}for some scalarscv∈ℝc\_\{v\}\\in\\mathbb\{R\}\. The sheaf Dirichlet energy along an edgee=\(u,v\)e=\(u,v\)then evaluates to:

‖ℱv⊴e𝐱v−ℱu⊴e𝐱u‖2\\displaystyle\\left\\\|\\mathcal\{F\}\_\{v\\trianglelefteq e\}\\mathbf\{x\}\_\{v\}\-\\mathcal\{F\}\_\{u\\trianglelefteq e\}\\mathbf\{x\}\_\{u\}\\right\\\|^\{2\}=‖cv\(ℱv⊴ewv\)−cu\(ℱu⊴ewu\)‖2\\displaystyle=\\left\\\|c\_\{v\}\(\\mathcal\{F\}\_\{v\\trianglelefteq e\}w\_\{v\}\)\-c\_\{u\}\(\\mathcal\{F\}\_\{u\\trianglelefteq e\}w\_\{u\}\)\\right\\\|^\{2\}=‖cvwe−cuwe‖2\\displaystyle=\\left\\\|c\_\{v\}w\_\{e\}\-c\_\{u\}w\_\{e\}\\right\\\|^\{2\}=\(cv−cu\)2‖we‖2\.\\displaystyle=\(c\_\{v\}\-c\_\{u\}\)^\{2\}\\\|w\_\{e\}\\\|^\{2\}\.Thus, onℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}, the sheaf Dirichlet energy collapses entirely into the ordinary scalar graph Laplacian energy\. For a connected graph, global sections within this subspace must satisfycv=cuc\_\{v\}=c\_\{u\}, forcing the harmonic space to consist entirely of constant signals\.

Optimization Bias and Collapse\.While the algebraic existence ofℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}does not strictly dictate that the entire harmonic space is trivial, the empirical reality of training neural networks makes it a prominent failure mode\. Neural Sheaf Diffusion models are optimized via gradient descent, which exhibits a well\-documented implicit bias toward low\-complexity and low\-rank solutions\(Rahamanet al\.,[2019](https://arxiv.org/html/2605.11178#bib.bib20); Gunasekaret al\.,[2017](https://arxiv.org/html/2605.11178#bib.bib22)\)\. Becauseℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}provides a trivially smooth energy minimum—where globally constant signals yield zero Dirichlet energy—optimizers are prone to relying on these trivial features at the expense of more complex ones\(Shahet al\.,[2020](https://arxiv.org/html/2605.11178#bib.bib21)\)\. Ifℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}appears prominently in the learned Krull\-Schmidt summands, these constant harmonic signals inject directly into the diffusion limit, reproducing classical oversmoothing rather than learning a complex, task\-adapted geometry\.

Geometric Intuition for Degeneration\.From the viewpoint of quiver invariant theory, learned sheaf parameters live in an affine representation spaceRep⁡\(QG,𝐝\)\\operatorname\{Rep\}\(Q\_\{G\},\\mathbf\{d\}\)equipped with a gauge\-group action\. Geometrically, the gauge orbits of representations in this space are often not closed; their boundaries \(orbit closures\) can contain direct sums of simpler, lower\-complexity objects\. Becauseℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}yields a globally constant harmonic space with zero Dirichlet energy, it represents a trivially smooth minimum in the loss landscape\. If the orbit closures containing such trivial components are geometrically accessible or dense in the relevant parameter regimes, the simplicity bias of gradient descent can naturally pull the optimization trajectory toward these degenerate boundaries\. While we do not claim that unregularized learning universally converges to a pure direct sum⨁ℱtriv\\bigoplus\\mathcal\{F\}\_\{\\mathrm\{triv\}\}, viewing this collapse mode as a geometric degeneration provides a concrete algebraic model for how classical oversmoothing manifests\. This intuition motivates the use of stability and moment\-map regularizers to explicitly bias the learned geometry away from these degenerate regions\.

Representation Degeneracy\.By the kernel decomposition theorem, any direct\-sum decompositionℱ≅⨁kℱ\(k\)\\mathcal\{F\}\\cong\\bigoplus\_\{k\}\\mathcal\{F\}^\{\(k\)\}induces a decompositionH0\(G;ℱ\)≅⨁kH0\(G;ℱ\(k\)\)H^\{0\}\(G;\\mathcal\{F\}\)\\cong\\bigoplus\_\{k\}H^\{0\}\(G;\\mathcal\{F\}^\{\(k\)\}\)\. Therefore, if many summands in the learned geometry contribute only constant\-like or low\-dimensional global sections due to this implicit optimization bias, the diffusion limit becomes non\-expressive\. We refer to this interpretation as*representation degeneracy*: the learned sheaf geometry decomposes into summands whose harmonic spaces are too simple or poorly aligned with the prediction task\.

King Stability\.King stability provides an algebraic way to exclude certain subrepresentations from a representation class\(King,[1994](https://arxiv.org/html/2605.11178#bib.bib11)\)\. Fix a stability parameterθ∈ℝQ0\\theta\\in\\mathbb\{R\}^\{Q\_\{0\}\}satisfyingθ⋅dimℱ=0\\theta\\cdot\\dim\\mathcal\{F\}=0\. Using the convention that a representation isθ\\theta\-semistable if every proper subrepresentationℱ′⊂ℱ\\mathcal\{F\}^\{\\prime\}\\subset\\mathcal\{F\}satisfiesθ⋅dimℱ′≥0\\theta\\cdot\\dim\\mathcal\{F\}^\{\\prime\}\\geq 0, a subrepresentationℱ′\\mathcal\{F\}^\{\\prime\}withθ⋅dimℱ′<0\\theta\\cdot\\dim\\mathcal\{F\}^\{\\prime\}<0is forbidden\. Thus, ifθ\\thetais chosen so thatθ\(ℱtriv\)<0\\theta\(\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\)<0, then aθ\\theta\-semistable sheaf cannot contain the trivial representation even as a subrepresentation\. Settingθ\(ℱtriv\)<0\\theta\(\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\)<0structurally ensures that the representation is algebraically devoid of this specific low\-complexity object\. This allows us to construct regularizers that bias the learned representation away from chambers that permit constant\-signal collapse\.

King stability controls subrepresentations directly\. In degeneration limits, such destabilizing subobjects may appear as direct summands of the associated graded or polystable representative, which is the setting where the kernel decomposition theorem becomes directly visible\. Thus, the stability viewpoint and the harmonic\-space decomposition theorem address complementary aspects of the same representation\-geometric collapse mechanism\.

Caveats of Degeneration\.We emphasize that excludingℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}prevents only one specific mode of homological collapse\. A learned sheaf may also degenerate into a geometry whereH0\(G;ℱ\)=0H^\{0\}\(G;\\mathcal\{F\}\)=0\. In this scenario, all features decay to zero during diffusion\. Whileθ\\theta\-stability provides a targeted tool to mathematically exclude the trivial constant\-signal mode, it does not universally guarantee that the surviving harmonic space is highly expressive or perfectly aligned with the downstream task\.

## 6Moment\-Map Regularization and Architectural Constraints

For the moment\-map discussion, we use the standard complexified quiver\-representation setting overk=ℂk=\\mathbb\{C\}\. The real\-valued implementation used in our experiments follows the same matrix formulas, replacing Hermitian adjoints by transposes\.

To translate the stability viewpoint into a differentiable learning principle, we use the moment\-map formulation of quiver moduli\. After choosing inner products on all stalks, the compact gauge groupK𝐝=∏v∈VU\(dv\)×∏e∈EU\(de\)K\_\{\\mathbf\{d\}\}=\\prod\_\{v\\in V\}U\(d\_\{v\}\)\\times\\prod\_\{e\\in E\}U\(d\_\{e\}\)acts unitarily on the representation space\. For each incidence mapAv,e:=ℱv⊴eA\_\{v,e\}:=\\mathcal\{F\}\_\{v\\trianglelefteq e\}, the corresponding moment\-map components are, up to sign convention,μv\(ℱ\)=−∑e:v⊴eAv,e∗Av,e\\mu\_\{v\}\(\\mathcal\{F\}\)=\-\\sum\_\{e:\\,v\\trianglelefteq e\}A\_\{v,e\}^\{\\ast\}A\_\{v,e\}andμe\(ℱ\)=∑v:v⊴eAv,eAv,e∗\\mu\_\{e\}\(\\mathcal\{F\}\)=\\sum\_\{v:\\,v\\trianglelefteq e\}A\_\{v,e\}A\_\{v,e\}^\{\\ast\}\. By the Kempf–Ness correspondence, solutions to the shifted moment\-map equationμi\(ℱ\)=θiIdi\\mu\_\{i\}\(\\mathcal\{F\}\)=\\theta\_\{i\}I\_\{d\_\{i\}\}describe balanced representatives of polystable orbits in the complex gauge quotient\(King,[1994](https://arxiv.org/html/2605.11178#bib.bib11); Kempf and Ness,[1979](https://arxiv.org/html/2605.11178#bib.bib19); Mumfordet al\.,[1994](https://arxiv.org/html/2605.11178#bib.bib14); Kirwan,[1984](https://arxiv.org/html/2605.11178#bib.bib15)\)\. This provides a mechanism for biasing the neural network towardθ\\theta\-polystable representation geometry during training\.

Central Moment Regularization\.In our experiments, we use a conservative moment\-map\-inspired regularizer that does not require choosing or learning a specific stability chamber\. Instead, we penalize the non\-central part of the moment map:ℛcent\(ℱ\)=∑i∈V⊔E‖μi\(ℱ\)−tr⁡\(μi\(ℱ\)\)diIdi‖F2\\mathcal\{R\}\_\{\\mathrm\{cent\}\}\(\\mathcal\{F\}\)=\\sum\_\{i\\in V\\sqcup E\}\\\|\\mu\_\{i\}\(\\mathcal\{F\}\)\-\\frac\{\\operatorname\{tr\}\(\\mu\_\{i\}\(\\mathcal\{F\}\)\)\}\{d\_\{i\}\}I\_\{d\_\{i\}\}\\\|\_\{F\}^\{2\}\. The training objective becomesℒ=ℒtask\+λμℛcent\(ℱ\)\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\lambda\_\{\\mu\}\\mathcal\{R\}\_\{\\mathrm\{cent\}\}\(\\mathcal\{F\}\)\. This penalty does not guarantee strict formal polystability, nor does it strictly forbid collapse\. Rather, it acts as a soft geometric bias: it shifts the optimization landscape by pulling the learned incidence\-quiver representation toward a more balanced region of the representation space\. Empirically, this provides a robust geometric regularizer for the fully general equal\-stalk sheaf model\.

Adaptive Stability and Implementation \(ThetaMM\)\.A more expressive variant dynamically navigates GIT chambers by learning a stability parameterθi\\theta\_\{i\}for each graph objecti∈V⊔Ei\\in V\\sqcup E\. We penalize the shifted residualℛθ\-μ\(ℱ;θ\)=∑i‖μi\(ℱ\)−θiIdi‖F2\\mathcal\{R\}\_\{\\theta\\text\{\-\}\\mu\}\(\\mathcal\{F\};\\theta\)=\\sum\_\{i\}\\\|\\mu\_\{i\}\(\\mathcal\{F\}\)\-\\theta\_\{i\}I\_\{d\_\{i\}\}\\\|\_\{F\}^\{2\}\. Similar to the central penalty, minimizing this continuous loss encourages the restriction maps to approach a balanced moment\-map condition, but does not strictly guarantee exact formal GIT semistability\. To ensure the learnedθ\\thetaparameters remain mathematically valid for King stability, we explicitly enforce the admissibility conditionθ⋅𝐝=∑idiθi=0\\theta\\cdot\\mathbf\{d\}=\\sum\_\{i\}d\_\{i\}\\theta\_\{i\}=0\. In practice, this is achieved during each forward pass by taking the unconstrained learnable parameters and projecting them onto the orthogonal complement of the dimension vector𝐝\\mathbf\{d\}\.

Architectural Constraint\.This adaptive setup exposes a strict architectural obstruction\. To successfully destabilize the trivial subrepresentation, one needsθ\(ℱtriv\)=∑v∈Vθv\+∑e∈Eθe<0\\theta\(\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\)=\\sum\_\{v\\in V\}\\theta\_\{v\}\+\\sum\_\{e\\in E\}\\theta\_\{e\}<0under the sign convention above\. Simultaneously, the enforced admissibility condition requires∑v∈Vdvθv\+∑e∈Edeθe=0\\sum\_\{v\\in V\}d\_\{v\}\\theta\_\{v\}\+\\sum\_\{e\\in E\}d\_\{e\}\\theta\_\{e\}=0\. If the network uses uniform stalk dimensions,dv=de=dd\_\{v\}=d\_\{e\}=d, this condition reduces tod\(∑v∈Vθv\+∑e∈Eθe\)=0d\\big\(\\sum\_\{v\\in V\}\\theta\_\{v\}\+\\sum\_\{e\\in E\}\\theta\_\{e\}\\big\)=0, and therefore strictly forcesθ\(ℱtriv\)=0\\theta\(\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\)=0\. Thus, in the standard equal\-stalk architecture, the trivial subrepresentation lies exactly on a stability wall and cannot be excluded by any admissibleθ\\theta\. To make adaptive chamber selection effective against this collapse mode, the architecture must break the proportionality between the full dimension vector and the trivial dimension vector by using non\-uniform stalk dimensions, for instancedv≠ded\_\{v\}\\neq d\_\{e\}\. In that case, the constraintsθ⋅𝐝=0\\theta\\cdot\\mathbf\{d\}=0andθ\(ℱtriv\)<0\\theta\(\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\)<0can hold simultaneously\. This mathematically motivates rectangular restriction maps, such asAv,e:ℝdv→ℝdeA\_\{v,e\}:\\mathbb\{R\}^\{d\_\{v\}\}\\to\\mathbb\{R\}^\{d\_\{e\}\}withde<dvd\_\{e\}<d\_\{v\}, as a natural structural extension for adaptive stability\-aware sheaf diffusion\.

Summary:Oversmoothing in sheaf diffusion can be understood as a homological degeneration: the learned representation may contain low\-complexity summands whose global sections are too simple to preserve task\-relevant features\. Moment\-map regularization biases the learned sheaf toward more balanced representation geometry, while non\-uniform stalk dimensions remove the equal\-stalk obstruction that prevents adaptive stability parameters from assigning nonzero weight to the trivial all\-object summand\.

## 7Empirical Evaluation of Representation Stability

We view these experiments as a targeted empirical probe of the representation\-geometric principles derived in Section[6](https://arxiv.org/html/2605.11178#S6)\. Rather than serving as universal performance boosters, the results suggest that moment\-map regularizers provide a structured and sensitive inductive bias whose effect depends on the underlying stalk\-dimensional architecture\.

Experimental Setup\.We augment theGeneralSheafmodel, i\.e\. Gen\-NSD, with our proposed regularizers and evaluate on the standard WebKB heterophilic node classification datasets: Texas, Cornell, and Wisconsin\. We compare two architectural regimes: the standardSquarearchitecture, withdv=de=3d\_\{v\}=d\_\{e\}=3, and a symmetry\-brokenRectangulararchitecture, withdv=3d\_\{v\}=3andde=2d\_\{e\}=2\. For explicit regularization, we test the soft central penalty \(CentMM\), which penalizes the non\-central part of the moment map,μi\(ℱ\)−tr⁡\(μi\(ℱ\)\)diIdi\\mu\_\{i\}\(\\mathcal\{F\}\)\-\\frac\{\\operatorname\{tr\}\(\\mu\_\{i\}\(\\mathcal\{F\}\)\)\}\{d\_\{i\}\}I\_\{d\_\{i\}\}, and the adaptive shifted moment\-map penalty \(ThetaMM\), which learns stability parametersθi\\theta\_\{i\}through residuals of the formμi\(ℱ\)−θiIdi\\mu\_\{i\}\(\\mathcal\{F\}\)\-\\theta\_\{i\}I\_\{d\_\{i\}\}\. Both regularizers are applied directly to the raw incidence restriction mapsAv,e=ℱv⊴eA\_\{v,e\}=\\mathcal\{F\}\_\{v\\trianglelefteq e\}during the forward pass, before constructing the sheaf Laplacian\.

All models are evaluated using a fixed set of hyperparameters to rigorously test the structural constraints across datasets\.

Table 1:Effect of architectural symmetry breaking and stability\-aware regularization\. Rectangular stalks \(dv=3,de=2d\_\{v\}=3,d\_\{e\}=2\) remove the equal\-stalk obstruction\. For ThetaMM, the penalty is more beneficial in the rectangular regime \(achieving81\.76%81\.76\\%on Wisconsin\), consistent with the stability\-wall hypothesis\.Empirical Evidence for the Stability\-Wall Mechanism\.As shown in Table[1](https://arxiv.org/html/2605.11178#S7.T1), the interaction between architecture and regularization is strongly dataset\-dependent, but consistent with the stability\-wall mechanism discussed in Section[6](https://arxiv.org/html/2605.11178#S6)\. In the standard equal\-stalk architecture,dv=ded\_\{v\}=d\_\{e\}, the admissibility conditionθ⋅𝐝=0\\theta\\cdot\\mathbf\{d\}=0forces the trivial all\-object subrepresentation to have zeroθ\\theta\-weight\. Thus, adaptiveθ\\theta\-regularization cannot strictly exclude this collapse mode\. In the square regime, ThetaMM decreases Texas performance from78\.65%78\.65\\%to77\.57%77\.57\\%, suggesting that adaptive stability is not effective when this obstruction is present\.

Breaking Symmetry Changes the Optimization Geometry\.By using rectangular incidence maps withdv=3d\_\{v\}=3andde=2d\_\{e\}=2, the admissibility conditionθ⋅𝐝=0\\theta\\cdot\\mathbf\{d\}=0and the destabilization conditionθ\(ℱtriv\)<0\\theta\(\\mathcal\{F\}\_\{\\mathrm\{triv\}\}\)<0can hold simultaneously\. Thus, non\-uniform stalk dimensions make it possible in principle for adaptive stability parameters to assign a nonzero destabilizing weight to the trivial subrepresentation\. The empirical effect depends on the dataset\. On Cornell, the architectural intervention alone provides a useful geometric bias: the unregularized rectangular baseline matches the square baseline in mean test accuracy, while reducing variance from±8\.82\\pm 8\.82to±6\.86\\pm 6\.86and increasing validation accuracy from80\.17%80\.17\\%to82\.54%82\.54\\%\. On Texas, the rectangular bottleneck slightly decreases the unregularized baseline, but adding central moment regularization improves performance to80\.00±5\.0180\.00\\pm 5\.01\. Most notably, Wisconsin gives the clearest positive signal for adaptive stability: in the rectangular regime, ThetaMM achieves the strongest result among the tested variants, improving over the square baseline from80\.20±3\.9780\.20\\pm 3\.97to81\.76±4\.8181\.76\\pm 4\.81and increasing validation accuracy from80\.25%80\.25\\%to81\.75%81\.75\\%\.

Together, these results support the central representation\-geometric claim of the paper: oversmoothing is sensitive to the decomposition structure of the learned sheaf representation, and both architectural asymmetry and moment\-map\-inspired regularization can bias learning away from low\-complexity harmonic geometries\. The effect is not universal, but the rectangular experiments provide evidence that adaptive stability becomes more meaningful once the equal\-stalk stability wall is removed\.

Effect of Network Depth\.Because oversmoothing is fundamentally a pathology of depth, we performed an ablation study on Wisconsin, scaling the number of message\-passing layersL∈\{2,4,8,16,32,64,128,256\}L\\in\\\{2,4,8,16,32,64,128,256\\\}\. As shown in Figure[1](https://arxiv.org/html/2605.11178#S7.F1), at standard architectures \(L=2,4L=2,4\), the Rectangular ThetaMM model strictly outperforms the unregularized square baseline, demonstrating that adaptive stability actively guides the geometry before deep expressivity dominates\. Interestingly, as depth increases up toL=128L=128, both architectures exhibit remarkable resilience, maintaining test accuracies of∼85%\\sim 85\\%\. Finally, attempting to scale to extreme depth \(L=256L=256\) resulted in catastrophic numerical instability \(NaN values\) during the forward pass across all models, confirming the physical precision limits of the unconstrained sheaf Laplacian formulation\. Ultimately, the rectangular model remains highly competitive up to the hardware boundary \(85\.88%85\.88\\%atL=128L=128\), confirming that thede<dvd\_\{e\}<d\_\{v\}bottleneck mathematically bounds the space without destroying the network’s capacity to learn expressive deep representations\.

212^\{1\}222^\{2\}232^\{3\}242^\{4\}252^\{5\}262^\{6\}272^\{7\}70707575808085859090Number of Layers \(LL\)Test Accuracy \(%\)Baseline \(Square\)Rectangular ThetaMMFigure 1:Test accuracy across network depth on Wisconsin\. The stability\-aware Rectangular ThetaMM model outperforms the baseline at standard depths\. Remarkably, both models demonstrate extreme depth resilience up toL=128L=128\. Attempting to scale toL=256L=256resulted in numerical explosion \(NaN\) across both architectures, establishing the hardware precision limit of the formulation\.Geometric Diagnostics\.These diagnostics suggest that the regularizers alter the learned sheaf geometry\. On Cornell, CentMM and ThetaMM reduce the final moment\-map residual relative to unregularized training\. We also observe that regularized runs tend to maintain larger Dirichlet energies than degenerate unregularized runs, consistent with reduced constant\-signal collapse\. We leave a more systematic spectral and harmonic\-space analysis for future work\.

## 8Related Work

Oversmoothing in Graph Neural Networks\.The success of message\-passing GNNs\(Kipf and Welling,[2017](https://arxiv.org/html/2605.11178#bib.bib6); Veličkovićet al\.,[2018](https://arxiv.org/html/2605.11178#bib.bib7)\)is often accompanied by the oversmoothing problem, where repeated feature aggregation drives node representations toward indistinguishable limits\(Liet al\.,[2018](https://arxiv.org/html/2605.11178#bib.bib4)\)\. This phenomenon is commonly analyzed through spectral graph theory and Dirichlet energy\(Chenet al\.,[2020](https://arxiv.org/html/2605.11178#bib.bib5)\): repeated diffusion asymptotically projects features onto the kernel of the graph Laplacian\. Various architectural interventions, such as residual connections, normalization, dropout, or edge dropping, have been proposed to delay or mitigate this collapse\. However, these methods typically act at the level of feature propagation, whereas our work studies the algebraic structure of the harmonic space itself\.

Topological Deep Learning and Neural Sheaves\.Topological and geometric deep learning extend message passing beyond scalar graph Laplacians by enriching the spaces on which signals live\(Bronsteinet al\.,[2017](https://arxiv.org/html/2605.11178#bib.bib18)\)\. Cellular sheaves\(Hansen and Ghrist,[2019](https://arxiv.org/html/2605.11178#bib.bib8); Curry,[2014](https://arxiv.org/html/2605.11178#bib.bib10); Bredon,[1997](https://arxiv.org/html/2605.11178#bib.bib9)\)provide a natural framework for this: they assign vector spaces to vertices and edges and compare local data through restriction maps\. Neural Sheaf Diffusion \(NSD\)\(Bodnaret al\.,[2022](https://arxiv.org/html/2605.11178#bib.bib1)\)learns these restriction maps end\-to\-end, allowing graph diffusion to adapt to heterophilic structure\. Subsequent variants, including Sheaf Attention Networks\(Barberoet al\.,[2022b](https://arxiv.org/html/2605.11178#bib.bib2)\)and sheaf neural networks with connection Laplacians\(Barberoet al\.,[2022a](https://arxiv.org/html/2605.11178#bib.bib3)\), further develop this perspective\. However, fully general learned sheaves remain numerically delicate and can be difficult to optimize\. Our work complements this line by studying the representation geometry of the learned restriction maps and by introducing moment\-map\-inspired regularization as a geometric bias on this representation space\.

Implicit Bias and Representation Theory\.Gradient\-based optimization in overparameterized neural networks is known to exhibit implicit bias toward low\-complexity or smooth solutions\(Rahamanet al\.,[2019](https://arxiv.org/html/2605.11178#bib.bib20); Shahet al\.,[2020](https://arxiv.org/html/2605.11178#bib.bib21)\)\. In matrix factorization and related linear models, this often appears as implicit regularization toward low\-rank or low\-complexity structure\(Gunasekaret al\.,[2017](https://arxiv.org/html/2605.11178#bib.bib22); Aroraet al\.,[2019](https://arxiv.org/html/2605.11178#bib.bib23)\)\. We translate this perspective to Neural Sheaf Diffusion by viewing the learned restriction maps as a representation of the graph’s incidence quiver\(Schiffler,[2014](https://arxiv.org/html/2605.11178#bib.bib12); Derksen and Weyman,[2005](https://arxiv.org/html/2605.11178#bib.bib13)\)\. This allows us to interpret oversmoothing as degeneration toward low\-complexity harmonic summands\.

Stability, Moduli, and Moment Maps\.To formalize nondegenerate learned sheaf geometries, we draw on Geometric Invariant Theory \(GIT\)\(Mumfordet al\.,[1994](https://arxiv.org/html/2605.11178#bib.bib14); Kirwan,[1984](https://arxiv.org/html/2605.11178#bib.bib15)\), Kempf–Ness theory\(Kempf and Ness,[1979](https://arxiv.org/html/2605.11178#bib.bib19)\), and King’s stability for quiver representations\(King,[1994](https://arxiv.org/html/2605.11178#bib.bib11)\)\. In this language, stability distinguishes representation classes with controlled subrepresentation structure from degenerate ones, while moment maps provide a differentiable notion of balancedness under the unitary gauge action\. To our knowledge, this work is the first to connect oversmoothing in Neural Sheaf Diffusion to quiver\-representation degeneration and to propose moment\-map\-inspired regularization as a way to bias learned sheaf geometries away from trivial harmonic collapse\.

## 9Limitations

While this work provides a formal representation\-geometric framework for understanding oversmoothing, we emphasize that mapping oversmoothing to representation degeneracy is a conceptual interpretation rather than a universal mechanism\. Standard GNN failure modes are multifaceted, and while bounding the trivial subrepresentation limits one specific form of homological collapse, it does not guarantee high accuracy if the target task is uncorrelated with the underlying heterophilic graph structure\. Furthermore, our empirical results serve as preliminary evidence for the stability\-wall hypothesis; the improvements shown on the WebKB benchmark indicate that moment\-map penalties and rectangular stalks act as structured, dataset\-dependent inductive biases rather than universal performance boosters\. Finally, while our regularizers encourage the learned parameters toward balanced moment\-map configurations, the continuous penaltyℛθ\-μ\\mathcal\{R\}\_\{\\theta\\text\{\-\}\\mu\}does not strictly enforce exact algebraic GIT semistability during gradient descent\.

## 10Conclusion

In this work, we developed a representation\-theoretic perspective on oversmoothing in Neural Sheaf Diffusion\. By identifying learned cellular sheaves with representations of the graph’s incidence quiver, we showed that direct\-sum decompositions of the learned representation induce corresponding decompositions of the harmonic space reached in the diffusion limit\. This yields an algebraic interpretation of oversmoothing as representation degeneration: collapse toward trivial or low\-complexity summands whose global sections fail to preserve discriminative information\.

We connected this viewpoint to stability, moduli, and moment\-map ideas from Geometric Invariant Theory\. In particular, we identified the trivial all\-object subrepresentation as a canonical collapse mode and showed that, in equal\-stalk architectures, the admissibility constraint for learnable stability parameters forces this trivial summand onto a stability wall\. This explains why adaptiveθ\\theta\-based regularization is structurally limited in standard square\-stalk NSD architectures\. To translate these ideas into learning, we introduced moment\-map\-inspired regularizers acting directly on raw incidence restriction maps\. Our empirical results suggest that central moment regularization and rectangular stalk architectures provide meaningful but dataset\-dependent geometric biases\. Rectangular stalks remove the equal\-stalk obstruction and make adaptive stability more effective in some settings, while central moment regularization offers a robust soft regularizer for learned general sheaf geometry\. Overall, our framework reframes oversmoothing not only as a spectral pathology of repeated message passing, but as a degeneration phenomenon in the representation geometry underlying learned sheaf diffusion\. We hope this perspective opens a path toward moduli\-aware graph neural networks with more principled control over their harmonic limits\.

## References

- Implicit regularization in deep matrix factorization\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§8](https://arxiv.org/html/2605.11178#S8.p3.1)\.
- F\. Barbero, C\. Bodnar, H\. Sáez de Ocáriz Borde, M\. M\. Bronstein, P\. Veličković, and P\. Liò \(2022a\)Sheaf neural networks with connection laplacians\.InTopological, Algebraic and Geometric Learning Workshops 2022,Proceedings of Machine Learning Research, Vol\.196,pp\. 28–36\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- F\. Barbero, C\. Bodnar, H\. Sáez de Ocáriz Borde, and P\. Liò \(2022b\)Sheaf attention networks\.InNeurIPS 2022 Workshop on Symmetry and Geometry in Neural Representations,Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- C\. Bodnar, F\. Di Giovanni, B\. P\. Chamberlain, P\. Liò, and M\. M\. Bronstein \(2022\)Neural sheaf diffusion: a topological perspective on heterophily and oversmoothing in GNNs\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 18527–18541\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§2](https://arxiv.org/html/2605.11178#S2.p3.6),[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- G\. E\. Bredon \(1997\)Sheaf theory\.2 edition,Springer\.Cited by:[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- M\. M\. Bronstein, J\. Bruna, Y\. LeCun, A\. Szlam, and P\. Vandergheynst \(2017\)Geometric deep learning: going beyond euclidean data\.IEEE Signal Processing Magazine34\(4\),pp\. 18–42\.Cited by:[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- D\. Chen, Y\. Lin, W\. Li, P\. Li, J\. Zhou, and X\. Sun \(2020\)Measuring and relieving the over\-smoothing problem for graph neural networks from the topological view\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.34,pp\. 3438–3445\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p1.1),[§2](https://arxiv.org/html/2605.11178#S2.p1.7),[§8](https://arxiv.org/html/2605.11178#S8.p1.1)\.
- J\. M\. Curry \(2014\)Sheaves, cosheaves and applications\.Ph\.D\. Thesis,University of Pennsylvania\.Cited by:[§2](https://arxiv.org/html/2605.11178#S2.p2.15),[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- H\. Derksen and J\. Weyman \(2005\)Quiver representations\.Notices of the American Mathematical Society52\(2\),pp\. 200–206\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p3.1),[§2](https://arxiv.org/html/2605.11178#S2.p4.15),[§8](https://arxiv.org/html/2605.11178#S8.p3.1)\.
- S\. Gunasekar, B\. E\. Woodworth, S\. Bhojanapalli, B\. Neyshabur, and N\. Srebro \(2017\)Implicit regularization in matrix factorization\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§5](https://arxiv.org/html/2605.11178#S5.p3.3),[§8](https://arxiv.org/html/2605.11178#S8.p3.1)\.
- J\. Hansen and R\. Ghrist \(2019\)Toward a spectral theory of cellular sheaves\.Journal of Applied and Computational Topology3,pp\. 315–358\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§2](https://arxiv.org/html/2605.11178#S2.p3.6),[§8](https://arxiv.org/html/2605.11178#S8.p2.1)\.
- G\. R\. Kempf and L\. Ness \(1979\)The length of vectors in representation spaces\.InAlgebraic Geometry,pp\. 233–243\.Cited by:[§6](https://arxiv.org/html/2605.11178#S6.p2.6),[§8](https://arxiv.org/html/2605.11178#S8.p4.1)\.
- A\. D\. King \(1994\)Moduli of representations of finite dimensional algebras\.The Quarterly Journal of Mathematics45\(4\),pp\. 515–530\.External Links:[Document](https://dx.doi.org/10.1093/qmath/45.4.515)Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p4.1),[§5](https://arxiv.org/html/2605.11178#S5.p6.11),[§6](https://arxiv.org/html/2605.11178#S6.p2.6),[§8](https://arxiv.org/html/2605.11178#S8.p4.1)\.
- T\. N\. Kipf and M\. Welling \(2017\)Semi\-supervised classification with graph convolutional networks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p1.1),[§8](https://arxiv.org/html/2605.11178#S8.p1.1)\.
- F\. C\. Kirwan \(1984\)Cohomology of quotients in symplectic and algebraic geometry\.Princeton University Press\.Cited by:[§6](https://arxiv.org/html/2605.11178#S6.p2.6),[§8](https://arxiv.org/html/2605.11178#S8.p4.1)\.
- Q\. Li, Z\. Han, and X\. Wu \(2018\)Deeper insights into graph convolutional networks for semi\-supervised learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p1.1),[§2](https://arxiv.org/html/2605.11178#S2.p1.7),[§8](https://arxiv.org/html/2605.11178#S8.p1.1)\.
- D\. Mumford, J\. Fogarty, and F\. Kirwan \(1994\)Geometric invariant theory\.3 edition,Springer\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p4.1),[§6](https://arxiv.org/html/2605.11178#S6.p2.6),[§8](https://arxiv.org/html/2605.11178#S8.p4.1)\.
- H\. Pei, B\. Wei, K\. C\. Chang, Y\. Lei, and B\. Yang \(2020\)Geom\-gcn: geometric graph convolutional networks\.InInternational Conference on Learning Representations,Cited by:[§B\.1](https://arxiv.org/html/2605.11178#A2.SS1.p1.1)\.
- N\. Rahaman, A\. Baratin, D\. Arpit, F\. Draxler, M\. Lin, F\. Hamprecht, Y\. Bengio, and A\. Courville \(2019\)On the spectral bias of neural networks\.InInternational Conference on Machine Learning,pp\. 5301–5310\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§5](https://arxiv.org/html/2605.11178#S5.p3.3),[§8](https://arxiv.org/html/2605.11178#S8.p3.1)\.
- R\. Schiffler \(2014\)Quiver representations\.Springer\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p3.1),[§2](https://arxiv.org/html/2605.11178#S2.p4.15),[§8](https://arxiv.org/html/2605.11178#S8.p3.1)\.
- H\. Shah, K\. Tamuly, A\. Raghunathan, P\. Jain, and P\. Netrapalli \(2020\)The pitfalls of simplicity bias in neural networks\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 9573–9585\.Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p2.1),[§5](https://arxiv.org/html/2605.11178#S5.p3.3),[§8](https://arxiv.org/html/2605.11178#S8.p3.1)\.
- A\. D\. Shepard \(1985\)A cellular description of the derived category of a stratified space\.Brown University\.Cited by:[§2](https://arxiv.org/html/2605.11178#S2.p2.15)\.
- P\. Veličković, G\. Cucurull, A\. Casanova, A\. Romero, P\. Liò, and Y\. Bengio \(2018\)Graph attention networks\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.11178#S1.p1.1),[§8](https://arxiv.org/html/2605.11178#S8.p1.1)\.

## Appendix AExtended Proofs and Mathematical Details

In this section, we provide the formal proofs for the claims made in the main text regarding the homological decomposition of the diffusion limit and the moment\-map formulation of stability\.

### A\.1Proof of the Kernel Decomposition Theorem

In Section 4, we claimed that direct\-sum decompositions of the learned incidence\-quiver representation induce corresponding decompositions of the harmonic space\. Here we formalize this functorial argument\.

###### Theorem A\.1\.

Letℱ≅⨁k=1Kℱ\(k\)\\mathcal\{F\}\\cong\\bigoplus\_\{k=1\}^\{K\}\\mathcal\{F\}^\{\(k\)\}be a finite direct\-sum decomposition of a cellular sheaf, viewed as a representation of the incidence quiverQGQ\_\{G\}in the Abelian categoryRep⁡\(QG\)\\operatorname\{Rep\}\(Q\_\{G\}\)\. LetH0\(G;ℱ\)=ker⁡\(Δℱ\)H^\{0\}\(G;\\mathcal\{F\}\)=\\ker\(\\Delta\_\{\\mathcal\{F\}\}\)denote the space of global sections \(the diffusion limit\)\. Then there is a canonical isomorphism of vector spaces:

H0\(G;ℱ\)≅⨁k=1KH0\(G;ℱ\(k\)\)H^\{0\}\(G;\\mathcal\{F\}\)\\cong\\bigoplus\_\{k=1\}^\{K\}H^\{0\}\\big\(G;\\mathcal\{F\}^\{\(k\)\}\\big\)

###### Proof\.

Let𝐕𝐞𝐜𝐭\\mathbf\{Vect\}denote the category of finite\-dimensional vector spaces overℝ\\mathbb\{R\}\. A representation of the incidence quiverQGQ\_\{G\}is a functorM:𝒞QG→𝐕𝐞𝐜𝐭M:\\mathcal\{C\}\_\{Q\_\{G\}\}\\to\\mathbf\{Vect\}, where𝒞QG\\mathcal\{C\}\_\{Q\_\{G\}\}is the path category ofQGQ\_\{G\}\. The space of global sections of the corresponding sheafℱ\\mathcal\{F\}is given by the categorical limit of this diagram:

H0\(G;ℱ\)=lim⟵MH^\{0\}\(G;\\mathcal\{F\}\)=\\lim\_\{\\longleftarrow\}MThis assignment defines the global section functorΓ:Rep⁡\(QG\)→𝐕𝐞𝐜𝐭\\Gamma:\\operatorname\{Rep\}\(Q\_\{G\}\)\\to\\mathbf\{Vect\}\.

Because𝐕𝐞𝐜𝐭\\mathbf\{Vect\}is an Abelian category, finite products and finite coproducts coincide, forming biproducts \(direct sums\)\. The category of representationsRep⁡\(QG\)\\operatorname\{Rep\}\(Q\_\{G\}\)is also Abelian, and finite direct sums of representations are constructed point\-wise \(stalk\-wise\)\.

Since limits commute with limits, and a finite product is a limit, the limit functorΓ\\Gammapreserves finite products\. Because finite products are biproducts in𝐕𝐞𝐜𝐭\\mathbf\{Vect\},Γ\\Gammaacts as an additive functor that strictly preserves finite direct sums\. Therefore:

Γ\(⨁k=1Kℱ\(k\)\)≅⨁k=1KΓ\(ℱ\(k\)\)\\Gamma\\left\(\\bigoplus\_\{k=1\}^\{K\}\\mathcal\{F\}^\{\(k\)\}\\right\)\\cong\\bigoplus\_\{k=1\}^\{K\}\\Gamma\\big\(\\mathcal\{F\}^\{\(k\)\}\\big\)SubstitutingΓ\(ℱ\)=H0\(G;ℱ\)\\Gamma\(\\mathcal\{F\}\)=H^\{0\}\(G;\\mathcal\{F\}\)yields the stated isomorphism\. ∎

This proof confirms that manual spectral block\-diagonalization of the sheaf Laplacian is unnecessary; the structural decomposition of the diffusion limit is purely a consequence of homological algebra\.

### A\.2Derivation of the Moment Map for Neural Sheaves

We formalize the connection between the unitary gauge group action and the moment map regularizers introduced in Section 6\.

Let𝐝=\(\(dv\)v∈V,\(de\)e∈E\)\\mathbf\{d\}=\(\(d\_\{v\}\)\_\{v\\in V\},\(d\_\{e\}\)\_\{e\\in E\}\)be the dimension vector of the sheaf\. In the complexified setting, we equip every stalkℱ\(v\)\\mathcal\{F\}\(v\)andℱ\(e\)\\mathcal\{F\}\(e\)with a standard Hermitian inner product, restricting the complex general linear gauge groupG𝐝G\_\{\\mathbf\{d\}\}to the maximal compact subgroupK𝐝=∏vU\(dv\)×∏eU\(de\)K\_\{\\mathbf\{d\}\}=\\prod\_\{v\}U\(d\_\{v\}\)\\times\\prod\_\{e\}U\(d\_\{e\}\)\.

The Lie algebra ofK𝐝K\_\{\\mathbf\{d\}\}is𝔨=⨁v𝔲\(dv\)⊕⨁e𝔲\(de\)\\mathfrak\{k\}=\\bigoplus\_\{v\}\\mathfrak\{u\}\(d\_\{v\}\)\\oplus\\bigoplus\_\{e\}\\mathfrak\{u\}\(d\_\{e\}\), which consists of skew\-Hermitian matrices\. The representation spaceRep⁡\(QG,𝐝\)\\operatorname\{Rep\}\(Q\_\{G\},\\mathbf\{d\}\)is a flat Kähler manifold\. The action ofK𝐝K\_\{\\mathbf\{d\}\}on a point \(a set of restriction mapsAv,eA\_\{v,e\}\) is Hamiltonian, and therefore admits a moment mapμ:Rep⁡\(QG,𝐝\)→𝔨∗\\mu:\\operatorname\{Rep\}\(Q\_\{G\},\\mathbf\{d\}\)\\to\\mathfrak\{k\}^\{\*\}\.

By the standard formula for quiver representations, theii\-th component of the moment map \(whereiiis a vertex or edge in the graph\) is given by the difference between the maps enteringiiand leavingii\. For the bipartite incidence quiverQGQ\_\{G\}, arrows only go from graph verticesvvto graph edgesee\. Therefore:

1. 1\.For a graph vertexvv\(which only has outgoing arrows inQGQ\_\{G\}\): μv\(A\)=−∑e:v⊴eAv,e∗Av,e\\mu\_\{v\}\(A\)=\-\\sum\_\{e:\\,v\\trianglelefteq e\}A\_\{v,e\}^\{\*\}A\_\{v,e\}
2. 2\.For a graph edgeee\(which only has incoming arrows inQGQ\_\{G\}\): μe\(A\)=∑v:v⊴eAv,eAv,e∗\\mu\_\{e\}\(A\)=\\sum\_\{v:\\,v\\trianglelefteq e\}A\_\{v,e\}A\_\{v,e\}^\{\*\}

According to Kempf\-Ness theory, the intersection of a complex gauge orbit with the zero locus of the shifted moment mapμi\(A\)−θiIdi=0\\mu\_\{i\}\(A\)\-\\theta\_\{i\}I\_\{d\_\{i\}\}=0corresponds exactly toθ\\theta\-polystable representations\. Our regularizers \(CentMM and ThetaMM\) penalize the Frobenius norm of these exact moment map residuals to bias gradient descent toward these stable orbits\.

## Appendix BExperimental Details

### B\.1Dataset Statistics

We evaluate our proposed models on standard heterophilic datasets, primarily focusing on the WebKB benchmark alongside the larger Wikipedia network,squirrel, both introduced to the graph neural network community byPeiet al\.\[[2020](https://arxiv.org/html/2605.11178#bib.bib25)\]\. The WebKB datasets \(Texas, Cornell, Wisconsin\) represent webpage networks from university computer science departments, where nodes are web pages, edges are hyperlinks, and features are bag\-of\-words representations\. Thesquirreldataset represents a page\-page network from Wikipedia focusing on specific topics, where nodes are articles, edges are mutual links, and features are informative nouns extracted from the text\. The node classification task for all datasets is to classify the pages into five distinct categories\. As shown in Table[2](https://arxiv.org/html/2605.11178#A2.T2), all of these datasets exhibit extremely low edge homophily\.

Table 2:Summary statistics for the heterophilic node classification datasets\.
### B\.2Training Protocol and Reproducibility

All experiments were implemented using PyTorch Geometric and optimized using the Adam optimizer\. We evaluated the models over 10 random data splits \(48% training, 32% validation, and 20% testing\), which is standard for the WebKB benchmarks\. Standard deviations reported in Table[1](https://arxiv.org/html/2605.11178#S7.T1)are computed across these 10 splits\. Training was conducted for a maximum of 1500 epochs with an early stopping criterion of 200 epochs based on validation loss\. The base network architecture across all runs utilized 4 layers \(unless depth was explicitly ablated\), a hidden channel dimension of 20, a learning rate of 0\.01, a weight decay of5×10−35\\times 10^\{\-3\}, and a dropout rate of 0\.7\.

Rather than performing a continuous hyperparameter search, we used a fixed configuration across datasets to isolate the architectural effects\. The central moment regularization strength was set toλμ=2×10−3\\lambda\_\{\\mu\}=2\\times 10^\{\-3\}and the adaptive stability penalty toλθ=1×10−4\\lambda\_\{\\theta\}=1\\times 10^\{\-4\}\. The rectangular models were not strictly parameter\-count matched to the square models in the main table; however, a dedicated parameter\-matched square control \(1818hidden channels\) evaluated on Texas achieved80\.27%±4\.6980\.27\\%\\pm 4\.69, confirming that capacity reduction alone does not explain the distinct stability behavior of the rectangular architecture\.

Enforcing Admissibility for ThetaMM\.For adaptive stability, the admissibility conditionθ⋅𝐝=0\\theta\\cdot\\mathbf\{d\}=0is strictly enforced during each forward pass\. We parameterize unconstrained weightsθ~i\\tilde\{\\theta\}\_\{i\}and project them orthogonally onto the hyperplane defined by the dimension vector𝐝\\mathbf\{d\}\. The projected stability parameters are computed as:

θi=θ~i−∑jdjθ~j∑jdj2di\\theta\_\{i\}=\\tilde\{\\theta\}\_\{i\}\-\\frac\{\\sum\_\{j\}d\_\{j\}\\tilde\{\\theta\}\_\{j\}\}\{\\sum\_\{j\}d\_\{j\}^\{2\}\}d\_\{i\}This guarantees that∑idiθi=0\\sum\_\{i\}d\_\{i\}\\theta\_\{i\}=0, ensuring the parameter remains a valid King stability weight\.

### B\.3Compute Infrastructure

The models were trained on a machine equipped with dual NVIDIA Tesla V100S\-PCIE GPUs \(32GB VRAM\)\. Due to the lightweight nature of the WebKB graphs, the runtime per experiment was negligible \(typically under 2 minutes per split\)\.

## Appendix CFurther Empirical Diagnostics and Controls

To address the specific dynamics of our proposed architectural and regularization interventions, we performed additional geometric diagnostics and parameter\-matched control experiments\.

### C\.1Parameter\-Matched Control

A natural question regarding the3→23\\to 2rectangular architecture is whether its performance differences stem from the structural removal of the stability wall, or simply from the fact that it contains fewer trainable parameters in the restriction maps than the3×33\\times 3square baseline\.

To isolate this, we evaluated a parameter\-matched square control on the Texas dataset\. We reduced thehidden\_channelsof the equal\-stalk model from2020to1818, perfectly matching the total parameter count of the3→23\\to 2rectangular model\. This matched control achieved a test accuracy of80\.27%±4\.6980\.27\\%\\pm 4\.69, outperforming the standard78\.65%78\.65\\%baseline\. This is consistent with the broader premise in Section[5](https://arxiv.org/html/2605.11178#S5): restricting model capacity may provide an implicit regularization effect that helps reduce reliance on trivial representations\.

However, capacity reduction alone does not explain the full mechanism\. As demonstrated on Wisconsin \(Table[1](https://arxiv.org/html/2605.11178#S7.T1)\), simply applying the ThetaMM penalty to the equal\-stalk architecture fails to improve performance, regardless of parameter count\. It is specifically the geometric intervention of breaking the stalk symmetry \(de<dvd\_\{e\}<d\_\{v\}\) that allows the learnable stability parameters to actively navigate GIT chambers and reach peak performance \(81\.76%81\.76\\%\)\.

### C\.2Geometric Diagnostics: Dirichlet Energy

To verify that our moment\-map regularizers actively alter the harmonic geometry of the diffusion limit—rather than merely acting as standard weight decay—we tracked the Dirichlet energy of the node features during the forward passes of our experiments\.

As established in Section[5](https://arxiv.org/html/2605.11178#S5), if a learned sheaf contains the trivial subrepresentationℱtriv\\mathcal\{F\}\_\{\\mathrm\{triv\}\}, the sheaf Dirichlet energy collapses toward zero, mirroring classical oversmoothing where features become globally constant\. Our logs are consistent with this mechanism: in degenerate, unregularized equal\-stalk runs, the tracked Dirichlet energy frequently collapses to near\-zero levels\. In contrast, the application of moment\-map penalties \(CentMM and ThetaMM\) tends to maintain larger Dirichlet energies than degenerate unregularized runs, consistent with reduced reliance on trivial constant\-signal collapse and improved retention of discriminative global sections\.

### C\.3Effect of Network Depth and Generalization

To verify the interaction between our structural constraints and network depth, we performed an ablation study on the Wisconsin dataset, scaling the number of message\-passing layersL∈\{2,4,8,16,32,64,128,256\}L\\in\\\{2,4,8,16,32,64,128,256\\\}\. At shallow depths \(L=2,4L=2,4\), the Rectangular ThetaMM model consistently outperforms the unregularized square baseline\. As depth increases up toL=128L=128, both architectures exhibit remarkable resilience before experiencing identical catastrophic numerical explosions \(NaN values\) atL=256L=256\. The rectangular model remains highly competitive at these extreme depths \(85\.88%85\.88\\%atL=128L=128\), confirming that thede<dvd\_\{e\}<d\_\{v\}bottleneck mathematically bounds the space without destroying the network’s capacity to learn expressive deep representations\.

Generalization to Cornell\.To confirm that depth resilience is not unique to the Wisconsin topology, we extended the extreme depth evaluation to the Cornell dataset \(Figure[2](https://arxiv.org/html/2605.11178#A3.F2)\)\. Both architectures scale efficiently at shallow depths, peaking at∼85\.4%\\sim 85\.4\\%atL=8L=8\. However, the true impact of the stability constraints emerges as we push the network into the extreme depth regime\. While the unregularized baseline steadily degrades down to81\.62%81\.62\\%atL=64L=64, the Rectangular ThetaMM model successfully prevents this homological decay, maintaining its peak test accuracy of85\.68%85\.68\\%\. Remarkably, both models exhibit survival up toL=128L=128before experiencing identical catastrophic numerical explosions atL=256L=256\. This confirms that across multiple heterophilic topologies, thede<dvd\_\{e\}<d\_\{v\}constraint actively shields the learned representation from degeneration in intermediate deep regimes, while maintaining the capacity to scale up to the absolute hardware precision limit of the sheaf Laplacian\.

212^\{1\}222^\{2\}232^\{3\}242^\{4\}252^\{5\}262^\{6\}272^\{7\}70707575808085859090Number of Layers \(LL\)Test Accuracy \(%\)Baseline \(Square\)Rectangular ThetaMMFigure 2:Test accuracy across network depth on Cornell\. Both architectures achieve peak performance atL=8L=8\. However, as depth increases into the oversmoothing regime \(L=64L=64\), the unregularized baseline’s performance degrades\. In contrast, the Rectangular ThetaMM model acts as a structural safeguard, maintaining peak expressivity \(85\.68%85\.68\\%\)\. Attempting to scale toL=256L=256resulted in numerical explosion across both architectures\.
### C\.4Extended Evaluation on Wikipedia Networks \(Squirrel\)

To ensure our findings generalize beyond the WebKB benchmarks, we extended our depth ablation to thesquirreldataset, a significantly larger and denser heterophilic graph \(5,201 nodes\)\. As shown in Table[3](https://arxiv.org/html/2605.11178#A3.T3),squirrelpresents a notoriously difficult optimization landscape where unconstrained models struggle to avoid trivial harmonic representations\.

Remarkably, the Rectangular ThetaMM model consistently outperforms the unregularized baseline at every standard architectural depth \(L=2,4,8L=2,4,8\)\. However, as the network scales to extreme depth \(L=16L=16\), the high edge density of thesquirrelgraph eventually overwhelms both architectures, causing both models to experience homological collapse and degrade to∼19\.6%\\sim 19\.6\\%\. These results confirm that while the geometric bias provided byde<dvd\_\{e\}<d\_\{v\}and moment\-map stability consistently improves representation learning and delays degeneration on large\-scale heterophilic networks, it acts as a structured inductive bias rather than an absolute panacea against collapse in ultra\-dense, extreme\-depth regimes\.

Table 3:Test accuracy across network depth on thesquirreldataset\. The stability\-aware model consistently outperforms the baseline at intermediate depths \(L=2,4,8L=2,4,8\)\. AtL=16L=16, the extreme density of the graph induces homological collapse across both architectures\.
Oversmoothing as Representation Degeneracy in Neural Sheaf Diffusion

Similar Articles

Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation

Steerable Neural ODEs on Homogeneous Spaces

Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS

Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

Submit Feedback

Similar Articles

Hierarchical Multi-Scale Graph Neural Networks: Scalable Heterophilous Learning with Oversmoothing and Oversquashing Mitigation
Steerable Neural ODEs on Homogeneous Spaces
Graph Normalization: Fast Binarizing Dynamics for Differentiable MWIS
Deep Dreams Are Made of This: Visualizing Monosemantic Features in Diffusion Models
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs