Neural Collapse by Design: Learning Class Prototypes on the Hypersphere
Summary
This paper shows that cross-entropy and supervised contrastive learning are both forms of prototype learning on the hypersphere and proposes normalized losses (NTCE and NONL) that achieve Neural Collapse by design, outperforming standard methods.
View Cached Full Text
Cached at: 05/21/26, 06:22 AM
# Neural Collapse by Design: Learning Class Prototypes on the Hypersphere
Source: [https://arxiv.org/html/2605.20302](https://arxiv.org/html/2605.20302)
###### Abstract
Supervised classification has a theoretical optimum, Neural Collapse \(NC\), yet neither of its two dominant paradigms reaches it in practice\. Cross entropy \(CE\) leaves radial degrees of freedom unconstrained and converges to a degenerate geometry, while supervised contrastive learning \(SCL\) drives features toward NC during pretraining but discards this structure in a post hoc linear probing phase\. We show that both paradigms are different appearances of the same method, prototype contrast on the unit hypersphere, and that closing the gap requires fixing each at its specific point of failure\. From the CE side, we propose NTCE and NONL, two normalized losses that import contrastive optimization’s missing ingredients into classifier learning: a large effective negative set and decoupled alignment and uniformity terms\. From the SCL side, we prove that SCL’s objective already optimizes throughout training for a principled classifier whose weights are the class mean embeddings, making linear probing both redundant and harmful\. Empirically, on four benchmarks including ImageNet\-1K, NTCE and NONL surpass CE accuracy, closely approximate NC \(≥95%\\geq 95\\%\), and match CE’s converged NC on 4/5 metrics in under7\.5%7\.5\\%of its iterations, while SCL with fixed prototypes matches linear probing without the hours\-long classifier training phase\. The learned geometry yields\+5\.5%\+5\.5\\%mean relative improvement in transfer learning, up to\+8\.7%\+8\.7\\%under severe class imbalance, and lower mCE on ImageNet\-C, recasting supervised learning as prototype learning on the hypersphere, with NC reached*by design*on both paths\. Code:[https://github\.com/pakoromilas/nc\_by\_design](https://github.com/pakoromilas/nc_by_design)
Machine Learning, ICML, Representation Learning, Neural Collapse, Supervised Contrastive Learning, Supervised Learning
Figure 1:Supervised learning as learning class prototypes on the hypersphere\.\(a\)Cross\-entropy with unconstrained features𝐳\\mathbf\{z\}, weights𝐖\\mathbf\{W\}, and biases𝐛\\mathbf\{b\}leaves radial degrees of freedom free, preventing convergence to NC\.\(b\)SCL pretraining maps features onto𝒮d−1\\mathcal\{S\}^\{d\-1\}via a projection head, producing representations that approach within\-class collapse \(NC1\) and maximal between\-class separation \(NC2\)\.\(c\)Standard practice discards the projection head and trains a linear probe on unconstrained𝐳\\mathbf\{z\}, reintroducing free‖𝐰‖\\\|\\mathbf\{w\}\\\|and𝐛\\mathbf\{b\}that destroy the NC geometry learned during pretraining\.\(d\)We show that both paradigms learn class prototypes on the hypersphere, converging to the same simplex ETF\. From classifier learning \(CL\): normalizing to𝒮d−1\\mathcal\{S\}^\{d\-1\}and applying contrastive optimization \(NTCE/NONL\) yields learnable prototypes𝐰^c\\hat\{\\mathbf\{w\}\}\_\{c\}that converge to class means \(Theorem 4\.1\)\. From SCL: the class\-mean prototypes𝝁^c\\hat\{\\bm\{\\mu\}\}\_\{c\}are already the optimal classifier throughout training, making linear probing unnecessary \(Theorem 4\.2\)\. Both paths achieve NC1–NC4 in theory and closely approximate it in practice, with𝐰^c=𝝁^c\\hat\{\\mathbf\{w\}\}\_\{c\}=\\hat\{\\bm\{\\mu\}\}\_\{c\}at the global optimum\.## 1Introduction
Despite theoretical proofs that Neural Collapse \(NC\) is the global optimum of supervised learning objectives\(Lu and Steinerberger,[2022](https://arxiv.org/html/2605.20302#bib.bib24); Zhouet al\.,[2022a](https://arxiv.org/html/2605.20302#bib.bib29); Grafet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib42)\), standard supervised pipelines rarely attain it in practice\. This failure is particularly striking becauseNC delivers precisely the properties we seek: when neural networks do approach this geometric configuration, where within class representations collapse to their means, class means form an equiangular tight frame \(ETF\), and classifier weights align with these prototypes, they demonstrate improved generalization\(Papyanet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib15); Bartlettet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib17); Neyshaburet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib18)\), adversarial robustness\(Fawziet al\.,[2016](https://arxiv.org/html/2605.20302#bib.bib21); Dinget al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib22)\), enhanced transfer learning\(Galantiet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib2); Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20)\), and converge toward max margin classifiers\(Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16)\)with stronger robustness guarantees\(Hein and Andriushchenko,[2017](https://arxiv.org/html/2605.20302#bib.bib10)\)\.If NC is provably optimal and empirically beneficial, why do standard supervised pipelines consistently fail to achieve it?
We identify the core issue asunconstrained radial degrees of freedom\. Cross entropy \(CE\) optimization allows features and weights to be jointly rescaled without changing predictions\(Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16)\), leaving radial directions underconstrained and preventing convergence to a unique geometry\. While explicit regularization of features, weights, and biases may resolve this\(Zhuet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib30)\), it introduces multiple hyperparameters that complicate practical adoption\. A principled solution is to eliminate radial freedom entirely by constraining optimization to the unit hypersphere, where NC becomes theuniqueglobal optimum\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\)\. This is exactly whatnormalized softmaxlosses such as NormFace\(Wanget al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib33)\)do: by normalizing both features and classifier weights, they project classification onto the hypersphere and reformulate it as angular similarity between data representations and learnable class prototypes\.
Yet CE is not the only supervised paradigm that fails to reach NC\. Supervised contrastive learning \(SCL\)\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20)\)does drive features toward the NC geometry during pretraining, mapping them onto the unit hypersphere through a projection head and producing within class collapse and ETF aligned class means\(Grafet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib42)\)\. The standard pipeline, however, then lifts to a different representation space by discarding the projection head and trains a linear classifier on the unnormalized encoder representations with free weights and biases, reintroducing exactly the radial and bias pathologies that broke CE in the first place\. The two paradigms therefore fail for opposite reasons:CE never builds the geometry, while SCL approximates it during pretraining only to discard it during linear probing\.
These complementary diagnoses converge on a single conceptual lens:prototype contrast on the unit hypersphere\([Figure1](https://arxiv.org/html/2605.20302#S0.F1)\)\. Once Classifier learning \(CL\) adopts normalized softmax, it optimizes angular similarity between normalized features andexplicitclass weight vectors that serve as prototypes\. We prove that SCL, on the other side, optimizes angular similarity among normalized instances usingimplicitclass mean embeddings as prototypes\. CL and SCL are thereforedifferent appearances of the same method, differing only in whether the prototype is parameterized or emergent, and both can reach the same simplex ETF\.
Despite this shared geometric foundation, normalized softmax recasts CE in contrastive form but does not yet bring the corresponding optimization with it: it contrasts against onlyKKclass prototypes\(Heet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib49)\), giving a small effective negative set, and couples positive and negative similarity terms through a shared normalization\(Yehet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib51)\)\. We close the remaining gaps to NC by applyingproper contrastive optimizationand making the followingfive contributions:
C1\.Weunify normalized softmax and SCLunder a single geometric framework, revealing both asprototype contrast methods on the unit hyperspherethat differ only in whether prototypes are explicit \(learned weights\) or implicit \(class means\)\. This framework explains why both can achieve NC in practice while standard CE cannot\.
C2\.We proposetwo supervised objectivesthat overcome existing computational limitations\.NTCE\(Normalized Temperature scaled Cross Entropy\) increases the effective number of negatives fromKKclasses toMMbatch samples, strengthening inter class separation\.NONL\(Negatives Only Normalization Loss\) eliminates interference between intra class alignment and inter class repulsion by normalizing only over negatives, accelerating NC convergence\.
C3\.We prove thatthe SCL objective already optimizes for an optimal prototype classifier throughout pretraining,eliminating the need for linear probing\. The class\-mean embeddings learned by SCL form the principled SCL classifier regardless of whether NC is attained\.
C4\.We validate our approach across four benchmarks including ImageNet\-1K\. NTCE and NONL achieve≥95%\\geq 95\\%on NC metricswhilesurpassing standard CE accuracy, and match CE’s NC metrics withsubstantially fewer training iterations\. Our prototype classifier maintains SCL’s accuracy while eliminating hours of linear probing computation, a significantpractical saving for large scale deployments\.
C5\.We empirically demonstrate that the representations learned by our objectives translate intopractical benefits, yielding improved performance ontransfer learning\(\+5\.5% mean relative improvement\),long tailed classification\(up to \+8\.7% relative improvement\), androbustness\(lower mCE\)\.
These results suggest afundamental shiftin how supervised learning should be understood: not as unconstrained optimization in Euclidean space, but asprototype based classification on the hypersphere\.
## 2Related Work
Neural Collapse\.Neural Collapse \(NC\) describes a limiting geometry in which within\-class features collapse to their means \(NC1\), class means form a centered simplex ETF \(NC2\), classifier weights align with the means \(NC3\), and biases collapse \(NC4\)\(Papyanet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib15)\)\. Variants of this structure characterize global minimizers for several objectives and modeling assumptions, including MSE\(Hanet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib37); Zhouet al\.,[2022a](https://arxiv.org/html/2605.20302#bib.bib29)\), cross\-entropy \(CE\)\(Lu and Steinerberger,[2022](https://arxiv.org/html/2605.20302#bib.bib24)\), supervised contrastive learning \(SCL\)\(Grafet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib42)\), and CE variants such as label smoothing and focal loss\(Zhouet al\.,[2022b](https://arxiv.org/html/2605.20302#bib.bib1)\)\. In finite training, however, standard CE with weight decay often fails to realize the optimal geometry: the loss is*scale\-noncoercive*and can be driven toward zero by inflating logit magnitudes without improving angular structure\(Albert and Anderson,[1984](https://arxiv.org/html/2605.20302#bib.bib31); Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16)\)\. Class imbalance further distorts the ETF and slows convergence\(Thrampoulidiset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib28); Hong and Ling,[2024](https://arxiv.org/html/2605.20302#bib.bib53)\); free bias terms obstruct NC4 and can exacerbate miscalibration unless controlled \(e\.g\., logit adjustment\)\(Menonet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib46)\)\. While simultaneously penalizing features, weights, and biases can restore coercivity and yield NC in principle\(Zhuet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib30); Zhouet al\.,[2022a](https://arxiv.org/html/2605.20302#bib.bib29)\), tuning multiple regularizers is brittle\.*We show that contrasting instances against class prototypes on the hypersphere operationalizes NC in practice\.*
Learning on the hypersphere\.Constraining radial freedom is a principled route to NC\. When both features and classifier lie on the unit hypersphere, CE over the product of spheres exhibits a benign strict\-saddle landscape whose minima realize perfect NC\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\)\. Related evidence appears in contrastive objectives: SCL yields within\-class collapse and simplex class means\(Grafet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib42)\), while in self\-supervised contrastive learning batch\-level optima form a simplex ETF\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\)\. A long line of face\-recognition work, including SphereFace, CosFace, ArcFace, and NormFace\(Liuet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib34); Wanget al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib35); Denget al\.,[2019](https://arxiv.org/html/2605.20302#bib.bib36); Wanget al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib33)\), operationalizes direction\-only discrimination by using angular/cosine margins\.*We unify these approaches by showing that both normalized softmax and SCL perform prototype contrast on the hypersphere\.*Building on this bridge, we extend normalized softmax with NTCE/NONL to import desirable properties\.
Prototype\-based classification and ETF classifiers\.Prototype methods classify via distances to learned representatives\(Snellet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib47)\)\. Motivated by NC, several works fix or guide the classifier toward ETF\-like prototypes and learn only the encoder, for example by \(i\) fixing a simplex ETF head and training the backbone \(ETF\+DR\)\(Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\), \(ii\) using hyperspherical prototype networks\(Metteset al\.,[2019](https://arxiv.org/html/2605.20302#bib.bib4)\), or \(iii\) constructing equiangular basis vectors \(EBVs\)\(Shenet al\.,[2023](https://arxiv.org/html/2605.20302#bib.bib5)\)\. Other approaches enforce \(non\-negative\) orthogonality\(Kim and Kim,[2024](https://arxiv.org/html/2605.20302#bib.bib14)\)or guide the classifier toward the nearest ETF via a Riemannian inner optimization\(Markouet al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib48)\)\. Recently NC structure has been exploited in a teacher–student setting\(Zhanget al\.,[2025](https://arxiv.org/html/2605.20302#bib.bib7)\): given a trained teacher that already exhibits NC, they compute*teacher*class centroids and use them as an NC3\-inspired classifier for the*student*\.*Our perspective is that CL and SCL already operate with prototypes: we modify the objectives to realize NC in practice, and we show that SCL’s class\-mean prototypes form an effective classifier, making linear probing unnecessary\.*
## 3Preliminaries
Notation\.Scalars are denoted by lowercase lettersuu, vectors by lowercase bold letters𝒖\{\\bm\{u\}\}, and matrices by uppercase bold letters𝑼\{\\bm\{U\}\}\. Sets are represented by uppercase caligraphic letters𝒰\\mathcal\{U\}\. Individual elements are accessed using subscript notation:ui\{u\}\_\{i\}for theii\-th element of vector𝒖\{\\bm\{u\}\}andUi,j\{U\}\_\{i,j\}for the element at rowiiand columnjjof matrix𝑼\{\\bm\{U\}\}\. To denote vertical \(row\-wise\) concatenation of matrices𝐗\\mathbf\{X\}and𝐘\\mathbf\{Y\}, we use\[𝐗;𝐘\]\[\\mathbf\{X\};\\mathbf\{Y\}\]\. We denote normalized vectors with𝒖^j=𝒖j/‖𝒖j‖\\hat\{\{\\bm\{u\}\}\}\_\{j\}=\{\\bm\{u\}\}\_\{j\}/\\\|\{\\bm\{u\}\}\_\{j\}\\\|\.
### 3\.1Learning Paradigms
Classifier Learning with Cross\-Entropy\.The cross\-entropy loss is the standard Classifier Learning \(CL\) objective, optimizing representations and classifier weights simultaneously\. An encoderf𝜽:𝒳→𝒵f\_\{\\bm\{\\theta\}\}:\\mathcal\{X\}\\to\\mathcal\{Z\}, parameterized by𝜽∈Θ\\bm\{\\theta\}\\in\\Theta, maps an input𝐱∈𝒳\\mathbf\{x\}\\in\\mathcal\{X\}to its representation𝐳=f𝜽\(𝐱\)∈𝒵\\mathbf\{z\}=f\_\{\\bm\{\\theta\}\}\(\\mathbf\{x\}\)\\in\\mathcal\{Z\}\. For aKK\-class task,yiy\_\{i\}denotes the class assignment of sample𝐱i\\mathbf\{x\}\_\{i\}\. A linear classifier is placed on top of the encoder, with weight matrix𝐖∈ℝK×h\\mathbf\{W\}\\in\\mathbb\{R\}^\{K\\times h\}and bias𝐛∈ℝK\\mathbf\{b\}\\in\\mathbb\{R\}^\{K\}, wherehhis the embedding dimension\. For a mini\-batch ofMMsamples with\{𝐳i\}i=1M\\\{\\mathbf\{z\}\_\{i\}\\\}\_\{i=1\}^\{M\}, the cross\-entropy loss is defined as
ℒCE\(𝐙,𝐖\)=1M∑i=1M−log\(e𝐳i⊤𝐰yi\+byi∑j=1Ke𝐳i⊤𝐰j\+bj\),\\mathcal\{L\}\_\{\\text\{CE\}\}\(\\mathbf\{Z\},\\mathbf\{W\}\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\-\\log\\left\(\\frac\{e^\{\\mathbf\{z\}\_\{i\}^\{\\top\}\\mathbf\{w\}\_\{y\_\{i\}\}\+b\_\{y\_\{i\}\}\}\}\{\\sum\_\{j=1\}^\{K\}e^\{\\mathbf\{z\}\_\{i\}^\{\\top\}\\mathbf\{w\}\_\{j\}\+b\_\{j\}\}\}\\right\),\(1\)where𝐰j\\mathbf\{w\}\_\{j\}denotes thejj\-th row of𝐖\\mathbf\{W\}andbjb\_\{j\}thejj\-th component of𝐛\\mathbf\{b\}\.
Supervised Contrastive Learning\.*Supervised Contrastive Learning \(SCL\)*takes a seemingly different direction: it learns representations by exploiting similarities between instances to learn class\-invariant representations\. Building on our notation, the contrastive framework augments the encoderf𝜽:𝒳→𝒵f\_\{\\bm\{\\theta\}\}:\\mathcal\{X\}\\to\\mathcal\{Z\}with a projection headgϕ:𝒵→𝒰g\_\{\\bm\{\\phi\}\}:\\mathcal\{Z\}\\to\\mathcal\{U\}, parameterized byϕ∈Φ\\bm\{\\phi\}\\in\\Phi, which maps representations onto the unit hypersphere,𝒰=𝕊d−1=\{𝒖∈ℝd∣‖𝒖‖=1\}\\mathcal\{U\}=\\mathbb\{S\}^\{d\-1\}=\\\{\{\\bm\{u\}\}\\in\\mathbb\{R\}^\{d\}\\mid\\\|\{\\bm\{u\}\}\\\|=1\\\}\. We denote the projected representations as𝒖,𝒗∈𝒰\{\\bm\{u\}\},\{\\bm\{v\}\}\\in\\mathcal\{U\}, where𝒖i\{\\bm\{u\}\}\_\{i\}comes from instance𝒙i\{\\bm\{x\}\}\_\{i\}and𝒗i\{\\bm\{v\}\}\_\{i\}from its alternative view produced via augmentation, a typical process in contrastive learning\.
For SCL the objective is to pull together positive pairs while pushing apart negative pairs in the projection space\. Typically alternative views of the same data point that originate from augmentation are considered as new data points,*i\.e\.*𝑨=\[𝑼;𝑽\]\{\\bm\{A\}\}=\[\{\\bm\{U\}\};\{\\bm\{V\}\}\], and the supervised contrastive loss becomes:
ℒSCL\(𝑨\)=12M∑i=12M−1\|𝒞\(i\)\|∑l∈𝒞\(i\)log\(e𝒂i⊤𝒂l/τ∑j=1j≠i2Me𝒂i⊤𝒂j/τ\),\\mathcal\{L\}\_\{\\text\{SCL\}\}\(\{\\bm\{A\}\}\)=\\frac\{1\}\{2M\}\\sum\_\{i=1\}^\{2M\}\-\\frac\{1\}\{\|\\mathcal\{C\}\(i\)\|\}\\sum\_\{l\\in\\mathcal\{C\}\(i\)\}\\log\\left\(\\frac\{e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{l\}/\\tau\}\}\{\\sum\_\{j=1\\atop j\\neq i\}^\{2M\}e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{j\}/\\tau\}\}\\right\),\(2\)where𝒞\(i\)\\mathcal\{C\}\(i\)denotes the set of indices corresponding to positive examples sharing the same class as𝒙i\{\\bm\{x\}\}\_\{i\}andτ\>0\\tau\>0is a temperature parameter that controls the concentration of the distribution\.
A crucial distinction emerges post\-training: while learning with cross\-entropy directly produces a classifier, contrastive learning requires an additional step\. After optimizing[Equation2](https://arxiv.org/html/2605.20302#S3.E2), the projection head is discarded and a linear classifier𝑾,𝒃\{\\bm\{W\}\},\{\\bm\{b\}\}is trained on the frozen encoder representations𝒛\{\\bm\{z\}\}using[Equation1](https://arxiv.org/html/2605.20302#S3.E1), a process known aslinear probing\.
### 3\.2Neural Collapse \(NC\)\.
*Neural Collapse*Papyanet al\.\([2020](https://arxiv.org/html/2605.20302#bib.bib15)\)is the late\-training regime \(on balanced data\) where last\-layer features and the linear classifier converge to a highly structured limit\. Let𝒛i=f\(𝒙i\)∈ℝh\{\\bm\{z\}\}\_\{i\}=f\(\{\\bm\{x\}\}\_\{i\}\)\\in\\mathbb\{R\}^\{h\}, class means𝝁c=1nc∑i:yi=c𝒛i\{\\bm\{\\mu\}\}\_\{c\}=\\frac\{1\}\{n\_\{c\}\}\\sum\_\{i:y\_\{i\}=c\}\{\\bm\{z\}\}\_\{i\}, weights𝒘c\{\\bm\{w\}\}\_\{c\}, and bias𝒃\{\\bm\{b\}\}\. NC holds when, up to common scalings:
- \(NC1\)Within\-class collapse:𝒛i=𝝁yi\{\\bm\{z\}\}\_\{i\}=\{\\bm\{\\mu\}\}\_\{y\_\{i\}\}for allii\.
- \(NC2\)Simplex ETF of class means:the centered means𝝁~c=𝝁c−1K∑k=1K𝝁k\\tilde\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{\\mu\}\}\_\{c\}\-\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\{\\bm\{\\mu\}\}\_\{k\}have equal norms and equal pairwise angles so the means span a centered\(K−1\)\(K\\\!\-\\\!1\)\-simplex ETF\.
- \(NC3\)Alignment of Class Representation and Classifier:classifier columns align with the class means,𝒘c∥𝝁c\{\\bm\{w\}\}\_\{c\}\\parallel\{\\bm\{\\mu\}\}\_\{c\}\(there existsγ\>0\\gamma\>0with𝒘c=γ𝝁c\{\\bm\{w\}\}\_\{c\}=\\gamma\\,\{\\bm\{\\mu\}\}\_\{c\}\)\.
- \(NC4\)Bias collapse:𝒃=β1\{\\bm\{b\}\}=\\beta\\,\\mathbf\{1\}for some scalarβ\\beta\.
Under NC, the decision rule reduces to nearest\-class\-mean classification\. We assume balanced classes andh≥K−1h\\geq K\-1so a centered simplex ETF is feasible\(Lu and Steinerberger,[2022](https://arxiv.org/html/2605.20302#bib.bib24)\)\.
Practical Challenges in reaching Neural CollapseNeural Collapse \(NC\) is now well documented in deep nets\(Papyanet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib15)\)and characterizes global minima of balanced cross\-entropy\(Lu and Steinerberger,[2022](https://arxiv.org/html/2605.20302#bib.bib24)\)\. However standard pipelines does not enforce it in practice\. For the typical paradigm of cross\-entropy and classifier weight decay, the objective admits an*unbounded rescaling direction*: shrinking the classifier while amplifying features leaves logits unchanged, reduces the penalty, and drives the loss toward zero without achieving NC\(Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16); Albert and Anderson,[1984](https://arxiv.org/html/2605.20302#bib.bib31)\)\. It is shown byZhuet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib30)\)that a well\-posed objective arises when all radial degrees of freedom are constrained by penalizing weights, features, and biases simultaneously\(Zhuet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib30)\)\. However this is practically brittle due to multiple regularizers to tune\.
Supervised contrastive training on the other hand can drive representations toward NC geometry\(Grafet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib42)\)\. However, the subsequent*linear probing*step typically fits a softmax classifier with cross\-entropy on*frozen*features, allowing free weight magnitudes and biases\. This reintroduces the same scale and bias pathologies as cross\-entropy even when training has already reached an NC\.
## 4Supervised Learning on the Hypersphere
In this section we bridge classifier learning and supervised contrastive learning under a common viewpoint\. Our approach uses similarity\-based optimization while eliminating radial degrees of freedom by constraining both feature and classifier norms to the hypersphere\. This constraint transforms the optimization landscape into a benign geometry where all critical points become global optima\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\), enabling direct convergence to NC\.
### 4\.1Revisiting Cross Entropy: Contrasting Class Prototypes to Instances
The weight matrix of the final linear classifier in CL methods can be expressed as𝑾=\[𝒘1;𝒘2;…;𝒘K\]∈ℝK×h\{\\bm\{W\}\}=\[\{\\bm\{w\}\}\_\{1\};\{\\bm\{w\}\}\_\{2\};\\dots;\{\\bm\{w\}\}\_\{K\}\]\\in\\mathbb\{R\}^\{K\\times h\}, where each𝒘c\{\\bm\{w\}\}\_\{c\}represents a learnable class prototype\. This formulation reveals an important insight: we can treat the classifier weights aslearnable prototypesthat evolve through gradient descent to capture class\-specific geometric structures\. Building on this we design objectives that use such prototypes to arrive at the optimal NC geometry\.
Normalized Softmax Losses\.Standard cross\-entropy and contrastive learning represent two seemingly distinct paradigms: the former discriminates through learned magnitudes and biases in unconstrained space, while the latter operates purely on angular similarities on the hypersphere\. This fundamental difference leads to a critical inefficiency: while both methods theoretically converge to neural collapse configurations, cross\-entropy introduces unnecessary radial degrees of freedom that slow convergence to this optimal geometry\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23); Zhuet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib30)\)\.
Normalized softmax losses resolve this inefficiency by reformulating cross\-entropy as a pure geometric objective\. NormFace\(Wanget al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib33)\), a prominent example, achieves this through three coordinated modifications: \(i\) eliminating biases that merely translate decision boundaries without encoding semantic structure, \(ii\) projecting representations onto the hypersphere to focus exclusively on angular geometry, and \(iii\) introducing temperature scaling to control concentration of the softmax distribution\. Formally, with𝒖i=𝒛i/‖𝒛i‖2\{\\bm\{u\}\}\_\{i\}=\{\\bm\{z\}\}\_\{i\}/\\\|\{\\bm\{z\}\}\_\{i\}\\\|\_\{2\}as the normalized representation and𝒘^j=𝒘j/‖𝒘j‖2\\hat\{\{\\bm\{w\}\}\}\_\{j\}=\{\\bm\{w\}\}\_\{j\}/\\\|\{\\bm\{w\}\}\_\{j\}\\\|\_\{2\}as the normalized classifier weight for classjj, NormFace minimizes:
LNormFace\(𝑼,𝑾\)=−1M∑i=1Mlog\(e𝒖i⊤𝒘^yi/τ∑j=1Ke𝒖i⊤𝒘^j/τ\)\.L\_\{\\text\{NormFace\}\}\(\{\\bm\{U\}\},\{\\bm\{W\}\}\)=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\log\\left\(\\frac\{e^\{\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}/\\tau\}\}\{\\sum\_\{j=1\}^\{K\}e^\{\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{w\}\}\}\_\{j\}/\\tau\}\}\\right\)\.\(3\)
This reformulation transforms classification into contrastive learning between data instances and learnable class prototypes while maintaining CE’s computational efficiency\.
Normalized Temperature\-scaled Cross Entropy \(NTCE\)\.When viewing CL from a contrastive learning perspective through NormFace we encounter an inherent limitation of cross entropy: the number of negatives in the objective is limited toKK, the number of class prototypes\. Contrastive objectives need very large numbers of negatives in order to converge\(Heet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib49)\), since fewer negatives provide a worse estimate of the expectation of the actual contrastive objective\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\)\.
By inverting the contrastive direction from instance\-to\-class to class\-to\-instance discrimination we address this limitation through the Normalized Temperature\-scaled Cross Entropy \(NTCE\)\. This modification fundamentally alters the learning dynamics: rather than each instance contrasting againstKKclass prototypes, each class prototype now contrasts againstMMbatch representations\.
The key insight underlying NTCE is thatclass prototypes themselves can serve as anchorsin the contrastive formulation\. By anchoring on the class weight vector corresponding to each instance’s ground\-truth label and contrasting it against all batch representations, we dramatically expand the negative sampling space\. Formally, NTCE takes the form:
LNTCE\(𝑼,𝑾\)=1M∑i=1M−log\(e𝒘^yi⊤𝒖i/τ∑j=1Me𝒘^yi⊤𝒖j/τ\),L\_\{\\text\{NTCE\}\}\(\{\\bm\{U\}\},\{\\bm\{W\}\}\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\-\\log\\left\(\\frac\{e^\{\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}/\\tau\}\}\{\\sum\_\{j=1\}^\{M\}e^\{\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\}\}\\right\),\(4\)where𝒘^yi\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}serves as the anchor for instanceii, and critically, the denominator sums over allMMinstances in the batch rather than overKKclasses\.
Negatives Only Normalization Loss \(NONL\)\.NTCE adds enhanced negative sampling on top of NormFace to directly transfer the principles of contrastive learning to cross entropy training\. However, it also inherits a fundamental drawback of popular contrastive objectives that compromises its optimization dynamics\. The denominator in[Equation4](https://arxiv.org/html/2605.20302#S4.E4)indiscriminately aggregates all instances sharing the same class anchor\. That is, the denominator \(also known as the uniformity term\) is optimized when instances of the same class have maximum distance\(Wang and Isola,[2020](https://arxiv.org/html/2605.20302#bib.bib9)\), which contradicts the optimality of the numerator \(the alignment term\)\. More specifically, positive pairs explicitly appear as negative samples in the normalization term, generating gradients that actively repel instances from their own class prototype\. When instanceiiand instancejjshare classyi=yjy\_\{i\}=y\_\{j\}, the terme𝒘^yi⊤𝒖j/τe^\{\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\}in the denominator produces gradients that decrease𝒘^yi⊤𝒖j\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}, directly opposing the alignment objective\. This is a known behavior calledalignment\-uniformity coupling\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50); Yehet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib51)\)\.
To resolve this conflict we introduce the Negatives\-Only Normalization Loss \(NONL\), which explicitly excludes same\-class instances from the denominator:
LNONL\(𝑼,𝑾\)=1M∑i=1M−log\(e𝒘^yi⊤𝒖i/τ∑j=1j∉𝒞\(i\)Me𝒘^yi⊤𝒖j/τ\)\.L\_\{\\text\{NONL\}\}\(\{\\bm\{U\}\},\{\\bm\{W\}\}\)=\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\-\\log\\left\(\\frac\{e^\{\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}/\\tau\}\}\{\\sum\_\{\\begin\{subarray\}\{c\}j=1\\\\ j\\not\\in\\mathcal\{C\}\(i\)\\end\{subarray\}\}^\{M\}e^\{\\hat\{\{\\bm\{w\}\}\}\_\{y\_\{i\}\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\}\}\\right\)\.\(5\)
### 4\.2Revisiting Supervised Contrastive Learning: Contrasting Mean\-Class Prototypes to Instances
Standard SCL pipelines train an encoder with[Equation2](https://arxiv.org/html/2605.20302#S3.E2)on the unit hypersphere, then discard the projection head and fit a linear classifier with cross\-entropy on the unnormalized encoder representations\. This phase, known as linear probing, introducesunnecessary degrees of freedomthat disrupt the geometric optimality achieved during pretraining: \(i\)geometric mismatch: SCL features live on the hypersphere with collapsed, ETF\-structured class means; linear probing operates in unconstrained Euclidean space, allowing weight rescaling and bias shifts that break classifier\-feature alignment\(Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16)\), and \(ii\)computational redundancy: it is a separate training phase, typically ofTTepochs over the whole dataset\.
Prior work\(Grafet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib42)\)shows that SCL minimizers exhibit within\-class collapse and a centered simplex ETF of class means\. If both SCL and a hypothetical classifier converge to the same NC geometry, then the class\-mean prototypes𝝁^c\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}themselves already satisfy NC3 by construction\. A natural alternative to linear probing is thereforeFixed Prototypes \(FP\): keep the projection head \(where SCL features are optimized\), and set the classifier weights directly to the prototypes,𝒘c=𝝁^c\{\\bm\{w\}\}\_\{c\}=\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}, with no additional training\.
Despite this being a straightforward idea from the theoretical results ofGrafet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib42)\), there are several reasons to expect this to not work\. Exact NC is never reached in practice, where SCL trained models do not realize the ETF \(see[Table3](https://arxiv.org/html/2605.20302#S4.T3)\)\. Therefore, in the non\-optimal case we have no*a priori*reason to believe class\-mean prototypes are the classifier this objective drives toward\.[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)answers whether FP is a proper classifier for SCL\.
### 4\.3A Unified View of Supervised Learning
We first show in[Theorem4\.1](https://arxiv.org/html/2605.20302#S4.Thmtheorem1)that, in the balanced UFM/LPM setting\(Tirer and Bruna,[2022](https://arxiv.org/html/2605.20302#bib.bib41); Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\), the three normalized losses \(NormFace, NTCE, and NONL\) are globally optimized by Neural Collapse \(NC\) geometry\.
###### Theorem 4\.1\(Neural Collapse optimality of normalized losses\)\.
In the balanced UFM/LPM setting above withd≥Kd\\geq K, every global minimizer ofLNFL\_\{\\mathrm\{NF\}\},LNTCEL\_\{\\mathrm\{NTCE\}\}, andLNONLL\_\{\\mathrm\{NONL\}\}satisfies NC1–NC3 \(within\-class collapse, simplex ETF class means, and classifier–feature alignment\), up to a global rotation and permutation of class labels\.
We now return to the FP classifier from[Section4\.2](https://arxiv.org/html/2605.20302#S4.SS2)\. We follow[Equation2](https://arxiv.org/html/2605.20302#S3.E2)to treat alternative views produced by data augmentation as distinct samples,𝑨=\[𝑼;𝑽\]\{\\bm\{A\}\}=\[\{\\bm\{U\}\};\{\\bm\{V\}\}\]\. Letℬc=\{j∈\[2M\]:yj=c\}\\mathcal\{B\}\_\{c\}=\\\{j\\in\[2M\]:y\_\{j\}=c\\\}denote the within\-batch index set for classcc,nc=\|ℬc\|n\_\{c\}=\|\\mathcal\{B\}\_\{c\}\|, and𝝁^c=1nc∑j∈ℬc𝒂j\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}=\\tfrac\{1\}\{n\_\{c\}\}\\sum\_\{j\\in\\mathcal\{B\}\_\{c\}\}\{\\bm\{a\}\}\_\{j\}the corresponding batch prototype \(class mean\)\. We define the*prototype loss*,Lproto\(𝑨\)=L\_\{\\text\{proto\}\}\(\{\\bm\{A\}\}\)=
−12M∑i=12Mlog\(e𝒂i⊤𝝁^yi/τ−e𝒂i⊤𝝁^yi/τ\+∑c=1Knc⋅e𝒂i⊤𝝁^c/τ\),\-\\frac\{1\}\{2M\}\\sum\_\{i=1\}^\{2M\}\\log\\left\(\\frac\{e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}/\\tau\}\}\{\-e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}/\\tau\}\+\\sum\_\{c=1\}^\{K\}n\_\{c\}\\cdot e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}/\\tau\}\}\\right\),\(6\)where the numerator encourages alignment with the correct class prototype, while the denominator includes both positive and negative prototypes weighted by their batch frequenciesncn\_\{c\}\.[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)connects the optima of this loss to the ones of SCL\. The proof can be found in[AppendixB](https://arxiv.org/html/2605.20302#A2)\.
###### Theorem 4\.2\(Equivalence of SCL and prototype–softmax minimizers\)\.
For unit\-norm representations and balanced labels the supervised contrastive lossLSCLL\_\{\\text\{SCL\}\}and the prototype lossLprotoL\_\{\\text\{proto\}\}in[Equation6](https://arxiv.org/html/2605.20302#S4.E6)share the same set of global minimizers \(up to rotation and label permutation\)\. In particular, at every global minimizer the representations exhibit in\-class collapse and the class means form a centered simplex ETF\.
[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)closes the gaps raised in[Section4\.2](https://arxiv.org/html/2605.20302#S4.SS2)\. It proves the minimizer\-set equalityargminLSCL=argminLproto\\arg\\min L\_\{\\text\{SCL\}\}=\\arg\\min L\_\{\\text\{proto\}\}, meaning that SCL’s objective is equivalent to optimizing the prototype\-softmax classifier of[Equation6](https://arxiv.org/html/2605.20302#S4.E6)\. The class\-mean prototypes are therefore not a post\-hoc choice but the solution the SCL objective drives toward across the entire optimization trajectory, not only at the unreachable exact minimum\. This enables us to discard linear probing by setting𝒘c=𝝁^c\{\\bm\{w\}\}\_\{c\}=\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}directly\. In our FP method, classifier weights are the learned class means, taken from the projection\-head output where SCL features actually live, with no additional magnitude or bias\. We show empirically \([Section5](https://arxiv.org/html/2605.20302#S5)\) that FP matches linear probing accuracy while eliminating a separate training phase\. We discuss the relation toGrafet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib42)\)and toKiniet al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib61)\), who analyze SCL minimizers under specific architectural assumptions, in[SectionsD\.1](https://arxiv.org/html/2605.20302#A4.SS1)and[D\.4](https://arxiv.org/html/2605.20302#A4.SS4)\.
The unified view\.Thencn\_\{c\}weighting in the denominator of[Equation6](https://arxiv.org/html/2605.20302#S4.E6)captures the effect of using multiple negative instances, matching the structure of[Equation4](https://arxiv.org/html/2605.20302#S4.E4)\. When discarding thencn\_\{c\}weights, this loss reduces to[Equation3](https://arxiv.org/html/2605.20302#S4.E3), establishing a directcorrespondence between the prototype weights and class means\. The optimal solution of[Equation3](https://arxiv.org/html/2605.20302#S4.E3)holds when𝒘c=𝝁^c\{\\bm\{w\}\}\_\{c\}=\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\), and our results above extend this correspondence to NTCE and NONL on the CL side and to SCL on the contrastive side\. Classifier learning with normalized losses and supervised contrastive learning are therefore two appearances of the same object: prototype contrast on the unit hypersphere\. The two paradigms differ only in whether the prototype is parameterized \(the learnable𝒘c\{\\bm\{w\}\}\_\{c\}in CL\) or emergent \(the class mean𝝁^c\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}in SCL\)\. Under this view, supervised learning becomes a question of*directions*, not magnitudes, with a single optimal geometry, the simplex ETF, reachable from both sides\.
Table 1:Classifier Learning Performance Comparison\.green: overall best per dataset\.LossCIFAR\-10CIFAR\-100ImageNet\-100ImageNet\-1KCE94\.694\.672\.172\.184\.484\.475\.475\.4ETF \+ DR94\.494\.472\.172\.184\.584\.575\.475\.4NormFace94\.894\.872\.472\.484\.484\.476\.476\.4NTCE\(ours\)94\.794\.772\.972\.984\.784\.776\.7NONL\(ours\)94\.973\.684\.976\.576\.5Table 2:Supervised Contrastive Learning Performance Comparison\.Green: best per dataset\.TT: epochs; LP: Linear Probing; NLP: Normalized Linear Probing; FP: Fixed Prototypes\.MethodLossPassesCIFAR\-10CIFAR\-100IN\-100IN\-1KLPSCLT×NT\{\\times\}N95\.073\.984\.884\.875\.1NLPSCLT×NT\{\\times\}N94\.994\.973\.673\.684\.884\.875\.1FP \(ours\)SCLNN95\.073\.986\.875\.1Table 3:NC metricson CIFAR\-10/100 \(training\)\.Boldmarks the best within each learning family;greenmarks the overall best per dataset\. Theoretical optima: Intra ER0/00/0, Inter ER9/999/99, Weights ER9/999/99, Weight Align0/00/0, Instance Align0/00/0, MIR1/11/1, HDR0/00/0\.Learning FamilyMethodEffective RankAlignmentInformation Theory MetricsIntra↓\\downarrowInter↑\\uparrowWeights↑\\uparrowWeight↓\\downarrowInstance↓\\downarrowMIR↑\\uparrowHDR↓\\downarrowCE22\.5 / 96\.48\.6 / 57\.18\.9 / 89\.70\.59 / 0\.830\.69 / 1\.050\.98/ 0\.970\.03 / 0\.13ETF \+ DR9\.00 / 18\.48\.90 / 94\.89\.00/99\.000\.58 / 0\.590\.59 / 0\.610\.98/1\.000\.02/0\.11ClassifierNormFace10\.5 / 13\.69\.0/ 96\.29\.0/ 96\.10\.12 /0\.010\.14 / 0\.060\.95 /1\.000\.04 / 0\.30LearningNTCE9\.0 / 12\.69\.0/99\.08\.9 / 98\.90\.08/0\.010\.10/0\.050\.96 /1\.000\.05 / 0\.30NONL4\.0/11\.49\.0/99\.09\.0/99\.00\.11 /0\.010\.16 / 0\.060\.95 /1\.000\.05 / 0\.30ContrastiveSCL\(w probing\)4\.5/7\.59\.0/66\.78\.3 /77\.80\.99 / 1\.030\.10/0\.340\.99 /0\.950\.07/0\.11LearningSCL\(w/o probing\)4\.5/7\.59\.0/66\.79\.0/ 66\.50\.00/0\.000\.10/0\.341\.00/ 0\.870\.09 / 0\.14
Table 4:Transfer Learning Results\. Numbers are top\-1 accuracy \(%\) for all datasets except VOC2007 \(mAP\)\. Best per column ingreen\.MethodFoodCIFAR10CIFAR100CarsDTDPetsFlowersVOC2007MeanCE68\.088\.667\.725\.969\.967\.081\.467\.167\.0ETF \+ DR57\.283\.954\.39\.864\.049\.258\.360\.054\.6NormFace69\.889\.869\.729\.770\.169\.783\.267\.968\.7NTCE69\.389\.969\.828\.170\.069\.481\.967\.868\.3NONL70\.790\.071\.038\.169\.472\.985\.268\.370\.7Δ\\Delta\(NONL−\-CE\)\+4\.0%\+1\.6%\+4\.9%\+47\.1%−0\.7\-0\.7%\+8\.8%\+4\.7%\+1\.8%\+5\.5%Table 5:Robustness Results\.Clean error, mCE, and corruption error \(%\) on ImageNet\-C\. Best per column ingreen\.NetworkErrormCENoiseBlurWeatherDigitalGaussShotImpulseDefocGlassMotionZoomSnowFrostFogBrightContrastElasticPixelJPEGCE25\.080\.1757883849486888078686266968480ETF \+ DR24\.679\.2757882869485867677666267938082NormFace23\.677\.8747682849381857675646065918180NTCE23\.377\.6737680849383857675656065937879NONL23\.577\.8727580859383857675646066918181
## 5Empirical Validation and Discussion
In this section we empirically validate our methods against cross\-entropy \(CE\), ETF \+ DR\(Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\), and NormFace\(Wanget al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib33)\)for Classifier Learning paradigms and supervised contrastive learning \(SCL\), evaluating: \(i\) classification accuracy, \(ii\) proximity to neural collapse geometry, and \(iii\) NC convergence speed\. Experiments are conducted on four standard datasets:CIFAR10, CIFAR100, ImageNet\-100, and ImageNet1K, following common representation learning benchmarking practices\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20); Markouet al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib48); Wanget al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib52); Yehet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib51)\)\. We use ResNet50 for ImageNet datasets and ResNet18 for CIFAR\. Implementation details are provided in[AppendixE](https://arxiv.org/html/2605.20302#A5)\.
### 5\.1Classification Performance
Classifier Learning Methods\.As can be inferred from[Table1](https://arxiv.org/html/2605.20302#S4.T1), normalized lossesoutperform cross\-entropy\(CE\) in all cases, while also our losses further outperform NormFace\. NONL achieves the strongest gains on datasets with few \(10\) to medium \(100\) number of classes while it has the second best score on ImageNet\-1K\. Here we have to note that ImageNet\-1K, our objectives exhibit the typical behavior of contrastive\-style objectives: they benefit from larger batch sizes, whereas using smaller batches leads to degraded performance due to insufficient in\-batch negatives \(see[SectionG\.4](https://arxiv.org/html/2605.20302#A7.SS4)\)\.
Supervised Contrastive Learning Methods\.The accuracy from three classifier learning strategies on SCL representations is presented in[Table2](https://arxiv.org/html/2605.20302#S4.T2): \(i\) standard linear probing with learnable weights and bias, \(ii\) normalized linear probing using NormFace loss, and \(iii\) fixed prototypes computed as class\-mean embeddings\. Fixed prototypes match linear probing performance on 3 of 4 datasets, and mark a considerable \+2\.0% improvement on ImageNet\-100requiring onlyNNforward passes versusT×NT\\times Nfor training\-based methods, whereTTis the number of epochs\. Normalized linear probing achieves comparable accuracy to standard linear probing, validating that thediscriminative information in SCL features resides primarily in their angular structurerather than magnitude or biases\. These findings validate that angular structure alone suffices for discrimination in well\-trained representations, enablingtraining\-free classificationin SCL via fixed prototypes thateliminate huge computational costsby discarding a, typically hours long, training phase\.
### 5\.2Quantifying Neural Collapse
We quantify NC1–NC3 with complementary, condition\-specific metrics; we omit NC4 \(bias collapse\) as our models enforce zero bias by design\.
Effective Rank \(NC1, NC2\)\.For matrix𝐀\\mathbf\{A\}with singular values\{σi\}\\\{\\sigma\_\{i\}\\\}the effective rank\(Roy and Vetterli,[2007](https://arxiv.org/html/2605.20302#bib.bib26)\)is defined aserank\(𝐀\)=exp\{−∑ipilogpi\}\\text\{erank\}\(\\mathbf\{A\}\)=\\exp\\\{\-\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}\\\}wherepi=σi/∑jσjp\_\{i\}=\\sigma\_\{i\}/\\sum\_\{j\}\\sigma\_\{j\}\. We compute the intra and inter class effective ranks\(Zhanget al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib27)\)as:erankintra=1K∑c=1Kerank\(Cov\[𝐳i−μc∣yi=c\]\)\\text\{erank\}\_\{\\text\{intra\}\}=\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\text\{erank\}\(\\text\{Cov\}\[\\mathbf\{z\}\_\{i\}\-\\mu\_\{c\}\\mid y\_\{i\}=c\]\)anderankinter=erank\(Cov\[μc−μG\]\)\\text\{erank\}\_\{\\text\{inter\}\}=\\text\{erank\}\(\\text\{Cov\}\[\\mu\_\{c\}\-\\mu\_\{G\}\]\), whereCovis the covariance matrix\. These metrics quantifyNC1\(within\-class variability collapse\):erankintra→0\\text\{erank\}\_\{\\text\{intra\}\}\\to 0indicates𝒛i→μyi\{\\bm\{z\}\}\_\{i\}\\to\\mu\_\{y\_\{i\}\}, andNC2\(ETF structure\):Zhanget al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib27)\)proved that whenerankinter=K−1\\text\{erank\}\_\{\\text\{inter\}\}=K\-1the class means form a simplex with equal pairwise angles\. We also reporterank\(𝐖\)\\mathrm\{erank\}\(\\mathbf\{W\}\)to assess whether classifier weights approximate an equiangular tight frame \(ETF\)\.
Alignment \(NC3\)\.We quantify feature–classifier alignment by1N∑i=1N‖𝐳i−𝐰yi‖22\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\\|\\mathbf\{z\}\_\{i\}\-\\mathbf\{w\}\_\{y\_\{i\}\}\\\|\_\{2\}^\{2\}and also report instance\-to\-instance alignment to probe per\-class collapse\.
Information Metrics \(NC2, NC3\)\.For normalized Gram matrices𝐆W\\mathbf\{G\}\_\{W\}\(weights\),𝐆M\\mathbf\{G\}\_\{M\}\(class means\) and H being the matrix entropy,Songet al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib25)\)connects Neural Collapse to the metrics:
MIR=H\(𝐆W\)\+H\(𝐆M\)−H\(𝐆W⊙𝐆M\)min\{H\(𝐆W\),H\(𝐆M\)\},\\displaystyle=\\frac\{\\text\{H\}\(\\mathbf\{G\}\_\{W\}\)\+\\text\{H\}\(\\mathbf\{G\}\_\{M\}\)\-\\text\{H\}\(\\mathbf\{G\}\_\{W\}\\odot\\mathbf\{G\}\_\{M\}\)\}\{\\min\\\{\\text\{H\}\(\\mathbf\{G\}\_\{W\}\),\\text\{H\}\(\\mathbf\{G\}\_\{M\}\)\\\}\},HDR=\|H\(𝐆W\)−H\(𝐆M\)\|max\{H\(𝐆W\),H\(𝐆M\)\}\.\\displaystyle=\\frac\{\\lvert\\text\{H\}\(\\mathbf\{G\}\_\{W\}\)\-\\text\{H\}\(\\mathbf\{G\}\_\{M\}\)\\rvert\}\{\\max\\\{\\text\{H\}\(\\mathbf\{G\}\_\{W\}\),\\text\{H\}\(\\mathbf\{G\}\_\{M\}\)\\\}\}\.
These capture the information\-theoretic signatures ofNC2andNC3where under full collapseMIR→1\\text\{MIR\}\\to 1andHDR→0\\text\{HDR\}\\to 0, reflecting perfect structural alignment\.
In[Table3](https://arxiv.org/html/2605.20302#S4.T3)four key findings are revealed:\(i\) CE fails to achieve NC:high intra\-class variance \(erank 22\.5/96\.4\), suboptimal inter\-class separation \(erank 8\.6/57\.1 vs\. theoretical K\-1=9/99\), and poor weight\-feature alignment \(w\-inst 0\.59/0\.83, inst\-inst 0\.69/1\.05\)\.\(ii\) Normalized softmax losses satisfy NC2\-NC3since they achieve perfect inter\-class separation \(erank 9\.0/99\.0\), near\-zero alignment errors \(NTCE: w\-inst 0\.08/0\.01, inst\-inst 0\.10/0\.05\), and optimal weight dimensionality matching the simplex ETF, withNONL being the overall bestmostly due to its better intra class structure\.\(iii\) SCL with linear probing violates NC3:despite superior within\-class collapse \(erank 4\.5/7\.5\), inter\-class structure degrades \(erank 9\.0/66\.7\) and classifier\-feature alignment fails \(w\-inst 0\.99/1\.03\)\.\(iv\) Fixed prototypes restore NC3 in SCL:removing the trainable classifier enforces perfect alignment by construction, though inter\-class separation remains suboptimal\. In appendix[Table7](https://arxiv.org/html/2605.20302#A6.T7)we further show that\(v\)our objectives match cross\-entropy’s NC metrics withsubstantially fewer training iterations\.
While CE and ETF\+ DR attain slightly better MIR/HDR values than our normalized losses, these information\-theoretic metrics primarily reflect the overall entropy/redundancy of the representation, not the NC geometry itself\. In our case, CE appears to preserve a bit more raw variability, but organizes it in a less NC\-like, less prototype\-structured way \(higher intra\-class effective rank, weaker alignment\), whereas our normalized losses reshape the same information into a cleaner NC geometry\. As our downstream experiments show in[Section5\.3](https://arxiv.org/html/2605.20302#S5.SS3), this structured organization is more beneficial for transfer, long\-tailed performance, and robustness, even though CE may capture slightly more “information” by these metrics\.
Table 6:Class Imbalance\.Performance under on CIFAR\-10\-LT/100\-LT vs\. imbalance ratioτ\\tau\(Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\)\. Best per column ingreen\.CIFAR\-10\-LTCIFAR\-100\-LTMethodτ=0\.1\\tau=0\.10\.020\.020\.010\.01τ=0\.1\\tau=0\.10\.020\.020\.010\.01CE88\.176\.870\.255\.541\.537\.4ETF \+ DR88\.077\.871\.354\.440\.636\.2NormFace88\.079\.074\.354\.640\.135\.9NTCE88\.880\.877\.356\.843\.739\.0NONL89\.280\.576\.357\.444\.740\.0Δ\(NONL−CE\)\\Delta\(\\text\{NONL\}\-\\text\{CE\}\)\+1\.2%\+4\.8%\+8\.7%\+3\.4%\+7\.7%\+7\.0%
### 5\.3Practical Benefits of Collapsed Representations
Transfer learning\.We first ask whether representations that lie closer to the NC regime are more generalizable to unseen tasks\. Following typical pipelines\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20)\), we freeze the pretrained encoder for each loss and train a linear classifier \(or detection head for VOC07\) on eight diverse downstream datasets\. As shown in Table[4](https://arxiv.org/html/2605.20302#S4.T4),NONL consistently yields strong transfer performance: it attains the best accuracy on 7/8 datasets and delivers a \+5\.5% relative improvement in mean performance over CE, while NTCE also consistently exceeds CE\. These results confirm prior works\(Galantiet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib2); Papyanet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib15); Bartlettet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib17); Neyshaburet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib18)\)that explicitly encouraging NC\-like geometry produces features that generalize better beyond the pretraining distribution\.
Long\-tailed classification\.We next examine robustness to class imbalance using standard evaluation pipelines\(Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\)\. For CIFAR\-10\-LT and CIFAR\-100\-LT, we construct long\-tailed versions with three imbalance ratios and train all models directly on the imbalanced data\.[Table6](https://arxiv.org/html/2605.20302#S5.T6)shows that our NC\-inducing objectives substantially improve minority\-class performance: on CIFAR\-100\-LT, NONL outperforms CE by \+3\.4%, \+7\.7%, and \+7\.0% under increasing imbalance, with gains up to \+8\.7% across CIFAR\-10/100\-LT, and also surpasses the ETF\+DR baseline\(Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\)\. This validates the literature\(Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\)that enforcing NC\-like geometric structure helps maintain class separability even when minority classes are severely underrepresented, complementing the improvements observed in transfer\.
Out\-of\-distribution robustness\.Finally, we evaluate robustness to common corruptions on ImageNet\-C\(Hendrycks and Dietterich,[2019](https://arxiv.org/html/2605.20302#bib.bib8)\)\. Models are trained on clean ImageNet\-1K only and evaluated on corrupted variants, reporting clean top\-1 error and mean Corruption Error \(mCE\) normalized as inHendrycks and Dietterich \([2019](https://arxiv.org/html/2605.20302#bib.bib8)\)\. As summarized in[Table5](https://arxiv.org/html/2605.20302#S4.T5), our normalized losses reduce mCE compared to CE while also improving clean accuracy\. Thus, NC\-inducing objectives not only improve in\-distribution performance, but as reported in the literature\(Fawziet al\.,[2016](https://arxiv.org/html/2605.20302#bib.bib21); Dinget al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib22); Hein and Andriushchenko,[2017](https://arxiv.org/html/2605.20302#bib.bib10)\), also yield representations that are more robust to distribution shift, in line with their benefits for transfer and long\-tailed recognition\.
### 5\.4Limitations
Our framework draws several limitations from the related literature\.\(i\)Like other contrastive methods, NTCE and NONL benefit from larger batches to provide sufficient in\-batch negatives \([SectionG\.4](https://arxiv.org/html/2605.20302#A7.SS4)\); we explicitly acknowledge that needing larger batches is a practical drawback, but this is a typical limitation of contrastive methods \(including SCL\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20)\)and many SSL methods\(Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\)\)\. Recent SSL methods \(memory banks, queues, momentum encoders\(Heet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib49)\)\) suggest promising ways to increase the number of negatives without linearly scaling batch size, which we view as a natural direction for future extensions of NONL\.\(ii\)Our theory rests on balanced\-class UFM/LPM assumptions; while recent work formally justifies UFM as a faithful proxy for deep ResNet/Transformer training\(Súkeníket al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib63),[2025](https://arxiv.org/html/2605.20302#bib.bib64)\), end\-to\-end training dynamics are not directly characterized\.\(iii\)For SCL specifically,Kiniet al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib61)\)show that when using ReLU activations the optimum is an orthogonal frame with collapsed in\-class representations rather than a centered simplex ETF; the minimizer\-set equivalence in[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)still holds with this alternative minimizer, but the resulting geometry differs from the standard NC simplex\.\(iv\)In the standard transfer setting we evaluate, knowledge transfers from large\-scale to smaller downstream tasks\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20); Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\); the recent line on coarse\-to\-fine transfer\(Chenet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib65); Kornblithet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib66)\)argues that collapsed representations may struggle to distinguish fine\-grained sub\-classes within a coarse class\.
## 6Conclusion
In this work, we address the mismatch between the theoretical optima of supervised objectives and their behavior in practice\. Constraining learning to the unit hypersphere removes the radial degeneracy of cross\-entropy and unifies normalized softmax and supervised contrastive learning \(SCL\) as a single prototype\-contrast paradigm\. Building on this view, we propose two objectives \(NTCE and NONL\) that accelerate convergence to Neural Collapse\. Theoretically, we prove SCL already yields an optimal prototype classifier during contrastive training, eliminating the typical linear probing phase\. Empirically, across four benchmarks including ImageNet\-1K, our methods surpass CE accuracy, reach≥\\geq95% on NC metrics, and translate to practical benefits with better capabilities on transfer learning, long\-tailed classification and robustness\. Overall, supervised learning is recast as prototype\-based classification on the hypersphere, narrowing the theory–practice gap while simplifying and speeding up training\.
## Acknowledgements
Panagiotis Koromilas was supported by the Hellenic Foundation for Research and Innovation \(HFRI\) under the 4th Call for HFRI PhD Fellowships \(Fellowship Number: 10816\)\. Yannis Panagakis was supported by the project MIS 5154714 of the National Recovery and Resilience Plan Greece 2\.0 funded by the European Union under the NextGenerationEU Program\.
## Impact Statement
This paper presents work whose goal is to advance the field of machine learning\. Specifically, our work improves the understanding of the inner workings of supervised representation learning and proposes methods to improve several aspects of performance and optimization\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\. This work was supported by computing time awarded on Cyclone of the HPC Facility of the Cyprus Institute\.
## References
- A\. Albert and J\. A\. Anderson \(1984\)On the existence of maximum likelihood estimates in logistic regression models\.Biometrika71\(1\),pp\. 1–10\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p2.1)\.
- N\. Bansal, X\. Chen, and Z\. Wang \(2018\)Can we gain more from orthogonality regularizations in training deep networks?\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.2](https://arxiv.org/html/2605.20302#A4.SS2.p1.1)\.
- P\. L\. Bartlett, D\. J\. Foster, and M\. Telgarsky \(2017\)Spectrally\-normalized margin bounds for neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),pp\. 6241–6250\.Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p1.1)\.
- S\. Carbonnelle and C\. De Vleeschouwer \(2019\)Layer rotation: a surprisingly simple indicator of generalization in deep networks?\.ICML Workshop on Identifying and Understanding Deep Learning Phenomena\.Cited by:[§D\.2](https://arxiv.org/html/2605.20302#A4.SS2.p1.1)\.
- M\. F\. Chen, D\. Y\. Fu, A\. Narayan, M\. Zhang, Z\. Song, K\. Fatahalian, and C\. Ré \(2022\)Perfectly balanced: improving transfer and robustness of supervised contrastive learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton \(2020\)A simple framework for contrastive learning of visual representations\.InInternational conference on machine learning,pp\. 1597–1607\.Cited by:[§G\.2](https://arxiv.org/html/2605.20302#A7.SS2.p2.3),[§G\.2](https://arxiv.org/html/2605.20302#A7.SS2.p3.2),[§G\.3](https://arxiv.org/html/2605.20302#A7.SS3.p1.3),[§G\.4](https://arxiv.org/html/2605.20302#A7.SS4.p1.1),[§G\.6](https://arxiv.org/html/2605.20302#A7.SS6.p1.7),[§G\.7](https://arxiv.org/html/2605.20302#A7.SS7.p1.3),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- B\. De Brabandere, D\. Neven, and L\. Van Gool \(2017\)Semantic instance segmentation with a discriminative loss function\.arXiv preprint arXiv:1708\.02551\.Cited by:[§D\.3](https://arxiv.org/html/2605.20302#A4.SS3.p1.1)\.
- J\. Deng, J\. Guo, N\. Xue, and S\. Zafeiriou \(2019\)ArcFace: additive angular margin loss for deep face recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 4690–4699\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p2.1)\.
- G\. W\. Ding, Y\. Sharma, K\. Y\. C\. Lui, and R\. Huang \(2020\)MMA Training: direct input space margin maximization through adversarial training\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HkeryxBtPB)Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p3.1)\.
- A\. Fawzi, S\. Moosavi\-Dezfooli, and P\. Frossard \(2016\)Robustness of classifiers: from adversarial to random noise\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p3.1)\.
- T\. Galanti, A\. György, and M\. Hutter \(2021\)On the role of neural collapse in transfer learning\.InInternational Conference on Learning Representations,Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p1.1)\.
- J\. Gill, V\. Vakilian, and C\. Thrampoulidis \(2024\)Cited by:[§D\.3](https://arxiv.org/html/2605.20302#A4.SS3.p4.1)\.
- F\. Graf, C\. Hofer, M\. Niethammer, and R\. Kwitt \(2021\)Dissecting supervised contrastive learning\.InInternational Conference on Machine Learning,pp\. 3821–3830\.Cited by:[Appendix B](https://arxiv.org/html/2605.20302#A2.8.p8.6),[§D\.1](https://arxiv.org/html/2605.20302#A4.SS1),[§D\.1](https://arxiv.org/html/2605.20302#A4.SS1.p1.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§1](https://arxiv.org/html/2605.20302#S1.p3.1),[§2](https://arxiv.org/html/2605.20302#S2.p1.1),[§2](https://arxiv.org/html/2605.20302#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p3.1),[§4\.2](https://arxiv.org/html/2605.20302#S4.SS2.p2.2),[§4\.2](https://arxiv.org/html/2605.20302#S4.SS2.p3.1),[§4\.3](https://arxiv.org/html/2605.20302#S4.SS3.p3.2)\.
- X\.Y\. Han, V\. Papyan, and D\. L\. Donoho \(2022\)Neural collapse under MSE loss: proximity to and dynamics on the central path\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=w1UbdvWH_R3)Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p1.1)\.
- M\. Y\. Harun, K\. Lee, J\. Gallardo, G\. Krishnan, and C\. Kanan \(2025\)Controlling neural collapse enhances out\-of\-distribution detection and transfer learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1)\.
- K\. He, H\. Fan, Y\. Wu, S\. Xie, and R\. Girshick \(2020\)Momentum contrast for unsupervised visual representation learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9729–9738\.Cited by:[§G\.6](https://arxiv.org/html/2605.20302#A7.SS6.p1.7),[§1](https://arxiv.org/html/2605.20302#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p5.1),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- M\. Hein and M\. Andriushchenko \(2017\)Formal guarantees on the robustness of a classifier against adversarial manipulation\.Advances in neural information processing systems30\.Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p3.1)\.
- D\. Hendrycks and T\. Dietterich \(2019\)Benchmarking neural network robustness to common corruptions and perturbations\.Proceedings of the International Conference on Learning Representations\.Cited by:[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p3.1)\.
- Y\. Hong and S\. Ling \(2024\)Neural collapse for unconstrained feature model under class\-imbalance\.Journal of Machine Learning Research25\(180\),pp\. 1–48\.External Links:[Link](https://www.jmlr.org/papers/volume25/23-1215/23-1215.pdf)Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p1.1)\.
- L\. Huang, X\. Liu, B\. Lang, A\. W\. Yu, Y\. Wang, and B\. Li \(2018\)Orthogonal weight normalization: solution to optimization over multiple dependent Stiefel manifolds in deep neural networks\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§D\.2](https://arxiv.org/html/2605.20302#A4.SS2.p1.1)\.
- P\. Khosla, P\. Teterwak, C\. Wang, A\. Sarna, Y\. Tian, P\. Isola, A\. Maschinot, C\. Liu, and D\. Krishnan \(2020\)Supervised contrastive learning\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://proceedings.neurips.cc/paper/2020/file/d89a66c7c80a29b1bdbab0f2a1a94af8-Paper.pdf)Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[Appendix E](https://arxiv.org/html/2605.20302#A5.p1.1),[§G\.4](https://arxiv.org/html/2605.20302#A7.SS4.p1.1),[§G\.6](https://arxiv.org/html/2605.20302#A7.SS6.p1.7),[§G\.7](https://arxiv.org/html/2605.20302#A7.SS7.p1.3),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§1](https://arxiv.org/html/2605.20302#S1.p3.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p1.1),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1),[§5](https://arxiv.org/html/2605.20302#S5.p1.1)\.
- H\. Kim and K\. Kim \(2024\)Fixed non\-negative orthogonal classifier: inducing zero\-mean neural collapse with feature dimension separation\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/pdf?id=F4bmOrmUwc)Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p3.1)\.
- G\. R\. Kini, V\. Vakilian, T\. Behnia, J\. Gilani Tehrani\-Saadi, and C\. Thrampoulidis \(2024\)Symmetric neural\-collapse representations with supervised contrastive loss: the impact of ReLU and batching\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p2.2),[§4\.3](https://arxiv.org/html/2605.20302#S4.SS3.p3.2),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- S\. Kornblith, T\. Chen, H\. Lee, and M\. Norouzi \(2021\)Why do better loss functions lead to less transferable features?\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- P\. Koromilas, G\. Bouritsas, T\. Giannakopoulos, M\. Nicolaou, and Y\. Panagakis \(2024\)Bridging mini\-batch and asymptotic analysis in contrastive learning: from infonce to kernel\-based losses\.InInternational Conference on Machine Learning,pp\. 25276–25301\.Cited by:[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx2.SSS0.Px2.p1.3),[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx2.SSS0.Px2.p2.2),[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx2.p1.1),[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx3.SSS0.Px2.p1.3),[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx3.SSS0.Px2.p1.7),[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px3.p1.2),[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px7.p1.2),[§2](https://arxiv.org/html/2605.20302#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p5.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p8.5)\.
- W\. Liu, Y\. Wen, Z\. Yu, M\. Li, B\. Raj, and L\. Song \(2017\)SphereFace: deep hypersphere embedding for face recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 6738–6746\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p2.1)\.
- J\. Lu and S\. Steinerberger \(2022\)Neural collapse under cross\-entropy loss\.Applied and Computational Harmonic Analysis59,pp\. 224–241\.External Links:[Document](https://dx.doi.org/10.1016/j.acha.2021.12.011)Cited by:[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§2](https://arxiv.org/html/2605.20302#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p1.5),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p2.1)\.
- E\. Markou, T\. Ajanthan, and S\. Gould \(2024\)Guiding neural collapse: optimising towards the nearest simplex equiangular tight frame\.Advances in Neural Information Processing Systems37,pp\. 35544–35573\.Cited by:[Appendix E](https://arxiv.org/html/2605.20302#A5.p1.1),[§2](https://arxiv.org/html/2605.20302#S2.p3.1),[§5](https://arxiv.org/html/2605.20302#S5.p1.1)\.
- A\. K\. Menon, S\. Jayasumana, A\. S\. Rawat, H\. Jain, A\. Veit, and S\. Kumar \(2021\)Long\-tail learning via logit adjustment\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=37nvvqkCo5)Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p1.1)\.
- P\. Mettes, E\. van der Pol, and C\. G\. M\. Snoek \(2019\)Hyperspherical prototype networks\.InAdvances in Neural Information Processing Systems,Vol\.32\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p3.1)\.
- T\. Miyato, T\. Kataoka, M\. Koyama, and Y\. Yoshida \(2018\)Spectral normalization for generative adversarial networks\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§D\.2](https://arxiv.org/html/2605.20302#A4.SS2.p1.1)\.
- D\. Neven, B\. De Brabandere, M\. Proesmans, and L\. Van Gool \(2019\)Instance segmentation by jointly optimizing spatial embeddings and clustering bandwidth\.InIEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§D\.3](https://arxiv.org/html/2605.20302#A4.SS3.p1.1)\.
- A\. Newell, Z\. Huang, and J\. Deng \(2017\)Associative embedding: end\-to\-end learning for joint detection and grouping\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.3](https://arxiv.org/html/2605.20302#A4.SS3.p1.1)\.
- B\. Neyshabur, S\. Bhojanapalli, and N\. Srebro \(2018\)A PAC\-Bayesian approach to spectrally\-normalized margin bounds for neural networks\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=Skz_WfbCZ)Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p1.1)\.
- V\. Papyan, X\. Y\. Han, and D\. L\. Donoho \(2020\)Prevalence of neural collapse during the terminal phase of deep learning training\.Proceedings of the National Academy of Sciences117\(40\),pp\. 24652–24663\.External Links:[Document](https://dx.doi.org/10.1073/pnas.2015509117)Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§2](https://arxiv.org/html/2605.20302#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p1.4),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p2.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p1.1)\.
- O\. Roy and M\. Vetterli \(2007\)The effective rank: a measure of effective dimensionality\.In2007 15th European signal processing conference,pp\. 606–610\.Cited by:[§5\.2](https://arxiv.org/html/2605.20302#S5.SS2.p2.11)\.
- T\. Salimans and D\. P\. Kingma \(2016\)Weight normalization: a simple reparameterization to accelerate training of deep neural networks\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.2](https://arxiv.org/html/2605.20302#A4.SS2.p1.1)\.
- Y\. Shen, X\. Sun, and X\. Wei \(2023\)Equiangular basis vectors\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 16880–16890\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p3.1)\.
- J\. Snell, K\. Swersky, and R\. S\. Zemel \(2017\)Prototypical networks for few\-shot learning\.InAdvances in Neural Information Processing Systems,Vol\.30\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p3.1)\.
- K\. Song, Z\. Tan, B\. Zou, H\. Ma, and W\. Huang \(2024\)Unveiling the dynamics of information interplay in supervised learning\.InInternational Conference on Machine Learning,pp\. 46156–46167\.Cited by:[§5\.2](https://arxiv.org/html/2605.20302#S5.SS2.p4.2)\.
- D\. Soudry, E\. Hoffer, M\. S\. Nacson, S\. Gunasekar, and N\. Srebro \(2018\)The implicit bias of gradient descent on separable data\.Journal of Machine Learning Research19\(70\),pp\. 1–57\.External Links:[Link](https://jmlr.org/papers/v19/18-188.html)Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px1.p1.1),[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p3.1),[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§1](https://arxiv.org/html/2605.20302#S1.p2.1),[§2](https://arxiv.org/html/2605.20302#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p2.1),[§4\.2](https://arxiv.org/html/2605.20302#S4.SS2.p1.1)\.
- P\. Súkeník, M\. Mondelli, and C\. H\. Lampert \(2024\)Neural collapse versus low\-rank bias: is deep neural collapse really optimal?\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p1.1),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- P\. Súkeník, M\. Mondelli, and C\. H\. Lampert \(2025\)Deep neural collapse is provably optimal for the deep unconstrained features model\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§D\.4](https://arxiv.org/html/2605.20302#A4.SS4.p1.1),[§5\.4](https://arxiv.org/html/2605.20302#S5.SS4.p1.1)\.
- C\. Thrampoulidis, G\. R\. Kini, V\. Vakilian, and T\. Behnia \(2022\)Imbalance trouble: revisiting neural\-collapse geometry\.Advances in Neural Information Processing Systems35,pp\. 27225–27238\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p1.1)\.
- T\. Tirer and J\. Bruna \(2022\)Extended unconstrained features model for exploring deep neural collapse\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 21478–21505\.External Links:[Link](https://proceedings.mlr.press/v162/tirer22a.html)Cited by:[Appendix A](https://arxiv.org/html/2605.20302#A1.p1.3),[§4\.3](https://arxiv.org/html/2605.20302#S4.SS3.p1.1)\.
- F\. Wang, X\. Xiang, J\. Cheng, and A\. L\. Yuille \(2017\)NormFace: l2 hypersphere embedding for face verification\.InProceedings of the 25th ACM International Conference on Multimedia \(MM\),pp\. 1041–1049\.External Links:[Document](https://dx.doi.org/10.1145/3123266.3123359)Cited by:[§1](https://arxiv.org/html/2605.20302#S1.p2.1),[§2](https://arxiv.org/html/2605.20302#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p3.3),[§5](https://arxiv.org/html/2605.20302#S5.p1.1)\.
- H\. Wang, Y\. Wang, Z\. Zhou, X\. Ji, D\. Gong, J\. Zhou, Z\. Li, and W\. Liu \(2018\)CosFace: large margin cosine loss for deep face recognition\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 5265–5274\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p2.1)\.
- T\. Wang and P\. Isola \(2020\)Understanding contrastive representation learning through alignment and uniformity on the hypersphere\.InInternational conference on machine learning,pp\. 9929–9939\.Cited by:[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p8.5)\.
- X\. Wang, Z\. Liu, and S\. X\. Yu \(2021\)Unsupervised feature learning by cross\-level instance\-group discrimination\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 12586–12595\.Cited by:[Appendix E](https://arxiv.org/html/2605.20302#A5.p1.1),[§5](https://arxiv.org/html/2605.20302#S5.p1.1)\.
- Y\. Yang, S\. Chen, X\. Li, L\. Xie, Z\. Lin, and D\. Tao \(2022\)Inducing neural collapse in imbalanced learning: do we really need a learnable classifier at the end of deep neural network?\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 37991–38002\.Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px6.p1.1),[§2](https://arxiv.org/html/2605.20302#S2.p3.1),[§5\.3](https://arxiv.org/html/2605.20302#S5.SS3.p2.1),[Table 6](https://arxiv.org/html/2605.20302#S5.T6),[Table 6](https://arxiv.org/html/2605.20302#S5.T6.2.1.1),[§5](https://arxiv.org/html/2605.20302#S5.p1.1)\.
- C\. Yaras, P\. Wang, Z\. Zhu, L\. Balzano, and Q\. Qu \(2022\)Neural collapse with normalized features: a geometric analysis over the riemannian manifold\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx1.p1.3),[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx1.p2.2),[Appendix A](https://arxiv.org/html/2605.20302#A1.SSx3.SSS0.Px3.1.p1.1),[Appendix A](https://arxiv.org/html/2605.20302#A1.p1.3),[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px1.p1.1),[§D\.2](https://arxiv.org/html/2605.20302#A4.SS2.p2.2),[§D\.3](https://arxiv.org/html/2605.20302#A4.SS3.p2.1),[§1](https://arxiv.org/html/2605.20302#S1.p2.1),[§2](https://arxiv.org/html/2605.20302#S2.p2.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p2.1),[§4\.3](https://arxiv.org/html/2605.20302#S4.SS3.p1.1),[§4\.3](https://arxiv.org/html/2605.20302#S4.SS3.p4.5),[§4](https://arxiv.org/html/2605.20302#S4.p1.1)\.
- C\. Yeh, C\. Hong, Y\. Hsu, T\. Liu, Y\. Chen, and Y\. LeCun \(2022\)Decoupled contrastive learning\.InEuropean Conference on Computer Vision,pp\. 668–684\.Cited by:[Appendix C](https://arxiv.org/html/2605.20302#A3.SS0.SSS0.Px2.p1.1),[Appendix E](https://arxiv.org/html/2605.20302#A5.p1.1),[§1](https://arxiv.org/html/2605.20302#S1.p5.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p8.5),[§5](https://arxiv.org/html/2605.20302#S5.p1.1)\.
- S\. Zhang, Z\. Song, and K\. He \(2025\)Neural collapse inspired knowledge distillation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 22542–22550\.Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p3.1)\.
- Y\. Zhang, Z\. Tan, J\. Yang, W\. Huang, and Y\. Yuan \(2024\)Matrix information theory for self\-supervised learning\.InProceedings of the 41st International Conference on Machine Learning,pp\. 59897–59918\.Cited by:[§5\.2](https://arxiv.org/html/2605.20302#S5.SS2.p2.11)\.
- J\. Zhou, X\. Li, T\. Ding, C\. You, Q\. Qu, and Z\. Zhu \(2022a\)On the optimization landscape of neural collapse under mse loss: global optimality with unconstrained features\.InInternational Conference on Machine Learning,pp\. 27179–27202\.Cited by:[§1](https://arxiv.org/html/2605.20302#S1.p1.1),[§2](https://arxiv.org/html/2605.20302#S2.p1.1)\.
- J\. Zhou, C\. You, X\. Li, K\. Liu, S\. Liu, Q\. Qu, and Z\. Zhu \(2022b\)Are all losses created equal: a neural collapse perspective\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.20302#S2.p1.1)\.
- Z\. Zhu, T\. Ding, J\. Zhou, X\. Li, C\. You, J\. Sulam, and Q\. Qu \(2021\)A geometric analysis of neural collapse with unconstrained features\.Advances in Neural Information Processing Systems34,pp\. 29820–29834\.Cited by:[§1](https://arxiv.org/html/2605.20302#S1.p2.1),[§2](https://arxiv.org/html/2605.20302#S2.p1.1),[§3\.2](https://arxiv.org/html/2605.20302#S3.SS2.p2.1),[§4\.1](https://arxiv.org/html/2605.20302#S4.SS1.p2.1)\.
## Appendix ANeural Collapse Optimality of Normalized Objectives
We adopt the balanced unconstrained\-features / layer\-peeled model \(UFM/LPM\)\(Tirer and Bruna,[2022](https://arxiv.org/html/2605.20302#bib.bib41); Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\)\. The last\-layer features𝒛i∈ℝd\{\\bm\{z\}\}\_\{i\}\\in\\mathbb\{R\}^\{d\}and classifier weights𝒘c∈ℝd\{\\bm\{w\}\}\_\{c\}\\in\\mathbb\{R\}^\{d\}are free optimization variables\. We work with theirℓ2\\ell\_\{2\}\-normalized versions
𝒖i=𝒛i‖𝒛i‖,𝒘^c=𝒘c‖𝒘c‖,\{\\bm\{u\}\}\_\{i\}=\\frac\{\{\\bm\{z\}\}\_\{i\}\}\{\\\|\{\\bm\{z\}\}\_\{i\}\\\|\},\\qquad\\hat\{\\bm\{w\}\}\_\{c\}=\\frac\{\{\\bm\{w\}\}\_\{c\}\}\{\\\|\{\\bm\{w\}\}\_\{c\}\\\|\},so that‖𝒖i‖=‖𝒘^c‖=1\\\|\{\\bm\{u\}\}\_\{i\}\\\|=\\\|\\hat\{\\bm\{w\}\}\_\{c\}\\\|=1and
Sic:=𝒖i⊤𝒘^c\.S\_\{ic\}:=\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{c\}\.There areKKclasses andMMtraining samples, and the dataset is balanced: each classcchas index set
Ic:=\{i:yi=c\}with\|Ic\|=n=M/K\.I\_\{c\}:=\\\{i:y\_\{i\}=c\\\}\\quad\\text\{with\}\\quad\|I\_\{c\}\|=n=M/K\.We assumed≥Kd\\geq K\.
#### Normalized CE–based losses\.
We consider three normalized cross\-entropy–based losses with temperatureτ\>0\\tau\>0:
ℒNF\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NF\}\}=−1M∑i=1Mlogexp\(Si,yi/τ\)∑c=1Kexp\(Sic/τ\),\\displaystyle=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\log\\frac\{\\exp\(S\_\{i,y\_\{i\}\}/\\tau\)\}\{\\sum\_\{c=1\}^\{K\}\\exp\(S\_\{ic\}/\\tau\)\},\(7\)ℒNTCE\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}=−1M∑i=1Mlogexp\(Si,yi/τ\)∑j=1Mexp\(Sj,yi/τ\),\\displaystyle=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\log\\frac\{\\exp\(S\_\{i,y\_\{i\}\}/\\tau\)\}\{\\sum\_\{j=1\}^\{M\}\\exp\(S\_\{j,y\_\{i\}\}/\\tau\)\},\(8\)ℒNONL\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}=−1M∑i=1Mlogexp\(Si,yi/τ\)∑j:yj≠yiexp\(Sj,yi/τ\)\.\\displaystyle=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\log\\frac\{\\exp\(S\_\{i,y\_\{i\}\}/\\tau\)\}\{\\sum\_\{j:\\,y\_\{j\}\\neq y\_\{i\}\}\\exp\(S\_\{j,y\_\{i\}\}/\\tau\)\}\.\(9\)
#### Neural Collapse properties\.
We say a configuration exhibits*Neural Collapse*\(NC\) if there exist unit vectors𝝁1,…,𝝁K∈ℝd\{\\bm\{\\mu\}\}\_\{1\},\\dots,\{\\bm\{\\mu\}\}\_\{K\}\\in\\mathbb\{R\}^\{d\}such that:
- \(NC1\)\(Within\-class collapse\)𝒖i=𝝁yi\{\\bm\{u\}\}\_\{i\}=\{\\bm\{\\mu\}\}\_\{y\_\{i\}\}for allii\.
- \(NC2\)\(Simplex ETF\) the vectors\{𝝁c\}\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}form a centered regular simplex in a\(K−1\)\(K\-1\)–dimensional subspace: ‖𝝁c‖=1and𝝁c⊤𝝁c′=−1K−1∀c≠c′\.\\\|\{\\bm\{\\mu\}\}\_\{c\}\\\|=1\\quad\\text\{and\}\\quad\{\\bm\{\\mu\}\}\_\{c\}^\{\\top\}\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}=\-\\frac\{1\}\{K\-1\}\\quad\\forall c\\neq c^\{\\prime\}\.
- \(NC3\)\(Classifier–mean alignment\)𝒘^c=𝝁c\\hat\{\\bm\{w\}\}\_\{c\}=\{\\bm\{\\mu\}\}\_\{c\}for allcc\.
At such a configuration the \(normalized\-feature\) class means𝝁^c:=1n∑i∈Ic𝒖i\\hat\{\\bm\{\\mu\}\}\_\{c\}:=\\frac\{1\}\{n\}\\sum\_\{i\\in I\_\{c\}\}\{\\bm\{u\}\}\_\{i\}coincide with𝝁c\{\\bm\{\\mu\}\}\_\{c\}and are therefore unit norm\.
###### Theorem A\.1\(NC optimality of normalized CE–based losses\)\.
In the balanced UFM/LPM setting above withd≥Kd\\geq K, every global minimizer ofℒNF\\mathcal\{L\}\_\{\\mathrm\{NF\}\},ℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}, andℒNONL\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}satisfies NC1–NC3, up to a global rotation and permutation of class labels\.
We now analyze the three losses in turn\.
### NormFace
Yaraset al\.\([2022](https://arxiv.org/html/2605.20302#bib.bib23)\)study the constrained UFM problem
minH,W1M∑i=1MCE\(τ′W⊤𝒉i,yi\)s\.t\.‖𝒉i‖=1,‖𝒘c‖=1,\\min\_\{H,W\}\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\mathrm\{CE\}\\big\(\\tau^\{\\prime\}W^\{\\top\}\{\\bm\{h\}\}\_\{i\},y\_\{i\}\\big\)\\quad\\text\{s\.t\. \}\\\|\{\\bm\{h\}\}\_\{i\}\\\|=1,\\ \\\|\{\\bm\{w\}\}\_\{c\}\\\|=1,\(10\)withτ′\>0\\tau^\{\\prime\}\>0, whereCE\\mathrm\{CE\}is the standard cross\-entropy\.
###### Lemma A\.2\(NormFace≡\\equivYaras et al\.\)\.
Setτ′=1/τ\\tau^\{\\prime\}=1/\\tau, and identify𝐡i=𝐮i\{\\bm\{h\}\}\_\{i\}=\{\\bm\{u\}\}\_\{i\}and𝐰c=𝐰^c\{\\bm\{w\}\}\_\{c\}=\\hat\{\\bm\{w\}\}\_\{c\}\. ThenℒNF\\mathcal\{L\}\_\{\\mathrm\{NF\}\}coincides with equation[10](https://arxiv.org/html/2605.20302#A1.E10), andargminℒNF\\operatorname\*\{arg\\,min\}\\mathcal\{L\}\_\{\\mathrm\{NF\}\}equals the set of global minimizers of equation[10](https://arxiv.org/html/2605.20302#A1.E10)over unit\-norm features and weights\.
###### Proof\.
With𝒉i=𝒖i\{\\bm\{h\}\}\_\{i\}=\{\\bm\{u\}\}\_\{i\},𝒘c=𝒘^c\{\\bm\{w\}\}\_\{c\}=\\hat\{\\bm\{w\}\}\_\{c\}andτ′=1/τ\\tau^\{\\prime\}=1/\\tau, we have\(τ′W⊤𝒉i\)c=Sic/τ\\big\(\\tau^\{\\prime\}W^\{\\top\}\{\\bm\{h\}\}\_\{i\}\\big\)\_\{c\}=S\_\{ic\}/\\tau, so the summand in equation[10](https://arxiv.org/html/2605.20302#A1.E10)is exactly−logexp\(Si,yi/τ\)∑cexp\(Sic/τ\)\-\\log\\frac\{\\exp\(S\_\{i,y\_\{i\}\}/\\tau\)\}\{\\sum\_\{c\}\\exp\(S\_\{ic\}/\\tau\)\}, which is theiith summand inℒNF\\mathcal\{L\}\_\{\\mathrm\{NF\}\}\. Averaging overiigives the claim\. ∎
Theorem 3\.1 ofYaraset al\.\([2022](https://arxiv.org/html/2605.20302#bib.bib23)\)states that, under balanced labels andd≥Kd\\geq K, every global minimizer of equation[10](https://arxiv.org/html/2605.20302#A1.E10)satisfies NC1–NC3\. Together with Lemma[A\.2](https://arxiv.org/html/2605.20302#A1.Thmtheorem2), this implies that every global minimizer ofℒNF\\mathcal\{L\}\_\{\\mathrm\{NF\}\}satisfies NC1–NC3\.
### NTCE
We now show that every global minimizer ofℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}satisfies NC1–NC3\. The proof follows the same three\-step pattern we later use for NONL: we first reduce to a class\-level objective depending only on class means and weights, then view this function as a contrastive loss of La/Lc type from\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\)and apply its respective minimizer characterization of at the class level, and finally lift the resulting structure back to the sample level\.
Recall that
ℒNTCE=−1M∑i=1Mlogexp\(𝒖i⊤𝒘^yi/τ\)∑j=1Mexp\(𝒖j⊤𝒘^yi/τ\)=1M∑i=1MℓiNTCE,\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\log\\frac\{\\exp\(\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{y\_\{i\}\}/\\tau\)\}\{\\sum\_\{j=1\}^\{M\}\\exp\(\{\\bm\{u\}\}\_\{j\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{y\_\{i\}\}/\\tau\)\}\\;=\\;\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\ell\_\{i\}^\{\\mathrm\{NTCE\}\},with per\-sample loss
ℓiNTCE:=−logexp\(𝒖i⊤𝒘^yi/τ\)∑j=1Mexp\(𝒖j⊤𝒘^yi/τ\)\.\\ell\_\{i\}^\{\\mathrm\{NTCE\}\}:=\-\\log\\frac\{\\exp\(\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{y\_\{i\}\}/\\tau\)\}\{\\sum\_\{j=1\}^\{M\}\\exp\(\{\\bm\{u\}\}\_\{j\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{y\_\{i\}\}/\\tau\)\}\.We again work in the balanced setting\|Ic\|=n=M/K\|I\_\{c\}\|=n=M/Kand‖𝒖i‖=‖𝒘^c‖=1\\\|\{\\bm\{u\}\}\_\{i\}\\\|=\\\|\\hat\{\\bm\{w\}\}\_\{c\}\\\|=1\.
#### Step 1: reduction to class means\.
###### Lemma A\.3\(NTCE reduction via class means\)\.
Assume balanced labels,\|Ic\|=n=M/K\|I\_\{c\}\|=n=M/Kfor allcc\. For any configuration\{𝐮i\},\{𝐰^c\}\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}with‖𝐮i‖=‖𝐰^c‖=1\\\|\{\\bm\{u\}\}\_\{i\}\\\|=\\\|\\hat\{\\bm\{w\}\}\_\{c\}\\\|=1define the normalized\-feature class means
𝝁^c:=1n∑j∈Ic𝒖j\.\\hat\{\\bm\{\\mu\}\}\_\{c\}:=\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c\}\}\{\\bm\{u\}\}\_\{j\}\.Then
ℒNTCE\(\{𝒖i\},\{𝒘^c\}\)≥LNTCEcls\(\{𝝁^c\},\{𝒘^c\}\),\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}\(\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\)\\;\\geq\\;L\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\),where the class\-level loss is
LNTCEcls:=−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑c′=1Knexp\(𝒘^c⊤𝝁^c′/τ\)\)\.L\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}:=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Bigg\(\\sum\_\{c^\{\\prime\}=1\}^\{K\}n\\,\\exp\\big\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\\big\)\\Bigg\)\.\(11\)Moreover, the inequality is tight if and only if, for every ordered pair\(c,c′\)\(c,c^\{\\prime\}\), the logits𝐰^c⊤𝐮j\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}are constant overj∈Ic′j\\in I\_\{c^\{\\prime\}\}, i\.e\.
𝒘^c⊤𝒖j=𝒘^c⊤𝝁^c′for allj∈Ic′\.\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}=\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}\\quad\\text\{for all \}j\\in I\_\{c^\{\\prime\}\}\.
###### Proof\.
Fix a configuration\{𝒖i\},\{𝒘^c\}\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}and define the class means𝝁^c\\hat\{\\bm\{\\mu\}\}\_\{c\}as above\. Using the balanced labels, writeℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}as an average over classes\. Fori∈Ici\\in I\_\{c\}we have
ℓiNTCE=−1τ𝒘^c⊤𝒖i\+log\(∑j=1Mexp\(𝒘^c⊤𝒖j/τ\)\),\\ell\_\{i\}^\{\\mathrm\{NTCE\}\}=\-\\frac\{1\}\{\\tau\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}\+\\log\\Big\(\\sum\_\{j=1\}^\{M\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)\\Big\),so
ℒNTCE\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}=1M∑c=1K∑i∈IcℓiNTCE\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{c=1\}^\{K\}\\sum\_\{i\\in I\_\{c\}\}\\ell\_\{i\}^\{\\mathrm\{NTCE\}\}=−1Mτ∑c=1K∑i∈Ic𝒘^c⊤𝒖i\+1M∑c=1K∑i∈Iclog\(∑j=1Mexp\(𝒘^c⊤𝒖j/τ\)\)\.\\displaystyle=\-\\frac\{1\}\{M\\tau\}\\sum\_\{c=1\}^\{K\}\\sum\_\{i\\in I\_\{c\}\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}\+\\frac\{1\}\{M\}\\sum\_\{c=1\}^\{K\}\\sum\_\{i\\in I\_\{c\}\}\\log\\Big\(\\sum\_\{j=1\}^\{M\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)\\Big\)\.The denominator term inside the logarithm depends only on the anchor classcc, not onii, so∑i∈Ic\\sum\_\{i\\in I\_\{c\}\}introduces a factor of\|Ic\|=n\|I\_\{c\}\|=n\. UsingM=nKM=nKand the definition of𝝁^c\\hat\{\\bm\{\\mu\}\}\_\{c\}we obtain
ℒNTCE\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}=−1Kτ∑c=1K𝒘^c⊤\(1n∑i∈Ic𝒖i\)\+1K∑c=1Klog\(∑j=1Mexp\(𝒘^c⊤𝒖j/τ\)\)\\displaystyle=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\Big\(\\frac\{1\}\{n\}\\sum\_\{i\\in I\_\{c\}\}\{\\bm\{u\}\}\_\{i\}\\Big\)\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Big\(\\sum\_\{j=1\}^\{M\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)\\Big\)=−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑j=1Mexp\(𝒘^c⊤𝒖j/τ\)\)\.\\displaystyle=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Big\(\\sum\_\{j=1\}^\{M\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)\\Big\)\.
For each fixed anchor classcc, split the denominator over classes:
∑j=1Mexp\(𝒘^c⊤𝒖j/τ\)=∑c′=1K∑j∈Ic′exp\(𝒘^c⊤𝒖j/τ\)\.\\sum\_\{j=1\}^\{M\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)=\\sum\_\{c^\{\\prime\}=1\}^\{K\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)\.For fixed\(c,c′\)\(c,c^\{\\prime\}\), the functionfc\(𝒙\):=exp\(𝒘^c⊤𝒙/τ\)f\_\{c\}\(\{\\bm\{x\}\}\):=\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{x\}\}/\\tau\)is convex in𝒙\{\\bm\{x\}\}, so by Jensen’s inequality overj∈Ic′j\\in I\_\{c^\{\\prime\}\},
1n∑j∈Ic′exp\(𝒘^c⊤𝒖j/τ\)\\displaystyle\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)=1n∑j∈Ic′fc\(𝒖j\)≥fc\(1n∑j∈Ic′𝒖j\)=exp\(𝒘^c⊤𝝁^c′/τ\)\.\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}f\_\{c\}\(\{\\bm\{u\}\}\_\{j\}\)\\;\\geq\\;f\_\{c\}\\\!\\Big\(\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\{\\bm\{u\}\}\_\{j\}\\Big\)=\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\.Multiplying bynnand summing overc′c^\{\\prime\}yields
∑j=1Mexp\(𝒘^c⊤𝒖j/τ\)≥∑c′=1Knexp\(𝒘^c⊤𝝁^c′/τ\)\.\\sum\_\{j=1\}^\{M\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)\\;\\geq\\;\\sum\_\{c^\{\\prime\}=1\}^\{K\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\.Taking logs and averaging overccgives
ℒNTCE≥−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑c′=1Knexp\(𝒘^c⊤𝝁^c′/τ\)\)=LNTCEcls\.\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}\\;\\geq\\;\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Bigg\(\\sum\_\{c^\{\\prime\}=1\}^\{K\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\\Bigg\)=L\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\.
Jensen’s inequality is tight for a given pair\(c,c′\)\(c,c^\{\\prime\}\)if and only if the arguments offcf\_\{c\}are constant overj∈Ic′j\\in I\_\{c^\{\\prime\}\}, i\.e\. if and only if𝒘^c⊤𝒖j\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}is constant injjonIc′I\_\{c^\{\\prime\}\}\. In that case this constant must equal𝒘^c⊤𝝁^c′\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}\. Tightness for allc,c′c,c^\{\\prime\}gives the stated condition\. ∎
Thus, for any configuration of unit features and weights, the NTCE loss is lower\-bounded by the class\-level objectiveLNTCEclsL\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}depending only on theKKclass means𝝁^c\\hat\{\\bm\{\\mu\}\}\_\{c\}and theKKclassifier weights𝒘^c\\hat\{\\bm\{w\}\}\_\{c\}, and Lemma[A\.3](https://arxiv.org/html/2605.20302#A1.Thmtheorem3)precisely characterizes when this lower bound is tight \(blockwise constant logits\)\.
It is convenient to separate out the constantlogn\\log nfactor, and to view the class means and weights abstractly as unit vectors\. Define the*normalized*class\-level NTCE loss
L~NTCEcls\(\{𝝁^c\},\{𝒘^c\}\):=−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑c′=1Kexp\(𝒘^c⊤𝝁^c′/τ\)\),\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\):=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Bigg\(\\sum\_\{c^\{\\prime\}=1\}^\{K\}\\exp\\big\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\\big\)\\Bigg\),so that
LNTCEcls=logn\+L~NTCEcls\.L\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}=\\log n\+\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\.
In what follows, we treat the pairs\(𝝁^c,𝒘^c\)\(\\hat\{\\bm\{\\mu\}\}\_\{c\},\\hat\{\\bm\{w\}\}\_\{c\}\)as free variables on the unit sphere and, to lighten notation, write𝝁c:=𝝁^c\{\\bm\{\\mu\}\}\_\{c\}:=\\hat\{\\bm\{\\mu\}\}\_\{c\}and𝒘c:=𝒘^c\{\\bm\{w\}\}\_\{c\}:=\\hat\{\\bm\{w\}\}\_\{c\}\.
#### Step 2: analysis of the class\-level problem\.
For each classccwe view theccth summand inL~NTCEcls\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}as a standard contrastive loss of La/Lc type\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\), with
qc=𝒘c\(anchor\),kc\+=𝝁c\(positive\),\{kc−=𝝁c′:c′≠c\}\(negatives\)\.q\_\{c\}=\{\\bm\{w\}\}\_\{c\}\\quad\\text\{\(anchor\)\},\\qquad k\_\{c\}^\{\+\}=\{\\bm\{\\mu\}\}\_\{c\}\\quad\\text\{\(positive\)\},\\qquad\\\{k\_\{c\}^\{\-\}=\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}:c^\{\\prime\}\\neq c\\\}\\quad\\text\{\(negatives\)\}\.The per\-class alignment and contrastive terms are
La\(qc,kc\+\)=−1τqc⊤kc\+,Lc\(qc,\{kc−\}\)=log\(∑c′=1Kexp\(qc⊤kc′−/τ\)\)\.L\_\{\\mathrm\{a\}\}\(q\_\{c\},k\_\{c\}^\{\+\}\)=\-\\frac\{1\}\{\\tau\}q\_\{c\}^\{\\top\}k\_\{c\}^\{\+\},\\qquad L\_\{\\mathrm\{c\}\}\(q\_\{c\},\\\{k\_\{c\}^\{\-\}\\\}\)=\\log\\Big\(\\sum\_\{c^\{\\prime\}=1\}^\{K\}\\exp\(q\_\{c\}^\{\\top\}k\_\{c^\{\\prime\}\}^\{\-\}/\\tau\)\\Big\)\.The La/Lc framework requiresLaL\_\{\\mathrm\{a\}\}to be strictly decreasing in similarity andLcL\_\{\\mathrm\{c\}\}to be convex and strictly increasing in similarity\. These conditions hold here:
- •qc⊤kc\+q\_\{c\}^\{\\top\}k\_\{c\}^\{\+\}entersLaL\_\{\\mathrm\{a\}\}linearly with a negative coefficient, soLaL\_\{\\mathrm\{a\}\}decreases asqc⊤kc\+q\_\{c\}^\{\\top\}k\_\{c\}^\{\+\}increases\.
- •LcL\_\{\\mathrm\{c\}\}is a log\-sum\-exp of the similaritiesqc⊤kc−/τq\_\{c\}^\{\\top\}k\_\{c\}^\{\-\}/\\tau, hence convex and strictly increasing in each similarity argument\.
Therefore we may invoke the minimizer characterization for La/Lc losses\. By Theorem 4\.1 and Appendix B\.1 ofKoromilaset al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib50)\), providedd≥Kd\\geq K, the global minimizers ofL~NTCEcls\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}over unit vectors satisfy:
- •Perfect alignment:𝝁c=𝒘c\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{w\}\}\_\{c\}for allcc\.
- •Simplex ETF structure:the directions\{𝝁c\}c=1K\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\_\{c=1\}^\{K\}form a centered regular simplex equiangular tight frame in a\(K−1\)\(K\-1\)–dimensional subspace: ‖𝝁c‖=1,𝝁c⊤𝝁c′=−1K−1∀c≠c′\.\\\|\{\\bm\{\\mu\}\}\_\{c\}\\\|=1,\\qquad\{\\bm\{\\mu\}\}\_\{c\}^\{\\top\}\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}=\-\\frac\{1\}\{K\-1\}\\quad\\forall c\\neq c^\{\\prime\}\.
In particular, there exists a simplex ETF\{𝝁c\}c=1K⊂ℝd\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\_\{c=1\}^\{K\}\\subset\\mathbb\{R\}^\{d\}such that𝝁c=𝒘c\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{w\}\}\_\{c\}is a global minimizer ofL~NTCEcls\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}, unique up to a global rotation and permutation of the class indices\.
#### Step 3: lifting back to the sample level\.
We now relate these class\-level minimizers back to the original sample\-level NTCE objective and derive the NC structure of its global minimizers\.
*Existence of Neural Collapse minimizers\.*Let\{𝝁c\}c=1K\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\_\{c=1\}^\{K\}be a simplex ETF and set
𝒘^c:=𝝁c,𝒖i:=𝝁yifor alli\.\\hat\{\\bm\{w\}\}\_\{c\}:=\{\\bm\{\\mu\}\}\_\{c\},\\qquad\{\\bm\{u\}\}\_\{i\}:=\{\\bm\{\\mu\}\}\_\{y\_\{i\}\}\\quad\\text\{for all \}i\.This configuration satisfies NC1–NC3 by construction: within each classcc, all normalized features collapse to𝝁c\{\\bm\{\\mu\}\}\_\{c\}\(NC1\), the vectors\{𝝁c\}\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}form a centered simplex ETF \(NC2\), and𝒘^c=𝝁c\\hat\{\\bm\{w\}\}\_\{c\}=\{\\bm\{\\mu\}\}\_\{c\}\(NC3\)\. In particular, the feature class means are𝝁^c=𝝁c\\hat\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{\\mu\}\}\_\{c\}\.
Moreover, for this configuration the Jensen inequalities in Lemma[A\.3](https://arxiv.org/html/2605.20302#A1.Thmtheorem3)are tight: for any anchor classccand any classc′c^\{\\prime\}, we have𝒘^c⊤𝒖j=𝝁c⊤𝝁c′\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}=\{\\bm\{\\mu\}\}\_\{c\}^\{\\top\}\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}for allj∈Ic′j\\in I\_\{c^\{\\prime\}\}, so the logits are constant within each class\. Hence
ℒNTCE=LNTCEcls\(\{𝝁^c\},\{𝒘^c\}\)=logn\+L~NTCEcls\(\{𝝁c\},\{𝝁c\}\)\.\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}=L\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\)=\\log n\+\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\)\.Since\{𝝁c\},\{𝝁c\}\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}is a global minimizer ofL~NTCEcls\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}, this shows that
inf\{𝒖i\},\{𝒘^c\}ℒNTCE≤logn\+inf\{𝝁c\},\{𝒘c\}L~NTCEcls\.\\inf\_\{\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\}\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}\\;\\leq\\;\\log n\+\\inf\_\{\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{w\}\}\_\{c\}\\\}\}\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\.
*Structure of arbitrary global minimizers\.*Conversely, let\(\{𝒖i⋆\},\{𝒘^c⋆\}\)\(\\\{\{\\bm\{u\}\}\_\{i\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)be any global minimizer ofℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}, and let
𝝁^c⋆:=1n∑j∈Ic𝒖j⋆\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}:=\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c\}\}\{\\bm\{u\}\}\_\{j\}^\{\\star\}be the corresponding class means\. Lemma[A\.3](https://arxiv.org/html/2605.20302#A1.Thmtheorem3)gives
ℒNTCE\(\{𝒖i⋆\},\{𝒘^c⋆\}\)≥LNTCEcls\(\{𝝁^c⋆\},\{𝒘^c⋆\}\)=logn\+L~NTCEcls\(\{𝝁^c⋆\},\{𝒘^c⋆\}\)\.\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}\(\\\{\{\\bm\{u\}\}\_\{i\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)\\;\\geq\\;L\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)=\\log n\+\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)\.On the other hand, from the ETF construction above we know that
inf\{𝒖i\},\{𝒘^c\}ℒNTCE≤logn\+inf\{𝝁c\},\{𝒘c\}L~NTCEcls\.\\inf\_\{\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\}\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}\\;\\leq\\;\\log n\+\\inf\_\{\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{w\}\}\_\{c\}\\\}\}\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\.Since\(\{𝒖i⋆\},\{𝒘^c⋆\}\)\(\\\{\{\\bm\{u\}\}\_\{i\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)achieves the global infimum, the two displays must be equalities\. Therefore:
- •L~NTCEcls\(\{𝝁^c⋆\},\{𝒘^c⋆\}\)\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)attains the global minimum ofL~NTCEcls\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}, so by the La/Lc minimizer characterization we must have, up to a global rotation and permutation of class labels, 𝝁^c⋆=𝒘^c⋆for allc,\{𝝁^c⋆\}form a centered simplex ETF\.\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}=\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\quad\\text\{for all \}c,\\qquad\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\}\\text\{ form a centered simplex ETF\.\}
- •Lemma[A\.3](https://arxiv.org/html/2605.20302#A1.Thmtheorem3)must be tight at the minimizer, so the Jensen equalities hold for all\(c,c′\)\(c,c^\{\\prime\}\): for every anchor classccand every classc′c^\{\\prime\}, the logits𝒘^c⋆⊤𝒖j⋆\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\\top\}\{\\bm\{u\}\}\_\{j\}^\{\\star\}are constant overj∈Ic′j\\in I\_\{c^\{\\prime\}\}, equal to𝒘^c⋆⊤𝝁^c′⋆\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\.
LetS:=span\{𝒘^1⋆,…,𝒘^K⋆\}S:=\\mathrm\{span\}\\\{\\hat\{\\bm\{w\}\}\_\{1\}^\{\\star\},\\dots,\\hat\{\\bm\{w\}\}\_\{K\}^\{\\star\}\\\}, which is the\(K−1\)\(K\-1\)–dimensional simplex\-ETF subspace\. Fix a classc′c^\{\\prime\}andj∈Ic′j\\in I\_\{c^\{\\prime\}\}\. For everyc≠c′c\\neq c^\{\\prime\}, tightness of Jensen gives
𝒘^c⋆⊤\(𝒖j⋆−𝝁^c′⋆\)=0\.\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\\top\}\\big\(\{\\bm\{u\}\}\_\{j\}^\{\\star\}\-\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\\big\)=0\.Since\{𝒘^c⋆\}c=1K\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\_\{c=1\}^\{K\}form a centered simplex ETF in the\(K−1\)\(K\-1\)–dimensional subspaceSSand satisfy∑c=1K𝒘^c⋆=0\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}=0, anyK−1K\-1of them are linearly independent and thus spanSS\. In particular, the set\{𝒘^c⋆:c≠c′\}\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}:c\\neq c^\{\\prime\}\\\}spansSS, so𝒖j⋆−𝝁^c′⋆\{\\bm\{u\}\}\_\{j\}^\{\\star\}\-\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}is orthogonal toSS, and hence the orthogonal projection of𝒖j⋆\{\\bm\{u\}\}\_\{j\}^\{\\star\}ontoSSequals𝝁^c′⋆\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\.
But both𝒖j⋆\{\\bm\{u\}\}\_\{j\}^\{\\star\}and𝝁^c′⋆=𝒘^c′⋆\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}=\\hat\{\\bm\{w\}\}\_\{c^\{\\prime\}\}^\{\\star\}are unit vectors, and𝝁^c′⋆∈S\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\\in S\. The only way for a unit vector to have a unit\-norm projection ontoSSis to lie inSSitself and coincide with its projection, so we must have
𝒖j⋆=𝝁^c′⋆for allj∈Ic′\.\{\\bm\{u\}\}\_\{j\}^\{\\star\}=\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\\quad\\text\{for all \}j\\in I\_\{c^\{\\prime\}\}\.
Thus within each class all features collapse to a single unit direction \(NC1\), theseKKdirections form a centered simplex ETF \(NC2\), and the classifier weights align with the class means \(NC3\)\. Therefore every global minimizer ofℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}exhibits Neural Collapse, up to a global rotation and permutation of the class labels\.
### NONL
We finally treat the NONL objective equation[9](https://arxiv.org/html/2605.20302#A1.E9)\. The proof proceeds by first bounding the sample\-level loss by a class\-level objective depending only on class means and weights, then applying the La/Lc minimizer characterization at the class level, and finally lifting this structure back to the sample level to obtain NC1–NC3\.
Recall that
ℒNONL=−1M∑i=1Mlogexp\(Si,yi/τ\)∑j:yj≠yiexp\(Sj,yi/τ\)=1M∑i=1MℓiNONL,\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}=\-\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\log\\frac\{\\exp\(S\_\{i,y\_\{i\}\}/\\tau\)\}\{\\sum\_\{j:\\,y\_\{j\}\\neq y\_\{i\}\}\\exp\(S\_\{j,y\_\{i\}\}/\\tau\)\}\\;=\\;\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\ell\_\{i\}^\{\\mathrm\{NONL\}\},with per\-sample loss
ℓiNONL:=−logexp\(𝒖i⊤𝒘^yi/τ\)∑j:yj≠yiexp\(𝒖j⊤𝒘^yi/τ\)\.\\ell\_\{i\}^\{\\mathrm\{NONL\}\}:=\-\\log\\frac\{\\exp\(\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{y\_\{i\}\}/\\tau\)\}\{\\sum\_\{j:\\,y\_\{j\}\\neq y\_\{i\}\}\\exp\(\{\\bm\{u\}\}\_\{j\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{y\_\{i\}\}/\\tau\)\}\.We again work in the balanced setting\|Ic\|=n=M/K\|I\_\{c\}\|=n=M/Kand‖𝒖i‖=‖𝒘^c‖=1\\\|\{\\bm\{u\}\}\_\{i\}\\\|=\\\|\\hat\{\\bm\{w\}\}\_\{c\}\\\|=1\.
#### Step 1: reduction to class means\.
We first show that the sample\-level NONL loss admits a lower bound that depends only on theKKnormalized\-feature class means and theKKclassifier weights\.
###### Lemma A\.4\(NONL reduction via class means\)\.
Assume balanced labels,\|Ic\|=n=M/K\|I\_\{c\}\|=n=M/Kfor allcc\. For any configuration\{𝐮i\},\{𝐰^c\}\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}with‖𝐮i‖=‖𝐰^c‖=1\\\|\{\\bm\{u\}\}\_\{i\}\\\|=\\\|\\hat\{\\bm\{w\}\}\_\{c\}\\\|=1define the normalized\-feature class means
𝝁^c:=1n∑j∈Ic𝒖j\.\\hat\{\\bm\{\\mu\}\}\_\{c\}:=\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c\}\}\{\\bm\{u\}\}\_\{j\}\.Then
ℒNONL\(\{𝒖i\},\{𝒘^c\}\)≥LNONLcls\(\{𝝁^c\},\{𝒘^c\}\),\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}\(\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\)\\;\\geq\\;L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\),where the class\-level loss is
LNONLcls:=−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑c′≠cnexp\(𝒘^c⊤𝝁^c′/τ\)\)\.L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}:=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Bigg\(\\sum\_\{c^\{\\prime\}\\neq c\}n\\,\\exp\\big\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\\big\)\\Bigg\)\.\(12\)Moreover, the inequality is tight if and only if, for every ordered pair\(c,c′\)\(c,c^\{\\prime\}\)withc′≠cc^\{\\prime\}\\neq c, the “negative” logits𝐰^c⊤𝐮j\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}are constant overj∈Ic′j\\in I\_\{c^\{\\prime\}\}, i\.e\.
𝒘^c⊤𝒖j=𝒘^c⊤𝝁^c′for allj∈Ic′,c′≠c\.\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}=\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}\\quad\\text\{for all \}j\\in I\_\{c^\{\\prime\}\},\\;c^\{\\prime\}\\neq c\.
###### Proof\.
Fix a sample indexiiwith labelyi=cy\_\{i\}=c\. Its NONL denominator is
Dineg:=∑j:yj≠cexp\(𝒖j⊤𝒘^c/τ\)=∑c′≠c∑j∈Ic′exp\(𝒖j⊤𝒘^c/τ\)\.D\_\{i\}^\{\\mathrm\{neg\}\}:=\\sum\_\{j:\\,y\_\{j\}\\neq c\}\\exp\(\{\\bm\{u\}\}\_\{j\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{c\}/\\tau\)=\\sum\_\{c^\{\\prime\}\\neq c\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\\exp\(\{\\bm\{u\}\}\_\{j\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{c\}/\\tau\)\.
For each anchor classccand negative classc′≠cc^\{\\prime\}\\neq c, consider the function
fc\(𝒙\):=exp\(𝒘^c⊤𝒙/τ\),f\_\{c\}\(\{\\bm\{x\}\}\):=\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{x\}\}/\\tau\),which is convex in𝒙\{\\bm\{x\}\}\. Applying Jensen’s inequality over the negative\-class samples\{𝒖j:j∈Ic′\}\\\{\{\\bm\{u\}\}\_\{j\}:j\\in I\_\{c^\{\\prime\}\}\\\}gives
1n∑j∈Ic′exp\(𝒘^c⊤𝒖j/τ\)\\displaystyle\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}/\\tau\)=1n∑j∈Ic′fc\(𝒖j\)≥fc\(1n∑j∈Ic′𝒖j\)=exp\(𝒘^c⊤𝝁^c′/τ\)\.\\displaystyle=\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}f\_\{c\}\(\{\\bm\{u\}\}\_\{j\}\)\\;\\geq\\;f\_\{c\}\\\!\\Big\(\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\{\\bm\{u\}\}\_\{j\}\\Big\)=\\exp\\big\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\\big\)\.Multiplying bynnand summing overc′≠cc^\{\\prime\}\\neq cyields
Dineg=∑c′≠c∑j∈Ic′exp\(𝒖j⊤𝒘^c/τ\)≥∑c′≠cnexp\(𝒘^c⊤𝝁^c′/τ\)\.D\_\{i\}^\{\\mathrm\{neg\}\}=\\sum\_\{c^\{\\prime\}\\neq c\}\\sum\_\{j\\in I\_\{c^\{\\prime\}\}\}\\exp\(\{\\bm\{u\}\}\_\{j\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{c\}/\\tau\)\\;\\geq\\;\\sum\_\{c^\{\\prime\}\\neq c\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\.
By definition ofℓiNONL\\ell\_\{i\}^\{\\mathrm\{NONL\}\},
ℓiNONL\\displaystyle\\ell\_\{i\}^\{\\mathrm\{NONL\}\}=−logexp\(𝒖i⊤𝒘^c/τ\)Dineg\\displaystyle=\-\\log\\frac\{\\exp\(\{\\bm\{u\}\}\_\{i\}^\{\\top\}\\hat\{\\bm\{w\}\}\_\{c\}/\\tau\)\}\{D\_\{i\}^\{\\mathrm\{neg\}\}\}=−1τ𝒘^c⊤𝒖i\+logDineg\\displaystyle=\-\\frac\{1\}\{\\tau\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}\+\\log D\_\{i\}^\{\\mathrm\{neg\}\}≥−1τ𝒘^c⊤𝒖i\+log\(∑c′≠cnexp\(𝒘^c⊤𝝁^c′/τ\)\)=:ℓ~i\.\\displaystyle\\geq\-\\frac\{1\}\{\\tau\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}\+\\log\\Bigg\(\\sum\_\{c^\{\\prime\}\\neq c\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\\Bigg\)=:\\tilde\{\\ell\}\_\{i\}\.This inequality is tight if and only if all Jensen steps above are equalities\. For a fixed\(c,c′\)\(c,c^\{\\prime\}\)withc′≠cc^\{\\prime\}\\neq c, equality in Jensen requires that the arguments offcf\_\{c\}be constant overj∈Ic′j\\in I\_\{c^\{\\prime\}\}, i\.e\.𝒘^c⊤𝒖j\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}is constant onIc′I\_\{c^\{\\prime\}\}\. Using the definition of𝝁^c′\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}, this constant must then equal𝒘^c⊤𝝁^c′\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}, giving the stated tightness condition\.
Finally, averageℓ~i\\tilde\{\\ell\}\_\{i\}over all samples\. Using the balanced labels\|Ic\|=n\|I\_\{c\}\|=nand the definition of𝝁^c\\hat\{\\bm\{\\mu\}\}\_\{c\},
1M∑i=1Mℓ~i\\displaystyle\\frac\{1\}\{M\}\\sum\_\{i=1\}^\{M\}\\tilde\{\\ell\}\_\{i\}=1M∑c=1K∑i∈Ic\[−1τ𝒘^c⊤𝒖i\+log\(∑c′≠cnexp\(𝒘^c⊤𝝁^c′/τ\)\)\]\\displaystyle=\\frac\{1\}\{M\}\\sum\_\{c=1\}^\{K\}\\sum\_\{i\\in I\_\{c\}\}\\Bigg\[\-\\frac\{1\}\{\\tau\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}\+\\log\\Bigg\(\\sum\_\{c^\{\\prime\}\\neq c\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\\Bigg\)\\Bigg\]=−1Mτ∑c=1K∑i∈Ic𝒘^c⊤𝒖i\+1M∑c=1K\|Ic\|log\(∑c′≠cnexp\(𝒘^c⊤𝝁^c′/τ\)\)\\displaystyle=\-\\frac\{1\}\{M\\tau\}\\sum\_\{c=1\}^\{K\}\\sum\_\{i\\in I\_\{c\}\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{i\}\+\\frac\{1\}\{M\}\\sum\_\{c=1\}^\{K\}\|I\_\{c\}\|\\,\\log\\Bigg\(\\sum\_\{c^\{\\prime\}\\neq c\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\\Bigg\)=−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑c′≠cnexp\(𝒘^c⊤𝝁^c′/τ\)\)\\displaystyle=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Bigg\(\\sum\_\{c^\{\\prime\}\\neq c\}n\\,\\exp\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\)\\Bigg\)=LNONLcls\(\{𝝁^c\},\{𝒘^c\}\)\.\\displaystyle=L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\)\.SinceℒNONL\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}is the average of theℓiNONL\\ell\_\{i\}^\{\\mathrm\{NONL\}\}and eachℓiNONL≥ℓ~i\\ell\_\{i\}^\{\\mathrm\{NONL\}\}\\geq\\tilde\{\\ell\}\_\{i\}, we obtainℒNONL≥LNONLcls\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}\\geq L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}with the stated equality condition\. ∎
As before, it is convenient to separate out the factornnfrom the logarithm and to treat the class means and weights abstractly as unit vectors\. Define the*normalized*class\-level NONL loss
L~NONLcls\(\{𝝁^c\},\{𝒘^c\}\):=−1Kτ∑c=1K𝒘^c⊤𝝁^c\+1K∑c=1Klog\(∑c′≠cexp\(𝒘^c⊤𝝁^c′/τ\)\),\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\):=\-\\frac\{1\}\{K\\tau\}\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c\}\+\\frac\{1\}\{K\}\\sum\_\{c=1\}^\{K\}\\log\\Bigg\(\\sum\_\{c^\{\\prime\}\\neq c\}\\exp\\big\(\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}/\\tau\\big\)\\Bigg\),so that
LNONLcls=logn\+L~NONLcls\.L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}=\\log n\+\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\.In what follows we again treat\(𝝁^c,𝒘^c\)\(\\hat\{\\bm\{\\mu\}\}\_\{c\},\\hat\{\\bm\{w\}\}\_\{c\}\)as free unit vectors and write𝝁c:=𝝁^c\{\\bm\{\\mu\}\}\_\{c\}:=\\hat\{\\bm\{\\mu\}\}\_\{c\}and𝒘c:=𝒘^c\{\\bm\{w\}\}\_\{c\}:=\\hat\{\\bm\{w\}\}\_\{c\}\.
#### Step 2: analysis of the class\-level problem\.
For each classccwe can view theccth summand inL~NONLcls\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}as a standard decoupled alignment/uniformity loss of La/Lc type\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\), with:
qc=𝒘c\(anchor\),kc\+=𝝁c\(positive\),\{kc−=𝝁c′:c′≠c\}\(negatives\)\.q\_\{c\}=\{\\bm\{w\}\}\_\{c\}\\quad\\text\{\(anchor\)\},\\qquad k\_\{c\}^\{\+\}=\{\\bm\{\\mu\}\}\_\{c\}\\quad\\text\{\(positive\)\},\\qquad\\\{k\_\{c\}^\{\-\}=\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}:c^\{\\prime\}\\neq c\\\}\\quad\\text\{\(negatives\)\}\.The per\-class alignment and contrastive terms are
La\(qc,kc\+\)=−1τqc⊤kc\+,Lc\(qc,\{kc−\}\)=log\(∑c′≠cexp\(qc⊤kc′−/τ\)\)\.L\_\{\\mathrm\{a\}\}\(q\_\{c\},k\_\{c\}^\{\+\}\)=\-\\frac\{1\}\{\\tau\}q\_\{c\}^\{\\top\}k\_\{c\}^\{\+\},\\qquad L\_\{\\mathrm\{c\}\}\(q\_\{c\},\\\{k\_\{c\}^\{\-\}\\\}\)=\\log\\Big\(\\sum\_\{c^\{\\prime\}\\neq c\}\\exp\(q\_\{c\}^\{\\top\}k\_\{c^\{\\prime\}\}^\{\-\}/\\tau\)\\Big\)\.As in the NTCE case,LaL\_\{\\mathrm\{a\}\}is strictly decreasing in similarity andLcL\_\{\\mathrm\{c\}\}is convex and strictly increasing in the similarities\. Therefore we may again invoke the La/Lc minimizer characterization\. By Theorem 4\.1 and Appendix B\.1 ofKoromilaset al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib50)\), providedd≥Kd\\geq K, the global minimizers ofL~NONLcls\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}over unit vectors satisfy:
- •Perfect alignment:𝝁c=𝒘c\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{w\}\}\_\{c\}for allcc\.
- •Simplex ETF structure:the directions\{𝝁c\}c=1K\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\_\{c=1\}^\{K\}form a centered regular simplex equiangular tight frame in a\(K−1\)\(K\-1\)–dimensional subspace: ‖𝝁c‖=1,𝝁c⊤𝝁c′=−1K−1∀c≠c′\.\\\|\{\\bm\{\\mu\}\}\_\{c\}\\\|=1,\\qquad\{\\bm\{\\mu\}\}\_\{c\}^\{\\top\}\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}=\-\\frac\{1\}\{K\-1\}\\quad\\forall c\\neq c^\{\\prime\}\.
In particular, there exists a simplex ETF\{𝝁c\}c=1K⊂ℝd\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\_\{c=1\}^\{K\}\\subset\\mathbb\{R\}^\{d\}such that𝝁c=𝒘c\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{w\}\}\_\{c\}is a global minimizer ofL~NONLcls\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}, unique up to a global rotation and permutation of the class indices\.
#### Step 3: lifting back to the sample level\.
We now relate these class\-level minimizers back to the original sample\-level NONL objective and derive the NC structure of its global minimizers\.
*Existence of Neural Collapse minimizers\.*Let\{𝝁c\}c=1K\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\_\{c=1\}^\{K\}be a simplex ETF and set
𝒘^c:=𝝁c,𝒖i:=𝝁yifor alli\.\\hat\{\\bm\{w\}\}\_\{c\}:=\{\\bm\{\\mu\}\}\_\{c\},\\qquad\{\\bm\{u\}\}\_\{i\}:=\{\\bm\{\\mu\}\}\_\{y\_\{i\}\}\\quad\\text\{for all \}i\.This configuration satisfies NC1–NC3 by construction: within each classcc, all normalized features collapse to𝝁c\{\\bm\{\\mu\}\}\_\{c\}\(NC1\), the vectors\{𝝁c\}\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}form a centered simplex ETF \(NC2\), and𝒘^c=𝝁c\\hat\{\\bm\{w\}\}\_\{c\}=\{\\bm\{\\mu\}\}\_\{c\}\(NC3\)\. In particular, the feature class means are𝝁^c=𝝁c\\hat\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{\\mu\}\}\_\{c\}\.
Moreover, for this configuration the Jensen inequalities in Lemma[A\.4](https://arxiv.org/html/2605.20302#A1.Thmtheorem4)are tight for all\(c,c′\)\(c,c^\{\\prime\}\): for any anchor classccand negative classc′≠cc^\{\\prime\}\\neq cwe have𝒘^c⊤𝒖j=𝝁c⊤𝝁c′\\hat\{\\bm\{w\}\}\_\{c\}^\{\\top\}\{\\bm\{u\}\}\_\{j\}=\{\\bm\{\\mu\}\}\_\{c\}^\{\\top\}\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}for allj∈Ic′j\\in I\_\{c^\{\\prime\}\}, so the negative logits are constant within each negative class\. Hence
ℒNONL=LNONLcls\(\{𝝁^c\},\{𝒘^c\}\)=logn\+L~NONLcls\(\{𝝁c\},\{𝝁c\}\)\.\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}=L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\)=\\log n\+\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}\)\.Since\{𝝁c\},\{𝝁c\}\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\}is a global minimizer ofL~NONLcls\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}, this shows that
inf\{𝒖i\},\{𝒘^c\}ℒNONL≤logn\+inf\{𝝁c\},\{𝒘c\}L~NONLcls\.\\inf\_\{\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\}\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}\\;\\leq\\;\\log n\+\\inf\_\{\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{w\}\}\_\{c\}\\\}\}\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\.
*Structure of arbitrary global minimizers\.*Conversely, let\(\{𝒖i⋆\},\{𝒘^c⋆\}\)\(\\\{\{\\bm\{u\}\}\_\{i\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)be any global minimizer ofℒNONL\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}, and let
𝝁^c⋆:=1n∑j∈Ic𝒖j⋆\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}:=\\frac\{1\}\{n\}\\sum\_\{j\\in I\_\{c\}\}\{\\bm\{u\}\}\_\{j\}^\{\\star\}be the corresponding class means\. Lemma[A\.4](https://arxiv.org/html/2605.20302#A1.Thmtheorem4)gives
ℒNONL\(\{𝒖i⋆\},\{𝒘^c⋆\}\)≥LNONLcls\(\{𝝁^c⋆\},\{𝒘^c⋆\}\)=logn\+L~NONLcls\(\{𝝁^c⋆\},\{𝒘^c⋆\}\)\.\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}\(\\\{\{\\bm\{u\}\}\_\{i\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)\\;\\geq\\;L\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)=\\log n\+\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)\.On the other hand, from the ETF construction above we know that
inf\{𝒖i\},\{𝒘^c\}ℒNONL≤logn\+inf\{𝝁c\},\{𝒘c\}L~NONLcls\.\\inf\_\{\\\{\{\\bm\{u\}\}\_\{i\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}\\\}\}\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}\\;\\leq\\;\\log n\+\\inf\_\{\\\{\{\\bm\{\\mu\}\}\_\{c\}\\\},\\\{\{\\bm\{w\}\}\_\{c\}\\\}\}\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\.Since\(\{𝒖i⋆\},\{𝒘^c⋆\}\)\(\\\{\{\\bm\{u\}\}\_\{i\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)achieves this infimum, the two displays must be equalities\. Therefore:
- •L~NONLcls\(\{𝝁^c⋆\},\{𝒘^c⋆\}\)\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}\(\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\},\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\)attains the global minimum ofL~NONLcls\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}, so by the La/Lc minimizer characterization we must have, up to a global rotation and permutation of class labels, 𝝁^c⋆=𝒘^c⋆for allc,\{𝝁^c⋆\}form a centered simplex ETF\.\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}=\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\quad\\text\{for all \}c,\\qquad\\\{\\hat\{\\bm\{\\mu\}\}\_\{c\}^\{\\star\}\\\}\\text\{ form a centered simplex ETF\.\}
- •Lemma[A\.4](https://arxiv.org/html/2605.20302#A1.Thmtheorem4)must be tight at the minimizer, so the Jensen equalities hold for all\(c,c′\)\(c,c^\{\\prime\}\): for every anchor classccand negative classc′≠cc^\{\\prime\}\\neq c, the logits𝒘^c⋆⊤𝒖j⋆\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\\top\}\{\\bm\{u\}\}\_\{j\}^\{\\star\}are constant overj∈Ic′j\\in I\_\{c^\{\\prime\}\}, equal to𝒘^c⋆⊤𝝁^c′⋆\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\\top\}\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\.
LetS:=span\{𝒘^1⋆,…,𝒘^K⋆\}S:=\\mathrm\{span\}\\\{\\hat\{\\bm\{w\}\}\_\{1\}^\{\\star\},\\dots,\\hat\{\\bm\{w\}\}\_\{K\}^\{\\star\}\\\}, which is the\(K−1\)\(K\-1\)–dimensional simplex\-ETF subspace\. Fix a classc′c^\{\\prime\}andj∈Ic′j\\in I\_\{c^\{\\prime\}\}\. For everyc≠c′c\\neq c^\{\\prime\}, tightness of Jensen gives
𝒘^c⋆⊤\(𝒖j⋆−𝝁^c′⋆\)=0\.\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\\top\}\\big\(\{\\bm\{u\}\}\_\{j\}^\{\\star\}\-\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\\big\)=0\.As before, since\{𝒘^c⋆\}c=1K\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}\\\}\_\{c=1\}^\{K\}form a centered simplex ETF inSSand∑c=1K𝒘^c⋆=0\\sum\_\{c=1\}^\{K\}\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}=0, anyK−1K\-1of them are linearly independent and thus spanSS\. In particular, the set\{𝒘^c⋆:c≠c′\}\\\{\\hat\{\\bm\{w\}\}\_\{c\}^\{\\star\}:c\\neq c^\{\\prime\}\\\}spansSS, so𝒖j⋆−𝝁^c′⋆\{\\bm\{u\}\}\_\{j\}^\{\\star\}\-\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}is orthogonal toSS, and hence the orthogonal projection of𝒖j⋆\{\\bm\{u\}\}\_\{j\}^\{\\star\}ontoSSequals𝝁^c′⋆\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\.
But both𝒖j⋆\{\\bm\{u\}\}\_\{j\}^\{\\star\}and𝝁^c′⋆=𝒘^c′⋆\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}=\\hat\{\\bm\{w\}\}\_\{c^\{\\prime\}\}^\{\\star\}are unit vectors, and𝝁^c′⋆∈S\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\\in S\. As in the NTCE case, the only way for a unit vector to have a unit\-norm projection ontoSSis to lie inSSitself and coincide with its projection, so we must have
𝒖j⋆=𝝁^c′⋆for allj∈Ic′\.\{\\bm\{u\}\}\_\{j\}^\{\\star\}=\\hat\{\\bm\{\\mu\}\}\_\{c^\{\\prime\}\}^\{\\star\}\\quad\\text\{for all \}j\\in I\_\{c^\{\\prime\}\}\.
Thus within each class all features collapse to a single unit direction \(NC1\), theseKKdirections form a centered simplex ETF \(NC2\), and the classifier weights align with the class means \(NC3\)\. Therefore every global minimizer ofℒNONL\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}exhibits Neural Collapse, up to a global rotation and permutation of the class labels\.
###### Proof of Theorem[A\.1](https://arxiv.org/html/2605.20302#A1.Thmtheorem1)\.
NormFace:Lemma[A\.2](https://arxiv.org/html/2605.20302#A1.Thmtheorem2)together with Theorem 3\.1 ofYaraset al\.\([2022](https://arxiv.org/html/2605.20302#bib.bib23)\)shows that every global minimizer ofℒNF\\mathcal\{L\}\_\{\\mathrm\{NF\}\}satisfies NC1–NC3\.
NTCE:Lemma[A\.3](https://arxiv.org/html/2605.20302#A1.Thmtheorem3)boundsℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}by the class\-level lossLNTCEclsL\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}, while the La/Lc minimizer characterization \(Step 2\) identifies the global minimizers ofL~NTCEcls\\tilde\{L\}\_\{\\mathrm\{NTCE\}\}^\{\\mathrm\{cls\}\}as simplex ETF configurations with𝝁c=𝒘c\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{w\}\}\_\{c\}\. Step 3 shows that any global minimizer ofℒNTCE\\mathcal\{L\}\_\{\\mathrm\{NTCE\}\}must both attain this class\-level minimum and satisfy the tightness conditions in Lemma[A\.3](https://arxiv.org/html/2605.20302#A1.Thmtheorem3), which enforces NC1\. Together these yield NC1–NC3 for all NTCE minimizers\.
NONL:Lemma[A\.4](https://arxiv.org/html/2605.20302#A1.Thmtheorem4)boundsℒNONL\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}by the class\-level lossLNONLclsL\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}, while the La/Lc minimizer characterization \(Step 2\) identifies the global minimizers ofL~NONLcls\\tilde\{L\}\_\{\\mathrm\{NONL\}\}^\{\\mathrm\{cls\}\}as simplex ETF configurations with𝝁c=𝒘c\{\\bm\{\\mu\}\}\_\{c\}=\{\\bm\{w\}\}\_\{c\}\. Step 3 shows that any global minimizer ofℒNONL\\mathcal\{L\}\_\{\\mathrm\{NONL\}\}must both attain this class\-level minimum and satisfy the tightness conditions in Lemma[A\.4](https://arxiv.org/html/2605.20302#A1.Thmtheorem4), which again enforces NC1\. Together these yield NC1–NC3 for all NONL minimizers\.
In all three cases, the resulting NC configuration is unique up to a global rotation and permutation of the class labels\. This proves the theorem\. ∎
## Appendix BEquivalence of SCL and prototype–softmax minimizers
Here we provide the proof of[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)\.
###### Proof\.
Fixi∈\[2M\]i\\in\[2M\]with labelyiy\_\{i\}\. Let𝒞\(i\)=\{j∈\[2M\]:j≠i,yj=yi\}\\mathcal\{C\}\(i\)=\\\{j\\in\[2M\]:j\\neq i,\\ y\_\{j\}=y\_\{i\}\\\},ℬc=\{j∈\[2M\]:yj=c\}\\mathcal\{B\}\_\{c\}=\\\{j\\in\[2M\]:y\_\{j\}=c\\\},nc=\|ℬc\|n\_\{c\}=\|\\mathcal\{B\}\_\{c\}\|, and𝝁^c=1nc∑j∈ℬc𝒂j\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}=\\frac\{1\}\{n\_\{c\}\}\\sum\_\{j\\in\\mathcal\{B\}\_\{c\}\}\{\\bm\{a\}\}\_\{j\}\.
\(A\) SCL lower bound\.By unfolding the SCL loss defined in[Equation2](https://arxiv.org/html/2605.20302#S3.E2), the per\-example loss term can be written as
ℓiSCL=−1\|𝒞\(i\)\|∑l∈𝒞\(i\)𝒂i⊤𝒂lτ\+log∑j∈\[2M\]∖\{i\}exp\(𝒂i⊤𝒂j/τ\)\.\\ell\_\{i\}^\{\\textnormal\{SCL\}\}=\-\\frac\{1\}\{\|\\mathcal\{C\}\(i\)\|\}\\sum\_\{l\\in\\mathcal\{C\}\(i\)\}\\frac\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{l\}\}\{\\tau\}\+\\log\\sum\_\{j\\in\[2M\]\\setminus\\\{i\\\}\}\\exp\(\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{j\}/\\tau\)\.
For the first term, using1\|𝒞\(i\)\|∑l∈𝒞\(i\)𝒂l=nyi𝝁^yi−𝒂inyi−1\\frac\{1\}\{\|\\mathcal\{C\}\(i\)\|\}\\sum\_\{l\\in\\mathcal\{C\}\(i\)\}\{\\bm\{a\}\}\_\{l\}=\\frac\{n\_\{y\_\{i\}\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\-\{\\bm\{a\}\}\_\{i\}\}\{\\,n\_\{y\_\{i\}\}\-1\\,\}and‖𝒂i‖=1\\\|\{\\bm\{a\}\}\_\{i\}\\\|=1gives−𝒂i⊤τ\(1\|𝒞\(i\)\|∑l∈𝒞\(i\)𝒂l\)≥−𝒂i⊤𝝁^yiτ\.\-\\frac\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\}\{\\tau\}\\big\(\\frac\{1\}\{\|\\mathcal\{C\}\(i\)\|\}\\sum\_\{l\\in\\mathcal\{C\}\(i\)\}\{\\bm\{a\}\}\_\{l\}\\big\)\\geq\-\\frac\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\}\{\\tau\}\.
For the second term, we group by class, subtract the self term and then apply Jensen classwise due to convexity of the exponential function:
∑j∈\[2M\]∖\{i\}e𝒂i⊤𝒂j/τ=∑c=1K∑l∈ℬce𝒂i⊤𝒂l/τ−e1/τ≥∑c=1Knce𝒂i⊤𝝁^c/τ−e1/τ\.\\sum\_\{j\\in\[2M\]\\setminus\\\{i\\\}\}\\\!e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{j\}/\\tau\}=\\sum\_\{c=1\}^\{K\}\\sum\_\{l\\in\\mathcal\{B\}\_\{c\}\}e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{l\}/\\tau\}\-e^\{1/\\tau\}\\ \\geq\\ \\sum\_\{c=1\}^\{K\}n\_\{c\}\\,e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}/\\tau\}\-e^\{1/\\tau\}\.Combining,
ℓiSCL≥−𝒂i⊤𝝁^yiτ\+log\(∑c=1Knce𝒂i⊤𝝁^c/τ−e1/τ\)=:ℓi⋆\.\\ell\_\{i\}^\{\\textnormal\{SCL\}\}\\ \\geq\\ \-\\frac\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\}\{\\tau\}\+\\log\\\!\\Bigg\(\\sum\_\{c=1\}^\{K\}n\_\{c\}\\,e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}/\\tau\}\-e^\{1/\\tau\}\\Bigg\)\\ =:\\ \\ell\_\{i\}^\{\\star\}\.\(13\)
Equality in equation[13](https://arxiv.org/html/2605.20302#A2.E13)holds iff every class\-wise sum is collapsed, i\.e\.,𝒂j=𝝁^c\{\\bm\{a\}\}\_\{j\}=\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}for allj∈ℬcj\\in\\mathcal\{B\}\_\{c\}, because the positive\-term bound is tight only when𝒂i⊤𝝁^yi=1\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}=1\(so𝒂i=𝝁^yi\{\\bm\{a\}\}\_\{i\}=\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\) and the classwise Jensen step is tight only when all within\-class logits\{𝒂i⊤𝒂l:l∈ℬc\}\\\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\{\\bm\{a\}\}\_\{l\}:l\\in\\mathcal\{B\}\_\{c\}\\\}are equal\.
\(B\) Prototype loss lower bound\.Since𝒂i⊤𝝁^yi≤1\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\\leq 1for unit vectors,e𝒂i⊤𝝁^yi/τ≤e1/τe^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}/\\tau\}\\leq e^\{1/\\tau\}\. Therefore
∑c=1Knce𝒂i⊤𝝁^c/τ−e𝒂i⊤𝝁^yi/τ⏟=:Diproto≥∑c=1Knce𝒂i⊤𝝁^c/τ−e1/τ⏟=:Di⋆,\\underbrace\{\\sum\_\{c=1\}^\{K\}n\_\{c\}\\,e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}/\\tau\}\-e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}/\\tau\}\}\_\{=:D\_\{i\}^\{\\text\{proto\}\}\}\\ \\geq\\ \\underbrace\{\\sum\_\{c=1\}^\{K\}n\_\{c\}\\,e^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}/\\tau\}\-e^\{1/\\tau\}\}\_\{=:D\_\{i\}^\{\\star\}\},and thus, with the*same*numeratore𝒂i⊤𝝁^yi/τe^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}/\\tau\},
ℓiproto=−𝒂i⊤𝝁^yiτ\+logDiproto≥−𝒂i⊤𝝁^yiτ\+logDi⋆=ℓi⋆\.\\ell\_\{i\}^\{\\textnormal\{proto\}\}=\-\\frac\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\}\{\\tau\}\+\\log D\_\{i\}^\{\\text\{proto\}\}\\ \\geq\\ \-\\frac\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\}\{\\tau\}\+\\log D\_\{i\}^\{\\star\}=\\ell\_\{i\}^\{\\star\}\.Averaging overiigives the following inequalities for any batch𝑨\{\\bm\{A\}\}:
LSCL\(𝑨\)\\displaystyle L\_\{\\text\{SCL\}\}\(\{\\bm\{A\}\}\)≥L⋆\(𝑨\)\\displaystyle\\;\\geq\\;L\_\{\\star\}\(\{\\bm\{A\}\}\)Lproto\(𝑨\)\\displaystyle L\_\{\\text\{proto\}\}\(\{\\bm\{A\}\}\)≥L⋆\(𝑨\)\.\\displaystyle\\;\\geq\\;L\_\{\\star\}\(\{\\bm\{A\}\}\)\.
\(C\) Collapse–simplex makes all three equal\.ByGrafet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib42), Theorem 2\), any SCL global minimizer exhibits class\-wise collapse,𝒂j=𝜻yj\{\\bm\{a\}\}\_\{j\}=\\bm\{\\zeta\}\_\{y\_\{j\}\}, and the directions\{𝜻c\}\\\{\\bm\{\\zeta\}\_\{c\}\\\}form a centered regular\(K−1\)\(K\\\!\-\\\!1\)\-simplex\. Hence𝝁^c=𝜻c\\hat\{\{\\bm\{\\mu\}\}\}\_\{c\}=\\bm\{\\zeta\}\_\{c\}and𝒂i⊤𝝁^yi=1\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}=1for allii, making both inequalities above tight:
LSCL\(𝑨⋆\)=L⋆\(𝑨⋆\)=Lproto\(𝑨⋆\)\.L\_\{\\text\{SCL\}\}\(\{\\bm\{A\}\}^\{\\star\}\)=L\_\{\\star\}\(\{\\bm\{A\}\}^\{\\star\}\)=L\_\{\\text\{proto\}\}\(\{\\bm\{A\}\}^\{\\star\}\)\.ThereforeminLSCL=minL⋆=minLproto\\min L\_\{\\text\{SCL\}\}=\\min L\_\{\\star\}=\\min L\_\{\\text\{proto\}\}, all attained at the collapsed\-simplex configurations\.
\(D\) Equality of argmin sets\.Let𝑨\{\\bm\{A\}\}minimizeLprotoL\_\{\\text\{proto\}\}\. ThenLproto\(𝑨\)=minLproto=minL⋆L\_\{\\text\{proto\}\}\(\{\\bm\{A\}\}\)=\\min L\_\{\\text\{proto\}\}=\\min L\_\{\\star\}, soL⋆\(𝑨\)=Lproto\(𝑨\)L\_\{\\star\}\(\{\\bm\{A\}\}\)=L\_\{\\text\{proto\}\}\(\{\\bm\{A\}\}\), which forcese𝒂i⊤𝝁^yi/τ=e1/τe^\{\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}/\\tau\}=e^\{1/\\tau\}for everyii, i\.e\.,𝒂i⊤𝝁^yi=1\{\\bm\{a\}\}\_\{i\}^\{\\top\}\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}=1and hence𝒂i=𝝁^yi\{\\bm\{a\}\}\_\{i\}=\\hat\{\{\\bm\{\\mu\}\}\}\_\{y\_\{i\}\}\(class\-wise collapse\)\. MoreoverLSCL\(𝑨\)=L⋆\(𝑨\)=minLSCLL\_\{\\text\{SCL\}\}\(\{\\bm\{A\}\}\)=L\_\{\\star\}\(\{\\bm\{A\}\}\)=\\min L\_\{\\text\{SCL\}\}, so𝑨\{\\bm\{A\}\}also minimizes SCL\.
Graf’s theorem then implies the class means form a centered simplex ETF\. Thus the argmin sets ofLSCLL\_\{\\text\{SCL\}\}andLprotoL\_\{\\text\{proto\}\}coincide \(up to rotation and label permutation\)\.
∎
## Appendix CFrom Theory to Empirical Behavior
[TheoremA\.1](https://arxiv.org/html/2605.20302#A1.Thmtheorem1)and[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2), combined with the framing of our paper relative to the literature, imply specific behaviors that should manifest in practice\. We organize this section around six such expected behaviors \(EB1–EB6\) and verify each empirically against the corresponding table or figure\.
#### EB1: Normalized losses converge to NC geometry faster than CE\.
Normalized losses make NC the unique global optimum on a benign strict\-saddle landscape\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\), whereas CE admits unbounded rescaling that prevents NC even at its global optimum\(Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16)\)\.
*Validated:*[Table7](https://arxiv.org/html/2605.20302#A6.T7)shows our methods reach CE\-equivalent NC values in under7\.5%7\.5\\%of CE’s iterations on 4/5 metrics\.[Figure2](https://arxiv.org/html/2605.20302#A6.F2)shows CE plateauing while our losses converge to≥95%\\geq\\\!95\\%NC \([Table3](https://arxiv.org/html/2605.20302#S4.T3)\)\.
#### EB2: NONL converges to NC faster than NormFace and NTCE\.
Decoupling alignment from uniformity removes competing gradient directions\(Yehet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib51)\), so NONL should follow a more direct optimization path\.
*Validated:*[Table7](https://arxiv.org/html/2605.20302#A6.T7)shows NONL converges1\.21\.2–3\.1×3\.1\\timesfaster than NormFace on the rank metrics\.
#### EB3: NTCE improves over NormFace\.
Expanding the negative set fromKKprototypes toMMbatch instances yields better contrastive objective estimates\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\), strengthening inter\-class separation\.
*Validated:*[Table1](https://arxiv.org/html/2605.20302#S4.T1)and[Table3](https://arxiv.org/html/2605.20302#S4.T3)show NTCE outperforming NormFace on both accuracy and NC metrics\.
#### EB4: Class\-mean prototypes from SCL are effective classifiers throughout training\.
[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)proves the minimizer sets ofℒSCL\\mathcal\{L\}\_\{\\mathrm\{SCL\}\}andℒproto\\mathcal\{L\}\_\{\\mathrm\{proto\}\}coincide, which means SCL already optimizes for classifier–feature alignment, not only at the exact \(unreached\) optimum\.
*Validated:*[Table2](https://arxiv.org/html/2605.20302#S4.T2)shows fixed prototypes match linear probing on 3/4 datasets and exceed it on ImageNet\-100 by\+2\.0\+2\.0pp, with a single forward pass replacing the full LP training phase\.
#### EB5: Classifier Learning with normalized losses and SCL reach the same NC geometry\.
Both are prototype\-contrast methods on the hypersphere converging to the same simplex ETF\.
*Validated:*In[Table3](https://arxiv.org/html/2605.20302#S4.T3)both families arrive significantly closer to the NC structure compared to CE\.
#### EB6: Better NC geometry yields better downstream performance\.
The simplex ETF maximizes inter\-class margins, which prior work links to improved generalization, transferability, and robustness\(Galantiet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib2); Papyanet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib15); Bartlettet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib17); Neyshaburet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib18); Hein and Andriushchenko,[2017](https://arxiv.org/html/2605.20302#bib.bib10); Fawziet al\.,[2016](https://arxiv.org/html/2605.20302#bib.bib21); Dinget al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib22); Yanget al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib3)\)\.
*Validated:*[Tables4](https://arxiv.org/html/2605.20302#S4.T4),[6](https://arxiv.org/html/2605.20302#S5.T6)and[5](https://arxiv.org/html/2605.20302#S4.T5)show consistent gains in transfer \(\+5\.5%\+5\.5\\%mean\), long\-tail \(up to\+8\.7%\+8\.7\\%\), and robustness \(lower mCE\)\.
#### Where theory meets practical approximation\.
Our theorems characterize the geometry of global minimizers but do not analyze the optimization dynamics that reach them\. In practice, our methods closely approach the theoretical NC optimum across all metrics \([Table3](https://arxiv.org/html/2605.20302#S4.T3),[Figure2](https://arxiv.org/html/2605.20302#A6.F2)\), substantially more so than CE or SCL with linear probing\. The theory also does not characterize how batch size, learning\-rate schedules, temperatureτ\\tau, or data augmentation affect convergence\. For instance, our theorems hold for anyτ\>0\\tau\>0, yet[Figure3](https://arxiv.org/html/2605.20302#A7.F3)shows that temperature impacts finite\-time performance in some cases; similarly, larger batches provide better approximations of the contrastive objective\(Koromilaset al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib50)\), confirmed by our[Table11](https://arxiv.org/html/2605.20302#A7.T11)\. Understanding these optimization\-level interactions remains an open direction\. Our contribution is identifying the right target geometry and the losses that make it uniquely optimal, while the dynamics of reaching it deserve further study\.
## Appendix DRelation to Other Literature
This section expands on the related\-work discussion in the main text, situating our framework relative to three additional bodies of work that our rebuttal discussion brought into focus: alternative routes to constraining radial freedom, the associative\-embedding line of work, and recent NC results under specific architectures\.
### D\.1Relation toGrafet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib42)\)NC Results on Supervised Contrastive Learning
Our[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)strengthens the SCL minimizer characterization ofGrafet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib42)\), which establishes that SCL minimizers exhibit NC1 \(within\-class collapse\) and NC2 \(simplex ETF of class means\)\. It is natural to ask whether the existence of an optimal class\-mean classifier follows trivially from NC1 \+ NC2; we show here that it does not, and that this matters in practice\.
#### Why the class\-mean classifier does not follow trivially from NC1 \+ NC2\.
At the exact global minimum, where NC1 \+ NC2 are realized, the best linear classifier indeed has weight vectors corresponding to the class means; this much is straightforward\. However, SCL, contrary to NTCE and NONL, which closely approximate NC \(≥95%\\geq\\\!95\\%on all NC metrics in[Table3](https://arxiv.org/html/2605.20302#S4.T3)\), does not in practice reach this exact minimum \([Table3](https://arxiv.org/html/2605.20302#S4.T3): inter/intra erank66\.7/7\.566\.7/7\.5on CIFAR\-100 vs\. the theoretical optimum99/099/0\)\. When NC1 \+ NC2 is not realized, there is no prior theoretical basis to expect class\-mean classifiers to work\.
[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)fills this gap\. It proves the minimizer\-set equalityargminℒSCL=argminℒproto\\arg\\min\\mathcal\{L\}\_\{\\mathrm\{SCL\}\}=\\arg\\min\\mathcal\{L\}\_\{\\mathrm\{proto\}\}, meaning SCL’s objective is equivalent to optimizing the prototype\-softmax classifier of[Equation6](https://arxiv.org/html/2605.20302#S4.E6)\. That is,the prototype classifier is the principled SCL classifier across the entire optimization trajectory, not only at the unreachable exact minimum\. This is why fixed prototypes \(FP\) work even at∼95%\\sim\\\!95\\%NC and why linear probing, which discards the projection head, returns to unnormalized features, and fits unconstrained weights and biases, is unnecessary\.
#### What is and is not proved by prior work\.
SCL contains no classifier weights in its formulation\. Prior theorems therefore establish NC1 \+ NC2 for SCL minimizers but cannot address NC3 \(classifier–feature alignment\) or NC4 \(zero bias\), which are not even expressible without a classifier\.[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)resolves this by proving that a classifier is already implicit in SCL: the class\-mean prototypes*are*the classifier weights, enabling NC3 and NC4\. Concretely, prior results characterize*what SCL solutions look like*\(collapsed ETF\);[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)proves*what SCL optimizes for*\(prototype classification\)\. It establishes that class\-mean prototypes do not serve as post\-hoc classifier weights but are the solution the SCL objective drives toward, making their direct use principled rather than heuristic\.
#### Practical consequence\.
The linear probing phase of SCL discards the projection head and trains on unnormalized encoder representations with unconstrained weights and biases\. Fixed prototypes instead use the normalized representations from the projection head \(the optimization space\) and add no magnitude or bias\. This rests on two contributions: \(i\)[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)proves the optimal classifier is already on the optimization space, and \(ii\) our normalized classifier framework, which enables classification on the normalized space rather than via post\-hoc unnormalized probing\.
### D\.2Alternative Approaches to Constraining Radial Freedom
There are several approaches to controlling radial degrees of freedom beyond normalized softmax: Riemannian optimization on the Stiefel manifold\(Huanget al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib71); Bansalet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib72)\), weight normalization\(Salimans and Kingma,[2016](https://arxiv.org/html/2605.20302#bib.bib73)\), spectral normalization\(Miyatoet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib74)\), and layer rotation\(Carbonnelle and De Vleeschouwer,[2019](https://arxiv.org/html/2605.20302#bib.bib75)\)\.
The critical observation is that all these methods are applied on top of standard CE, inheriting its fundamental limitations\. Stiefel optimization constrains weight vectors to be mutually orthogonal \(pairwise cosine0\) by definition, whereas the simplex ETF that NC requires has pairwise cosine−1/\(K−1\)\-1/\(K\{\-\}1\), thus constituting a distinct geometric structure\. Our normalized softmax can be understood as optimization on the oblique manifold \(product of spheres\), whichYaraset al\.\([2022](https://arxiv.org/html/2605.20302#bib.bib23)\)prove has a benign strict\-saddle landscape specifically for NC\. Weight and spectral normalization control magnitudes but still operate with unnormalized features and learnable biases\. Layer rotation is a useful diagnostic but does not modify the training objective\. In all cases, applying these constraints to CE still leaves: \(i\) additional regularization hyperparameters to eliminate radial freedom, \(ii\) no temperature to control the similarity distribution, and \(iii\) CE’s fundamental loss\-level limitations,*i\.e\.*small effective negative set and alignment–uniformity coupling, that no geometric constraint alone can resolve\.
Normalized softmax addresses the constraint side cleanly by utilizing per\-vector normalization on the right manifold, zero bias, and temperature control, while NTCE and NONL address the loss\-level problems that persist even under perfect geometric constraints\.
### D\.3Associative Embedding and Prototype\-Augmented SCL
Our work also connects to the associative\-embedding literature\(Newellet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib68); De Brabandereet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib69); Nevenet al\.,[2019](https://arxiv.org/html/2605.20302#bib.bib70)\), where embeddings are learned by contrasting in\-group positives to out\-of\-group negatives and are applied to tasks where linear probing is not needed\. While our work focuses on the principled SCL pipeline for classification, where linear probing is standard practice and where NC theory provides formal optimality guarantees, this conceptual link is meaningful and worth making explicit\.
Concretely, eliminating linear probing is not relevant to associative methods \(which already skip it\), but our[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)is: it shows that optimizing per\-instance alignment and uniformity \(as in SCL\) yields the same optima as optimizing with class\-mean prototypes \(as associative methods do\)\. This gives theoretical room to apply SCL in associative method applications\. However, associative methods operate in unconstrained Euclidean space where precisely the radial degree\-of\-freedom problem we identify prevents convergence to optimal geometry\. Our hyperspherical framework addresses this: normalizing embeddings to the unit sphere eliminates scale ambiguity and provides geometric guarantees that associative methods lack\. Our results suggest one could replace associative losses with SCL on the hypersphere, achieving the same optima more easily thanks to the benign loss landscape\(Yaraset al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib23)\)\.
Beyond SCL, NTCE and NONL open room for principled and more efficient CE\-style training that uses learnable group embeddings \(the classifier weights\), which we leave as a natural direction for future work on associative embedding methods\.
Closer in spirit,Gillet al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib62)\)augment SCL with explicit prototype representations\. Our perspective is complementary: rather than adding explicit prototypes to SCL,[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)proves that SCL already implicitly optimizes for them\. The class\-mean prototypes are not added on top of SCL; they are the classifier the SCL objective drives toward\.
### D\.4NC under Specific Architectural and Training Choices
The standard NC theory builds on the unconstrained\-features \(UFM\) and layer\-peeled \(LPM\) models, which abstract away architectural details\.Súkeníket al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib63)\)andSúkeníket al\.\([2025](https://arxiv.org/html/2605.20302#bib.bib64)\)have begun to formally justify these idealizations for end\-to\-end training: proving deep UFM optimality for non\-linear models and, in follow\-up work, formally reducing end\-to\-end ResNet/Transformer training to an equivalent UFM, with the approximation tightening as depth grows\. These results provide grounding for the UFM/LPM assumptions in[TheoremA\.1](https://arxiv.org/html/2605.20302#A1.Thmtheorem1)and[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)\.
A second strand examines NC under specific activation functions\.Kiniet al\.\([2024](https://arxiv.org/html/2605.20302#bib.bib61)\)show that when using ReLU, the optima of SCL is an orthogonal frame with collapsed in\-class representations, rather than a centered simplex ETF\. This minimizer can be plugged directly into our proof \([AppendixB](https://arxiv.org/html/2605.20302#A2), Part C and Step 3\) in place of the Graf et al\. minimizer\. When changing the minimizer, the argmin equivalence from[Theorem4\.2](https://arxiv.org/html/2605.20302#S4.Thmtheorem2)still holds:argminℒSCL=argminℒproto\\arg\\min\\mathcal\{L\}\_\{\\mathrm\{SCL\}\}=\\arg\\min\\mathcal\{L\}\_\{\\mathrm\{proto\}\}is structural and does not depend on which minimizer shape is realized\. The geometric implications, however, do depend on the activation, and we acknowledge this in[Section5\.4](https://arxiv.org/html/2605.20302#S5.SS4)\. We also note that, contrary to NTCE and NONL, SCL does not achieve≥95%\\geq\\\!95\\%NC in our experiments \([Table3](https://arxiv.org/html/2605.20302#S4.T3)\), so the gap to the activation\-dependent optimum remains a separate empirical concern\.
A third thread examines the relationship between NC and transfer\.Chenet al\.\([2022](https://arxiv.org/html/2605.20302#bib.bib65)\)andKornblithet al\.\([2021](https://arxiv.org/html/2605.20302#bib.bib66)\)report a trade\-off between collapse and downstream transfer in the coarse\-to\-fine setting, where collapsed representations may fail to preserve fine\-grained sub\-class structure within a coarse class\.Harunet al\.\([2025](https://arxiv.org/html/2605.20302#bib.bib67)\)propose explicitly controlling the degree of NC at intermediate layers, with their method enforcing NC at the classification layer, which is the layer we operate on, and reporting strong transfer, validating that NC at the final layer is compatible with strong transfer in the standard large\-to\-small protocol ofKhoslaet al\.\([2020](https://arxiv.org/html/2605.20302#bib.bib20)\), which is the protocol we evaluate\. The broader NC literature similarly establishes that collapsed, maximally separated representations improve generalization, robustness, and transfer in this standard protocol\(Papyanet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib15); Bartlettet al\.,[2017](https://arxiv.org/html/2605.20302#bib.bib17); Neyshaburet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib18); Fawziet al\.,[2016](https://arxiv.org/html/2605.20302#bib.bib21); Dinget al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib22); Galantiet al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib2); Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20); Soudryet al\.,[2018](https://arxiv.org/html/2605.20302#bib.bib16); Hein and Andriushchenko,[2017](https://arxiv.org/html/2605.20302#bib.bib10)\)\. The coarse\-to\-fine regime is outside our evaluation scope; we list it in[Section5\.4](https://arxiv.org/html/2605.20302#S5.SS4)\.
## Appendix EImplementation Details
Experiments are conducted on four standard image classification datasets:CIFAR10, CIFAR100, ImageNet\-100, and ImageNet1K, following common representation learning benchmarking practices\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20); Markouet al\.,[2024](https://arxiv.org/html/2605.20302#bib.bib48); Wanget al\.,[2021](https://arxiv.org/html/2605.20302#bib.bib52); Yehet al\.,[2022](https://arxiv.org/html/2605.20302#bib.bib51)\)\. We use ResNet50 for ImageNet\-100/ImageNet1K and ResNet18 for CIFAR10/CIFAR100\. All models are trained using SGD optimizer for 500 epochs on ImageNet1K \(temperature 0\.1\) and ImageNet\-100 \(batch size 1024, temperature 0\.2\) and 1000 epochs on CIFAR10/CIFAR100\. For ImageNet1k in order to enable fair comparison we report for each method its best accuracy for training with batch size 2048, 4096, and 8192\. For CIFAR10/100 we set the batch size to 512 and evaluate all 11 temperatures in the set \[0\.07, 0\.1, 0\.2, 0\.3, 0\.4, 0\.5, 0\.6, 0\.7, 0\.8, 0\.9, 1\]\. In[Table1](https://arxiv.org/html/2605.20302#S4.T1)and[Table3](https://arxiv.org/html/2605.20302#S4.T3)we report for each method the best performing temperature\. For Supervised Contrastive Learning we perform the linear probing phase for the typical 90 epochs\. Our code is publicly available at[https://github\.com/pakoromilas/nc\_by\_design](https://github.com/pakoromilas/nc_by_design)
### E\.1Classifier Learning Methods \(CE, NormFace, NTCE, NONL\)
For the family of classifier learning methods, we employ the following hyperparameters across datasets:
#### CIFAR10/CIFAR100\.
Models are trained for 1000 epochs with batch size 512\. We use SGD optimizer with momentum 0\.9, weight decay10−410^\{\-4\}, and initial learning rate 0\.2\. The learning rate follows a cosine annealing schedule throughout training, decaying to a minimum value ofηmin=η0×0\.13\\eta\_\{\\mathrm\{min\}\}=\\eta\_\{0\}\\times 0\.1^\{3\}whereη0\\eta\_\{0\}is the initial learning rate\. Data augmentation consists of RandomResizedCrop with scale \(0\.2, 1\.0\), RandomHorizontalFlip, and standard normalization with dataset\-specific mean and standard deviation values\.
#### ImageNet\-100\.
ResNet50 models are trained for 500 epochs with batch size 1024 \(256 per GPU with 4 GPUs\)\. We employ SGD optimizer with momentum 0\.9, weight decay10−410^\{\-4\}, and initial learning rate 0\.1, which is automatically scaled based on the total batch size\. We use cosine annealing scheduler with 10 epochs of linear warmup from 0\.01 to the target learning rate\. After warmup, the learning rate follows a cosine decay toηmin=η0×0\.13\\eta\_\{\\mathrm\{min\}\}=\\eta\_\{0\}\\times 0\.1^\{3\}\. Synchronized BatchNorm is enabled across GPUs\. Data augmentation includes RandomResizedCrop\(224\) with scale \(0\.2, 1\.0\), RandomHorizontalFlip, and standard ImageNet normalization\.
#### ImageNet1K\.
ResNet50 models are trained for 500 epochs with batch size 2048 \(256 per GPU with 8 GPUs\)\. Hyperparameters follow the same configuration as ImageNet\-100, with SGD optimizer \(momentum 0\.9, weight decay10−410^\{\-4\}\), initial learning rate 0\.1 with automatic scaling based on batch size\. We apply 10 epochs of linear warmup followed by cosine annealing toηmin=η0×0\.13\\eta\_\{\\mathrm\{min\}\}=\\eta\_\{0\}\\times 0\.1^\{3\}\. Data augmentation and normalization follow ImageNet\-100 settings\.
### E\.2Supervised Contrastive Learning
For supervised contrastive methods, we implement a two\-phase training procedure:
Phase 1: Contrastive Training\.
CIFAR10/CIFAR100:Models are trained for 1000 epochs with batch size 512\. SGD optimizer is used with momentum 0\.9, weight decay10−410^\{\-4\}, and initial learning rate 0\.05\. The learning rate follows cosine annealing schedule throughout training, decaying toηmin=η0×0\.13\\eta\_\{\\mathrm\{min\}\}=\\eta\_\{0\}\\times 0\.1^\{3\}\. We use extensive data augmentation including RandomResizedCrop with scale \(0\.2, 1\.0\), RandomHorizontalFlip, ColorJitter\(0\.4, 0\.4, 0\.4, 0\.1\) with probability 0\.8, and RandomGrayscale with probability 0\.2\. Each image generates two augmented views for contrastive learning\.
ImageNet\-100:ResNet50 encoder with 128\-dimensional projection head is trained for 500 epochs with batch size 1024\. We use SGD optimizer with momentum 0\.9, weight decay10−410^\{\-4\}, and base learning rate 0\.8 \(automatically scaled by batch size\)\. Learning rate follows cosine annealing with 10 epochs linear warmup from 0\.01, then decays following a cosine schedule toηmin=η0×0\.13\\eta\_\{\\mathrm\{min\}\}=\\eta\_\{0\}\\times 0\.1^\{3\}\. Data augmentation extends CIFAR settings with the addition of Gaussian blur for ImageNet scale images\.
ImageNet1K:Training spans 500 epochs with batch size 2048 using the same optimizer configuration as ImageNet\-100\. Base learning rate is set to 0\.1 with automatic scaling\. We employ cosine annealing with 5 epochs warmup from 0\.01, followed by cosine decay toηmin=η0×0\.13\\eta\_\{\\mathrm\{min\}\}=\\eta\_\{0\}\\times 0\.1^\{3\}\. The same augmentation pipeline as ImageNet\-100 is used\.
#### Phase 2: Linear Evaluation\.
For all datasets, we freeze the learned encoder and train a linear classifier on top of the representations:
CIFAR10/CIFAR100:Linear classifier is trained for 100 epochs using SGD with learning rate 5\.0, momentum 0\.9, and zero weight decay\. Learning rate is decayed by factor 0\.2 at epochs 60, 75, 90 using a step scheduler\.
ImageNet\-100:Linear evaluation runs for 90 epochs with SGD optimizer, learning rate 2\.0, momentum 0\.9, and zero weight decay\. Learning rate decay by factor 0\.2 occurs at epochs 30, 60, 80 using a step scheduler\.
ImageNet1K:Linear classifier training spans 90 epochs with SGD, learning rate 0\.8, momentum 0\.9, and zero weight decay\. The same step decay schedule as ImageNet\-100 is applied\.
### E\.3Additional Implementation Details
For distributed training on ImageNet datasets, we employ DistributedDataParallel with one process per GPU\. Random seed is fixed at 42 for reproducibility\. The cosine annealing scheduler is implemented following the standard formulation:ηt=ηmin\+12\(η0−ηmin\)\(1\+cos\(πtT\)\)\\eta\_\{t\}=\\eta\_\{\\mathrm\{min\}\}\+\\frac\{1\}\{2\}\(\\eta\_\{0\}\-\\eta\_\{\\mathrm\{min\}\}\)\(1\+\\cos\(\\frac\{\\pi t\}\{T\}\)\), wherettis the current epoch andTTis the total number of epochs\. For experiments with warmup, the warmup period linearly interpolates from the warmup starting learning rate to the initial learning rate before transitioning to cosine annealing\. Temperature parameterτ\\tauis searched over the range \[0\.07, 0\.1, 0\.2, …, 1\.0\] for CIFAR experiments, while ImageNet experiments use the optimal temperature found through preliminary experiments \(0\.1 for supervised contrastive, 0\.2 for classifier learning methods\)\. All models use standard weight initialization and no additional regularization beyond weight decay\.
## Appendix FConvergence Dynamics\.
Table 7:Convergence speed \(% of training iters\):\(I\)time to reach the 95% NC threshold;\(II\)time to match CE’s final value; “0%” indicates the target is met at the first logged eval\.MethodInstance alignmentWeight alignmentWeights erankIntra erankInter erank\(I\) NC convergence to95%95\\%threshold \(ratio to max iterations\)NormFace79\.4%8\.2%52\.6%45\.4%56\.2%NTCE79\.4%6\.8%56\.4%36\.6%52\.4%NONL79\.4%7\.4%34\.6%14\.6%47\.2%\(II\) CE convergence to converged value \(ratio to CE converged iteration\)NormFace2\.2%2\.0%66\.3%0%7\.4%NTCE2\.2%1\.8%73\.9%0%7\.4%NONL2\.2%1\.8%35\.4%0%6\.0%On CIFAR\-100, we track NC metrics and define convergence as the earliest iteration where the exponentially\-weighted moving average enters and remains within a metric\-specific tolerance around the 95% NC threshold\.
In[Table7](https://arxiv.org/html/2605.20302#A6.T7)\(I\) the convergence speed to 95% of theoretical NC thresholds is quantified\. Normalized losses reach these thresholds,typically early in training\. NONL converges faster withgains over NormFace for the rank metrics\(1\.2\-3\.1 speedup\), benefiting from simplified optimization without competing terms\.[Table7](https://arxiv.org/html/2605.20302#A6.T7)\(II\) benchmarks against CE’s converged values\. The acceleration is dramatic: normalized losses reach CE\-equivalent values in under 7\.5% of CE’s required iterations across 4/5 metrics, whileNONL converges faster\. This demonstrates that normalized losses fundamentally restructure the optimization landscape,enabling direct paths to neural collapse\.
LABEL:\\pgfplotslegendfromnameNClegend01k2k3k4k5k00\.20\.40\.60\.81\.0\(a\) Training Accuracy01k2k3k4k5k00\.50\.5111\.51\.522\(b\) Weight–Instance Alignment01k2k3k4k5k00\.50\.5111\.51\.522\(c\) Weight–Class Alignment01k2k3k4k5k404060608080100100\(d\) Weight Effective Rank01k2k3k4k5k0252550507575100100\(e\) Intra\-Class Effective Rank01k2k3k4k5k0252550507575100100\(f\) Inter\-Class Effective Rank
Figure 2:NC convergence on CIFAR\-100\.Six metrics vs\. training iterations; red dashed lines mark the 95% NC threshold and circles denote each method’s convergence\.In[Figure2](https://arxiv.org/html/2605.20302#A6.F2)the training dynamics are demonstrated\. While cross\-entropy \(CE\) achieves perfect training accuracy, it fails to reach neural collapse geometry, plateauing at suboptimal metric values\. CE’s accuracy improvements appear toderive solely from magnitude and bias adjustmentsrather than geometric reorganization\. In contrast, our methodssimultaneously optimize all NC metrics throughout training, converging to proper NC geometry while maintaining optimal accuracy\.
## Appendix GExtra Ablation Studies
### G\.1Role of the Projection Head
Table 8:Contrastive Learning Results \- Without Projection Head\.Performance comparison across different classifier learning approaches without projection head\.Classifier LearningLossCIFAR\-10CIFAR\-100ImageNet\-100ImageNet\-1KLinear Probingscl959570\.670\.684\.184\.17171Normalized Linear Probingscl959571\.471\.484\.384\.372\.172\.1Fixed Prototypesscl959571\.471\.484\.784\.770\.170\.1In[Table8](https://arxiv.org/html/2605.20302#A7.T8)we demonstrate the importance of the projection head in contrastive training\. Across three datasets, except on the relatively simple CIFAR\-10 benchmark, removing the head consistently reduces accuracy by more than 2 points\. At first glance, one might expect the opposite: discarding the head should let the loss act directly on the final encoder embeddings on the unit hypersphere\. We hypothesize that the projection head helps primarily by imposing a beneficial dimensionality bottleneck\. With ResNet\-50, the encoder’s representation is 2048\-dimensional, whereas the projection head maps it to 128 dimensions\. For aKK\-class problem \(e\.g\.,K=100K=100\), the ideal equiangular tight frame \(ETF\) geometry lives in a\(K−1\)\(K\-1\)\-dimensional subspace\. Encouraging embeddings to adopt this structure is plausibly easier in a 128\-dimensional space than in a 2048\-dimensional one, where the optimizer has many more irrelevant directions to explore\.
\(a\)CIFAR\-10
\(b\)CIFAR\-100
Figure 3:Validation Accuracy \(%\) Phase Diagrams\.Classification accuracy on validation set\. Higher values indicate better generalization performance\. Each subplot shows the performance landscape across temperature and batch size hyperparameters for different loss functions: NormFace, NTCE, and NONL\. Brighter regions indicate superior performance\. White contour lines indicate iso\-performance curves for detailed analysis\. Red dashed contours highlight optimal parameter regions \(top 10% performance\)\. Scatter points represent individual experimental runs with performance\-based sizing\. Each dataset uses its own optimal colorbar range\. Results originate from grid runs across temperature values in \[0\.07, 1\.0\] and batch sizes in 32, 64, 128, 256, 512, 1024, 2048\.
### G\.2Effective hyperparameter ranges
Normalized softmax losses introduce two hyperparameters inherited from contrastive learning: the temperatureτ\\tau, which controls the sharpness of the similarity distribution, and the need for larger batch sizeBB, which governs the number of in\-batch negatives\. We assess their impact by grid–searchingτ∈\[0\.07,1\.0\]\\tau\\in\[0\.07,1\.0\]\(11 values\) andB∈\{32,64,128,256,512,1024,2048\}B\\in\\\{32,64,128,256,512,1024,2048\\\}on CIFAR\-10 and CIFAR\-100 with NormFace, NTCE, and NONL\.
[Figure3](https://arxiv.org/html/2605.20302#A7.F3)shows consistent “sweet spots” across methods: accuracy forms a pronounced band at moderate temperatures, with performance degrading for overly largeτ\\tauand, to a lesser extent, for very smallτ\\tau\. The location of this band shifts toward slightly smaller temperatures as the number of classes increases \(CIFAR\-100 vs\. CIFAR\-10\), mirroring observations in self\-supervised contrastive learning\(Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\)\. Within the effective temperature range, performance is comparatively insensitive toBB, yielding a broad plateau over batch sizes—large batches can help, but are not strictly required\.
In practice, these trends provide the same reliable defaults as in self\-supervised contrastive learning\(Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\):τ∈\{0\.1,0\.2\}\\tau\\in\\\{0\.1,0\.2\\\}works well for small\- to medium\-class datasets, whileτ∈\{0\.07,0\.1\}\\tau\\in\\\{0\.07,0\.1\\\}is preferable for larger\-class settings\. Thus, although normalized softmax losses expose additional hyperparameters, their effective ranges are narrow and stable, so a small amount of tuning \(or even these defaults\) is typically sufficient to reach near\-peak accuracy\.
### G\.3Temperature Ablation at Batch Size 512
[Tables9](https://arxiv.org/html/2605.20302#A7.T9)and[10](https://arxiv.org/html/2605.20302#A7.T10)report validation accuracy across temperaturesτ∈\{0\.07,0\.1,0\.2,…,1\.0\}\\tau\\in\\\{0\.07,0\.1,0\.2,\\ldots,1\.0\\\}at batch size 512, the configuration used for the CIFAR results in[Table1](https://arxiv.org/html/2605.20302#S4.T1)\.Boldmarks the temperature reported in[Table1](https://arxiv.org/html/2605.20302#S4.T1)\(i\.e\., the bestτ\\tauper method\)\. We observe that the optimalτ\\taushifts toward smaller values as the number of classes grows, consistent with self\-supervised contrastive learning\(Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\)\.
Table 9:CIFAR\-100 — Validation Accuracy \(%\) at batch size 512\.Bold= value reported in[Table1](https://arxiv.org/html/2605.20302#S4.T1)\(bestτ\\tauper method\)\.Objectiveτ=0\.07\\tau\{=\}0\.070\.10\.10\.20\.20\.30\.30\.40\.40\.50\.50\.60\.60\.70\.70\.80\.80\.90\.91\.01\.0NormFace71\.771\.772\.072\.463\.652\.042\.537\.033\.628\.927\.6NTCE72\.071\.772\.971\.464\.052\.342\.738\.630\.229\.028\.2NONL73\.273\.672\.571\.663\.452\.642\.837\.231\.228\.226\.8Table 10:CIFAR\-10 — Validation Accuracy \(%\) at batch size 512\.Bold= value reported in[Table1](https://arxiv.org/html/2605.20302#S4.T1)\(bestτ\\tauper method\)\.Objectiveτ=0\.07\\tau\{=\}0\.070\.10\.10\.20\.20\.30\.30\.40\.40\.50\.50\.60\.60\.70\.70\.80\.80\.90\.91\.01\.0NormFace94\.694\.394\.594\.594\.694\.794\.894\.685\.985\.985\.7NTCE94\.294\.394\.394\.494\.594\.694\.794\.586\.285\.985\.7NONL92\.994\.794\.994\.694\.494\.394\.194\.386\.085\.985\.9
### G\.4Need for large batch size
In[Table11](https://arxiv.org/html/2605.20302#A7.T11)NONL benefits from larger batches, reaching \+1\.1% over CE on ImageNet\-1K with an appropriate batch size\. We explicitly acknowledge that needing larger batches is a practical drawback, but this is a typical limitation of contrastive methods \(including SCL\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20)\)and many SSL methods\(Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\)\)\. Our work highlights that supervised normalized classifiers are already performing a kind of contrastive learning with class prototypes, and that fully reaping the generalization and robustness benefits naturally points toward larger effective negative sets\. Recent SSL methods \(memory banks, queues, momentum encoders\) suggest promising ways to increase the number of negatives without linearly scaling batch size; we now emphasize this as a natural direction for future extensions of NONL\.
Table 11:Top\-1 accuracy \(%\) on ImageNet\-1K for different batch sizes and loss functions\.Loss204840968192CE75\.475\.475\.1NORMFACE75\.676\.476\.3NTCE \(ours\)76\.076\.776\.7NONL \(ours\)75\.076\.276\.5
### G\.5Applicability to larger architectures
In[Table12](https://arxiv.org/html/2605.20302#A7.T12)we test using deeper and higher\-capacity models, where we observe the same consistent trend: our normalized prototype\-based losses \(NTCE and NONL\) improve upon CE in terms of accuracy while inducing significantly stronger NC geometry\. These results indicate that the benefits of our objectives are not limited to small or medium\-scale architectures, but extend to larger backbones as well\.
Table 12:ImageNet\-100 top\-1 accuracy \(%\) for different backbones\. Best results per column are highlighted ingreen\. The last row reports the relative improvement of NONL over CE\.MethodResNet\-50ResNet\-101ResNet\-152MeanCE84\.485\.385\.585\.1NormFace84\.485\.485\.685\.1NTCE84\.785\.485\.485\.2NONL84\.985\.585\.885\.4Δ\\Delta\(NONL−\-CE\)\+0\.6%\+0\.2%\+0\.4%\+0\.4%
### G\.6Statistical Robustness
ImageNet\-1K runs require∼200\\sim\\\!200GPU\-hours each, and works of this scale typically report single runs\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20); Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11); Heet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib49)\), allocating compute to scaling rather than repetition\. To address concerns about statistical robustness we provide 5\-seed results on the smaller datasets, with temperature fixed to the value reported in[Table1](https://arxiv.org/html/2605.20302#S4.T1)\(i\.e\., the bestτ\\tauper method\)\.[Table13](https://arxiv.org/html/2605.20302#A7.T13)reports mean±\\pmstandard deviation\. On CIFAR\-100, NONL exceeds CE by roughly four standard deviations, and the relative ordering CE<<NormFace<<NTCE<<NONL is consistent at fixed configuration, validating the superiority of our methods\. We note that CIFAR\-10 with a ResNet\-18 is near\-saturated at∼94\.6%\\sim\\\!94\.6\\%, leaving limited headroom for separation between methods on this benchmark; nevertheless NONL still leads\.
On the role of temperature:τ\\taudoes not grant normalized losses extra expressivity over CE\. Standard CE can replicate the effect of anyτ\\tauthrough weight magnitudes \(setting‖𝒘c‖∝1/τ\\\|\{\\bm\{w\}\}\_\{c\}\\\|\\propto 1/\\tau\) and additionally has free biases and unconstrained feature norms, rendering it strictly more flexible\. NormFace, which is not one of our proposed methods, already usesτ\\tauand already outperforms CE, isolating the benefit of temperature from our contributions; on ImageNet\-1K we use a single typicalτ\\tauwithout grid search \([AppendixE](https://arxiv.org/html/2605.20302#A5)\), yet gains persist\.
Table 13:Mean±\\pmstandard deviation of validation accuracy \(%\) across 5 seeds, with temperature fixed to the value reported in[Table1](https://arxiv.org/html/2605.20302#S4.T1)\.MethodCIFAR\-100CIFAR\-10CE72\.16±\\pm0\.2094\.52±\\pm0\.15NormFace72\.32±\\pm0\.1794\.64±\\pm0\.10NTCE72\.78±\\pm0\.2194\.68±\\pm0\.07NONL73\.38±\\pm0\.2894\.84±\\pm0\.15
### G\.7Computational Cost of Linear Probing vs\. Fixed Prototypes
The standard linear probing protocol\(Khoslaet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib20); Chenet al\.,[2020](https://arxiv.org/html/2605.20302#bib.bib11)\)applies data augmentation at every epoch, so representations change and cannot be cached; the full encoder forward pass must repeatTTtimes forTTprobing epochs, givingT×NT\\times Ncomplexity\. A single\-epoch cache\-and\-train approach is possible but deviates from the established protocol and yields lower accuracy, since the LP classifier no longer sees the augmentation\-induced variance that the standard protocol relies on\.[Table14](https://arxiv.org/html/2605.20302#A7.T14)reports the gap\.
On 8×\\timesB200 GPUs, 500 epochs of pretraining take∼25\\sim\\\!25h \(∼200\\sim\\\!200GPU\-hours\), while 90 LP epochs add∼3\\sim\\\!3h \(∼24\\sim\\\!24GPU\-hours\), roughly∼$342\\sim\\\!\\mathdollar 342on AWS per run\. FP requires exactly one forward pass per sample, eliminating this overhead while matching LP accuracy \([Table2](https://arxiv.org/html/2605.20302#S4.T2)\)\.
Table 14:Effect of caching encoder features during linear probing on SCL representations\. Standard LP performs augmentation at every epoch \(requiringT×NT\\times Nforward passes\); cached LP computes encoder features once and trains a classifier on the cached representations \(NNforward passes total but no augmentation during LP\)\. FP eliminates LP entirely\.Standard LPCached LPFP \(ours\)CIFAR\-10073\.972\.373\.9CIFAR\-1095\.094\.495\.0Similar Articles
The role of class encoding in neural collapse
This paper investigates how class label encoding influences neural collapse in neural network classifiers, showing that with one-hot encoding and balanced data, uncentered mean features transition from a simplex equiangular tight frame to an orthogonal frame as bias regularization increases.
Channel-Level Semantic Perturbations: Unlearnable Examples for Diverse Training Paradigms
This paper systematically investigates unlearnable examples under diverse training paradigms, revealing that pretrained weights weaken existing methods, and proposes Shallow Semantic Camouflage (SSC) to maintain unlearnability by generating perturbations in a semantically valid subspace.
Representation Collapse in Sequential Post-Training of Large Language Models
This paper studies representation collapse in sequential post-training of large language models, showing that repeated adaptation stages compress internal representations, reducing plasticity and out-of-domain generalization. The authors propose lightweight interventions to preserve future learnability without sacrificing behavioral gains.
From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale
This paper investigates how training alignment objectives reshape linguistic features in large language models, finding that instruction-tuned systems collapse language entropy significantly more than scale would suggest, and that entropy regularization can mitigate this collapse.
State-Space NTK Collapse Near Bifurcations
This paper develops a local theory of gradient descent near bifurcations in dynamical models, showing that the state-space neural tangent kernel collapses to a rank-one operator that dominates learning dynamics, making optimization effectively low-dimensional and predictable from normal forms.