The role of class encoding in neural collapse
Summary
This paper investigates how class label encoding influences neural collapse in neural network classifiers, showing that with one-hot encoding and balanced data, uncentered mean features transition from a simplex equiangular tight frame to an orthogonal frame as bias regularization increases.
View Cached Full Text
Cached at: 06/02/26, 03:41 PM
# The role of class encoding in neural collapse
Source: [https://arxiv.org/html/2606.00344](https://arxiv.org/html/2606.00344)
###### Abstract
Neural collapse is a structural property of the last\-hidden\-layer activations in neural network classification models, when trained beyond a zero classification error\. In this work, we explore the role of label encoding in neural collapse by relying on the unrestricted feature model with mean squared error training loss\. We demonstrate that, for one\-hot encoded labels and balanced data, the uncentered mean features associated with each class transition from a simplex equiangular tight frame to an orthogonal frame when increasing the bias regularization coefficient associated with the final classifier\. These structures are reminiscent of the orthogonal frame structure of one\-hot encoded labels\. For any arbitrary encoding, we also show that the final classifier’s bias aims at centering the labels, compensating for the discrepancy between the global mean of the labels and the origin\. We further discuss the role of the encoding in other neural collapse properties\.
## IIntroduction
Neural collapse \(NC\) refers to a phenomenon occurring during the terminal stage of learning deep neural network classifiers\. By terminal stage of learning, we refer to the last epochs, where the training loss is further reduced while the training error is zero \(i\.e\., all inputs are correctly classified\)\. It was noticed in\[[17](https://arxiv.org/html/2606.00344#bib.bib24)\]that some structure appears in the activations of the last hidden layer; namely, the activations corresponding to inputs of a same class concentrate around a mean value, leading to the*neural collapse*terminology, and these mean vectors, after centering, localize at the vertices of a simplex equiangular tight frame \(ETF\)\. This nice geometric structure was exploited in the design of out\-of\-distribution detection algorithms\[[8](https://arxiv.org/html/2606.00344#bib.bib10)\]and transfer learning methods\[[5](https://arxiv.org/html/2606.00344#bib.bib7)\]among others; see\[[12](https://arxiv.org/html/2606.00344#bib.bib16)\]for a review on NC\.
Understanding the emergence of NC has been the focus of many works over the past years\. The original work\[[17](https://arxiv.org/html/2606.00344#bib.bib24)\]numerically observed NC on classifiers trained with the cross\-entropy loss across various datasets and architectures\. Other works further explored the occurrence of NC for different losses\[[9](https://arxiv.org/html/2606.00344#bib.bib11),[28](https://arxiv.org/html/2606.00344#bib.bib35)\], and aimed at generalizing NC to intermediate layers\[[18](https://arxiv.org/html/2606.00344#bib.bib25),[20](https://arxiv.org/html/2606.00344#bib.bib27),[19](https://arxiv.org/html/2606.00344#bib.bib28)\], to regression models\[[1](https://arxiv.org/html/2606.00344#bib.bib1)\]or to account for imbalanced data\[[4](https://arxiv.org/html/2606.00344#bib.bib5),[22](https://arxiv.org/html/2606.00344#bib.bib32),[14](https://arxiv.org/html/2606.00344#bib.bib20)\]\. Additionally, motivated by large language models, the works\[[11](https://arxiv.org/html/2606.00344#bib.bib14),[25](https://arxiv.org/html/2606.00344#bib.bib33)\]investigated NC in the regime where the number of classes outnumbers the last\-hidden\-layer dimension\.
Arguably the most common approach towards a theoretical characterization of NC relies on the*unconstrained features model*\(UFM\)\[[16](https://arxiv.org/html/2606.00344#bib.bib23)\], or the related*layer\-peeled model*\[[4](https://arxiv.org/html/2606.00344#bib.bib5)\]\. These models formulate the training problem as the joint optimization of last\-hidden\-layer features and a final linear classifier, i\.e\., the last\-hidden\-layer features are not restricted by the expressivity of the previous layers\. These simplified models, which can be seen as \(overparametrized\) matrix factorization problems, were used to prove that global minimizers satisfy the NC properties for various loss functions\[[6](https://arxiv.org/html/2606.00344#bib.bib8),[3](https://arxiv.org/html/2606.00344#bib.bib4),[9](https://arxiv.org/html/2606.00344#bib.bib11),[4](https://arxiv.org/html/2606.00344#bib.bib5),[16](https://arxiv.org/html/2606.00344#bib.bib23),[28](https://arxiv.org/html/2606.00344#bib.bib35),[23](https://arxiv.org/html/2606.00344#bib.bib30),[15](https://arxiv.org/html/2606.00344#bib.bib22)\]\. Subsequent works provide global landscape analyses, showing that the UFM/layer\-peeled model has a benign loss landscape and, consequently, that the training algorithm will converge to a point satisfying the NC properties as soon as it is able to avoid saddle points with a strictly negative curvature in some directions\[[29](https://arxiv.org/html/2606.00344#bib.bib37),[26](https://arxiv.org/html/2606.00344#bib.bib34),[27](https://arxiv.org/html/2606.00344#bib.bib36)\]\.
In this paper, we focus on the mean squared error \(MSE\) loss function\. The use of this loss function gained attraction over recent years for classification problems\[[2](https://arxiv.org/html/2606.00344#bib.bib3),[10](https://arxiv.org/html/2606.00344#bib.bib12)\]\. In\[[23](https://arxiv.org/html/2606.00344#bib.bib30)\], the authors showed that the geometric structure of the optimal solutions of the UFM for the MSE loss with one\-hot encoding differs depending on the use of a bias term in the final classifier: the \(uncentered\) mean activations are organized as a simplex ETF when some \(unregularized\) bias is used, while they form an orthogonal frame \(OF\) in the absence of bias\. In this work, we build a bridge between these two structures, by showing that OFs and simplex ETFs are merely shifted versions of one another, see Figure[1](https://arxiv.org/html/2606.00344#S3.F1)\. The transition between these two structures is due to the final classifier’s bias, which aims at compensating for the possible non\-centering of the labels and whose magnitude depends on the strength of its regularization\. Note that these structures are reminiscent of the OF structure of one\-hot encoded labels\. We explore further the role of the encoding on NC, and demonstrate that even if variability collapse does not depend on the encoding, this is not the case of other NC properties\.
The structure of the paper is as follows\. We recall fundamental definitions in Section[II](https://arxiv.org/html/2606.00344#S2), show the equivalence between simplex ETFs and OFs, and discuss the role of the bias of the final classifier for general encodings in Section[III](https://arxiv.org/html/2606.00344#S3), and address other NC properties in Section[IV](https://arxiv.org/html/2606.00344#S4)\.
## IIPreliminaries
Let us consider a set of input datax1,…,xN∈ℝdxx\_\{1\},\\dots,x\_\{N\}\\in\\mathbb\{R\}^\{d\_\{x\}\}, to be classified intoKKclasses\. Let us writey¯k∈ℝK\\bar\{y\}\_\{k\}\\in\\mathbb\{R\}^\{K\}, the label of classk∈\{1,…,K\}k\\in\\\{1,\\dots,K\\\}and defineY¯:=\[y¯1⋯y¯K\]∈ℝK×K\\bar\{Y\}:=\[\\bar\{y\}\_\{1\}\\cdots\\bar\{y\}\_\{K\}\]\\in\\mathbb\{R\}^\{K\\times K\}\. Note that we consider the dimension of the encoding to be equal to the number of classesKK, which includes one\-hot encoding and label smoothing, among others\. Letyi∈\{y¯1,…,y¯K\}y\_\{i\}\\in\\\{\\bar\{y\}\_\{1\},\\dots,\\bar\{y\}\_\{K\}\\\}be the target label vector associated with inputxix\_\{i\}and defineY:=\[y1⋯yN\]∈ℝK×NY:=\[y\_\{1\}\\cdots y\_\{N\}\]\\in\\mathbb\{R\}^\{K\\times N\}\. For notational convenience, we assume that the samples are ordered by class:
whereC:=\[e11n1⊤⋯eK1nK⊤\]∈ℝK×NC:=\\begin\{bmatrix\}e\_\{1\}1\_\{n\_\{1\}\}^\{\\top\}&\\cdots&e\_\{K\}1\_\{n\_\{K\}\}^\{\\top\}\\end\{bmatrix\}\\in\\mathbb\{R\}^\{K\\times N\}is a class assignment matrix, whereeke\_\{k\}denotes thekthk^\{\\textnormal\{th\}\}canonical vector,nkn\_\{k\}the number of occurrences of the target labely¯k\\bar\{y\}\_\{k\}in the dataset, and1nk1\_\{n\_\{k\}\}the vector of ones of lengthnkn\_\{k\}\. This notation generalizes the Kronecker product formulation widely used in the literature for balanced classes\[[23](https://arxiv.org/html/2606.00344#bib.bib30),[24](https://arxiv.org/html/2606.00344#bib.bib31)\]\.
NC refers to four properties that occur simultaneously\.
- •𝒩𝒞1\\mathcal\{NC\}1\(Variability collapse\): the last\-hidden\-layer features concentrate around their class mean\.
- •𝒩𝒞2\\mathcal\{NC\}2\(Convergence to a simplex ETF\): the class means, after centering by their global mean, converge to a simplex ETF \(see Definition[II\.1](https://arxiv.org/html/2606.00344#S2.Thmdefinition1)\)\.
- •𝒩𝒞3\\mathcal\{NC\}3\(Convergence to self\-duality\): the last layer’s classifier weights converge to the class means, up to rescaling\.
- •𝒩𝒞4\\mathcal\{NC\}4\(Simplification to nearest\-class\-center classification\): the linear classifier’s decision rule reduces to choosing the class whose mean is the nearest to the last\-hidden\-layer activations\.
NC is tightly related to the notion of simplex ETF\[[17](https://arxiv.org/html/2606.00344#bib.bib24)\]\.
###### Definition II\.1\(Simplex ETF\)\.
A simplex equiangular tight frame\[[17](https://arxiv.org/html/2606.00344#bib.bib24)\]is a collection ofKKvectors inℝd\\mathbb\{R\}^\{d\}\(withd⩾Kd\\geqslant K\) specified by the columns of
M=αP\(IK−1K1K⊤K\),M=\\alpha P\\left\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\\right\),whereP∈ℝd×KP\\in\\mathbb\{R\}^\{d\\times K\}is semi\-orthogonal \(i\.e\., has orthonormal columns\) andα\>0\\alpha\>0\. Note that, equivalently,MMsatisfies
M⊤M=α2\(IK−1K1K⊤K\)\.M^\{\\top\}M=\\alpha^\{2\}\\left\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\\right\)\.
###### Definition II\.2\(OF\)\.
By orthogonal frame, we refer to a collection ofKKvectors inℝd\\mathbb\{R\}^\{d\}\(withd⩾Kd\\geqslant K\) specified by the columns of
whereP∈ℝd×KP\\in\\mathbb\{R\}^\{d\\times K\}is semi\-orthogonal andα\>0\\alpha\>0\.
In this work, we analyze NC through the lens of the Unconstrained Features Model \(UFM\) proposed in\[[16](https://arxiv.org/html/2606.00344#bib.bib23)\], expressed here for the MSE loss:
minW,H,b12N\\displaystyle\\min\_\{W,H,b\}\\frac\{1\}\{2N\}‖WH\+b1N⊤−Y‖F2\+λW2‖W‖F2\\displaystyle\\\|WH\+b1\_\{N\}^\{\\top\}\-Y\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\frac\{\\lambda\_\{W\}\}\{2\}\\\|W\\\|\_\{\\mathrm\{F\}\}^\{2\}\+λH2‖H‖F2\+λb2‖b‖22,\\displaystyle\+\\frac\{\\lambda\_\{H\}\}\{2\}\\\|H\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\frac\{\\lambda\_\{b\}\}\{2\}\\\|b\\\|\_\{2\}^\{2\},\(2\)whereH∈ℝd×NH\\in\\mathbb\{R\}^\{d\\times N\}withd⩾Kd\\geqslant Kare unconstrained features, which model here the last\-hidden\-layer activations, omitting expressivity restrictions due to the network architecture,W∈ℝK×dW\\in\\mathbb\{R\}^\{K\\times d\}andb∈ℝKb\\in\\mathbb\{R\}^\{K\}are respectively the final classifier’s weights and bias, andλW\>0\\lambda\_\{W\}\>0,λH\>0\\lambda\_\{H\}\>0andλb⩾0\\lambda\_\{b\}\\geqslant 0are regularization parameters\. Note that the linearity assumption on the classifier is standard\[[16](https://arxiv.org/html/2606.00344#bib.bib23)\]\.
## IIIFrom OFs to simplex ETFs: the role of bias
### III\-AOne\-hot encoded labels and balanced data
We address here the UFM problem \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) with one\-hot encoded labels and balanced data, in line with\[[23](https://arxiv.org/html/2606.00344#bib.bib30),[27](https://arxiv.org/html/2606.00344#bib.bib36)\]\.
###### Theorem III\.1\(Theorem 3\.1 in\[[27](https://arxiv.org/html/2606.00344#bib.bib36)\], shortened\)\.
Assume that the labels are one\-hot encoded i\.e\.,Y¯=IK\\bar\{Y\}=I\_\{K\}, that the data are balanced, i\.e\.,n1=⋯=nK=:n=NKn\_\{1\}=\\cdots=n\_\{K\}=:n=\\frac\{N\}\{K\}, and definec:=KnλWλHc:=K\\sqrt\{n\\lambda\_\{W\}\\lambda\_\{H\}\}\. Then, ifc<1c<1, any global minimizers\(W∗,H∗,b∗\)\(W^\{\*\},H^\{\*\},b^\{\*\}\)of \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) satisfy𝒩𝒞1\\mathcal\{NC\}1and𝒩𝒞3\\mathcal\{NC\}3:
H∗=H¯∗CandW∗⊤=nλHλWH¯∗,H^\{\*\}=\\bar\{H\}^\{\*\}C\\quad\\text\{and\}\\quad\{W^\{\*\}\}^\{\\top\}=\\sqrt\{\\frac\{n\\lambda\_\{H\}\}\{\\lambda\_\{W\}\}\}\\bar\{H\}^\{\*\},\(3\)whereH¯∗∈ℝd×K\\bar\{H\}^\{\*\}\\in\\mathbb\{R\}^\{d\\times K\}is such that
H¯∗⊤H¯∗=\{α1\(IK−1K1K⊤K\)ifλb⩽c1−c,α2\(IK−cλb\(1−c\)1K1K⊤K\)otherwise,\\bar\{H\}^\{\*\\top\}\\bar\{H\}^\{\*\}=\\begin\{cases\}\\alpha\_\{1\}\\left\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\\right\)&\\text\{if $\\lambda\_\{b\}\\leqslant\\frac\{c\}\{1\-c\}$,\}\\\\ \\alpha\_\{2\}\\left\(I\_\{K\}\-\\frac\{c\}\{\\lambda\_\{b\}\(1\-c\)\}\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\\right\)&\\text\{otherwise,\}\\end\{cases\}\(4\)for someα1\>0\\alpha\_\{1\}\>0andα2\>0\\alpha\_\{2\}\>0that depend onλW\\lambda\_\{W\},λH\\lambda\_\{H\}andλb\\lambda\_\{b\}\. Moreover, the optimal biasb∗b^\{\*\}satisfies
b∗=\{11\+λb1KKifλb⩽c1−c,cλb1KKotherwise\.b^\{\*\}=\\begin\{cases\}\\frac\{1\}\{1\+\\lambda\_\{b\}\}\\frac\{1\_\{K\}\}\{K\}&\\text\{if $\\lambda\_\{b\}\\leqslant\\frac\{c\}\{1\-c\}$,\}\\\\ \\frac\{c\}\{\\lambda\_\{b\}\}\\frac\{1\_\{K\}\}\{K\}&\\text\{otherwise\.\}\\end\{cases\}\(5\)
Theorem[III\.1](https://arxiv.org/html/2606.00344#S3.Thmtheorem1)reveals a transition in the structure ofH¯∗\\bar\{H\}^\{\*\}asλb\\lambda\_\{b\}decreases, shifting from an OF \(forλb→∞\\lambda\_\{b\}\\to\\infty\) to a simplex ETF \(forλb≤c/\(1−c\)\\lambda\_\{b\}\\leq c/\(1\-c\)\)\. This observation suggests a link between OFs and simplex ETFs\. Indeed, centering any orthogonal frameM=αPM=\\alpha Pgives
Mc=αP−\(αP\)1K1K⊤K,M\_\{c\}=\\alpha P\-\(\\alpha P\)\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\},which, according to Definition[II\.1](https://arxiv.org/html/2606.00344#S2.Thmdefinition1), is a simplex ETF; see Figure[1](https://arxiv.org/html/2606.00344#S3.F1)for an illustration\. This led us to propose the following definition\.
Figure 1:Centering an OF \(blue\) always results in a simplex ETF \(green\)\. For this figure, the parameters in Definitions[II\.1](https://arxiv.org/html/2606.00344#S2.Thmdefinition1)and[II\.2](https://arxiv.org/html/2606.00344#S2.Thmdefinition2)are chosen asd=K=3d=K=3,P=IKP=I\_\{K\}andα=1\\alpha=1\.###### Definition III\.1\.
\(Shifted simplex ETF\) A shifted simplex equiangular tight frame \(SSETF\) is a collection ofKKvectors inℝd\\mathbb\{R\}^\{d\}\(withd⩾Kd\\geqslant K\) specified by the columns of
M=αP\(IK−β1K1K⊤K\),M=\\alpha P\\left\(I\_\{K\}\-\\beta\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\\right\),whereP∈ℝd×KP\\in\\mathbb\{R\}^\{d\\times K\}is semi\-orthogonal,α\>0\\alpha\>0andβ∈ℝ\\beta\\in\\mathbb\{R\}\.
This proposed structure generalizes both OFs \(β=0\\beta=0\) and simplex ETFs \(β=1\\beta=1\)\. Note that centering any SSETF leads to a simplex ETF\.
###### Proposition III\.2\.
IfM∈ℝd×KM\\in\\mathbb\{R\}^\{d\\times K\}is an SSETF, thenMc=M\(IK−1K1K⊤K\)M\_\{c\}=M\\left\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\\right\)is a simplex ETF\.
###### Proof\.
By Definition[III\.1](https://arxiv.org/html/2606.00344#S3.Thmdefinition1),M=αP\(IK−β1K1K⊤K\)M=\\alpha P\(I\_\{K\}\-\\beta\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\)for someα,β∈ℝ\\alpha,\\beta\\in\\mathbb\{R\}\. By definition,Mc=M\(IK−1K1K⊤K\)M\_\{c\}=M\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\), which simplifies toMc=αP\(IK−1K1K⊤K\)M\_\{c\}=\\alpha P\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\)for anyβ\\beta, i\.e\.,McM\_\{c\}is a simplex ETF\. ∎
Notice in \([5](https://arxiv.org/html/2606.00344#S3.E5)\) how the optimal biasb∗b^\{\*\}evolves asλb\\lambda\_\{b\}decreases: it transitions continuously from0K0\_\{K\}in the bias\-free UFM \(λb→∞\)\\lambda\_\{b\}\\to\\infty\)to1K1K\\frac\{1\}\{K\}1\_\{K\}in the unregularized\-bias UFM \(λb=0\\lambda\_\{b\}=0\), namely, the mean of the labels\. In other words, for one\-hot encoded labels, the bias in the unregularized\-bias UFM compensates for the non\-centering of the labels\.
### III\-BArbitrary labels
We generalize here the results of previous section to arbitrary label encodingY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}and imbalanced data\. In this setting, the authors of\[[27](https://arxiv.org/html/2606.00344#bib.bib36)\]showed that, for the bias\-free UFM, the optimal unconstrained feature matrixH∗H^\{\*\}is closely related to the right singular vectors ofYY, i\.e\., depends on the encoding\. To address the UFM with bias \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\), we first derive optimality conditions in Lemma[III\.3](https://arxiv.org/html/2606.00344#S3.Thmtheorem3), from which Theorem[III\.4](https://arxiv.org/html/2606.00344#S3.Thmtheorem4)is proven\.
###### Lemma III\.3\.
LetY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}be an arbitrary encoding matrix and letY=Y¯C∈ℝK×NY=\\bar\{Y\}C\\in\\mathbb\{R\}^\{K\\times N\}for some class assignment matrixCC, see \([1](https://arxiv.org/html/2606.00344#S2.E1)\)\. Then, any global minimizer\(W∗,H∗,b∗\)\(W^\{\*\},H^\{\*\},b^\{\*\}\)of \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) satisfies the first\-order optimality conditions:
YMγH∗⊤\\displaystyle YM\_\{\\gamma\}\{H^\{\*\}\}^\{\\top\}=W∗H∗MγH∗⊤\+λWNW∗,\\displaystyle=W^\{\*\}H^\{\*\}M\_\{\\gamma\}\{H^\{\*\}\}^\{\\top\}\+\\lambda\_\{W\}NW^\{\*\},\(6\)W∗⊤YMγ\\displaystyle\{W^\{\*\}\}^\{\\top\}YM\_\{\\gamma\}=W∗⊤W∗H∗Mγ\+λHNH∗,\\displaystyle=\{W^\{\*\}\}^\{\\top\}W^\{\*\}H^\{\*\}M\_\{\\gamma\}\+\\lambda\_\{H\}NH^\{\*\},\(7\)b∗\\displaystyle b^\{\*\}=γN\(Y−W∗H∗\)1N,\\displaystyle=\\frac\{\\gamma\}\{N\}\(Y\-W^\{\*\}H^\{\*\}\)1\_\{N\},\(8\)whereγ=11\+λb\\gamma=\\frac\{1\}\{1\+\\lambda\_\{b\}\}andMγ=IN−γ1N1N⊤NM\_\{\\gamma\}=I\_\{N\}\-\\gamma\\frac\{1\_\{N\}1\_\{N\}^\{\\top\}\}\{N\}\.
###### Proof\.
Since \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) is strictly convex and coercive inbb, setting the gradient w\.r\.t\.bbof \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) to0allows us to findb∗b^\{\*\}, leading to \([8](https://arxiv.org/html/2606.00344#S3.E8)\)\. Then, requiring the gradients w\.r\.t\.WWandHHto be0give the stationarity conditions \([6](https://arxiv.org/html/2606.00344#S3.E6)\) and \([7](https://arxiv.org/html/2606.00344#S3.E7)\), after plugging in \([8](https://arxiv.org/html/2606.00344#S3.E8)\)\. As a result, any global minimizer\(W∗,H∗,b∗\)\(W^\{\*\},H^\{\*\},b^\{\*\}\)of \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) must satisfy \([6](https://arxiv.org/html/2606.00344#S3.E6)\)\-\([8](https://arxiv.org/html/2606.00344#S3.E8)\)\. ∎
###### Theorem III\.4\.
LetY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}be an arbitrary encoding matrix and letY=Y¯C∈ℝK×NY=\\bar\{Y\}C\\in\\mathbb\{R\}^\{K\\times N\}for some class assignment matrixCC, see \([1](https://arxiv.org/html/2606.00344#S2.E1)\)\.
- •The UFM \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) withλb=0\\lambda\_\{b\}=0has optimal biasb∗=Y1NNb^\{\*\}=Y\\frac\{1\_\{N\}\}\{N\}\. It reduces to the UFM with no bias for centered labelsY−Y1N1N⊤NY\-Y\\frac\{1\_\{N\}1\_\{N\}^\{\\top\}\}\{N\}\.
- •IfYYis centered, i\.e\.,Y1N=0KY1\_\{N\}=0\_\{K\}, then for allλb⩾0\\lambda\_\{b\}\\geqslant 0, the optimal bias of the UFM \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) isb∗=0Kb^\{\*\}=0\_\{K\}\.
###### Proof\.
In Lemma[III\.3](https://arxiv.org/html/2606.00344#S3.Thmtheorem3),MγM\_\{\\gamma\}satisfiesMγ1N=\(1−γ\)1NM\_\{\\gamma\}1\_\{N\}=\(1\-\\gamma\)1\_\{N\}\. Multiplying \([7](https://arxiv.org/html/2606.00344#S3.E7)\) on the right by1N1\_\{N\}thus gives
\(1−γ\)W∗⊤Y1N=\(\(1−γ\)W∗⊤W∗\+λHNId\)\(H∗1N\)\.\(1\-\\gamma\)\{W^\{\*\}\}^\{\\top\}Y1\_\{N\}=\(\(1\-\\gamma\)\{W^\{\*\}\}^\{\\top\}W^\{\*\}\+\\lambda\_\{H\}NI\_\{d\}\)\(H^\{\*\}1\_\{N\}\)\.Ifλb=0\\lambda\_\{b\}=0\(i\.e\.,γ=1\\gamma=1\) orYYis centered, the left\-hand side of the above equation vanishes\. Moreover, since\(1−γ\)W∗⊤W∗\+λHNId\(1\-\\gamma\)\{W^\{\*\}\}^\{\\top\}W^\{\*\}\+\\lambda\_\{H\}NI\_\{d\}is always invertible forλH\>0\\lambda\_\{H\}\>0andλb⩾0\\lambda\_\{b\}\\geqslant 0, it follows thatH∗1N=0KH^\{\*\}1\_\{N\}=0\_\{K\}in both cases\. Hence, by \([8](https://arxiv.org/html/2606.00344#S3.E8)\),b∗=γNY1Nb^\{\*\}=\\frac\{\\gamma\}\{N\}Y1\_\{N\}\. On the one hand, ifλb=0\\lambda\_\{b\}=0, the bias simplifies tob∗=1NY1Nb^\{\*\}=\\frac\{1\}\{N\}Y1\_\{N\}and the objective in \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) becomes
12N‖WH\+Y1N1N⊤N−Y‖F2\+λW2‖W‖F2\+λH2‖H‖F2\.\\displaystyle\\frac\{1\}\{2N\}\\biggl\\\|WH\+Y\\frac\{1\_\{N\}1\_\{N\}^\{\\top\}\}\{N\}\-Y\\biggr\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\frac\{\\lambda\_\{W\}\}\{2\}\\\|W\\\|\_\{\\mathrm\{F\}\}^\{2\}\+\\frac\{\\lambda\_\{H\}\}\{2\}\\\|H\\\|\_\{\\mathrm\{F\}\}^\{2\}\.Thus, solving the UFM withλb=0\\lambda\_\{b\}=0and labelsYYreduces to solving the bias\-free UFM with centered labelsY−Y1N1N⊤NY\-Y\\frac\{1\_\{N\}1\_\{N\}^\{\\top\}\}\{N\}\. On the other hand, ifYYis centered, i\.e\.,Y1N=0KY1\_\{N\}=0\_\{K\}, thenb∗=0Kb^\{\*\}=0\_\{K\}, regardless of the value ofλb⩾0\\lambda\_\{b\}\\geqslant 0\. ∎
Theorem[III\.4](https://arxiv.org/html/2606.00344#S3.Thmtheorem4)confirms that the bias in the UFM aims at centering the labels beyond one\-hot encoding and balanced data\.
## IVImpact of encoding on NC properties
We explore in this section the impact of the encoding on the NC properties, from the viewpoint of the UFM with MSE loss\. First, we show that variability collapse𝒩𝒞1\\mathcal\{NC\}1holds for any encoding\. Then, for balanced scenarios, we relate convergence of the centered features to a simplex ETF \(i\.e\.,𝒩𝒞2\\mathcal\{NC\}2\) to SSETF encodings and demonstrate that a weaker version of self\-duality, denoted𝒩𝒞3′\\mathcal\{NC\}3^\{\\prime\}, is preserved regardless of the encoding\.
### IV\-ARobustness of variability collapse \(𝒩𝒞1\\mathcal\{NC\}1\)
𝒩𝒞1\\mathcal\{NC\}1has been shown to hold for multiple instances of the UFM under MSE loss, such as in the bias\-free case with one\-hot encoded labels and imbalanced data\[[13](https://arxiv.org/html/2606.00344#bib.bib21)\]or forλb⩾0\\lambda\_\{b\}\\geqslant 0, with one\-hot labels and balanced data\[[27](https://arxiv.org/html/2606.00344#bib.bib36)\]\. To our knowledge, our next result is the first to ensure that𝒩𝒞1\\mathcal\{NC\}1holds more generally for any bias regularizationλb⩾0\\lambda\_\{b\}\\geqslant 0and for any encoding as well as for imbalanced scenarios\. The independence of variability collapse on the encoding and on the value ofλb\\lambda\_\{b\}can be explained intuitively: assuming that the model has found an optimal feature representation for a sample of a given class, then all samples of the same class should be represented identically, because of the separability of the problem regarding the samples\.
###### Theorem IV\.1\.
LetY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}be an arbitrary encoding matrix and letY=Y¯C∈ℝK×NY=\\bar\{Y\}C\\in\\mathbb\{R\}^\{K\\times N\}for some class assignment matrixCC, see \([1](https://arxiv.org/html/2606.00344#S2.E1)\)\. Assume that classes are balanced, and let\(W∗,H∗,b∗\)\(W^\{\*\},H^\{\*\},b^\{\*\}\)be an optimal solution of \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\)\. Then, for anyλH,λW\>0\\lambda\_\{H\},\\lambda\_\{W\}\>0andλb⩾0\\lambda\_\{b\}\\geqslant 0,H∗H^\{\*\}satisfies𝒩𝒞1\\mathcal\{NC\}1, i\.e\.,
H∗=H¯∗CH^\{\*\}=\\bar\{H\}^\{\*\}C\(9\)for someH¯∗∈ℝd×K\\bar\{H\}^\{\*\}\\in\\mathbb\{R\}^\{d\\times K\}\.
###### Proof\.
Lethk,ih\_\{k,i\}be the feature vector corresponding to theithi^\{\\textnormal\{th\}\}sample of classkk, with1⩽i⩽nk1\\leqslant i\\leqslant n\_\{k\}and1⩽k⩽K1\\leqslant k\\leqslant K\. We decompose the two terms involvingHHin \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) as
12N∑k=1K∑i=1nk‖Whk,i\+b−y¯k‖22\+λH2∑k=1K∑i=1nk‖hk,i‖22\.\\displaystyle\\frac\{1\}\{2N\}\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{n\_\{k\}\}\\\|Wh\_\{k,i\}\+b\-\\bar\{y\}\_\{k\}\\\|\_\{2\}^\{2\}\+\\frac\{\\lambda\_\{H\}\}\{2\}\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{n\_\{k\}\}\\\|h\_\{k,i\}\\\|\_\{2\}^\{2\}\.For thekthk^\{th\}sum of the second term, we get
∑i=1nk‖hk,i‖22⩾1nk‖∑i=1nkhk,i‖22=nk‖h¯k‖22,\\sum\_\{i=1\}^\{n\_\{k\}\}\\\|h\_\{k,i\}\\\|\_\{2\}^\{2\}\\geqslant\\frac\{1\}\{n\_\{k\}\}\\biggl\\\|\\sum\_\{i=1\}^\{n\_\{k\}\}h\_\{k,i\}\\biggr\\\|\_\{2\}^\{2\}=n\_\{k\}\\\|\\bar\{h\}\_\{k\}\\\|\_\{2\}^\{2\},\(10\)where we first used Jensen’s inequality applied to the convex functionf\(x\)=‖x‖22f\(x\)=\\\|x\\\|\_\{2\}^\{2\}forx∈ℝdx\\in\\mathbb\{R\}^\{d\}, and where we definedh¯k:=\(∑i=1nkhk,i\)/nk\\bar\{h\}\_\{k\}:=\(\\sum\_\{i=1\}^\{n\_\{k\}\}h\_\{k,i\}\)/n\_\{k\}\. Note that, for each classkk, equality can always be reached in \([10](https://arxiv.org/html/2606.00344#S4.E10)\) and holds if and only ifhk,i=h¯kh\_\{k,i\}=\\bar\{h\}\_\{k\}for all1⩽i⩽nk1\\leqslant i\\leqslant n\_\{k\}\. Then, we apply the same reasoning for thekthk^\{\\textnormal\{th\}\}sum of the first term:
∑i=1nk‖Whk,i\+b−y¯k‖22⩾nk‖Wh¯k\+b−y¯k‖22,\\sum\_\{i=1\}^\{n\_\{k\}\}\\\|Wh\_\{k,i\}\+b\-\\bar\{y\}\_\{k\}\\\|\_\{2\}^\{2\}\\geqslant n\_\{k\}\\\|W\\bar\{h\}\_\{k\}\+b\-\\bar\{y\}\_\{k\}\\\|\_\{2\}^\{2\},\(11\)with equality in \([11](https://arxiv.org/html/2606.00344#S4.E11)\) if and only ifWhk,i\+b−y¯k=Wh¯k\+b−yk¯Wh\_\{k,i\}\+b\-\\bar\{y\}\_\{k\}=W\\bar\{h\}\_\{k\}\+b\-\\bar\{y\_\{k\}\}for all1⩽i⩽nk1\\leqslant i\\leqslant n\_\{k\}, or equivalentlyWhk,i=Wh¯kWh\_\{k,i\}=W\\bar\{h\}\_\{k\}\. These conditions are weaker than the previously derived conditionshk,i=h¯kh\_\{k,i\}=\\bar\{h\}\_\{k\}, and consequently do not impose any additional constraints onHH\. Then, any optimalH∗H^\{\*\}in \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) can be written asH∗=H¯∗CH^\{\*\}=\\bar\{H\}^\{\*\}CwithH¯∗=\[h¯1⋯h¯K\]\\bar\{H\}^\{\*\}=\[\\bar\{h\}\_\{1\}\\cdots\\bar\{h\}\_\{K\}\]\. ∎
### IV\-BConvergence to a simplex ETF \(𝒩𝒞2\\mathcal\{NC\}2\) for SSETF encodings
Currently, the complete determination of which encodings result in𝒩𝒞2\\mathcal\{NC\}2for the UFM with MSE loss remains an open question, both in the balanced and imbalanced cases\. We next show that a sufficient condition for𝒩𝒞2\\mathcal\{NC\}2is the fact that the optimal mean activationsH¯∗\\bar\{H\}^\{\*\}form an SSETF; this directly follows from Proposition[III\.2](https://arxiv.org/html/2606.00344#S3.Thmtheorem2)\.
###### Corollary IV\.2\.
LetY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}be an arbitrary class encoding matrix\. If, for anyλW,λH\>0\\lambda\_\{W\},\\lambda\_\{H\}\>0andλb⩾0\\lambda\_\{b\}\\geqslant 0, the optimal mean featuresH¯∗\\bar\{H\}^\{\*\}of \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) defined in \([9](https://arxiv.org/html/2606.00344#S4.E9)\) form an SSETF, then their centered counterpartH¯c∗=H¯∗\(IK−1K1K⊤K\)\\bar\{H\}^\{\*\}\_\{c\}=\\bar\{H\}^\{\*\}\(I\_\{K\}\-\\frac\{1\_\{K\}1\_\{K\}^\{\\top\}\}\{K\}\)form a simplex ETF, in agreement with𝒩𝒞2\\mathcal\{NC\}2\.
This corollary shows that the optimal solutions of the UFM with MSE loss for balanced one\-hot encoded data, provided in Theorem[III\.1](https://arxiv.org/html/2606.00344#S3.Thmtheorem1), satisfy𝒩𝒞2\\mathcal\{NC\}2, as empirically observed by the authors of\[[23](https://arxiv.org/html/2606.00344#bib.bib30)\]\. Future work will aim at deriving necessary and sufficient conditions on the encoding matrixY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}to result in mean featuresH¯∗\\bar\{H\}^\{\*\}with an SSETF structure, and consequently, to lead in𝒩𝒞2\\mathcal\{NC\}2\. We conjecture that, in balanced scenarios,H¯∗∈ℝd×K\\bar\{H\}^\{\*\}\\in\\mathbb\{R\}^\{d\\times K\}is an SSETF if and only if the encodingY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}is an SSETF\. The family of SSETFs encompasses two widely used encodings: one\-hot encoding \(forβ=0\\beta=0,α=1\\alpha=1,d=Kd=KandP=IKP=I\_\{K\}in Definition[III\.1](https://arxiv.org/html/2606.00344#S3.Thmdefinition1)\), and label smoothing, a “smoothed” variant of one\-hot encoding\. Indeed, smoothed labels can be written asY¯δ=\(1−δ\)IK\+δK1K1K⊤\\bar\{Y\}^\{\\delta\}=\(1\-\\delta\)I\_\{K\}\+\\frac\{\\delta\}\{K\}1\_\{K\}1\_\{K\}^\{\\top\}for some smoothing parameter0<δ<10<\\delta<1\[[21](https://arxiv.org/html/2606.00344#bib.bib29),[7](https://arxiv.org/html/2606.00344#bib.bib9)\], and form an SSETF; simply takeα=1−δ\\alpha=1\-\\delta,β=δ1−δ\\beta=\\frac\{\\delta\}\{1\-\\delta\},d=Kd=KandP=IKP=I\_\{K\}in Definition[III\.1](https://arxiv.org/html/2606.00344#S3.Thmdefinition1)\. While we showed earlier that𝒩𝒞2\\mathcal\{NC\}2holds for one\-hot encoded labels \(see Corollary[IV\.2](https://arxiv.org/html/2606.00344#S4.Thmtheorem2)\), this is still unproven for label smoothing, though empirical results supporting[IV\.2](https://arxiv.org/html/2606.00344#S4.Thmtheorem2)were obtained using the cross\-entropy loss\[[7](https://arxiv.org/html/2606.00344#bib.bib9)\]\.
### IV\-CFailure of self\-duality \(𝒩𝒞3\\mathcal\{NC\}3\) and success of a weaker form of self\-duality \(𝒩𝒞3′\\mathcal\{NC\}3^\{\\prime\}\)
For balanced data and one\-hot encoded labels, Theorem[III\.1](https://arxiv.org/html/2606.00344#S3.Thmtheorem1)showed thatW∗⊤\{W^\{\*\}\}^\{\\top\}andH¯∗\\bar\{H\}^\{\*\}are necessarily aligned, i\.e\., the global minimizers of the UFM satisfy𝒩𝒞3\\mathcal\{NC\}3\. We show here that𝒩𝒞3\\mathcal\{NC\}3may fail, depending on the encoding\. In particular, applying an orthogonal transformationQQto the labels destroys the alignmnent betweenW∗⊤\{W^\{\*\}\}^\{\\top\}andH¯∗\\bar\{H\}^\{\*\}\.
###### Proposition IV\.3\.
LetY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}be arbitrary and letY=Y¯C∈ℝK×NY=\\bar\{Y\}C\\in\\mathbb\{R\}^\{K\\times N\}for some class assignment matrixCC, see \([1](https://arxiv.org/html/2606.00344#S2.E1)\)\. LetQ∈ℝK×KQ\\in\\mathbb\{R\}^\{K\\times K\}be an orthogonal matrix and letYQ=QYY\_\{Q\}=QY\. Then, the global minimizers of the UFM with labelsYQY\_\{Q\}are of the form\(QW∗,H∗,Qb∗\)\(QW^\{\*\},H^\{\*\},Qb^\{\*\}\), where\(W∗,H∗,b∗\)\(W^\{\*\},H^\{\*\},b^\{\*\}\)is any global minimizer of the UFM with labelsYY\.
###### Proof\.
Replacing\(W,H,b\)\(W,H,b\)in the UFM \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) with modified labelsYQY\_\{Q\}by\(QW′,H,Qb′\)\(QW^\{\\prime\},H,Qb^\{\\prime\}\)and using the invariance of the Frobenius norm under multiplication with an orthogonal matrix yields the result\. SinceQQis invertible, this change of variables is reversible, providing a bijection between the set of global minimizers of the original problem and those of the problem with new labels\. ∎
The optimal classification weightsW∗W^\{\*\}\(and the biasb∗b^\{\*\}\) directly follow the encodingYYundergoing any left orthogonal transformation, indicating that the left singular vectors ofW∗W^\{\*\}are strongly related to the ones ofY¯\\bar\{Y\}\. This observation matches the result from\[[27](https://arxiv.org/html/2606.00344#bib.bib36)\]in the bias\-free case\. Meanwhile, the mean class featuresH¯∗\\bar\{H\}^\{\*\}stay unchanged, so the self\-duality betweenW∗⊤\{W^\{\*\}\}^\{\\top\}andH¯∗\\bar\{H\}^\{\*\}is broken\. However, inverting the isometry reconstructs their alignment, indicating that the property is not denied at a more fundamental level\. Based on this remark, we propose the following weaker variant of𝒩𝒞3\\mathcal\{NC\}3, that we will show to hold in a broader setting\.
- •𝒩𝒞3′\\mathcal\{NC\}3^\{\\prime\}\(Variant of𝒩𝒞3\\mathcal\{NC\}3\): the last layer’s classifier weights converge to the class means, up to rescaling and rotation/reflections\.
Mathematically,𝒩𝒞3′\\mathcal\{NC\}3^\{\\prime\}is satisfied if and only if there exists an orthogonal matrixQ∈ℝK×KQ\\in\\mathbb\{R\}^\{K\\times K\}and a scalarκ\>0\\kappa\>0such that
W∗⊤=κH¯∗Q⊤\.\{W^\{\*\}\}^\{\\top\}=\\kappa\\bar\{H\}^\{\*\}Q^\{\\top\}\.\(12\)This equality can be equivalently stated in terms of the Gram matrices ofW∗⊤\{W^\{\*\}\}^\{\\top\}andH¯∗\\bar\{H\}^\{\*\}:
W∗⊤W∗=κ2H¯∗H¯∗⊤\.\{W^\{\*\}\}^\{\\top\}W^\{\*\}=\\kappa^\{2\}\\bar\{H\}^\{\*\}\{\\bar\{H\}\}^\{\*^\{\\top\}\}\.
As stated in Theorem[IV\.4](https://arxiv.org/html/2606.00344#S4.Thmtheorem4),𝒩𝒞3′\\mathcal\{NC\}3^\{\\prime\}is always verified in the UFM with MSE loss and balanced classes\. Besides, the theorem implies that there always exists an orthogonal matrixQ∈ℝK×KQ\\in\\mathbb\{R\}^\{K\\times K\}such that it reverts to the original self\-duality property𝒩𝒞3\\mathcal\{NC\}3when considering the labelsQ⊤YQ^\{\\top\}Y\.
###### Theorem IV\.4\.
LetY¯∈ℝK×K\\bar\{Y\}\\in\\mathbb\{R\}^\{K\\times K\}be an arbitrary class encoding matrix, and assume that the data are balanced\. Then, for anyλW,λH\>0\\lambda\_\{W\},\\lambda\_\{H\}\>0andλb⩾0\\lambda\_\{b\}\\geqslant 0, the optimal mean featuresH¯∗\\bar\{H\}^\{\*\}of \([II](https://arxiv.org/html/2606.00344#S2.Ex4)\) defined in \([9](https://arxiv.org/html/2606.00344#S4.E9)\) satisfy𝒩𝒞3′\\mathcal\{NC\}3^\{\\prime\}, with scaling coefficientκ=λHnλW\\kappa=\\sqrt\{\\frac\{\\lambda\_\{H\}n\}\{\\lambda\_\{W\}\}\}in \([12](https://arxiv.org/html/2606.00344#S4.E12)\)\.
###### Proof\.
Multiply the optimality conditions \([6](https://arxiv.org/html/2606.00344#S3.E6)\) and \([7](https://arxiv.org/html/2606.00344#S3.E7)\) respectively byW∗⊤\{W^\{\*\}\}^\{\\top\}on the right and byH¯∗⊤\{\\bar\{H\}\}^\{\*^\{\\top\}\}on the left\. Subtracting these two equations givesλWNW∗⊤W∗=λHNH∗H∗⊤\\lambda\_\{W\}N\{W^\{\*\}\}^\{\\top\}W^\{\*\}=\\lambda\_\{H\}NH^\{\*\}H^\{\*^\{\\top\}\}\. By Theorem[IV\.1](https://arxiv.org/html/2606.00344#S4.Thmtheorem1), we can substituteH∗=H¯∗CH^\{\*\}=\\bar\{H\}^\{\*\}C, whereCCis the class assignment matrix\. Finally, leveraging the equalityCC⊤=nIKCC^\{\\top\}=nI\_\{K\}for balanced classes yieldsW∗⊤W∗=λHnλWH¯∗H¯∗⊤\{W^\{\*\}\}^\{\\top\}W^\{\*\}=\\frac\{\\lambda\_\{H\}n\}\{\\lambda\_\{W\}\}\\bar\{H\}^\{\*\}\\bar\{H\}^\{\*^\{\\top\}\}, where we identifyκ2=λHnλW\\kappa^\{2\}=\\frac\{\\lambda\_\{H\}n\}\{\\lambda\_\{W\}\}\. ∎
## VConclusion
This work studies the role of class encoding in NC\. It demonstrates that within the UFM under MSE loss, the final classifier’s bias aims at compensating for the non\-centering of the labels\. In the case of balanced, one\-hot encoded labels, increasing the bias regularization parameter yields a continuous transition of the unconstrained mean features from an OF to an ETF\. We finally discuss the dependency of other NC properties on class encoding\.
## Acknowledgment
R\. M\. is a FRIA grantee of the Fonds de la Recherche Scientifique \- FNRS\. E\. M\. work is partly funded by the FRS\-FNRS Research Project NTTN \(grant number TW02223\) and the Concerted Research Action \(ARC\) “Gravit\-AI”\.
## References
- \[1\]G\. Andriopoulos, Z\. Dong, L\. Guo, Z\. Zhao, and K\. Ross\(2024\)The prevalence of neural collapse in neural multivariate regression\.InProceedings of the 38th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[2\]A\. Demirkaya, J\. Chen, and S\. Oymak\(2020\)Exploring the role of loss functions in multiclass classification\.InProceedings of the 54th Annual Conference on Information Sciences and Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p4.1)\.
- \[3\]W\. E and S\. Wojtowytsch\(2021\)On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers\.InProceedings of 2nd Annual Conference on Mathematical and Scientific Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[4\]C\. Fang, H\. He, Q\. Long, and W\. J\. Su\(2021\)Exploring deep neural networks via layer\-peeled model: minority collapse in imbalanced training\.Proceedings of the National Academy of Sciences118\(43\)\.External Links:ISSN 1091\-6490,[Document](https://dx.doi.org/10.1073/pnas.2103091118)Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1),[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[5\]T\. Galanti, A\. György, and M\. Hutter\(2022\)On the role of neural collapse in transfer learning\.InProceedings of the 10th International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p1.1)\.
- \[6\]F\. Graf, C\. D\. Hofer, M\. Niethammer, and R\. Kwitt\(2021\)Dissecting supervised constrastive learning\.InProceedings of the 38th International Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[7\]L\. Guo, G\. Andriopoulos, Z\. Zhao, S\. Ling, Z\. Dong, and K\. Ross\(2025\)Cross entropy versus label smoothing: a neural collapse perspective\.Note:arXiv:2402\.03979External Links:2402\.03979,[Link](https://arxiv.org/abs/2402.03979)Cited by:[§IV\-B](https://arxiv.org/html/2606.00344#S4.SS2.p2.17)\.
- \[8\]J\. Haas, W\. Yolland, and B\. T\. Rabus\(2022\)Linking neural collapse and l2 normalization with improved out\-of\-distribution detection in deep neural networks\.Transactions on Machine Learning Research\.Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p1.1)\.
- \[9\]X\. Y\. Han, V\. Papyan, and D\. L\. Donoho\(2022\)Neural collapse under MSE loss: proximity to and dynamics on the central path\.InProceedings of the 10th International Conference on Learning Representations,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1),[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[10\]L\. Hui and M\. Belkin\(2021\)Evaluation of neural architectures trained with square loss vs cross\-entropy in classification tasks\.Proceedings of the 9th International Conference on Learning Representations\.Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p4.1)\.
- \[11\]J\. Jiang, J\. Zhou, P\. Wang, Q\. Qu, D\. G\. Mixon, C\. You, and Z\. Zhu\(2024\)Generalized neural collapse for a large number of classes\.InProceedings of the 41st International Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[12\]V\. Kothapalli\(2023\)Neural collapse: a review on modelling principles and generalization\.Transactions on Machine Learning Research\.Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p1.1)\.
- \[13\]H\. Liu\(2024\)The exploration of neural collapse under imbalanced data\.Note:arXiv:2411\.17278External Links:2411\.17278,[Link](https://arxiv.org/abs/2411.17278)Cited by:[§IV\-A](https://arxiv.org/html/2606.00344#S4.SS1.p1.5)\.
- \[14\]X\. Liu, J\. Zhang, T\. Hu, H\. Cao, L\. Pan, and Y\. Yao\(2023\)Inducing neural collapse in deep long\-tailed learning\.InProceedings of the 26th International Conference on Artificial Intelligence and Statistics,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[15\]J\. Lu and S\. Steinerberger\(2022\)Neural collapse under cross\-entropy loss\.Applied and Computational Harmonic Analysis59,pp\. 224–241\.External Links:ISSN 1063\-5203,[Document](https://dx.doi.org/10.1016/j.acha.2021.12.011)Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[16\]D\. G\. Mixon, H\. Parshall, and J\. Pi\(2020\)Neural collapse with unconstrained features\.Sampling Theory, Signal Processing, and Data Analysis20\(11\)\.External Links:[Document](https://dx.doi.org/10.1007/s43670-022-00027-5)Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1),[§II](https://arxiv.org/html/2606.00344#S2.p3.7),[§II](https://arxiv.org/html/2606.00344#S2.p3.8)\.
- \[17\]V\. Papyan, X\. Y\. Han, and D\. L\. Donoho\(2020\)Prevalence of neural collapse during the terminal phase of deep learning training\.Proceedings of the National Academy of Sciences117\(40\),pp\. 24652–24663\.External Links:ISSN 1091\-6490,[Document](https://dx.doi.org/10.1073/pnas.2015509117)Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p1.1),[§I](https://arxiv.org/html/2606.00344#S1.p2.1),[Definition II\.1](https://arxiv.org/html/2606.00344#S2.Thmdefinition1.p1.3),[§II](https://arxiv.org/html/2606.00344#S2.p2.2)\.
- \[18\]A\. Rangamani, M\. Lindegaard, T\. Galanti, and T\. Poggio\(2023\)Feature learning in deep classifiers through intermediate neural collapse\.InProceedings of the 40th International Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[19\]P\. Sukenik, C\. H\. Lampert, and M\. Mondelli\(2024\)Neural collapse versus low\-rank bias: is deep neural collapse really optimal?\.InProceedings of the 38th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[20\]P\. Sukenik, M\. Mondelli, and C\. H\. Lampert\(2023\)Deep neural collapse is provably optimal for the deep unconstrained features model\.InProceedings of the 37th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[21\]C\. Szegedy, V\. Vanhoucke, S\. Ioffe, J\. Shlens, and Z\. Wojna\(2016\)Rethinking the Inception architecture for computer vision\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,Cited by:[§IV\-B](https://arxiv.org/html/2606.00344#S4.SS2.p2.17)\.
- \[22\]C\. Thrampoulidis, G\. R\. Kini, V\. Vakilian, and T\. Behnia\(2022\)Imbalance trouble: revisiting neural\-collapse geometry\.InProceedings of the 36th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[23\]T\. Tirer and J\. Bruna\(2022\)Extended unconstrained features model for exploring deep neural collapse\.InProceedings of the 39th International Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1),[§I](https://arxiv.org/html/2606.00344#S1.p4.1),[§II](https://arxiv.org/html/2606.00344#S2.p1.16),[§III\-A](https://arxiv.org/html/2606.00344#S3.SS1.p1.1),[§IV\-B](https://arxiv.org/html/2606.00344#S4.SS2.p2.17)\.
- \[24\]T\. Tirer, H\. Huang, and J\. Niles\-Weed\(2023\)Perturbation analysis of neural collapse\.InProceedings of the 40th International Conference on Machine Learning,Cited by:[§II](https://arxiv.org/html/2606.00344#S2.p1.16)\.
- \[25\]R\. Wu and V\. Papyan\(2024\)Linguistic collapse: neural collapse in \(large\) language models\.InProceedings of the 38th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1)\.
- \[26\]C\. Yaras, P\. Wang, Z\. Zhu, L\. Balzano, and Q\. Qu\(2022\)Neural collapse with normalized features: a geometric analysis over the Riemannian manifold\.InProceedings of the 36th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[27\]J\. Zhou, X\. Li, T\. Ding, C\. You, Q\. Qu, and Z\. Zhu\(2022\)On the optimization landscape of neural collapse under MSE loss: global optimality with unconstrained features\.InProceedings of the 39th International Conference on Machine Learning,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1),[§III\-A](https://arxiv.org/html/2606.00344#S3.SS1.p1.1),[§III\-B](https://arxiv.org/html/2606.00344#S3.SS2.p1.3),[Theorem III\.1](https://arxiv.org/html/2606.00344#S3.Thmtheorem1),[§IV\-A](https://arxiv.org/html/2606.00344#S4.SS1.p1.5),[§IV\-C](https://arxiv.org/html/2606.00344#S4.SS3.p2.9)\.
- \[28\]J\. Zhou, C\. You, X\. Li, K\. Liu, S\. Liu, Q\. Qu, and Z\. Zhu\(2022\)Are all losses created equal: a neural collapse perspective\.InProceedings of the 36th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p2.1),[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.
- \[29\]Z\. Zhu, T\. Ding, J\. Zhou, X\. Li, C\. You, J\. Sulam, and Q\. Qu\(2021\)A geometric analysis of neural collapse with unconstrained features\.InProceedings of the 35th Conference on Neural Information Processing Systems,Cited by:[§I](https://arxiv.org/html/2606.00344#S1.p3.1)\.Similar Articles
The Implicit Bias of Depth: From Neural Collapse to Softmax Codes
This paper studies how depth alone induces an implicit low-rank bias in deep unconstrained feature models trained without regularization, shifting the optimal solution from neural collapse to softmax codes, and provides the first asymptotic and dynamic characterization of this bias under gradient descent with cross-entropy loss.
Neural Collapse by Design: Learning Class Prototypes on the Hypersphere
This paper shows that cross-entropy and supervised contrastive learning are both forms of prototype learning on the hypersphere and proposes normalized losses (NTCE and NONL) that achieve Neural Collapse by design, outperforming standard methods.
Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins
This paper introduces GOEN, a pipeline combining multi-scale features, L2 normalization, and Mahalanobis distance for OOD detection, and finds that CenterLoss regularization actually degrades OOD performance despite improving classification accuracy.
Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes
This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.
From Context Shift to Stylistic Collapse: Why Training Objectives Matter More Than Scale
This paper investigates how training alignment objectives reshape linguistic features in large language models, finding that instruction-tuned systems collapse language entropy significantly more than scale would suggest, and that entropy regularization can mitigate this collapse.