A Unified Geometric Framework for Weighted Contrastive Learning

arXiv cs.LG Papers

Summary

This paper introduces a unified geometric framework showing that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, providing exact characterizations of optimal embeddings for supervised and weakly supervised contrastive learning methods and revealing when such embeddings are geometrically realizable, degenerate, or inconsistent.

arXiv:2605.13943v1 Announce Type: new Abstract: Contrastive learning (CL) aims to preserve relational structure between samples by learning representations that reflect a similarity graph. Yet, the geometry of the resulting embeddings remains poorly understood. Here we show that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, where the weighting scheme specifies the target geometry to be realized by the representation. This viewpoint yields exact characterizations of the optimal embeddings for several supervised and weakly supervised objectives. In supervised classification, both SupCon and Soft SupCon (a dense relaxation of it where pairs from distinct classes have small non-zero similarity) collapse samples within each class to a single prototype. However, while balanced SupCon recovers the classical regular simplex geometry, class imbalance breaks this symmetry: SupCon induces non-uniform inter-class similarities depending on class sizes, whereas Soft SupCon preserves a regular simplex geometry regardless of class imbalance. In continuous-label settings, our framework reveals a different failure mode: y-Aware CL generally cannot attain its entropic optimum unless the labels lie on a hypersphere, exposing a mismatch between Euclidean label weights and spherical latent similarity. By contrast, geometrically consistent choices such as Euclidean-Euclidean weighting or X-CLR admit unique optimal embeddings. Our results show that the choice of weighting scheme determines whether contrastive learning is geometrically realizable, degenerate, or inconsistent, providing a principled framework for designing contrastive objectives.
Original Article
View Cached Full Text

Cached at: 05/15/26, 06:25 AM

# A Unified Geometric Framework for Weighted Contrastive Learning
Source: [https://arxiv.org/html/2605.13943](https://arxiv.org/html/2605.13943)
Raphaël Vock Edouard Duchesnay Benoit Dufumier GAIA Lab, NeuroSpin, CEA, CNRS Université Paris\-Saclay, Gif\-sur\-Yvette, France \{raphael\.vock,edouard\.duchesnay,benoit\.dufumier\}@cea\.fr

###### Abstract

Contrastive learning \(CL\) aims to preserve relational structure between samples by learning representations that reflect a similarity graph\. Yet, the geometry of the resulting embeddings remains poorly understood\. Here we show that weighted InfoNCE objectives can be interpreted as Distance Geometry Problems, where the weighting scheme specifies the target geometry to be realized by the representation\. This viewpoint yields exact characterizations of the optimal embeddings for several supervised and weakly supervised objectives\. In supervised classification, both SupCon and Soft SupCon \(a dense relaxation of it where pairs from distinct classes have small non\-zero similarity\) collapse samples within each class to a single prototype\. However, while balanced SupCon recovers the classical regular simplex geometry, class imbalance breaks this symmetry: SupCon induces non\-uniform inter\-class similarities depending on class sizes, whereas Soft SupCon preserves a regular simplex geometry regardless of class imbalance\. In continuous\-label settings, our framework reveals a different failure mode:yy\-Aware CL generally cannot attain its entropic optimum unless the labels lie on a hypersphere, exposing a mismatch between Euclidean label weights and spherical latent similarity\. By contrast, geometrically consistent choices such as Euclidean–Euclidean weighting or𝕏\\mathbb\{X\}\-CLR admit unique optimal embeddings\. Our results show that the choice of weighting scheme determines whether contrastive learning is geometrically realizable, degenerate, or inconsistent, providing a principled framework for designing contrastive objectives\.

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/fig1.png)Figure 1:The weighting matrix \(top\) defines pairwise similarities that induce a target geometry \(middle\), which the embedding attempts to realize \(bottom\)\. Left: dense weights \(Soft SupCon\) yield a regular, symmetric geometry\. Middle: sparse weights \(SupCon\) produce degenerate solutions, where class imbalance distorts prototype geometry\. Right: continuous Euclidean label similarities \(yy\-Aware\) induce a geometry that is inconsistent with cosine similarities, leading to distortion\.Contrastive Learning \(CL\) plays a pivotal role in current supervised and self\-supervised deep representation learning models\. It was popularized through several successful frameworks: SimCLRChenet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib10)\)and MoCo for self\-supervised learning of visual representations, SupConKhoslaet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib7)\)for supervised classification, and CLIPRadfordet al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib36)\)for vision\-language representation learning\. Later on, it was adapted to perform supervised regressionZhaet al\.\([2023](https://arxiv.org/html/2605.13943#bib.bib4)\); Barbanoet al\.\([2023](https://arxiv.org/html/2605.13943#bib.bib35)\), handling continuous targets and weakly supervised learningDufumieret al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib26)\); Schneideret al\.\([2023](https://arxiv.org/html/2605.13943#bib.bib37)\)when auxiliary information \(ormeta\-data\) is available during training as a means of constraining representations\. Despite their empirical success, the geometry of the representations learned by contrastive objectives remains poorly understood\. In particular, it is unclear what structure these objectives are implicitly trying to realize in the embedding space, and whether this structure can always be achieved\.

From a geometric perspective, CL can be viewed as attempting to realize a set of pairwise relationships between samples in a low\-dimensional space\. These relationships are specified implicitly by the weighting scheme used in the loss, which encodes prior knowledge about which samples should be similar or dissimilar\. This raises a fundamental question:*does there exist a representation satisfying these constraints, and if so, is it unique?*

In this work, we show that weighted contrastive learning can be interpreted as a Distance Geometry ProblemLibertiet al\.\([2014](https://arxiv.org/html/2605.13943#bib.bib25)\)\(DGP\), where the weighting scheme defines a target pairwise geometry\. Under this view, learning representations amounts to finding a geometric realization of this target structure\. This perspective reveals three regimes: some objectives define realizable geometries with unique solutions, others lead to degenerate solutions, and some are geometrically inconsistent\.

In supervised classification, we prove that both SupCon and a soft relaxation of it collapse samples within each class, but differ under class imbalance: SupCon induces class\-size\-dependent prototype configurations, whereas the soft variant preserves a regular simplex geometry independent of imbalance\. In continuous\-label settings, we show thatyy\-Aware CLDufumieret al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib26)\)is generally inconsistent, while geometrically consistent formulations such as𝕏\\mathbb\{X\}\-CLRSobalet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib5)\)admit realizable and unique optima\.

Beyond theory, this perspective provides practical guidance for designing contrastive objectives by matching the geometry induced by the weights with that of the embedding space\. We also introduce metrics to evaluate convergence to the predicted optima and validate our results experimentally\.

In summary, we make the following contributions:

- •We show that weighted InfoNCE defines a target pairwise geometry and that minimizing the loss amounts to solving a DGP\.
- •We prove that both SupCon and Soft SupCon \(a smooth relaxation in which inter\-class pairs are assigned a small but non\-zero similarity\) collapse each class to a prototype, but differ under imbalance: SupCon produces class\-size\-dependent prototype geometry, whereas Soft SupCon preserves regular simplex geometry\.
- •We show that the label\-space geometry and latent\-space similarity must be matched for the optimum to be realizable \(e\.g\. as in𝕏\\mathbb\{X\}\-CLR\)\. As a consequence, we prove thatyy\-Aware CL generally cannot attain its entropic lower bound because it combines Euclidean label distances with spherical latent similarity\.
- •We provide a novel lower bound for the value of the loss and provide three new metrics, all geometric in nature, as a principled means to evaluate the convergence of models and the optimality of their representations\.

## 2Related Work

Theoretical aspects of CL have mainly been studied from three perspectives: information theory, probability and graph theory, and metric learning\. Our approach is in essence a geometric one and mostly aligns with this last view of CL\.

Mutual information perspective\.Early modern contrastive objectives were motivated by the fact that they can be understood as tractable proxies for mutual information \(MI\) maximization\. Contrastive Predictive Codingvan den Oordet al\.\([2018](https://arxiv.org/html/2605.13943#bib.bib28)\)introduced the InfoNCE loss and showed how negative sampling yields a scalable lower bound on MI in sequential prediction settings\. In parallel, Deep InfoMax proposed MI maximization between global and local feature statistics for unsupervised representation learningHjelmet al\.\([2018](https://arxiv.org/html/2605.13943#bib.bib29)\), and related multi\-view MI objectives were developed for self\-supervised learningBachmanet al\.\([2019](https://arxiv.org/html/2605.13943#bib.bib30)\)\. Subsequent work clarified the behavior and limitations of variational MI bounds \(including InfoNCE\) through bias–variance tradeoffsPooleet al\.\([2019](https://arxiv.org/html/2605.13943#bib.bib31)\)\. However, MI does not fully explain representation quality or geometry: empirical evidenceTschannenet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib8)\)shows that downstream performance strongly depends on architectural and estimator\-induced inductive biases, rather than on MI estimation alone\.

Probabilistic and graph\-based view\.Another line of research models positives as samples sharing an unobserved semantic factor \(latent class\), yielding generalization guaranties for downstream classification from contrastive pretrainingAroraet al\.\([2019](https://arxiv.org/html/2605.13943#bib.bib39)\)\. More recent analyses shift from strong assumptions on latent\-class sampling to augmentation\-driven positive pairs by introducing an augmentation graph whose spectral structure is related to contrastive objectivesHaoChenet al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib12)\); HaoChen and Ma \([2022](https://arxiv.org/html/2605.13943#bib.bib38)\)\. Stronger results were obtained on downstream performances with this formulation by assuming semantic clustering of this augmentation graph\. Meanwhile, an attempt to unify contrastive and non\-contrastive approaches was proposedBalestriero and LeCun \([2022](https://arxiv.org/html/2605.13943#bib.bib27)\)based on a “similarity graph”, a rough approximation of the augmentation graph\. A complementary view is given inWang and Isola \([2020](https://arxiv.org/html/2605.13943#bib.bib9)\)which interprets contrastive learning as trading off an alignment term \(pulling positives together\) and a uniformity term \(uniformly spreading representations out on a hypersphere\), while also noting the inherent limitations to learning this trade\-off with limited data\.

Metric Learning\.Before the MI and probabilistic/graph\-based perspectives of CL, early works on dimensionality reductionHadsellet al\.\([2006](https://arxiv.org/html/2605.13943#bib.bib33)\)proposed a more geometric formulation of CL with subsequent pairwise and triplet\-based losses, which explicitly enforce geometric structure in embedding space\. Modern self\-supervisedChenet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib10)\); Heet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib34)\)CL frameworks can be seen as scalable extensions of this metric learning view that implicitly shape the geometry of representations on the unit sphere\. In the discrete supervised setting, SupConKhoslaet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib7)\)extends self\-supervised objectives using class labels to define positive samples\. A more explicit geometric characterization was provided byGrafet al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib1)\), which analyzes the loss\-minimizing configurations of SupCon and states that, under idealized conditions, class representations approach a regular simplex arrangement on the hypersphere\. In the regression case, kernel\-based variantsDufumieret al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib26)\); Barbanoet al\.\([2023](https://arxiv.org/html/2605.13943#bib.bib35)\)and rank\-based variantsZhaet al\.\([2023](https://arxiv.org/html/2605.13943#bib.bib4)\)of the InfoNCE objective have been derived, but as far as we are aware a geometrical characterization of their optima has not been given until now\.

Overall, while contrastive learning has been studied extensively from MI and probabilistic/graph\-based viewpoints, a precise characterization of the geometry of optimal representations \(including existence, uniqueness and the resulting distance structure\) remains as of yet unexplored\.

## 3Analysis of Weighted Contrastive Learning

### 3\.1Unified formulation of weighted Contrastive Learning losses

#### Problem setup\.

The goal of representation learningBengioet al\.\([2013](https://arxiv.org/html/2605.13943#bib.bib24)\)is to learn a mappingf:𝒳→𝒵f:\\mathcal\{X\}\\rightarrow\\mathcal\{Z\}from input dataX=\(xi\)i∈\[1\.\.n\]∈𝒳n⊆ℝn×pX=\(x\_\{i\}\)\_\{i\\in\[1\.\.n\]\}\\in\\mathcal\{X\}^\{n\}\\subseteq\\mathbb\{R\}^\{n\\times p\}such that its representationZ=\(zi\)i∈\[1\.\.n\]:=f​\(X\)∈𝒵n⊆ℝn×qZ=\(z\_\{i\}\)\_\{i\\in\[1\.\.n\]\}:=f\(X\)\\in\\mathcal\{Z\}^\{n\}\\subseteq\\mathbb\{R\}^\{n\\times q\}retains useful information for a variety of downstream tasks\.

###### Definition 3\.1\(ww\-InfoNCE loss\)\.

Given a similarity matrixW=\(wi​j\)∈ℝn×nW=\(w\_\{ij\}\)\\in\\mathbb\{R\}^\{n\\times n\}such thatwi​j≥0w\_\{ij\}\\geq 0fori≠ji\\neq j, andS=\(si​j\)S=\(s\_\{ij\}\)wheresi​j=s​\(zi,zj\)=s​\(f​\(xi\),f​\(xj\)\)s\_\{ij\}=s\(z\_\{i\},z\_\{j\}\)=s\(f\(x\_\{i\}\),f\(x\_\{j\}\)\)fors:ℝq×ℝq→ℝs:\\mathbb\{R\}^\{q\}\\times\\mathbb\{R\}^\{q\}\\to\\mathbb\{R\}some measure of similarity in the latent space, theweighted InfoNCE lossis defined as

ℒNCEW:=−1n​∑i=1n∑j≠iwi​j∑k≠iwi​k​log⁡\(exp⁡si​j∑k≠iexp⁡si​k\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}:=\-\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\sum\_\{j\\neq i\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{\\exp\{s\_\{ij\}\}\}\{\\sum\_\{k\\neq i\}\\exp\{s\_\{ik\}\}\}\\right\)

Remark\.Originally, SupCon,yy\-Aware and𝕏\\mathbb\{X\}\-CLR all define the InfoNCE loss in terms of several augmentations of every given sample\. We argue this case can be recovered from the above formulation by concatenating several data augmentations in a single matrixXX\.

Let us we recall the weighting schemes that correspond to several well\-known frameworks:

- •SupConKhoslaet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib7)\):wi​j=1w\_\{ij\}=1ifyi=yjy\_\{i\}=y\_\{j\}and 0 otherwise for a labeled dataset\(X,Y\)\(X,Y\)withY∈\[1\.\.C\]nY\\in\[1\.\.C\]^\{n\}andCCthe number of classes\.
- •y\-AwareDufumieret al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib26)\):wi​j=exp⁡\(−‖yi−yj‖2\)w\_\{ij\}=\\exp\{\\left\(\-\\\|y\_\{i\}\-y\_\{j\}\\\|^\{2\}\\right\)\}for a weakly\-labeled dataset\(X,Y\)\(X,Y\)withY∈ℝn×ℓY\\in\\mathbb\{R\}^\{n\\times\\ell\}the meta\-data\.
- •𝕏\\mathbb\{X\}\-CLRSobalet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib5)\):wi​j=exp⁡\(cos⁡\(yi,yj\)/τ′\)w\_\{ij\}=\\exp\(\\cos\(y\_\{i\},y\_\{j\}\)/\\tau^\{\\prime\}\)for an image\-caption dataset\(X,Y\)\(X,Y\)withYYthe textual descriptions of the images encoded by a pretrained text encoder\.

In each of the above cases, similarity in the latent space is measured with cosine similarity:si​j=cos⁡\(zi,zj\)/τs\_\{ij\}=\\cos\(z\_\{i\},z\_\{j\}\)/\\tau, withτ\\taua temperature hyperparameter\.

###### Assumption 3\.2\(Symmetry ofSSandWW\)\.

The similarity functionssand similarity matrixWWare symmetric, sosi​j=sj​is\_\{ij\}=s\_\{ji\}andwi​j=wj​iw\_\{ij\}=w\_\{ji\}for alli,j∈\[1\.\.n\]i,j\\in\[1\.\.n\]\.

We divide the analysis into two parts\. First, we study the*soft*weighting regime, whereWWis strictly positive \(§[3\.2](https://arxiv.org/html/2605.13943#S3.SS2)–[3\.6](https://arxiv.org/html/2605.13943#S3.SS6)\)\. We then turn to thehardorsparseregime, whereWWmay contain zeros \(§[3\.7](https://arxiv.org/html/2605.13943#S3.SS7)–[3\.8](https://arxiv.org/html/2605.13943#S3.SS8)\)\.

### 3\.2Entropic lower bound

Under a strict\-positiveness assumption, we can state the entropic lower bound of theww\-InfoNCE loss \(proof in Appendix[A](https://arxiv.org/html/2605.13943#A1)\):

###### Theorem 3\.3\(Entropic lower bound ofww\-InfoNCE\)\.

Under[˜3\.2](https://arxiv.org/html/2605.13943#S3.Thmtheorem2)and assumingwi​j\>0w\_\{ij\}\>0fori≠ji\\neq j, viewing the weighted InfoNCE lossℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}as a function ofS=\(si​j\)S=\(s\_\{ij\}\)over the space of symmetric similarity matrices, the global minimum is attained precisely when

si​j∗=log⁡\(wi​j\)\+cfor​i≠js\_\{ij\}^\{\*\}=\\log\(w\_\{ij\}\)\+c\\qquad\\text\{for\}\\ i\\neq jfor some constantc∈ℝc\\in\\mathbb\{R\}\. Moreover, that minimum equals

minS⁡ℒNCEW=−1n​∑i≠jwi​j∑k≠iwi​k​log⁡\(wi​j∑k≠iwi​k\)\.\\min\_\{S\}\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}=\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\right\)\.\(1\)

This result echoes what has been previously obtained using a probabilistic frameworkPooleet al\.\([2019](https://arxiv.org/html/2605.13943#bib.bib31)\), withssbeing the “critic” discriminating positive pairs against negative ones\. In our case, we obtain a geometric characterization ofZ∗Z^\{\*\}which is more precise than previous works111InPooleet al\.\([2019](https://arxiv.org/html/2605.13943#bib.bib31)\), the constantccdepends onii\.\(in part thanks to Assumption[3\.2](https://arxiv.org/html/2605.13943#S3.Thmtheorem2)\) and which does not depend on a dichotomy between positive and negative samples as in the classical view of contrastive learning\.

### 3\.3Connection with the Distance Geometry Problem

We can now express the relationship between theww\-InfoNCE loss and the Distance Geometry Problem \(DGP\)Libertiet al\.\([2014](https://arxiv.org/html/2605.13943#bib.bib25)\)as immediate corollaries of[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3):

###### Corollary 3\.4\.

When latent similarity is given bysi​j=−‖zi−zj‖2s\_\{ij\}=\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}andwi​j=exp⁡\(−di​j\)w\_\{ij\}=\\exp\(\-d\_\{ij\}\)for a symmetric dissimilarity matrixD=\(di​j\)D=\(d\_\{ij\}\), attaining the entropic lower bound ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}stated in[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)amounts to solving the DGP

‖zi∗−zj∗‖2=di​j\+cfor​i≠j,\\\|z\_\{i\}^\{\*\}\-z\_\{j\}^\{\*\}\\\|^\{2\}=d\_\{ij\}\+c\\qquad\\textrm\{for \}i\\neq j,\(2\)for any constantc∈ℝc\\in\\mathbb\{R\}\.

###### Corollary 3\.5\.

Whensi​j=cos⁡\(zi,zj\)/τs\_\{ij\}=\\cos\(z\_\{i\},z\_\{j\}\)/\\tauandwi​j=exp⁡\(−di​j\)w\_\{ij\}=\\exp\(\-\{d\_\{ij\}\}\)for a symmetric dissimilarity matrixD=\(di​j\)D=\(d\_\{ij\}\), attaining the entropic lower bound ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}amounts to solving the cosine problem

cos⁡\(zi,zj\)=−τ​di​j\+c′for​i≠j,\\cos\(z\_\{i\},z\_\{j\}\)=\-\\tau d\_\{ij\}\+c^\{\\prime\}\\qquad\\textrm\{for \}i\\neq j,\(3\)for any constantc′∈ℝc^\{\\prime\}\\in\\mathbb\{R\}\.

These corollaries show that minimizingww\-InfoNCE amounts to finding a low\-dimensional embedding with prescribed similarity, up to an additive constantccorc′c^\{\\prime\}\. We will go on to show that a mild condition onnnandrank⁡\(D\)\\operatorname\{rank\}\(D\)suffice to entail thatc=0c=0orc′=1c^\{\\prime\}=1\.

### 3\.4Existence and uniqueness of the optimal embedding

Having established the link between DGP and weighted InfoNCE, we ask ourselves the question of existence and uniqueness of solutions to these two families of problems\. The answers involve a classical notion from computational geometryAlfakih and others \([2018](https://arxiv.org/html/2605.13943#bib.bib23)\)\.

###### Definition 3\.6\.

\(EDM\) Ann×nn\\times nmatrixD=\(di​j\)D=\(d\_\{ij\}\)is called anEuclidean Distance Matrix\(EDM\) if there exist pointsz1∗,…,zn∗∈ℝqz\_\{1\}^\{\*\},\\dots,z\_\{n\}^\{\*\}\\in\\mathbb\{R\}^\{q\}such that

∥zi∗−zj∗∥2=di​jfori,j∈\[1\.\.n\]\.\\\|z\_\{i\}^\{\*\}\-z\_\{j\}^\{\*\}\\\|^\{2\}=d\_\{ij\}\\qquad\\text\{for\}\\ i,j\\in\[1\.\.n\]\.\(4\)We call the pointsz1∗,…,zn∗z\_\{1\}^\{\*\},\\ldots,z^\{\*\}\_\{n\}a*qq\-dimensional realization*ofDD\. The*embedding dimension*is defined as the smallest dimensionrrof a Euclidean realization, or equivalently as the dimension of the affine span of any given Euclidean realization\. Distinct Euclidean realizations of an EDM in a given dimension are congruent, i\.e\. they differ by a Euclidean isometry\. A minimal Euclidean realization \(i\.e\. inℝr\\mathbb\{R\}^\{r\}\) is called a set of*generating points*ofDD\. Finally, we say thatDDissphericalif those generating points can be chosen such that‖z1‖=⋯=‖zn‖=ρ\\\|z\_\{1\}\\\|=\\cdots=\\\|z\_\{n\}\\\|=\\rhofor some radiusρ\>0\\rho\>0\. The smallest suchρ\\rhois called theradiusof a spherical EDM\.

###### Theorem 3\.7\(Euclidean DGP\)\.

Assume that the number of pointsnnis strictly greater than2​q\+32q\+3\. Assume thatsi​j=exp⁡\(−‖zi−zj‖2\)s\_\{ij\}=\\exp\(\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}\)andW=\(exp⁡\(−di​j\)\)W=\(\\exp\(\-d\_\{ij\}\)\)forD=\(di​j\)D=\(d\_\{ij\}\)ann×nn\\times ndissimilarity matrix withrank⁡\(D\)≤q\+2\\operatorname\{rank\}\(D\)\\leq q\+2\. Then the entropic lower bound ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}is attained iffDDis EDM of embedding dimensionr≤qr\\leq q, up to diagonal elements\. In that case, the minimumZ∗Z^\{\*\}satisfies‖zi∗−zj∗‖=di​j\\\|z^\{\*\}\_\{i\}\-z^\{\*\}\_\{j\}\\\|=d\_\{ij\}for alli,j∈\[1\.\.n\]i,j\\in\[1\.\.n\]and is essentially unique \(i\.e\. up to Euclidean isometry\)\.

Next, we consider the case of Corollary[3\.5](https://arxiv.org/html/2605.13943#S3.Thmtheorem5)where the similarity metric between latent vectors is given by cosine similarity\. This is the most common use\-case in CLKhoslaet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib7)\); Dufumieret al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib26)\); Sobalet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib5)\)and it has a rigidity property analogous to[Theorem˜3\.7](https://arxiv.org/html/2605.13943#S3.Thmtheorem7), which we prove in Appendix[B\.2](https://arxiv.org/html/2605.13943#A2.SS2):

###### Theorem 3\.8\(Spherical DGP\)\.

Assume thatn\>2​q\+4n\>2q\+4\. Assume thatsi​j=exp⁡\(cos⁡\(zi,zj\)/τ\)s\_\{ij\}=\\exp\(\\cos\(z\_\{i\},z\_\{j\}\)/\\tau\)andwi​j=exp⁡\(−di​j\)w\_\{ij\}=\\exp\(\-d\_\{ij\}\)forD=\(di​j\)D=\(d\_\{ij\}\)ann×nn\\times ndissimilarity matrix withrank⁡\(D\)≤q\+1\\operatorname\{rank\}\(D\)\\leq q\+1\. Then the entropic lower bound ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}is attained by some embeddingZ∗∈ℝn×qZ^\{\*\}\\in\\mathbb\{R\}^\{n\\times q\}iffDDis a spherical EDM \(up to diagonal elements\) with embedding dimensionr≤qr\\leq qand radiusρ≤1/2​τ\\rho\\leq 1/\\sqrt\{2\\tau\}\(with equality whenr=qr=q\)\. In that case,cos⁡\(zi,zj\)=1−τ​di​j\\cos\(z\_\{i\},z\_\{j\}\)=1\-\\tau d\_\{ij\}for alli,j∈\[1\.\.n\]i,j\\in\[1\.\.n\]andZ∗Z^\{\*\}is essentially unique \(i\.e\. the normalized vectorsz~i:=zi∗/‖zi∗‖\\widetilde\{z\}\_\{i\}:=z\_\{i\}^\{\*\}/\\\|z\_\{i\}^\{\*\}\\\|are unique up to linear Euclidean isometry\)\.

### 3\.5Supervised classification: Soft SupCon

We now show that the Soft SupCon weighting scheme arises as a direct application of[Theorem˜3\.8](https://arxiv.org/html/2605.13943#S3.Thmtheorem8)\.

###### Corollary 3\.9\(Soft SupCon\)\.

Letyi∈\[1\.\.C\]y\_\{i\}\\in\[1\.\.C\]be one ofCCdiscrete classes associated with each sample and definewi​j=1w\_\{ij\}=1ifyi=yjy\_\{i\}=y\_\{j\}andwi​j=ε∈\(0,1\)w\_\{ij\}=\\varepsilon\\in\(0,1\)otherwise\. Assumen\>2​q\+4n\>2q\+4and considersi​j=exp⁡\(cos⁡\(zi,zj\)/τ\)s\_\{ij\}=\\exp\(\\cos\(z\_\{i\},z\_\{j\}\)/\\tau\)\. IfC≤qC\\leq qandτ≤\(C−1\)/\(−C​log⁡ε\)\\tau\\leq\(C\-1\)/\(\-C\\log\\varepsilon\), then the entropic lower bound ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}is attained by someZ∗∈ℝn×qZ^\{\*\}\\in\\mathbb\{R\}^\{n\\times q\}, which is unique up to linear Euclidean isometry of the normalized embeddings\. At the optimum, one hascos⁡\(zi,zj\)=1\\cos\(z\_\{i\},z\_\{j\}\)=1whenyi=yjy\_\{i\}=y\_\{j\}, andcos⁡\(zi,zj\)=1\+τ​log⁡ε\\cos\(z\_\{i\},z\_\{j\}\)=1\+\\tau\\log\\varepsilonotherwise\. In particular, all points from the same class collapse to a single representation, and theCCclasses form a simplex with uniform inter\-class similarityβ=1\+τ​log⁡ε\\beta=1\+\\tau\\log\\varepsilon\. The simplex is regular whenτ∗=\(C−1\)/\(−C​log⁡ε\)\\tau^\{\*\}=\(C\-1\)/\(\-C\\log\\varepsilon\), corresponding toβ∗=−1/\(C−1\)\\beta^\{\*\}=\-1/\(C\-1\)\.

### 3\.6Supervised multivariate regression

Next, we state three results corresponding to various weighting schemes that can be encountered when using continuously labeled data \(see Appendix[C\.1](https://arxiv.org/html/2605.13943#A3.SS1)for the proofs\)\. Let\(X,Y\)\(X,Y\)be a labeled dataset withℓ\\ellcontinuous labelsY∈ℝn×ℓY\\in\\mathbb\{R\}^\{n\\times\\ell\}\. We assume throughout thatn\>2​q\+4n\>2q\+4so that[Theorem˜3\.7](https://arxiv.org/html/2605.13943#S3.Thmtheorem7)and[3\.8](https://arxiv.org/html/2605.13943#S3.Thmtheorem8)can be applied\. Our first result concerns the case where Euclidean distance is used as a dissimilarity function in both label space and representation space\.

###### Theorem 3\.10\(Euclideanww\-InfoNCE\)\.

Assumeℓ≤q\\ell\\leq q\. Ifwi​j=exp⁡\(−‖yi−yj‖2\)w\_\{ij\}=\\exp\(\{\-\\\|y\_\{i\}\-y\_\{j\}\\\|^\{2\}\}\)andsi​j=−‖zi−zj‖2s\_\{ij\}=\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}, theww\-InfoNCE loss attains its entropic lower bound inZ∗=\(y~i\)i=1nZ^\{\*\}=\(\\widetilde\{y\}\_\{i\}\)\_\{i=1\}^\{n\}, wherey~i=\(yi,0,…,0\)∈ℝq\\widetilde\{y\}\_\{i\}=\(y\_\{i\},0,\\dots,0\)\\in\\mathbb\{R\}^\{q\}fori∈\[1\.\.n\]i\\in\[1\.\.n\], which is essentially unique\.

Our next result is a negative one and concerns the Euclidean–sphericalww\-InfoNCE loss, i\.e\. theyy\-Aware lossDufumieret al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib26)\)\.

###### Theorem 3\.11\(yy\-Aware\)\.

Even ifℓ≤q\\ell\\leq q, theyy\-Aware InfoNCE loss, which is theww\-InfoNCE loss withsi​j=cos⁡\(zi,zj\)/τs\_\{ij\}=\\cos\(z\_\{i\},z\_\{j\}\)/\\tauandwi​j=exp⁡\(−‖yi−yj‖2\)w\_\{ij\}=\\exp\(\-\\\|y\_\{i\}\-y\_\{j\}\\\|^\{2\}\), does not attain the entropic lower bound unless theyiy\_\{i\}happen to lie on a hypersphere inℝℓ\\mathbb\{R\}^\{\\ell\}\.

On the other hand, the spherical–spherical case, as in the𝕏\\mathbb\{X\}\-CLR lossSobalet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib5)\), does attain the entropic lower bound and offers a spherical counterpart to[Theorem˜3\.10](https://arxiv.org/html/2605.13943#S3.Thmtheorem10)\.

###### Theorem 3\.12\(𝕏\\mathbb\{X\}\-CLR\)\.

Ifℓ<q\\ell<qandτ≤τ′\\tau\\leq\\tau^\{\\prime\}, the𝕏\\mathbb\{X\}\-CLR lossSobalet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib5)\), which is theww\-InfoNCE loss withsi​j=cos⁡\(zi,zj\)/τs\_\{ij\}=\\cos\(z\_\{i\},z\_\{j\}\)/\\tauandwi​j=exp⁡\(cos⁡\(yi,yj\)/τ′\)w\_\{ij\}=\\exp\(\\cos\(y\_\{i\},y\_\{j\}\)/\\tau^\{\\prime\}\)attains the entropic lower bound in the pointszi∗=\(yi/‖yi‖,τ′/τ−1,0,…,0\)z\_\{i\}^\{\*\}=\(y\_\{i\}/\\\|y\_\{i\}\\\|,\\sqrt\{\\tau^\{\\prime\}/\\tau\-1\},0,\\dots,0\)fori∈\[1\.\.n\]i\\in\[1\.\.n\]\. Moreover, that minimum is essentially unique\.

In practiceSobalet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib5)\), the dimension of text embeddings in𝕏\\mathbb\{X\}\-CLR isℓ=768\\ell=768for CLIP and the latent dimensionq=2048q=2048for ResNet50\. It was also found thatτ′≈τ\\tau^\{\\prime\}\\approx\\taugives the best classification accuracy on ImageNet\.

Practical implications\.[Theorem˜3\.11](https://arxiv.org/html/2605.13943#S3.Thmtheorem11)and[3\.12](https://arxiv.org/html/2605.13943#S3.Thmtheorem12)suggest that the pair of similarity functions used to define\(W,S\)\(W,S\)should be chosen consistently to allow for a realizable optimum\. The spherical–spherical or Euclidean\-Euclidean weighting schemes are therefore preferable over the Euclidean–spherical scheme \(as inyy\-Aware\)\.

### 3\.7Sparse weights and limiting properties

The assumption of positive weights, i\.e\.wi​j\>0w\_\{ij\}\>0in[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3), may be relaxed towi​j≥0w\_\{ij\}\\geq 0assuming that∑i​kwi​k\>0\\sum\_\{ik\}w\_\{ik\}\>0so thatℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}is well defined: we term this thesparse case\. Although any sparse matrixWWcan be understood as a limit of a sequence of positive matrices\(Wm\)\(W\_\{m\}\), the optimum ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}need not be the limit of the optimum ofℒNCEWm\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\_\{m\}\}because in general convergence is not uniform\. Moreover, the configuration of points which attain the minimum may behave qualitatively different in the limit\.

In Appendix[D\.1](https://arxiv.org/html/2605.13943#A4.SS1)we prove that in the sparse case, the entropic lower bound continues to hold and that it is sharp\. We give a necessary and sufficient for it to be optimal \(i\.e\. impossible to improve upon\)\. We go on to show that it is optimal in the case of the so\-called Euclidean SupCon \(i\.e\. theww\-InfoNCE loss with a hard SupCon weighting scheme and Euclidean similaritiessi​j=−‖zi−zj‖2s\_\{ij\}=\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}\) but that quasi\-optima can be produced in any number of qualitatively distinct configurations\. This comes in stark contrast to the soft case where optima are essentially unique \([Theorem˜3\.7](https://arxiv.org/html/2605.13943#S3.Thmtheorem7)\)\.

### 3\.8SupCon

We now consider the case of SupCon proper \(i\.e\. theww\-InfoNCE loss with a hard SupCon weighting scheme and spherical similaritiessi​j=cos⁡\(zi,zj\)/τs\_\{ij\}=\\cos\(z\_\{i\},z\_\{j\}\)/\\tau\)\. Using a compactness argument, we see in Appendix[D\.2](https://arxiv.org/html/2605.13943#A4.SS2)that the entropic lower bound is sub\-optimal\. We go on to prove the following theorem:

###### Theorem 3\.13\.

LetCCdenote the number of discrete classes andℓc\\ell\_\{c\}the size of theccth class\. Assume thatC≤qC\\leq q\.

1. 1\.The SupCon loss has a unique global minimum over the matrices of the formG/τG/\\tau, withGGann×nn\\times ncosine matrix\.
2. 2\.That minimum has collapsed representations and can be expressed in terms of class prototypesμ1,…,μC∈𝕊q−1\\mu\_\{1\},\\ldots,\\mu\_\{C\}\\in\\mathbb\{S\}^\{q\-1\}which are unique up to linear Euclidean isometry\.
3. 3\.Ifc≠c′c\\neq c^\{\\prime\}, the value ofcos⁡\(μc,μc′\)\\cos\(\\mu\_\{c\},\\mu\_\{c^\{\\prime\}\}\)depends only on\(ℓc,ℓc′\)\(\\ell\_\{c\},\\ell\_\{c^\{\\prime\}\}\)\.

In particular we see that SupCon suffers from representation collapse which generalizes bothNguyenet al\.\([2024](https://arxiv.org/html/2605.13943#bib.bib47)\)andBehnia and Thrampoulidis \([2024](https://arxiv.org/html/2605.13943#bib.bib48)\)in that we make no assumptions onτ\>0\\tau\>0nor do we require an infinite number of samples\.

In the case of balanced classes, we easily recover the result fromGrafet al\.\([2021](https://arxiv.org/html/2605.13943#bib.bib1)\)that SupCon optima correspond to regular simplices \(Appendix[D\.2](https://arxiv.org/html/2605.13943#A4.SS2),[Theorem˜D\.6](https://arxiv.org/html/2605.13943#A4.Thmtheorem6)\)\. Moreover, Hard and Soft SupCon have similar optima: both collapse classes and yield uniform inter\-class similarities\. For Hard SupCon,βhard=−1/\(C−1\)\\beta\_\{\\mathrm\{hard\}\}=\-1/\(C\-1\)depends onCC; for Soft SupCon,βsoft=1\+τ​log⁡ε\\beta\_\{\\mathrm\{soft\}\}=1\+\\tau\\log\\varepsilondepends on\(τ,ε\)\(\\tau,\\varepsilon\)but not onCC\. Choosingτ∗=\(C−1\)/\(−C​log⁡ε\)\\tau^\{\*\}=\(C\-1\)/\(\-C\\log\\varepsilon\)yields a regular simplex\.

This equivalence breaks under class imbalance: SupCon produces non\-uniform inter\-class similarities, grouping minority classes more closely, whereas Soft SupCon is agnostic to class imbalance\.

### 3\.9Metrics

To empirically evaluate whether the similarity structure of the learned encodingsZ=\(z1;…;zn\)∈ℝn×qZ=\(z\_\{1\};\\ldots;z\_\{n\}\)\\in\\mathbb\{R\}^\{n\\times q\}are close to the theoretical optimumZ∗=\(z1∗;…;zn∗\)∈ℝn×qZ^\{\*\}=\(z\_\{1\}^\{\*\};\\ldots;z\_\{n\}^\{\*\}\)\\in\\mathbb\{R\}^\{n\\times q\}as predicted by any of our theoretical results, we introduce three metrics\.

###### Definition 3\.14\.

Theww\-InfoNCE entropic loss gapis defined as the nonnegative quantityΔW:=\(ℒNCEW/minS⁡ℒNCEW\)−1\\Delta\_\{W\}:=\(\{\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\}/\{\\min\_\{S\}\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\}\)\-1, whereminS⁡ℒNCEW\\min\_\{S\}\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}is the entropic lower bound stated in[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)\.

Note that a small value ofΔW\\Delta\_\{W\}is a necessary but not sufficient condition for theziz\_\{i\}to be close to the optimumzi∗z\_\{i\}^\{\*\}\. This motivates the introduction of a second metric, related to Kruskal’s stress, which directly compares the pairwise similarities of theziz\_\{i\}versus those of thezi∗z\_\{i\}^\{\*\}\.

###### Definition 3\.15\.

Lets:ℝq×ℝq→ℝs:\\mathbb\{R\}^\{q\}\\times\\mathbb\{R\}^\{q\}\\to\\mathbb\{R\}denote a similarity function in latent space, e\.g\.s​\(x,y\)=−‖x−y‖2s\(x,y\)=\-\\\|x\-y\\\|^\{2\}ors​\(x,y\)=cos⁡\(x,y\)s\(x,y\)=\\cos\(x,y\)\. Thecoefficient of similaritybetweenZZandZ∗Z^\{\*\}is defined as ther2r^\{2\}ofs​\(zi,zj\)s\(z\_\{i\},z\_\{j\}\)as a predictor ofs​\(zi∗,zj∗\)s\(z\_\{i\}^\{\*\},z\_\{j\}^\{\*\}\):

rs​\-sim2​\(Z,Z∗\)\\displaystyle r^\{2\}\_\{s\\textrm\{\-sim\}\}\(Z,Z^\{\*\}\):=1−μ^i​j​\[\(s​\(zi,zj\)−s​\(zi∗,zj∗\)\)2\]σ^i​j2​\[s​\(zi∗,zj∗\)\]\\displaystyle:=1\-\\frac\{\\widehat\{\\mu\}\_\{ij\}\\left\[\(s\(z\_\{i\},z\_\{j\}\)\-s\(z\_\{i\}^\{\*\},z\_\{j\}^\{\*\}\)\)^\{2\}\\right\]\}\{\\widehat\{\\sigma\}^\{2\}\_\{ij\}\[s\(z\_\{i\}^\{\*\},z\_\{j\}^\{\*\}\)\]\}whereμ^i​j\\widehat\{\\mu\}\_\{ij\}andσ^i​j2\\widehat\{\\sigma\}^\{2\}\_\{ij\}denote, respectively, sample mean and sample variance over the set of pairs\(i,j\)∈\[1\.\.n\]2\(i,j\)\\in\[1\.\.n\]^\{2\}\.

Note by standard results that ifssis the negative squared Euclidean distance thenrs​\-sim2=1r^\{2\}\_\{s\\textrm\{\-sim\}\}=1iffZZandZ′Z^\{\\prime\}differ by a Euclidean isometry\. The same result holds ifssis cosine similarity andZZandZ∗Z^\{\*\}are taken on the unit hypersphere\. Note that the computation ofrs​\-sim2r^\{2\}\_\{s\\textrm\{\-sim\}\}isO​\(n2\)O\(n^\{2\}\)in time, and therefore impractical for largenn\. This motivates the following definition, which measures similarity at an intrinsic level in sub\-quadraticO​\(n​log⁡n\)O\(n\\log n\)time\.

###### Definition 3\.16\.

Let\(Ω,b\)\(\\Omega,b\)minimize∑i=1n‖Ω​zi\+b−zi∗‖2\\sum\_\{i=1\}^\{n\}\\\|\\Omega z\_\{i\}\+b\-z\_\{i\}^\{\*\}\\\|^\{2\}overΩ∈Oq​\(ℝ\)\\Omega\\in O\_\{q\}\(\\mathbb\{R\}\),b∈ℝqb\\in\\mathbb\{R\}^\{q\}\. TheProcrustes similarityis defined by

rProc2​\(Z,Z∗\)=1−μ^i​\[‖Ω​zi\+b−zi∗‖2\]σ^i2​\[zi∗\],r^\{2\}\_\{\\mathrm\{Proc\}\}\(Z,Z^\{\*\}\)=1\-\\frac\{\\widehat\{\\mu\}\_\{i\}\\left\[\\\|\\Omega z\_\{i\}\+b\-z\_\{i\}^\{\*\}\\\|^\{2\}\\right\]\}\{\\widehat\{\\sigma\}\_\{i\}^\{2\}\[z\_\{i\}^\{\*\}\]\},

i\.e\. the coefficient of determination after optimal rigid alignment\. Note thatrProc2​\(Z,Z∗\)=1r^\{2\}\_\{\\mathrm\{Proc\}\}\(Z,Z^\{\*\}\)=1iffZZandZ∗Z^\{\*\}differ by a Euclidean isometry\.

## 4Experiments

Armed with the metrics defined in §[3\.9](https://arxiv.org/html/2605.13943#S3.SS9), we conduct a series of experiments to validate our theoretical claims and verify whether representations learned withww\-InfoNCE loss on practical datasets are in fact near their geometric optima\. Implementation details are provided in Appendix[G](https://arxiv.org/html/2605.13943#A7)\.

### 4\.1Weighted InfoNCE on MNIST: discrete, continuous and mixed regime

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/mnist_cat_cropped.png)Figure 2:Euclidean representations of MNIST learned with the discretely weighted InfoNCE loss\. Left: coefficient of similarity betweenZZand regular 9\-simplex versus latent dimensionqq\. Right: Procrustes similarity versusqq\.We train a ConvNet with theww\-InfoNCE loss on MNISTLeCunet al\.\([1998](https://arxiv.org/html/2605.13943#bib.bib42)\)across three regimes: \(i\) discrete classification, \(ii\) continuous regression, and \(iii\) mixed discrete–continuous labels\.

Discrete case\.We first consider the discrete classification setting, using Soft SupCon weights \(as in §[3\.5](https://arxiv.org/html/2605.13943#S3.SS5)\) with the valueε=e−1\\varepsilon=e^\{\-1\}between negative samples\. The optimal geometry corresponds to a regular 9\-simplex, realizable inℝq\\mathbb\{R\}^\{q\}forq≥9q\\geq 9\. As shown in Fig\.[2](https://arxiv.org/html/2605.13943#S4.F2), bothrs​\-sim2r^\{2\}\_\{s\\text\{\-sim\}\}andrProc2r^\{2\}\_\{\\mathrm\{Proc\}\}increase withqqup to a sharp transition atq=9q=9, where the learned representation becomes nearly isometric \(rProc2≈0\.9r^\{2\}\_\{\\mathrm\{Proc\}\}\\approx 0\.9\)\.

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/mnist_2d.png)\(a\)2D Euclidean representations of MNIST learned usingww\-InfoNCE with continuous 2D labels\. Top: morphometric weights and distance comparison\. Bottom: learned embeddings colored by thickness \(left\) and slant \(right\)\.
![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/mnist_11d.png)\(b\)2​D2DEuclidean representations of MNIST using a mixed discrete–continuous weighted InfoNCE loss\. Top row: loss error, coefficient of similarity and Procrustes similarity as a function ofq∈\[7\.\.13\]q\\in\[7\.\.13\]\. Bottom row: embeddings colored by class, thickness and slant\.

Figure 3:MNIST representations under different weighting schemes\.Multivariate regression\.We then consider a continuous regression task based on two morphological attributes \(thickness and slant\)\. Training with weights derived from Euclidean distances in label space yields embeddings that closely approximate label geometry \(r2=0\.91r^\{2\}=0\.91\), with strong Procrustes alignment \(rProc2≈0\.96r^\{2\}\_\{\\mathrm\{Proc\}\}\\approx 0\.96; Fig\.[3\(a\)](https://arxiv.org/html/2605.13943#S4.F3.sf1)\)\.

Mixed discrete\-continuous labels\.Finally, we study the mixed discrete–continuous regime by combining class labels with morphological features\. The optimal structure corresponds to the product of a 9\-simplex and a unit square, realizable forq≥11q\\geq 11\. As shown in Fig\.[3\(b\)](https://arxiv.org/html/2605.13943#S4.F3.sf2), performance improves up toq=11q=11\. In low dimension \(q=2q=2\), the representation is locally isometric within classes \(rProc2,local≈0\.93r^\{2,\\mathrm\{local\}\}\_\{\\mathrm\{Proc\}\}\\approx 0\.93\) but not globally \(rProc2,global≈0\.46r^\{2,\\mathrm\{global\}\}\_\{\\mathrm\{Proc\}\}\\approx 0\.46\)\.

Additional𝕏\\mathbb\{X\}\-CLR experiments in Appendix[E](https://arxiv.org/html/2605.13943#A5), using both one\-hot and semantic label embeddings, further illustrate that consistent label–embedding geometries yield realizable optima\.

### 4\.2Class imbalance breaks SupCon’s regularity, but not Soft SupCon’s

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/supcon_imbalanced_updated.png)Figure 4:Similarity heatmaps representingcos⁡\(zi,zj\)\\cos\(z\_\{i\},z\_\{j\}\)for 10\-dimensional embeddings learned with SupCon \(top\) and Soft SupCon \(bottom\) forC=10C=10more or less imbalanced classes \(columns\)\. Gray off\-diagonal blocks indicate an inter\-class similarityβ∗=−1/\(C−1\)≈−0\.111\\beta^\{\*\}=\-1/\(C\-1\)\\approx\-0\.111equivalent to a regular simplex\.To confirm the predictions made in §[3\.8](https://arxiv.org/html/2605.13943#S3.SS8)regarding Hard and Soft SupCon, we optimized theww\-InfoNCE loss with a variety of SupCon\-like weighting schemes\. We synthesized SupCon\-like weight matrices \(see Appendix[D](https://arxiv.org/html/2605.13943#A4)\) witheitherhard zerosorthe soft weightε=e−1\\varepsilon=e^\{\-1\}on the off\-diagonal blocks\. We consideredC=10C=10classes with an assortment of relative sizes ranging from balanced classes to the case where all1010classes have a different size, in the proportions1:2:⋯:101:2:\\cdots:10\. In all cases we observed class collapse\. In the soft case we chose the optimal temperatureτ∗=\(10−1\)/10\\tau^\{\*\}=\(10\-1\)/10which, as predicted in[Corollary˜3\.9](https://arxiv.org/html/2605.13943#S3.Thmtheorem9), yields a regular simplexevenin the imbalanced case\. In contrast, class imbalance breaks the regularity of the optimal SupCon embeddings\. The third part of[Theorem˜3\.13](https://arxiv.org/html/2605.13943#S3.Thmtheorem13)is confirmed since for SupCon the inter\-class similarity associated with an off\-diagonal block can be seen to depend only on that block’s shape\.

### 4\.3Inconsistency ofyy\-Aware and its correction

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/yaware_1d_mnist.png)Figure 5:Euclidean representations of MNIST inℝ2\\mathbb\{R\}^\{2\}using digit’s thickness as continuous label\. Left: true 1D geometry of the problem\. Center:yy\-Aware representation usingcos\\cosas similarity function in𝒵\\mathcal\{Z\}\. Right: the correctedyy\-Aware representations learned using Euclidean distance instead\.We showed in[Theorem˜3\.11](https://arxiv.org/html/2605.13943#S3.Thmtheorem11)that theyy\-Aware weighting scheme generally does not reach its lower bound unless the labels lie on a hypersphere\. To address this, we proposed a solution by using Euclidean distance in both the label and latent spaces\. We confirm these predictions on MNIST with line thickness as a continuous label \(§[4\.1](https://arxiv.org/html/2605.13943#S4.SS1)\)\. As shown in[Figure˜5](https://arxiv.org/html/2605.13943#S4.F5), cosine similarity in𝒵\\mathcal\{Z\}constrains the embeddings to a spherical geometry\. Replacing it with negative squared Euclidean distance \(consistent with the label space\) allows the encoder to recover the correct geometry \(rProc2=0\.85r^\{2\}\_\{\\mathrm\{Proc\}\}=0\.85\)\.

## 5Discussion and Open Problems

This work advances the understanding of weighted contrastive learning in both weakly supervised and supervised settings by explicitly characterizing the global optimal representation using a rigorous geometric framework\. We derived conditions on weighting schemes to guarantee the existence of optimal realizations and introduced three novel metrics for assessing them\. We made several rigorous predictions which we went on to confirm in practical applications\.

Limitations\.The main limitation of this work is that we assumed an explicit dissimilarity matrix encoding prior knowledge about the problem through labels or meta\-data\. In the self\-supervised case \(e\.g\. SimCLRChenet al\.\([2020](https://arxiv.org/html/2605.13943#bib.bib10)\)\),DDis a rough approximation of an*implicit*dissimilarity matrix across samples\. Characterizing this matrix is difficult and should use empirical evidence and realistic assumptionsSaunshiet al\.\([2022](https://arxiv.org/html/2605.13943#bib.bib40)\)as the basis for theoretical results\. We leave this as an avenue for future research\.

Overall, we believe that our geometric framework offers a fresh perspective on CL and SSL algorithms and a complementary approach to the mainstream probabilistic and graph\-based views\. We hope that our theoretical analysis of CL will be used by practitioners to better refine their weighting schemes, as well as by theorists to deepen their understanding of CL and SSL\.

## References

- Euclidean distance matrices and their applications in rigidity theory\.Springer\.Cited by:[§B\.1](https://arxiv.org/html/2605.13943#A2.SS1.p1.1),[§B\.2](https://arxiv.org/html/2605.13943#A2.SS2.1.p1.1),[§F\.1](https://arxiv.org/html/2605.13943#A6.SS1.1.p1.6),[§3\.4](https://arxiv.org/html/2605.13943#S3.SS4.p1.1)\.
- S\. Arora, H\. Khandeparkar, M\. Khodak, O\. Plevrakis, and N\. Saunshi \(2019\)A theoretical analysis of contrastive unsupervised representation learning\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p3.1)\.
- P\. Bachman, R\. D\. Hjelm, and W\. Buchwalter \(2019\)Learning representations by maximizing mutual information across views\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p2.1)\.
- R\. Balestriero and Y\. LeCun \(2022\)Contrastive and non\-contrastive self\-supervised learning recover global and local spectral embedding methods\.Advances in Neural Information Processing Systems35,pp\. 26671–26685\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p3.1)\.
- C\. A\. Barbano, B\. Dufumier, E\. Duchesnay, M\. Grangetto, and P\. Gori \(2023\)Contrastive learning for regression in multi\-site brain age prediction\.In2023 IEEE 20th International Symposium on Biomedical Imaging \(ISBI\),pp\. 1–4\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p1.1),[§2](https://arxiv.org/html/2605.13943#S2.p4.1)\.
- T\. Behnia and C\. Thrampoulidis \(2024\)Supervised contrastive representation learning: landscape analysis with unconstrained features\.In2024 IEEE International Symposium on Information Theory \(ISIT\),pp\. 575–580\.Cited by:[§3\.8](https://arxiv.org/html/2605.13943#S3.SS8.p2.1)\.
- Y\. Bengio, A\. Courville, and P\. Vincent \(2013\)Representation learning: a review and new perspectives\.IEEE transactions on pattern analysis and machine intelligence35\(8\),pp\. 1798–1828\.Cited by:[§3\.1](https://arxiv.org/html/2605.13943#S3.SS1.SSS0.Px1.p1.3)\.
- T\. Chen, S\. Kornblith, M\. Norouzi, and G\. Hinton \(2020\)A simple framework for contrastive learning of visual representations\.InInternational conference on machine learning,pp\. 1597–1607\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p1.1),[§2](https://arxiv.org/html/2605.13943#S2.p4.1),[§5](https://arxiv.org/html/2605.13943#S5.p2.1)\.
- B\. Dufumier, P\. Gori, J\. Victor, A\. Grigis, M\. Wessa, P\. Brambilla, P\. Favre, M\. Polosan, C\. Mcdonald, C\. M\. Piguet,et al\.\(2021\)Contrastive learning with continuous proxy meta\-data for 3d mri classification\.InMedical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24,pp\. 58–68\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p1.1),[§1](https://arxiv.org/html/2605.13943#S1.p4.2),[§2](https://arxiv.org/html/2605.13943#S2.p4.1),[2nd item](https://arxiv.org/html/2605.13943#S3.I1.i2.p1.3),[§3\.4](https://arxiv.org/html/2605.13943#S3.SS4.p2.1),[§3\.6](https://arxiv.org/html/2605.13943#S3.SS6.p2.2)\.
- F\. Graf, C\. Hofer, M\. Niethammer, and R\. Kwitt \(2021\)Dissecting supervised contrastive learning\.InInternational Conference on Machine Learning,pp\. 3821–3830\.Cited by:[§D\.2](https://arxiv.org/html/2605.13943#A4.SS2.p9.6),[§2](https://arxiv.org/html/2605.13943#S2.p4.1),[§3\.8](https://arxiv.org/html/2605.13943#S3.SS8.p3.6)\.
- R\. Hadsell, S\. Chopra, and Y\. LeCun \(2006\)Dimensionality reduction by learning an invariant mapping\.In2006 IEEE computer society conference on computer vision and pattern recognition \(CVPR’06\),Vol\.2,pp\. 1735–1742\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p4.1)\.
- J\. Z\. HaoChen and T\. Ma \(2022\)A theoretical study of inductive biases in contrastive learning\.arXiv preprint arXiv:2211\.14699\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p3.1)\.
- J\. Z\. HaoChen, C\. Wei, A\. Gaidon, and T\. Ma \(2021\)Provable guarantees for self\-supervised deep learning with spectral contrastive loss\.Advances in Neural Information Processing Systems34,pp\. 5000–5011\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p3.1)\.
- K\. He, H\. Fan, Y\. Wu, S\. Xie, and R\. Girshick \(2020\)Momentum contrast for unsupervised visual representation learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 9729–9738\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p4.1)\.
- G\. E\. Hinton and S\. Roweis \(2002\)Stochastic neighbor embedding\.Advances in neural information processing systems15\.Cited by:[§F\.2](https://arxiv.org/html/2605.13943#A6.SS2.p1.10)\.
- R\. D\. Hjelm, A\. Fedorov, S\. Lavoie\-Marchildon, K\. Grewal, A\. Trischler, and Y\. Bengio \(2018\)Learning deep representations by mutual information estimation and maximization\.arXiv preprint arXiv:1808\.06670\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p2.1)\.
- P\. Khosla, P\. Teterwak, C\. Wang, A\. Sarna, Y\. Tian, P\. Isola, A\. Maschinot, C\. Liu, and D\. Krishnan \(2020\)Supervised contrastive learning\.Advances in neural information processing systems33,pp\. 18661–18673\.Cited by:[Appendix D](https://arxiv.org/html/2605.13943#A4.p2.7),[§1](https://arxiv.org/html/2605.13943#S1.p1.1),[§2](https://arxiv.org/html/2605.13943#S2.p4.1),[1st item](https://arxiv.org/html/2605.13943#S3.I1.i1.p1.5),[§3\.4](https://arxiv.org/html/2605.13943#S3.SS4.p2.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.\(2009\)\.Cited by:[Appendix E](https://arxiv.org/html/2605.13943#A5.p1.6)\.
- Y\. LeCun, C\. Cortes, and C\. J\.C\. Burges \(1998\)The mnist database of handwritten digits\.Note:[http://yann\.lecun\.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)Cited by:[§4\.1](https://arxiv.org/html/2605.13943#S4.SS1.p1.1)\.
- L\. Liberti, C\. Lavor, N\. Maculan, and A\. Mucherino \(2014\)Euclidean distance geometry and applications\.SIAM review56\(1\),pp\. 3–69\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p3.1),[§3\.3](https://arxiv.org/html/2605.13943#S3.SS3.p1.1)\.
- T\. Nguyen, R\. Jiang, S\. Aeron, P\. Ishwar, and D\. R\. Brown \(2024\)On neural collapse in contrastive learning with imbalanced datasets\.InProceedings of the 2024 IEEE 34th International Workshop on Machine Learning for Signal Processing \(MLSP\),pp\. 1–6\.Cited by:[§3\.8](https://arxiv.org/html/2605.13943#S3.SS8.p2.1)\.
- B\. Poole, S\. Ozair, A\. Van Den Oord, A\. Alemi, and G\. Tucker \(2019\)On variational bounds of mutual information\.InInternational conference on machine learning,pp\. 5171–5180\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p2.1),[§3\.2](https://arxiv.org/html/2605.13943#S3.SS2.p2.2),[footnote 1](https://arxiv.org/html/2605.13943#footnote1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p1.1)\.
- N\. Saunshi, J\. T\. Ash, S\. Goel, D\. Misra, C\. Zhang, S\. Arora, S\. Kakade, and A\. Krishnamurthy \(2022\)Understanding contrastive learning requires incorporating inductive biases\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§5](https://arxiv.org/html/2605.13943#S5.p2.1)\.
- S\. Schneider, J\. H\. Lee, and M\. W\. Mathis \(2023\)Learnable latent embeddings for joint behavioural and neural analysis\.Nature617\(7960\),pp\. 360–368\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p1.1)\.
- V\. Sobal, M\. Ibrahim, R\. Balestriero, V\. Cabannes, D\. Bouchacourt, P\. Astolfi, K\. Cho, and Y\. LeCun \(2024\)X\-sample contrastive loss: improving contrastive learning with sample similarity graphs\.arXiv preprint arXiv:2407\.18134\.Cited by:[Appendix E](https://arxiv.org/html/2605.13943#A5.p1.6),[§1](https://arxiv.org/html/2605.13943#S1.p4.2),[3rd item](https://arxiv.org/html/2605.13943#S3.I1.i3.p1.4),[§3\.4](https://arxiv.org/html/2605.13943#S3.SS4.p2.1),[§3\.6](https://arxiv.org/html/2605.13943#S3.SS6.p3.1),[§3\.6](https://arxiv.org/html/2605.13943#S3.SS6.p4.4),[Theorem 3\.12](https://arxiv.org/html/2605.13943#S3.Thmtheorem12.p1.8.8)\.
- M\. Tschannen, J\. Djolonga, P\. K\. Rubenstein, S\. Gelly, and M\. Lucic \(2020\)On mutual information maximization for representation learning\.InInternational Conference on Learning Representations \(ICLR\),Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p2.1)\.
- A\. van den Oord, Y\. Li, and O\. Vinyals \(2018\)Representation learning with contrastive predictive coding\.arXiv preprint arXiv:1807\.03748\.Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p2.1)\.
- T\. Wang and P\. Isola \(2020\)Understanding contrastive representation learning through alignment and uniformity on the hypersphere\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§2](https://arxiv.org/html/2605.13943#S2.p3.1)\.
- K\. Zha, P\. Cao, J\. Son, Y\. Yang, and D\. Katabi \(2023\)Rank\-n\-contrast: learning continuous representations for regression\.Advances in Neural Information Processing Systems36,pp\. 17882–17903\.Cited by:[§1](https://arxiv.org/html/2605.13943#S1.p1.1),[§2](https://arxiv.org/html/2605.13943#S2.p4.1)\.

## Appendix AEntropic lower bound

See[3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)

###### Proof\.

We define two 2D discrete distributions over\[1\.\.n\]2\[1\.\.n\]^\{2\}via the density functions

pW​\(i,j\)=δi≠j​wi​jn​∑k≠iwi​k,pS​\(i,j\)=δi≠j​exp⁡si​jn​∑k≠iexp⁡si​k\.\\displaystyle p\_\{W\}\(i,j\)=\\frac\{\\delta\_\{i\\neq j\}w\_\{ij\}\}\{n\\sum\_\{k\\neq i\}w\_\{ik\}\},\\qquad p\_\{S\}\(i,j\)=\\frac\{\\delta\_\{i\\neq j\}\\exp\{s\_\{ij\}\}\}\{n\\sum\_\{k\\neq i\}\\exp\{s\_\{ik\}\}\}\.Then it is easy to see thatℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}equals the cross\-entropy ofpSp\_\{S\}relative topWp\_\{W\}:

ℒNCEW=H​\(pW,pS\):=−∑i,jpW​\(i,j\)​log⁡pS​\(i,j\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}=H\(p\_\{W\},p\_\{S\}\):=\-\\sum\_\{i,j\}p\_\{W\}\(i,j\)\\log p\_\{S\}\(i,j\)where we make use of the convention that0​log⁡0=00\\log 0=0\. By Gibbs’ inequality,H​\(pW\)≤H​\(pW,pS\)H\(p\_\{W\}\)\\leq H\(p\_\{W\},p\_\{S\}\), whereH​\(pW\)H\(p\_\{W\}\)is the entropy ofpWp\_\{W\}, with equality iffpW=pSp\_\{W\}=p\_\{S\}\. Suppose this is the case, then for eachi≠ji\\neq j,

wi​j∑i≠kwi​k=exp⁡si​j∑k≠iexp⁡si​k,\\frac\{w\_\{ij\}\}\{\\sum\_\{i\\neq k\}w\_\{ik\}\}=\\frac\{\\exp\{s\_\{ij\}\}\}\{\\sum\_\{k\\neq i\}\\exp\{s\_\{ik\}\}\},which is equivalent to

exp⁡si​jwi​j=∑k≠iexp⁡si​k∑i≠kwi​k\.\\frac\{\\exp s\_\{ij\}\}\{w\_\{ij\}\}=\\frac\{\\sum\_\{k\\neq i\}\\exp\{s\_\{ik\}\}\}\{\\sum\_\{i\\neq k\}w\_\{ik\}\}\.\(5\)Note that the RHS of[Equation˜5](https://arxiv.org/html/2605.13943#A1.E5)does not depend onjj\. But since the LHS is symmetric, the RHS must not depend oniieither\. This means that \([5](https://arxiv.org/html/2605.13943#A1.E5)\) is in fact a constantCC\. Sincewi​j\>0w\_\{ij\}\>0wheni≠ji\\neq j,C\>0C\>0\. Thussi​j=log⁡\(wi​j\)\+cs\_\{ij\}=\\log\(w\_\{ij\}\)\+cwherec=log⁡Cc=\\log C, as desired\. Conversely, one easily verifies that ifsi​j=log⁡\(wi​j\)\+cs\_\{ij\}=\\log\(w\_\{ij\}\)\+cthenpW=pSp\_\{W\}=p\_\{S\}, henceℒNCEW=H​\(pW,pS\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}=H\(p\_\{W\},p\_\{S\}\)is minimal\.

Finally note that the RHS of \([1](https://arxiv.org/html/2605.13943#S3.E1)\) is preciselyH​\(pW\)H\(p\_\{W\}\)\. ∎

## Appendix BCharacterizations of the DGP

### B\.1Euclidean DGP

We state a classical characterization of EDM matrices and their embedding dimensions\. The proof can be found inAlfakih and others \[[2018](https://arxiv.org/html/2605.13943#bib.bib23)\]\(Theorems 3\.2 and 3\.8\)\.

###### Lemma B\.1\.

A symmetric matrixD∈ℝn×nD\\in\\mathbb\{R\}^\{n\\times n\}with vanishing diagonal entries is an EDM if and only if the centered Gram matrixB:=\(−1/2\)​J​D​JB:=\(\-1/2\)JDJ, whereJ=In−\(1/n\)​EnJ=I\_\{n\}\-\(1/n\)E\_\{n\}is the centering matrix, is positive semi\-definite \(PSD\)\. If such is the case then the embedding dimensionrrofDDis equal tor=rank⁡\(B\)r=\\operatorname\{rank\}\(B\), and moreoverrank⁡\(D\)\\operatorname\{rank\}\(D\)is eitherr\+1r\+1orr\+2r\+2\.

See[3\.7](https://arxiv.org/html/2605.13943#S3.Thmtheorem7)

###### Proof\.

Recall thatEnE\_\{n\}denotes then×nn\\times nmatrix with unit coefficients andInI\_\{n\}then×nn\\times nidentity matrix, andJ≔In−\(1/n\)​EnJ\\coloneqq I\_\{n\}\-\(1/n\)E\_\{n\}is the so\-calledcentering matrix\. SinceℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}does not depend on the diagonal ofDD, we may assume WLOG that the diagonal ofDDis identically zero\.

\(⇒\)\(\\Rightarrow\)Assume there existsZ∗∈ℝn×qZ^\{\*\}\\in\\mathbb\{R\}^\{n\\times q\}which attains the entropic lower bound\. By[Corollary˜3\.4](https://arxiv.org/html/2605.13943#S3.Thmtheorem4)there is a constantccsuch that‖zi−zj‖2=di​j\+c\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}=d\_\{ij\}\+cfor alli≠ji\\neq j\. It follows thatD′=\(di​j\+c​δi≠j\)=D\+c​\(En−In\)D^\{\\prime\}=\(d\_\{ij\}\+c\\delta\_\{i\\neq j\}\)=D\+c\(E\_\{n\}\-I\_\{n\}\)is an EDM with embedding dimensionr≤qr\\leq q\. Assume for the sake of a contradiction thatc≠0c\\neq 0\. SinceD′D^\{\\prime\}is an EDM with embedding dimensionrr, it follows from Theorem[B\.1](https://arxiv.org/html/2605.13943#A2.Thmtheorem1)thatr=rank⁡\(J​D′​J\)r=\\operatorname\{rank\}\(JD^\{\\prime\}J\)\. ButJ​\(En−In\)​J=−JJ\(E\_\{n\}\-I\_\{n\}\)J=\-J, henceJ​D′​J=J​D​J−c​JJD^\{\\prime\}J=JDJ\-cJ\. Sincerank⁡\(c​J\)=rank⁡\(J\)=n−1\\operatorname\{rank\}\(cJ\)=\\operatorname\{rank\}\(J\)=n\-1andrank⁡\(J​D​J\)≤rank⁡\(D\)≤q\+2\\operatorname\{rank\}\(JDJ\)\\leq\\operatorname\{rank\}\(D\)\\leq q\+2, it follows that

r=rank⁡\(J​D′​J\)=rank⁡\(J​D​J−c​J\)≥\|rank⁡\(J​D​J\)−rank⁡\(c​J\)\|=\(n−1\)−\(q\+2\)\>qr=\\operatorname\{rank\}\(JD^\{\\prime\}J\)=\\operatorname\{rank\}\(JDJ\-cJ\)\\geq\\left\|\\operatorname\{rank\}\(JDJ\)\-\\operatorname\{rank\}\(cJ\)\\right\|=\(n\-1\)\-\(q\+2\)\>qwhich contradicts the fact thatr≤qr\\leq q\. This proves thatc=0c=0, henceD=D′D=D^\{\\prime\}is an EDM with embedding dimensionr≤qr\\leq q\.

\(⇐\)\(\\Leftarrow\)Conversely, ifDDis an EDM with embedding dimensionr≤qr\\leq qthen any realizationZ∗∈ℝn×qZ^\{\*\}\\in\\mathbb\{R\}^\{n\\times q\}solves[Equation˜2](https://arxiv.org/html/2605.13943#S3.E2)withc=0c=0\.

Finally, note that a solution that any two solutions to Equation \([2](https://arxiv.org/html/2605.13943#S3.E2)\) inℝq\\mathbb\{R\}^\{q\}withc=0c=0have the same EDM, hence they differ by a Euclidean isometry\. ∎

### B\.2Spherical DGP

We state without proof a useful characterization of spherical EDMs:

###### Lemma B\.2\.

Ann×nn\\times nEDMDDof embedding dimensionr≤n−2r\\leq n\-2is spherical iffrank⁡\(D\)=r\+1\\operatorname\{rank\}\(D\)=r\+1, or equivalently if there exists a scalarβ\\betasuch thatβ​En−D\\beta E\_\{n\}\-Dis PSD\. Moreover, the smallestβ\\betawith this property isβ=2​ρ2\\beta=2\\rho^\{2\}, whereρ\\rhois the radius ofDD\.

###### Proof\.

SeeAlfakih and others \[[2018](https://arxiv.org/html/2605.13943#bib.bib23)\], Theorem 4\.2 and 4\.3\. ∎

See[3\.8](https://arxiv.org/html/2605.13943#S3.Thmtheorem8)

###### Proof\.

Assume WLOG that the diagonal ofDDis identically zero\.

\(⇒\)\(\\Rightarrow\)Assume there existsZ∗Z^\{\*\}which attains the entropic lower bound, then by[Corollary˜3\.5](https://arxiv.org/html/2605.13943#S3.Thmtheorem5)there existsc′∈ℝc^\{\\prime\}\\in\\mathbb\{R\}such thatcos⁡\(zi,zj\)=−τ​di​j\+c′\\cos\(z\_\{i\},z\_\{j\}\)=\-\\tau d\_\{ij\}\+c^\{\\prime\}for eachi≠ji\\neq j\. Denote byz~i=zi/‖zi‖\\widetilde\{z\}\_\{i\}=z\_\{i\}/\\\|z\_\{i\}\\\|the normalized realizations\. Then for eachi≠ji\\neq j,

‖z~i−z~j‖2\\displaystyle\\\|\\widetilde\{z\}\_\{i\}\-\\widetilde\{z\}\_\{j\}\\\|^\{2\}=‖z~i‖2\+‖z~j‖2−2​⟨z~i,z~j⟩\\displaystyle=\\\|\\widetilde\{z\}\_\{i\}\\\|^\{2\}\+\\\|\\widetilde\{z\}\_\{j\}\\\|^\{2\}\-2\\langle\\widetilde\{z\}\_\{i\},\\widetilde\{z\}\_\{j\}\\rangle=1\+1−2​cos⁡\(zi,zj\)\\displaystyle=1\+1\-2\\cos\(z\_\{i\},z\_\{j\}\)=2​\(τ​di​j\+1−c′\)\.\\displaystyle=2\(\\tau d\_\{ij\}\+1\-c^\{\\prime\}\)\.Thus,D′:=2​τ​D\+2​\(1−c′\)​\(En−In\)D^\{\\prime\}:=2\\tau D\+2\(1\-c^\{\\prime\}\)\(E\_\{n\}\-I\_\{n\}\)is an EDM with embedding dimensionr≤qr\\leq q\. In particularrank⁡\(D′\)≤r\+2≤q\+2\\operatorname\{rank\}\(D^\{\\prime\}\)\\leq r\+2\\leq q\+2\. It follows thatc′=1c^\{\\prime\}=1, else we would have

q\+2≥rank⁡\(D′\)≥\|rank⁡\(D\)−rank⁡\(En−In\)\|≥\(n−1\)−\(q\+1\)\>q\+2,q\+2\\geq\\operatorname\{rank\}\(D^\{\\prime\}\)\\geq\|\\operatorname\{rank\}\(D\)\-\\operatorname\{rank\}\(E\_\{n\}\-I\_\{n\}\)\|\\geq\(n\-1\)\-\(q\+1\)\>q\+2,a contradiction\. Thus,c′=1c^\{\\prime\}=1andD=D′/\(2​τ\)D=D^\{\\prime\}/\(2\\tau\)is an EDM with embedding dimensionrr\.

Note that2​En−D′=2​Z~​Z~T2E\_\{n\}\-D^\{\\prime\}=2\\widetilde\{Z\}\\widetilde\{Z\}^\{T\}since for eachi,ji,j,⟨z~i,z~j⟩=cos⁡\(zi,zj\)=1−τ​di​j\\langle\\tilde\{z\}\_\{i\},\\tilde\{z\}\_\{j\}\\rangle=\\cos\(z\_\{i\},z\_\{j\}\)=1\-\\tau d\_\{ij\}\. In particular,2​En−D′2E\_\{n\}\-D^\{\\prime\}is PSD\. Sincer≤q≤n−2r\\leq q\\leq n\-2\(asn\>2​q\+4n\>2q\+4\), this means by[Lemma˜B\.2](https://arxiv.org/html/2605.13943#A2.Thmtheorem2)thatDDis spherical, and moreover that its radiusρ′\\rho^\{\\prime\}satisfiesρ′≤1\\rho^\{\\prime\}\\leq 1\. It follows thatD=D′/2​τD=D^\{\\prime\}/2\\tauis spherical and its radiusρ\\rhosatisfiesρ=ρ′/2​τ≤1/2​τ\\rho=\\rho^\{\\prime\}/\\sqrt\{2\\tau\}\\leq 1/\\sqrt\{2\\tau\}\. Ifr=qr=qthen thez~i\\widetilde\{z\}\_\{i\}are generating points ofD′D^\{\\prime\}, henceρ′=1\\rho^\{\\prime\}=1andρ=1/2​τ\\rho=1/\\sqrt\{2\\tau\}\.

\(⇐\)\(\\Leftarrow\)Conversely, assume thatDDa spherical EDM with embedding dimensionr≤qr\\leq qand radiusρ\\rho\.

Assume first thatr=qr=qandρ=1/2​τ\\rho=1/\\sqrt\{2\\tau\}\. Thus, there exists a spherical realizationz1,…,zn∈ℝqz\_\{1\},\\dots,z\_\{n\}\\in\\mathbb\{R\}^\{q\}such that‖zi−zj‖2=di​j\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}=d\_\{ij\}and‖zi‖=1/2​τ\\\|z\_\{i\}\\\|=1/\\sqrt\{2\\tau\}\. It follows that for alli,ji,j,

cos⁡\(zi,zj\)\\displaystyle\\cos\(z\_\{i\},z\_\{j\}\)=2​τ​⟨zi,zj⟩\\displaystyle=2\\tau\\langle z\_\{i\},z\_\{j\}\\rangle=τ​\(‖zi‖2\+‖zj‖2−‖zi−zj‖2\)\\displaystyle=\\tau\(\\\|z\_\{i\}\\\|^\{2\}\+\\\|z\_\{j\}\\\|^\{2\}\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}\)=1−τ​di​j,\\displaystyle=1\-\\tau d\_\{ij\},soZZis a solution to[Equation˜3](https://arxiv.org/html/2605.13943#S3.E3)withc′=1c^\{\\prime\}=1\.

On the other hand, ifr<qr<qandρ≤1/2​τ\\rho\\leq 1/\\sqrt\{2\\tau\}then there exists a realizationp1,…,pn∈ℝrp\_\{1\},\\dots,p\_\{n\}\\in\\mathbb\{R\}^\{r\}ofDD, with‖pi‖=ρ\\\|p\_\{i\}\\\|=\\rhoand‖pi−pj‖2=di​j\\\|p\_\{i\}\-p\_\{j\}\\\|^\{2\}=d\_\{ij\}\. Setting

zi:=\(pi,12​τ−ρ2,0,…,0\)∈ℝqz\_\{i\}:=\\left\(p\_\{i\},\\sqrt\{\\frac\{1\}\{2\\tau\}\-\\rho^\{2\}\},0,\\dots,0\\right\)\\in\\mathbb\{R\}^\{q\}which obviously have the same EDM as thepip\_\{i\}, and since‖zi‖2=‖pi‖2\+1/\(2​τ\)−ρ2=1/\(2​τ\)\\\|z\_\{i\}\\\|^\{2\}=\\\|p\_\{i\}\\\|^\{2\}\+1/\(2\\tau\)\-\\rho^\{2\}=1/\(2\\tau\)we can conclude as before thatZZis a solution to[Equation˜3](https://arxiv.org/html/2605.13943#S3.E3)\.

Note finally that any two solutions to[Equation˜3](https://arxiv.org/html/2605.13943#S3.E3)must satisfyc′=1c^\{\\prime\}=1, hence they have the same cosine matrix, thus their normalizations differ by a linear Euclidean isometry\. ∎

## Appendix CPractical cases

### C\.1Supervised multivariate regression case

See[3\.10](https://arxiv.org/html/2605.13943#S3.Thmtheorem10)

###### Proof\.

Sinceyi∈ℝℓy\_\{i\}\\in\\mathbb\{R\}^\{\\ell\}andℓ≤q\\ell\\leq q,DDis an EDM with embedding dimension at mostqq, sorank⁡\(D\)≤q\+2\\operatorname\{rank\}\(D\)\\leq q\+2\. By[Theorem˜3\.7](https://arxiv.org/html/2605.13943#S3.Thmtheorem7)there exists an essentially unique embeddingZ∗Z^\{\*\}which attains the entropic lower bound\. The statedZ∗Z^\{\*\}is one such embedding and any other differs from it by a Euclidean isometry\. ∎

See[3\.11](https://arxiv.org/html/2605.13943#S3.Thmtheorem11)

###### Proof\.

Note thatD=\(‖yi−yj‖2\)D=\(\\\|y\_\{i\}\-y\_\{j\}\\\|^\{2\}\)is an EDM with embedding dimensionr≤qr\\leq q\(becauseyi∈ℝℓy\_\{i\}\\in\\mathbb\{R\}^\{\\ell\}andℓ≤q\\ell\\leq q\) sorank⁡\(D\)≤q\+2\\operatorname\{rank\}\(D\)\\leq q\+2\. If there exists an embedding that attains the entropic lower bound then according to[Lemma˜B\.2](https://arxiv.org/html/2605.13943#A2.Thmtheorem2),DDis a spherical EDM so theyiy\_\{i\}lie on a hypersphere\. ∎

See[3\.12](https://arxiv.org/html/2605.13943#S3.Thmtheorem12)

###### Proof\.

Assume WLOG that theyiy\_\{i\}are normalized\. Borrowing the notation from[Theorem˜3\.8](https://arxiv.org/html/2605.13943#S3.Thmtheorem8), letD=\(di​j\)D=\(d\_\{ij\}\)withdi​j=−cos⁡\(yi,yj\)/τ′d\_\{ij\}=\-\\cos\(y\_\{i\},y\_\{j\}\)/\\tau^\{\\prime\}fori,j∈\[1\.\.n\]i,j\\in\[1\.\.n\]be the dissimilarity matrix associated toWW, withwi​j=exp⁡\(−di​j\)w\_\{ij\}=\\exp\(\-d\_\{ij\}\)\. Since theyiy\_\{i\}are normalized, it holds that

di​j=1τ′​\(‖yi−yj‖22−1\)\.d\_\{ij\}=\\frac\{1\}\{\\tau^\{\\prime\}\}\\left\(\\frac\{\\\|y\_\{i\}\-y\_\{j\}\\\|^\{2\}\}\{2\}\-1\\right\)\.Thus,di​j′:=di​j\+1/τ′d^\{\\prime\}\_\{ij\}:=d\_\{ij\}\+1/\\tau^\{\\prime\}is a spherical EDM with embedding dimensionr≤ℓr\\leq\\elland radiusρ≤1/2​τ′\\rho\\leq 1/\\sqrt\{2\\tau^\{\\prime\}\}\. SinceDDandD′D^\{\\prime\}differ by an additive constant and owing to[Corollary˜3\.5](https://arxiv.org/html/2605.13943#S3.Thmtheorem5), replacingDDbyD′D^\{\\prime\}does not change the set of points that attain the entropic lower bound of theww\-InfoNCE loss\. Sincerank⁡\(D′\)=ℓ\+1<q\+1\\operatorname\{rank\}\(D^\{\\prime\}\)=\\ell\+1<q\+1andρ≤1/2​τ\\rho\\leq 1/\\sqrt\{2\\tau\}\(asτ≤τ′\\tau\\leq\\tau^\{\\prime\}\), it follows from[Theorem˜3\.8](https://arxiv.org/html/2605.13943#S3.Thmtheorem8), that there is an essentially unique solution that attains the entropic lower bound for the dissimilarity matrixD′D^\{\\prime\}\(and hence forDD\)\.

It is easy to see thatzi∗=\(yi,τ′/τ−1,0,…,0\)z\_\{i\}^\{\*\}=\(y\_\{i\},\\sqrt\{\\tau^\{\\prime\}/\\tau\-1\},0,\\ldots,0\)is one such solution since it satisfies \([3](https://arxiv.org/html/2605.13943#S3.E3)\) withc′=1−τ/τ′c^\{\\prime\}=1\-\\tau/\\tau^\{\\prime\}\.

∎

## Appendix DSparse weights

Previously, we restricted our attention to the case wherewi​j\>0w\_\{ij\}\>0for anyi≠ji\\neq jbecause it leads to a simple characterization of the minimum ofℒNCEw\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{w\}\([Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)\)\. However, relaxing that condition towi​j≥0w\_\{ij\}\\geq 0is of interest since in practical applications,WWis often sparse\. Note, however, thatWWmust contain sufficiently many nonzero entries so as to avoid divisions by0in the expressionℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\. Specifically, we require that∑k≠iwi​k\>0\\sum\_\{k\\neq i\}w\_\{ik\}\>0, which is equivalent to saying that each line or column inWWhas at least one nonzero off\-diagonal entry\. Let us call symmetric nonnegatively\-valued matrices with this propertywell\-conditioned\.

We now turn our attention to the sparse weight matricesW=\(wi​j\)W=\(w\_\{ij\}\)encountered in practical applications, namely the SupCon lossKhoslaet al\.\[[2020](https://arxiv.org/html/2605.13943#bib.bib7)\], wherewi​jw\_\{ij\}is defined as11if samplesiiandjjare in the same class, and0otherwise\. By assuming WLOG that a given class is represented by a contiguous set of indices, we can representWWin block form as follows:

W=\(𝟏ℓ1×ℓ1𝟎⋯𝟎𝟎𝟏ℓ2×ℓ2⋯𝟎⋮⋮⋱⋮𝟎𝟎⋯𝟏ℓC×ℓC\)∈ℝn×n,W=\\begin\{pmatrix\}\\mathbf\{1\}\_\{\\ell\_\{1\}\\times\\ell\_\{1\}\}&\\mathbf\{0\}&\\cdots&\\mathbf\{0\}\\\\ \\mathbf\{0\}&\\mathbf\{1\}\_\{\\ell\_\{2\}\\times\\ell\_\{2\}\}&\\cdots&\\mathbf\{0\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ \\mathbf\{0\}&\\mathbf\{0\}&\\cdots&\\mathbf\{1\}\_\{\\ell\_\{C\}\\times\\ell\_\{C\}\}\\end\{pmatrix\}\\in\\mathbb\{R\}^\{n\\times n\},\(6\)whereCCis the number of classes,ℓc\\ell\_\{c\}the size of theccth class, and𝟏ℓc×ℓc=Eℓc\\mathbf\{1\}\_\{\\ell\_\{c\}\\times\\ell\_\{c\}\}=E\_\{\\ell\_\{c\}\}theℓc×ℓc\\ell\_\{c\}\\times\\ell\_\{c\}matrix of ones\. We will always assume thatℓc≥2\\ell\_\{c\}\\geq 2, so thatWWis well\-conditioned\. We call matrices of this formSupCon\-like\.

In what follows, we denote by𝒮\\mathscr\{S\}some space ofn×nn\\times ndissimilarity matrices whereSSis allowed to vary\. We make only one assumption about𝒮\\mathscr\{S\}, namely thatexp⁡\(𝒮\)\\exp\(\\mathscr\{S\}\), i\.e\. the set of matrices in𝒮\\mathscr\{S\}transformed pointwise by the exponential function, is bounded\. In practice,𝒮\\mathscr\{S\}will be chosen as one of the following sets, which correspond, respectively, to the Euclidean and spherical setting:

𝒮Eucl\.:=\\displaystyle\\mathscr\{S\}\_\{\\mathrm\{Eucl\.\}\}:=\{−D∣D​an EDM of embedding dimension≤q\},\\displaystyle\\left\\\{\-D\\mid D\\textrm\{ an EDM of embedding dimension\}\\leq q\\right\\\},𝒮sph\.:=\\displaystyle\\mathscr\{S\}\_\{\\mathrm\{sph\.\}\}:=\{1τ​G∣G​a cosine matrix of rank≤q\}\.\\displaystyle\\left\\\{\\frac\{1\}\{\\tau\}G\\mid G\\textrm\{ a cosine matrix of rank\}\\leq q\\right\\\}\.
### D\.1A condition for the optimality of the entropic lower bound

In the proof of[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3), the existence of a minimizer crucially relied on the assumption thatwi​jw\_\{ij\}has positive entries in order forlog⁡\(wi​j\)\\log\(w\_\{ij\}\)to be well\-defined\. Recall the expression for the minimum value, namely

H​\(pW\)=−1n​∑i≠jwi​j∑k≠iwi​k​log⁡\(wi​j∑k≠iwi​k\),H\(p\_\{W\}\)=\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\right\),\(7\)where we borrow the notationspWp\_\{W\}andH​\(−\)H\(\-\)used in the proof of that theorem\. Remark that the RHS of \([7](https://arxiv.org/html/2605.13943#A4.E7)\) has a well\-defined meaning if strict positiveness assumption onWWis relaxed and instead we only require thatWWbe well\-conditioned\. Indeed,t​log⁡\(t\)→0t\\log\(t\)\\to 0ast→0\+t\\to 0^\{\+\}soH​\(pW\)H\(p\_\{W\}\)can be evaluated using the convention0​log⁡\(0\):=00\\log\(0\):=0\. Thus, we ask ourselves if the lower boundH​\(pW\)≤ℒNCEWH\(p\_\{W\}\)\\leq\\mathcal\{L\}\_\{\\textrm\{NCE\}\}^\{W\}continues to hold\.

The following theorem answers that question in the positive and shows that it is necessarily sharp whenWWcontains zeros\. Moreover, we give a necessary and sufficient condition for optimality of the entropic bound, which we characterize in terms ofcl⁡\(exp⁡\(𝒮\)\)\\operatorname\{cl\}\(\\exp\(\\mathscr\{S\}\)\), wherecl⁡\(−\)\\operatorname\{cl\}\(\{\-\}\)denotes the topological closure of a set\.

###### Theorem D\.1\.

LetWWbe a well\-conditionedn×nn\\times nmatrix\.

1. 1\.The boundH​\(pW\)≤ℒNCEW​\(S\)H\(p\_\{W\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)holds for allS∈𝒮S\\in\\mathscr\{S\}\.
2. 2\.IfWWcontains at least one off\-diagonal zero then that bound is strict\.
3. 3\.If\(S\(m\)\)m∈ℕ\(S^\{\(m\)\}\)\_\{m\\in\\mathbb\{N\}\}is a sequence of matrices in𝒮\\mathscr\{S\}such thatexp⁡\(S\(m\)\)→W\\exp\(S^\{\(m\)\}\)\\to Wasm→∞m\\to\\infty, thenℒNCEW​\(S\(m\)\)→H​\(pW\)\.\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)\\to H\(p\_\{W\}\)\.
4. 4\.The boundH​\(pW\)≤ℒNCEW​\(S\)H\(p\_\{W\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)is optimal iffc​W∈cl⁡\(exp⁡\(𝒮\)\)cW\\in\\operatorname\{cl\}\(\\exp\(\\mathscr\{S\}\)\)up to diagonal elements, for somec\>0c\>0\.

###### Proof\.

The first statement follows from Gibbs’ inequality, as in the proof of[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)\. In that proof we saw that the condition for the lower bound being attained amounts to the equality

wi​j∑i≠kwi​k=exp⁡si​j∑k≠iexp⁡si​k,\\frac\{w\_\{ij\}\}\{\\sum\_\{i\\neq k\}w\_\{ik\}\}=\\frac\{\\exp\{s\_\{ij\}\}\}\{\\sum\_\{k\\neq i\}\\exp\{s\_\{ik\}\}\},for eachi≠ji\\neq j\. This is plainly impossible ifwi​j=0w\_\{ij\}=0\(since the RHS is nonzero\), in which case the bound is strict\.

For the third statement, let\(S\(m\)\)m∈ℕ\(S^\{\(m\)\}\)\_\{m\\in\\mathbb\{N\}\}be a sequence of matrices in𝒮\\mathscr\{S\}such thatW\(m\):=exp⁡\(S\(m\)\)→WW^\{\(m\)\}:=\\exp\(S^\{\(m\)\}\)\\to Wasm→∞m\\to\\infty\. Then,

ℒNCEW​\(S\(m\)\)=−1n​∑i≠jwi​j∑k≠iwi​k​log⁡\(wi​j\(m\)∑k≠iwi​k\(m\)\)\.\\mathcal\{L\}\_\{\\textrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)=\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{w^\{\(m\)\}\_\{ij\}\}\{\\sum\_\{k\\neq i\}w^\{\(m\)\}\_\{ik\}\}\\right\)\.
Denote byffthe function defined fort\>0t\>0byf​\(t\)=t​log⁡\(t\)f\(t\)=t\\log\(t\)and continuously extended att=0t=0by settingf​\(0\)=0f\(0\)=0\. We want to show that for eachi≠ji\\neq j,

wi​j∑k≠iwi​k​log⁡\(wi​j\(m\)∑k≠iwi​k\(m\)\)⟶f​\(wi​j∑k≠iwi​k\)as​m→∞\.\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{w^\{\(m\)\}\_\{ij\}\}\{\\sum\_\{k\\neq i\}w^\{\(m\)\}\_\{ik\}\}\\right\)\\longrightarrow f\\left\(\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\right\)\\quad\\textrm\{as \}m\\to\\infty\.Ifwi​j=0w\_\{ij\}=0the statement is trivially true sincef​\(0\)=0f\(0\)=0by definition\. On the other hand, ifwi​j\>0w\_\{ij\}\>0thenlog\\logis continuous atwi​jw\_\{ij\}, so

wi​j∑k≠iwi​k​log⁡\(wi​j\(m\)∑k≠iwi​k\(m\)\)⟶wi​j∑k≠iwi​k​log⁡\(wi​j∑k≠iwi​k\)=f​\(wi​j∑k≠iwi​k\)\.\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{w^\{\(m\)\}\_\{ij\}\}\{\\sum\_\{k\\neq i\}w^\{\(m\)\}\_\{ik\}\}\\right\)\\longrightarrow\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\right\)=f\\left\(\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\right\)\.It follows that

ℒNCEW​\(S\(m\)\)⟶−1n​∑i≠jf​\(wi​j∑k≠iwi​k\)=H​\(pW\),\\mathcal\{L\}\_\{\\textrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)\\longrightarrow\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}f\\left\(\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\right\)=H\(p\_\{W\}\),which proves the third statement\.

For the “if” part of the fourth statement, let us assume thatc​W∈cl⁡\(exp⁡\(𝒮\)\)cW\\in\\operatorname\{cl\}\(\\exp\(\\mathscr\{S\}\)\)up to diagonal elements, for somec\>0c\>0\. Then there is a sequence\(S\(m\)\)m∈ℕ\(S^\{\(m\)\}\)\_\{m\\in\\mathbb\{N\}\}such thatexp⁡\(S\(m\)\)→c​W\\exp\(S^\{\(m\)\}\)\\to cW, up to diagonal elements\. Sincec​WcWis well\-conditioned, it follows thatℒNCEW​\(S\(m\)\)→H​\(pc​W\)=H​\(pW\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)\\to H\(p\_\{cW\}\)=H\(p\_\{W\}\)\. This proves that the bound is optimal\.

Conversely, let us assume that the boundH​\(pW\)≤ℒNCEW​\(S\)H\(p\_\{W\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)is optimal, so there exists a sequence\(S\(m\)\)m∈ℕ\(S^\{\(m\)\}\)\_\{m\\in\\mathbb\{N\}\}such thatℒNCEW​\(S\(m\)\)→H​\(pW\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)\\to H\(p\_\{W\}\)\. LetV\(m\):=exp⁡\(S\(m\)\)V^\{\(m\)\}:=\\exp\(S^\{\(m\)\}\)\. SinceV\(m\)∈exp⁡\(𝒮\)V^\{\(m\)\}\\in\\exp\(\\mathscr\{S\}\)it is bounded, hence by the Bolzano–Weierstrass theorem there is a convergent subsequenceV\(mp\)→VV^\{\(m\_\{p\}\)\}\\to V\. It follows that

H​\(pW\)\\displaystyle H\(p\_\{W\}\)=limm→∞ℒNCEW​\(S\(m\)\)=limm→∞−1n​∑i≠jwi​j∑k≠iwi​k​log⁡\(vi​j\(m\)∑k≠ivi​k\(m\)\)\\displaystyle=\\lim\_\{m\\to\\infty\}\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)=\\lim\_\{m\\to\\infty\}\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{v^\{\(m\)\}\_\{ij\}\}\{\\sum\_\{k\\neq i\}v^\{\(m\)\}\_\{ik\}\}\\right\)=limp→∞−1n​∑i≠jwi​j∑k≠iwi​k​log⁡\(vi​j\(mp\)∑k≠ivi​k\(mp\)\)=−1n​∑i≠jwi​j∑k≠iwi​k​log⁡\(vi​j∑k≠ivi​k\)\\displaystyle=\\lim\_\{p\\to\\infty\}\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{v^\{\(m\_\{p\}\)\}\_\{ij\}\}\{\\sum\_\{k\\neq i\}v^\{\(m\_\{p\}\)\}\_\{ik\}\}\\right\)=\-\\frac\{1\}\{n\}\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\log\\left\(\\frac\{v\_\{ij\}\}\{\\sum\_\{k\\neq i\}v\_\{ik\}\}\\right\)=H​\(pW,pV\),\\displaystyle=H\(p\_\{W\},p\_\{V\}\),wherepVp\_\{V\}is the 2D discrete sequence defined on\[1\.\.n\]2\[1\.\.n\]^\{2\}analogously topWp\_\{W\}\(see proof of[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)\)\. By Gibbs’ inequality this implies thatpW=pVp\_\{W\}=p\_\{V\}, and arguing by symmetry just as we did in[Theorem˜3\.3](https://arxiv.org/html/2605.13943#S3.Thmtheorem3)we see that there existsc\>0c\>0such thatV=c​WV=cWup to diagonal elements\. SinceV=limp→∞exp⁡\(V\(mp\)\)∈cl⁡\(exp⁡\(𝒮\)\)V=\\lim\_\{p\\to\\infty\}\\exp\(V^\{\(m\_\{p\}\)\}\)\\in\\operatorname\{cl\}\(\\exp\(\\mathscr\{S\}\)\), this completes the proof\. ∎

We now go on to characterize the so\-called Euclidean SupCon, i\.e\. the case whereWWis a SupCon\-like weight matrix \([6](https://arxiv.org/html/2605.13943#A4.E6)\), and𝒮=𝒮Eucl\.\\mathscr\{S\}=\\mathscr\{S\}\_\{\\mathrm\{Eucl\.\}\}is the set of negative squaredn×nn\\times nEDMs of embedding dimension at mostqq\. IfWWencodesC≥2C\\geq 2classes then it contains zeros, thereby guaranteeing that the lower bound is strict\. Moreover, quasi\-optima \(i\.e\. configurations of points such thatℒNCEW≤H​\(pW\)\+ε\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\\leq H\(p\_\{W\}\)\+\\varepsilonfor some smallε\>0\\varepsilon\>0\) can be produced in a wealth of qualitatively distinct geometries\. This comes in stark contrast to when we made previously with Soft SupCon, in which case we saw that the optima were usually constrained to a unique configuration, up to Euclidean isometry \([Theorem˜3\.7](https://arxiv.org/html/2605.13943#S3.Thmtheorem7)\)\.

To see where this non\-uniqueness comes from in the Euclidean SupCon case, choose pairwise distinct pointsμ1,…,μC∈ℝq\\mu\_\{1\},\\ldots,\\mu\_\{C\}\\in\\mathbb\{R\}^\{q\}which will serve as class prototypes\. Letm\>0m\>0and consider the embedding which assigns each samplexix\_\{i\}tom​μcim\\mu\_\{c\_\{i\}\}, whereci∈\[1\.\.C\]c\_\{i\}\\in\[1\.\.C\]is that sample’s class\. Denoting byS\(m\)S^\{\(m\)\}the negative squared Euclidean distance matrix associated with that embedding, it follows thatexp⁡\(S\(m\)\)\\exp\(S^\{\(m\)\}\)can be expressed in block form as

exp⁡\(S\(m\)\)=\(𝟏ℓ1×ℓ1e−m2​‖μ1−μ2‖2⋯e−m2​‖μ1−μC‖2e−m2​‖μ2−μ1‖2𝟏ℓ2×ℓ2⋯e−m2​‖μ2−μC‖2⋮⋮⋱⋮e−m2​‖μC−μ1‖2e−m2​‖μC−μ2‖2⋯𝟏ℓC×ℓC\),\\exp\(S^\{\(m\)\}\)=\\begin\{pmatrix\}\\mathbf\{1\}\_\{\\ell\_\{1\}\\times\\ell\_\{1\}\}&e^\{\-m^\{2\}\\\|\\mu\_\{1\}\-\\mu\_\{2\}\\\|^\{2\}\}&\\cdots&e^\{\-m^\{2\}\\\|\\mu\_\{1\}\-\\mu\_\{C\}\\\|^\{2\}\}\\\\ e^\{\-m^\{2\}\\\|\\mu\_\{2\}\-\\mu\_\{1\}\\\|^\{2\}\}&\\mathbf\{1\}\_\{\\ell\_\{2\}\\times\\ell\_\{2\}\}&\\cdots&e^\{\-m^\{2\}\\\|\\mu\_\{2\}\-\\mu\_\{C\}\\\|^\{2\}\}\\\\ \\vdots&\\vdots&\\ddots&\\vdots\\\\ e^\{\-m^\{2\}\\\|\\mu\_\{C\}\-\\mu\_\{1\}\\\|^\{2\}\}&e^\{\-m^\{2\}\\\|\\mu\_\{C\}\-\\mu\_\{2\}\\\|^\{2\}\}&\\cdots&\\mathbf\{1\}\_\{\\ell\_\{C\}\\times\\ell\_\{C\}\}\\end\{pmatrix\},\(8\)where the off\-diagonal coefficients represent constant\-valued blocks of the appropriate shape\.

Note in particular thatexp⁡\(S\(m\)\)→W\\exp\(S^\{\(m\)\}\)\\to Wasm→∞m\\to\\infty, soℒNCEW​\(S\(m\)\)→H​\(pW\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\(m\)\}\)\\to H\(p\_\{W\}\)by[Theorem˜D\.1](https://arxiv.org/html/2605.13943#A4.Thmtheorem1)\. Thus, taking a large enoughmmyields a near\-optimal solution\. Yet, the choice of theμ1,…,μC\\mu\_\{1\},\\ldots,\\mu\_\{C\}was entirely unconstrained, so the class prototypes can be chosen in any number of qualitatively dissimilar configurations\. For instance, whenC=3C=3the triangle formed by the class prototypes can be chosen to be equilateral, right\-angled or scalene and all the while minimizeℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}to within, say,ε=10−10\\varepsilon=10^\{\-10\}of the theoretical lower bound\.

### D\.2Hard SupCon

We now go on to SupCon proper, that is to say the case whereWWis a SupCon\-like binary weight matrix \([6](https://arxiv.org/html/2605.13943#A4.E6)\) and𝒮=𝒮sph\.\\mathscr\{S\}=\\mathscr\{S\}\_\{\\mathrm\{sph\.\}\}consists of thosen×nn\\times nmatrices of the formG/τG/\\tau, whereGGis a cosine matrix of rank at mostqq\. Note that𝒮sph\.\\mathscr\{S\}\_\{\\mathrm\{sph\.\}\}is a compact set so owing to[Theorem˜D\.1](https://arxiv.org/html/2605.13943#A4.Thmtheorem1), the boundH​\(pW\)⪇ℒNCEW​\(S\)H\(p\_\{W\}\)\\lneq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)is sharp, and thusa priorisub\-optimal, since a continuous function on a compact set attains its minimum\.

We attack the problem of minimizingℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}by exploiting its convexity as well as the great deal of symmetries ofWW\. In what follows we let𝒮0⊇𝒮\\mathscr\{S\}\_\{0\}\\supseteq\\mathscr\{S\}denote the set ofn×nn\\times nsymmetric matrices of the formG/τG/\\tauwhereGGis a cosine matrix; in other words it is defined in the same way as𝒮\\mathscr\{S\}except that the rank is condition relaxed\. Crucially,𝒮0\\mathscr\{S\}\_\{0\}is a convex set \(which in general𝒮\\mathscr\{S\}is not\)\.

We denote by𝔖n\\mathfrak\{S\}\_\{n\}the group of permutations of\[1\.\.n\]\[1\.\.n\]\. Then𝔖n\\mathfrak\{S\}\_\{n\}acts on the space ofn×nn\\times nmatrices by permutation of lines and columns, i\.e\. ifσ∈𝔖n\\sigma\\in\\mathfrak\{S\}\_\{n\}andX∈ℝn×nX\\in\\mathbb\{R\}^\{n\\times n\}thenσ​X\\sigma Xis then×nn\\times nmatrix defined by\(σ​X\)i​j=Xσ​\(i\)​σ​\(j\)\(\\sigma X\)\_\{ij\}=X\_\{\\sigma\(i\)\\sigma\(j\)\}\. Note that𝒮0\\mathscr\{S\}\_\{0\}is closed under this group action, i\.e\. ifS∈𝒮0S\\in\\mathscr\{S\}\_\{0\}thenσ​S∈𝒮0\\sigma S\\in\\mathscr\{S\}\_\{0\}\. Let𝔖X\\mathfrak\{S\}\_\{X\}denote thestabilizerof a given matrixX∈ℝn×nX\\in\\mathbb\{R\}^\{n\\times n\}, i\.e\. the subgroup of𝔖n\\mathfrak\{S\}\_\{n\}consisting of those permutationsσ\\sigmasuch thatσ​X=X\\sigma X=X\.

We say thatSSisWW\-symmetricif𝔖W⊆𝔖S\\mathfrak\{S\}\_\{W\}\\subseteq\\mathfrak\{S\}\_\{S\}, i\.e\. any symmetry ofWWis also a symmetry ofSS\.

###### Lemma D\.2\.

For anyS∈𝒮0S\\in\\mathscr\{S\}\_\{0\}, there existsS′∈𝒮0S^\{\\prime\}\\in\\mathscr\{S\}\_\{0\}which isWW\-symmetric such thatℒNCEW​\(S′\)≤ℒNCEW​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\\prime\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)\.

###### Proof\.

Note thatS↦ℒNCEW​\(S\)S\\mapsto\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)is a convex function on the convex set𝒮0\\mathscr\{S\}\_\{0\}; indeed this can be seen by expressing it as a nonnegatively\-weighted sum of convex functions:

ℒNCEW​\(S\)=∑i≠jwi​jn​∑k≠iwi​k​\(LSEk≠i​\(si​k\)−si​j\),\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)=\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{n\\sum\_\{k\\neq i\}w\_\{ik\}\}\\left\(\\mathrm\{LSE\}\_\{k\\neq i\}\(s\_\{ik\}\)\-s\_\{ij\}\\right\),whereLSE\\mathrm\{LSE\}denotes the convex log\-sum\-exp operation, i\.e\.LSEk​\(xk\)=log⁡\(∑kexp⁡\(xk\)\)\\mathrm\{LSE\}\_\{k\}\(x\_\{k\}\)=\\log\(\\sum\_\{k\}\\exp\(x\_\{k\}\)\)\. Now, we define the symmetrized version ofSSas

S′:=1\|𝔖W\|​∑σ∈𝔖Wσ​S∈𝒮0\.S^\{\\prime\}:=\\frac\{1\}\{\|\\mathfrak\{S\}\_\{W\}\|\}\\sum\_\{\\sigma\\in\\mathfrak\{S\}\_\{W\}\}\\sigma S\\in\\mathscr\{S\}\_\{0\}\.It follows from the convexity ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}that

ℒNCEW​\(S′\)≤1\|𝔖W\|​∑σ∈𝔖WℒNCEW​\(σ​S\)\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\\prime\}\)\\leq\\frac\{1\}\{\|\\mathfrak\{S\}\_\{W\}\|\}\\sum\_\{\\sigma\\in\\mathfrak\{S\}\_\{W\}\}\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(\\sigma S\)\.\(9\)Note that the RHS is none else thanℒNCEW​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)\. Indeed, it is an exercise in sum notation to see thatℒNCEW​\(σ​S\)=ℒNCEσ−1​W​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(\\sigma S\)=\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{\\sigma^\{\-1\}W\}\(S\)holds for anyσ∈𝔖n\\sigma\\in\\mathfrak\{S\}\_\{n\}\. Ifσ∈𝔖W\\sigma\\in\\mathfrak\{S\}\_\{W\}thenσ−1∈𝔖W\\sigma^\{\-1\}\\in\\mathfrak\{S\}\_\{W\}, i\.e\.σ−1​W=W\\sigma^\{\-1\}W=W, soℒNCEW​\(σ​S\)=ℒNCEW​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(\\sigma S\)=\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)\. Plugging that into \([9](https://arxiv.org/html/2605.13943#A4.E9)\) yieldsℒNCEW​\(S′\)≤ℒNCEW​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\\prime\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)\.

It remains to show thatS′S^\{\\prime\}isWW\-symmetric\. To this end, note that for anyπ∈𝔖W\\pi\\in\\mathfrak\{S\}\_\{W\},

π​S′=1\|𝔖W\|​∑σ∈𝔖Wπ​\(σ​S\)=1\|𝔖W\|​∑σ∈𝔖W\(π​σ\)​S=1\|𝔖W\|​∑σ′∈𝔖Wσ′​S=S′\\pi S^\{\\prime\}=\\frac\{1\}\{\|\\mathfrak\{S\}\_\{W\}\|\}\\sum\_\{\\sigma\\in\\mathfrak\{S\}\_\{W\}\}\\pi\(\\sigma S\)=\\frac\{1\}\{\|\\mathfrak\{S\}\_\{W\}\|\}\\sum\_\{\\sigma\\in\\mathfrak\{S\}\_\{W\}\}\(\\pi\\sigma\)S=\\frac\{1\}\{\|\\mathfrak\{S\}\_\{W\}\|\}\\sum\_\{\\sigma^\{\\prime\}\\in\\mathfrak\{S\}\_\{W\}\}\\sigma^\{\\prime\}S=S^\{\\prime\}where we re\-indexed the sum inσ′=π​σ\\sigma^\{\\prime\}=\\pi\\sigma, using the fact that𝔖W\\mathfrak\{S\}\_\{W\}is a subgroup of𝔖n\\mathfrak\{S\}\_\{n\}\. Thus,π∈𝔖S′\\pi\\in\\mathfrak\{S\}\_\{S^\{\\prime\}\}, as required\. ∎

The following lemma elucidates the structure ofWW\-symmetric matrices\.

###### Lemma D\.3\.

AWW\-symmetric matrixSSis constant on the blocks induced by the class partition, up to diagonal elements\. Any two off\-diagonal blocks with the same shape are equal\. If two diagonal blocks have the same shape then their off\-diagonal values are equal\.

###### Proof\.

Leti≠ji\\neq jandi′≠j′i^\{\\prime\}\\neq j^\{\\prime\}be two pairs of distinct indices in the same block, that is to say such that the pairsi,i′i,i^\{\\prime\}andj,j′j,j^\{\\prime\}are both in the same class\. Suppose first thatiiandjjbelong to the same classI⊆\[1\.\.n\]I\\subseteq\[1\.\.n\]\. Since𝔖I\\mathfrak\{S\}\_\{I\}acts 2\-transitively onII, there existsσ∈𝔖I\\sigma\\in\\mathfrak\{S\}\_\{I\}such thatσ​\(i\)=i′\\sigma\(i\)=i^\{\\prime\}andσ​\(j\)=j′\\sigma\(j\)=j^\{\\prime\}\. We extendσ\\sigmato𝔖n\\mathfrak\{S\}\_\{n\}such thatσ\\sigmaacts like the identity on\[1\.\.n\]∖I\[1\.\.n\]\\setminus I\. It follows thatσ∈𝔖W\\sigma\\in\\mathfrak\{S\}\_\{W\}\. It follows thatsi​j=sσ​\(i\)​σ​\(j\)=si′​j′s\_\{ij\}=s\_\{\\sigma\(i\)\\sigma\(j\)\}=s\_\{i^\{\\prime\}j^\{\\prime\}\}sinceSSisWW\-symmetric\. On the other hand, ifiiandjjbelong to distinct classesIIandJJ, the transpositions\(i​j\)\(i\\ j\)and\(i′​j′\)\(i^\{\\prime\}\\ j^\{\\prime\}\)have disjoint support and composing them yields a permutationσ∈𝔖W\\sigma\\in\\mathfrak\{S\}\_\{W\}, and again we getsi​j=sσ​\(i\)​σ​\(j\)=si′​j′s\_\{ij\}=s\_\{\\sigma\(i\)\\sigma\(j\)\}=s\_\{i^\{\\prime\}j^\{\\prime\}\}sinceSSisWW\-symmetric\. This shows that off\-diagonal elements in the same block are equal\.

Let\(I,J\)\(I,J\)and\(I′,J′\)\(I^\{\\prime\},J^\{\\prime\}\)be pairs of classes representing off\-diagonal blocks of the same shape, i\.e\.\|I\|=\|I′\|\|I\|=\|I^\{\\prime\}\|and\|J\|=\|J′\|\|J\|=\|J^\{\\prime\}\|\. Then there are permutationsσ1:I→I′\\sigma\_\{1\}:I\\to I^\{\\prime\}andσ2:J→J′\\sigma\_\{2\}:J\\to J^\{\\prime\}\. LetI3,…,ICI\_\{3\},\\ldots,I\_\{C\}denote the classes not equal toIIorJJ, and letI3′,…,IC′I\_\{3\}^\{\\prime\},\\ldots,I\_\{C\}^\{\\prime\}those not equal toI′I^\{\\prime\}orJ′J^\{\\prime\}\. The latter may be chosen in such an order that\|Ii\|=\|Ii′\|\|I\_\{i\}\|=\|I^\{\\prime\}\_\{i\}\|for3≤i≤C3\\leq i\\leq Cso there exist bijectionsσi:Ii→Ii′\\sigma\_\{i\}:I\_\{i\}\\to I\_\{i\}^\{\\prime\}\. The bijectionsσ1,σ2,σ3,…,σC\\sigma\_\{1\},\\sigma\_\{2\},\\sigma\_\{3\},\\ldots,\\sigma\_\{C\}, whose domains and codomains are all disjoint, combine into a single permutationσ:I⊔J⊔I3⊔⋯⊔Ic=\[1\.\.n\]→I′⊔J′⊔I3′⊔⋯⊔Ic′=\[1\.\.n\]\\sigma:I\\sqcup J\\sqcup I\_\{3\}\\sqcup\\cdots\\sqcup I\_\{c\}=\[1\.\.n\]\\to I^\{\\prime\}\\sqcup J^\{\\prime\}\\sqcup I\_\{3\}^\{\\prime\}\\sqcup\\cdots\\sqcup I\_\{c\}^\{\\prime\}=\[1\.\.n\]which permutes classes, and therefore belongs to𝔖W\\mathfrak\{S\}\_\{W\}\. It follows thatσ​S=S\\sigma S=S, henceSI​J=Sσ​\(I\)​σ​\(J\)=SI′​J′S\_\{IJ\}=S\_\{\\sigma\(I\)\\sigma\(J\)\}=S\_\{I^\{\\prime\}J^\{\\prime\}\}\. This proves that the value ofSSat an off\-diagonal blocks depends only on its shape, as required\.

Finally, if\(I,I\)\(I,I\)and\(J,J\)\(J,J\)are diagonal blocks such that\|I\|=\|J\|\|I\|=\|J\|, a bijectionσ:I→J\\sigma:I\\to Jcan be extended using the same idea as above to produceσ∈𝔖W\\sigma\\in\\mathfrak\{S\}\_\{W\}such thatσ​\(I\)=J\\sigma\(I\)=J\. Hence fori,i′∈Ii,i^\{\\prime\}\\in I, ifi≠i′i\\neq i^\{\\prime\}thensI​I=si​i′=sσ​\(i\)​σ​\(i′\)=sJ​Js\_\{II\}=s\_\{ii^\{\\prime\}\}=s\_\{\\sigma\(i\)\\sigma\(i^\{\\prime\}\)\}=s\_\{JJ\}\. ∎

###### Lemma D\.4\.

Let𝐚,𝐯∈ℝm\\mathbf\{a\},\\mathbf\{v\}\\in\\mathbb\{R\}^\{m\}\. Then the restriction of the log\-sum\-exp functionLSE:ℝm→ℝ\\operatorname\{LSE\}:\\mathbb\{R\}^\{m\}\\to\\mathbb\{R\}to the line𝐚\+ℝ​𝐯⊆ℝm\\mathbf\{a\}\+\\mathbb\{R\}\\mathbf\{v\}\\subseteq\\mathbb\{R\}^\{m\}is convex, and it is strictly convex iff the coefficients of𝐯\\mathbf\{v\}are not identical\.

We omit the proof which can easily be derived using elementary calculus and the Cauchy–Schwarz inequality\.

We say that a matrix in𝒮0\\mathscr\{S\}\_\{0\}isWW\-collapsedif it isWW\-symmetric and the diagonal entries agree with their respective diagonal blocks, which is to say that the values of the diagonal blocks are identically and maximally equal to1/τ1/\\tau\. AWW\-collapsed matrix can be expressed in block form as\(βc​c′⋅𝟏ℓc×ℓc′\)c,c′=1C\(\\beta\_\{cc^\{\\prime\}\}\\cdot\\mathbf\{1\}\_\{\\ell\_\{c\}\\times\\ell\_\{c^\{\\prime\}\}\}\)\_\{c,c^\{\\prime\}=1\}^\{C\}whereβc​c=1/τ\\beta\_\{cc\}=1/\\taufor eachc∈\[1\.\.C\]c\\in\[1\.\.C\]\.[Lemma˜D\.3](https://arxiv.org/html/2605.13943#A4.Thmtheorem3)tells us that ifc≠c′c\\neq c^\{\\prime\}thenβc​c′\\beta\_\{cc^\{\\prime\}\}depends only on the shape of the block, i\.e\.\(ℓc,ℓc′\)\(\\ell\_\{c\},\\ell\_\{c^\{\\prime\}\}\)\.

###### Lemma D\.5\.

The functionℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}has a unique global minimum over𝒮0\\mathscr\{S\}\_\{0\}\. Moreover, that minimum isWW\-collapsed\.

###### Proof\.

LetS∈𝒮0S\\in\\mathscr\{S\}\_\{0\}be a global minimum, which exists by compactness\. By[Lemma˜D\.2](https://arxiv.org/html/2605.13943#A4.Thmtheorem2), there is aWW\-symmetricS′∈𝒮0S^\{\\prime\}\\in\\mathscr\{S\}\_\{0\}such thatℒNCEW​\(S′\)≤ℒNCEW​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\\prime\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)\. But sinceSSis a global minimum over𝒮0\\mathscr\{S\}\_\{0\}, so must beS′S^\{\\prime\}\. Since the minimizers of a convex function over a convex set form a convex set,ℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}is identically equal to its minimum on the line segment\[S,S′\]⊆𝒮0\[S,S^\{\\prime\}\]\\subseteq\\mathscr\{S\}\_\{0\}, so for anyt∈\[0,1\]t\\in\[0,1\]it holds thatℒNCEW​\(t​S\+\(1−t\)​S′\)=t​ℒNCEW​\(S\)\+\(1−t\)​ℒNCEW​\(S′\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(tS\+\(1\-t\)S^\{\\prime\}\)=t\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)\+\(1\-t\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\\prime\}\)\. ExpressingℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}in terms of theLSE\\operatorname\{LSE\}as at the start of the proof of[Lemma˜D\.2](https://arxiv.org/html/2605.13943#A4.Thmtheorem2), this implies that:

∑i≠jwi​j∑k≠iwi​k​LSEk≠i⁡\(t​si​k\+\(1−t\)​si​k′\)=∑i≠jwi​j∑k≠iwi​k​\(t​LSEk≠i⁡\(si​k\)\+\(1−t\)​LSEk≠i⁡\(si​k′\)\)\.\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\operatorname\{LSE\}\_\{k\\neq i\}\(ts\_\{ik\}\+\(1\-t\)s^\{\\prime\}\_\{ik\}\)=\\sum\_\{i\\neq j\}\\frac\{w\_\{ij\}\}\{\\sum\_\{k\\neq i\}w\_\{ik\}\}\\left\(t\\operatorname\{LSE\}\_\{k\\neq i\}\(s\_\{ik\}\)\+\(1\-t\)\\operatorname\{LSE\}\_\{k\\neq i\}\(s^\{\\prime\}\_\{ik\}\)\\right\)\.\(10\)For eachi∈\[1\.\.n\]i\\in\[1\.\.n\], the family\(si​k−si​k′\)k≠i\(s\_\{ik\}\-s^\{\\prime\}\_\{ik\}\)\_\{k\\neq i\}must be a constantcic\_\{i\}, else by[Lemma˜D\.4](https://arxiv.org/html/2605.13943#A4.Thmtheorem4)the strict convexity ofLSE\\operatorname\{LSE\}restricted to a line would cause \([10](https://arxiv.org/html/2605.13943#A4.E10)\) to be a strict inequality\. SinceSSandS′S^\{\\prime\}are symmetric,ci=cc\_\{i\}=cdoes not depend onii\. It follows thatS=S′\+c​\(En−In\)S=S^\{\\prime\}\+c\(E\_\{n\}\-I\_\{n\}\)which isWW\-symmetric sinceS′,EnS^\{\\prime\},E\_\{n\}andInI\_\{n\}all are\. By[Lemma˜D\.3](https://arxiv.org/html/2605.13943#A4.Thmtheorem3),SScan be expressed simply in terms of the blocks induced by the class partition\. It remains to be seen thatSSis collapsed, i\.e\. the values of the diagonal blocks are identical and maximally equal to1/τ1/\\tau\. Let\(αc\)c=1C\(\\alpha\_\{c\}\)\_\{c=1\}^\{C\}denote the values of the diagonal blocks, i\.e\.intra\-class similarities, and\(βc​c′\)c≠c′C\(\\beta\_\{cc^\{\\prime\}\}\)\_\{c\\neq c^\{\\prime\}\}^\{C\}theinter\-class similarities\. Letℐc\\mathcal\{I\}\_\{c\}denote the set of indices belonging to classcc, andℓc=\|ℐc\|\\ell\_\{c\}=\|\\mathcal\{I\}\_\{c\}\|its size\. Then,

ℒNCEW​\(S\)=−1n​∑c=1C1ℓc−1​∑i,j∈ℐci≠j\(αc−log⁡\(\(ℓc−1\)​eαc\+∑c′≠cℓc′​eβc​c′\)\)\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\)=\-\\frac\{1\}\{n\}\\sum\_\{c=1\}^\{C\}\\frac\{1\}\{\\ell\_\{c\}\-1\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\in\\mathcal\{I\}\_\{c\}\\\\ i\\neq j\\end\{subarray\}\}\\left\(\\alpha\_\{c\}\-\\log\\left\(\(\\ell\_\{c\}\-1\)e^\{\\alpha\_\{c\}\}\+\\sum\_\{c^\{\\prime\}\\neq c\}\\ell\_\{c^\{\\prime\}\}e^\{\\beta\_\{cc^\{\\prime\}\}\}\\right\)\\right\)\.For eachcc, this is a strictly decreasing function ofαc\\alpha\_\{c\}\. Indeed,

∂ℒNCE∂αc=−1n​1ℓc−1​∑i,j∈ℐci≠j∑c′≠cℓc′​eβc​c′\(ℓc−1\)​eαc\+∑c′≠cℓc′​eβc​c′<0\.\\frac\{\\partial\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}\}\{\\partial\\alpha\_\{c\}\}=\-\\frac\{1\}\{n\}\\frac\{1\}\{\\ell\_\{c\}\-1\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\in\\mathcal\{I\}\_\{c\}\\\\ i\\neq j\\end\{subarray\}\}\\frac\{\\sum\_\{c^\{\\prime\}\\neq c\}\\ell\_\{c^\{\\prime\}\}e^\{\\beta\_\{cc^\{\\prime\}\}\}\}\{\(\\ell\_\{c\}\-1\)e^\{\\alpha\_\{c\}\}\+\\sum\_\{c^\{\\prime\}\\neq c\}\\ell\_\{c^\{\\prime\}\}e^\{\\beta\_\{cc^\{\\prime\}\}\}\}<0\.Because of its block structure, replacing the diagonal entries ofSSwith1/τ1/\\tauyields a valid matrix in𝒮0\\mathscr\{S\}\_\{0\}\. By minimality ofℒNCEW​\(S\)\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S\), it follows thatSSmust have the maximum value ofαc=1/τ\\alpha\_\{c\}=1/\\tauon the diagonal blocks, which shows thatSSisWW\-collapsed\.

We have shown that any global minimum isWW\-collapsed\. As to uniqueness, note that ifS~\\widetilde\{S\}is another global minimum then the same reasoning we carried out onS′S^\{\\prime\}shows thatS=S~\+c​\(En−In\)S=\\widetilde\{S\}\+c\(E\_\{n\}\-I\_\{n\}\)for some constantcc, which implies thatc=0c=0as bothSSandS~\\widetilde\{S\}areWW\-collapsed\. ∎

See[3\.13](https://arxiv.org/html/2605.13943#S3.Thmtheorem13)

###### Proof\.

By[Lemma˜D\.5](https://arxiv.org/html/2605.13943#A4.Thmtheorem5),ℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}has a unique global minimum ofSSover𝒮0\\mathscr\{S\}\_\{0\}\. Moreover,SSisWW\-collapsed, so in particular it is expressible in block form in terms ofC×CC\\times Cconstant blocks with shapesℓc×ℓc′\\ell\_\{c\}\\times\\ell\_\{c^\{\\prime\}\}forc,c′∈\[1\.\.C\]c,c^\{\\prime\}\\in\[1\.\.C\]\. Because of its constant block structure,rank⁡\(S\)≤C≤q\\operatorname\{rank\}\(S\)\\leq C\\leq q, soS∈𝒮S\\in\\mathscr\{S\}and it isa fortiorithe unique global minimum ofℒNCEW\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}over𝒮\\mathscr\{S\}\. The existence of a geometric realization in terms ofCCclass prototypes follows from its constant block structure\. Their essential uniqueness follows from the fact that points on a sphere are determined up to linear isometry by their cosine matrix\. Finally, 3\. follows from[Lemma˜D\.3](https://arxiv.org/html/2605.13943#A4.Thmtheorem3)\. ∎

The upshot of[Theorem˜3\.13](https://arxiv.org/html/2605.13943#S3.Thmtheorem13)is that regularity in class sizes can be exploited to greatly reduce the number of parameters in the optimization of the SupCon loss\. For instance, we easily recover a following result fromGrafet al\.\[[2021](https://arxiv.org/html/2605.13943#bib.bib1)\]in the case of balanced classes\. Recall that a verticesμ1,…,μC∈𝕊q−1\\mu\_\{1\},\\ldots,\\mu\_\{C\}\\in\\mathbb\{S\}^\{q\-1\}are said to form aregular simplexifμ1\+⋯\+μC=𝟎\\mu\_\{1\}\+\\cdots\+\\mu\_\{C\}=\\mathbf\{0\}and the value ofcos⁡\(μc,μc′\)\\cos\(\\mu\_\{c\},\\mu\_\{c^\{\\prime\}\}\)is the same for allc≠c′c\\neq c^\{\\prime\}\. This is equivalent to the conditioncos⁡\(μc,μc′\)=1/\(C−1\)\\cos\(\\mu\_\{c\},\\mu\_\{c^\{\\prime\}\}\)=1/\(C\-1\)for allc≠c′c\\neq c^\{\\prime\}\.

###### Theorem D\.6\(Graf et al, 2021\)\.

IfC\>1C\>1andℓ1=⋯=ℓC\\ell\_\{1\}=\\cdots=\\ell\_\{C\}then class prototypesμ1,…,μC∈𝕊q−1\\mu\_\{1\},\\ldots,\\mu\_\{C\}\\in\\mathbb\{S\}^\{q\-1\}which minimize the SupCon loss form the vertices of a regular simplex\.

###### Proof\.

Letμ1,…,μC\\mu\_\{1\},\\ldots,\\mu\_\{C\}be the optimal class prototypes\. Per[Theorem˜3\.13](https://arxiv.org/html/2605.13943#S3.Thmtheorem13)and since there is only one possible class size,cos⁡\(μc,μc′\)=β\\cos\(\\mu\_\{c\},\\mu\_\{c^\{\\prime\}\}\)=\\betafor some constantβ\\beta, for everyc≠c′c\\neq c^\{\\prime\}\. It remains to show thatβ=−1/\(C−1\)\\beta=\-1/\(C\-1\)\. Expressing the minimal value of the loss in terms ofβ\\betaat the optimumS∗=\(cos⁡\(μc,μc′\)/τ⋅𝟏ℓc×ℓc′\)c,c′=1CS^\{\*\}=\(\\cos\(\\mu\_\{c\},\\mu\_\{c^\{\\prime\}\}\)/\\tau\\cdot\\mathbf\{1\}\_\{\\ell\_\{c\}\\times\\ell\_\{c^\{\\prime\}\}\}\)\_\{c,c^\{\\prime\}=1\}^\{C\}, we see that

ℒNCEW​\(S∗\)=−1n​∑c=1C1ℓ−1​∑i,j∈ℐci≠j\(1τ−log⁡\(\(ℓ−1\)​e1/τ\+\(C−1\)​ℓ​eβ/τ\)\),\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{NCE\}\}^\{W\}\(S^\{\*\}\)=\-\\frac\{1\}\{n\}\\sum\_\{c=1\}^\{C\}\\frac\{1\}\{\\ell\-1\}\\sum\_\{\\begin\{subarray\}\{c\}i,j\\in\\mathcal\{I\}\_\{c\}\\\\ i\\neq j\\end\{subarray\}\}\\left\(\\frac\{1\}\{\\tau\}\-\\log\\left\(\(\\ell\-1\)e^\{1/\\tau\}\+\(C\-1\)\\ell e^\{\\beta/\\tau\}\\right\)\\right\),\(11\)whereℓ\\elldenotes the size of every class andℐc⊂\[1\.\.n\]\\mathcal\{I\}\_\{c\}\\subset\[1\.\.n\]the set of indices belonging to classcc\. In particular we see that the RHS of \([11](https://arxiv.org/html/2605.13943#A4.E11)\) is a strictly increasing function ofβ\\betaso the value ofβ\\betamust be as small as possible subject to the constraint thatS∗S^\{\*\}be PSD\. Owing to the block structure ofS∗S^\{\*\}, this amounts to the condition thatAβ:=IC\+β​\(EC−IC\)A\_\{\\beta\}:=I\_\{C\}\+\\beta\(E\_\{C\}\-I\_\{C\}\)is PSD\. We can express the quadratic form associated withAβA\_\{\\beta\}as follows:

𝐱⊤​Aβ​𝐱=\(1−β\)​∑c=1Cxc2\+β​\(∑c=1Cxc\)2\.\\mathbf\{x\}^\{\\top\}A\_\{\\beta\}\\mathbf\{x\}=\(1\-\\beta\)\\sum\_\{c=1\}^\{C\}x\_\{c\}^\{2\}\+\\beta\\left\(\\sum\_\{c=1\}^\{C\}x\_\{c\}\\right\)^\{2\}\.On the one hand, ifβ=−1/\(C−1\)\\beta=\-1/\(C\-1\)then𝐱⊤​Aβ​𝐱≥0\\mathbf\{x\}^\{\\top\}A\_\{\\beta\}\\mathbf\{x\}\\geq 0by the Cauchy–Schwarz inequality, soAβA\_\{\\beta\}is PSD\. On the other hand, ifβ<−1/\(C−1\)\\beta<\-1/\(C\-1\)and𝐱=\(1,1,…,1\)\\mathbf\{x\}=\(1,1,\\ldots,1\)then𝐱⊤​Aβ​𝐱=\(1−β\)​C\+β​C2=C​\(1\+\(C−1\)​β\)<0\\mathbf\{x\}^\{\\top\}A\_\{\\beta\}\\mathbf\{x\}=\(1\-\\beta\)C\+\\beta C^\{2\}=C\(1\+\(C\-1\)\\beta\)<0soAAis not PSD\.

Thus, the optimal inter\-class similarity isβ=−1/\(C−1\)\\beta=\-1/\(C\-1\), as desired\. ∎

## Appendix E𝕏\\mathbb\{X\}\-CLR geometry

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/cifar.png)Figure 6:Alignment between learned and target label geometries\. Heatmap ofmax⁡\(0,rProc2​\(Z,Z′\)\)\\max\(0,r^\{2\}\_\{\\mathrm\{Proc\}\}\(Z,Z^\{\\prime\}\)\)where columns correspond to the geometry used for training and rows to evaluation geometries\. Diagonal entries indicate successful recovery of the target geometry, while off\-diagonal values highlight geometric mismatch\.TargetrProc2,testr^\{2,\\mathrm\{test\}\}\_\{\\text\{Proc\}\}ΔWtest\\Delta\_\{W\}^\{\\mathrm\{test\}\}\(%\)Top\-1 \(%\)Top\-3 \(%\)Simplex0\.630\.08379\.6889\.66CLIP0\.630\.09073\.3886\.67MiniLM0\.660\.03878\.3789\.15

Table 1:Generalization statistics for𝕏\\mathbb\{X\}\-CLR models trained on CIFAR\-100\. Top\-kkaccuracy is computed on the test set using a cross\-validated ridge regression fitted on training representations\.
We evaluate the𝕏\\mathbb\{X\}\-CLR contrastive lossSobalet al\.\[[2024](https://arxiv.org/html/2605.13943#bib.bib5)\]on CIFAR\-100Krizhevskyet al\.\[[2009](https://arxiv.org/html/2605.13943#bib.bib43)\]using a ResNet50 encoder with aq=512q=512\-dimensional latent space\. Labelsy∈ℝℓy\\in\\mathbb\{R\}^\{\\ell\}are encoded in three ways: \(i\) one\-hot class vectors \(simplex,ℓ=100\\ell=100\), \(ii\) CLIP ViT\-B/32 text embeddings of the prompt “a photo of<class\>” \(ℓ=512\\ell=512\), and \(iii\) MiniLM222[https://huggingface\.co/sentence\-transformers/all\-MiniLM\-L6\-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)embeddings of the same prompt \(ℓ=384\\ell=384\)\.

For each model, we computerProc2​\(Z,Z∗\)r^\{2\}\_\{\\mathrm\{Proc\}\}\(Z,Z^\{\*\}\), whereZZare the learned embeddings andZ∗Z^\{\*\}the target label embeddings used during training \(zero\-padded if needed\)\. As a control, we also evaluaterProc2​\(Z,Z′\)r^\{2\}\_\{\\mathrm\{Proc\}\}\(Z,Z^\{\\prime\}\)with alternative label geometriesZ′Z^\{\\prime\}\. Results in[Figure˜6](https://arxiv.org/html/2605.13943#A5.F6)show that the models recover their target geometry and that the different label geometries are mutually distinct\.

Finally, we tested whether our proposed metrics could explain better the generalization properties of the pre\-trained models\. To do so, we computed the Procrustes and loss error for the three previous models pre\-trained on three different labels geometries and we report them in[Table˜1](https://arxiv.org/html/2605.13943#A5.T1)along with their top\-1 and top\-3 accuracies for predicting the class labels of CIFAR\-100 \(seen as generalization error\)\. Strikingly, the𝕏\\mathbb\{X\}\-CLR pre\-trained with MiniLM geometry generalizes very well while having the best Procrustes and loss error\. It suggests that strong performance on our metrics translates into better model generalization on new tasks\.

## Appendix FConnection with Kernel PCA and SNE

### F\.1Kernel PCA

We state a special case that gives a precise connection between Kernel PCA andww\-InfoNCE:

###### Theorem F\.1\(Kernelww\-InfoNCE\)\.

LetK=\(ki​j\)K=\(k\_\{ij\}\)be ann×nn\\times npositive semi\-definite kernel matrix inducing the squared kernel distancedi​j:=ki​i\+kj​j−2​ki​jd\_\{ij\}:=k\_\{ii\}\+k\_\{jj\}\-2k\_\{ij\}\. Assume the rank of the centered kernelK~\\widetilde\{K\}is at mostqq\. The optimal embedding minimizingww\-InfoNCE withwi​j=exp⁡\(−di​j\)w\_\{ij\}=\\exp\(\-d\_\{ij\}\)andsi​j=−‖zi−zj‖2s\_\{ij\}=\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}corresponds to the kernel principal components ofKK, up to Euclidean isometry\.

###### Proof\.

SinceKKis a symmetric positive\-definite kernel,D=\(di​j\)D=\(d\_\{ij\}\)is an EDM with embedding dimensionr=rank⁡\(K~\)≤qr=\\operatorname\{rank\}\(\\widetilde\{K\}\)\\leq q\. It follows thatrank⁡\(D\)≤q\+2\\operatorname\{rank\}\(D\)\\leq q\+2and one realization of this EDM is exactly the principal components ofKK, i\.e\. the eigenvectors ofK~\\widetilde\{K\}scaled by the square root of its eigenvalues as per the to the characterization of EDMs inAlfakih and others \[[2018](https://arxiv.org/html/2605.13943#bib.bib23)\]\. We then conclude with[Lemma˜B\.1](https://arxiv.org/html/2605.13943#A2.Thmtheorem1)\. ∎

### F\.2SNE

Stochastic Neighbor Embedding \(SNE\)Hinton and Roweis \[[2002](https://arxiv.org/html/2605.13943#bib.bib32)\]is the predecessor of t\-SNE and was originally proposed for data visualization\. It learns an embeddingZZof high\-dimensional dataXXby minimizing aww\-InfoNCE objective with weightswi​j=exp⁡\(−‖xi−xj‖2/\(2​σi2\)\)w\_\{ij\}=\\exp\\\!\\left\(\-\\\|x\_\{i\}\-x\_\{j\}\\\|^\{2\}/\(2\\sigma\_\{i\}^\{2\}\)\\right\), and similaritiessi​j=−‖zi−zj‖2s\_\{ij\}=\-\\\|z\_\{i\}\-z\_\{j\}\\\|^\{2\}\. This objective violates Assumption[3\.2](https://arxiv.org/html/2605.13943#S3.Thmtheorem2), since in generalwi​j≠wj​iw\_\{ij\}\\neq w\_\{ji\}\. The asymmetry arises from non\-uniform local densities inXX, which induce point\-dependent varianceσi\\sigma\_\{i\}\. If instead one assumes a constant varianceσi=σ\\sigma\_\{i\}=\\sigma, the SNE objective enforces‖zi−zj‖≈‖xi−xj‖/σ\\\|z\_\{i\}\-z\_\{j\}\\\|\\approx\\\|x\_\{i\}\-x\_\{j\}\\\|/\\sigma, as shown in Corollary[3\.4](https://arxiv.org/html/2605.13943#S3.Thmtheorem4)\.

In this setting, SNE is equivalent toww\-InfoNCE and recovers an approximation of the principal components ofXX\. This approximation becomes exact when the rank of the covariance matrix ofXXis smaller than the number of components \(i\.e\. embedding dimension\), as shown in[Theorem˜F\.1](https://arxiv.org/html/2605.13943#A6.Thmtheorem1)\.

Below, we show that in that setting, PCA and SNE/ww\-InfoNCE representations are indeed quasi\-isometric\.

![Refer to caption](https://arxiv.org/html/2605.13943v1/figures/pca_vs_winfonce.png)Figure 7:PCA against weighted InfoNCE representations of MNIST withqqlatent dimensions \(orcomponents\) using Euclidean input distance as dissimilarity measure\.We train a ConvNet on MNIST with latent dimensionq∈\{8,16,32,48,64\}q\\in\\\{8,16,32,48,64\\\}andσ=103\\sigma=10^\{3\}\. In[Figure˜7](https://arxiv.org/html/2605.13943#A6.F7)we report the coefficient of similarityrs​\-sim2r^\{2\}\_\{s\\textrm\{\-sim\}\}betweenww\-InfoNCE representations and PCA of the test set with the same latent sizeqq\. We observe thatq=48q=48gives an almost perfect scorers​\-sim2=0\.98r^\{2\}\_\{s\\textrm\{\-sim\}\}=0\.98, which is also confirmed visually when plotting the distances in the ConvNet latent space against PCA space\. Whenq=48q=48, PCA explains 90% of variance, suggesting thatr≈qr\\approx qin this case, confirming our theory\.

## Appendix GImplementation details

The experiments in this paper were carried out in Python using the PyTorch package and a single NVIDIA RTX 4500 Graphics Processing Unit \(GPU\) with 24 GB of memory\.

The code required to reproduce all the experiments in this paper can be found in the anonymized repository with the following URL:

Key details and parameter values are summarized in the following table\.

§[4\.1](https://arxiv.org/html/2605.13943#S4.SS1)[4\.2](https://arxiv.org/html/2605.13943#S4.SS2)[4\.3](https://arxiv.org/html/2605.13943#S4.SS3)Appendix[E](https://arxiv.org/html/2605.13943#A5)DatasetMNISTN/AMNISTCIFAR\-100File\(s\)mnist\_1d\.ipynbmnist\_nd\.ipynbimbalanced\_supcon\.pymnist\_1d\.ipynbexperiments/cifar100/Architecture3\-layer ConvNetN/A3\-layer ConvNetResNet\-50LossEucl\.ww\-InfoNCEHard/Soft SupConyy\-Aware𝕏\\mathbb\{X\}\-CLROptimizerAdamAdamAdamAdamLR1e\-35e\-21e\-31e\-4WD2e\-6N/A2e\-61e\-6Epochs2020–5050<20​k<20\\mathrm\{k\}5050150Batch size512full512512Key parametersd=2d=2ord∈\[5\.\.12\]d\\in\[5\.\.12\],Hard:τ=0\.1\\tau\\\!=\\\!0\.1Soft:ε=e−1\\varepsilon\\\!=\\\!e^\{\-1\},τ=τ∗\\tau=\\tau^\{\*\}d=C=10d=C\\\!=\\\!10d=2d\\\!=\\\!2,τ=0\.1\\tau=0\.1q=512q\\\!=\\\!512,ℓ∈\{100,512,384\}\\ell\\\!\\in\\\!\\\{100,512,384\\\}τ=0\.1\\tau=0\.1Table 2:Implementation details\.

Similar Articles

A Unified Perspective for Learning Graph Representations Across Multi-Level Abstractions

arXiv cs.LG

This paper proposes a unified contrastive framework for learning graph representations across multiple abstraction levels (node, proximity, cluster, graph) with a parameter-free self-weighting mechanism that adaptively assigns weights to similarity scores, outperforming state-of-the-art on downstream tasks like classification, clustering, and link prediction.

Learning Coherent Representations: A Topological Approach to Interpretability

arXiv cs.LG

This paper introduces coherence, a geometric constraint for neural representations inspired by grid cells and head direction cells in the brain. Coherence ensures that features respond to geometrically connected regions of the data manifold, improving interpretability; the authors propose a differentiable objective (Coh) and validate it on synthetic data, rotated MNIST, and BERT token embeddings.

Geometry Conflict: Explaining and Controlling Forgetting in LLM Continual Post-Training

Hugging Face Daily Papers

This research investigates how task geometry influences continual post-training in LLMs, identifying 'geometry conflict' as a cause of forgetting and a mechanism for controlling update integration. The authors propose Geometry-Conflict Wasserstein Merging (GCWM), a data-free method that improves retention and performance across various model sizes.

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

arXiv cs.LG

The paper identifies a misalignment between the softmax-based InfoNCE loss and the normalized embedding setting in modern contrastive learning. It proposes WEINCE, a simple modification that blends softmax logits with an endpoint shortfall correction using extreme value theory, yielding consistent improvements across vision benchmarks.