Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

arXiv cs.LG 05/26/26, 04:00 AM Papers
Summary
This paper introduces a Riemannian version of archetypal analysis using data-driven pullback geometry to combine interpretability with non-linear expressiveness, proposing the Riemannian Archetypal Mapping (RAM) and demonstrating its effectiveness on synthetic data and MNIST.
arXiv:2605.24113v1 Announce Type: new Abstract: Classical archetypal analysis is appealing for its interpretability, but its linear geometry can limit performance on data with strongly non-linear structure; at the same time, existing neural extensions improve flexibility while often weakening the geometric meaning of archetypes and interpolations. In this work, we develop a Riemannian version of archetypal analysis based on data-driven pullback geometry for real-valued data, with the goal of combining the interpretability of classical archetypal analysis with the expressive power of modern non-linear models. We introduce a class of deformed star distributions together with associated pullback Riemannian geometry to provide a statistical interpretation of the resulting manifold mappings, define the Riemannian archetypal mapping (RAM) as a projection onto the manifold of geodesically convex combinations of archetypes, and propose a practical optimization scheme based on convex relaxation followed by non-convex refinement. We further propose a learning scheme that yields reasonable, albeit generally suboptimal, deformed star distributions from data. Experiments on synthetic examples and MNIST show that the resulting framework produces meaningful geodesics, useful denoising projections, and geometry-aware classifications, while also clarifying where current optimization limitations remain.
Original Article
View Cached Full Text
Cached at: 05/26/26, 09:01 AM
# Interpretable non-linear data analysis on deformed star distributions
Source: [https://arxiv.org/html/2605.24113](https://arxiv.org/html/2605.24113)
Willem Diepeveen Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA wdiepeveen@math\.ucla\.edu &Deanna Needell Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095, USA deanna@math\.ucla\.edu

###### Abstract

Classical archetypal analysis is appealing for its interpretability, but its linear geometry can limit performance on data with strongly non\-linear structure; at the same time, existing neural extensions improve flexibility while often weakening the geometric meaning of archetypes and interpolations\. In this work, we develop a Riemannian version of archetypal analysis based on data\-driven pullback geometry for real\-valued data, with the goal of combining the interpretability of classical archetypal analysis with the expressive power of modern non\-linear models\. We introduce a class of deformed star distributions together with associated pullback Riemannian geometry to provide a statistical interpretation of the resulting manifold mappings, define the Riemannian archetypal mapping \(RAM\) as a projection onto the manifold of geodesically convex combinations of archetypes, and propose a practical optimization scheme based on convex relaxation followed by non\-convex refinement\. We further propose a learning scheme that yields reasonable, albeit generally suboptimal, deformed star distributions from data\. Experiments on synthetic examples and MNIST show that the resulting framework produces meaningful geodesics, useful denoising projections, and geometry\-aware classifications, while also clarifying where current optimization limitations remain\.

00footnotetext:Our code is available at[https://github\.com/wdiepeveen/Riemannian\-Archetypal\-Analysis](https://github.com/wdiepeveen/Riemannian-Archetypal-Analysis)\.## 1Introduction

Extracting interpretable features from data is a central task in exploratory analysis, understanding constituent components, and enabling downstream classification and decision\-making\. The usefulness of any such method is largely determined by how well its underlying geometric model fits the data\. For example, principal component analysis \(PCA\)[pearson1901liii](https://arxiv.org/html/2605.24113#bib.bib55)assumes that the data lie near a low\-dimensional linear subspace fitted in a least\-squares sense; when this assumption is violated, the resulting components can be difficult to interpret and of limited practical value\. In response to this limitation, classical but still widely used methods have adopted richer geometric models: independent component analysis \(ICA\)[comon1994independent](https://arxiv.org/html/2605.24113#bib.bib14)models data as a linear mixture of statistically independent components, non\-negative matrix factorization \(NMF\)[paatero1994positive](https://arxiv.org/html/2605.24113#bib.bib53)represents data as lying in the conical hull of non\-negative basis vectors in the positive orthant, and archetypal analysis \(AA\)[cutler1994archetypal](https://arxiv.org/html/2605.24113#bib.bib15)describes data as residing in a polytope whose vertices are themselves data points\. For different data types and end goals, particular combinations and refinements of these basic ideas are often preferred[abdolali2021simplex](https://arxiv.org/html/2605.24113#bib.bib1);[ding2008convex](https://arxiv.org/html/2605.24113#bib.bib24);[kleverov2026non](https://arxiv.org/html/2605.24113#bib.bib39);[li2008minimum](https://arxiv.org/html/2605.24113#bib.bib47);[lin2018maximum](https://arxiv.org/html/2605.24113#bib.bib48);[miao2007endmember](https://arxiv.org/html/2605.24113#bib.bib49);[nascimento2005vertex](https://arxiv.org/html/2605.24113#bib.bib52)\.

In applications where the goal is to identify representative data points \(archetypes\), to quantify how other observations relate to them, and to leverage these relationships in classification or other downstream tasks, AA enjoys several advantages\. Highlighting its advantages over ICA and NMF, AA yields interpretable factors in the form of extremal data points \(unlike ICA, whose statistically independent components are linear directions that need not correspond to specific representative observations\) and applies to general real\-valued data without non\-negativity constraints \(unlike NMF, which requires non\-negative data and typically yields basis elements that are not actual observations\)\. For a comprehensive overview of archetypal analysis and its variants, see[alcacer2025survey](https://arxiv.org/html/2605.24113#bib.bib4)\.

##### Towards non\-linear archetypal analysis

However, AA is inherently linear: it operates in an ambient Euclidean space and encodes geometric assumptions via linear combinations or convex mixtures of data points\. In many modern applications, data concentrate near highly non\-linear manifolds[fefferman2016testing](https://arxiv.org/html/2605.24113#bib.bib26);[whiteley2022statistical](https://arxiv.org/html/2605.24113#bib.bib69), and naive linear or conical models fail to capture the true geometry and to provide meaningful notions of distance or interpolation between archetypes and observations\.

Motivated by this limitation, several non\-linear, neural network\-based extensions of AA have been proposed\. Models such as AAnet[van2019finding](https://arxiv.org/html/2605.24113#bib.bib65)replace the linear archetypal map by a learned encoder\-decoder architecture while retaining an archetypal structure in the latent space, and subsequent adaptations such as Deep AA[keller2021learning](https://arxiv.org/html/2605.24113#bib.bib36), which ground archetypal representations in probabilistic generative models – variational autoencoders \(VAEs\)[kingma2013auto](https://arxiv.org/html/2605.24113#bib.bib37)– rather than standard autoencoders, further extend this idea\. Despite the empirical success of these models in applications such as single\-cell analysis[venkat2025aanet](https://arxiv.org/html/2605.24113#bib.bib66)and related extensions[tasissa2023k](https://arxiv.org/html/2605.24113#bib.bib64);[wieser2025revisiting](https://arxiv.org/html/2605.24113#bib.bib70), current non\-linear AA variants still treat the latent space as an ad hoc Euclidean space, without guarantees that distances or interpolations there reflect meaningful geometric relationships between archetypes and observations – even though such interpretability is a primary motivation for these methods\.

This lack of rigorous geometric interpretability is a common critique of non\-linear dimension reduction methods[chari2023specious](https://arxiv.org/html/2605.24113#bib.bib11), and it has motivated a growing body of work on endowing \(variational\) autoencoder latent spaces with meaningful geometric structure[bhasker2025uncovering](https://arxiv.org/html/2605.24113#bib.bib6);[hartwig2026geodesic](https://arxiv.org/html/2605.24113#bib.bib31);[kohli2021ldle](https://arxiv.org/html/2605.24113#bib.bib40);[lee2025geometry](https://arxiv.org/html/2605.24113#bib.bib42);[moon2019visualizing](https://arxiv.org/html/2605.24113#bib.bib51);[psenka2024representation](https://arxiv.org/html/2605.24113#bib.bib57);[vigouroux2026discovering](https://arxiv.org/html/2605.24113#bib.bib67)\. However, to the best of our knowledge, these efforts have so far not been extended to the non\-linear AA setting\.

##### Towards Riemannian archetypal analysis

Our goal in this work is to develop provably geometrically meaningful non\-linear AA methodologies by leveraging Riemannian geometry, rather than generalizing the above\-mentioned \(still heuristic\) approaches to endowing the non\-linear AA latent space with meaningful geometric structure\. Importantly, this should not be confused with simply choosing a non\-Euclidean latent manifold[cho2023hyperbolic](https://arxiv.org/html/2605.24113#bib.bib13);[davidson2018hyperspherical](https://arxiv.org/html/2605.24113#bib.bib17), which typically inherits the same interpretability issues as standard autoencoders when the latent geometry is not explicitly tied to the data distribution and archetypal structure\.

At a high level, the aim of a Riemannian reformulation of a machine learning method is to learn a Riemannian structure on the ambient space such that the data form \(or are well\-approximated by\) a low\-dimensional, totally geodesic submanifold\. The machine learning task is then phrased as an optimization problem over this data manifold, and solved using specialized Riemannian optimization techniques; see[Figure1](https://arxiv.org/html/2605.24113#S1.F1)for an illustration in the context of archetypal analysis\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/log_probs.png)\(a\)A deformed star distribution visualized by its level sets\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/samples_ram_projected_exact_Omega.png)\(b\)An illustration of Riemannian archetypal analysis\.

Figure 1:Data from the deformed star distribution \(blue\) are projected in theℓ2\\ell^\{2\}\-sense \(orange\) onto the manifold bounded by geodesics \(green\) connecting the four archetypes \(red\) via the Riemannian archetypal mapping \(RAM\)\. Each projected point admits a geodesic barycentric representation whose weight vector is sparse for points projected onto a boundary or corner, and points that already lie in the manifold approximation of the data set remain unchanged\.Learning such Riemannian structures and efficiently evaluating manifold maps \(geodesics, exponential, and logarithmic mappings\) from data has been an active research area[arvanitidis2016locally](https://arxiv.org/html/2605.24113#bib.bib5);[diepeveen2024pulling](https://arxiv.org/html/2605.24113#bib.bib18);[hauberg2012geometric](https://arxiv.org/html/2605.24113#bib.bib32);[peltonen2004improved](https://arxiv.org/html/2605.24113#bib.bib56);[Scarvelis2023](https://arxiv.org/html/2605.24113#bib.bib60);[sorrenson2025learning](https://arxiv.org/html/2605.24113#bib.bib62);[sun2024geometryaware](https://arxiv.org/html/2605.24113#bib.bib63); see also[gruffaz2025riemannian](https://arxiv.org/html/2605.24113#bib.bib28)for a recent overview\. Only recently have both ingredients been realized in a unified, data\-driven way[diepeveen2025scorebased](https://arxiv.org/html/2605.24113#bib.bib19), in which volume\-preserving normalizing flows are trained to “unroll” the data manifold, yielding a pullback Riemannian metric with closed\-form manifold mappings that provably follow high\-likelihood regions of the underlying deformed Gaussian density\. On top of such a geometry, Riemannian neural networks become practical\. In particular, Riemannian autoencoders \(RAEs\)[diepeveen2024pulling](https://arxiv.org/html/2605.24113#bib.bib18), which can be viewed as non\-linear generalizations of PCA, implement anℓ2\\ell^\{2\}\-projection onto a low\-dimensional submanifold, where the projection can often be approximated in closed form for sufficiently regular diffeomorphisms[diepeveen2024pulling](https://arxiv.org/html/2605.24113#bib.bib18)or efficiently solved via iso\-Riemannian optimization otherwise[diepeveen2025isoriemanopt](https://arxiv.org/html/2605.24113#bib.bib23)\.

However, just as classical PCA is well suited for \(approximately\) Gaussian data in Euclidean space, RAEs and more generally the framework proposed in[diepeveen2025scorebased](https://arxiv.org/html/2605.24113#bib.bib19)are primarily tailored to settings where the underlying geometric model is a deformed Gaussian distribution obtained via a volume\-preserving diffeomorphism \(or, more generally, a diffeomorphism with constant Jacobian determinant\)\. This is restrictive: not every data distribution can be pushed forward to a Gaussian under such diffeomorphisms, and in particular deformed star distributions – the natural geometric model underlying archetypal analysis – do not fit this paradigm\. In practice, the existing Riemannian geometry workflow has therefore only reached non\-linear PCA\-type methods and remains unable to capture the polytope\- and star\-like geometries \(even in linear settings\) that AA is designed for\.

### 1\.1Contributions

This gap motivates the development of Riemannian archetypal analysis, which is the focus of this work\. Our main contributions are as follows:

- •Pullback geometries for deformed star distributions\.We introduce a broad class of deformed star distributions and define a corresponding family of pullback Riemannian geometries with diffeomorphisms that do not have a constant Jacobian determinant\. We show that pullback geodesics remain within high\-likelihood regions of the distribution and that, by choosing a non\-trivial pullback metric within the proposed family, we obtain more stable and meaningful geodesics\.
- •Riemannian archetypal mappings \(RAMs\)\.Given a pullback structure induced by a deformed star distribution and a collection of archetypes, we define the Riemannian archetype mapping \(RAM\) as the projection of data points onto the manifold of geodesically convex combinations of the archetypes\. We reformulate this projection as a non\-convex constrained optimization problem and propose an algorithm that initializes from a convex relaxation to obtain reliable solutions, whose distances to each archetype can be directly used for classification and other downstream tasks\.
- •Learning archetypes and star distributions\.We explain why standard negative log\-likelihood training is ill\-suited for reliably recovering deformed star distributions, and instead propose a constructive learning scheme combined with classical normalizing flow training\. This scheme is motivated by the observation that multiscale normalizing flows with constant Jacobian determinant naturally induce star\-shaped latent distributions, which we exploit to learn both archetypes and their associated deformed star geometry\.

### 1\.2Related work and broader impact

##### Feature extraction on abstract manifolds

RAEs are built on ideas originally developed in the abstract manifold setting, both for constructing low\-dimensional data manifolds[fletcher2004principal](https://arxiv.org/html/2605.24113#bib.bib27)and for understanding their geometric behavior[diepeveen2025curvature](https://arxiv.org/html/2605.24113#bib.bib20)\. Although RAEs ultimately use a \(non\-intrinsic\)ℓ2\\ell^\{2\}\-projection in the ambient space, this distinction from the fully intrinsic setting can to a large extent be addressed by both learning the manifold and performing data analysis under a connection different from the Levi\-Civita connection called the iso\-connection[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22), chosen so that geodesics preserve their shape while acquiring constantℓ2\\ell^\{2\}\-speed and inducing notions of distance that align better with the ambient Euclidean geometry\.

From this perspective, it would be natural to first look for a generalization of AA in the abstract manifold setting and then specialize to data\-driven geometries\. However, to the best of our knowledge there is currently no abstract manifold analogue of AA, in contrast to ICA and NMF, for which manifold\-based generalizations have been proposed[ho2013nonlinear](https://arxiv.org/html/2605.24113#bib.bib34);[chew2025curvature](https://arxiv.org/html/2605.24113#bib.bib12)\. Having said that, these manifold ICA and NMF methods have not yet been adapted to the mixed \(intrinsic\-extrinsic\) metric setting that arises in machine learning applications\. All this suggests that looking for AA outside the purely intrinsic setting could indicate how to construct it for the intrinsic setting and provide a roadmap for extending ICA, NMF, and a broad family of manifold\-based methods[miolane2020geomstats](https://arxiv.org/html/2605.24113#bib.bib50)to a substantially wider range of applications\.

##### Signal processing on learned data manifolds

For RAEs induced by diffeomorphisms that are not highly regular \(in particular, that do not have constant Jacobian determinant as in our deformed star setting\), one generally has to resort to optimization rather than closed\-form approximations of projections\. Classical Riemannian optimization methods[absil2008optimization](https://arxiv.org/html/2605.24113#bib.bib2);[boumal2023introduction](https://arxiv.org/html/2605.24113#bib.bib8)rely on geodesic convexity to obtain efficient algorithms, but this assumption is typically too strong in the mixed\-metric setting\. One can still formally apply Riemannian gradient descent; however, for pullback geometries this is effectively equivalent to performing gradient descent in a chart\.

Such chart\-based optimization can be made to work[diepeveen2026riemannian](https://arxiv.org/html/2605.24113#bib.bib21)and underlies several non\-pullback, latent\-space or chart\-based approaches[alberti2024manifold](https://arxiv.org/html/2605.24113#bib.bib3);[daras2021intermediate](https://arxiv.org/html/2605.24113#bib.bib16);[hand2018phase](https://arxiv.org/html/2605.24113#bib.bib29);[hand2018global](https://arxiv.org/html/2605.24113#bib.bib30);[lei2019inverting](https://arxiv.org/html/2605.24113#bib.bib43);[robinett2025manifold](https://arxiv.org/html/2605.24113#bib.bib58);[shustin2022manifold](https://arxiv.org/html/2605.24113#bib.bib61), but strong convexity and Lipschitz properties of the objective often deteriorate under the chart, forcing very small step sizes in practice\. Very recently, geodesic convexity on the data manifold and Euclidean convexity of the objective were combined more effectively[diepeveen2025isoriemanopt](https://arxiv.org/html/2605.24113#bib.bib23)by switching the iso\-connection[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22), in a way analogous to the fix used for mixed\-metric dimension reduction mentioned before\. This leads to an iso\-Riemannian optimization framework that identifies the “right” vector field for descent and permits larger step sizes, yielding faster convergence than standard chart\-based methods\.

At present, however, only the iso\-Riemannian analogue of gradient descent has been systematically developed[diepeveen2025isoriemanopt](https://arxiv.org/html/2605.24113#bib.bib23), and there is no general iso\-Riemannian framework for handling constraints – precisely what is needed to accelerate evaluations of RAMs in our setting, where the underlying diffeomorphisms are far from having constant Jacobian determinant\. Developing constrained optimization under a star\-shaped geometry thus not only enables our Riemannian archetypal analysis, but also points to a natural next step for the iso\-Riemannian optimization literature\.

##### Star geometry in machine learning and data science

More broadly, star\-shaped data distributions and their deformations have recently begun to play a prominent role both in explaining the success of modern machine learning methods and in suggesting more principled design directions\. For instance, non\-convex regularizers based on star geometry can be shown to be optimal for certain data models[leong2025optimal](https://arxiv.org/html/2605.24113#bib.bib45), motivating the development of non\-convex regularization schemes in inverse problems and learning[hertrich2025learning](https://arxiv.org/html/2605.24113#bib.bib33);[leong2024star](https://arxiv.org/html/2605.24113#bib.bib44)\. Related geometric star models also underlie recent analyses of diffusion models, where such structures lead to efficient learning and recovery guarantees[wang2025diffusion](https://arxiv.org/html/2605.24113#bib.bib68);[leong2025recovery](https://arxiv.org/html/2605.24113#bib.bib46), and they appear in white\-box representations of deep networks: deformed star\-geometry\-based models for architectures such as ReduNet[chan2022redunet](https://arxiv.org/html/2605.24113#bib.bib10)and for transformers[yu2023white](https://arxiv.org/html/2605.24113#bib.bib71)reveal explicit links between star\-shaped latent structure and classification or sequence modeling performance\. All of this suggests that a deformed star\-based Riemannian model may simultaneously sharpen the statistical understanding of machine learning methods more broadly and provide a natural geometric foundation for Riemannian neural networks in realistic data regimes\.

### 1\.3Outline

[Section2](https://arxiv.org/html/2605.24113#S2)recalls the Riemannian and pullback geometry underlying our constructions and highlights limitations of current pullback manifold learning, further motivating the need for our approach\. We then develop a Riemannian formulation of archetypal analysis and its data\-driven instantiation in a step\-by\-step manner\. First,[Section3](https://arxiv.org/html/2605.24113#S3)introduces deformed star distributions as the statistical backbone of the assumed data geometry, together with their pullback structure, and shows how these induce stable, high\-likelihood geodesics adapted to archetypal structure, while iso\-Riemannian geometry restores an interpretable notion of time\. Next,[Section4](https://arxiv.org/html/2605.24113#S4)derives the Riemannian archetypal mapping \(RAM\) from these statistical and geometric assumptions, develops relaxed and refined optimization schemes for its evaluation, and explains how iso\-corrected weights can be used for classification\. Building on this,[Section5](https://arxiv.org/html/2605.24113#S5)proposes a practical three\-step learning procedure that produces reasonable, albeit suboptimal, deformed star models from data, thereby making the tools of[Sections3](https://arxiv.org/html/2605.24113#S3)and[4](https://arxiv.org/html/2605.24113#S4)accessible in practice\. To assess how well the star\-shaped assumption fits real data and to expose practical limitations,[Section6](https://arxiv.org/html/2605.24113#S6)reports experiments on MNIST, focusing on interpolation, denoising, and classification under the learned geometry\. Finally,[Section7](https://arxiv.org/html/2605.24113#S7)summarizes the main findings and outlines directions for future work\.

## 2Preliminaries

For the purposes of this work we will need the following notions and results, which we present in basic notations from differential and Riemannian geometry, see[boothby2003introduction](https://arxiv.org/html/2605.24113#bib.bib7);[carmo1992riemannian](https://arxiv.org/html/2605.24113#bib.bib9);[lee2013smooth](https://arxiv.org/html/2605.24113#bib.bib41);[sakai1996riemannian](https://arxiv.org/html/2605.24113#bib.bib59)for details\.

##### Smooth manifolds and tangent spaces

Letℳ\\mathcal\{M\}be a*dd\-dimensional smooth manifold*, i\.e\., a topological manifold of dimensionddequipped with a*maximal smooth atlas*, meaning a collection of charts whose transition functions are all smooth, making the manifold locally diffeomorphic toℝd\\mathbb\{R\}^\{d\}\. We writeC∞\(ℳ\)C^\{\\infty\}\(\\mathcal\{M\}\)for the space of smooth functions overℳ\\mathcal\{M\}\. The*tangent space*at𝐩∈ℳ\\mathbf\{p\}\\in\\mathcal\{M\}, which is defined as the space of all*derivations*at𝐩\\mathbf\{p\}, is denoted by𝒯𝐩ℳ\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}and for*tangent vectors*we writeΞ𝐩∈𝒯𝐩ℳ\\Xi\_\{\\mathbf\{p\}\}\\in\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}\. For the*tangent bundle*we write𝒯ℳ\\mathcal\{T\}\\mathcal\{M\}and smooth vector fields, which are defined as*smooth sections*of the tangent bundle, are written as𝒳\(ℳ\)⊂𝒯ℳ\\mathscr\{X\}\(\\mathcal\{M\}\)\\subset\\mathcal\{T\}\\mathcal\{M\}\.

##### Riemannian manifolds

A smooth manifoldℳ\\mathcal\{M\}becomes a*Riemannian manifold*if it is equipped with a smoothly varying*metric tensor field*\(⋅,⋅\):𝒳\(ℳ\)×𝒳\(ℳ\)→C∞\(ℳ\)\(\\cdot,\\cdot\):\\mathscr\{X\}\(\\mathcal\{M\}\)\\times\\mathscr\{X\}\(\\mathcal\{M\}\)\\to C^\{\\infty\}\(\\mathcal\{M\}\)\. This tensor field induces a*\(Riemannian\) metric*dℳ:ℳ×ℳ→ℝd\_\{\\mathcal\{M\}\}:\\mathcal\{M\}\\times\\mathcal\{M\}\\to\\mathbb\{R\}\. The metric tensor can also be used to construct a unique affine connection, the*Levi\-Civita connection*, that is denoted by∇\(⋅\)\(⋅\):𝒳\(ℳ\)×𝒳\(ℳ\)→𝒳\(ℳ\)\\nabla\_\{\(\\,\\cdot\\,\)\}\(\\,\\cdot\\,\):\\mathscr\{X\}\(\\mathcal\{M\}\)\\times\\mathscr\{X\}\(\\mathcal\{M\}\)\\to\\mathscr\{X\}\(\\mathcal\{M\}\)\. This connection is in turn the cornerstone of a myriad of manifold mappings\.

One is the notion of a*geodesic*, which for two points𝐩,𝐪∈ℳ\\mathbf\{p\},\\mathbf\{q\}\\in\\mathcal\{M\}is defined as a curveγ𝐩,𝐪:\[0,1\]→ℳ\\gamma\_\{\\mathbf\{p\},\\mathbf\{q\}\}:\[0,1\]\\to\\mathcal\{M\}with minimal length that connects𝐩\\mathbf\{p\}with𝐪\\mathbf\{q\}– that is, if such a curve exists\. To ensure existence, we often consider*\(geodesically\) convex*subsets, i\.e\., sets𝒟⊂ℳ\\mathcal\{D\}\\subset\\mathcal\{M\}such thatγ𝐩,𝐪⊂𝒟\\gamma\_\{\\mathbf\{p\},\\mathbf\{q\}\}\\subset\\mathcal\{D\}for any𝐩,𝐪∈𝒟\\mathbf\{p\},\\mathbf\{q\}\\in\\mathcal\{D\}\. In addition, when geodesics are also unique on𝒟\\mathcal\{D\}, we call𝒟\\mathcal\{D\}*strongly \(geodesically\) convex*\. Another closely related notion to geodesics is the curvet↦γ𝐩,Ξ𝐩\(t\)t\\mapsto\\gamma\_\{\\mathbf\{p\},\\Xi\_\{\\mathbf\{p\}\}\}\(t\)for a geodesic starting from𝐩∈ℳ\\mathbf\{p\}\\in\\mathcal\{M\}with velocityγ˙𝐩,Ξ𝐩\(0\)=Ξ𝐩∈𝒯𝐩ℳ\\dot\{\\gamma\}\_\{\\mathbf\{p\},\\Xi\_\{\\mathbf\{p\}\}\}\(0\)=\\Xi\_\{\\mathbf\{p\}\}\\in\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}\. This can be used to define the*exponential map*exp𝐩:𝒟𝐩→ℳ\\exp\_\{\\mathbf\{p\}\}:\\mathcal\{D\}\_\{\\mathbf\{p\}\}\\to\\mathcal\{M\}at𝐩\\mathbf\{p\}asexp𝐩⁡\(Ξ𝐩\):=γ𝐩,Ξ𝐩\(1\),\\exp\_\{\\mathbf\{p\}\}\(\\Xi\_\{\\mathbf\{p\}\}\):=\\gamma\_\{\\mathbf\{p\},\\Xi\_\{\\mathbf\{p\}\}\}\(1\),where𝒟𝐩⊂𝒯𝐩ℳ\\mathcal\{D\}\_\{\\mathbf\{p\}\}\\subset\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}is the set on whichγ𝐩,Ξ𝐩\(1\)\\gamma\_\{\\mathbf\{p\},\\Xi\_\{\\mathbf\{p\}\}\}\(1\)is defined\. The manifoldℳ\\mathcal\{M\}is said to be*\(geodesically\) complete*whenever𝒟𝐩=𝒯𝐩ℳ\\mathcal\{D\}\_\{\\mathbf\{p\}\}=\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}for all𝐩∈ℳ\\mathbf\{p\}\\in\\mathcal\{M\}\. Furthermore, the*logarithmic map*log𝐩:exp𝐩⁡\(𝒟𝐩′\)→𝒟𝐩′\\log\_\{\\mathbf\{p\}\}:\\exp\_\{\\mathbf\{p\}\}\(\\mathcal\{D\}^\{\\prime\}\_\{\\mathbf\{p\}\}\)\\to\\mathcal\{D\}^\{\\prime\}\_\{\\mathbf\{p\}\}at𝐩\\mathbf\{p\}is defined as the inverse ofexp𝐩\\exp\_\{\\mathbf\{p\}\}, so it is well\-defined on𝒟𝐩′⊂𝒟𝐩\\mathcal\{D\}^\{\\prime\}\_\{\\mathbf\{p\}\}\\subset\\mathcal\{D\}\_\{\\mathbf\{p\}\}whereexp𝐩\\exp\_\{\\mathbf\{p\}\}is a diffeomorphism\. Moreover, for*parallel transport*𝒫𝐪←𝐩:𝒯𝐩ℳ→𝒯𝐪ℳ\\mathcal\{P\}\_\{\\mathbf\{q\}\\leftarrow\\mathbf\{p\}\}:\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}\\to\\mathcal\{T\}\_\{\\mathbf\{q\}\}\\mathcal\{M\}of a vectorΞ𝐩∈𝒯𝐩ℳ\\Xi\_\{\\mathbf\{p\}\}\\in\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{M\}along a geodesic from𝐩\\mathbf\{p\}to𝐪\\mathbf\{q\}we write𝒫𝐪←𝐩Ξ𝐩\\mathcal\{P\}\_\{\\mathbf\{q\}\\leftarrow\\mathbf\{p\}\}\\Xi\_\{\\mathbf\{p\}\}\.

##### Pullback manifolds

If\(ℳ,\(⋅,⋅\)\)\(\\mathcal\{M\},\(\\cdot,\\cdot\)\)is add\-dimensional Riemannian manifold,𝒩\\mathcal\{N\}is add\-dimensional smooth manifold andφ:𝒩→ℳ\\varphi:\\mathcal\{N\}\\to\\mathcal\{M\}is a diffeomorphism, the*pullback metric*

\(Ξ,Φ\)𝐩φ:=\(D𝐩φ\[Ξ𝐩\],D𝐩φ\[Φ𝐩\]\)φ\(𝐩\),𝐩∈𝒩,Ξ,Φ∈𝒳\(𝒩\)\(\\Xi,\\Phi\)\_\{\\mathbf\{p\}\}^\{\\varphi\}:=\(D\_\{\\mathbf\{p\}\}\\varphi\[\\Xi\_\{\\mathbf\{p\}\}\],D\_\{\\mathbf\{p\}\}\\varphi\[\\Phi\_\{\\mathbf\{p\}\}\]\)\_\{\\varphi\(\\mathbf\{p\}\)\},\\quad\\mathbf\{p\}\\in\\mathcal\{N\},\\Xi,\\Phi\\in\\mathscr\{X\}\(\\mathcal\{N\}\)\(1\)whereD𝐩φ:𝒯𝐩𝒩→𝒯φ\(𝐩\)ℳD\_\{\\mathbf\{p\}\}\\varphi:\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{N\}\\to\\mathcal\{T\}\_\{\\varphi\(\\mathbf\{p\}\)\}\\mathcal\{M\}denotes the differential ofφ\\varphi, defines a Riemannian structure on𝒩\\mathcal\{N\}, which we denote by\(𝒩,\(⋅,⋅\)φ\)\(\\mathcal\{N\},\(\\cdot,\\cdot\)^\{\\varphi\}\)\. Pullback mappings are denoted similarly to \([1](https://arxiv.org/html/2605.24113#S2.E1)\) with the diffeomorphismφ\\varphias a superscript, i\.e\., we writed𝒩φ\(𝐩,𝐪\)d^\{\\varphi\}\_\{\\mathcal\{N\}\}\(\\mathbf\{p\},\\mathbf\{q\}\),γ𝐩,𝐪φ\\gamma^\{\\varphi\}\_\{\\mathbf\{p\},\\mathbf\{q\}\},exp𝐩φ⁡\(Ξ𝐩\)\\exp^\{\\varphi\}\_\{\\mathbf\{p\}\}\(\\Xi\_\{\\mathbf\{p\}\}\),log𝐩φ⁡𝐪\\log^\{\\varphi\}\_\{\\mathbf\{p\}\}\\mathbf\{q\}, and𝒫𝐪←𝐩φ\\mathcal\{P\}^\{\\varphi\}\_\{\\mathbf\{q\}\\leftarrow\\mathbf\{p\}\}for𝐩,𝐪∈𝒩\\mathbf\{p\},\\mathbf\{q\}\\in\\mathcal\{N\}andΞ𝐩∈𝒯𝐩𝒩\\Xi\_\{\\mathbf\{p\}\}\\in\\mathcal\{T\}\_\{\\mathbf\{p\}\}\\mathcal\{N\}\. Pullback metrics literally pull back all geometric information from the Riemannian structure onℳ\\mathcal\{M\}\. In particular, closed\-form manifold mappings on\(ℳ,\(⋅,⋅\)\)\(\\mathcal\{M\},\(\\cdot,\\cdot\)\)yield under mild assumptions closed\-form manifold mappings on\(𝒩,\(⋅,⋅\)φ\)\(\\mathcal\{N\},\(\\cdot,\\cdot\)^\{\\varphi\}\)\.

##### Data\-driven Euclidean pullback manifolds

Notably, for Euclidean pullback manifolds\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)generated by a diffeomorphismφ:ℝd→ℝd\\varphi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}pulling back the standard Euclidean structure\(ℝd,\(⋅,⋅\)2\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)\_\{2\}\)– which is how scalable data\-driven Riemannian geometry is constructed for high\-dimensional data[diepeveen2025scorebased](https://arxiv.org/html/2605.24113#bib.bib19);[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22)–, we have\([diepeveen2024pulling,](https://arxiv.org/html/2605.24113#bib.bib18), Prop 2\.1\)

dℝdφ\(𝐱,𝐲\)\\displaystyle d\_\{\\mathbb\{R\}^\{d\}\}^\{\\varphi\}\(\\mathbf\{x\},\\mathbf\{y\}\)=‖φ\(𝐱\)−φ\(𝐲\)‖2,\\displaystyle=\\\|\\varphi\(\\mathbf\{x\}\)\-\\varphi\(\\mathbf\{y\}\)\\\|\_\{2\},\(2\)γ𝐱,𝐲φ\(t\)\\displaystyle\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)=φ−1\(\(1−t\)φ\(𝐱\)\+tφ\(𝐲\)\),\\displaystyle=\\varphi^\{\-1\}\(\(1\-t\)\\varphi\(\\mathbf\{x\}\)\+t\\varphi\(\\mathbf\{y\}\)\),\(3\)exp𝐱φ⁡\(Ξ𝐱\)\\displaystyle\\exp^\{\\varphi\}\_\{\\mathbf\{x\}\}\(\\Xi\_\{\\mathbf\{x\}\}\)=φ−1\(φ\(𝐱\)\+D𝐱φ\[Ξ𝐱\]\),\\displaystyle=\\varphi^\{\-1\}\(\\varphi\(\\mathbf\{x\}\)\+D\_\{\\mathbf\{x\}\}\\varphi\[\\Xi\_\{\\mathbf\{x\}\}\]\),\(4\)log𝐱φ⁡\(𝐲\)\\displaystyle\\log^\{\\varphi\}\_\{\\mathbf\{x\}\}\(\\mathbf\{y\}\)=Dφ\(𝐱\)φ−1\[φ\(𝐲\)−φ\(𝐱\)\],\\displaystyle=D\_\{\\varphi\(\\mathbf\{x\}\)\}\\varphi^\{\-1\}\[\\varphi\(\\mathbf\{y\}\)\-\\varphi\(\\mathbf\{x\}\)\],\(5\)𝒫𝐲←𝐱φΞ𝐱\\displaystyle\\mathcal\{P\}^\{\\varphi\}\_\{\\mathbf\{y\}\\leftarrow\\mathbf\{x\}\}\\Xi\_\{\\mathbf\{x\}\}=Dφ\(𝐲\)φ−1\[D𝐱φ\[Ξ𝐱\]\],\\displaystyle=D\_\{\\varphi\(\\mathbf\{y\}\)\}\\varphi^\{\-1\}\[D\_\{\\mathbf\{x\}\}\\varphi\[\\Xi\_\{\\mathbf\{x\}\}\]\],\(6\)where𝐱,𝐲∈ℝd\\mathbf\{x\},\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}andΞ𝐱∈𝒯𝐱ℝd≅ℝd\\Xi\_\{\\mathbf\{x\}\}\\in\\mathcal\{T\}\_\{\\mathbf\{x\}\}\\mathbb\{R\}^\{d\}\\cong\\mathbb\{R\}^\{d\}, and have\([diepeveen2024pulling,](https://arxiv.org/html/2605.24113#bib.bib18), Prop 3\.7\)

argmin𝐱∈ℝd∑i=1ndℝdφ\(𝐱,𝐱i\)2=φ−1\(1n∑i=1nφ\(𝐱i\)\),\\operatorname\{argmin\}\_\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\}\\sum\_\{i=1\}^\{n\}d^\{\\varphi\}\_\{\\mathbb\{R\}^\{d\}\}\(\\mathbf\{x\},\\mathbf\{x\}^\{i\}\)^\{2\}=\\varphi^\{\-1\}\(\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\varphi\(\\mathbf\{x\}^\{i\}\)\),\(7\)for the Riemannian barycentre[karcher1977riemannian](https://arxiv.org/html/2605.24113#bib.bib35), where𝐱1,…,𝐱n∈ℝd\\mathbf\{x\}^\{1\},\\ldots,\\mathbf\{x\}^\{n\}\\in\\mathbb\{R\}^\{d\}\.

In the context of a data\-driven pullback structure, the manifold mappings above gain a practical interpretation\. A well\-trainedφ\\varphiessentially flattens out the data space, i\.e\., it maps a data set – residing close to a non\-linear data manifold – into the vicinity of a \(low\-dimensional\) linear subspace ofℝd\\mathbb\{R\}^\{d\}\. Manifold mappings are essentially computed using Euclidean rules applied to points and tangent vectors mapped into this linear subspace byφ\\varphiand then mapped back to the original data domain usingφ−1\\varphi^\{\-1\}\. As a result, geodesics between two points will always move through regions with large amounts of data – or probabilistically speaking through regions with high likelihood\. For a more detailed discussion and the manifold mapping for the general pullback setting, we refer the reader to[diepeveen2024pulling](https://arxiv.org/html/2605.24113#bib.bib18)\.

##### Learning Euclidean pullback manifolds

To learn such a diffeomorphismφ:ℝd→ℝd\\varphi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}that generates geodesics that interpolate through regions of high likelihood, normalizing flow training has shown to be a scalable approach[diepeveen2025scorebased](https://arxiv.org/html/2605.24113#bib.bib19);[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22)\. Following[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22), this boils down to minimizing the negative log likelihood loss

ℒ\(θ\):=𝔼𝐗∼pdata\[−log⁡pφθ\(𝐗\)\],\\mathcal\{L\}\(\\theta\):=\\mathbb\{E\}\_\{\\mathbf\{\\mathbf\{X\}\}\\sim p\_\{\\text\{data \}\}\}\\left\[\-\\log p\_\{\\varphi\_\{\\theta\}\}\(\\mathbf\{X\}\)\\right\],\(8\)wherepφθ:ℝd→ℝp\_\{\\varphi\_\{\\theta\}\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}is given by

pφθ\(𝐱\):=1\(2π\)de−12‖φθ\(𝐱\)‖22\|det\(D𝐱φθ\)\|,p\_\{\\varphi\_\{\\theta\}\}\(\\mathbf\{x\}\):=\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\}\}e^\{\-\\frac\{1\}\{2\}\\\|\\varphi\_\{\\theta\}\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\}\|\\det\(D\_\{\\mathbf\{x\}\}\\varphi\_\{\\theta\}\)\|,\(9\)and whereφθ:ℝd→ℝd\\varphi\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is an invertible neural network with parametersθ\\thetasuch that the mapping𝐱↦\|det\(D𝐱φθ\)\|\\mathbf\{x\}\\mapsto\|\\det\(D\_\{\\mathbf\{x\}\}\\varphi\_\{\\theta\}\)\|is constant\. In practice, the latter constraint is typically guaranteed by using additive coupling layers[dinh2014nice](https://arxiv.org/html/2605.24113#bib.bib25)combined with invertible linear channel mixing and normalization strategies[kingma2018glow](https://arxiv.org/html/2605.24113#bib.bib38)\.

To get intuition as to why this approach yields a suitable diffeomorphism and pullback geometry by extension, we first note that the functiont↦−log⁡\(pφθ\(γ𝐱,𝐲φθ\(t\)\)\)t\\mapsto\-\\log\\bigl\(p\_\{\\varphi\_\{\\theta\}\}\(\\gamma^\{\\varphi\_\{\\theta\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)\)\\bigr\)is strongly convex for any combination of end points𝐱,𝐲∈ℝd\\mathbf\{x\},\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}and network parametrizationθ\\theta\(\([diepeveen2025scorebased,](https://arxiv.org/html/2605.24113#bib.bib19), Thm\. 3\.3\)\)\. Then, ifpdatap\_\{\\text\{data\}\}is feasible, i\.e\., there exists someθ∗\\theta^\{\*\}such thatpφθ∗=pdatap\_\{\\varphi\_\{\\theta^\{\*\}\}\}=p\_\{\\text\{data\}\}, minimizing \([8](https://arxiv.org/html/2605.24113#S2.E8)\) will find thisθ∗\\theta^\{\*\}\. In other words, if we havepφθ∗=pdatap\_\{\\varphi\_\{\\theta^\{\*\}\}\}=p\_\{\\text\{data\}\}this means that geodesicsγ𝐱,𝐲φθ∗\\gamma^\{\\varphi\_\{\\theta^\{\*\}\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}between data points move through regions with higher likelihood than the end points, which is exactly what we set out to do\.

## 3Riemannian geometry for deformed star distributions

Before discussing how to formulate Riemannian archetypal analysis and learn the associated Riemannian structures, we first address[Remark1](https://arxiv.org/html/2605.24113#Thmremark1)and introduce a class of data distributions that provides a more suitable starting point for this endeavor and discuss several geometric considerations\. In what follows, we formalize this class as a family of deformed star distributions\. In the archetypal analysis setting, archetypes naturally concentrate near the “corners” of the deformed star\. We then systematically study the associated pullback structures they induce and address interpretability issues of geodesics via the iso\-Riemannian geometry generated by these pullback metrics\. All proofs are deferred to[AppendixA](https://arxiv.org/html/2605.24113#A1)\.

### 3\.1Deformed star distributions

First, we seek a class of distributions that generalizes \([9](https://arxiv.org/html/2605.24113#S2.E9)\)\. In particular, we will consider densitiespϕ,ρ:ℝd→ℝp\_\{\\phi,\\rho\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}of the form

pϕ,ρ\(𝐱\):=12d2−1Γ\(d2\)e−12ρ\(ϕ\(𝐱\)‖ϕ\(𝐱\)‖2\)−2‖ϕ\(𝐱\)‖22∫𝕊d−1ρ\(ω\)d𝑑σ\(ω\)\|detD𝐱ϕ\|,p\_\{\\phi,\\rho\}\(\\mathbf\{x\}\):=\\frac\{1\}\{2^\{\\frac\{d\}\{2\}\-1\}\\Gamma\\left\(\\frac\{d\}\{2\}\\right\)\}\\frac\{e^\{\-\\frac\{1\}\{2\}\\rho\(\\frac\{\\phi\(\\mathbf\{x\}\)\}\{\\\|\\phi\(\\mathbf\{x\}\)\\\|\_\{2\}\}\)^\{\-2\}\\\|\\phi\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\}\}\{\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\rho\(\\omega\)^\{d\}d\\sigma\(\\omega\)\}\|\\det D\_\{\\mathbf\{x\}\}\\phi\|,\(10\)whereϕ:ℝd→ℝd\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is a diffeomorphism with constant Jacobian determinant,ρ:𝕊d−1→ℝ\>0\\rho:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}is a radial function whose role is to modulate the radial scaling in each direction \(so that the “star corners” correspond to directionsω\\omegawith large values ofρ\(ω\)\\rho\(\\omega\)\), andσ\\sigmais the standard surface measure on the unit sphere𝕊d−1⊂ℝd\\mathbb\{S\}^\{d\-1\}\\subset\\mathbb\{R\}^\{d\}, i\.e\. the\(d−1\)\(d\-1\)\-dimensional Hausdorff measure restricted to𝕊d−1\\mathbb\{S\}^\{d\-1\}– so thatσ\(𝕊d−1\)=2πd2/Γ\(d2\)\\sigma\(\\mathbb\{S\}^\{d\-1\}\)=2\\pi^\{\\frac\{d\}\{2\}\}/\\Gamma\\left\(\\frac\{d\}\{2\}\\right\)\.

Densities of the form \([10](https://arxiv.org/html/2605.24113#S3.E10)\) exhibit a deformed star\-shaped structure, as depicted in[Figure1\(a\)](https://arxiv.org/html/2605.24113#S1.F1.sf1), and can be interpreted as the pushforward – viaϕ−1\\phi^\{\-1\}– of a latent star\-shaped distribution whose level sets are completely determined by the radial functionρ\\rho\. It is worth highlighting that \([10](https://arxiv.org/html/2605.24113#S3.E10)\) reduces to a distribution of the form \([9](https://arxiv.org/html/2605.24113#S2.E9)\) whenρ\(ω\):=1\\rho\(\\omega\):=1for allω∈𝕊d−1\\omega\\in\\mathbb\{S\}^\{d\-1\}and that for generalρ\\rhothis family of densities still defines valid probability distributions\.

###### Proposition 1\.

Letϕ:ℝd→ℝd\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a smooth diffeomorphism and letρ:𝕊d−1→ℝ\>0\\rho:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}be a radial function\.

Then, the functionpϕ,ρp\_\{\\phi,\\rho\}defined as in \([10](https://arxiv.org/html/2605.24113#S3.E10)\) is a probability density onℝd\\mathbb\{R\}^\{d\}, i\.e\.,

∫ℝdpϕ,ρ\(𝐱\)𝑑𝐱=1\.\\int\_\{\\mathbb\{R\}^\{d\}\}p\_\{\\phi,\\rho\}\(\\mathbf\{x\}\)\\,d\\mathbf\{x\}=1\.\(11\)

### 3\.2Pullback geometry and stability

Assuming that the data distribution is of the form \([10](https://arxiv.org/html/2605.24113#S3.E10)\), our next goal is to construct a diffeomorphismφ:ℝd→ℝd\\varphi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}– presumably depending onϕ\\phiandρ\\rho– such that, for any choice of endpoints𝐱,𝐲∈ℝd\\mathbf\{x\},\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}, the mappingt↦−log⁡\(pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)\)t\\mapsto\-\\log\\bigl\(p\_\{\\phi,\\rho\}\(\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)\)\\bigr\)is strongly convex\. In direct analogy with the deformed Gaussian setting, this would imply that geodesics between data points pass through regions of higher likelihood than the endpoints, but now with respect to a more realistic data distribution than the deformed Gaussian model\.

In contrast to the deformed Gaussian case, however, obtaining stable and meaningful geodesics is somewhat less straightforward here\. Before turning to the construction in detail, we first describe the basic building blocks of the pullback geometry: in addition to the diffeomorphismϕ\\phi, we will introduce two further mappings\.

###### Proposition 2\.

Letρ:𝕊d−1→ℝ\>0\\rho:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}be a radial function and letv:ℝ\>0→ℝ\>0v:\\mathbb\{R\}\_\{\>0\}\\to\\mathbb\{R\}\_\{\>0\}be a strictly increasing function\.

1. \(i\)The mappingχρ:ℝd→ℝd\\chi\_\{\\rho\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}defined as χρ\(𝐱\):=\{ρ\(𝐱‖𝐱‖2\)−1𝐱‖𝐱‖2\>0𝟎‖𝐱‖2=0,\\chi\_\{\\rho\}\(\\mathbf\{x\}\):=\\left\\\{\\begin\{matrix\}\\rho\\bigl\(\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\\bigr\)^\{\-1\}\\mathbf\{x\}&\\\|\\mathbf\{x\}\\\|\_\{2\}\>0\\\\ \\mathbf\{0\}&\\\|\\mathbf\{x\}\\\|\_\{2\}=0\\\\ \\end\{matrix\}\\right\.,\(12\)is invertible and its inverse is given by χρ−1\(𝐲\):=\{ρ\(𝐲‖𝐲‖2\)𝐲‖𝐲‖2\>0𝟎‖𝐲‖2=0\.\\chi\_\{\\rho\}^\{\-1\}\(\\mathbf\{y\}\):=\\left\\\{\\begin\{matrix\}\\rho\\bigl\(\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}\\bigr\)\\mathbf\{y\}&\\\|\\mathbf\{y\}\\\|\_\{2\}\>0\\\\ \\mathbf\{0\}&\\\|\\mathbf\{y\}\\\|\_\{2\}=0\\\\ \\end\{matrix\}\\right\.\.\(13\)
2. \(ii\)The mappingψv:ℝd→ℝd\\psi\_\{v\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}defined as ψv\(𝐱\):=\{v\(‖𝐱‖2\)𝐱‖𝐱‖2‖𝐱‖2\>0𝟎‖𝐱‖2=0,\\psi\_\{v\}\(\\mathbf\{x\}\):=\\left\\\{\\begin\{matrix\}v\\bigl\(\\\|\\mathbf\{x\}\\\|\_\{2\}\\bigr\)\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}&\\\|\\mathbf\{x\}\\\|\_\{2\}\>0\\\\ \\mathbf\{0\}&\\\|\\mathbf\{x\}\\\|\_\{2\}=0\\\\ \\end\{matrix\}\\right\.,\(14\)is invertible and its inverse is given by ψv−1\(𝐲\):=\{v\(‖𝐲‖2\)−1𝐲‖𝐲‖2‖𝐲‖2\>0𝟎‖𝐲‖2=0\.\\psi\_\{v\}^\{\-1\}\(\\mathbf\{y\}\):=\\left\\\{\\begin\{matrix\}v\\bigl\(\\\|\\mathbf\{y\}\\\|\_\{2\}\\bigr\)^\{\-1\}\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}&\\\|\\mathbf\{y\}\\\|\_\{2\}\>0\\\\ \\mathbf\{0\}&\\\|\\mathbf\{y\}\\\|\_\{2\}=0\\\\ \\end\{matrix\}\\right\.\.\(15\)

The pullback geometry we advocate for in this work is defined by the composition of the formφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\. While the role ofρ\\rhoinχρ\\chi\_\{\\rho\}is perhaps somewhat anticipated, the introduction ofvvandψv\\psi\_\{v\}requires additional motivation\. Before presenting the main result showing that this geometry indeed yields high\-likelihood geodesics, it is instructive to compare pullback geodesics obtained with and withoutψv\\psi\_\{v\}\.[Figure2\(a\)](https://arxiv.org/html/2605.24113#S3.F2.sf1)shows that pullback geodesics withψv\\psi\_\{v\}contract more strongly towards the center; this is particularly beneficial for geodesics connecting to the◆\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\blacklozenge\}\}\-archetype, which otherwise swing out considerably without this adaptation, as illustrated in[Figure2\(b\)](https://arxiv.org/html/2605.24113#S3.F2.sf2)\. More generally, reduced swinging in the figure corresponds to trajectories that pass through regions of even higher likelihood than the default geodesics\. Thus, rather than compromising our initial goal of obtaining high\-likelihood geodesics, the inclusion ofψv\\psi\_\{v\}is expected to enhance it, while yielding more stable geodesics\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/geodesics_Omega_curves.png)\(a\)Pullback geodesics underφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/geodesics_phi_chi_curves.png)\(b\)Pullback geodesics underφ:=χρ∘ϕ\\varphi:=\\chi\_\{\\rho\}\\circ\\phi\.

Figure 2:Using the default pullback geometry for a data distributionpϕ,ρp\_\{\\phi,\\rho\}can cause geodesics to swing out in an undesirable way, whereas a modified pullback geometry obtained by composing withψv\\psi\_\{v\}forv\(s\)=log⁡\(10s\+1\)v\(s\)=\\log\(10s\+1\)eliminates this behavior\.To make this intuition precise and move toward a general statement, we first require the following lemma, which essentially indicates when incorporatingψv\\psi\_\{v\}into the diffeomorphism does not destroy strong convexity\.

###### Lemma 1\.

Letψv:ℝd→ℝd\\psi\_\{v\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be an invertible mapping of the form \([14](https://arxiv.org/html/2605.24113#S3.E14)\) generated by a concave strictly increasing functionv:ℝ\>0→ℝ\>0v:\\mathbb\{R\}\_\{\>0\}\\to\\mathbb\{R\}\_\{\>0\}withlims→0v\(s\)=0\\lim\_\{s\\to 0\}v\(s\)=0andlims→0v′\(s\)\>0\\lim\_\{s\\to 0\}v^\{\\prime\}\(s\)\>0\.

Then, for any vectors𝐱≠𝐲∈ℝd\\mathbf\{x\}\\neq\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}, the mapping

t↦‖ψv−1\(\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)\)‖22t\\mapsto\\\|\\psi\_\{v\}^\{\-1\}\(\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\)\\\|\_\{2\}^\{2\}\(16\)is strongly convex\.

With all these components in place, we can now state the main result, which formalizes the behavior illustrated in[Figure3](https://arxiv.org/html/2605.24113#S3.F3): geodesics under a pullback geometry that is possibly modified by a diffeomorphismψv\\psi\_\{v\}with additional structure onvvindeed pass through high\-likelihood regions in general\.

###### Theorem 1\(high\-likelihood geodesics\)\.

Letϕ:ℝd→ℝd\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a smooth diffeomorphism with constant Jacobian determinant, letρ:𝕊d−1→ℝ\>0\\rho:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}be a radial function, and letv:ℝ\>0→ℝ\>0v:\\mathbb\{R\}\_\{\>0\}\\to\\mathbb\{R\}\_\{\>0\}be a concave strictly increasing function withlims→0v\(s\)=0\\lim\_\{s\\to 0\}v\(s\)=0andlims→0v′\(s\)\>0\\lim\_\{s\\to 0\}v^\{\\prime\}\(s\)\>0\. In addition, letpϕ,ρp\_\{\\phi,\\rho\}be a probability density of the form \([10](https://arxiv.org/html/2605.24113#S3.E10)\) generated byϕ\\phiandρ\\rho, and letχρ:ℝd→ℝd\\chi\_\{\\rho\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}andψv:ℝd→ℝd\\psi\_\{v\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be invertible mappings of the form \([12](https://arxiv.org/html/2605.24113#S3.E12)\) and \([14](https://arxiv.org/html/2605.24113#S3.E14)\) generated byρ\\rhoandvv, respectively\.

Then, for any vectors𝐱≠𝐲∈ℝd\\mathbf\{x\}\\neq\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}andφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi, mapping

t↦−log⁡\(pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)\),t∈\[0,1\]t\\mapsto\-\\log\\bigl\(p\_\{\\phi,\\rho\}\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\bigr\),\\quad t\\in\[0,1\]\(17\)
is strongly convex\.

###### Proof sketch\.

We rewrite the mapping \([17](https://arxiv.org/html/2605.24113#S3.E17)\) to the form

t↦12‖ψv−1\(\(1−t\)ψv\(𝐱′\)\+tψv\(𝐲′\)\)‖22\+const,t\\mapsto\\tfrac\{1\}\{2\}\\bigl\\\|\\psi\_\{v\}^\{\-1\}\\bigl\(\(1\-t\)\\,\\psi\_\{v\}\(\\mathbf\{x\}^\{\\prime\}\)\+t\\,\\psi\_\{v\}\(\\mathbf\{y\}^\{\\prime\}\)\\bigr\)\\bigr\\\|\_\{2\}^\{2\}\+\\text\{const\},for suitable𝐱′,𝐲′\\mathbf\{x\}^\{\\prime\},\\mathbf\{y\}^\{\\prime\}, and then invoke[Lemma1](https://arxiv.org/html/2605.24113#Thmlemma1), which shows that this squared norm is strongly convex intton\[0,1\]\[0,1\]\. ∎

### 3\.3Iso\-Riemannian geometry and interpretability

Before considering the pullback construction complete, it is important to note that, although the geodesics have the desired shape, their speed is not constant in the Euclidean \(ℓ2\\ell^\{2\}\) sense, mostly due to the diffeomorphismsχρ\\chi\_\{\\rho\}andψv\\psi\_\{v\}having non\-constant Jacobian determinants\. This leads to misleading interpolations – an effect predicted by theory\([diepeveen2024pulling,](https://arxiv.org/html/2605.24113#bib.bib18), Thm\. 3\.4\)and clearly visible in[Figures7\(a\)](https://arxiv.org/html/2605.24113#S5.F7.sf1)and[7\(b\)](https://arxiv.org/html/2605.24113#S5.F7.sf2), where time\-equidistant points along the geodesics can cluster towards the center of the distribution, even more so after introducingψv\\psi\_\{v\}\. As a result, one obtains the erroneous impression that most of the data between the endpoints looks very similar\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/geodesics_Omega.png)\(a\)Pullback geodesics underφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/geodesics_phi_chi.png)\(b\)Pullback geodesics underφ:=χρ∘ϕ\\varphi:=\\chi\_\{\\rho\}\\circ\\phi\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/geodesics_Omega_iso.png)\(c\)Iso\-pullback geodesics underφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/geodesics_phi_chi_iso.png)\(d\)Iso\-pullback geodesics underφ:=χρ∘ϕ\\varphi:=\\chi\_\{\\rho\}\\circ\\phi\.

Figure 3:Using either the default pullback geometry or its modified variant for a data distributionpϕ,ρp\_\{\\phi,\\rho\}produces geodesics that can spend most of their time near the center of the deformed star \(top\), whereas their iso\-Riemannian counterparts remove this effect \(bottom\)\.This issue was anticipated by[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22), in which the authors proposed iso\-Riemannian geometry as a remedy\. For our purposes, it suffices to recall that iso\-Riemannian geometry constructs the fundamental manifold mappings using the iso\-connection rather than the Levi\-Civita connection\. In practice, this implies that geodesics retain the same shape as those induced by the Levi\-Civita connection but are reparametrized to have constantℓ2\\ell^\{2\}\-speed, as illustrated in[Figures3\(c\)](https://arxiv.org/html/2605.24113#S3.F3.sf3)and[3\(d\)](https://arxiv.org/html/2605.24113#S3.F3.sf4)\. In particular, writingγ𝐱,𝐲φ,iso:\[0,1\]→ℝd\\gamma^\{\\varphi,\\operatorname\{iso\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}:\[0,1\]\\to\\mathbb\{R\}^\{d\}for iso\-geodesics under\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\), we have

γ𝐱,𝐲φ,iso\(t\)=γ𝐱,𝐲φ\(τ𝐱,𝐲\(t\)\),t∈\[0,1\],\\gamma^\{\\varphi,\\operatorname\{iso\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)=\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(\\tau\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)\),\\quad t\\in\[0,1\],\(18\)whereγ𝐱,𝐲φ\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}is the standard pullback geodesic between𝐱\\mathbf\{x\}and𝐲\\mathbf\{y\}induced by\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\), and where the reparametrization mappingτ𝐱,𝐲:\[0,1\]→\[0,1\]\\tau\_\{\\mathbf\{x\},\\mathbf\{y\}\}:\[0,1\]\\to\[0,1\]is a diffeomorphism specified via its inverse

τ𝐱,𝐲−1\(t′\):=∫0t′‖γ˙𝐱,𝐲φ\(s\)‖2𝑑s∫01‖γ˙𝐱,𝐲φ\(s\)‖2𝑑s\.\\tau\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\-1\}\(t^\{\\prime\}\):=\\frac\{\\displaystyle\\int\_\{0\}^\{t^\{\\prime\}\}\\bigl\\\|\\dot\{\\gamma\}^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(s\)\\bigr\\\|\_\{2\}\\,ds\}\{\\displaystyle\\int\_\{0\}^\{1\}\\bigl\\\|\\dot\{\\gamma\}^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(s\)\\bigr\\\|\_\{2\}\\,ds\}\.\(19\)

## 4Riemannian archetypal projections for deformed star distributions

Now that we have an appropriate statistical and geometric framework for a Riemannian version of archetypal analysis \(AA\) – a deformed star distribution together with its associated pullback geometry – we turn to formulating archetypal analysis itself in this setting\. Classical AA can be viewed as comprising two components: learning the archetypes and using these archetypes to project and classify the entire data set; in our framework, it will be more natural to tackle the second part first, since the archetypes and the underlying distribution will ultimately be learned jointly in the next section\. Accordingly, we assume the data are already endowed with a star\-shaped statistical model and a fixed collection of archetypes situated near the “corners” of the deformed star\. In what follows, we introduce the Riemannian archetypal mapping \(RAM\) as the projection of data points onto the manifold defined by geodesic convex combinations of the archetypes\. To address the nonconvexity of the RAM problem, we propose a convex relaxation that provides a strong initialization; solving the resulting RAM then yields a “denoised” approximation on this manifold\. Finally, to extract an interpretable weight vector encoding the corresponding archetypal mixture, we discuss the necessity of transitioning to the iso\-Riemannian geometry induced by the pullback metric and propose a simple way of doing so\. As before, we defer all proofs to[AppendixB](https://arxiv.org/html/2605.24113#A2)\.

### 4\.1The Riemannian archetypal mapping

Before turning to classification, we first seek to project data onto the set of convex combinations of the archetypes\. To this end, we define the*Riemannian archetypal mapping*\(RAM\) generated by archetypes𝐯\(1\),…,𝐯\(r\)∈ℝd\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}\\in\\mathbb\{R\}^\{d\}and pullback structure\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)– in practice arising from a diffeomorphism of the formφ=ψv∘χρ∘ϕ\\varphi=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi– as the mappingTφ:ℝd→ℝdT^\{\\varphi\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}given by

Tφ\(𝐱\):=argmin𝐲∈ℳ¯φ‖𝐱−𝐲‖22,T^\{\\varphi\}\(\\mathbf\{x\}\):=\\operatorname\*\{argmin\}\_\{\\mathbf\{y\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\}\\\|\\mathbf\{x\}\-\\mathbf\{y\}\\\|\_\{2\}^\{2\},\(20\)where

ℳ¯φ:=\{𝐱∈ℝd∣𝐱=argmin𝐲∈ℝd∑j=1r𝐠jdℝdφ\(𝐲,𝐯\(j\)\)2for some𝐠∈Δr\},\\bar\{\\mathcal\{M\}\}^\{\\varphi\}:=\\bigl\\\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\\;\\mid\\;\\mathbf\{x\}=\\operatorname\*\{argmin\}\_\{\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}\}\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}d^\{\\varphi\}\_\{\\mathbb\{R\}^\{d\}\}\(\\mathbf\{y\},\\mathbf\{v\}^\{\(j\)\}\)^\{2\}\\text\{ for some \}\\mathbf\{g\}\\in\\Delta^\{r\}\\bigr\\\},\(21\)and whereΔr⊂ℝr\\Delta^\{r\}\\subset\\mathbb\{R\}^\{r\}is the unit simplex inℝr\\mathbb\{R\}^\{r\}\.

The RAM can thus be viewed as theℓ2\\ell^\{2\}\-projection onto the constraint setℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}, consisting of all weighted Riemannian barycentres \(with respect to\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)\) of the archetypes\. This choice is natural: it reduces to the classical interpretation of linear AA in the Euclidean setting, while in our more general framework every geodesic segment between a pair of archetypes is contained inℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\. In addition, for points that already live in the constraint set, the RAM reduces to the identity mapping, i\.e\.,Tφ\(𝐱\)=𝐱T^\{\\varphi\}\(\\mathbf\{x\}\)=\\mathbf\{x\}for any𝐱∈ℳ¯φ\\mathbf\{x\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\.

However, in its present formulation, computing the RAM for general inputs appears intractable\. The first step toward making \([20](https://arxiv.org/html/2605.24113#S4.E20)\) more manageable is to simplify the constraint set, which turns out to be geodesically convex submanifold with corners, i\.e\., an embedded polytope\-like set shaped byφ\\varphi, and admits a more convenient representation\.

###### Theorem 2\(Properties ofℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\)\.

Letφ:ℝd→ℝd\\varphi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a smooth diffeomorphism and let𝐯\(1\),…,𝐯\(r\)∈ℝd\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}\\in\\mathbb\{R\}^\{d\}be anyrrvectors\.

Then, the setℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}defined in \([21](https://arxiv.org/html/2605.24113#S4.E21)\)

1. \(i\)satisfies ℳ¯φ=\{𝐱∈ℝd∣𝐱=φ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)for some𝐠∈Δr\}\.\\bar\{\\mathcal\{M\}\}^\{\\varphi\}=\\\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\\;\\mid\\;\\mathbf\{x\}=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\text\{ for some \}\\mathbf\{g\}\\in\\Delta^\{r\}\\\}\.\(22\)
2. \(ii\)is a strongly geodesically convex set of\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)\.
3. \(iii\)is a smooth manifold with corners, whose interior has dimension dim\(int⁡\(ℳ¯φ\)\)=rank⁡\(\[φ\(𝐯\(1\)\)−φ\(𝐯\(r\)\),…,φ\(𝐯\(r−1\)\)−φ\(𝐯\(r\)\)\]\)≤r−1\.\\dim\(\\operatorname\{int\}\(\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\)\)=\\operatorname\{rank\}\(\[\\varphi\(\\mathbf\{v\}^\{\(1\)\}\)\-\\varphi\(\\mathbf\{v\}^\{\(r\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\-1\)\}\)\-\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\]\)\\leq r\-1\.\(23\)

###### Proof sketch\.

For \(i\), each term𝐱↦dℝdφ\(𝐱,𝐲\)2\\mathbf\{x\}\\mapsto d\_\{\\mathbb\{R\}^\{d\}\}^\{\\varphi\}\(\\mathbf\{x\},\\mathbf\{y\}\)^\{2\}is strongly geodesically convex since it is just‖φ\(𝐱\)−φ\(𝐲\)‖22\\\|\\varphi\(\\mathbf\{x\}\)\-\\varphi\(\\mathbf\{y\}\)\\\|\_\{2\}^\{2\}in the embedding, so the weighted sum has a unique minimizer, and checking the first\-order optimality conditions shows that this minimizer is exactlyφ−1\(∑j𝐠jφ\(𝐯\(j\)\)\)\\varphi^\{\-1\}\\bigl\(\\sum\_\{j\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\bigr\)for any𝐠∈Δr\\mathbf\{g\}\\in\\Delta\_\{r\}\.

For \(ii\), using the representation from \(i\) we write𝐱,𝐲\\mathbf\{x\},\\mathbf\{y\}as weighted barycenters in theφ\\varphi\-embedding and interpolate their weights linearly intt; the induced curve is both a geodesic by the explicit formulaγ𝐱,𝐲φ\(t\)=φ−1\(\(1−t\)φ\(𝐱\)\+tφ\(𝐲\)\)\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)=\\varphi^\{\-1\}\(\(1\-t\)\\varphi\(\\mathbf\{x\}\)\+t\\varphi\(\\mathbf\{y\}\)\)and remains inℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}, which shows geodesic convexity\.

For \(iii\), the mapφ′:ℳ¯φ→conv⁡\{φ\(𝐯\(1\)\),…,φ\(𝐯\(r\)\)\}\\varphi^\{\\prime\}\\colon\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\\to\\operatorname\{conv\}\\\{\\varphi\(\\mathbf\{v\}^\{\(1\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\\\},φ′\(𝐱\)=φ\(𝐱\)\\varphi^\{\\prime\}\(\\mathbf\{x\}\)=\\varphi\(\\mathbf\{x\}\), is a diffeomorphism, soℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}is a smooth manifold with corners whose interior dimension equals that of the convex polytope, namely the rank of the difference matrix of the embedded archetypes, which is at mostr−1r\-1\.

∎

Especially the first property in[Theorem2](https://arxiv.org/html/2605.24113#Thmtheorem2)is very useful in terms of evaluating the RAM\. That is, it allows us to rephrase the optimization problem \([20](https://arxiv.org/html/2605.24113#S4.E20)\) as a problem over a simplex\.

###### Theorem 3\(Equivalent formulation ofTφT^\{\\varphi\}\)\.

Letφ:ℝd→ℝd\\varphi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a smooth diffeomorphism and let𝐯\(1\),…,𝐯\(r\)∈ℝd\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}\\in\\mathbb\{R\}^\{d\}be anyrrvectors\.

Then, the Riemannian archetypal mappingTφ:ℝd→ℝdT^\{\\varphi\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}generated by𝐯\(1\),…,𝐯\(r\)\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}and\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)satisfies

Tφ\(𝐱\)=φ−1\(∑j=1r𝐠j∗φ\(𝐯\(j\)\)\),where𝐠∗∈argmin𝐠∈Δr‖𝐱−φ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)‖22\.T^\{\\varphi\}\(\\mathbf\{x\}\)=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}^\{\*\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\),\\quad\\text\{where \}\\mathbf\{g\}^\{\*\}\\in\\operatorname\*\{argmin\}\_\{\\mathbf\{g\}\\in\\Delta^\{r\}\}\\\|\\mathbf\{x\}\-\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\\|\_\{2\}^\{2\}\.\(24\)

###### Proof\.

The claim follows directly from \(i\) in[Theorem2](https://arxiv.org/html/2605.24113#Thmtheorem2)\. ∎

At this point, it is important to note that the optimization problem \([24](https://arxiv.org/html/2605.24113#S4.E24)\) is generally non\-convex, so computing the RAM requires some care\. Before turning to algorithmic aspects, it is helpful to briefly relate this mapping to the specific setting of interest, namely the deformed star distributionspϕ,ρp\_\{\\phi,\\rho\}under the choiceφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\. Beyond providing a convenient reformulation of the RAM, this perspective also allows us to establish its statistical properties with respect topϕ,ρp\_\{\\phi,\\rho\}\. In particular, any RAM\-projected vector can be shown to be at least as likely as the corresponding weighted combination of the archetype likelihoods\.

###### Corollary 1\.

Letϕ:ℝd→ℝd\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a smooth diffeomorphism with constant Jacobian determinant, letρ:𝕊d−1→ℝ\>0\\rho:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}be a radial function, and letv:ℝ\>0→ℝ\>0v:\\mathbb\{R\}\_\{\>0\}\\to\\mathbb\{R\}\_\{\>0\}be a concave strictly increasing function withlims→0v\(s\)=0\\lim\_\{s\\to 0\}v\(s\)=0andlims→0v′\(s\)\>0\\lim\_\{s\\to 0\}v^\{\\prime\}\(s\)\>0\. In addition, letpϕ,ρp\_\{\\phi,\\rho\}be a probability density of the form \([10](https://arxiv.org/html/2605.24113#S3.E10)\) generated byϕ\\phiandρ\\rho, and letχρ:ℝd→ℝd\\chi\_\{\\rho\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}andψv:ℝd→ℝd\\psi\_\{v\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be invertible mappings of the form \([12](https://arxiv.org/html/2605.24113#S3.E12)\) and \([14](https://arxiv.org/html/2605.24113#S3.E14)\) generated byρ\\rhoandvv, respectively\.

Then, for any𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}andφ:=ψv∘χρ∘ϕ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi, the Riemannian archetypal mapping satisfies

pϕ,ρ\(Tφ\(𝐱\)\)≥exp⁡\(∑j=1r𝐠j∗log⁡\(pϕ,ρ\(𝐯j\)\)\),for any𝐠∗∈argmin𝐠∈Δr‖𝐱−φ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)‖22\.p\_\{\\phi,\\rho\}\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\\geq\\exp\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}^\{\*\}\\log\\bigl\(p\_\{\\phi,\\rho\}\(\\mathbf\{v\}^\{j\}\)\\bigr\)\\Bigr\),\\quad\\text\{for any \}\\mathbf\{g\}^\{\*\}\\in\\operatorname\*\{argmin\}\_\{\\mathbf\{g\}\\in\\Delta^\{r\}\}\\\|\\mathbf\{x\}\-\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\\|\_\{2\}^\{2\}\.\(25\)

### 4\.2Solving the RAM problem

The optimization problem \([24](https://arxiv.org/html/2605.24113#S4.E24)\) in[Theorem3](https://arxiv.org/html/2605.24113#Thmtheorem3)remains non\-convex, so a good initialization strategy is essential\. In analogy with the relaxation used for Riemannian autoencoders \(RAEs\)[diepeveen2024pulling](https://arxiv.org/html/2605.24113#bib.bib18), we therefore introduce a relaxed variant\. Specifically, we define the relaxed Riemannian archetypal mapping \(relaxed RAM\) generated by archetypes𝐯\(1\),…,𝐯\(r\)∈ℝd\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}\\in\\mathbb\{R\}^\{d\}and pullback structure\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)as the mappingSφ:ℝd→ℝdS^\{\\varphi\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}given by

Sφ\(𝐱\):=argmin𝐲∈ℳ¯φdℝdφ\(𝐱,𝐲\)2\.S^\{\\varphi\}\(\\mathbf\{x\}\):=\\operatorname\*\{argmin\}\_\{\\mathbf\{y\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\}d\_\{\\mathbb\{R\}^\{d\}\}^\{\\varphi\}\(\\mathbf\{x\},\\mathbf\{y\}\)^\{2\}\.\(26\)The relaxed RAM can be interpreted as the Riemannian metric projection onto the constraint setℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\. For diffeomorphisms with non\-constant Jacobian determinant – again due to the presence ofχρ\\chi\_\{\\rho\}andψv\\psi\_\{v\}in a diffeomorphismφ=ψv∘χρ∘ϕ\\varphi=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phiwe would use in practice – this generally yields substantially different minimizers than the \(original\) RAM\. Nevertheless, for points already lying in the constraint set the relaxed RAM still reduces to the identity, i\.e\.,Sφ\(𝐱\)=𝐱S^\{\\varphi\}\(\\mathbf\{x\}\)=\\mathbf\{x\}for all𝐱∈ℳ¯φ\\mathbf\{x\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\.

As before,[Theorem2](https://arxiv.org/html/2605.24113#Thmtheorem2)provides a practical way to evaluate the relaxed RAM, with the key difference that the reformulated optimization problem is now convex\.

###### Theorem 4\(Equivalent formulation ofSφS^\{\\varphi\}\)\.

Letφ:ℝd→ℝd\\varphi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}be a smooth diffeomorphism and let𝐯\(1\),…,𝐯\(r\)∈ℝd\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}\\in\\mathbb\{R\}^\{d\}be anyrrvectors\.

Then, the relaxed Riemannian archetypal mappingSφ:ℝd→ℝdS^\{\\varphi\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}generated by𝐯\(1\),…,𝐯\(r\)\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}and\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\)satisfies

Sφ\(𝐱\)=φ−1\(∑j=1r𝐠j∗φ\(𝐯\(j\)\)\),where𝐠∗∈argmin𝐠∈Δr‖φ\(𝐱\)−∑j=1r𝐠jφ\(𝐯\(j\)\)‖22,S^\{\\varphi\}\(\\mathbf\{x\}\)=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}^\{\*\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\),\\quad\\text\{where \}\\mathbf\{g\}^\{\*\}\\in\\operatorname\*\{argmin\}\_\{\\mathbf\{g\}\\in\\Delta^\{r\}\}\\\|\\varphi\(\\mathbf\{x\}\)\-\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\\|\_\{2\}^\{2\},\(27\)

###### Proof\.

The claim follows directly from \(i\) in[Theorem2](https://arxiv.org/html/2605.24113#Thmtheorem2)\. ∎

The optimization problem \([27](https://arxiv.org/html/2605.24113#S4.E27)\) can be solved efficiently with proximal gradient descent[parikh2014proximal](https://arxiv.org/html/2605.24113#bib.bib54), which yields the update scheme

𝐠rel\(ℓ\+1\):=ΠΔr\(𝐠rel\(ℓ\)−αφ\(𝐕\)⊤\(φ\(𝐕\)𝐠rel\(ℓ\)−φ\(𝐱\)\)\),withφ\(𝐕\):=\[φ\(𝐯\(1\)\),…,φ\(𝐯\(r\)\)\]∈ℝd×r,\\mathbf\{g\}\_\{\\operatorname\{rel\}\}^\{\(\\ell\+1\)\}:=\\Pi\_\{\\Delta^\{r\}\}\\Bigl\(\\mathbf\{g\}\_\{\\operatorname\{rel\}\}^\{\(\\ell\)\}\-\\alpha\\,\\varphi\(\\mathbf\{V\}\)^\{\\top\}\\bigl\(\\varphi\(\\mathbf\{V\}\)\\mathbf\{g\}\_\{\\operatorname\{rel\}\}^\{\(\\ell\)\}\-\\varphi\(\\mathbf\{x\}\)\\bigr\)\\Bigr\),\\quad\\text\{with \}\\varphi\(\\mathbf\{V\}\):=\[\\varphi\(\\mathbf\{v\}^\{\(1\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\]\\in\\mathbb\{R\}^\{d\\times r\},\(28\)whereΠΔr:ℝr→Δr\\Pi\_\{\\Delta^\{r\}\}:\\mathbb\{R\}^\{r\}\\to\\Delta^\{r\}is theℓ2\\ell^\{2\}\-orthogonal projection onto the simplex and the step size

α:=1‖φ\(𝐕\)⊤φ\(𝐕\)‖=1λmax\(φ\(𝐕\)⊤φ\(𝐕\)\)\\alpha:=\\frac\{1\}\{\\\|\\varphi\(\\mathbf\{V\}\)^\{\\top\}\\varphi\(\\mathbf\{V\}\)\\\|\}=\\frac\{1\}\{\\lambda\_\{\\max\}\(\\varphi\(\\mathbf\{V\}\)^\{\\top\}\\varphi\(\\mathbf\{V\}\)\)\}\(29\)is used to ensure convergence\. The scheme is initialized at𝐠rel\(0\):=1r𝟏r:=\[1r,…,1r\]⊤∈ℝr\\mathbf\{g\}\_\{\\operatorname\{rel\}\}^\{\(0\)\}:=\\frac\{1\}\{r\}\\mathbf\{1\}\_\{r\}:=\[\\frac\{1\}\{r\},\\ldots,\\frac\{1\}\{r\}\]^\{\\top\}\\in\\mathbb\{R\}^\{r\}\.

Upon convergence of proximal gradient descent for the relaxed RAM problem, the resulting weights𝐠rel∗∈ℝr\\mathbf\{g\}\_\{\\operatorname\{rel\}\}^\{\*\}\\in\\mathbb\{R\}^\{r\}serve as initialization for proximal gradient descent on the RAM objective111In practice, using𝐠rel\(0\)\\mathbf\{g\}\_\{\\operatorname\{rel\}\}^\{\(0\)\}as initialization can be beneficial if it yields a lower RAM objective value\., leading to the update scheme

𝐠\(ℓ\+1\):=ΠΔr\(𝐠\(ℓ\)−αℓφ\(𝐕\)⊤\(Dφ\(𝐕\)𝐠\(ℓ\)φ−1\)⊤\(φ−1\(φ\(𝐕\)𝐠\(ℓ\)\)−𝐱\)\),\\mathbf\{g\}^\{\(\\ell\+1\)\}:=\\Pi\_\{\\Delta^\{r\}\}\\Bigl\(\\mathbf\{g\}^\{\(\\ell\)\}\-\\alpha\_\{\\ell\}\\,\\varphi\(\\mathbf\{V\}\)^\{\\top\}\(D\_\{\\varphi\(\\mathbf\{V\}\)\\mathbf\{g\}^\{\(\\ell\)\}\}\\varphi^\{\-1\}\)^\{\\top\}\\bigl\(\\varphi^\{\-1\}\(\\varphi\(\\mathbf\{V\}\)\\mathbf\{g\}^\{\(\\ell\)\}\)\-\\mathbf\{x\}\\bigr\)\\Bigr\),\(30\)whereαℓ\\alpha\_\{\\ell\}is determined via line search to promote convergence\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/samples_ram_projected_relaxed_Omega.png)\(a\)RAM projections \(orange\) or deformed star data \(blue\)
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/errors_ram_Omega.png)\(b\)Squared distance of RAM and relaxed RAM to original data

Figure 4:The Riemannian archetypal mapping \(RAM\) – initialized from relaxed RAM weights – projects data points from a deformed star distribution onto the data manifold generated by four archetypes \(left\)\. RAM projections achieve substantially smaller distances to the original data compared to relaxed RAM projections, as evidenced by the reduction in reconstruction error \(right\)\.For pullback geometries induced by deformed star distributions, the RAM optimization scheme \([30](https://arxiv.org/html/2605.24113#S4.E30)\) produces substantial improvements over the relaxed RAM scheme \([28](https://arxiv.org/html/2605.24113#S4.E28)\)\.[Figure4](https://arxiv.org/html/2605.24113#S4.F4)illustrates this effect: RAM projections achieve significantly lower reconstruction errors by properly accounting for the pullback metric structure, whereas relaxed RAM incurs systematic projection errors due to neglecting the distance distortion introduced byχρ\\chi\_\{\\rho\}andψv\\psi\_\{v\}\.

### 4\.3Iso\-Riemannian geometry and classification

Given the strong empirical performance of RAM on the deformed star data in[Figure4](https://arxiv.org/html/2605.24113#S4.F4), one might be inclined to believe that the converged weights𝐠∗∈ℝr\\mathbf\{g\}^\{\*\}\\in\\mathbb\{R\}^\{r\}from proximal gradient descent yield a trustworthy classification of a projected point, and that the magnitude of each entry in𝐠∗\\mathbf\{g\}^\{\*\}is directly interpretable\.

However, such an interpretation is premature: due toχρ\\chi\_\{\\rho\}andψv\\psi\_\{v\}not having a constant Jacobian determinant, the pullback metric generally distorts distances, which affects the geometry experienced by the logarithmic map and hence the meaning of the weights\. To make this precise, observe first that

∑j=1r𝐠j∗logTφ\(𝐱\)φ⁡\(𝐯\(j\)\)=\([5](https://arxiv.org/html/2605.24113#S2.E5)\)∑j=1r𝐠j∗Dφ\(Tφ\(𝐱\)\)φ−1\[φ\(𝐯\(j\)\)−φ\(Tφ\(𝐱\)\)\]=Dφ\(Tφ\(𝐱\)\)φ−1\[∑j=1r𝐠j∗φ\(𝐯\(j\)\)−φ\(Tφ\(𝐱\)\)\]=[Theorem3](https://arxiv.org/html/2605.24113#Thmtheorem3)Dφ\(Tφ\(𝐱\)\)φ−1\[φ\(Tφ\(𝐱\)\)−φ\(Tφ\(𝐱\)\)\]=𝟎\.\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\*\}\_\{j\}\\log^\{\\varphi\}\_\{T^\{\\varphi\}\(\\mathbf\{x\}\)\}\(\\mathbf\{v\}^\{\(j\)\}\)\\overset\{\\eqref\{eq:thm\-log\-remetrized\}\}\{=\}\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\*\}\_\{j\}D\_\{\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\}\\varphi^\{\-1\}\[\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\-\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\]\\\\ =D\_\{\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\}\\varphi^\{\-1\}\\Bigl\[\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\*\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\-\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\\Bigr\]\\\\ \\overset\{\\text\{\\lx@cref\{creftypecap~refnum\}\{thm:ram\-mapping\-equivalence\}\}\}\{=\}D\_\{\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\}\\varphi^\{\-1\}\\Bigl\[\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\-\\varphi\(T^\{\\varphi\}\(\\mathbf\{x\}\)\)\\Bigr\]=\\mathbf\{0\}\.\(31\)Thus, the weights determine an affine combination of the tangent vectors aimed at the archetypes such that their weighted sum is zero\. Conceptually, ifTφ\(𝐱\)T^\{\\varphi\}\(\\mathbf\{x\}\)is close to a given archetype, the associated logarithmic vector is short relative to those pointing toward more distant archetypes\. To compensate for this disparity, the optimal solution assigns a larger weight to the nearby archetype and smaller weights to archetypes that are farther away\.

The subtlety is that “close”, “far”, “short” and “long” here are measured in the pullback geometry, not in the ambientℓ2\\ell^\{2\}\-sense\. Becauseφ\\varphiis not an isometry, neither the distances nor the lengths of these tangent vectors reflect Euclidean arc length along the underlying geodesics\. Consequently, a naiveℓ2\\ell^\{2\}\-based reading of𝐠∗\\mathbf\{g\}^\{\*\}can lead to misleading interpretations, analogous to the interpolation pathologies discussed earlier for pullback geodesics\.

This issue was also anticipated in the iso\-Riemannian geometry framework of[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22)\. For the logarithmic map, this means keeping the direction oflogTφ\(𝐱\)φ⁡\(𝐯\(j\)\)\\log^\{\\varphi\}\_\{T^\{\\varphi\}\(\\mathbf\{x\}\)\}\(\\mathbf\{v\}^\{\(j\)\}\)but rescaling its length so that it equals theℓ2\\ell^\{2\}\-arc length of the geodesic connecting the endpoints\. Specifically, writinglog𝐱φ,iso:ℝd→𝒯𝐱ℝd\\log\_\{\\mathbf\{x\}\}^\{\\varphi,\\operatorname\{iso\}\}:\\mathbb\{R\}^\{d\}\\to\\mathcal\{T\}\_\{\\mathbf\{x\}\}\\mathbb\{R\}^\{d\}for the iso\-logarithmic map under\(ℝd,\(⋅,⋅\)φ\)\(\\mathbb\{R\}^\{d\},\(\\cdot,\\cdot\)^\{\\varphi\}\), we have

log𝐱φ,iso⁡\(𝐲\)=∫01‖γ˙𝐱,𝐲φ\(s\)‖2𝑑s‖log𝐱φ⁡\(𝐲\)‖log𝐱φ⁡\(𝐲\),\\log\_\{\\mathbf\{x\}\}^\{\\varphi,\\operatorname\{iso\}\}\(\\mathbf\{y\}\)=\\frac\{\\displaystyle\\int\_\{0\}^\{1\}\\bigl\\\|\\dot\{\\gamma\}^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(s\)\\bigr\\\|\_\{2\}\\,ds\}\{\\bigl\\\|\\log\_\{\\mathbf\{x\}\}^\{\\varphi\}\(\\mathbf\{y\}\)\\bigr\\\|\}\\,\\log\_\{\\mathbf\{x\}\}^\{\\varphi\}\(\\mathbf\{y\}\),\(32\)wherelog𝐱φ⁡\(𝐲\)\\log\_\{\\mathbf\{x\}\}^\{\\varphi\}\(\\mathbf\{y\}\)is the standard pullback log between𝐱\\mathbf\{x\}and𝐲∈ℝd\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}\.

Rather than using𝐠∗\\mathbf\{g\}^\{\*\}to weight the pullback logs, we therefore seek weights that balance the iso\-logs to the archetypes, i\.e\., weights𝐡∗\\mathbf\{h\}^\{\*\}on the simplex satisfying

∑j=1r𝐡j∗logTφ\(𝐱\)φ,iso⁡\(𝐯\(j\)\)=𝟎\.\\sum\_\{j=1\}^\{r\}\\mathbf\{h\}^\{\*\}\_\{j\}\\log^\{\\varphi,\\operatorname\{iso\}\}\_\{T^\{\\varphi\}\(\\mathbf\{x\}\)\}\(\\mathbf\{v\}^\{\(j\)\}\)=\\mathbf\{0\}\.\(33\)Such a solution always exists; in particular, one can verify that

𝐡j∗:=cj𝐠j∗∑j′=1rcj′𝐠j′∗,withcj:=\{‖logTφ\(𝐱\)φ⁡\(𝐯\(j\)\)‖∫01‖γ˙Tφ\(𝐱\),𝐯\(j\)φ\(s\)‖2𝑑sifTφ\(𝐱\)≠𝐯\(j\),1otherwise,\\mathbf\{h\}^\{\*\}\_\{j\}:=\\frac\{c\_\{j\}\\,\\mathbf\{g\}^\{\*\}\_\{j\}\}\{\\sum\_\{j^\{\\prime\}=1\}^\{r\}c\_\{j^\{\\prime\}\}\\mathbf\{g\}^\{\*\}\_\{j^\{\\prime\}\}\},\\quad\\text\{with\}\\quad c\_\{j\}:=\\begin\{cases\}\\displaystyle\\frac\{\\bigl\\\|\\log\_\{T^\{\\varphi\}\(\\mathbf\{x\}\)\}^\{\\varphi\}\(\\mathbf\{v\}^\{\(j\)\}\)\\bigr\\\|\}\{\\displaystyle\\int\_\{0\}^\{1\}\\bigl\\\|\\dot\{\\gamma\}^\{\\varphi\}\_\{T^\{\\varphi\}\(\\mathbf\{x\}\),\\mathbf\{v\}^\{\(j\)\}\}\(s\)\\bigr\\\|\_\{2\}\\,ds\}&\\text\{if \}T^\{\\varphi\}\(\\mathbf\{x\}\)\\neq\\mathbf\{v\}^\{\(j\)\},\\\\ \\\\ \\qquad\\quad\\quad 1&\\text\{otherwise\},\\end\{cases\}\(34\)indeed satisfies \([33](https://arxiv.org/html/2605.24113#S4.E33)\)\. Importantly, we never need to evaluate the iso\-logs themselves – only the scaling coefficientscjc\_\{j\}, which suffice to obtain the corrected classification – and always have𝐡∗=𝐠∗\\mathbf\{h\}^\{\*\}=\\mathbf\{g\}^\{\*\}in the classical Euclidean AA setting\.

[Figure5](https://arxiv.org/html/2605.24113#S4.F5)illustrates this effect for the deformed star distribution we have been considering\. Here, the pullback logarithms \(pink\) do not encode the trueℓ2\\ell^\{2\}\-distance to the archetypes, whereas the iso\-logarithms \(purple\) do\. As a consequence, naive classification overemphasizes the⋆\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\star\}\}\-archetype for a point lying relatively close to the geodesic connecting the⋆\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\star\}\}\- and▲\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\blacktriangle\}\}\-archetype, while the corrected weights sketch a more tempered, and geometrically faithful, picture\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/log_Omega_val.png)\(a\)Pullback logs \(pink\) and iso\-pullback logs \(purple\)\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/river_cross/weights_comparison.png)\(b\)Naive and iso\-corrected weights

Figure 5:The pullback logarithmic mappings \(pink\) do not carry information on how far away the archetypes are in anℓ2\\ell^\{2\}\-sense – unlike the iso\-logarithmic mappings \(purple\) –, which causes classification on the⋆\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\star\}\}\-archetype to be unreasonably high for a data point that lives relatively close to the geodesic connecting the⋆\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\star\}\}\- and▲\{\\color\[rgb\]\{0\.8941,0\.1019,0\.1098\}\\definecolor\[named\]\{pgfstrokecolor\}\{rgb\}\{0\.8941,0\.1019,0\.1098\}\\boldsymbol\{\\blacktriangle\}\}\-archetype\. When using the iso\-corrected weights, the classification is more modest and geometrically faithful\.

## 5Learning deformed star distributions

Finally, the remaining task is to \(jointly\) learn a distributionpϕ,ρ:ℝd→ℝp\_\{\\phi,\\rho\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}of the form \([10](https://arxiv.org/html/2605.24113#S3.E10)\) – whereϕ:ℝd→ℝd\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is a diffeomorphism with constant Jacobian determinant andρ:𝕊d−1→ℝ\>0\\rho:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}is a radial function – and archetypes𝐯\(1\),…,𝐯\(r\)∈ℝd\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}\\in\\mathbb\{R\}^\{d\}from data𝐱\(1\),…,𝐱\(n\)∼pdata\\mathbf\{x\}^\{\(1\)\},\\ldots,\\mathbf\{x\}^\{\(n\)\}\\sim p\_\{\\text\{data\}\}\. Assuming that the true data distributionpdatap\_\{\\text\{data\}\}is well\-approximated by such a distribution and well\-represented by such archetypes, the resulting pullback structure and Riemannian archetypal mapping \(RAM\) enable the data analysis tasks introduced earlier, i\.e\., interpolation \([Section3](https://arxiv.org/html/2605.24113#S3)\) and denoising and classification \([Section4](https://arxiv.org/html/2605.24113#S4)\)\. In what follows, we first discuss why classical negative log\-likelihood training for learningpϕ,ρp\_\{\\phi,\\rho\}is likely to encounter difficulties, and how a three\-step procedure can nevertheless yield a reasonable – albeit suboptimal – approximation\. We show that existing methods, after suitable adaptations, address all but one component in our three\-step approach: learning the radial function, to which we devote a more detailed treatment\. As before, all proofs are deferred to[AppendixC](https://arxiv.org/html/2605.24113#A3)\.

### 5\.1A three\-step approach

Focusing on the distributionpϕ,ρp\_\{\\phi,\\rho\}, one would ideally seek a model that is closest to the true data distribution in Kullback\-Leibler divergence\. In particular, if the true distribution were of this form, it could in principle be recovered by solving

\(ϕ∗,ρ∗\)∈argminϕ,ρ𝔼𝐗∼pdata\[−log⁡pϕ,ρ\(𝐗\)\]\.\(\\phi^\{\*\},\\rho^\{\*\}\)\\in\\operatorname\*\{argmin\}\_\{\\phi,\\rho\}\\,\\mathbb\{E\}\_\{\\mathbf\{\\mathbf\{X\}\}\\sim p\_\{\\text\{data\}\}\}\\left\[\-\\log p\_\{\\phi,\\rho\}\(\\mathbf\{X\}\)\\right\]\.\(35\)To understand why this objective is challenging, consider the expanded form of the negative log\-likelihood:

−log⁡pϕ,ρ\(𝐱\)=12ρ\(ϕ\(𝐱\)‖ϕ\(𝐱\)‖2\)−2‖ϕ\(𝐱\)‖22−log⁡\(\|detD𝐱ϕ\|\)\+log⁡\(∫𝕊d−1ρ\(ω\)d𝑑σ\(ω\)\)\+log⁡\(2d2−1Γ\(d2\)\)\.\-\\log p\_\{\\phi,\\rho\}\(\\mathbf\{x\}\)=\\frac\{1\}\{2\}\\,\\rho\\Bigl\(\\frac\{\\phi\(\\mathbf\{x\}\)\}\{\\\|\\phi\(\\mathbf\{x\}\)\\\|\_\{2\}\}\\Bigr\)^\{\-2\}\\\|\\phi\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\-\\log\\bigl\(\|\\det D\_\{\\mathbf\{x\}\}\\phi\|\\bigr\)\+\\log\\Bigl\(\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\rho\(\\omega\)^\{d\}\\,d\\sigma\(\\omega\)\\Bigr\)\+\\log\\Bigl\(2^\{\\frac\{d\}\{2\}\-1\}\\Gamma\\bigl\(\\tfrac\{d\}\{2\}\\bigr\)\\Bigr\)\.\(36\)This expression shows that optimizing the likelihood requires evaluating an integral over the high\-dimensional sphere\. For general radial functions, such integrals are not tractable in closed form, while Monte Carlo approximations become numerically unstable due to the exponentdd, especially in the high\-dimensional regimes of interest\.

Motivated by these difficulties, we instead adopt an alternative approach that allows us to learn the diffeomorphismϕ\\phi, the archetypes𝐯\(1\),…,𝐯\(r\)\\mathbf\{v\}^\{\(1\)\},\\ldots,\\mathbf\{v\}^\{\(r\)\}, and the radial functionρ\\rhoseparately\.

##### Step 1: finding a diffeomorphism

We begin by observing that any diffeomorphism with constant Jacobian determinant cannot alter the topology of the level sets of a distribution\. In particular, if the target distribution has the structure of a deformed star, such a diffeomorphism will map it to another deformed star distribution\. This observation provides guidance for selecting a suitable diffeomorphism\.

In particular, suppose we aim for the pushforward distribution to be as isotropic as possible\. Then we expect the latent distribution to exhibit predominantly isotropic behavior, while retaining star\-shaped structure in certain directions\. Enforcing an isotropic latent distribution corresponds precisely to standard normalizing flow training[dinh2014nice](https://arxiv.org/html/2605.24113#bib.bib25), which we know to scale well\. That is, we seekϕθ∗\\phi\_\{\\theta^\{\*\}\}, for a suitable parametrizationθ∗\\theta^\{\*\}that ensures a constant Jacobian determinant, solving

θ∗∈argminθ𝔼𝐗∼pdata\[−log⁡pϕθ\(𝐗\)\],wherepϕθ\(𝐱\):=1\(2π\)dexp⁡\(−12‖ϕθ\(𝐱\)‖22\)\|det\(D𝐱ϕθ\)\|\.\\theta^\{\*\}\\in\\operatorname\*\{argmin\}\_\{\\theta\}\\,\\mathbb\{E\}\_\{\\mathbf\{\\mathbf\{X\}\}\\sim p\_\{\\text\{data\}\}\}\\left\[\-\\log p\_\{\\phi\_\{\\theta\}\}\(\\mathbf\{X\}\)\\right\],\\quad\\text\{where\}\\quad p\_\{\\phi\_\{\\theta\}\}\(\\mathbf\{x\}\):=\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\}\}\\exp\\Bigl\(\-\\tfrac\{1\}\{2\}\\\|\\phi\_\{\\theta\}\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\\Bigr\)\\,\\bigl\|\\det\(D\_\{\\mathbf\{x\}\}\\phi\_\{\\theta\}\)\\bigr\|\.\(37\)

##### Step 2: finding a radial decomposition via archetypes

Next, rather than attempting to fit all branches simultaneously, it might be better to treat each branch separately and consider radial functions of the form

ρ\(ω\):=softmax⁡\(ρ1\(ω\),…,ρr\(ω\)\),\\rho\(\\omega\):=\\operatorname\{softmax\}\\bigl\(\\rho\_\{1\}\(\\omega\),\\ldots,\\rho\_\{r\}\(\\omega\)\\bigr\),\(38\)where eachρj:𝕊d−1→ℝ\>0\\rho\_\{j\}:\\mathbb\{S\}^\{d\-1\}\\to\\mathbb\{R\}\_\{\>0\}describes the radial behavior of a single “branch” of the data\. In[Figure6\(a\)](https://arxiv.org/html/2605.24113#S5.F6.sf1)– representing latent data𝐲\(i\):=ϕθ∗\(𝐱\(i\)\)\\mathbf\{y\}^\{\(i\)\}:=\\phi\_\{\\theta^\{\*\}\}\(\\mathbf\{x\}^\{\(i\)\}\)fori=1,…ni=1,\\ldots n–, this corresponds to the four individual arms of the distribution\.

There are two natural design choices for how to incorporate archetypes:

1. \(i\)each branch gets a single archetype,
2. \(ii\)each branch gets multiple archetypes\.

The first option is most appropriate when we lack prior information about which parts of the data belong to which branch, while the second fits labeled data where, within each class, we aim to identify several archetypes that represent it\. In practice the situation need not be this binary, but for the purposes of this work it is useful to analyze both scenarios separately\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/tree/data.png)\(a\)Samples of a star distribution
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/tree/labeled_data.png)\(b\)Sample archetypes and corresponding classification

Figure 6:Instead of fitting a single radial function to the entire dataset, we first decompose the latent data \(left\) using classical archetypal analysis \(right\) and then model each resulting branch individually\.To make \(i\) work, we must determine which points belong to which branch, after which it remains to find a radial functionρj\\rho\_\{j\}that captures the geometry of those data points alone\. This can be achieved by applying classical archetypal analysis \(AA\)[cutler1994archetypal](https://arxiv.org/html/2605.24113#bib.bib15)in the latent space, i\.e\., chooser∈ℕr\\in\\mathbb\{N\}and solve

inf𝐅∈ℝd×r,𝐆∈ℝr×n‖𝐘−𝐘𝐅𝐆‖F2,s\.t\.𝐟\(j\)∈Δn,𝐠\(i\)∈Δr,\\inf\_\{\\mathbf\{F\}\\in\\mathbb\{R\}^\{d\\times r\},\\;\\mathbf\{G\}\\in\\mathbb\{R\}^\{r\\times n\}\}\\bigl\\\|\\mathbf\{Y\}\-\\mathbf\{Y\}\\,\\mathbf\{F\}\\mathbf\{G\}\\bigr\\\|\_\{F\}^\{2\},\\quad\\text\{s\.t\.\}\\quad\\mathbf\{f\}^\{\(j\)\}\\in\\Delta^\{n\},\\;\\mathbf\{g\}^\{\(i\)\}\\in\\Delta^\{r\},\(39\)where𝐘:=\[𝐲\(1\),…,𝐲\(n\)\]∈ℝd×n\\mathbf\{Y\}:=\[\\mathbf\{y\}^\{\(1\)\},\\ldots,\\mathbf\{y\}^\{\(n\)\}\]\\in\\mathbb\{R\}^\{d\\times n\}collects the mapped data points, and𝐅:=\[𝐟\(1\),…,𝐟\(r\)\]\\mathbf\{F\}:=\[\\mathbf\{f\}^\{\(1\)\},\\ldots,\\mathbf\{f\}^\{\(r\)\}\]and𝐆:=\[𝐠\(1\),…,𝐠\(n\)\]\\mathbf\{G\}:=\[\\mathbf\{g\}^\{\(1\)\},\\ldots,\\mathbf\{g\}^\{\(n\)\}\]are the factor matrices whose columns lie in the corresponding unit simplices\.

This yields archetypes

𝐯θ∗\(j\):=ϕθ∗−1\(𝐘𝐟\(j\)\)=ϕθ∗−1\(∑i=1n𝐟i\(j\)ϕθ∗\(𝐱\(i\)\)\)∈ℝd,\\mathbf\{v\}\_\{\\theta^\{\*\}\}^\{\(j\)\}:=\\phi\_\{\\theta^\{\*\}\}^\{\-1\}\\bigl\(\\mathbf\{Y\}\\mathbf\{f\}^\{\(j\)\}\\bigr\)=\\phi\_\{\\theta^\{\*\}\}^\{\-1\}\\Bigl\(\\sum\_\{i=1\}^\{n\}\\mathbf\{f\}\_\{i\}^\{\(j\)\}\\,\\phi\_\{\\theta^\{\*\}\}\(\\mathbf\{x\}^\{\(i\)\}\)\\Bigr\)\\in\\mathbb\{R\}^\{d\},\(40\)and assigns each point𝐱\(i\)\\mathbf\{x\}^\{\(i\)\}a class label

cθ∗\(i\):=argmaxj=1,…,r𝐠j\(i\),c\_\{\\theta^\{\*\}\}^\{\(i\)\}:=\\operatorname\*\{argmax\}\_\{j=1,\\ldots,r\}\\mathbf\{g\}^\{\(i\)\}\_\{j\},\(41\)and is illustrated forr=4r=4in[Figure6\(b\)](https://arxiv.org/html/2605.24113#S5.F6.sf2)\.

For case \(ii\), we can proceed in an analogous way by solving \([39](https://arxiv.org/html/2605.24113#S5.E39)\) separately on each labeled subset of the data\. We then compute the corresponding archetypes as in \([40](https://arxiv.org/html/2605.24113#S5.E40)\), but now there is no need to perform an additional classification step within an already labeled data set\.

##### Step 3: finding radial functions

For each class, the remaining task is to define an appropriate radial functionρj\\rho\_\{j\}\. We propose to construct this function by assuming that the branch associated withρj\\rho\_\{j\}forms a convex set that tightly encloses the data while containing the origin\. Given such a convex set𝒦⊂ℝd\\mathcal\{K\}\\subset\\mathbb\{R\}^\{d\}– containing the origin – we can compute its radial function via[leong2025optimal](https://arxiv.org/html/2605.24113#bib.bib45)

ρ𝒦\(ω\):=sup\{t\>0∣t⋅ω∈𝒦\}\.\\rho\_\{\\mathcal\{K\}\}\(\\omega\):=\\sup\\\{t\>0\\,\\mid\\,t\\cdot\\omega\\in\\mathcal\{K\}\\\}\.\(45\)Although this construction may not be optimal with respect to log\-likelihood \([35](https://arxiv.org/html/2605.24113#S5.E35)\), it yields closed\-form expressions for the radial function for certain classes of convex sets and guarantees that all data points are assigned high likelihood\.

In contrast to Steps 1 and 2, we can no longer rely on existing methods and must instead design a dedicated procedure to identify suitable convex sets and their corresponding radial functions\.

### 5\.2Learning ellipsoidal radial functions

We propose an ellipsoid\-based construction of radial functions for each branchρj\\rho\_\{j\}\. Specifically, we model each branch using ellipsoidal sets of the form

ℰΣ\(μ\):=\{𝐲∈ℝd∣\(𝐲−μ\)⊤Σ−1\(𝐲−μ\)≤1\},\\mathcal\{E\}\_\{\\Sigma\}\(\\mu\):=\\\{\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}\\;\\mid\\;\(\\mathbf\{y\}\-\\mu\)^\{\\top\}\\Sigma^\{\-1\}\(\\mathbf\{y\}\-\\mu\)\\leq 1\\\},\(46\)whereΣ∈ℝd×d\\Sigma\\in\\mathbb\{R\}^\{d\\times d\}is symmetric positive definite andμ∈ℝd\\mu\\in\\mathbb\{R\}^\{d\}denotes the center\. To ensure that these sets admit a well\-defined radial function, we require𝟎∈int⁡\(ℰΣ\(μ\)\)\\mathbf\{0\}\\in\\operatorname\{int\}\(\\mathcal\{E\}\_\{\\Sigma\}\(\\mu\)\), which is equivalent to the conditionμ⊤Σ−1μ<1\\mu^\{\\top\}\\Sigma^\{\-1\}\\mu<1\.

Rather than relying on a single ellipsoid, we instead consider the \(soft\) intersection of two ellipsoids\. This leads to a radial function for each branch of the form

ρj\(ω\):=softmin⁡\(ρΣo\(j\),μ\(j\)\(ω\),ρΣc\(j\),𝟎\(ω\)\),\\rho\_\{j\}\(\\omega\):=\\operatorname\{softmin\}\\bigl\(\\rho\_\{\\Sigma^\{\(j\)\}\_\{o\},\\mu^\{\(j\)\}\}\(\\omega\),\\;\\rho\_\{\\Sigma^\{\(j\)\}\_\{c\},\\mathbf\{0\}\}\(\\omega\)\\bigr\),\(47\)where each component is given by

ρΣ,μ\(ω\):=ρℰΣ\(μ\)\(ω\)=sup\{t\>0∣t⋅ω∈ℰΣ\(μ\)\}\.\\rho\_\{\\Sigma,\\mu\}\(\\omega\):=\\rho\_\{\\mathcal\{E\}\_\{\\Sigma\}\(\\mu\)\}\(\\omega\)=\\sup\\\{t\>0\\,\\mid\\,t\\cdot\\omega\\in\\mathcal\{E\}\_\{\\Sigma\}\(\\mu\)\\\}\.\(48\)
##### Ellipsoidal radial functions

Before proceeding to the construction of suitable ellipsoids, we first note that \([48](https://arxiv.org/html/2605.24113#S5.E48)\) admits a more convenient closed\-form expression\.

###### Proposition 3\.

LetΣ∈ℝd×d\\Sigma\\in\\mathbb\{R\}^\{d\\times d\}be a symmetric positive definite matrix and letμ∈ℝd\\mu\\in\\mathbb\{R\}^\{d\}be a vector\. Furthermore, assume thatμ⊤Σ−1μ<1\\mu^\{\\top\}\\Sigma^\{\-1\}\\mu<1\.

Then, the radial function \([48](https://arxiv.org/html/2605.24113#S5.E48)\) of the ellipsoid generated byΣ\\Sigmaandμ\\musatisfies

ρΣ,μ\(ω\)=ω⊤Σ−1μ\+\(ω⊤Σ−1μ\)2\+\(ω⊤Σ−1ω\)\(1−μ⊤Σ−1μ\)ω⊤Σ−1ω\\rho\_\{\\Sigma,\\mu\}\(\\omega\)=\\frac\{\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\+\\sqrt\{\\left\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\\right\)^\{2\}\+\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\)\\left\(1\-\\mu^\{\\top\}\\Sigma^\{\-1\}\\mu\\right\)\}\}\{\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\}\(49\)In particular, forμ=𝟎\\mu=\\mathbf\{0\}the radial function reduces to

ρΣ,μ\(ω\)=\(ω⊤Σ−1ω\)−12\\rho\_\{\\Sigma,\\mu\}\(\\omega\)=\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\)^\{\-\\frac\{1\}\{2\}\}\(50\)

##### Data enclosing ellipsoids

Selecting an ellipsoid of the form \([46](https://arxiv.org/html/2605.24113#S5.E46)\) amounts to deciding how tightly the latent data𝐲\(1\),…,𝐲\(n\)\\mathbf\{y\}^\{\(1\)\},\\ldots,\\mathbf\{y\}^\{\(n\)\}associated with a given branch333Strictly speaking, one should write𝐲~\(1\),…,𝐲~\(m\)∈\{𝐲\(i\)∣cθ∗\(i\)=j\}\\tilde\{\\mathbf\{y\}\}^\{\(1\)\},\\ldots,\\tilde\{\\mathbf\{y\}\}^\{\(m\)\}\\in\\\{\\mathbf\{y\}^\{\(i\)\}\\,\\mid\\,c\_\{\\theta^\{\*\}\}^\{\(i\)\}=j\\\}, but we omit this for notational simplicity\.should be enclosed\. In the presence of outliers, enclosing every data point may be undesirable\. Instead, we aim to construct an off\-centered ellipsoid with a meaningful centerμ\\muand a symmetric positive definite matrixΣo\\Sigma\_\{o\}, such that the origin is guaranteed to lie in its interior while the data are captured in an average sense\. These requirements can be satisfied through a careful construction\.

###### Proposition 4\.

Let𝐲\(1\),…,𝐲\(n\)∈ℝd\\mathbf\{y\}^\{\(1\)\},\\ldots,\\mathbf\{y\}^\{\(n\)\}\\in\\mathbb\{R\}^\{d\}be anynnvectors and letα\>1\\alpha\>1andβ∈\(0,α\)\\beta\\in\(0,\\alpha\)be positive real numbers\. Furthermore, letμ:=1n∑i=1n𝐲\(i\)\\mu:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{y\}^\{\(i\)\}be the mean of the vectors and let𝐏μ:=μμ⊤‖μ‖22∈ℝd×d\\mathbf\{P\}\_\{\\mu\}:=\\frac\{\\mu\\mu^\{\\top\}\}\{\\\|\\mu\\\|\_\{2\}^\{2\}\}\\in\\mathbb\{R\}^\{d\\times d\}be the projection matrix onto the subspace generated byμ\\mu\. Finally, consider the singular value decomposition

\(𝐈−𝐏μ\)\[𝐲\(1\),…,𝐲\(n\)\]=𝐔𝐒𝐕⊤,𝐔∈ℝd×\(d−1\),𝐒∈ℝ\(d−1\)×\(d−1\),𝐕∈ℝn×\(d−1\),\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mu\}\)\[\\mathbf\{y\}^\{\(1\)\},\\ldots,\\mathbf\{y\}^\{\(n\)\}\]=\\mathbf\{U\}\\mathbf\{S\}\\mathbf\{V\}^\{\\top\},\\quad\\mathbf\{U\}\\in\\mathbb\{R\}^\{d\\times\(d\-1\)\},\\mathbf\{S\}\\in\\mathbb\{R\}^\{\(d\-1\)\\times\(d\-1\)\},\\mathbf\{V\}\\in\\mathbb\{R\}^\{n\\times\(d\-1\)\},\(52\)where we writeςk≥0\\varsigma\_\{k\}\\geq 0for the singular values on the diagonal of the matrix𝐒\\mathbf\{S\}\.

Then, the symmetric positive definite matrixΣo:=𝐖Λo𝐖⊤\\Sigma\_\{o\}:=\\mathbf\{W\}\\Lambda\_\{o\}\\mathbf\{W\}^\{\\top\}, defined through the orthogonal matrix

𝐖:=\[1‖μ‖2μ,𝐔\]=\[1‖μ‖2μ,𝐮\(1\),…𝐮\(d−1\)\]∈ℝd×d\\mathbf\{W\}:=\[\\frac\{1\}\{\\\|\\mu\\\|\_\{2\}\}\\mu,\\mathbf\{U\}\]=\[\\frac\{1\}\{\\\|\\mu\\\|\_\{2\}\}\\mu,\\mathbf\{u\}^\{\(1\)\},\\ldots\\mathbf\{u\}^\{\(d\-1\)\}\]\\in\\mathbb\{R\}^\{d\\times d\}\(53\)and the diagonal matrix

Λo:=diag⁡\(λ1,…,λd\),λk:=\{max⁡\{dn∑i=1n\(μ⊤\(𝐲\(i\)−μ\)\)2‖μ‖22,α\}k=1,max⁡\{dnςk−12,β\}k=2,…,d,\\Lambda\_\{o\}:=\\operatorname\{diag\}\(\\lambda\_\{1\},\\ldots,\\lambda\_\{d\}\),\\quad\\lambda\_\{k\}:=\\begin\{cases\}\\max\\Bigl\\\{\\frac\{d\}\{n\}\\sum\_\{i=1\}^\{n\}\\frac\{\(\\mu^\{\\top\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)\)^\{2\}\}\{\\\|\\mu\\\|\_\{2\}^\{2\}\},\\alpha\\Bigr\\\}&k=1,\\\\ \\quad\\quad\\max\\Bigl\\\{\\frac\{d\}\{n\}\\varsigma\_\{k\-1\}^\{2\},\\beta\\Bigr\\\}&k=2,\\ldots,d,\\end\{cases\}\(54\)satisfies

1n∑i=1n\(𝐲\(i\)−μ\)⊤Σo−1\(𝐲\(i\)−μ\)≤1,andμ⊤Σo−1μ<1\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)^\{\\top\}\\Sigma\_\{o\}^\{\-1\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)\\leq 1,\\quad\\text\{and\}\\quad\\mu^\{\\top\}\\Sigma\_\{o\}^\{\-1\}\\mu<1\.\(55\)

To motivate the inclusion of a second, centered ellipsoid in the radial function \([47](https://arxiv.org/html/2605.24113#S5.E47)\), observe that geodesics under a radial function from the previously constructed ellipsoid can be numerically unstable, when used in isolation\. In particular, if its center lies far from the origin, the resulting set can extend excessively outward, which is undesirable for interpolation in terms of geodesics swinging out undesirably\. While this effect can be partially mitigated through a concave transformation, it is generally preferable to work with enclosing sets that remain as compact as possible\. Consequently, our primary interest lies in the portion of the set that is close to the origin, while still capturing the data in an average sense and ensuring that the data mean is contained\. This motivates the introduction of a second ellipsoid centered at the origin, which can be constructed in an analogous manner\.

###### Proposition 5\.

Let𝐲\(1\),…,𝐲\(n\)∈ℝd\\mathbf\{y\}^\{\(1\)\},\\ldots,\\mathbf\{y\}^\{\(n\)\}\\in\\mathbb\{R\}^\{d\}be anynnvectors and letα\>1\\alpha\>1andβ∈\(0,α\)\\beta\\in\(0,\\alpha\)be positive real numbers\. Furthermore, letμ:=1n∑i=1n𝐲\(i\)\\mu:=\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\\mathbf\{y\}^\{\(i\)\}be the mean of the vectors and let𝐏μ:=μμ⊤‖μ‖22∈ℝd×d\\mathbf\{P\}\_\{\\mu\}:=\\frac\{\\mu\\mu^\{\\top\}\}\{\\\|\\mu\\\|\_\{2\}^\{2\}\}\\in\\mathbb\{R\}^\{d\\times d\}be the projection matrix onto the subspace generated byμ\\mu\. Finally, consider the singular value decomposition

\(𝐈−𝐏μ\)\[𝐲\(1\),…,𝐲\(n\)\]=𝐔𝐒𝐕⊤,𝐔∈ℝd×\(d−1\),𝐒∈ℝ\(d−1\)×\(d−1\),𝐕∈ℝn×\(d−1\),\(\\mathbf\{I\}\-\\mathbf\{P\}\_\{\\mu\}\)\[\\mathbf\{y\}^\{\(1\)\},\\ldots,\\mathbf\{y\}^\{\(n\)\}\]=\\mathbf\{U\}\\mathbf\{S\}\\mathbf\{V\}^\{\\top\},\\quad\\mathbf\{U\}\\in\\mathbb\{R\}^\{d\\times\(d\-1\)\},\\mathbf\{S\}\\in\\mathbb\{R\}^\{\(d\-1\)\\times\(d\-1\)\},\\mathbf\{V\}\\in\\mathbb\{R\}^\{n\\times\(d\-1\)\},\(56\)where we writeςk≥0\\varsigma\_\{k\}\\geq 0for the singular values on the diagonal of the matrix𝐒\\mathbf\{S\}\.

Then, the symmetric positive definite matrixΣc:=𝐖Λc𝐖⊤\\Sigma\_\{c\}:=\\mathbf\{W\}\\Lambda\_\{c\}\\mathbf\{W\}^\{\\top\}, defined through the orthogonal matrix

𝐖:=\[1‖μ‖2μ,𝐔\]=\[1‖μ‖2μ,𝐮\(1\),…𝐮\(d−1\)\]∈ℝd×d\\mathbf\{W\}:=\[\\frac\{1\}\{\\\|\\mu\\\|\_\{2\}\}\\mu,\\mathbf\{U\}\]=\[\\frac\{1\}\{\\\|\\mu\\\|\_\{2\}\}\\mu,\\mathbf\{u\}^\{\(1\)\},\\ldots\\mathbf\{u\}^\{\(d\-1\)\}\]\\in\\mathbb\{R\}^\{d\\times d\}\(57\)and the diagonal matrix

Λc:=diag⁡\(λ1,…,λd\),λk:=\{max⁡\{dn∑i=1n\(μ⊤𝐲\(i\)\)2‖μ‖22,α\}k=1,max⁡\{dnςk−12,β\}k=2,…,d,\\Lambda\_\{c\}:=\\operatorname\{diag\}\(\\lambda\_\{1\},\\ldots,\\lambda\_\{d\}\),\\quad\\lambda\_\{k\}:=\\begin\{cases\}\\max\\Bigl\\\{\\frac\{d\}\{n\}\\sum\_\{i=1\}^\{n\}\\frac\{\(\\mu^\{\\top\}\\mathbf\{y\}^\{\(i\)\}\)^\{2\}\}\{\\\|\\mu\\\|\_\{2\}^\{2\}\},\\alpha\\Bigr\\\}&k=1,\\\\ \\quad\\quad\\max\\Bigl\\\{\\frac\{d\}\{n\}\\varsigma\_\{k\-1\}^\{2\},\\beta\\Bigr\\\}&k=2,\\ldots,d,\\end\{cases\}\(58\)satisfies

1n∑i=1n\(𝐲\(i\)\)⊤Σc−1𝐲\(i\)≤1,andμ⊤Σc−1μ<1\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{y\}^\{\(i\)\}\)^\{\\top\}\\Sigma\_\{c\}^\{\-1\}\\mathbf\{y\}^\{\(i\)\}\\leq 1,\\quad\\text\{and\}\\quad\\mu^\{\\top\}\\Sigma\_\{c\}^\{\-1\}\\mu<1\.\(59\)

Returning to the example in[Figure6](https://arxiv.org/html/2605.24113#S5.F6), we observe that this construction produces geodesics with a well\-behaved shape – along with its variant incorporating a strongly concave functionv\(s\):=log⁡\(5s\+1\)v\(s\):=\\log\(5s\+1\)–, as illustrated in[Figure7](https://arxiv.org/html/2605.24113#S5.F7)\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/tree/data_Omega_geodesics.png)\(a\)Pullback geodesics underφ:=ψv∘χρ\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\.
![Refer to caption](https://arxiv.org/html/2605.24113v1/results/tree/data_chi_geodesics.png)\(b\)Pullback geodesics underφ:=χρ\\varphi:=\\chi\_\{\\rho\}\.

Figure 7:The proposed construction \([38](https://arxiv.org/html/2605.24113#S5.E38)\) combined with \([47](https://arxiv.org/html/2605.24113#S5.E47)\) yields geodesics between archetypes that remain close to the data support under the naive construction \(right\)\. These can be further drawn inward by incorporating a second diffeomorphism \(left\)\.

## 6Numerical Experiments

The proposed framework has already shown to performs well on the low\-dimensional examples in previous sections, but a number of questions remain open\. In particular, it is not yet clear \(i\) how effectively the three\-step procedure in[Section5](https://arxiv.org/html/2605.24113#S5)can learn deformed star densities \(in both labeled and unlabeled settings\), \(ii\) how well the star geometry captures high\-dimensional data sets, for example in terms of the quality of interpolation, and \(iii\) whether our RAM\-efficient optimization scheme requires further refinement, either for optimization itself or for downstream classification performance\. Moreover, we would like to understand how iso\-Riemannian geometry influences the various data\-analysis tasks within this framework\.

Since no ground truth is available in high dimensions, we require a setting where we can visually assess whether geodesics meaningfully traverse the data cloud, and which naturally admits both labeled and unlabeled variants\. To this end, we consider the MNIST data set in two ways: first, we select a single digit class and apply our learning pipeline in the unlabeled setting, using one archetype per mode, and second, we use all digits together with their labels, grouping the data into ten class\-specific subsets\. In all experiments we further compose the learned star diffeomorphismχρ∘ϕθ∗\\chi\_\{\\rho\}\\circ\\phi\_\{\\theta^\{\*\}\}with the additional diffeomorphismψv\\psi\_\{v\}, wherev\(s\):=log⁡\(10s\+1\)v\(s\):=\\log\(10s\+1\), in order to regularize the resulting geodesics\. Consequently, all pullback geodesics are generated with respect to the composite mapφ:=ψv∘χρ∘ϕθ∗\\varphi:=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\_\{\\theta^\{\*\}\}\.

For a fair comparison with existing manifold\-learning approaches, we note that the only framework currently shown to learn meaningful \(and computationally tractable\) geodesics on MNIST is the method of[diepeveen2025scorebased](https://arxiv.org/html/2605.24113#bib.bib19), whose loss and parametrization were later refined in[diepeveen2025manifold](https://arxiv.org/html/2605.24113#bib.bib22)\. This construction reduces to normalizing flow training under a diffeomorphism with constant Jacobian determinant, which coincides with the first step in our star\-learning procedure and therefore provides a natural baseline\. To make the comparison as fair as possible, we adopt the same training routine for the normalizing flow component—using identical training parameters and network architectures in all settings—and we emphasize that these choices are not numerically optimized, but rather selected so that both the improvements over existing methods and the most pressing directions for further work are clearly visible\. All training details are deferred to[AppendixD](https://arxiv.org/html/2605.24113#A4)\.

### 6\.1Single digit MNIST

When considering the digit 3,[Figure8](https://arxiv.org/html/2605.24113#S6.F8)shows that step 1 of the three\-step approach indeed yields latent stars\. Here it should be noted that such projections can be misleading: the toy data set in[Figure6](https://arxiv.org/html/2605.24113#S5.F6)also yields cross\-shaped two\-dimensional views, yet the full higher\-dimensional structure is not a clean multidimensional cross with symmetric radial branches\. Similarly to the toy example, however, this ambiguity in the projections does not interfere with our automated procedure for extracting both archetypes and radial functions\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist_3/latent_scatter.png)Figure 8:After training the normalizing flow with loss \([37](https://arxiv.org/html/2605.24113#S5.E37)\) on the digit 3 from MNIST, the latent representations𝐲\(i\):=ϕθ∗\(𝐱\(i\)\)\\mathbf\{y\}^\{\(i\)\}:=\\phi\_\{\\theta^\{\*\}\}\(\\mathbf\{x\}^\{\(i\)\}\)fori=1,…,ni=1,\\ldots,nexhibit a pronounced star\-like structure in several two\-dimensional projections \(left and middle\), while appearing approximately isotropic in others \(right\)\.Advancing to step 2, selectingr=10r=10archetypes produces the digits shown in[Figure9](https://arxiv.org/html/2605.24113#S6.F9), which provide a reasonable summary of the main ways in which threes appear in the data\. Using the labels \([41](https://arxiv.org/html/2605.24113#S5.E41)\) induced by these archetypes, we then estimate a radial function \([47](https://arxiv.org/html/2605.24113#S5.E47)\) for each class, thereby completing step 3: this yields an estimated deformed star distribution and, together with the archetypes, equips us with a Riemannian structure for subsequent data analysis\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist_3/archetypes_phi.png)Figure 9:The archetypes𝐯θ∗\(j\)\\mathbf\{v\}\_\{\\theta^\{\*\}\}^\{\(j\)\}\([40](https://arxiv.org/html/2605.24113#S5.E40)\) forj=1,…,10j=1,\\ldots,10obtained from Riemannian Archetypal analysis \([39](https://arxiv.org/html/2605.24113#S5.E39)\) – under the pullback geometry generated byϕθ∗\\phi\_\{\\theta^\{\*\}\}– sketch a reasonable picture of the different types of threes in the data set\.Next,[Figure10](https://arxiv.org/html/2605.24113#S6.F10)shows that interpolation \([3](https://arxiv.org/html/2605.24113#S2.E3)\) under the reference diffeomorphismϕθ∗\\phi\_\{\\theta^\{\*\}\}already improves substantially on linear interpolation, while the star\-based geodesic associated withφ\\varphiyields paths that more closely follow the underlying data geometry, and the iso\-corrected variant \([18](https://arxiv.org/html/2605.24113#S3.E18)\) further restores a more interpretable notion of time along the trajectory\.

γ𝐱,𝐲ℓ2\(t\)\\gamma^\{\\ell^\{2\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲ϕθ∗\(t\)\\gamma^\{\\phi\_\{\\theta^\{\*\}\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲φ\(t\)\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲φ,iso\(t\)\\gamma^\{\\varphi,\\operatorname\{iso\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist_3/geodesics_comparison_3.png)

Figure 10:When interpolating between two validation images of the digit three \(left\- and right\-most columns\), the linear path \(first row\) simply causes one image to fade out while the other fades in\. In contrast, the baseline pullback geodesic \(second row\) produces a more nonlinear transformation, but the lower part of the three almost closes into a loop – an artifact not supported by the data\. The proposed geodesic \(third row\), which respects the learned star structure, first moves the left three toward a more “reference\-like” three \(closer to the star center\) before continuing to the target image\. The corresponding iso\-geodesic \(fourth row\) further mitigates the naive impression that the trajectory spends an inordinate amount of time near this reference three\.Finally, the Riemannian archetypal mapping \(RAM\) projections in[Figure11](https://arxiv.org/html/2605.24113#S6.F11)highlight the importance of refining the relaxed RAM initialization and demonstrate that this procedure yields very reasonable “denoised” instances of the digit three\.

𝐱val\(i\)\\mathbf\{x\}\_\{\\operatorname\{val\}\}^\{\(i\)\}

Sφ\(𝐱val\(i\)\)S^\{\\varphi\}\(\\mathbf\{x\}\_\{\\operatorname\{val\}\}^\{\(i\)\}\)

Tφ\(𝐱val\(i\)\)T^\{\\varphi\}\(\\mathbf\{x\}\_\{\\operatorname\{val\}\}^\{\(i\)\}\)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist_3/projected_clusters.png)

Figure 11:Validation images \(top row\) are first mapped using the relaxed RAM \(middle row\), which then serves as an initialization for the RAM \(bottom row\)\. Without this additional refinement step, the relaxed RAM tends to produce highly similar projections\.
### 6\.2All digit MNIST

Similarly to the digit 3 case, when we consider the full MNIST data set,[Figure12](https://arxiv.org/html/2605.24113#S6.F12)illustrates that step 1 of the three\-step procedure again produces latent star structures\. As before, however, such projections should be interpreted with care, for the reasons discussed above\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/latent_scatter.png)Figure 12:After training the normalizing flow with loss \([37](https://arxiv.org/html/2605.24113#S5.E37)\) on MNIST, the latent representations𝐲\(i\):=ϕθ∗\(𝐱\(i\)\)\\mathbf\{y\}^\{\(i\)\}:=\\phi\_\{\\theta^\{\*\}\}\(\\mathbf\{x\}^\{\(i\)\}\)fori=1,…,ni=1,\\ldots,nexhibit a pronounced star\-like structure in several two\-dimensional projections \(left and middle\), while appearing approximately isotropic in others \(right\)\.Advancing to step 2, we select 10 archetypes per digit class, giving a total ofr=100r=100archetypes and yielding the representative digits shown in[Figure13](https://arxiv.org/html/2605.24113#S6.F13), which form a reasonable summary of the data\. Using the induced class labels, we then estimate a radial function \([47](https://arxiv.org/html/2605.24113#S5.E47)\) for each class, thereby completing step 3: this provides an estimated deformed star distribution and, together with the archetypes – now multiple per class – endows the data with a Riemannian structure suitable for subsequent analysis\.

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/archetypes_phi.png)Figure 13:The archetypes𝐯θ∗\(j\)\\mathbf\{v\}\_\{\\theta^\{\*\}\}^\{\(j\)\}\([40](https://arxiv.org/html/2605.24113#S5.E40)\) forj=1,…,100j=1,\\ldots,100obtained from Riemannian Archetypal analysis \([39](https://arxiv.org/html/2605.24113#S5.E39)\) – under the pullback geometry generated byϕθ∗\\phi\_\{\\theta^\{\*\}\}– on each of the labeled subsets of the data set sketch a reasonable picture of the different types of digits in the data set\.Next,[Figure14](https://arxiv.org/html/2605.24113#S6.F14)demonstrates that – much as before – interpolation \([3](https://arxiv.org/html/2605.24113#S2.E3)\) under the reference diffeomorphismϕθ∗\\phi\_\{\\theta^\{\*\}\}already represents a substantial improvement over linear interpolation, while the star\-based geodesic associated withφ\\varphiproduces trajectories that more faithfully follow the data geometry, and the iso\-corrected version \([18](https://arxiv.org/html/2605.24113#S3.E18)\) further recovers a more interpretable notion of time along the path\.

γ𝐱,𝐲ℓ2\(t\)\\gamma^\{\\ell^\{2\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲ϕθ∗\(t\)\\gamma^\{\\phi\_\{\\theta^\{\*\}\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲φ\(t\)\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲φ,iso\(t\)\\gamma^\{\\varphi,\\operatorname\{iso\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/geodesics_comparison_1.png)

γ𝐱,𝐲ℓ2\(t\)\\gamma^\{\\ell^\{2\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲ϕθ∗\(t\)\\gamma^\{\\phi\_\{\\theta^\{\*\}\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲φ\(t\)\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

γ𝐱,𝐲φ,iso\(t\)\\gamma^\{\\varphi,\\operatorname\{iso\}\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/geodesics_comparison_5.png)

Figure 14:When interpolating between different digits \(left\- and right\-most columns\), the linear path \(first rows\) simply causes one image to fade out while the other fades in\. In contrast, the baseline pullback geodesic \(second rows\) produces a more nonlinear transformation, but it might move through images that are not always supported by the data\. The proposed geodesic \(third rows\), which respects the learned star structure, first moves the left digit toward a more “reference\-like” digit \(closer to the star center\) before continuing to the target digit\. In particular, these reference\-like digits seem to be eights\. The corresponding iso\-geodesic \(fourth rows\) further mitigates the naive impression that the trajectory spends an inordinate amount of time near this reference points\.Then, the Riemannian archetypal mapping \(RAM\) projections in[Figure15](https://arxiv.org/html/2605.24113#S6.F15)expose current limitations of the method\. In this setting, the algorithm fails to converge within a reasonable time, as the line search terminates once admissible step sizes fall below numerical precision\. In other words, optimizing through the diffeomorphism – already flagged in[Remark5](https://arxiv.org/html/2605.24113#Thmremark5)as a potential bottleneck – appears to have reached its limits, and further progress will realistically require a more developed iso\-Riemannian optimization theory\.

Finally, it is worth emphasizing what these limitations imply for classification\. Starting from the vector𝐠∗∈ℝ100\\mathbf\{g\}^\{\*\}\\in\\mathbb\{R\}^\{100\}obtained by solving \([24](https://arxiv.org/html/2605.24113#S4.E24)\), and its isometry\-corrected counterpart𝐡∗∈ℝ100\\mathbf\{h\}^\{\*\}\\in\\mathbb\{R\}^\{100\}from \([33](https://arxiv.org/html/2605.24113#S4.E33)\), we aggregate the components associated with each digit class \(for example,𝐠1∗\+⋯\+𝐠10∗\\mathbf\{g\}^\{\*\}\_\{1\}\+\\cdots\+\\mathbf\{g\}^\{\*\}\_\{10\}gives the net weight for the digit 0 class\)\. This allows us to examine more directly how the optimization issues impact downstream classification performance\. As shown in[Figure16](https://arxiv.org/html/2605.24113#S6.F16), points that are visibly well\-projected in[Figure15](https://arxiv.org/html/2605.24113#S6.F15), i\.e\., digits 0, 1, 2, 3, 4, 6, and 7, are classified reliably – often with additional improvement after the isometry correction – whereas digits 5, 8, and 9 are classified less accurately\. We stress once more that this shortcoming stems from optimization limits rather than from any intrinsic deficiency of the overall methodology\.

𝐱val\(i\)\\mathbf\{x\}\_\{\\operatorname\{val\}\}^\{\(i\)\}

Sφ\(𝐱val\(i\)\)S^\{\\varphi\}\(\\mathbf\{x\}\_\{\\operatorname\{val\}\}^\{\(i\)\}\)

Tφ\(𝐱val\(i\)\)T^\{\\varphi\}\(\\mathbf\{x\}\_\{\\operatorname\{val\}\}^\{\(i\)\}\)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/projected_clusters.png)

Figure 15:Validation data points \(top row\) are first mapped using the relaxed RAM \(middle row\), which then serves as an initialization for the RAM \(bottom row\)\. Without this additional refinement step, the relaxed RAM tends to produce highly similar projections\. Nevertheless, the RAM does not fully converge for all data points, because the line search step size eventually falls below numerical precision and the procedure can no longer make progress\.![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_0.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_1.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_2.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_3.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_4.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_5.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_6.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_7.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_8.png)

![Refer to caption](https://arxiv.org/html/2605.24113v1/results/mnist/weights_comparison_9.png)

Figure 16:For each of the validation points in[Figure15](https://arxiv.org/html/2605.24113#S6.F15), we look at how its mass is distributed over the archetypes, add up the contributions belonging to the same digit, and read off a predicted class – before and after the isometry correction, i\.e\., considering𝐠\\mathbf\{g\}and𝐡\\mathbf\{h\}\. For the digits 0, 1, 2, 3, 4, 6, and 7, where the RAM projection is visually clean, this simple RAM\-based classifier performs well, often even better after isometry correction, while digits 5, 8, and 9 expose the current optimization limits of the approach rather than a failure of the underlying geometric model\.

## 7Conclusions

This work introduces Riemannian archetypal analysis as a geometrically grounded non\-linear extension of classical archetypal analysis for real\-valued data whose generating distribution can be approximated by a deformed star distribution\. The framework preserves the central interpretability goal of archetypal methods while endowing the ambient space with a data\-driven pullback Riemannian structure that supports fast interpolation, projection, denoising, and classification in a unified way\. Our experiments show that this perspective is effective on both synthetic and real data, but they also make clear that optimization through the learned diffeomorphism remains the main computational bottleneck, especially for Riemannian archetypal mappings \(RAMs\) in higher\-dimensional settings\.

Several natural directions follow from this work\. The most immediate is the development of constrained iso\-Riemannian optimization methods tailored to RAM\-type problems, since the current theory mainly covers first\-order unconstrained settings and the numerical results indicate that substantially better optimization schemes should be possible\. A second direction is to improve the learning of deformed star models beyond the present constructive scheme, ideally with stronger statistical guarantees and more expressive radial parametrizations\. More broadly, the present work suggests that star\-shaped geometric models may provide a useful foundation for other interpretable non\-linear factorization methods, and that iso\-Riemannian ideas may play an important role in making such models computationally viable\.

#### Acknowledgments

WD and DN were partially funded by Defense Health Agency CDMRP TB240022 and NSF DMS 2408912

## References

- \[1\]Maryam Abdolali and Nicolas Gillis\.Simplex\-structured matrix factorization: Sparsity\-based identifiability and provably correct algorithms\.SIAM Journal on Mathematics of Data Science, 3\(2\):593–623, 2021\.
- \[2\]P\-A Absil, Robert Mahony, and Rodolphe Sepulchre\.Optimization algorithms on matrix manifolds\.Princeton University Press, 2008\.
- \[3\]Giovanni S Alberti, Johannes Hertrich, Matteo Santacesaria, and Silvia Sciutto\.Manifold learning by mixture models of vaes for inverse problems\.Journal of Machine Learning Research, 25\(202\):1–35, 2024\.
- \[4\]Aleix Alcacer, Irene Epifanio, Sebastian Mair, and Morten Mørup\.A survey on archetypal analysis\.arXiv preprint arXiv:2504\.12392, 2025\.
- \[5\]Georgios Arvanitidis, Lars K Hansen, and Søren Hauberg\.A locally adaptive normal distribution\.Advances in Neural Information Processing Systems, 29, 2016\.
- \[6\]Nithya Bhasker, Hattie Chung, Louis Boucherie, Vladislav Kim, Stefanie Speidel, and Melanie Weber\.Uncovering developmental lineages from single\-cell data with contrastive poincaré maps\.bioRxiv, pages 2025–08, 2025\.
- \[7\]William M Boothby\.An introduction to differentiable manifolds and Riemannian geometry, Revised, volume 120\.Gulf Professional Publishing, 2003\.
- \[8\]Nicolas Boumal\.An introduction to optimization on smooth manifolds\.Cambridge University Press, 2023\.
- \[9\]Manfredo Perdigao do Carmo\.Riemannian geometry\.Birkhäuser, 1992\.
- \[10\]Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, and Yi Ma\.Redunet: A white\-box deep network from the principle of maximizing rate reduction\.Journal of machine learning research, 23\(114\):1–103, 2022\.
- \[11\]Tara Chari and Lior Pachter\.The specious art of single\-cell genomics\.PLOS Computational Biology, 19\(8\):e1011288, 2023\.
- \[12\]Joyce Chew, Willem Diepeveen, and Deanna Needell\.Curvature corrected nonnegative manifold data factorization\.arXiv preprint arXiv:2502\.15124, 2025\.
- \[13\]Seunghyuk Cho, Juyong Lee, and Dongwoo Kim\.Hyperbolic vae via latent gaussian distributions\.Advances in Neural Information Processing Systems, 36:569–588, 2023\.
- \[14\]Pierre Comon\.Independent component analysis, a new concept?Signal processing, 36\(3\):287–314, 1994\.
- \[15\]Adele Cutler and Leo Breiman\.Archetypal analysis\.Technometrics, 36\(4\):338–347, 1994\.
- \[16\]Giannis Daras, Joseph Dean, Ajil Jalal, and Alex Dimakis\.Intermediate layer optimization for inverse problems using deep generative models\.InInternational Conference on Machine Learning, pages 2421–2432\. PMLR, 2021\.
- \[17\]Tim R\. Davidson, Luca Falorsi, Nicola De Cao, Thomas Kipf, and Jakub M\. Tomczak\.Hyperspherical variational auto\-encoders\.34th Conference on Uncertainty in Artificial Intelligence \(UAI\-18\), 2018\.
- \[18\]Willem Diepeveen\.Pulling back symmetric riemannian geometry for data analysis\.arXiv preprint arXiv:2403\.06612, 2024\.
- \[19\]Willem Diepeveen, Georgios Batzolis, Zakhar Shumaylov, and Carola\-Bibiane Schönlieb\.Score\-based pullback riemannian geometry: Extracting the data manifold geometry using anisotropic flows\.InForty\-second International Conference on Machine Learning, 2025\.
- \[20\]Willem Diepeveen, Joyce Chew, and Deanna Needell\.Curvature\-corrected tangent space\-based approximation of manifold\-valued data\.Information and Inference: A Journal of the IMA, 14\(4\):iaaf031, 2025\.
- \[21\]Willem Diepeveen and Oscar Leong\.Riemannian ambientflow: Towards simultaneous manifold learning and generative modeling from corrupted data\.arXiv preprint arXiv:2601\.18728, 2026\.
- \[22\]Willem Diepeveen and Deanna Needell\.Manifold learning with normalizing flows: Towards regularity, expressivity and iso\-riemannian geometry\.arXiv preprint arXiv:2505\.08087, 2025\.
- \[23\]Willem Diepeveen and Melanie Weber\.Iso\-riemannian optimization on learned data manifolds\.arXiv preprint arXiv:2510\.21033, 2025\.
- \[24\]Chris HQ Ding, Tao Li, and Michael I Jordan\.Convex and semi\-nonnegative matrix factorizations\.IEEE transactions on pattern analysis and machine intelligence, 32\(1\):45–55, 2008\.
- \[25\]Laurent Dinh, David Krueger, and Yoshua Bengio\.Nice: Non\-linear independent components estimation\.arXiv preprint arXiv:1410\.8516, 2014\.
- \[26\]Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan\.Testing the manifold hypothesis\.Journal of the American Mathematical Society, 29\(4\):983–1049, 2016\.
- \[27\]P Thomas Fletcher, Conglin Lu, Stephen M Pizer, and Sarang Joshi\.Principal geodesic analysis for the study of nonlinear statistics of shape\.IEEE transactions on medical imaging, 23\(8\):995–1005, 2004\.
- \[28\]Samuel Gruffaz and Josua Sassen\.Riemannian metric learning: Closer to you than you imagine\.arXiv preprint arXiv:2503\.05321, 2025\.
- \[29\]Paul Hand, Oscar Leong, and Vlad Voroninski\.Phase retrieval under a generative prior\.Advances in Neural Information Processing Systems, 31, 2018\.
- \[30\]Paul Hand and Vladislav Voroninski\.Global guarantees for enforcing deep generative priors by empirical risk\.InConference On Learning Theory, pages 970–978\. PMLR, 2018\.
- \[31\]Florine Hartwig, Josua Sassen, Juliane Braunsmann, Martin Rumpf, and Benedikt Wirth\.Geodesic calculus on implicitly defined latent manifolds\.arXiv preprint arXiv:2510\.09468, 2026\.
- \[32\]Søren Hauberg, Oren Freifeld, and Michael Black\.A geometric take on metric learning\.Advances in Neural Information Processing Systems, 25, 2012\.
- \[33\]Johannes Hertrich, Hok Shing Wong, Alexander Denker, Stanislas Ducotterd, Zhenghan Fang, Markus Haltmeier, Željko Kereta, Erich Kobler, Oscar Leong, Mohammad Sadegh Salehi, et al\.Learning regularization functionals for inverse problems: A comparative study\.arXiv preprint arXiv:2510\.01755, 2025\.
- \[34\]Jeffrey Ho, Yuchen Xie, and Baba Vemuri\.On a nonlinear generalization of sparse coding and dictionary learning\.InInternational conference on machine learning, pages 1480–1488\. PMLR, 2013\.
- \[35\]Hermann Karcher\.Riemannian center of mass and mollifier smoothing\.Communications on pure and applied mathematics, 30\(5\):509–541, 1977\.
- \[36\]Sebastian Mathias Keller, Maxim Samarin, Fabricio Arend Torres, Mario Wieser, and Volker Roth\.Learning extremal representations with deep archetypal analysis\.International journal of computer vision, 129\(4\):805–820, 2021\.
- \[37\]Diederik P Kingma and Max Welling\.Auto\-encoding variational bayes\.arXiv preprint arXiv:1312\.6114, 2013\.
- \[38\]Durk P Kingma and Prafulla Dhariwal\.Glow: Generative flow with invertible 1x1 convolutions\.Advances in neural information processing systems, 31, 2018\.
- \[39\]Denis Kleverov, Ekaterina Aladyeva, Alexey Serdyukov, and Maxim N Artyomov\.Non\-negative matrix factorization and deconvolution as a dual simplex problem\.Genome Biology, 2026\.
- \[40\]Dhruv Kohli, Alexander Cloninger, and Gal Mishne\.Ldle: Low distortion local eigenmaps\.Journal of machine learning research, 22\(282\):1–64, 2021\.
- \[41\]John M Lee\.Smooth manifolds\.InIntroduction to Smooth Manifolds, pages 1–31\. Springer, 2013\.
- \[42\]Wonjun Lee, Riley CW O’Neill, Dongmian Zou, Jeff Calder, and Gilad Lerman\.Geometry\-preserving encoder/decoder in latent generative models\.arXiv preprint arXiv:2501\.09876, 2025\.
- \[43\]Qi Lei, Ajil Jalal, Inderjit S Dhillon, and Alexandros G Dimakis\.Inverting deep generative models, one layer at a time\.Advances in neural information processing systems, 32, 2019\.
- \[44\]Oscar Leong, Eliza O’Reilly, and Yong S Soh\.The star geometry of critic\-based regularizer learning\.Advances in Neural Information Processing Systems, 37:71240–71276, 2024\.
- \[45\]Oscar Leong, Eliza O’Reilly, Yong Sheng Soh, and Venkat Chandrasekaran\.Optimal regularization for a data source\.Foundations of Computational Mathematics, pages 1–50, 2025\.
- \[46\]Oscar Leong and Yann Traonmilin\.A recovery theory for diffusion priors: Deterministic analysis of the implicit prior algorithm\.arXiv preprint arXiv:2509\.20511, 2025\.
- \[47\]Jun Li and José M Bioucas\-Dias\.Minimum volume simplex analysis: A fast algorithm to unmix hyperspectral data\.InIGARSS 2008\-2008 IEEE International Geoscience and Remote Sensing Symposium, volume 3, pages III–250\. IEEE, 2008\.
- \[48\]Chia\-Hsiang Lin, Ruiyuan Wu, Wing\-Kin Ma, Chong\-Yung Chi, and Yue Wang\.Maximum volume inscribed ellipsoid: A new simplex\-structured matrix factorization framework via facet enumeration and convex optimization\.SIAM Journal on Imaging Sciences, 11\(2\):1651–1679, 2018\.
- \[49\]Lidan Miao and Hairong Qi\.Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization\.IEEE Transactions on Geoscience and Remote Sensing, 45\(3\):765–777, 2007\.
- \[50\]Nina Miolane, Nicolas Guigui, Alice Le Brigant, Johan Mathe, Benjamin Hou, Yann Thanwerdas, Stefan Heyder, Olivier Peltre, Niklas Koep, Hadi Zaatiti, et al\.Geomstats: A python package for riemannian geometry in machine learning\.Journal of Machine Learning Research, 21\(223\):1–9, 2020\.
- \[51\]Kevin R Moon, David Van Dijk, Zheng Wang, Scott Gigante, Daniel B Burkhardt, William S Chen, Kristina Yim, Antonia van den Elzen, Matthew J Hirn, Ronald R Coifman, et al\.Visualizing structure and transitions in high\-dimensional biological data\.Nature biotechnology, 37\(12\):1482–1492, 2019\.
- \[52\]José MP Nascimento and José MB Dias\.Vertex component analysis: A fast algorithm to unmix hyperspectral data\.IEEE transactions on Geoscience and Remote Sensing, 43\(4\):898–910, 2005\.
- \[53\]Pentti Paatero and Unto Tapper\.Positive matrix factorization: A non\-negative factor model with optimal utilization of error estimates of data values\.Environmetrics, 5\(2\):111–126, 1994\.
- \[54\]Neal Parikh and Stephen Boyd\.Proximal algorithms\.Foundations and Trends in optimization, 1\(3\):127–239, 2014\.
- \[55\]Karl Pearson\.Liii\. on lines and planes of closest fit to systems of points in space\.The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2\(11\):559–572, 1901\.
- \[56\]Jaakko Peltonen, Arto Klami, and Samuel Kaski\.Improved learning of riemannian metrics for exploratory analysis\.Neural Networks, 17\(8\-9\):1087–1100, 2004\.
- \[57\]Michael Psenka, Druv Pai, Vishal Raman, Shankar Sastry, and Yi Ma\.Representation learning via manifold flattening and reconstruction\.Journal of Machine Learning Research, 25\(132\):1–47, 2024\.
- \[58\]Ryan A Robinett, Lorenzo Orecchia, and Samantha J Riesenfeld\.Manifold learning and optimization using tangent space proxies\.arXiv preprint arXiv:2501\.12678, 2025\.
- \[59\]Takashi Sakai\.Riemannian geometry, volume 149\.American Mathematical Soc\., 1996\.
- \[60\]Christopher Scarvelis and Justin Solomon\.Riemannian metric learning via optimal transport\.InThe Eleventh International Conference on Learning Representations, 2023\.
- \[61\]Boris Shustin, Haim Avron, and Barak Sober\.Manifold free riemannian optimization\.arXiv preprint arXiv:2209\.03269, 2022\.
- \[62\]Peter Sorrenson, Daniel Behrend\-Uriarte, Christoph Schnoerr, and Ullrich Koethe\.Learning distances from data with normalizing flows and score matching\.InForty\-second International Conference on Machine Learning, 2025\.
- \[63\]Xingzhi Sun, Danqi Liao, Kincaid MacDonald, Yanlei Zhang, Guillaume Huguet, Guy Wolf, Ian Adelstein, Tim G\. J\. Rudner, and Smita Krishnaswamy\.Geometry\-aware autoencoders for metric learning and generative modeling on data manifolds\.InICML 2024 Workshop on Geometry\-grounded Representation Learning and Generative Modeling, 2024\.
- \[64\]Abiy Tasissa, Pranay Tankala, James M Murphy, and Demba Ba\.K\-deep simplex: Manifold learning via local dictionaries\.IEEE Transactions on Signal Processing, 71:3741–3754, 2023\.
- \[65\]David van Dijk, Daniel B Burkhardt, Matthew Amodio, Alexander Tong, Guy Wolf, and Smita Krishnaswamy\.Finding archetypal spaces using neural networks\.In2019 IEEE International Conference on Big Data \(Big Data\), pages 2634–2643\. IEEE, 2019\.
- \[66\]Aarthi Venkat, Scott E Youlten, Beatriz P San Juan, Carley A Purcell, Shabarni Gupta, Matthew Amodio, Daniel P Neumann, John G Lock, Anton E Westacott, Cerys S McCool, et al\.Aanet resolves a continuum of spatially localized cell states to unveil intratumoral heterogeneity\.Cancer Discovery, 15\(10\):2139–2165, 2025\.
- \[67\]David Vigouroux, Lucas Drumetz, Ronan Fablet, and François Rousseau\.Discovering data manifold geometry via non\-contracting flows\.arXiv preprint arXiv:2602\.02611, 2026\.
- \[68\]Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, and Qing Qu\.Diffusion models learn low\-dimensional distributions via subspace clustering\.In2025 IEEE 10th International Workshop on Computational Advances in Multi\-Sensor Adaptive Processing \(CAMSAP\), pages 211–215\. IEEE, 2025\.
- \[69\]Nick Whiteley, Annie Gray, and Patrick Rubin\-Delanchy\.Statistical exploration of the manifold hypothesis\.arXiv preprint arXiv:2208\.11665, 2022\.
- \[70\]Mario Wieser, Daniel Siegismund, and Stephan Steigele\.Revisiting deep archetypal analysis for phenotype discovery in high content imaging\.InProceedings of the Winter Conference on Applications of Computer Vision, pages 3802–3811, 2025\.
- \[71\]Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele, and Yi Ma\.White\-box transformers via sparse rate reduction\.Advances in Neural Information Processing Systems, 36:9422–9457, 2023\.

## Appendix ASupplementary material to[section3](https://arxiv.org/html/2605.24113#S3)

##### Proof of[Proposition1](https://arxiv.org/html/2605.24113#Thmproposition1)

###### Proof\.

The statement follows from direct computation\. To see this, first note that sinceϕ\\phiis a diffeomorphism, the substitution𝐲=ϕ\(𝐱\)\\mathbf\{y\}=\\phi\(\\mathbf\{x\}\)is valid globally, and the Jacobian factor satisfies\|detD𝐱ϕ\|d𝐱=d𝐲\|\\det D\_\{\\mathbf\{x\}\}\\phi\|d\\mathbf\{x\}=d\\mathbf\{y\}\. So

∫ℝdpϕ,ρ\(𝐱\)𝑑𝐱=12d2−1Γ\(d2\)1∫𝕊d−1ρ\(ω\)d𝑑σ\(ω\)∫ℝdexp⁡\(−12ρ\(𝐲‖𝐲‖2\)−2‖𝐲‖22\)𝑑𝐲\.\\int\_\{\\mathbb\{R\}^\{d\}\}p\_\{\\phi,\\rho\}\(\\mathbf\{x\}\)\\,d\\mathbf\{x\}=\\frac\{1\}\{2^\{\\frac\{d\}\{2\}\-1\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\}\\,\\frac\{1\}\{\\displaystyle\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\rho\(\\omega\)^\{d\}\\,d\\sigma\(\\omega\)\}\\int\_\{\\mathbb\{R\}^\{d\}\}\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\,\\rho\\\!\\left\(\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}\\right\)^\{\-2\}\\\|\\mathbf\{y\}\\\|\_\{2\}^\{2\}\\right\)d\\mathbf\{y\}\.\(60\)
Next, we can write𝐲=rω\\mathbf\{y\}=r\\omegawithr∈\(0,∞\)r\\in\(0,\\infty\)andω∈𝕊d−1\\omega\\in\\mathbb\{S\}^\{d\-1\}\. Under polar coordinates the Lebesgue measure decomposes as

d𝐲=rd−1drdσ\(ω\)\.d\\mathbf\{y\}=r^\{d\-1\}\\,dr\\,d\\sigma\(\\omega\)\.Since‖𝐲‖2=r\\\|\\mathbf\{y\}\\\|\_\{2\}=rand𝐲/‖𝐲‖2=ω\\mathbf\{y\}/\\\|\\mathbf\{y\}\\\|\_\{2\}=\\omega, the inner integral in \([60](https://arxiv.org/html/2605.24113#A1.E60)\) becomes

∫ℝdexp⁡\(−12ρ\(𝐲‖𝐲‖2\)−2‖𝐲‖22\)𝑑𝐲=∫𝕊d−1∫0∞exp⁡\(−12ρ\(ω\)−2r2\)rd−1𝑑r𝑑σ\(ω\)\.\\int\_\{\\mathbb\{R\}^\{d\}\}\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\,\\rho\\\!\\left\(\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}\\right\)^\{\-2\}\\\|\\mathbf\{y\}\\\|\_\{2\}^\{2\}\\right\)d\\mathbf\{y\}=\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\int\_\{0\}^\{\\infty\}\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\,\\rho\(\\omega\)^\{\-2\}\\,r^\{2\}\\right\)r^\{d\-1\}\\,dr\\,d\\sigma\(\\omega\)\.\(61\)
Then, defininga\(ω\):=12ρ\(ω\)−2\>0a\(\\omega\):=\\tfrac\{1\}\{2\}\\,\\rho\(\\omega\)^\{\-2\}\>0, we can Use the standard Gamma\-function identity

∫0∞e−ar2rd−1𝑑r=12a−d/2Γ\(d2\),a\>0,\\int\_\{0\}^\{\\infty\}e^\{\-ar^\{2\}\}r^\{d\-1\}\\,dr=\\frac\{1\}\{2\}\\,a^\{\-d/2\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\),\\qquad a\>0,to obtain

∫0∞exp⁡\(−12ρ\(ω\)−2r2\)rd−1𝑑r=12\(12ρ\(ω\)−2\)−d/2Γ\(d2\)=122d/2ρ\(ω\)dΓ\(d2\)=2d2−1Γ\(d2\)ρ\(ω\)d\.\\int\_\{0\}^\{\\infty\}\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\,\\rho\(\\omega\)^\{\-2\}\\,r^\{2\}\\right\)r^\{d\-1\}\\,dr=\\frac\{1\}\{2\}\\,\\left\(\\frac\{1\}\{2\}\\,\\rho\(\\omega\)^\{\-2\}\\right\)^\{\-d/2\}\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\\\\ =\\frac\{1\}\{2\}\\,2^\{d/2\}\\,\\rho\(\\omega\)^\{d\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)=2^\{\\frac\{d\}\{2\}\-1\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\\,\\rho\(\\omega\)^\{d\}\.\(62\)
Substituting \([62](https://arxiv.org/html/2605.24113#A1.E62)\) back into \([61](https://arxiv.org/html/2605.24113#A1.E61)\) gives

∫ℝdexp⁡\(−12ρ\(𝐲‖𝐲‖2\)−2‖𝐲‖22\)𝑑𝐲=2d2−1Γ\(d2\)∫𝕊d−1ρ\(ω\)d𝑑σ\(ω\)\.\\int\_\{\\mathbb\{R\}^\{d\}\}\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\,\\rho\\\!\\left\(\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}\\right\)^\{\-2\}\\\|\\mathbf\{y\}\\\|\_\{2\}^\{2\}\\right\)d\\mathbf\{y\}=2^\{\\frac\{d\}\{2\}\-1\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\rho\(\\omega\)^\{d\}\\,d\\sigma\(\\omega\)\.\(63\)
Finally, inserting \([63](https://arxiv.org/html/2605.24113#A1.E63)\) into \([60](https://arxiv.org/html/2605.24113#A1.E60)\) gives

∫ℝdpϕ,ρ\(𝐱\)𝑑𝐱=12d2−1Γ\(d2\)⋅2d2−1Γ\(d2\)∫𝕊d−1ρ\(ω\)d𝑑σ\(ω\)∫𝕊d−1ρ\(ω\)d𝑑σ\(ω\)=1\.∎\\int\_\{\\mathbb\{R\}^\{d\}\}p\_\{\\phi,\\rho\}\(\\mathbf\{x\}\)\\,d\\mathbf\{x\}=\\frac\{1\}\{2^\{\\frac\{d\}\{2\}\-1\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\}\\cdot\\frac\{2^\{\\frac\{d\}\{2\}\-1\}\\,\\Gamma\\\!\\left\(\\frac\{d\}\{2\}\\right\)\\displaystyle\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\rho\(\\omega\)^\{d\}\\,d\\sigma\(\\omega\)\}\{\\displaystyle\\int\_\{\\mathbb\{S\}^\{d\-1\}\}\\rho\(\\omega\)^\{d\}\\,d\\sigma\(\\omega\)\}=1\.\\qed\(64\)

##### Proof of[Proposition2](https://arxiv.org/html/2605.24113#Thmproposition2)

###### Proof\.

\(i\) Let𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\. If𝐱=𝟎\\mathbf\{x\}=\\mathbf\{0\}, the claim is immediate from the definitions\.

If𝐱≠𝟎\\mathbf\{x\}\\neq\\mathbf\{0\}, set𝐲:=χρ\(𝐱\)=ρ\(𝐱‖𝐱‖2\)−1𝐱\\mathbf\{y\}:=\\chi\_\{\\rho\}\(\\mathbf\{x\}\)=\\rho\\\!\\left\(\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\\right\)^\{\-1\}\\mathbf\{x\}\. Then𝐲≠𝟎\\mathbf\{y\}\\neq\\mathbf\{0\}, and

𝐲‖𝐲‖2=ρ\(𝐱‖𝐱‖2\)−1𝐱ρ\(𝐱‖𝐱‖2\)−1‖𝐱‖2=𝐱‖𝐱‖2\.\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}=\\frac\{\\rho\\left\(\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\\right\)^\{\-1\}\\mathbf\{x\}\}\{\\rho\\left\(\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\\right\)^\{\-1\}\\\|\\mathbf\{x\}\\\|\_\{2\}\}=\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\.\(65\)Hence

χρ−1\(𝐲\)=ρ\(𝐲‖𝐲‖2\)𝐲=ρ\(𝐱‖𝐱‖2\)ρ\(𝐱‖𝐱‖2\)−1𝐱=𝐱\.\\chi\_\{\\rho\}^\{\-1\}\(\\mathbf\{y\}\)=\\rho\\\!\\left\(\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}\\right\)\\mathbf\{y\}=\\rho\\\!\\left\(\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\\right\)\\,\\rho\\\!\\left\(\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\\right\)^\{\-1\}\\mathbf\{x\}=\\mathbf\{x\}\.\(66\)Thusχρ−1∘χρ=Idℝd\\chi\_\{\\rho\}^\{\-1\}\\circ\\chi\_\{\\rho\}=\\operatorname\{Id\}\_\{\\mathbb\{R\}^\{d\}\}\. The identityχρ∘χρ−1=Idℝd\\chi\_\{\\rho\}\\circ\\chi\_\{\\rho\}^\{\-1\}=\\operatorname\{Id\}\_\{\\mathbb\{R\}^\{d\}\}is proved in exactly the same way, exchanging the roles of𝐱\\mathbf\{x\}and𝐲\\mathbf\{y\}\.

\(ii\) Let𝐱∈ℝd\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\. If𝐱=𝟎\\mathbf\{x\}=\\mathbf\{0\}, the claim is immediate from the definitions\.

If𝐱≠𝟎\\mathbf\{x\}\\neq\\mathbf\{0\}, set𝐲:=ψv\(𝐱\)=v\(‖𝐱‖2\)𝐱‖𝐱‖2\\mathbf\{y\}:=\\psi\_\{v\}\(\\mathbf\{x\}\)=v\(\\\|\\mathbf\{x\}\\\|\_\{2\}\)\\,\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\. Then‖𝐲‖2=v\(‖𝐱‖2\)\>0\\\|\\mathbf\{y\}\\\|\_\{2\}=v\(\\\|\\mathbf\{x\}\\\|\_\{2\}\)\>0and

𝐲‖𝐲‖2=𝐱‖𝐱‖2\.\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}=\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}\.\(67\)Using the inverse functionv−1:v\(ℝ\>0\)→ℝ\>0v^\{\-1\}:v\(\\mathbb\{R\}\_\{\>0\}\)\\to\\mathbb\{R\}\_\{\>0\}, we have

ψv−1\(𝐲\)=v−1\(‖𝐲‖2\)𝐲‖𝐲‖2=v−1\(v\(‖𝐱‖2\)\)𝐱‖𝐱‖2=‖𝐱‖2𝐱‖𝐱‖2=𝐱\.\\psi\_\{v\}^\{\-1\}\(\\mathbf\{y\}\)=v^\{\-1\}\(\\\|\\mathbf\{y\}\\\|\_\{2\}\)\\,\\frac\{\\mathbf\{y\}\}\{\\\|\\mathbf\{y\}\\\|\_\{2\}\}=v^\{\-1\}\(v\(\\\|\\mathbf\{x\}\\\|\_\{2\}\)\)\\,\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}=\\\|\\mathbf\{x\}\\\|\_\{2\}\\,\\frac\{\\mathbf\{x\}\}\{\\\|\\mathbf\{x\}\\\|\_\{2\}\}=\\mathbf\{x\}\.\(68\)Thusψv−1∘ψv=Idℝd\\psi\_\{v\}^\{\-1\}\\circ\\psi\_\{v\}=\\operatorname\{Id\}\_\{\\mathbb\{R\}^\{d\}\}\. The converse compositionψv∘ψv−1=Idℝd\\psi\_\{v\}\\circ\\psi\_\{v\}^\{\-1\}=\\operatorname\{Id\}\_\{\\mathbb\{R\}^\{d\}\}is proved in exactly the same way, exchanging the roles of𝐱\\mathbf\{x\}and𝐲\\mathbf\{y\}\. ∎

##### Proof of[Lemma1](https://arxiv.org/html/2605.24113#Thmlemma1)

###### Proof\.

Fix𝐱≠𝐲∈ℝd\\mathbf\{x\}\\neq\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}and write

f\(t\):=‖ψv−1\(\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)\)‖22\.f\(t\):=\\\|\\psi\_\{v\}^\{\-1\}\(\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\)\\\|\_\{2\}^\{2\}\.\(69\)By the definition ofψv\\psi\_\{v\}and its inverse, for any𝐱′,𝐲′∈ℝd\\mathbf\{x\}^\{\\prime\},\\mathbf\{y\}^\{\\prime\}\\in\\mathbb\{R\}^\{d\}we have

‖ψv\(𝐱′\)‖2=v\(‖𝐱′‖2\),‖ψv−1\(𝐲′\)‖2=v−1\(‖𝐲′‖2\),\\\|\\psi\_\{v\}\(\\mathbf\{x\}^\{\\prime\}\)\\\|\_\{2\}=v\(\\\|\\mathbf\{x\}^\{\\prime\}\\\|\_\{2\}\),\\qquad\\\|\\psi\_\{v\}^\{\-1\}\(\\mathbf\{y\}^\{\\prime\}\)\\\|\_\{2\}=v^\{\-1\}\(\\\|\\mathbf\{y\}^\{\\prime\}\\\|\_\{2\}\),\(70\)with the conventionv\(0\)=0v\(0\)=0andv−1\(0\)=0v^\{\-1\}\(0\)=0usinglims→0v\(s\)=0\\lim\_\{s\\to 0\}v\(s\)=0and strict monotonicity\. Hence, for alltt,

f\(t\)=‖ψv−1\(\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)\)‖22=\(v−1\(‖\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)‖2\)\)2\.f\(t\)=\\\|\\psi\_\{v\}^\{\-1\}\(\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\)\\\|\_\{2\}^\{2\}=\\bigl\(v^\{\-1\}\(\\\|\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\\\|\_\{2\}\)\\bigr\)^\{2\}\.\(71\)
Set

r\(t\):=‖\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)‖2,u\(s\):=v−1\(s\),h\(s\):=u\(s\)2,r\(t\):=\\\|\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\\\|\_\{2\},\\qquad u\(s\):=v^\{\-1\}\(s\),\\qquad h\(s\):=u\(s\)^\{2\},\(72\)so thatf\(t\)=h\(r\(t\)\)f\(t\)=h\(r\(t\)\)\. Sincevvis strictly increasing and concave,uuis strictly increasing and convex, withu′\>0u^\{\\prime\}\\\!\>0andu′′≥0u^\{\\prime\\prime\}\\\!\\geq 0on\(0,∞\)\(0,\\infty\)by the inverse function theorem\.

We compute the second derivative offfusing the chain rule\. On any interval where\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)≠𝟎\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\\neq\\mathbf\{0\}for alltt,rris smooth, and

f′\(t\)=h′\(r\(t\)\)r′\(t\),f′′\(t\)=h′′\(r\(t\)\)\(r′\(t\)\)2\+h′\(r\(t\)\)r′′\(t\)\.f^\{\\prime\}\(t\)=h^\{\\prime\}\(r\(t\)\)\\,r^\{\\prime\}\(t\),\\qquad f^\{\\prime\\prime\}\(t\)=h^\{\\prime\\prime\}\(r\(t\)\)\\,\(r^\{\\prime\}\(t\)\)^\{2\}\+h^\{\\prime\}\(r\(t\)\)\\,r^\{\\prime\\prime\}\(t\)\.\(73\)We now computeh′h^\{\\prime\}andh′′h^\{\\prime\\prime\}in terms ofvv\. Fors\>0s\>0,

h′\(s\)=2u\(s\)u′\(s\),h′′\(s\)=2\(u′\(s\)2\+u\(s\)u′′\(s\)\),h^\{\\prime\}\(s\)=2\\,u\(s\)\\,u^\{\\prime\}\(s\),\\qquad h^\{\\prime\\prime\}\(s\)=2\\bigl\(u^\{\\prime\}\(s\)^\{2\}\+u\(s\)\\,u^\{\\prime\\prime\}\(s\)\\bigr\),\(74\)which immediately tells us thath′\(s\)\>0h^\{\\prime\}\(s\)\>0for alls\>0s\>0\. By the inverse function theorem,

u′\(s\)=1v′\(u\(s\)\),u′′\(s\)=−v′′\(u\(s\)\)v′\(u\(s\)\)3\.u^\{\\prime\}\(s\)=\\frac\{1\}\{v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)\},\\qquad u^\{\\prime\\prime\}\(s\)=\-\\,\\frac\{v^\{\\prime\\prime\}\(u\(s\)\)\}\{v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)^\{3\}\}\.\(75\)Thus

u′\(s\)2\+u\(s\)u′′\(s\)=1v′\(u\(s\)\)2−u\(s\)v′′\(u\(s\)\)v′\(u\(s\)\)3=v′\(u\(s\)\)−u\(s\)v′′\(u\(s\)\)v′\(u\(s\)\)3\.u^\{\\prime\}\(s\)^\{2\}\+u\(s\)\\,u^\{\\prime\\prime\}\(s\)=\\frac\{1\}\{v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)^\{2\}\}\-\\frac\{u\(s\)\\,v^\{\\prime\\prime\}\\bigl\(u\(s\)\\bigr\)\}\{v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)^\{3\}\}=\\frac\{v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)\-u\(s\)\\,v^\{\\prime\\prime\}\\bigl\(u\(s\)\\bigr\)\}\{v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)^\{3\}\}\.\(76\)Sincevvis strictly increasing,v′\>0v^\{\\prime\}\>0, and since it is concave,v′′≤0v^\{\\prime\\prime\}\\leq 0\. Therefore

v′\(u\(s\)\)−u\(s\)v′′\(u\(s\)\)≥v′\(u\(s\)\)\>0,v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)\-u\(s\)\\,v^\{\\prime\\prime\}\\bigl\(u\(s\)\\bigr\)\\geq v^\{\\prime\}\\bigl\(u\(s\)\\bigr\)\>0,\(77\)so

h′′\(s\)=2\(u′\(s\)2\+u\(s\)u′′\(s\)\)\>0for alls\>0\.h^\{\\prime\\prime\}\(s\)=2\\bigl\(u^\{\\prime\}\(s\)^\{2\}\+u\(s\)\\,u^\{\\prime\\prime\}\(s\)\\bigr\)\>0\\quad\\text\{for all \}s\>0\.\(78\)
Next, observe

r\(t\)2=‖\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)‖22=‖t\(ψv\(𝐲\)−ψv\(𝐱\)\)\+ψv\(𝐱\)‖22r\(t\)^\{2\}=\\\|\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\\\|\_\{2\}^\{2\}=\\\|t\(\\psi\_\{v\}\(\\mathbf\{y\}\)\-\\psi\_\{v\}\(\\mathbf\{x\}\)\)\+\\psi\_\{v\}\(\\mathbf\{x\}\)\\\|\_\{2\}^\{2\}\(79\)is a quadratic polynomial inttwith strictly positive leading coefficient, i\.e\.,r′′\(t\)r^\{\\prime\\prime\}\(t\)is a positive constant in that case, andr′\(t\)r^\{\\prime\}\(t\)is affine\.

Putting things together and going back to

f′′\(t\)=h′′\(r\(t\)\)\(r′\(t\)\)2\+h′\(r\(t\)\)r′′\(t\),f^\{\\prime\\prime\}\(t\)=h^\{\\prime\\prime\}\(r\(t\)\)\\,\(r^\{\\prime\}\(t\)\)^\{2\}\+h^\{\\prime\}\(r\(t\)\)\\,r^\{\\prime\\prime\}\(t\),\(80\)we haveh′′\(r\(t\)\)\>0h^\{\\prime\\prime\}\(r\(t\)\)\>0,r′′\(t\)\>0r^\{\\prime\\prime\}\(t\)\>0, andh′\(r\(t\)\)\>0h^\{\\prime\}\(r\(t\)\)\>0\. Hence

f′′\(t\)≥h′\(r\(t\)\)r′′\(t\)\>0\.f^\{\\prime\\prime\}\(t\)\\geq h^\{\\prime\}\(r\(t\)\)\\,r^\{\\prime\\prime\}\(t\)\>0\.\(81\)
If\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)=𝟎\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)=\\mathbf\{0\}for somett, the limitlims→0v\(s\)=0\\lim\_\{s\\to 0\}v\(s\)=0ensures thatffextends smoothly through that point with the same second\-derivative bound\.

Thust↦‖ψv−1\(\(1−t\)ψv\(𝐱\)\+tψv\(𝐲\)\)‖22t\\mapsto\\\|\\psi\_\{v\}^\{\-1\}\(\(1\-t\)\\psi\_\{v\}\(\\mathbf\{x\}\)\+t\\psi\_\{v\}\(\\mathbf\{y\}\)\)\\\|\_\{2\}^\{2\}is strongly convex on\[0,1\]\[0,1\]\. ∎

##### Proof of[Theorem1](https://arxiv.org/html/2605.24113#Thmtheorem1)

###### Proof\.

First, we will rewrite−log⁡pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)\-\\log p\_\{\\phi,\\rho\}\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)in a more convenient way\. From \([10](https://arxiv.org/html/2605.24113#S3.E10)\), using that\|detD𝐱ϕ\|\|\\det D\_\{\\mathbf\{x\}\}\\phi\|is constant, we can write

pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)=C⋅exp⁡\(−12ρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖2\)−2‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖22\),p\_\{\\phi,\\rho\}\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)=C\\cdot\\exp\\\!\\left\(\-\\frac\{1\}\{2\}\\,\\rho\\\!\\left\(\\frac\{\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\}\{\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}\}\\right\)^\{\-2\}\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}^\{2\}\\right\),\(82\)for some constantC\>0C\>0independent oftt\. Hence,

−log⁡pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)=12ρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖2\)−2‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖22\+C′,\-\\log p\_\{\\phi,\\rho\}\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)=\\frac\{1\}\{2\}\\,\\rho\\\!\\left\(\\frac\{\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\}\{\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}\}\\right\)^\{\-2\}\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}^\{2\}\+C^\{\\prime\},\(83\)for some constantC′C^\{\\prime\}independent oftt\. By definition ofχρ\\chi\_\{\\rho\},

χρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)\)=ρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖2\)−1ϕ\(γ𝐱,𝐲φ\(t\)\),\\chi\_\{\\rho\}\(\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\)=\\rho\\\!\\left\(\\frac\{\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\}\{\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}\}\\right\)^\{\-1\}\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\),\(84\)and therefore

‖χρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)\)‖22=ρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖2\)−2‖ϕ\(γ𝐱,𝐲φ\(t\)\)‖22\.\\\|\\chi\_\{\\rho\}\(\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\)\\\|\_\{2\}^\{2\}=\\rho\\\!\\left\(\\frac\{\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\}\{\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}\}\\right\)^\{\-2\}\\\|\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\\|\_\{2\}^\{2\}\.\(85\)So we can rewrite

−log⁡pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)=12‖χρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)\)‖22\+C′\.\-\\log p\_\{\\phi,\\rho\}\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)=\\frac\{1\}\{2\}\\,\\\|\\chi\_\{\\rho\}\(\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\)\\\|\_\{2\}^\{2\}\+C^\{\\prime\}\.\(86\)
Next, expandingχρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)\)\\chi\_\{\\rho\}\(\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\)gives

χρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)\)=\([3](https://arxiv.org/html/2605.24113#S2.E3)\)φ−1\(\(1−t\)φ\(𝐱\)\+tφ\(𝐲\)\)=φ=ψv∘χρ∘ϕψv−1\(\(1−t\)ψv\(χρ\(ϕ\(𝐱\)\)\)\+tψv\(χρ\(ϕ\(𝐲\)\)\)\),\\chi\_\{\\rho\}\(\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\)\\overset\{\\eqref\{eq:thm\-geodesic\-remetrized\}\}\{=\}\\varphi^\{\-1\}\\bigl\(\(1\-t\)\\,\\varphi\(\\mathbf\{x\}\)\+t\\,\\varphi\(\\mathbf\{y\}\)\\bigr\)\\overset\{\\varphi=\\psi\_\{v\}\\circ\\chi\_\{\\rho\}\\circ\\phi\}\{=\}\\psi\_\{v\}^\{\-1\}\\bigl\(\(1\-t\)\\,\\psi\_\{v\}\(\\chi\_\{\\rho\}\(\\phi\(\\mathbf\{x\}\)\)\)\+t\\,\\psi\_\{v\}\(\\chi\_\{\\rho\}\(\\phi\(\\mathbf\{y\}\)\)\)\\bigr\),\(87\)from which follows that

‖χρ\(ϕ\(γ𝐱,𝐲φ\(t\)\)\)‖22=‖ψv−1\(\(1−t\)ψv\(χρ\(ϕ\(𝐱\)\)\)\+tψv\(χρ\(ϕ\(𝐲\)\)\)\)‖22\.\\\|\\chi\_\{\\rho\}\(\\phi\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\)\\\|\_\{2\}^\{2\}=\\Bigl\\\|\\psi\_\{v\}^\{\-1\}\\bigl\(\(1\-t\)\\,\\psi\_\{v\}\(\\chi\_\{\\rho\}\(\\phi\(\\mathbf\{x\}\)\)\)\+t\\,\\psi\_\{v\}\(\\chi\_\{\\rho\}\(\\phi\(\\mathbf\{y\}\)\)\)\\bigr\)\\Bigr\\\|\_\{2\}^\{2\}\.\(88\)The above equation is exactly of the form in[Lemma1](https://arxiv.org/html/2605.24113#Thmlemma1), with the two vectors

𝐱′:=χρ\(ϕ\(𝐱\)\),𝐲′:=χρ\(ϕ\(𝐲\)\),\\mathbf\{x\}^\{\\prime\}:=\\chi\_\{\\rho\}\(\\phi\(\\mathbf\{x\}\)\),\\qquad\\mathbf\{y\}^\{\\prime\}:=\\chi\_\{\\rho\}\(\\phi\(\\mathbf\{y\}\)\),\(89\)and the mappingψv\\psi\_\{v\}generated by the concave strictly increasingvvsatisfyinglims→0v\(s\)=0\\lim\_\{s\\to 0\}v\(s\)=0andlims→0v′\(s\)\>0\\lim\_\{s\\to 0\}v^\{\\prime\}\(s\)\>0\.

Hence, we must have that the function

t↦−log⁡\(pϕ,ρ\(γ𝐱,𝐲φ\(t\)\)\)t\\mapsto\-\\log\\bigl\(p\_\{\\phi,\\rho\}\(\\gamma\_\{\\mathbf\{x\},\\mathbf\{y\}\}^\{\\varphi\}\(t\)\)\\bigr\)\(90\)is strongly convex on\[0,1\]\[0,1\], because

t↦‖ψv−1\(tψv\(𝐲′\)\+\(1−t\)ψv\(𝐱′\)\)‖22t\\mapsto\\Bigl\\\|\\psi\_\{v\}^\{\-1\}\\bigl\(t\\,\\psi\_\{v\}\(\\mathbf\{y\}^\{\\prime\}\)\+\(1\-t\)\\,\\psi\_\{v\}\(\\mathbf\{x\}^\{\\prime\}\)\\bigr\)\\Bigr\\\|\_\{2\}^\{2\}\(91\)is strongly convex on\[0,1\]\[0,1\]by[Lemma1](https://arxiv.org/html/2605.24113#Thmlemma1), which proves the claim\. ∎

## Appendix BSupplementary material to[section4](https://arxiv.org/html/2605.24113#S4)

### B\.1Proof of[Theorem2](https://arxiv.org/html/2605.24113#Thmtheorem2)

###### Proof\.

\(i\):To prove that

ℳ¯φ:=\{𝐱∈ℝd∣𝐱=argmin𝐲∈ℝd∑j=1r𝐠jdℝdφ\(𝐲,𝐯\(j\)\)2for some𝐠∈Δr\}=\{𝐱∈ℝd∣𝐱=φ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)for some𝐠∈Δr\},\\bar\{\\mathcal\{M\}\}^\{\\varphi\}:=\\bigl\\\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\\;\\mid\\;\\mathbf\{x\}=\\operatorname\*\{argmin\}\_\{\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}\}\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}d^\{\\varphi\}\_\{\\mathbb\{R\}^\{d\}\}\(\\mathbf\{y\},\\mathbf\{v\}^\{\(j\)\}\)^\{2\}\\text\{ for some \}\\mathbf\{g\}\\in\\Delta\_\{r\}\\bigr\\\}\\\\ =\\\{\\mathbf\{x\}\\in\\mathbb\{R\}^\{d\}\\;\\mid\\;\\mathbf\{x\}=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\text\{ for some \}\\mathbf\{g\}\\in\\Delta\_\{r\}\\\},\(92\)we will show that every minimizer of𝐱↦∑j=1r𝐠jdℝdφ\(𝐱,𝐯\(j\)\)2\\mathbf\{x\}\\mapsto\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}d^\{\\varphi\}\_\{\\mathbb\{R\}^\{d\}\}\(\\mathbf\{x\},\\mathbf\{v\}^\{\(j\)\}\)^\{2\}is of the formφ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\.

For that, we first note that the mapping𝐱↦dℝdφ\(𝐱,𝐲\)2\\mathbf\{x\}\\mapsto d\_\{\\mathbb\{R\}^\{d\}\}^\{\\varphi\}\(\\mathbf\{x\},\\mathbf\{y\}\)^\{2\}is strongly geodesically convex for every𝐲∈ℝd\\mathbf\{y\}\\in\\mathbb\{R\}^\{d\}\. Indeed, this directly follows from\[[18](https://arxiv.org/html/2605.24113#bib.bib18), Lem\. 3\.5\]after rewriting the function into the composition form, i\.e\.,dℝdφ\(⋅,𝐲\)2=\([2](https://arxiv.org/html/2605.24113#S2.E2)\)‖φ\(⋅\)−φ\(𝐲\)‖22d\_\{\\mathbb\{R\}^\{d\}\}^\{\\varphi\}\(\\cdot,\\mathbf\{y\}\)^\{2\}\\overset\{\\eqref\{eq:thm\-distance\-remetrized\}\}\{=\}\\\|\\varphi\(\\cdot\)\-\\varphi\(\\mathbf\{y\}\)\\\|\_\{2\}^\{2\}, and realizing that𝐱′↦‖𝐱′−φ\(𝐲\)‖2\\mathbf\{x\}^\{\\prime\}\\mapsto\\\|\\mathbf\{x\}^\{\\prime\}\-\\varphi\(\\mathbf\{y\}\)\\\|^\{2\}is strongly convex\. Since strong geodesic convexity is closed under addition and multiplication with non\-negative scalars, we conclude that the mapping

𝐱↦∑j=1r𝐠jdℝdφ\(𝐱,𝐯\(j\)\)2\\mathbf\{x\}\\mapsto\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}d^\{\\varphi\}\_\{\\mathbb\{R\}^\{d\}\}\(\\mathbf\{x\},\\mathbf\{v\}^\{\(j\)\}\)^\{2\}\(93\)is strongly geodesically convex for any𝐠∈Δr\\mathbf\{g\}\\in\\Delta\_\{r\}\.

Next, it is then easily checked that𝐱∗:=φ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)\\mathbf\{x\}^\{\*\}:=\\varphi^\{\-1\}\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\)satisfies the first\-order optimality conditions:

grad∑j=1r𝐠jdℝdφ\(⋅,𝐯\(j\)\)2∣𝐱∗=−2∑j=1r𝐠jlog𝐱∗φ⁡\(𝐯\(j\)\)=\([5](https://arxiv.org/html/2605.24113#S2.E5)\)−2∑j=1r𝐠jD𝐱∗φ−1\[φ\(𝐯\(j\)\)−φ\(𝐱∗\)\]=−2D𝐱∗φ−1\[∑j=1r𝐠jφ\(𝐯\(j\)\)−φ\(𝐱∗\)\]=−2D𝐱∗φ−1\[φ\(𝐱∗\)−φ\(𝐱∗\)\]=𝟎\.\\operatorname\{grad\}\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}d^\{\\varphi\}\_\{\\mathbb\{R\}^\{d\}\}\(\\cdot,\\mathbf\{v\}^\{\(j\)\}\)^\{2\}\\mid\_\{\\mathbf\{x\}^\{\*\}\}=\-2\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\log^\{\\varphi\}\_\{\\mathbf\{x\}^\{\*\}\}\(\\mathbf\{v\}^\{\(j\)\}\)\\\\ \\overset\{\\eqref\{eq:thm\-log\-remetrized\}\}\{=\}\-2\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}D\_\{\\mathbf\{x\}^\{\*\}\}\\varphi^\{\-1\}\[\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\-\\varphi\(\\mathbf\{x\}^\{\*\}\)\]=\-2D\_\{\\mathbf\{x\}^\{\*\}\}\\varphi^\{\-1\}\[\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\-\\varphi\(\\mathbf\{x\}^\{\*\}\)\]\\\\ =\-2D\_\{\\mathbf\{x\}^\{\*\}\}\\varphi^\{\-1\}\[\\varphi\(\\mathbf\{x\}^\{\*\}\)\-\\varphi\(\\mathbf\{x\}^\{\*\}\)\]=\\mathbf\{0\}\.\(94\)So we conclude that for any𝐠∈Δr\\mathbf\{g\}\\in\\Delta\_\{r\}the unique minimizer of \([93](https://arxiv.org/html/2605.24113#A2.E93)\) isφ−1\(∑j=1r𝐠jφ\(𝐯\(j\)\)\)\\varphi^\{\-1\}\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}\_\{j\}\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\), which yields the claim

\(ii\):To prove thatℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}is a geodesically convex set, we will show that for any𝐱,𝐲∈ℳ¯φ\\mathbf\{x\},\\mathbf\{y\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}we have thatγ𝐱,𝐲φ∈ℳ¯φ\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\.

First, we note that by \(i\), that there exist𝐠𝐱,𝐠𝐲∈Δr\\mathbf\{g\}^\{\\mathbf\{x\}\},\\mathbf\{g\}^\{\\mathbf\{y\}\}\\in\\Delta\_\{r\}such that

𝐱=φ−1\(∑j=1r𝐠j𝐱φ\(𝐯\(j\)\)\),𝐲=φ−1\(∑j=1r𝐠j𝐲φ\(𝐯\(j\)\)\)\.\\mathbf\{x\}=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\\mathbf\{x\}\}\_\{j\}\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\),\\qquad\\mathbf\{y\}=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\\mathbf\{y\}\}\_\{j\}\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\.\(95\)Next, fort∈\[0,1\]t\\in\[0,1\]define the interpolating weights

𝐠t:=\(1−t\)𝐠𝐱\+t𝐠𝐲∈Δr,\\mathbf\{g\}^\{t\}:=\(1\-t\)\\mathbf\{g\}^\{\\mathbf\{x\}\}\+t\\mathbf\{g\}^\{\\mathbf\{y\}\}\\in\\Delta\_\{r\},\(96\)and the curve

γ\(t\):=φ−1\(∑j=1r𝐠jtφ\(𝐯\(j\)\)\)\.\\gamma\(t\):=\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{t\}\_\{j\}\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\.\(97\)Then, by construction,γ\(0\)=𝐱\\gamma\(0\)=\\mathbf\{x\},γ\(1\)=𝐲\\gamma\(1\)=\\mathbf\{y\}, andγ\(t\)∈ℳ¯φ\\gamma\(t\)\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}for allt∈\[0,1\]t\\in\[0,1\]\.

To prove the claim it suffices to show thatγ𝐱,𝐲φ\(t\)=γ\(t\)\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)=\\gamma\(t\)for allt∈\[0,1\]t\\in\[0,1\]\. This follows from direct computation:

γ𝐱,𝐲φ\(t\)=\([3](https://arxiv.org/html/2605.24113#S2.E3)\)φ−1\(\(1−t\)φ\(𝐱\)\+tφ\(𝐲\)\)=\([95](https://arxiv.org/html/2605.24113#A2.E95)\)φ−1\(\(1−t\)∑j=1r𝐠j𝐱φ\(𝐯\(j\)\)\+t∑j=1r𝐠j𝐲φ\(𝐯\(j\)\)\)=φ−1\(∑j=1r\(\(1−t\)𝐠j𝐱\+t𝐠j𝐲\)φ\(𝐯\(j\)\)\)=\([96](https://arxiv.org/html/2605.24113#A2.E96)\)φ−1\(∑j=1r𝐠tφ\(𝐯\(j\)\)\)=\([97](https://arxiv.org/html/2605.24113#A2.E97)\)γ\(t\)\.\\gamma^\{\\varphi\}\_\{\\mathbf\{x\},\\mathbf\{y\}\}\(t\)\\overset\{\\eqref\{eq:thm\-geodesic\-remetrized\}\}\{=\}\\varphi^\{\-1\}\(\(1\-t\)\\varphi\(\\mathbf\{x\}\)\+t\\varphi\(\\mathbf\{y\}\)\)\\overset\{\\eqref\{eq:thm2\-xy\-rewrite\}\}\{=\}\\varphi^\{\-1\}\\Bigl\(\(1\-t\)\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\\mathbf\{x\}\}\_\{j\}\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\+t\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{\\mathbf\{y\}\}\_\{j\}\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\\\ =\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\bigl\(\(1\-t\)\\mathbf\{g\}^\{\\mathbf\{x\}\}\_\{j\}\+t\\mathbf\{g\}^\{\\mathbf\{y\}\}\_\{j\}\\bigr\)\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\overset\{\\eqref\{eq:thm2\-gt\}\}\{=\}\\varphi^\{\-1\}\\Bigl\(\\sum\_\{j=1\}^\{r\}\\mathbf\{g\}^\{t\}\\,\\varphi\(\\mathbf\{v\}^\{\(j\)\}\)\\Bigr\)\\overset\{\\eqref\{eq:thm2\-gamma\}\}\{=\}\\gamma\(t\)\.\(98\)
\(iii\):Finally, the claim thatℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}is a smooth manifold with corners, whose interior has dimension

dim\(int⁡\(ℳ¯φ\)\)=rank⁡\(\[φ\(𝐯\(1\)\)−φ\(𝐯\(r\)\),…,φ\(𝐯\(r−1\)\)−φ\(𝐯\(r\)\)\]\)≤r−1,\\dim\(\\operatorname\{int\}\(\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\)\)=\\operatorname\{rank\}\(\[\\varphi\(\\mathbf\{v\}^\{\(1\)\}\)\-\\varphi\(\\mathbf\{v\}^\{\(r\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\-1\)\}\)\-\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\]\)\\leq r\-1,\(99\)follows directly, when seeing that the mappingφ′:ℳ¯φ→conv⁡\{φ\(𝐯\(1\)\),…,φ\(𝐯\(r\)\)\}\\varphi^\{\\prime\}:\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\\to\\operatorname\{conv\}\\\{\\varphi\(\\mathbf\{v\}^\{\(1\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\\\}between the constraint set and the convex hull of the embedded archetypes given by

φ′\(𝐱\):=φ\(𝐱\),𝐱∈ℳ¯φ\\varphi^\{\\prime\}\(\\mathbf\{x\}\):=\\varphi\(\\mathbf\{x\}\),\\quad\\mathbf\{x\}\\in\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\(100\)is a diffeomorphism between the two sets, i\.e\.,φ′\\varphi^\{\\prime\}is a chart\. In other words, since the polytopeconv⁡\{φ\(𝐯\(1\)\),…,φ\(𝐯\(r\)\)\}\\operatorname\{conv\}\\\{\\varphi\(\\mathbf\{v\}^\{\(1\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\\\}is a smooth manifold with corners, so isℳ¯φ\\bar\{\\mathcal\{M\}\}^\{\\varphi\}and

dim\(int\(ℳ¯φ\)\)=dim\(int\(conv\{φ\(𝐯\(1\)\),…,φ\(𝐯\(r\)\)\}\)=rank⁡\(\[φ\(𝐯\(1\)\)−φ\(𝐯\(r\)\),…,φ\(𝐯\(r−1\)\)−φ\(𝐯\(r\)\)\]\)≤r−1,\\dim\(\\operatorname\{int\}\(\\bar\{\\mathcal\{M\}\}^\{\\varphi\}\)\)=\\dim\(\\operatorname\{int\}\(\\operatorname\{conv\}\\\{\\varphi\(\\mathbf\{v\}^\{\(1\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\\\}\)\\\\ =\\operatorname\{rank\}\(\[\\varphi\(\\mathbf\{v\}^\{\(1\)\}\)\-\\varphi\(\\mathbf\{v\}^\{\(r\)\}\),\\ldots,\\varphi\(\\mathbf\{v\}^\{\(r\-1\)\}\)\-\\varphi\(\\mathbf\{v\}^\{\(r\)\}\)\]\)\\leq r\-1,\(101\)which yields the claim\.

∎

## Appendix CSupplementary material to[section5](https://arxiv.org/html/2605.24113#S5)

### C\.1Proof of[Proposition3](https://arxiv.org/html/2605.24113#Thmproposition3)

###### Proof\.

We prove the statement by direct evaluation of \([48](https://arxiv.org/html/2605.24113#S5.E48)\), i\.e\., we will solve for the supremum in

ρΣ,μ\(ω\)=sup\{t\>0∣t⋅ω∈ℰΣ\(μ\)\}\.\\rho\_\{\\Sigma,\\mu\}\(\\omega\)=\\sup\\\{t\>0\\,\\mid\\,t\\cdot\\omega\\in\\mathcal\{E\}\_\{\\Sigma\}\(\\mu\)\\\}\.\(102\)First, note that the supremum lives on the boundary, i\.e\., we want to find

t⋅ω∈∂ℰΣ\(μ\)⇔\(t⋅ω−μ\)⊤Σ\(t⋅ω−μ\)=1⇔t2ω⊤Σ−1ω−2tω⊤Σ−1μ\+μ⊤Σ−1μ−1=0\.t\\cdot\\omega\\in\\partial\\mathcal\{E\}\_\{\\Sigma\}\(\\mu\)\\quad\\Leftrightarrow\\quad\(t\\cdot\\omega\-\\mu\)^\{\\top\}\\Sigma\(t\\cdot\\omega\-\\mu\)=1\\quad\\Leftrightarrow\\quad t^\{2\}\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\-2t\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\+\\mu^\{\\top\}\\Sigma^\{\-1\}\\mu\-1=0\.\(103\)
The right\-hand side is a quadratic equation, which has solutions

t=ω⊤Σ−1μ±\(ω⊤Σ−1μ\)2\+\(ω⊤Σ−1ω\)\(1−μ⊤Σ−1μ\)ω⊤Σ−1ω\.t=\\frac\{\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\\pm\\sqrt\{\\left\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\\right\)^\{2\}\+\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\)\\left\(1\-\\mu^\{\\top\}\\Sigma^\{\-1\}\\mu\\right\)\}\}\{\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\}\.\(104\)Since we want positivett, we conclude that

ρΣ,μ\(ω\)=ω⊤Σ−1μ\+\(ω⊤Σ−1μ\)2\+\(ω⊤Σ−1ω\)\(1−μ⊤Σ−1μ\)ω⊤Σ−1ω,\\rho\_\{\\Sigma,\\mu\}\(\\omega\)=\\frac\{\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\+\\sqrt\{\\left\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\mu\\right\)^\{2\}\+\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\)\\left\(1\-\\mu^\{\\top\}\\Sigma^\{\-1\}\\mu\\right\)\}\}\{\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\},\(105\)as claimed\. Then, forμ=𝟎\\mu=\\mathbf\{0\}, the above expression reduces to

ρΣ,μ\(ω\)=\(ω⊤Σ−1ω\)−12\.\\rho\_\{\\Sigma,\\mu\}\(\\omega\)=\(\\omega^\{\\top\}\\Sigma^\{\-1\}\\omega\)^\{\-\\frac\{1\}\{2\}\}\.\(106\)∎

### C\.2Proof of[Proposition4](https://arxiv.org/html/2605.24113#Thmproposition4)

###### Proof\.

Both claims \([55](https://arxiv.org/html/2605.24113#S5.E55)\) follow from direct evaluation\.

Starting with the first claim, we first rewrite

1n∑i=1n\(𝐲\(i\)−μ\)⊤Σo−1\(𝐲\(i\)−μ\)=1ntr\(\(𝐘−μ𝟏n⊤\)⊤Σo−1\(𝐘−μ𝟏n⊤\)\)=1ntr\(Σo−1\(𝐘−μ𝟏n⊤\)\(𝐘−μ𝟏n⊤\)⊤\)=1ntr\(Σo−1𝐖diag⁡\(ς02,ς12,…,ςd−12\)𝐖⊤\)\.\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)^\{\\top\}\\Sigma\_\{o\}^\{\-1\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)=\\frac\{1\}\{n\}\\operatorname\*\{tr\}\\bigl\(\(\\mathbf\{Y\}\-\\mu\\mathbf\{1\}\_\{n\}^\{\\top\}\)^\{\\top\}\\Sigma\_\{o\}^\{\-1\}\(\\mathbf\{Y\}\-\\mu\\mathbf\{1\}\_\{n\}^\{\\top\}\)\\bigr\)\\\\ =\\frac\{1\}\{n\}\\operatorname\*\{tr\}\\bigl\(\\Sigma\_\{o\}^\{\-1\}\(\\mathbf\{Y\}\-\\mu\\mathbf\{1\}\_\{n\}^\{\\top\}\)\(\\mathbf\{Y\}\-\\mu\\mathbf\{1\}\_\{n\}^\{\\top\}\)^\{\\top\}\\bigr\)\\\\ =\\frac\{1\}\{n\}\\operatorname\*\{tr\}\\bigl\(\\Sigma\_\{o\}^\{\-1\}\\mathbf\{W\}\\operatorname\{diag\}\(\\varsigma\_\{0\}^\{2\},\\varsigma\_\{1\}^\{2\},\\ldots,\\varsigma\_\{d\-1\}^\{2\}\)\\mathbf\{W\}^\{\\top\}\\bigr\)\.\(107\)whereς0:=∑i=1n\(μ⊤\(𝐲\(i\)−μ\)\)2‖μ‖22\\varsigma\_\{0\}:=\\sum\_\{i=1\}^\{n\}\\frac\{\(\\mu^\{\\top\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)\)^\{2\}\}\{\\\|\\mu\\\|\_\{2\}^\{2\}\}\. Further rewriting yields

1ntr\(Σo−1𝐖diag⁡\(ς02,ς12,…,ςd−12\)𝐖⊤\)=1ntr\(𝐖Λo−1𝐖⊤𝐖diag⁡\(ς02,ς12,…,ςd−12\)𝐖⊤\)=1ntr\(Λo−1diag⁡\(ς02,ς12,…,ςd−12\)\)=1n∑k=1dςk−12λk,\\frac\{1\}\{n\}\\operatorname\*\{tr\}\\bigl\(\\Sigma\_\{o\}^\{\-1\}\\mathbf\{W\}\\operatorname\{diag\}\(\\varsigma\_\{0\}^\{2\},\\varsigma\_\{1\}^\{2\},\\ldots,\\varsigma\_\{d\-1\}^\{2\}\)\\mathbf\{W\}^\{\\top\}\\bigr\)=\\frac\{1\}\{n\}\\operatorname\*\{tr\}\\bigl\(\\mathbf\{W\}\\Lambda\_\{o\}^\{\-1\}\\mathbf\{W\}^\{\\top\}\\mathbf\{W\}\\operatorname\{diag\}\(\\varsigma\_\{0\}^\{2\},\\varsigma\_\{1\}^\{2\},\\ldots,\\varsigma\_\{d\-1\}^\{2\}\)\\mathbf\{W\}^\{\\top\}\\bigr\)\\\\ =\\frac\{1\}\{n\}\\operatorname\*\{tr\}\\bigl\(\\Lambda\_\{o\}^\{\-1\}\\operatorname\{diag\}\(\\varsigma\_\{0\}^\{2\},\\varsigma\_\{1\}^\{2\},\\ldots,\\varsigma\_\{d\-1\}^\{2\}\)\\bigr\)=\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{d\}\\frac\{\\varsigma\_\{k\-1\}^\{2\}\}\{\\lambda\_\{k\}\},\(108\)which gives us an expression we can now bound:

1n∑k=1dςk−12λk=1d∑k=1ddnςk−12λk≤1d∑k=1dλkλk=1\.\\frac\{1\}\{n\}\\sum\_\{k=1\}^\{d\}\\frac\{\\varsigma\_\{k\-1\}^\{2\}\}\{\\lambda\_\{k\}\}=\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}\\frac\{\\frac\{d\}\{n\}\\varsigma\_\{k\-1\}^\{2\}\}\{\\lambda\_\{k\}\}\\leq\\frac\{1\}\{d\}\\sum\_\{k=1\}^\{d\}\\frac\{\\lambda\_\{k\}\}\{\\lambda\_\{k\}\}=1\.\(109\)So we conclude that

1n∑i=1n\(𝐲\(i\)−μ\)⊤Σo−1\(𝐲\(i\)−μ\)≤1,\\frac\{1\}\{n\}\\sum\_\{i=1\}^\{n\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)^\{\\top\}\\Sigma\_\{o\}^\{\-1\}\(\\mathbf\{y\}^\{\(i\)\}\-\\mu\)\\leq 1,\(110\)as claimed\.

For the second claim, we rewrite

μ⊤Σo−1μ=μ⊤𝐖Λo−1𝐖⊤μ=‖1‖μ‖22μ⊤μ‖22λ1=1λ1<1,\\mu^\{\\top\}\\Sigma\_\{o\}^\{\-1\}\\mu=\\mu^\{\\top\}\\mathbf\{W\}\\Lambda\_\{o\}^\{\-1\}\\mathbf\{W\}^\{\\top\}\\mu=\\frac\{\\\|\\frac\{1\}\{\\\|\\mu\\\|\_\{2\}^\{2\}\}\\mu^\{\\top\}\\mu\\\|\_\{2\}^\{2\}\}\{\\lambda\_\{1\}\}=\\frac\{1\}\{\\lambda\_\{1\}\}<1,\(111\)as claimed\. ∎

### C\.3Proof of[Proposition5](https://arxiv.org/html/2605.24113#Thmproposition5)

###### Proof\.

The proof is analogous to the proof of[Proposition4](https://arxiv.org/html/2605.24113#Thmproposition4)above\. ∎

## Appendix DSupplementary material to[section6](https://arxiv.org/html/2605.24113#S6)

### D\.1Normalizing flow construction

##### Architecture

In all experiments, the diffeomorphismϕθ\\phi\_\{\\theta\}is a multi\-scale convolutional image flow acting on images inℝ1×32×32\\mathbb\{R\}^\{1\\times 32\\times 32\}with a standard normal base distribution\. The flow hasLLlevels; at each level it:

1. 1\.Applies a fixed squeeze transform with factor 2 \(halving height and width, quadrupling channels\)\.
2. 2\.Applies a learned image transformTℓT\_\{\\ell\}\.
3. 3\.Splits the channels in half, passing one half to the next level and storing the other for the inverse pass\.

The dimensions are chosen so that the first level sees44channels of size16×1616\\times 16, and for eachℓ=2,…,L\\ell=2,\\dots,L,

Cℓ=2Cℓ−1,Hℓ=Hℓ−1/2,Wℓ=Wℓ−1/2,C\_\{\\ell\}=2C\_\{\\ell\-1\},\\quad H\_\{\\ell\}=H\_\{\\ell\-1\}/2,\\quad W\_\{\\ell\}=W\_\{\\ell\-1\}/2,i\.e\., spatial dimensions shrink by a factor of 2 while the number of channels doubles at every level\.

Each level transformTℓT\_\{\\ell\}is a composite of several invertible convolutional blocks\. For an input with shape\(C,H,W\)\(C,H,W\), the transform consists ofnflowsn\_\{\\text\{flows\}\}repeated “flow steps” followed by a final linear block\. Every flow step applies:

- •channel\-wise affine normalization \(ActNorm\-type layer\),
- •an invertible1×11\\times 1convolution \(channel mixing\),
- •two linear parity\-based convolutions with kernel sizekkand alternating parities,
- •a nonlinear parity\-based coupling layer withk×kk\\times kconvolutions and a hidden width of 64 channels\.

After thesenflowsn\_\{\\text\{flows\}\}steps the transform finishes with one more normalization layer, another invertible1×11\\times 1convolution, and two additional parity\-based linear convolutions\. All components are exactly invertible, and the log\-determinant contributions from squeezing, normalization,1×11\\times 1convolutions, and coupling layers are accumulated in the usual way\.

The forward pass of the multi\-level flow thus proceeds by repeatedly squeezing, applying a level transform, and splitting channels until all levels have been applied\. The inverse pass reverses these operations \(recombining stored channel splits and “unsqueezing”\) and is used both for likelihood evaluation and for sampling\.

##### Training setup

For both the single\-digit and full MNIST experiments, images are resized to32×3232\\times 32, normalized using the standard MNIST mean 0\.1307 and variance 0\.3081, and split into training and validation subsets with an 80/20 ratio\. Mini\-batches of size 128 are used throughout\.

The flow is parameterized with:

- •kernel sizek=3k=3,
- •hidden width 64 channels in the nonlinear coupling networks,
- •nflows=3n\_\{\\text\{flows\}\}=3flow steps per scale,
- •nscales=3n\_\{\\text\{scales\}\}=3multi\-scale levels\.

Training minimizes the negative log\-likelihood of the data under the flow using Adam with learning rate10−310^\{\-3\}for 50 epochs\.

### D\.2Radial construction

In both settings, radial parameters \(α=1\.1\\alpha=1\.1andβ=1\\beta=1\) are shared across branches and govern the star\-shaped radial deformation used in the deformed star distributions\.
Riemannian Archetypal Analysis: Interpretable non-linear data analysis on deformed star distributions

Similar Articles

FormalAnalyticGeo: A Neural-Symbolic Based Framework for Multimodal Analytic Geometry Problem Generation

Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators

Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders

Tree-Structured Orthonormal Decomposition of the Aitchison Simplex

Geometry-Aware R-Structured Kolmogorov-Arnold Networks

Submit Feedback

Similar Articles

FormalAnalyticGeo: A Neural-Symbolic Based Framework for Multimodal Analytic Geometry Problem Generation
Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators
Unlocking Latent Dimensions: Exploring Representations of Large-Scale X-ray Scattering Data using Variational Autoencoders
Tree-Structured Orthonormal Decomposition of the Aitchison Simplex
Geometry-Aware R-Structured Kolmogorov-Arnold Networks