How Data Augmentation Shapes Neural Representations

arXiv cs.LG Papers

Summary

This paper uses shape analysis tools to characterize how different data augmentation strategies reshape the geometry of neural network representations, finding that augmentation strength and type lead to distinct, well-behaved trajectories in shape space.

arXiv:2605.15306v1 Announce Type: new Abstract: Data augmentation is widely recognized for improving generalization in deep networks, yet its impact on the geometry of learned representations remains poorly understood. In this work, we characterize how different data augmentation strategies reshape internal representations in neural networks. Using tools from shape analysis, we embed network hidden representations into a metric space where distance is invariant to scaling, translation, rotation and reflection. We show that increasing augmentation strength leads to well-behaved trajectories in this space, and that different augmentation types steer representations in distinct directions. Moreover, we investigate how neural representation shapes are distorted along data augmentation trajectories, and show that insights from neural geometry can predict which representations provide the most improvement when ensembling models. Our results reveal shared geometric patterns across architectures and seeds, and suggest that analyzing shape-space trajectories offers a principled tool for understanding and comparing data augmentation methods.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:39 AM

# How Data Augmentation Shapes Neural Representations
Source: [https://arxiv.org/html/2605.15306](https://arxiv.org/html/2605.15306)
Tianxiao He Department of Computer Science New York University New York, NY 10012 USA th3129@nyu\.edu &Alex H\. Williams Center for Neural Science New York University New York, NY 10012 USA aw4614@nyu\.edu Sarah E\. Harvey Flatiron Institute New York, NY 10010, USA sharvey@flatironinstitute\.org

###### Abstract

Data augmentation is widely recognized for improving generalization in deep networks, yet its impact on the geometry of learned representations remains poorly understood\. In this work, we characterize how different data augmentation strategies reshape internal representations in neural networks\. Using tools from shape analysis, we embed network hidden representations into a metric space where distance is invariant to scaling, translation, rotation and reflection\. We show that increasing augmentation strength leads to well\-behaved trajectories in this space, and that different augmentation types steer representations in distinct directions\. Moreover, we investigate how neural representation shapes are distorted along data augmentation trajectories, and show that insights from neural geometry can predict which representations provide the most improvement when ensembling models\. Our results reveal shared geometric patterns across architectures and seeds, and suggest that analyzing shape\-space trajectories offers a principled tool for understanding and comparing data augmentation methods\. Open source code is available at https://anonymous\.4open\.science/r/netrep\-analysis\-00EC

## 1Introduction

Due to the overparametrized nature of deep neural networks, many hyperparameter settings often achieve similar task performance\. Tuning hyperparameters such as learning rate, batch size, and momentum is known to be crucial for performance\[[24](https://arxiv.org/html/2605.15306#bib.bib17)\], but there is little direct study on how these choices change the geometry of learned neural representations\. In particular, when a single hyperparameter is isolated and gradually varied, how do features of hidden layer activity change?

A particularly rich area to study this phenomenon is data augmentation\. Data augmentation \(DA\) is a foundational technique for improving the performance of deep networks by diversifying existing datasets with simple transformations\[[23](https://arxiv.org/html/2605.15306#bib.bib32)\]\. In the context of computer vision, classic DA techniques include rotation, cropping, color jitter and addition of Gaussian noise to some fraction of input images\. DA is often interpreted as an implicit regularization technique\[[4](https://arxiv.org/html/2605.15306#bib.bib29)\]or the incorporation of prior knowledge about invariances in the task\[[18](https://arxiv.org/html/2605.15306#bib.bib22),[3](https://arxiv.org/html/2605.15306#bib.bib23)\]\. Despite their centrality in modern training pipelines, data augmentations are often tuned using simple heuristics or searches that optimize task performance\. However, many distinct combinations of augmentation hyperparameters, such as strength, probability, or spatial extent, can yield near\-identical accuracy on held\-out data\. This degeneracy suggests that performance metrics alone may obscure meaningful differences in how models learn to represent the world under different training protocols\.

While the benefits of DA are intuitive, a deeper understanding remains elusive\. For example, is there a sense in which DA methods move representations along well\-defined trajectories as the hyperparameter is varied? Do distinct augmentation methods \(e\.g\. image rotations and random cropping\) sculpt hidden layer computations in fundamentally different ways, or are their effects redundant? How do two augmentation methods interact when they are simultaneously applied? More broadly, does diversity in representational geometry predict the extent of functional improvements when models are ensembled across different hyperparameter values? Can two augmentations synergistically interact or interfere with each other, and can this be observed from their hidden representational geometry?

As a first step towards addressing these questions, we leverage previously developedmetric spaceson the geometry of hidden layer representations\[[27](https://arxiv.org/html/2605.15306#bib.bib25),[17](https://arxiv.org/html/2605.15306#bib.bib26)\]\. This framework identifies thegeometryof hidden layer representations as points on a smooth manifold of possible geometries, enabling tools from differential geometry to study the effect of hyperparameter changes on learned representations\. In particular, gradual changes in augmentation strength can be conceptualized as moving the representational geometry along a smooth path on this manifold\. Different types of augmentation give rise to different paths over the manifold, and we can compute the lengths and curvatures of these paths, as well as the angles formed at intersections\. Altogether, this enables a systematic comparison of how augmentation strength and types reshape the representation geometry across network layers, initializations, and architectures\. Our main contributions are:

- •We propose a geometric framework for analyzing the effects of data augmentation hyperparameters on learned neural representations
- •We demonstrate that increasing data augmentation magnitude steers representations along structured trajectories in this*shape space*, but not in the naive Euclidean space
- •We study how the magnitude of representation shape change due to data augmentation compares with the magnitude of representation change due to a different random seed
- •We show that greater divergence between trajectories of different augmentations predicts larger gains from model ensembling
- •We find that data augmentation displaces shape landmarks non\-uniformly, with effects that depend on network depth and data augmentation type\.

## 2Related Work

Data augmentation\.Data augmentation \(DA\) is a central technique in modern machine learning pipelines, used to increase data diversity, improve generalization, and enhance robustness\[[23](https://arxiv.org/html/2605.15306#bib.bib32),[26](https://arxiv.org/html/2605.15306#bib.bib36),[2](https://arxiv.org/html/2605.15306#bib.bib24),[4](https://arxiv.org/html/2605.15306#bib.bib29)\]\. It is generally thought that augmentations shape neural computation by “collapsing” the manifold of representations along nuisance dimensions, so that augmented views of the same input are mapped to similar responses\[[18](https://arxiv.org/html/2605.15306#bib.bib22)\]\. The broader principle of view\-invariant representations can be traced back decades within the computer vision and visual neuroscience literature\[[21](https://arxiv.org/html/2605.15306#bib.bib19),[22](https://arxiv.org/html/2605.15306#bib.bib20),[7](https://arxiv.org/html/2605.15306#bib.bib21)\]\.

Importantly, the intuition that training with augmentations “collapses” augmented views onto a single point only pertains to the representational geometry of the augmented dataset—it doesnottell us how the manifold of representations over clean images is altered by training with augmentations\. This distinction is particularly important for augmentation techniques that produce out\-of\-distribution training samples \(e\.g\. adding a large amount of Gaussian noise\), with the goal of improving performance of unseen within\-distribution samples\. In this paper, we investigate how different augmentations alter representations of non\-augmented “clean images” in the test set\.

Representational similarity\.Comparing neural network representations is essential for understanding training dynamics, transferability, and model equivalence\. Representational similarity metrics aim to characterize relationships between hidden layer activation patterns across layers and models\. Metrics like centered kernel alignment \(CKA\) measure pairwise similarity between activation matrices using kernel\-based alignment\[[16](https://arxiv.org/html/2605.15306#bib.bib30)\], and are widely used to study how layer representations change across seeds or architectures\[[20](https://arxiv.org/html/2605.15306#bib.bib5)\]\. More recent work has proposed using formal metric spaces to capture more nuanced geometric structure in representations\[[27](https://arxiv.org/html/2605.15306#bib.bib25)\]which allows for more advanced analysis methods on representations such as regression and clustering\.\[[17](https://arxiv.org/html/2605.15306#bib.bib26)\]leveraged this idea to study the trajectories of representations through layers of deep neural networks, and also introduce concepts from Riemannian geometry to study these paths\. While our work draws on similar methodological ideas, we apply them to study the trajectories of neural representations*across models*as hyperparameters are tuned\.

## 3Methods

In this section, we define our notion of a metric space for neural representations\. We then discuss geodesics of this metric and how to measure angles between trajectories of representations\. Lastly, we discuss how to interpret the direction of trajectories in this space in terms oflandmark displacements, which measure how much the hidden layer response changes to each image input after nuisance transformations are removed\.

![Refer to caption](https://arxiv.org/html/2605.15306v1/x1.png)Figure 1:Embedding into shape space enables more meaningful comparison of large ensembles of representations\.\(a\)After two models are trained end\-to\-end using different data augmentation methods,MMunaltered images from the test dataset are randomly selected as probe images\. The Riemannian shape distance[eq\.˜2](https://arxiv.org/html/2605.15306#S3.E2)between these two representations is measured as the angular distance up to translation, scale, rotation, and reflection\.\(b\)The representation shapes from layerℓ\\ellwith the Riemannian shape distance form a metric space in which we may measure distance, compute geodesics, and measure angles as data augmentation parameters are varied\.Problem Setup\.LetfLA:𝒵↦𝒴f\_\{L\}^\{A\}:\\mathcal\{Z\}\\mapsto\\mathcal\{Y\}represent the function of a feedforward neural network withL∈ℕL\\in\\mathbb\{N\}layers, where𝒵\\mathcal\{Z\}is the input space,𝒴\\mathcal\{Y\}is the output space, and the network is trained using DA strategyAA\. The neural networkfLB:𝒵↦𝒴f\_\{L\}^\{B\}:\\mathcal\{Z\}\\mapsto\\mathcal\{Y\}is identical in architecture and initialization tofLAf\_\{L\}^\{A\}, but is trained using DA strategyBB\.

We define the*representation mapping*from the input domain to a vector of activations at a hidden layerℓ\\ellwithNNunits asfℓA:𝒵↦ℝNf\_\{\\ell\}^\{A\}:\\mathcal\{Z\}\\mapsto\\mathbb\{R\}^\{N\}and similarlyfℓB:𝒵↦ℝNf\_\{\\ell\}^\{B\}:\\mathcal\{Z\}\\mapsto\\mathbb\{R\}^\{N\}\. How similar are the activation patterns produced in hidden layers byfℓA​\(⋅\)f\_\{\\ell\}^\{A\}\(\\cdot\)andfℓB​\(⋅\)f\_\{\\ell\}^\{B\}\(\\cdot\)? More specifically, in what way is the change in training data induced by a DA method changeA→BA\\rightarrow B, or an increase in the magnitude of a hyperparameterA→A\+Δ​AA\\rightarrow A\+\\Delta A, present in the representations formed byfℓA​\(⋅\)f\_\{\\ell\}^\{A\}\(\\cdot\),fℓB​\(⋅\)f\_\{\\ell\}^\{B\}\(\\cdot\)andfℓA\+Δ​Af\_\{\\ell\}^\{A\+\\Delta A\}? We measure these changes over a representative collection of unmodified probe inputsz1,…,zM∈𝒵z\_\{1\},\\ldots,z\_\{M\}\\in\\mathcal\{Z\}from the test set \(i\.e\. clean images\)\.

Following well\-established practices for comparing neural representations \(see e\.g\.\[[16](https://arxiv.org/html/2605.15306#bib.bib30),[27](https://arxiv.org/html/2605.15306#bib.bib25)\]\), we proceed by measuring neural responsesfℓA​\(z1\)​…​fℓA​\(zM\)f\_\{\\ell\}^\{A\}\(z\_\{1\}\)\\dots f\_\{\\ell\}^\{A\}\(z\_\{M\}\)and stacking them row\-wise into a representation matrixXℓA∈ℝM×NX\_\{\\ell\}^\{A\}\\in\\mathbb\{R\}^\{M\\times N\}\. Likewise, we form a matrixXℓB∈ℝM×NX\_\{\\ell\}^\{B\}\\in\\mathbb\{R\}^\{M\\times N\}from the second network’s responses,fℓB​\(z1\)​…​fℓB​\(zM\)f\_\{\\ell\}^\{B\}\(z\_\{1\}\)\\dots f\_\{\\ell\}^\{B\}\(z\_\{M\}\)\. Intuitively, one can view these matrices as approximations to each network’s input to layerℓ\\ellmapping over a discrete set ofMM“landmark” points\. The representation matricesXℓAX\_\{\\ell\}^\{A\}can be visualized asMMpoints in anNN\-dimensional space \([Fig\.˜1](https://arxiv.org/html/2605.15306#S3.F1)a\)\.

Riemannian Shape Distance\.Measuring the distance between representationsXℓAX\_\{\\ell\}^\{A\}andXℓBX\_\{\\ell\}^\{B\}is complicated by the fact that there is no inherent correspondence between neural dimensions at layerℓ\\ell\. A variety of metrics based on alignment or comparing intra\-network distances have been proposed \(see\[[15](https://arxiv.org/html/2605.15306#bib.bib4)\]for a review\)\. In this paper, we use theRiemannian shape distance\[[13](https://arxiv.org/html/2605.15306#bib.bib27)\], which alongside the closely relatedProcrustes distancecreates a framework for comparing deep network representations\[[27](https://arxiv.org/html/2605.15306#bib.bib25),[8](https://arxiv.org/html/2605.15306#bib.bib13)\]\. This notion of distance considers two representations as equivalent if their point clouds \(see[Fig\.˜1](https://arxiv.org/html/2605.15306#S3.F1)a\) can be superimposed by translation, rotation, reflections, and rescaling\. Intuitively, this would imply that the point clouds have the same geometry, and therefore carry the same linearly decodable information\[[10](https://arxiv.org/html/2605.15306#bib.bib11)\]\.

To compute the Riemannian shape distance between two representation matricesXi∈ℝM×NX\_\{i\}\\in\\mathbb\{R\}^\{M\\times N\}andXj∈ℝM×NX\_\{j\}\\in\\mathbb\{R\}^\{M\\times N\}, we first center the neural responses at the origin and re\-scale them to unit norm:

Zi=C​Xi‖C​Xi‖FandZj=C​Xj‖C​Xj‖FZ\_\{i\}=\\frac\{CX\_\{i\}\}\{\\\|CX\_\{i\}\\\|\_\{F\}\}\\quad\\text\{and\}\\quad Z\_\{j\}=\\frac\{CX\_\{j\}\}\{\\\|CX\_\{j\}\\\|\_\{F\}\}\(1\)whereC=IM−\(1/M\)​1M​1MTC=I\_\{M\}\-\(1/M\)1\_\{M\}1\_\{M\}^\{T\}is thecentering matrix\. The pre\-processed matricesZiZ\_\{i\}andZjZ\_\{j\}are often calledpre\-shapes\. The*shape*of the representationSiS\_\{i\}is defined by the set of all rotated and reflected versions of the pre\-shape:Si=\{Zi​O:O∈𝒪​\(N\)\}S\_\{i\}=\\\{Z\_\{i\}O:O\\in\\mathcal\{O\}\(N\)\\\}, whereO​\(N\)O\(N\)is the orthogonal group of rotations and reflections inNN\-dimensions, which is the set ofN×NN\\times Nmatrices satisfyingOT​O=O​OT=INO^\{T\}O=OO^\{T\}=I\_\{N\}anddetO=±1\\det\{O\}=\\pm 1\.*Shape space*ΣNM\\Sigma\_\{N\}^\{M\}is the space of all shapes withMMlandmarks inℝN\\mathbb\{R\}^\{N\}\. Then, the Riemannian shape distance is defined as:

ρ​\(Xi,Xj\)=arccos⁡\(supO∈𝒪​\(N\)Tr​\[Zj⊤​Zi​O\]\)\\rho\(X\_\{i\},X\_\{j\}\)=\\arccos\\Big\(\\sup\_\{O\\in\\mathcal\{O\}\(N\)\}\\text\{Tr\}\[Z\_\{j\}^\{\\top\}Z\_\{i\}O\]\\Big\)\(2\)which can be interpreted as the distance along the pre\-shape hypersphere betweenZjZ\_\{j\}and the aligned representationZi∗=Zi​O∗Z\_\{i\}^\{\*\}=Z\_\{i\}O^\{\*\}, whereO∗O^\{\*\}is the orthogonal matrix achieving the supremum in[eq\.˜2](https://arxiv.org/html/2605.15306#S3.E2)\.[Equation˜2](https://arxiv.org/html/2605.15306#S3.E2)is an intrinsic geodesic distance through shape space\. It is measured in radians, and is the smallest angle between pre\-shapesZiZ\_\{i\}andZjZ\_\{j\}on the hypersphere that can be achieved byNN\-dimensional rotations of the pre\-shapes\.

The shape distance,ρ\\rho, is symmetric and satisfies the triangle inequality\. Indeed, it is a true metric on the space of shapes,111Intuitively, ashapeis the set of equivalentpre\-shapesthat can be achieved byNN\-dimensional orthogonal transformations\. See\[[14](https://arxiv.org/html/2605.15306#bib.bib12),[9](https://arxiv.org/html/2605.15306#bib.bib16)\]for further background\.and it defines closed\-form geodesic paths between shapes as explained below\. At the same time, it embodies an intuitive minimal invariance class—representations are aligned up to rotation, reflection, translation, and global scaling—mirroring other popular methods such as linear CKA while avoiding the tuning and interpretability issues that often arise with more flexible nonlinear metrics\. Finally, unlike metrics derived from CKA, the shape space framework enables us to explicitly track how landmarks \(i\.e\. responses to probe images\) are displaced along a geodesic path, thereby enhancing human interpretability of representational change\. However, the Riemannian shape distance is related to metrics on stimulus\-by\-stimulus similarity matrices likeC​K​ACKAthrough its equivalence to the normalized Bures similarity, described in appendix A\.

Shape geodesics and geodesic angles\.Given two pre\-shapesZiZ\_\{i\}andZjZ\_\{j\}, we can obtain a geodesic between them parametrized bytt,γ​\(t\)\\gamma\(t\), by moving along a great circle on the pre\-shape sphere\. This curve, written explicitly in appendix A, represents the shortest smooth deformation from one neural representation to another, ignoring overall orientation, scale and reflection\.

Letγ1\\gamma\_\{1\}andγ2\\gamma\_\{2\}be two geodesics between a reference pre\-shapeZ0Z\_\{0\}and two comparison pre\-shapesZ1Z\_\{1\}andZ2Z\_\{2\}, respectively\. Their optimally aligned preshapes to initial preshapeZ0Z\_\{0\}areZ1∗Z\_\{1\}^\{\\ast\}andZ2∗Z\_\{2\}^\{\\ast\}, and the initial tangent vectors alongγ1\\gamma\_\{1\}andγ2\\gamma\_\{2\}are known to be

Vi=ρisin⁡ρi​\(Zi∗−cos⁡\(ρi\)​Z0\),with​ρi=arccos⁡\(Tr⁡Z0⊤​Zi∗\)\.V\_\{i\}=\\frac\{\\rho\_\{i\}\}\{\\sin\\rho\_\{i\}\}\\left\(Z\_\{i\}^\{\\ast\}\-\\cos\(\\rho\_\{i\}\)Z\_\{0\}\\right\),\\\>\\\>\\text\{with \}\\rho\_\{i\}=\\arccos\(\\operatorname\{Tr\}Z\_\{0\}^\{\\top\}Z\_\{i\}^\{\\ast\}\)\.fori=1,2\.i=1,2\.These tangent vectors represent the initial direction of each geodesic path in the shape manifold\. The angle between the two shape geodesics is the angle between their tangent vectors at the reference shape:

θ1,2=∠​\(γ1,γ2\)=arccos⁡\(⟨V1,V2⟩‖V1‖F⋅‖V2‖F\)\.\\theta\_\{1,2\}=\\angle\(\\gamma\_\{1\},\\gamma\_\{2\}\)=\\arccos\\left\(\\frac\{\\langle V\_\{1\},V\_\{2\}\\rangle\}\{\\\|V\_\{1\}\\\|\_\{F\}\\cdot\\\|V\_\{2\}\\\|\_\{F\}\}\\right\)\.\(3\)This angle will be used to quantify the similarity of two representation trajectories from the same initial shape\.

Landmark displacement\.We can also investigate*how*representation shapes are deformed as DA is applied at varying levels, by measuring the landmark\-level displacement\. The displacement for each landmarkΔ​zμ∈ℝN\\Delta z\_\{\\mu\}\\in\\mathbb\{R\}^\{N\},μ∈\{1,…,M\}\\mu\\in\\\{1,\.\.\.,M\\\}, is computed as the Euclidean displacement vector between corresponding landmarks across two optimally aligned preshapes \([Fig\.˜1](https://arxiv.org/html/2605.15306#S3.F1)a\)\. These displacement vectors are the rows of the matrixΔ​Z=ZℓA−ZℓB​O∗\\Delta Z=Z\_\{\\ell\}^\{A\}\-Z\_\{\\ell\}^\{B\}O^\{\*\}, withO∗O^\{\*\}being the matrix that obtains the supremum in[eq\.˜2](https://arxiv.org/html/2605.15306#S3.E2)\. We may then study which landmarks \(network inputs\) are maximally or minimally distorted along a step of a trajectory in shape space, and which landmarks are distorted in similar directions\.

## 4Experiments

![Refer to caption](https://arxiv.org/html/2605.15306v1/x2.png)Figure 2:Orthogonal alignment reveals well\-behaved DA trajectories in representation shape space that are not present without alignment\.Using ResNet\-18 layer 2 responses to 10,000 CIFAR\-10 test set images from models trained with the same seed, pairwise representation distances show coherent organization across augmentation strengths only when the alignment step is included\. \(a, b\) We compare additive Gaussian noise, parameterized by noise levelσ\\sigma, with patch Gaussian, parameterized by a cutout widthCCand within\-patch noise variancess\. \(c, d\) Distance matrices compare aligned and unaligned shape distances for Gaussian noise and patch Gaussian data augmentations at various strengths\. \(e, f\) MDS\-PCA visualizations of these distances reveal smooth, ordered trajectories in shape space, and for patch Gaussian, an organized two\-parameter surface over \(C,sC,s\), which disappear when orthogonal alignment is not performed\. \(g, h\) True vs\. predicted augmentation hyperparameters from leave\-one\-out ridge regression on aligned representation shapes of clean probe images\. Points near the diagonal indicate accurate decoding of augmentation strength from shape\.To demonstrate this framework, we studied image representations formed by various vision models trained on CIFAR\-10 image classification \(with ImageNet\[[5](https://arxiv.org/html/2605.15306#bib.bib1)\]fine\-tuning examples given in appendix C\)\. A spectrum of DA methods and magnitudes \(described below\) were applied during training\. After training, we collected hidden layer responses toM=10,000M=10,000clean \(un\-augmented\) probe images from the test set, and used these as the basis for shape distance calculations\. Full training details are listed in appendix B\.

DA steers representations along well\-behaved trajectories in shape space, but not Euclidean space\.How necessary is the alignment step, and the shape space framework generally? One might imagine that orthogonal alignments are not necessary when comparing representations across two identical networks that are trained and initialized with the same random seed, and only differ by a small amount of DA\. To investigate this, we trained ResNet\-18\[[12](https://arxiv.org/html/2605.15306#bib.bib37)\]models with Gaussian noise data augmentation governed by noise standard deviation parameterσ\\sigma, as well as with patch Gaussian data augmentation with standard deviationssover aC×CC\\times Csquare patch of pixels\[[19](https://arxiv.org/html/2605.15306#bib.bib14)\]\. WhenCCequals the width and height of the input images the method reduces to additive Gaussian noise \([Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)a\), while for largess, it resembles “CutOut” augmentation\[[6](https://arxiv.org/html/2605.15306#bib.bib10)\]with patch sizeCC\([Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)b\)\.

Whether we restrict focus to the 1D space of Gaussian noise over the full image \([Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)a, c, e, g\) or the 2D space of patch Gaussian augmentations of varying patch sizes \([Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)b, d, f, h\) we observe consistent relational structure corresponding to DA parameters only when using the shape space framework\. This is visible in distance matrices for representations collected after layer 2 of networks trained at different augmentation levels \([Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)c, d\)\. Each\(i,j\)\(i,j\)element in the distance matrix is equal to the shape distance between representationsiiandjjtrained with different DA hyperparameters, with \(left\) and without \(right\) the alignment step in the definition[eq\.˜2](https://arxiv.org/html/2605.15306#S3.E2)\. To visualize this relational structure, we use the MDS\-PCA procedure of\[[27](https://arxiv.org/html/2605.15306#bib.bib25)\], using multidimensional scaling to find points in a 200 dimensional Euclidean space with pairwise distances that well\-approximate the shape distances, and plotting the leading PCA axes\.[Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)e, f show for both DA methods the destruction of shape organization with DA hyperparameters when orthogonal alignment is not performed\.[Fig\.˜1](https://arxiv.org/html/2605.15306#S3.F1)and[Fig\.˜A\-4](https://arxiv.org/html/2605.15306#A3.F4)replicate this experiment using augmentation pairs CutOut with random crop, and rotation with sheer, and different geometric patterns can be observed for DA methods\.[Fig\.˜A\-2](https://arxiv.org/html/2605.15306#A3.F2)replicates this experiment with pretrained ResNet18\[[28](https://arxiv.org/html/2605.15306#bib.bib2)\]fine\-tuned using the ImageNet dataset\[[5](https://arxiv.org/html/2605.15306#bib.bib1)\]\.

To quantitatively measure how regularly augmentation steers representations through shape space, we investigated how well the DA hyperparameter used during training for a held\-out representation could be predicted linearly using the representations of the clean probe inputs alone\. We first aligned each representation shape to the un\-augmented shape by the optimal orthogonal transformation and then flattened the aligned shapes into vectors\. We fit a ridge regressor to these vectors to predict the DA hyperparameter used to train each model, and evaluated prediction with leave\-one\-shape\-out cross\-validation, fitting on the remaining aligned shapes, and predicting the held\-out hyperparameter value\. Geometrically, this can be viewed as a first\-order approximation to tangent\-space regression at the un\-augmented representation shape\. We find that the Gaussian noise and patch size hyperparameters for held\-out representations can be well\-predicted linearly from representations trained with those augmentations \([Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)g, h\), indicating that representations are not scattered arbitrarily in shape space; they are organized in a systematic way with respect to the DA parameters, but only when the orthogonal alignment step is first performed\.

![Refer to caption](https://arxiv.org/html/2605.15306v1/x3.png)Figure 3:Augmentation\-driven representation changes in shape space can be compared directly with seed\-to\-seed variability\.Patch Gaussian DA is applied during training of ResNet\-18 \(top\) and ViT \(bottom\) for a grid of \(C,sC,s\) values\. \(a\) Cross\-seed distance matrices show repeated within\-seed structure, \(b\) MDS\-PCA suggests similar augmentation trajectories across three random seeds\. \(c\)DaugD\_\{\\text\{aug\}\}andDseedD\_\{\\text\{seed\}\}averaged over three pairs of random seeds, with error bars corresponding to the standard deviations, as Gaussian noise DA strengthssis increased at fixed patch size ofC=12C=12\(top plot\) andC=24C=24\(bottom plot\)\. Divergence of these two curves indicates when changes due to the Gaussian noise parameter exceed those due to random seed\.The magnitude of representational shape change due to random seed gives a natural scale to evaluate the size of changes due to DA\.How sensitive are representations to small changes in DA, when compared with the effect of changing the random seed used for initialization? In light of recently observed “butterfly effects" in which small perturbations early in training lead to training trajectories that drive models into distinct loss minima\[[1](https://arxiv.org/html/2605.15306#bib.bib34)\], we are interested in understanding how the effect of random initialization compares with small changes in DA hyperparameter\. The shape metric space framework gives us a natural way to compare the scales of these two sources of variability in terms of the Riemannian shape distance between their learned representations\. The distance matrices in[Fig\.˜3](https://arxiv.org/html/2605.15306#S4.F3)\(a\) show the Riemannian shape distance between learned representations from the penultimate layer of ResNet\-18 and ViT trained with DA method patch Gaussian over a range of \(C,sC,s\) values, for three different random seeds 42, 43, 44\. Each distance matrix shows repeated off\-diagonal blocks corresponding to the three random seeds, suggesting that the representations share similar within\-seed structure as the Gaussian noise variance and patch size are varied\. This can be visualized in the MDS\-PCA plots in[Fig\.˜3](https://arxiv.org/html/2605.15306#S4.F3)\(b\), where trajectories corresponding to changing the DA hyperparameter values at a fixed random seed are shown connected by dotted lines\. From the MDS\-PCA visualizations, it appears that there is a transition point between a regime in which the scale of representational change due to these data augmentation methods is similar to that induced by changing the random seed \(smallCC, smallssregime\), to a regime where the scale of change due to the data augmentation is larger than the changes due to random seed\. To quantify this, we define two measured distance scales:

Daug=12​\(ρ​\(X0\(i\),Xp\(i\)\)\+ρ​\(X0\(j\),Xp\(j\)\)\),Dseed=12​\(ρ​\(X0\(i\),X0\(j\)\)\+ρ​\(Xp\(i\),Xp\(j\)\)\)D\_\{\\text\{aug\}\}=\\frac\{1\}\{2\}\(\\rho\(X\_\{0\}^\{\(i\)\},X\_\{p\}^\{\(i\)\}\)\+\\rho\(X\_\{0\}^\{\(j\)\},X\_\{p\}^\{\(j\)\}\)\),\\hskip 14\.22636ptD\_\{\\text\{seed\}\}=\\frac\{1\}\{2\}\(\\rho\(X\_\{0\}^\{\(i\)\},X\_\{0\}^\{\(j\)\}\)\+\\rho\(X\_\{p\}^\{\(i\)\},X\_\{p\}^\{\(j\)\}\)\)\(4\)whereX0\(i\)X\_\{0\}^\{\(i\)\}is a representation matrix learned without DA from random seedii, andXp\(i\)X\_\{p\}^\{\(i\)\}is a representation matrix learned with a DA hyperparameterpp\. IfDaug\>DseedD\_\{\\text\{aug\}\}\>D\_\{\\text\{seed\}\}, this indicates that the scale of representation shape changes due to the change in data augmentation hyperparameter tends to be larger than changes due to random seed\. As the data augmentation hyperparameter approaches0,Daug=DseedD\_\{\\text\{aug\}\}=D\_\{\\text\{seed\}\}\. When these two distance scales diverge from one another as a hyperparameterppis increased is a measure of how large of a DA parameter change is required before the resulting magnitude of representation shape change is distinguishable from a shape change due to random seed\. In[Fig\.˜3](https://arxiv.org/html/2605.15306#S4.F3)\(c\) we computeDaugD\_\{\\text\{aug\}\}andDseedD\_\{\\text\{seed\}\}averaged over three pairs of random seeds for both ResNet\-18 and ViT asssis varied at constantCC\. We find that these distance scales indeed separate arounds=0\.2s=0\.2, providing a sense of “natural units" for this hyperparameter\.

![Refer to caption](https://arxiv.org/html/2605.15306v1/x4.png)Figure 4:Different augmentation types produce distinct representational changes, whose diversity predicts ensemble gains\.\(a\) MDS\-PCA embedding of representation shapes for ResNet18 layer 2\.1 trained with 8 distinct augmentation types applied individually\. Each trajectory connected with dotted lines corresponds to a single augmentation method applied at different magnitudes listed in Table[2](https://arxiv.org/html/2605.15306#A2.T2)\. Increasing augmentation strength moves the model representations away from the representation shape measured after training with no data augmentation \(black circle\)\. \(b\) Heatmap of pairwise geodesic angles between the augmentation trajectories, measured with the un\-augmented model as the vertex as in[Fig\.˜1](https://arxiv.org/html/2605.15306#S3.F1)\(c\) Improvement in test accuracy of ensembled models over average accuracy of model ensemble\. Models trained separately with different augmentation methods, then ensembled pairwise across methods\. \(d\) Scatter plot showing the correlation between geodesic angle and the improvement in ensemble accuracy across pairs of augmentations, suggesting that geometric diversity is related with larger ensemble gains\. \(e,f,g\) Analogous MDS\-PCA embedding and geodesic angle measurements for AvgPool layer corresponding to panels \(a,b,d\)\.Different data augmentation methods move representations in different directions in shape space, and the angle between trajectories is predictive of model ensembling gains\.We now study the effect of classic DA methods on representation shape geometry when applied individually\. The embedding of representations into shape space established in[Section˜3](https://arxiv.org/html/2605.15306#S3)allows us to study geometric properties of representation*trajectories*through shape space, as a function of DA hyperparameter\.[Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)\(a,e\) show MDS\-PCA visualizations of these trajectories for eight different DA methods in ResNet\-18, each applied at four different magnitudes\. Dotted lines connect points of sequential augmentation hyperparameter magnitude listed in appendix B table[2](https://arxiv.org/html/2605.15306#A2.T2), starting from the black point corresponding with no DA\. \(a\) shows trajectories of representation shapes collected after layer 2, while \(e\) repeats the same analysis on the penultimate avgpool layer\. The mean geodesic angle for two trajectories is computed as the angle in the tangent space of the un\-augmented shape formed by geodesics between the un\-augmented shape and steps along two DA trajectories \([Equation˜3](https://arxiv.org/html/2605.15306#S3.E3)\), averaged over all steps in the trajectories\.[Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)\(b\) and \(f\) show heatmaps of the mean geodesic angles measured between all DA trajectories for layer 2 \(b\) and avgpool \(f\)\. Some clustering by DA type \(geometric, photometric, noise, occlusion\) is observed in both the MDS\-PCA trajectory visualizations and the geodesic angles\.

It is important to note that the dimensionality of shape space is known\[[9](https://arxiv.org/html/2605.15306#bib.bib16)\]to beN​\(M−1\)−1−N​\(N−1\)/2≈106N\(M\-1\)\-1\-N\(N\-1\)/2\\approx 10^\{6\}for our deep neural network representations, even after dimensionality reduction was performed on the activations \(appendix B\)\. In this high dimensional space, two randomly chosen directions in the tangent space of a particular shape are highly likely to have an angle very close to90∘90^\{\\circ\}and observations by chance of significantly acute angles are overwhelmingly rare\[[25](https://arxiv.org/html/2605.15306#bib.bib15)\], suggesting that the measured angles between∼60∘\\sim 60^\{\\circ\}and∼81∘\\sim 81^\{\\circ\}could indicate non\-trivial similarity between trajectory types\. As a practical test of this idea, we measure the test accuracy of ensembles of pairs of DA methods\. Each ensemble consists of two sets of trained models,ℳA,ℳB\\mathcal\{M\}\_\{A\},\\mathcal\{M\}\_\{B\}, from each DA trajectory in[Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)a \. Models are combined using soft voting by averaging their predicted class probabilities:pens​\(y∣x;A,B\)=1\|ℳA\|\+\|ℳB\|​∑f∈ℳA∪ℳBpf​\(y∣x\)p\_\{\\mathrm\{ens\}\}\(y\\mid x;A,B\)=\\frac\{1\}\{\|\\mathcal\{M\}\_\{A\}\|\+\|\\mathcal\{M\}\_\{B\}\|\}\\sum\_\{f\\in\\mathcal\{M\}\_\{A\}\\cup\\mathcal\{M\}\_\{B\}\}p\_\{f\}\(y\\mid x\), wherepf\(y∣x\)=softmax\(zf\(x\)\)yp\_\{f\}\(y\\mid x\)=\\operatorname\{softmax\}\(z\_\{f\}\(x\)\)\_\{y\}for modelff\. We quantify the benefit of ensembling using the ensembled classification accuracy improvement over the average constituent model accuracy,Δ​Acc​\(A,B\)=Accens​\(A,B\)−1\|ℳA\|\+\|ℳB\|​∑f∈ℳA∪ℳBAcc​\(f\)\.\\Delta\\mathrm\{Acc\}\(A,B\)=\\mathrm\{Acc\}\_\{\\mathrm\{ens\}\}\(A,B\)\-\\frac\{1\}\{\|\\mathcal\{M\}\_\{A\}\|\+\|\\mathcal\{M\}\_\{B\}\|\}\\sum\_\{f\\in\\mathcal\{M\}\_\{A\}\\cup\\mathcal\{M\}\_\{B\}\}\\mathrm\{Acc\}\(f\)\.We find that ensemble gains of DA pairs correlate with the mean geodesic angle between their trajectories \([Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)\), suggesting that improvements from ensembling are connected with the geometric diversity of the learned representations of the individual models\.

![Refer to caption](https://arxiv.org/html/2605.15306v1/figures/fig5_v3.png)Figure 5:DA produces layer\-dependent landmark distortions\.Landmark displacement analysis is performed comparing ResNet\-18 representations trained without DA with those trained with grayscale DA for layer 2 \(top\) and avgpool \(bottom\)\. \(a,e\) Histograms of landmark displacement magnitudes across probe images; \(b,f\) 25 minimally displaced landmarks; \(c,g\) maximally displaced landmarks split into those contracting toward the origin and expanding away from the origin; \(d,h\) first two principal components of the row\-normalized displacement matrixΔ​Z\\Delta Z\.Representation shape changes due to DA can be inspected with landmark displacements, which show layer\-dependent structure\.Over a step along a representation shape trajectory, each landmark \(probe input\)μ=1,…,M\\mu=1,\.\.\.,Mis displaced byΔ​zμ∈ℝN\\Delta z\_\{\\mu\}\\in\\mathbb\{R\}^\{N\}\([Fig\.˜1](https://arxiv.org/html/2605.15306#S3.F1)\), whose magnitudes and directions encode which aspects of the representation is distorted after rotations, reflections and scalings are removed\. This serves as a window into the layer\-by\-layer mechanism by which the DA process reshapes learned representations\.

[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)studies properties of the landmark displacements for ResNet\-18, comparing an un\-augmented representation to those resulting from training from the same random seed but with a small amount of “grayscale" DA \(3% chance of image grayscaling during training\)\. Panels on the top row correspond with representations after layer 2, while the bottom row show the same analysis for the penultimate avgpool representations\. The distributions of displacement magnitudes shows that in both layers a small number of landmarks are moved significantly more or less than the average displacement \([Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)a, e\)\. Image tiles in[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)\(b, c, f, and g\) \(and see also[Fig\.˜A\-5](https://arxiv.org/html/2605.15306#A3.F5),[Fig\.˜A\-6](https://arxiv.org/html/2605.15306#A3.F6)and[Fig\.˜A\-7](https://arxiv.org/html/2605.15306#A3.F7)\) show the maximally and minimally moved landmarks from the distributions in panels \(a\) and \(e\), with the maximally displaced landmarks divided into landmarks that move toward the origin \(“contracted" images\) and those that move away from the origin \(“expanded" images\)\. These expanding and contracting landmarks are of potential interest because they indicate sets of images whose features at this stage of the model are attenuated or amplified by the effect of the particular data augmentation during training\.[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)\(d\) and \(h\) show the top two principal components of the row\-normalized landmark displacement matrixΔ​Z\\Delta Z, as a way of probing whether groups of landmarks tend to be displaced in similar directions\.

From[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)we can see that in the avgpool layer, while the landmark displacements are lower dimensional in terms of variance explained by the top two principal components \(h\), patterns in the maximally or minimally displaced landmarks or their directions are not obvious\. However, in layer 2, the most affected landmarks seem related to each other in terms of lower\-level image features such as bright background colors and diagonal edges\. This, together with additional examples for different augmentation methods presented in appendix C, suggests that the effect of DA on representation is both layer and DA method dependent\. We conjecture that the properties and patterns of the landmark distortions could help elucidate the mechanisms by which DA or other hyperparameters deform representation shape layer by layer, and this would be an exciting avenue for future work\.

## 5Discussion

Limitations\.This work is limited to image classification, classical DA methods, and post\-hoc analysis of a fixed sample of unaltered probe images from the test set; extending the approach to other modalities, larger models, learned augmentation policies, and training dynamics is left to future work\. The Riemannian shape distance is also expensive for high\-dimensional convolutional representations, so dimensionality reduction of the representation matrices to a manageable dimensionality while preserving a high amount of variance is often required in practice \(see appendix B\)\.

Beyond the specific case of DA, this work suggests broader uses for shape space analysis as a general tool for studying how training choices steer learned functions\. This framework offers a common language for comparing models trained with different optimizers, regularizers, architectures, supervision, or fine\-tuning protocols, using tools from Riemannian geometry\. Rather than asking only whether two settings achieve similar accuracy, one can ask whether they move representations along similar paths, or whether small hyperparameter adjustments trigger abrupt transitions into new representational regimes\. In this sense, shape space geometry could become useful for hyperparameter search, augmentation policy design, principled ensemble construction, or understanding representational changes due to fine\-tuning, especially in settings where standard validation metrics are too coarse to distinguish between many near\-equivalent training choices\.

## References

- \[1\]\(2025\)The butterfly effect: neural network training trajectories are highly sensitive to initial conditions\.InForty\-second International Conference on Machine Learning,External Links:[Link](https://openreview.net/forum?id=L1Bm396P0X)Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p5.3)\.
- \[2\]C\. M\. Bishop\(1995\)Training with noise is equivalent to tikhonov regularization\.Neural Computation7\(1\),pp\. 108–116\.External Links:[Document](https://dx.doi.org/10.1162/neco.1995.7.1.108),[Link](https://doi.org/10.1162/neco.1995.7.1.108)Cited by:[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[3\]S\. Chen, E\. Dobriban, and J\. H\. Lee\(2020\)A group\-theoretic framework for data augmentation\.Journal of Machine Learning Research21\(245\),pp\. 1–71\.Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p2.1)\.
- \[4\]T\. Dao, A\. Gu, A\. J\. Ratner, V\. Smith, C\. D\. Sa, and C\. Ré\(2019\)A kernel theory of modern data augmentation\.External Links:1803\.06084,[Link](https://arxiv.org/abs/1803.06084)Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p2.1),[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[5\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)ImageNet: a large\-scale hierarchical image database\.In2009 IEEE Conference on Computer Vision and Pattern Recognition,pp\. 248–255\.Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p1.1),[§4](https://arxiv.org/html/2605.15306#S4.p3.3)\.
- \[6\]T\. DeVries and G\. W\. Taylor\(2017\)Improved regularization of convolutional neural networks with cutout\.arXiv preprint arXiv:1708\.04552\.Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p2.6)\.
- \[7\]J\. J\. DiCarlo and D\. D\. Cox\(2007\)Untangling invariant object recognition\.Trends in cognitive sciences11\(8\),pp\. 333–341\.Cited by:[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[8\]F\. Ding, J\. Denain, and J\. Steinhardt\(2021\)Grounding representation similarity through statistical testing\.Advances in Neural Information Processing Systems34,pp\. 1556–1568\.Cited by:[§3](https://arxiv.org/html/2605.15306#S3.p5.3)\.
- \[9\]I\. L\. Dryden and K\. V\. Mardia\(2016\)Statistical shape analysis: with applications in r\.2nd edition,John Wiley & Sons\.External Links:ISBN 978\-0\-470\-69962\-1Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p7.9),[footnote 1](https://arxiv.org/html/2605.15306#footnote1)\.
- \[10\]S\. E\. Harvey, D\. Lipshutz, and A\. H\. Williams\(2024\-14 Dec\)What representational similarity measures imply about decodable information\.InProceedings of UniReps: the Second Edition of the Workshop on Unifying Representations in Neural Models,M\. Fumero, C\. Domine, Z\. Lähner, D\. Crisostomi, L\. Moschella, and K\. Stachenfeld \(Eds\.\),Proceedings of Machine Learning Research, Vol\.285,pp\. 140–151\.Cited by:[§3](https://arxiv.org/html/2605.15306#S3.p5.3)\.
- \[11\]S\. E\. Harvey, B\. W\. Larsen, and A\. H\. Williams\(2024\-15 Dec\)Duality of bures and shape distances with implications for comparing neural representations\.InProceedings of UniReps: the First Workshop on Unifying Representations in Neural Models,M\. Fumero, E\. Rodolá, C\. Domine, F\. Locatello, K\. Dziugaite, and C\. Mathilde \(Eds\.\),Proceedings of Machine Learning Research, Vol\.243,pp\. 11–26\.External Links:[Link](https://proceedings.mlr.press/v243/harvey24a.html)Cited by:[Appendix A](https://arxiv.org/html/2605.15306#A1.p5.5)\.
- \[12\]K\. He, X\. Zhang, S\. Ren, and J\. Sun\(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 770–778\.Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p2.6)\.
- \[13\]D\. G\. Kendall\(1977\)The diffusion of shape\.Adv\. Appl\. Probab\.9\(3\),pp\. 428–430\.Cited by:[§3](https://arxiv.org/html/2605.15306#S3.p5.3)\.
- \[14\]D\. G\. Kendall, D\. Barden, T\. K\. Carne, and H\. Le\(2009\)Shape and shape theory\.John Wiley & Sons\.Cited by:[footnote 1](https://arxiv.org/html/2605.15306#footnote1)\.
- \[15\]M\. Klabunde, T\. Schumacher, M\. Strohmaier, and F\. Lemmerich\(2025\)Similarity of neural network models: a survey of functional and representational measures\.ACM Computing Surveys57\(9\),pp\. 1–52\.Cited by:[§3](https://arxiv.org/html/2605.15306#S3.p5.3)\.
- \[16\]S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton\(2019\)Similarity of neural network representations revisited\.External Links:1905\.00414,[Link](https://arxiv.org/abs/1905.00414)Cited by:[Appendix B](https://arxiv.org/html/2605.15306#A2.p2.4),[§2](https://arxiv.org/html/2605.15306#S2.p3.1),[§3](https://arxiv.org/html/2605.15306#S3.p4.9)\.
- \[17\]R\. D\. Lange, D\. Kwok, J\. K\. Matelsky, X\. Wang, D\. Rolnick, and K\. Kording\(2023\-28 Jul\)Deep networks as paths on the manifold of neural representations\.InProceedings of 2nd Annual Workshop on Topology, Algebra, and Geometry in Machine Learning \(TAG\-ML\),T\. Doster, T\. Emerson, H\. Kvinge, N\. Miolane, M\. Papillon, B\. Rieck, and S\. Sanborn \(Eds\.\),Proceedings of Machine Learning Research, Vol\.221,pp\. 102–133\.External Links:[Link](https://proceedings.mlr.press/v221/lange23a.html)Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p4.1),[§2](https://arxiv.org/html/2605.15306#S2.p3.1)\.
- \[18\]K\. Lenc and A\. Vedaldi\(2015\-06\)Understanding image representations by measuring their equivariance and equivalence\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition \(CVPR\),Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p2.1),[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[19\]R\. G\. Lopes, D\. Yin, B\. Poole, J\. Gilmer, and E\. D\. Cubuk\(2019\)Improving robustness without sacrificing accuracy with patch gaussian augmentation\.arXiv preprint arXiv:1906\.02611\.Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p2.6)\.
- \[20\]T\. Nguyen, M\. Raghu, and S\. Kornblith\(2020\)Do wide and deep networks learn the same things? uncovering how neural network representations vary with width and depth\.CoRRabs/2010\.15327\.External Links:[Link](https://arxiv.org/abs/2010.15327),2010\.15327Cited by:[§2](https://arxiv.org/html/2605.15306#S2.p3.1)\.
- \[21\]B\. Olshausen, C\. Anderson, and D\. Van Essen\(1993\)A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information\.Journal of Neuroscience13\(11\),pp\. 4700–4719\.External Links:[Document](https://dx.doi.org/10.1523/JNEUROSCI.13-11-04700.1993),ISSN 0270\-6474,[Link](https://www.jneurosci.org/content/13/11/4700),https://www\.jneurosci\.org/content/13/11/4700\.full\.pdfCited by:[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[22\]M\. Riesenhuber and T\. Poggio\(1999\)Hierarchical models of object recognition in cortex\.Nature neuroscience2\(11\),pp\. 1019–1025\.Cited by:[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[23\]C\. Shorten and T\. Khoshgoftaar\(2019\-07\)A survey on image data augmentation for deep learning\.Journal of Big Data6,pp\.\.External Links:[Document](https://dx.doi.org/10.1186/s40537-019-0197-0)Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p2.1),[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[24\]L\. N\. Smith\(2018\)A disciplined approach to neural network hyper\-parameters: part 1 – learning rate, batch size, momentum, and weight decay\.CoRRabs/1803\.09820\.External Links:[Link](https://arxiv.org/abs/1803.09820)Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p1.1)\.
- \[25\]R\. Vershynin\(2018\)High\-dimensional probability: an introduction with applications in data science\.Cambridge Series in Statistical and Probabilistic Mathematics,Cambridge University Press\.External Links:ISBN 9781108415194Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p7.9)\.
- \[26\]Z\. Wang, P\. Wang, K\. Liu, P\. Wang, Y\. Fu, C\. Lu, C\. C\. Aggarwal, J\. Pei, and Y\. Zhou\(2025\)A comprehensive survey on data augmentation\.External Links:2405\.09591,[Link](https://arxiv.org/abs/2405.09591)Cited by:[§2](https://arxiv.org/html/2605.15306#S2.p1.1)\.
- \[27\]A\. H\. Williams, E\. Kunz, S\. Kornblith, and S\. Linderman\(2021\)Generalized shape metrics on neural representations\.InAdvances in Neural Information Processing Systems,M\. Ranzato, A\. Beygelzimer, Y\. Dauphin, P\.S\. Liang, and J\. W\. Vaughan \(Eds\.\),Vol\.34,pp\. 4738–4750\.Cited by:[§1](https://arxiv.org/html/2605.15306#S1.p4.1),[§2](https://arxiv.org/html/2605.15306#S2.p3.1),[§3](https://arxiv.org/html/2605.15306#S3.p4.9),[§3](https://arxiv.org/html/2605.15306#S3.p5.3),[§4](https://arxiv.org/html/2605.15306#S4.p3.3)\.
- \[28\]S\. Zagoruyko and N\. Komodakis\(2016\)Wide residual networks\.arXiv preprint arXiv:1605\.07146\.Cited by:[§4](https://arxiv.org/html/2605.15306#S4.p3.3)\.

## Appendix ARiemannian shape distance details

The solution to the optimization problem in[Equation˜2](https://arxiv.org/html/2605.15306#S3.E2)is available in closed form as

ρ​\(Xi,Xj\)=arccos⁡\(∑k=1Nσk\)\\rho\(X\_\{i\},X\_\{j\}\)=\\arccos\\Big\(\\sum\_\{k=1\}^\{N\}\\sigma\_\{k\}\\Big\)\(5\)whereZj⊤​Zi=U​Σ​V⊤Z\_\{j\}^\{\\top\}Z\_\{i\}=U\\Sigma V^\{\\top\}is the singular value decomposition ofZj⊤​ZiZ\_\{j\}^\{\\top\}Z\_\{i\}andσk\\sigma\_\{k\}are the diagonal elements ofΣ\\Sigma\.

Shape geodesics\.

Given two pre\-shapesZiZ\_\{i\}andZjZ\_\{j\}, we can obtain a geodesic between them by moving along a great circle on the pre\-shape sphere\. Concretely, ifO∗O^\{\*\}is the orthogonal transformation achieving the supremum in[eq\.˜2](https://arxiv.org/html/2605.15306#S3.E2), then for any value oft∈\[0,1\]t\\in\[0,1\],

γ​\(t\)=sin⁡\(\(1−t\)​ρ\)sin⁡ρ​Zj\+sin⁡\(t​ρ\)sin⁡ρ​Zi​O∗\\gamma\(t\)=\\frac\{\\sin\(\(1\-t\)\\rho\)\}\{\\sin\\rho\}\\,Z\_\{j\}\+\\frac\{\\sin\(t\\rho\)\}\{\\sin\\rho\}\\,Z\_\{i\}O^\{\\ast\}defines a new pre\-shape along the geodesic curve that represents the shortest smooth deformation from one neural representation to another, ignoring overall orientation, scale and reflection\. Note thatγ​\(t\)\\gamma\(t\)represents anotherpre\-shape, which can be viewed as a exemplar for the representation’s shape\.

Connection to fidelity and normalized Bures similarity\.

The Riemannian shape distance also admits an equivalent formulation in terms of the fidelity between the corresponding stimulus\-by\-stimulus kernel matrices\. Let

Ki=C​Xi​Xi⊤​C,Kj=C​Xj​Xj⊤​C,K\_\{i\}=CX\_\{i\}X\_\{i\}^\{\\top\}C,\\qquad K\_\{j\}=CX\_\{j\}X\_\{j\}^\{\\top\}C,whereC=IM−1M​𝟏𝟏⊤C=I\_\{M\}\-\\frac\{1\}\{M\}\\mathbf\{1\}\\mathbf\{1\}^\{\\top\}is the centering matrix used in Eq\. \(1\)\. Define the fidelity between positive semidefinite matrices by

F​\(A,B\)=Tr​\[\(A1/2​B​A1/2\)1/2\]\.F\(A,B\)\\;=\\;\\mathrm\{Tr\}\\\!\\left\[\\left\(A^\{1/2\}BA^\{1/2\}\\right\)^\{1/2\}\\right\]\.Then the normalized Bures similarity betweenKiK\_\{i\}andKjK\_\{j\}is

NBS​\(Ki,Kj\)=F​\(Ki,Kj\)Tr​\(Ki\)​Tr​\(Kj\),\\mathrm\{NBS\}\(K\_\{i\},K\_\{j\}\)\\;=\\;\\frac\{F\(K\_\{i\},K\_\{j\}\)\}\{\\sqrt\{\\mathrm\{Tr\}\(K\_\{i\}\)\\,\\mathrm\{Tr\}\(K\_\{j\}\)\}\},and Harvey et al\.\[[11](https://arxiv.org/html/2605.15306#bib.bib28)\]show that this quantity is exactly the cosine of the Riemannian shape distance:

NBS​\(Ki,Kj\)=cos⁡\(ρ​\(Xi,Xj\)\)\.\\mathrm\{NBS\}\(K\_\{i\},K\_\{j\}\)\\;=\\;\\cos\\\!\\bigl\(\\rho\(X\_\{i\},X\_\{j\}\)\\bigr\)\.Equivalently,

ρ​\(Xi,Xj\)=arccos⁡\(F​\(Ki,Kj\)Tr​\(Ki\)​Tr​\(Kj\)\)\.\\rho\(X\_\{i\},X\_\{j\}\)\\;=\\;\\arccos\\\!\\left\(\\frac\{F\(K\_\{i\},K\_\{j\}\)\}\{\\sqrt\{\\mathrm\{Tr\}\(K\_\{i\}\)\\,\\mathrm\{Tr\}\(K\_\{j\}\)\}\}\\right\)\.
Thus, our shape\-space metric can also be interpreted as the angular form of a fidelity\-based comparison between the centered linear kernel matrices induced by the two representations\. In the special case whereXiX\_\{i\}andXjX\_\{j\}are already centered across probe inputs, one may simply writeKi=Xi​Xi⊤K\_\{i\}=X\_\{i\}X\_\{i\}^\{\\top\}andKj=Xj​Xj⊤K\_\{j\}=X\_\{j\}X\_\{j\}^\{\\top\}\.

## Appendix BTraining details

We train all our models on CIFAR\-10 with four different architectures using 4 NVIDIA A100 GPUs and the standard train/test split\. Unless otherwise specified, training hyperparameters are fixed for each architecture and are reported in Table[1](https://arxiv.org/html/2605.15306#A2.T1)\. Hyperparameters are tuned such that they reach a reasonable classification performance while also maintaining a smooth and stable learning curve\.

Activations produced by convolutional layers of sizeM×C×H×WM\\times C\\times H\\times W, whereCCis the number of channels andH×WH\\times Ware the spatial dimensions, were flattened into representation matricesXi∈ℝM×C​H​WX\_\{i\}\\in\\mathbb\{R\}^\{M\\times CHW\}\(as is common, see e\.g\.\[[16](https://arxiv.org/html/2605.15306#bib.bib30)\]\)\.

Representations formed by hidden layers are often extremely large in terms of the number of activationsN=C​H​WN=CHW, and the decomposition required to compute[Equation˜5](https://arxiv.org/html/2605.15306#A1.E5)can become computationally expensive\. IfN\>1000N\>1000, dimensionality of the representations was reduced by discarding all but the top10001000principal components, with the observation that in all representations studied this projection preserved at least75%75\\%of the variance\.

Table 1:Training Hyperparameters for CIFAR\-10 ModelsThe data augmentations used in our experiments for[Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)are listed in Table[2](https://arxiv.org/html/2605.15306#A2.T2)\. For each experiment, we fix the augmentation type and sweep over a predefined set of magnitudes, while keeping the model architecture, optimizer, learning rate schedule, and number of epochs unchanged\. All augmentations are applied stochastically on\-the\-fly during training\.

Table 2:Data Augmentation Hyperparameters
## Appendix CSupplementary figures

To ensure the generalization and reproducibility of our results, we repeat the same experiment over different architectures \(Fig[A\-1](https://arxiv.org/html/2605.15306#A3.F1)\), different dataset \(Fig[A\-2](https://arxiv.org/html/2605.15306#A3.F2)\), and different augmentation types \(Fig[A\-3](https://arxiv.org/html/2605.15306#A3.F3), Fig[A\-4](https://arxiv.org/html/2605.15306#A3.F4)\)\. We can observe similar structured geometry in shape space\.

![Refer to caption](https://arxiv.org/html/2605.15306v1/x5.png)Figure A\-1:Replicating the experiment of[Fig\.˜3](https://arxiv.org/html/2605.15306#S4.F3)in four different architectures \(Resnet18, VGG, DenseNet, ViT\) with data augmentation method patch Gaussian\.![Refer to caption](https://arxiv.org/html/2605.15306v1/x6.png)Figure A\-2:Replication of[Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)using ResNet18 on CIFAR\-10 and ImageNet\. Panels \(a,b\) show models trained on CIFAR\-10, while \(c,d\) show pretrained ResNet18 fine\-tuned on ImageNet\. The top row shows pairwise distance matrices across Cutout sizes, measured at Layer 2\.1 for \(a,c\) and avgpool for \(b,d\)\. The bottom row \(e–h\) shows the corresponding MDS embeddings after PCA, colored by Cutout size\.![Refer to caption](https://arxiv.org/html/2605.15306v1/x7.png)Figure A\-3:Replicating the experiment of[Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)also in ResNet18 but using the data augmentation methods CutOut and random cropping\. \(a\) is the distance metrics across pairse of augmentations, \(b\) is the MDS embeddings colored by augmentation strength\.![Refer to caption](https://arxiv.org/html/2605.15306v1/x8.png)Figure A\-4:Replicating the experiment of[Fig\.˜2](https://arxiv.org/html/2605.15306#S4.F2)also in ResNet18 but using the data augmentation methods rotation and shear\. \(a\) is the distance metrics across pairse of augmentations, \(b\) is the MDS embeddings colored by augmentation strength\.![Refer to caption](https://arxiv.org/html/2605.15306v1/x9.png)Figure A\-5:Layer\-by\-layer variability in maximally, median, and minimally deformed landmarks\. Landmark displacement analysis as in[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)comparing layers of ResNet18 trained on un\-augmented images and trained using data augmentation method sheer at level s = 4 \(one step along the sheer trajectory in[Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)a\)\. PCA plots in bottom row are identical projections of the displacementΔ​Z\\Delta Zonto its first two principal axes, except one is colored by CIFAR\-10 class and the other displays the corresponding CIFAR\-10 image over each datapoint\.![Refer to caption](https://arxiv.org/html/2605.15306v1/x10.png)Figure A\-6:Layer\-by\-layer variability in maximally, median, and minimally deformed landmarks\. Landmark displacement analysis as in[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)comparing layers of ResNet18 trained on un\-augmented images and trained using data augmentation method ‘grayscale’, with probability of image grayscale = 0\.1 \(one step along the grayscale trajectory of[Fig\.˜4](https://arxiv.org/html/2605.15306#S4.F4)\)\. PCA plots in bottom row are identical projections of the displacementΔ​Z\\Delta Zonto its first two principal axes, except one is colored by CIFAR\-10 class and the other displays the corresponding CIFAR\-10 image over each datapoint\.![Refer to caption](https://arxiv.org/html/2605.15306v1/x11.png)Figure A\-7:Layer\-by\-layer variability in maximally, median, and minimally deformed landmarks\. Landmark displacement analysis as in[Fig\.˜5](https://arxiv.org/html/2605.15306#S4.F5)comparing layers of ResNet18 trained on un\-augmented images and trained using data augmentation method Gaussian noise at level s = 0\.02\. PCA plots in bottom row are identical projections of the displacementΔ​Z\\Delta Zonto its first two principal axes, except one is colored by CIFAR\-10 class and the other displays the corresponding CIFAR\-10 image over each datapoint\.

Similar Articles

Can SAEs Capture Neural Geometry? (6 minute read)

TLDR AI

This article explores how sparse autoencoders (SAEs) can capture curved neural geometry, revealing three distinct ways SAE features represent manifolds, and presents an unsupervised pipeline to uncover geometric structure in neural representations.

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

arXiv cs.AI

This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.