Equivariance and Augmentation for Bayesian Neural Networks
Summary
This paper studies data augmentation for Bayesian neural networks trained with variational inference, deriving conditions for exact equivariance and introducing novel symmetrization techniques like orbit expansion to improve symmetry and performance.
View Cached Full Text
Cached at: 06/26/26, 05:17 AM
# Equivariance and Augmentation for Bayesian Neural Networks
Source: [https://arxiv.org/html/2606.26273](https://arxiv.org/html/2606.26273)
\\DeclareSourcemap\\maps
\[datatype=bibtex\]\\map\\step\[fieldset=primaryclass, null\]
Miaowen Dong111Department of Mathematical Sciences, Chalmers University of Technology and the University of Gothenburg, SE\-412 96 Gothenburg, Sweden\. Emails: miaowen@chalmers\.se, gerken@chalmers\.seAxel Flinth333Equal Contribution222Department of Mathematics and Mathematical Statistics, Umeå University, Linnéus väg 49, 901 87 Umeå, Sweden\. Email: axel\.flinth@umu\.seJan E\. Gerken333Equal Contribution111Department of Mathematical Sciences, Chalmers University of Technology and the University of Gothenburg, SE\-412 96 Gothenburg, Sweden\. Emails: miaowen@chalmers\.se, gerken@chalmers\.se
###### Abstract
Symmetries are important for many deep learning tasks, ranging from applications in the sciences to medical imaging\. However, there is an ongoing debate about whether to impose symmetry constraints on the neural network architecture \(yielding equivariant neural networks\) or learn them from augmented training data\. Although equivariant networks are well\-studied theoretically, much less is known about data augmentation, since analyzing augmentation requires control over the training dynamics\. Inspired by recent results that show that augmented infinite deep ensembles are exactly equivariant, we study data augmentation for Bayesian neural networks \(BNNs\) trained with variational inference\. We focus on variational distributions in the exponential family and derive conditions under which exact equivariance is reached\. We furthermore obtain bounds on the equivariance error and introduce three novel symmetrization techniques which boost the effect of data augmentation in this setting\. We conduct extensive numerical experiments which show that one of our symmetrization methods \(orbit expansion\) outperforms the baseline in both equivariance and overall performance\. Our code is available at[github\.com/dmw1998/augment\-BNNs](https://github.com/dmw1998/augment-BNNs)\.
## 1Introduction
In recent years, symmetric learning tasks have become an important field of study\. After an initial focus on specialized equivariant networks which impose the symmetry constraints layer\-by\-layer\[[5](https://arxiv.org/html/2606.26273#bib.bib23)\], attention has recently shifted towards learning symmetries from augmented training data\[[6](https://arxiv.org/html/2606.26273#bib.bib22)\]\. This approach has the advantage of, given an efficient symmetry transformation mechanism, being straightforward to implement and can be used together with highly\-optimized and well\-performing architectures\[[34](https://arxiv.org/html/2606.26273#bib.bib41)\]\. However, as the symmetry is merely learned and not imposed, it is only achieved approximately\. Therefore, novel techniques are required which improve the symmetry gain derived from augmented training\.
HGH\_\{G\}η\\etaSection[3\.4](https://arxiv.org/html/2606.26273#S3.SS4)Theorem[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)×\\timesFigure 1:Natural parametersη\\etafor a variational distribution in the exponential family that lie inHGH\_\{G\}correspond to symmetric BNNs, here exemplified with a reflection symmetry\. Our main Theorem[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)implies thatHGH\_\{G\}is invariant for augmented training\. Through the symmetrization strategies described in Section[3\.4](https://arxiv.org/html/2606.26273#S3.SS4), we can increase the equivariance of the final model\.The explicit layer\-wise equivariant neural networks are readily accessible to theoretical analysis, while data augmentation is considerably harder to study since it involves the training dynamics \(see the related works section\)\. It is however possible to show that in*expectation over initializations*, data augmentation leads to exact equivariance\[[16](https://arxiv.org/html/2606.26273#bib.bib35),[28](https://arxiv.org/html/2606.26273#bib.bib37)\]\. A practical but costly method for approximating such expected values is training a deep ensemble\.
The purpose of this work is to investigate a cheaper way to realize such “equivariance in expectation”: Training Bayesian neural networks \(BNNs\) with variational inference on augmented data\. In this setting, sampling from the posterior predictive distribution replaces the inference step on the ensemble and yields Bayesian uncertainty estimates\. In this setting, only one training run is necessary to obtain the variational posterior, in contrast to the one training run per ensemble member for deep ensembles\. Furthermore, since BNNs show stable out of distribution behavior, they are particularly well suited for small datasets, the regime in which data augmentation can be expected to have the largest effect\.
Our main contributions are the following:
- •We study BNNs trained on augmented data with a variational distribution from the exponential family\. We show that, when the training starts from an invariant prior, the variational distribution stays invariant throughout training, under some mild assumptions\. This generalizes a similar result from\[[29](https://arxiv.org/html/2606.26273#bib.bib18)\]for non\-Bayesian network training\.
- •We derive bounds for the deviation of the variational distribution from equivariance if the prior is not equivariant\. We furthermore prove bounds for the equivariance error in the predictions due to finite sampling\. These theoretical results are validated numerically\.
- •We introduce three symmetrization methods \(geometric averaging, projection and orbit expansion\) that can be applied during training to improve the equivariance properties of the BNN\. We test these techniques in extensive numerical experiments for image classification\.*Orbit expansion*outperforms the baseline in both model performance and equivariance\.
## 2Related Work
#### Equivariant neural networks\.
The issue of symmetry, i\.e\.*invariance*and*equivariance*, of deep neural nets has grown into an entire subfield called*geometric deep learning*\[[5](https://arxiv.org/html/2606.26273#bib.bib23)\]\. The most prominent construction of equivariant networks is the layerwise one\. This strategy originates from GCNNs, Group Convolutional Neural Networks\[[8](https://arxiv.org/html/2606.26273#bib.bib14)\], but has by now been generalized to virtually any symmetry induced by any group\[[21](https://arxiv.org/html/2606.26273#bib.bib25),[19](https://arxiv.org/html/2606.26273#bib.bib45),[15](https://arxiv.org/html/2606.26273#bib.bib2)\]\. There are also other strategies, such as learning from invariants\[[32](https://arxiv.org/html/2606.26273#bib.bib40)\], frame averaging\[[30](https://arxiv.org/html/2606.26273#bib.bib44)\], fundamental domain projection\[[1](https://arxiv.org/html/2606.26273#bib.bib24)\]and group averaging\. There are also works in which symmetries are enforced approximately, for instance through so called weight annealing\[[33](https://arxiv.org/html/2606.26273#bib.bib46)\]\.
#### Augmentation and training dynamics\.
The question of the effect of data augmentation on the training dynamics of neural networks have been treated in several simplified contexts, such as feature averaged models\[[23](https://arxiv.org/html/2606.26273#bib.bib28),[11](https://arxiv.org/html/2606.26273#bib.bib32)\]as well as linear neural networks\[[22](https://arxiv.org/html/2606.26273#bib.bib33),[7](https://arxiv.org/html/2606.26273#bib.bib30),[10](https://arxiv.org/html/2606.26273#bib.bib26)\]\. In these cases, it is often possible to prove equivalence of augmentation and equivariance\. Fully non\-linear networks were treated in\[[29](https://arxiv.org/html/2606.26273#bib.bib18),[27](https://arxiv.org/html/2606.26273#bib.bib38)\], whose results we generalize to Bayesian networks here\.
Empirical studies on the difference between augmentation and restrictions are plentiful\. More systematic treatments are\[[13](https://arxiv.org/html/2606.26273#bib.bib34),[14](https://arxiv.org/html/2606.26273#bib.bib39),[4](https://arxiv.org/html/2606.26273#bib.bib1)\]\.
#### Bayesian neural networks\.
Bayesian approaches to deep learning have been studied for many years\[[24](https://arxiv.org/html/2606.26273#bib.bib9)\]since they offer uncertainty estimates for neural networks that are otherwise black\-box models \(see the PhD thesis by\[[12](https://arxiv.org/html/2606.26273#bib.bib10)\]for an overview\)\. However, making BNNs practically applicable requires integrating variational inference into the deep learning training methodology\[[17](https://arxiv.org/html/2606.26273#bib.bib16),[3](https://arxiv.org/html/2606.26273#bib.bib7),[20](https://arxiv.org/html/2606.26273#bib.bib8)\]\. For a review of BNNs with a focus on practical applications, see\[[18](https://arxiv.org/html/2606.26273#bib.bib6)\]\.
Not many prior works consider symmetries in relation to BNNs\.\[[31](https://arxiv.org/html/2606.26273#bib.bib4)\]propose a probabilistic group\-averaging of BNNs in order to realize a soft symmetry constraint, optimized on the data\. Closest to our work is\[[26](https://arxiv.org/html/2606.26273#bib.bib5)\]which uses a specific prior that combines different weight\-sharing schemes and therefore symmetry constraints\. During training, the network learns which symmetry fits the data best\. In contrast, we consider training on augmented data with a generic prior that does not impose weight\-sharing\.
## 3Theory
We develop a theoretical framework for understanding how data augmentation induces equivariance in variational Bayesian inference\. We proceed in three steps: first, we characterize when exponential families are closed under group actions \([Section3\.2](https://arxiv.org/html/2606.26273#S3.SS2)\); second, we show that data\-augmented training makes the ELBO invariant, and how that affects the training \([Section3\.3](https://arxiv.org/html/2606.26273#S3.SS3)\); third, we propose symmetrization mechanisms and analyze their properties \([Section3\.4](https://arxiv.org/html/2606.26273#S3.SS4)\)\.
### 3\.1Preliminaries
In this section, we introduce the mathematical tools used throughout this paper\. We begin with exponential families, which provide the structural backbone for our theoretical analysis, before reviewing variational inference and the group theoretic concepts needed to formalize symmetry\.
#### Exponential families\.
An exponential family of probability distributions is defined through a base measureh\(θ\)h\(\\theta\), a sufficient statisticT\(θ\)∈ℝkT\(\\theta\)\\in\\mathbb\{R\}^\{k\}and a log\-partition functionA\(η\):=log∫h\(θ\)exp\(η⊤T\(θ\)\)𝑑θA\(\\eta\)~:=~\\log~\\int~h\(\\theta\)\\exp\(\\eta^\{\\top\}T\(\\theta\)\)d\\theta\. A distribution belongs to the exponential family if its density has the form
qη\(θ\)=h\(θ\)exp\(η⊤T\(θ\)−A\(η\)\),q\_\{\\eta\}\(\\theta\)=h\(\\theta\)\\exp\(\\eta^\{\\top\}T\(\\theta\)\-A\(\\eta\)\)\\,,\(1\)whereη∈H⊆ℝk\\eta\\in H\\subseteq\\mathbb\{R\}^\{k\}is the*natural parameter*\. Examples of such families include normal, exponential and log\-normal distributions\. An exponential family𝒬:=\{qη\(θ\)∣η∈H\}\\mathcal\{Q\}:=\\\{q\_\{\\eta\}\(\\theta\)\\mid\\eta\\in H\\\}, is*regular*ifHHis open, and*minimal*if the components ofT\(θ\)T\(\\theta\)are linearly independent\.
#### Group actions and push\-forward distributions\.
We assume that a groupGGacts on the input space𝒳\\mathcal\{X\}, the output space𝒴\\mathcal\{Y\}, and the parameter spaceΘ\\Thetavia representationsρ𝒳\\rho\_\{\\mathcal\{X\}\},ρ𝒴\\rho\_\{\\mathcal\{Y\}\}, andρΘ\\rho\_\{\\Theta\}respectively\. Throughout, we assume thatρΘ\\rho\_\{\\Theta\}is compatible with the data representationsρ𝒳\\rho\_\{\\mathcal\{X\}\}andρ𝒴\\rho\_\{\\mathcal\{Y\}\}in the following sense:
f\(ρ𝒳\(g\)x;θ\)=ρ𝒴\(g\)f\(x;ρΘ\(g\)−1θ\)\.f\(\\rho\_\{\\mathcal\{X\}\}\(g\)x;\\theta\)=\\rho\_\{\\mathcal\{Y\}\}\(g\)f\(x;\\rho\_\{\\Theta\}\(g\)^\{\-1\}\\theta\)\.\(2\)Following the layered decompositionΘ=⨁ℓΘℓ\\Theta=\\bigoplus\_\{\\ell\}\\Theta\_\{\\ell\}of a neural network’s parameters, we assume thatρΘ\\rho\_\{\\Theta\}acts per layer, i\.e\.,ρΘ=⨁ℓρΘℓ\\rho\_\{\\Theta\}=\\bigoplus\_\{\\ell\}\\rho\_\{\\Theta\_\{\\ell\}\}\. There is a canonical construction of such aρΘ\\rho\_\{\\Theta\}: One introduces representationsρΘℓ\\rho\_\{\\Theta\_\{\\ell\}\}on the hidden layers and then imposes equivariance on each linear layer with respect to the two representations acting on the input and output of that layer\. As long as the non\-linearities are equivariant with respect to the same representations \(which is always the case for pointwise non\-linearities and permutation representations\) one obtains \([2](https://arxiv.org/html/2606.26273#S3.E2)\)\. For more details, see\[[29](https://arxiv.org/html/2606.26273#bib.bib18)\]\. The choice ofρΘ\\rho\_\{\\Theta\}, or equivalently the choice of the hidden layer representations, is a modeling decision that we revisit in[Section3\.4](https://arxiv.org/html/2606.26273#S3.SS4)\. When there is no risk of confusion, we writegxgxandgygyas shorthands forρ𝒳\(g\)x\\rho\_\{\\mathcal\{X\}\}\(g\)xandρ𝒴\(g\)y\\rho\_\{\\mathcal\{Y\}\}\(g\)y, but retain the explicit notationρ\(g\)θ\\rho\(g\)\\thetafor the parameter space, where the representation structure is central to our analysis\. The push\-forward of a distributionqqunderg∈Gg\\in Gis defined as
𝒯g\#q\(B\):=q\(ρ\(g\)−1B\)\\mathcal\{T\}\_\{g\}\\\#q\(B\):=q\(\\rho\(g\)^\{\-1\}B\)\(3\)for any measurable setBB, with density𝒯g\#q\(θ\)=q\(ρ\(g\)−1θ\)\|detD\(ρ\(g\)−1\)\(θ\)\|\\mathcal\{T\}\_\{g\}\\\#q\(\\theta\)=q\(\\rho\(g\)^\{\-1\}\\theta\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\. A distributionqqis calledGG\-invariant if𝒯g\#q=q\\mathcal\{T\}\_\{g\}\\\#q=qfor allg∈Gg\\in G\.
#### Bayesian neural networks \(BNN\) and variational inference \(VI\)\.
A neural network is a parametrized functionf\(⋅;θ\):𝒳→𝒴f\(\\cdot;\\theta\):\\mathcal\{X\}\\rightarrow\\mathcal\{Y\}\. They define conditional label distributionsp\(y\|x,θ\)=p\(y\|f\(x;θ\)\)p\(y\\,\|\\,x,\\theta\)=p\(y\\,\|\\,f\(x;\\theta\)\)\. In the Bayesian treatment, we put a priorp0\(θ\)p\_\{0\}\(\\theta\)on the weights and calculate the posteriorp\(θ∣𝒟\)p\(\\theta\\mid\\mathcal\{D\}\)given a dataset𝒟=\{\(xi,yi\)\}i=1N0\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{0\}\}via Bayes’ rule\. However, since the posterior neural network likelihood typically is intractable, one makes a variational inference approximationqη∈𝒬q\_\{\\eta\}\\in\\mathcal\{Q\}, and then choose the parametersη\\etato maximize the*evidence lower bound*\[[2](https://arxiv.org/html/2606.26273#bib.bib15)\]
ELBO\(η\):=𝔼qη\[logp\(𝒟∣θ\)\]−DKL\(qη∥p0\)\.\\mathrm\{ELBO\}\(\\eta\):=\\mathbb\{E\}\_\{q\_\{\\eta\}\}\[\\log p\(\\mathcal\{D\}\\mid\\theta\)\]\-D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta\}\\\|p\_\{0\}\)\\,\.\(4\)The posterior predictive is approximated by Monte Carlo sampling fromqηq\_\{\\eta\}:
Fη\(x\):=𝔼θ∼qη\[f\(x;θ\)\]≈1T∑t=1Tf\(x;θ\(t\)\),θ\(t\)∼qη\.F\_\{\\eta\}\(x\):=\\mathbb\{E\}\_\{\\theta\\sim q\_\{\\eta\}\}\[f\(x;\\theta\)\]\\approx\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}f\(x;\\theta^\{\(t\)\}\)\\,,\\quad\\theta^\{\(t\)\}\\sim q\_\{\\eta\}\\,\.\(5\)
#### Data augmentation\.
Given a finite groupGGand a dataset𝒟=\{\(xi,yi\)\}i=1N0\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\_\{0\}\}drawn i\.i\.d\. from a data distributionP𝒳P\_\{\\mathcal\{X\}\}, data augmentation constructs an augmented dataset𝒟aug=\{\(gxi,gyi\)∣g∈G,\(xi,yi\)∈𝒟\}\\mathcal\{D\}\_\{\\mathrm\{aug\}\}=\\\{\(gx\_\{i\},gy\_\{i\}\)\\mid g\\in G,\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\\\}, expanding the dataset by a factor of\|G\|\|G\|\. One could also consider a continuous compact group by sampling finitely many Haar\-distributed group elements\. This comes with some subtleties that we discuss in[appendixA](https://arxiv.org/html/2606.26273#A1)\. In the following, we instead restrict to the case of finiteGG\. Following the mini\-batch scaling convention of\[[17](https://arxiv.org/html/2606.26273#bib.bib16)\]and\[[3](https://arxiv.org/html/2606.26273#bib.bib7)\], instead of using the ELBO \([4](https://arxiv.org/html/2606.26273#S3.E4)\) directly, we work with aβ\\beta\-weighted objective:
ℒβ\(η\):=1N𝔼qη\[−logp\(𝒟aug∣θ\)\]⏟=:Raug\(η\)\+βDKL\(qη∥p0\),\\mathcal\{L\}\_\{\\beta\}\(\\eta\):=\\underbrace\{\\tfrac\{1\}\{N\}\\,\\mathbb\{E\}\_\{q\_\{\\eta\}\}\[\-\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)\]\}\_\{=:\\,R\_\{\\mathrm\{aug\}\}\(\\eta\)\}\+\\beta D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta\}\\\|p\_\{0\}\),\(6\)whereN=\|G\|⋅N0N=\|G\|\\cdot N\_\{0\}is the augmented dataset size\.ℒβ\\mathcal\{L\}\_\{\\beta\}withβ=1/N\\beta=1/Nis the rescaled negative ELBO\.
If we assume the following invariance of the forward model:
p\(gy\|gf\(x;θ\)\)=p\(y\|f\(x;θ\)\),p\(gy\\,\|\\,gf\(x;\\theta\)\)=p\(y\\,\|\\,f\(x;\\theta\)\),\(7\)the compatibility ofρΘ\\rho\_\{\\Theta\}with the data representation implies that the augmented likelihood inherits the symmetry of the data\. Note that condition \([7](https://arxiv.org/html/2606.26273#S3.E7)\) is weak – it is e\.g\. satisfied fory∼f\(x;θ\)\+ey\\sim f\(x;\\theta\)\+efor aGG\-invariant erroree– and furthermore independent of the architectureff\.
###### Proposition 3\.1\(Invariant augmented likelihood\)\.
LetLaug\(θ\):=p\(𝒟aug∣θ\)L\_\{\\mathrm\{aug\}\}\(\\theta\):=p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)be the likelihood of the augmented dataset\. Under the compatibility assumption \([2](https://arxiv.org/html/2606.26273#S3.E2)\) as well as the invariance assumption \([7](https://arxiv.org/html/2606.26273#S3.E7)\), the likelihood isGG\-invariant:
Laug\(ρ\(g\)−1θ\)=Laug\(θ\),∀g∈G,θ∈Θ\.L\_\{\\mathrm\{aug\}\}\(\\rho\(g\)^\{\-1\}\\theta\)=L\_\{\\mathrm\{aug\}\}\(\\theta\),\\,\\forall g\\in G,\\,\\theta\\in\\Theta\\,\.\(8\)
The proof is given in[appendixB](https://arxiv.org/html/2606.26273#A2)\.
#### Equivariance\.
A single networkf\(⋅;θ\)f\(\\cdot\\,;\\theta\)isGG\-equivariant iff\(gx;θ\)=gf\(x;θ\)f\(gx;\\theta\)=gf\(x;\\theta\)for allx∈𝒳x\\in\\mathcal\{X\},g∈Gg\\in G\. More generally, the predictive mapFηF\_\{\\eta\}defined in \([5](https://arxiv.org/html/2606.26273#S3.E5)\) is*GεG\_\{\\varepsilon\}\-equivariant*if the equivariance defectΔFeq\(η\)\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)is bounded byε\\varepsilon,
ΔFeq\(η\):=𝔼x∼P𝒳,g∼νG\[‖Fη\(gx\)−gFη\(x\)‖2\]≤ε\.\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\):=\\mathbb\{E\}\_\{x\\sim P\_\{\\mathcal\{X\}\},\\,g\\sim\\nu\_\{G\}\}\\\!\\left\[\\left\\\|F\_\{\\eta\}\(gx\)\-gF\_\{\\eta\}\(x\)\\right\\\|^\{2\}\\right\]\\leq\\varepsilon\.\(9\)Here,νG\\nu\_\{G\}denotes the normalized Haar measure onGG\. Whenε=0\\varepsilon=0, this reduces to exactGG\-equivariance ofFηF\_\{\\eta\}P𝒳P\_\{\\mathcal\{X\}\}\-a\.s\. One central question is whether minimizingℒβ\\mathcal\{L\}\_\{\\beta\}on𝒟aug\\mathcal\{D\}\_\{\\mathrm\{aug\}\}drivesΔFeq\(η\)\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)to zero, and at what rate\.
### 3\.2Closed exponential family under group actions
Neural networks can learn equivariance efficiently from augmented data only when the parameter space is closed under group transformations\[[29](https://arxiv.org/html/2606.26273#bib.bib18)\]\. Similarly, in the Bayesian setting, we restrict the variational family to those closed under group transformations\. As the following theorem shows, this imposes constraints on the exponential family\. We provide the proof in[appendixC](https://arxiv.org/html/2606.26273#A3)\.
###### Theorem 3\.2\(Closure under push\-forward\)\.
A regular minimal exponential family𝒬\\mathcal\{Q\}is closed under the push\-forward operator𝒯g\\mathcal\{T\}\_\{g\}for allg∈Gg\\in Gif and only if for eachggthere existMg∈ℝk×kM\_\{g\}\\in\\mathbb\{R\}^\{k\\times k\},dg∈ℝkd\_\{g\}\\in\\mathbb\{R\}^\{k\},bg∈ℝkb\_\{g\}\\in\\mathbb\{R\}^\{k\}andcg∈ℝc\_\{g\}\\in\\mathbb\{R\}such that for allθ∈Θ\\theta\\in\\Theta,
T\(ρ\(g\)−1θ\)\\displaystyle T\(\\rho\(g\)^\{\-1\}\\theta\)=MgT\(θ\)\+dg,\\displaystyle=M\_\{g\}T\(\\theta\)\+d\_\{g\},\(10\)h\(ρ\(g\)−1θ\)\|detD\(ρ\(g\)−1\)\(θ\)\|\\displaystyle h\(\\rho\(g\)^\{\-1\}\\theta\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|=h\(θ\)exp\(bg⊤T\(θ\)\+cg\)\.\\displaystyle=h\(\\theta\)\\exp\(b\_\{g\}^\{\\top\}T\(\\theta\)\+c\_\{g\}\)\.\(11\)When these conditions hold,𝒯g\#qη=qηg\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}=q\_\{\\eta\_\{g\}\}whereηg=Mg⊤η\+bg\\eta\_\{g\}=M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}, and the transformationϕg:η↦Mg⊤η\+bg\\phi\_\{g\}:\\eta\\mapsto M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}forms an affine action ofGGon the space of natural parametersHH\.
A family𝒬\\mathcal\{Q\}is hence closed if and only if there exists an affine representation of the group acting on the ambient spaceℝk\\mathbb\{R\}^\{k\}of the space of natural parameters, that makesT:Θ→ℝkT:\\Theta\\to\\mathbb\{R\}^\{k\}invariant, and with respect to which the base measure transforms according to \([11](https://arxiv.org/html/2606.26273#S3.E11)\)\.
###### Example 3\.3\(Mean\-field Gaussian under permutation actions\)\.
Consider the mean\-field Gaussian family𝒬=\{𝒩\(μ,diag\(σ2\)\):μ∈ℝd,σ2∈ℝ\>0d\}\\mathcal\{Q\}=\\\{\\mathcal\{N\}\(\\mu,\\mathrm\{diag\}\(\\sigma^\{2\}\)\):\\mu\\in\\mathbb\{R\}^\{d\},\\sigma^\{2\}\\in\\mathbb\{R\}^\{d\}\_\{\>0\}\\\}onΘ=ℝd\\Theta=\\mathbb\{R\}^\{d\}\. This family has sufficient statisticT\(θ\)=\(θ,θ⊙θ\)∈ℝd⊕ℝdT\(\\theta\)=\(\\theta,\\,\\theta\\odot\\theta\)\\in\\mathbb\{R\}^\{d\}\\oplus\\mathbb\{R\}^\{d\}, constant base measureh\(θ\)=\(2π\)−d/2h\(\\theta\)=\(2\\pi\)^\{\-d/2\}, and natural parameterη=\(η1,η2\)∈ℝd⊕ℝd\\eta=\(\\eta\_\{1\},\\eta\_\{2\}\)\\in\\mathbb\{R\}^\{d\}\\oplus\\mathbb\{R\}^\{d\}given by\(η1\)i=μi/σi2\(\\eta\_\{1\}\)\_\{i\}=\\mu\_\{i\}/\\sigma\_\{i\}^\{2\}and\(η2\)i=−1/\(2σi2\)\(\\eta\_\{2\}\)\_\{i\}=\-1/\(2\\sigma\_\{i\}^\{2\}\)\.
LetGGact onΘ\\Thetaby a permutation representationρ\(g\)=Rg\\rho\(g\)=R\_\{g\}\. Then, both conditions of[Theorem3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2)are met: Permutation matrices commute with the Hadamard product, soT\(Rg−1θ\)=\(Rg−1⊕Rg−1\)T\(θ\)T\(R\_\{g\}^\{\-1\}\\theta\)=\(R\_\{g\}^\{\-1\}\\oplus R\_\{g\}^\{\-1\}\)T\(\\theta\)givingMg=Rg−1⊕Rg−1M\_\{g\}=R\_\{g\}^\{\-1\}\\oplus R\_\{g\}^\{\-1\}anddg=0d\_\{g\}=0\. Since\|detRg−1\|=1\|\\det R\_\{g\}^\{\-1\}\|=1andhhis constant, \([11](https://arxiv.org/html/2606.26273#S3.E11)\) is fulfilled bybg=cg=0b\_\{g\}=c\_\{g\}=0\. Using\(Rg−1\)⊤=Rg\(R\_\{g\}^\{\-1\}\)^\{\\top\}=R\_\{g\}, the induced natural\-parameter transformation is
ϕg\(η\)=Mg⊤η=\(Rg⊕Rg\)η,\\phi\_\{g\}\(\\eta\)=M\_\{g\}^\{\\top\}\\eta=\(R\_\{g\}\\oplus R\_\{g\}\)\\eta,\(12\)which simply amounts to permuting the entries ofη1\\eta\_\{1\}andη2\\eta\_\{2\}\. Equivalently,
𝒯g\#𝒩\(μ,diag\(σ2\)\)=𝒩\(Rgμ,diag\(Rgσ2\)\),\\mathcal\{T\}\_\{g\}\\\#\\mathcal\{N\}\(\\mu,\\mathrm\{diag\}\(\\sigma^\{2\}\)\)=\\mathcal\{N\}\(R\_\{g\}\\mu,\\mathrm\{diag\}\(R\_\{g\}\\sigma^\{2\}\)\),\(13\)so the family is indeed closed\.
### 3\.3Data augmentation induces equivariance
The representationϕg\\phi\_\{g\}fixes the set of symmetric parametersHG:=\{η∈H\|ϕg\(η\)=η,∀g∈G\}H\_\{G\}:=\\\{\\eta\\in H\\,\|\\,\\phi\_\{g\}\(\\eta\)=\\eta,\\,\\forall\\,g\\in G\\\}\. Just as in the single network case, parameters inHGH\_\{G\}correspond to equivariant predictive maps\.
###### Proposition 3\.4\.
Forη∈HG\\eta\\in H\_\{G\},FηF\_\{\\eta\}isGG\-equivariant\.
###### Proof\.
Ifη∈HG\\eta\\in H\_\{G\},𝒯g\#qη=qϕg\(η\)=qη\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}=q\_\{\\phi\_\{g\}\(\\eta\)\}=q\_\{\\eta\}\. Hence, ifθ∼qη\\theta\\sim q\_\{\\eta\},ρ\(g\)θ∼qη\\rho\(g\)\\theta\\sim q\_\{\\eta\}for everyg∈Gg\\in G\. This together with the compatibility assumption implies thatf\(gx;θ\)=gf\(x;ρ\(g−1\)θ\)=dgf\(x;θ\)f\(gx;\\theta\)=gf\(x;\\rho\(g^\{\-1\}\)\\theta\)\\stackrel\{\{\\scriptstyle\\mathrm\{d\}\}\}\{\{=\}\}gf\(x;\\theta\), where the last equality is in distribution\. Taking the expectation overθ\\thetayieldsFη\(gx\)=gFη\(x\)F\_\{\\eta\}\(gx\)=gF\_\{\\eta\}\(x\), which is the claim\. ∎
Our main result states that, under the conditions presented below,HGH\_\{G\}is invariant under gradient descent on augmented data\. Initializingη\\etathere guarantees equivariantFηF\_\{\\eta\}throughout training\.
###### Assumption 3\.5\(Invariant prior\)\.
The priorp0=qη0\(θ\)p\_\{0\}=q\_\{\\eta\_\{0\}\}\(\\theta\)for someη0\\eta\_\{0\}isGG\-invariant, i\.e\.,𝒯g\#p0=p0\\mathcal\{T\}\_\{g\}\\\#p\_\{0\}=p\_\{0\}for allg∈Gg\\in G, or equivalentlyϕg\(η0\)=η0\\phi\_\{g\}\(\\eta\_\{0\}\)=\\eta\_\{0\}\.
###### Assumption 3\.6\(Volume\-preserving action\)\.
\|detD\(ρ\(g\)−1\)\(θ\)\|=1\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\|=1\\,for allg∈Gg\\in Gandθ∈Θ\\theta\\in\\Theta\.
Both these assumptions are natural: an invariant prior matches the hypothesis of aGG\-symmetric task, and volume\-preservation is automatic for the, in practice very common, orthogonal actions\.
###### Theorem 3\.7\(Equivariance of VI\)\.
[Assumptions3\.5](https://arxiv.org/html/2606.26273#S3.Thmtheorem5)and[3\.6](https://arxiv.org/html/2606.26273#S3.Thmtheorem6)imply the following for variational inference using invariant likelihoods and closed exponential families:
1. \(i\)Loss invariance\.The lossℒβ\\mathcal\{L\}\_\{\\beta\}is invariant under the natural parameter transformation: ℒβ\(η\)=ℒβ\(ϕg\(η\)\),∀g∈G,η∈H\.\\mathcal\{L\}\_\{\\beta\}\(\\eta\)=\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta\)\),\\quad\\forall g\\in G,\\,\\eta\\in H\.\(14\)
2. \(ii\)Gradient equivariance and update commutativity\.The gradient of the lossℒβ\\mathcal\{L\}\_\{\\beta\}satisfies ∇ηℒβ\(ϕg\(η\)\)=Mg−1∇ηℒβ\(η\),∀g∈G,η∈H\.\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta\)\)=M\_\{g\}^\{\-1\}\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\),\\quad\\forall g\\in G,\\,\\eta\\in H\.\(15\)Consequently, forMgM\_\{g\}orthogonal, gradient descent𝒰\(η\)=η−α∇ηℒβ\(η\)\\mathcal\{U\}\(\\eta\)=\\eta\-\\alpha\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\)commutes withϕg\\phi\_\{g\}: 𝒰\(ϕg\(η\)\)=ϕg\(𝒰\(η\)\)\.\\mathcal\{U\}\(\\phi\_\{g\}\(\\eta\)\)=\\phi\_\{g\}\(\\mathcal\{U\}\(\\eta\)\)\.\(16\)
3. \(iii\)Invariant subspace is preserved\.ForMgM\_\{g\}orthogonal, the equivariant subspaceHGH\_\{G\}is preserved by gradient updates: ifη\(0\)∈HG\\eta^\{\(0\)\}\\in H\_\{G\}, thenη\(t\)∈HG\\eta^\{\(t\)\}\\in H\_\{G\}for allt≥0t\\geq 0\.
4. \(iv\)Optimal solutions inherit symmetry\.The set of optimal natural parametersH∗=argminηℒβ\(η\)H^\{\\ast\}=\\arg\\min\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\)is closed underϕg\\phi\_\{g\}for allg∈Gg\\in G\. If the optimum is unique, thenϕg\(η∗\)=η∗\\phi\_\{g\}\(\\eta^\{\\ast\}\)=\\eta^\{\\ast\}\.
The proof is given in[appendixD](https://arxiv.org/html/2606.26273#A4)\. This formalizes the intuition behind data augmentation: by making the training objective symmetric, augmentation ensures thatHGH\_\{G\}is invariant – and in the case of a unique minimum, even that the minimum is inHGH\_\{G\}\.
Note that Theorem[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)does not rely on augmented data directly, only on an invariant likelihood \(which is implied by data augmentation according to Proposition[3\.1](https://arxiv.org/html/2606.26273#S3.Thmtheorem1)\)\. Therefore, the theorem is still applicable when the symmetry is present in the data but unknown\. However in such a case, it is hard to guarantee that the prior is invariant\. The next theorem, which we prove in[appendixE](https://arxiv.org/html/2606.26273#A5), shows that as the dataset size increases, the effect of a non\-invariant prior vanishes\.
###### Theorem 3\.8\(Prior\-independent convergence\)\.
Assume[Assumption3\.6](https://arxiv.org/html/2606.26273#S3.Thmtheorem6), and that the likelihood is invariant\. LetRaug∗:=infηRaug\(η\)R^\{\\ast\}\_\{\\mathrm\{aug\}\}:=\\inf\_\{\\eta\}R\_\{\\mathrm\{aug\}\}\(\\eta\)andηβ∗\(p0\)∈argminηℒβ\(η;p0\)\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\\in\\arg\\min\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta;p\_\{0\}\)\. We then have
Raug\(ηβ∗\(p0\)\)≤Raug∗\+β⋅C\(p0\),R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\\leq R^\{\\ast\}\_\{\\mathrm\{aug\}\}\+\\beta\\cdot C\(p\_\{0\}\),\(17\)for any priorp0p\_\{0\}, whereC\(p0\):=infη∗∈HR∗DKL\(qη∗∥p0\)C\(p\_\{0\}\):=\\inf\_\{\\eta^\{\\ast\}\\in H^\{\\ast\}\_\{R\}\}D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta^\{\\ast\}\}\\\|p\_\{0\}\)andHR∗:=argminηRaug\(η\)H^\{\\ast\}\_\{R\}:=\\arg\\min\_\{\\eta\}R\_\{\\mathrm\{aug\}\}\(\\eta\)is the set of risk\-optimal natural parameters\. Asβ→0\\beta\\to 0, all minimizers achieveRaug∗R^\{\\ast\}\_\{\\mathrm\{aug\}\}regardless of the prior\.
Note that the quantityC\(p0\)C\(p\_\{0\}\)measures how well the prior aligns with the risk\-optimal orbit\. Sinceβ=1/N\\beta=1/N, with sufficient data, it becomes irrelevant\. In contrast,RaugR\_\{\\mathrm\{aug\}\}is asymptotically constant; It will almost surely converge to the true likelihood𝔼\(x,y\)∼P𝒳𝔼g∼νG𝔼qη\[−logp\(\(gx,gy\)∣θ\)\]\\mathbb\{E\}\_\{\\begin\{subarray\}\{c\}\(x,y\)\\sim P\_\{\\mathcal\{X\}\}\\end\{subarray\}\}\\mathbb\{E\}\_\{g\\sim\\nu\_\{G\}\}\\mathbb\{E\}\_\{q\_\{\\eta\}\}\[\-\\log p\(\(gx,gy\)\\mid\\theta\)\]\.
Next, we address the fact that the true predictive mapsFηF\_\{\\eta\}in practice must be approximated with finite sample means\. This will indeed cause a Monte Carlo equivariance defectΔ^Feq\(η\)\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\(defined formally in \([77](https://arxiv.org/html/2606.26273#A6.E77)\)\), but that defect vanishes with a number of samples grows, at the expected rate, as the following theorem shows \(proof in[appendixF](https://arxiv.org/html/2606.26273#A6)\)\. We verify it and[Theorem3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8)in[Section4\.1](https://arxiv.org/html/2606.26273#S4.SS1)\.
###### Theorem 3\.9\(Complexity forGεG\_\{\\varepsilon\}\-equivariance\)\.
Supposef\(x;θ\)f\(x;\\theta\)is uniformly bounded inx∈𝒳x\\in\\mathcal\{X\}andqηq\_\{\\eta\}\-a\.s\. inθ\\theta, and that theGG\-action on𝒴\\mathcal\{Y\}is orthogonal\. Then there exist constantsCdataC\_\{\\mathrm\{data\}\}andCweightC\_\{\\mathrm\{weight\}\}such that, for any fixedη\\eta, with probability at least1−δ1\-\\deltaover theN0N\_\{0\}i\.i\.d\. inputsxi∼P𝒳x\_\{i\}\\sim P\_\{\\mathcal\{X\}\}and theTTi\.i\.d\. posterior samplesθ\(t\)∼qη\\theta^\{\(t\)\}\\sim q\_\{\\eta\}, the Monte Carlo equivariance defect satisfies
Δ^Feq\(η\)≤ΔFeq\(η\)\+Cdatalog\(2/δ\)N0\+Cweightlog\(2/δ\)T\.\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\\leq\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\+C\_\{\\mathrm\{data\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{N\_\{0\}\}\}\+C\_\{\\mathrm\{weight\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{T\}\}\\,\.\(18\)In other words, for anyε0,εdata,εweight\>0\\varepsilon\_\{0\},\\varepsilon\_\{\\mathrm\{data\}\},\\varepsilon\_\{\\mathrm\{weight\}\}\>0, ifΔFeq\(η\)<ε0\\Delta\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)<\\varepsilon\_\{0\}, it suffices to choose
N0≥Cdata2εdata2log\(2δ\),T≥Cweight2εweight2log\(2δ\)N\_\{0\}\\geq\\frac\{C\_\{\\mathrm\{data\}\}^\{2\}\}\{\\varepsilon\_\{\\mathrm\{data\}\}^\{2\}\}\\log\\left\(\\frac\{2\}\{\\delta\}\\right\),\\quad T\\geq\\frac\{C\_\{\\mathrm\{weight\}\}^\{2\}\}\{\\varepsilon\_\{\\mathrm\{weight\}\}^\{2\}\}\\log\\left\(\\frac\{2\}\{\\delta\}\\right\)\(19\)in order forΔ^Feq\(η\)<ε0\+εdata\+εweight\\widehat\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)<\\varepsilon\_\{0\}\+\\varepsilon\_\{\\mathrm\{data\}\}\+\\varepsilon\_\{\\mathrm\{weight\}\}with probability at least1−δ1\-\\delta\.
### 3\.4Symmetrization of the variational posterior
According to[Theorem3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)\(iii\), the space of parameters corresponding to equivariant variational distributionsHGH\_\{G\}is preserved during training\. Hence, if we ensure that the parameters lie inHGH\_\{G\}at some point during training, we will arrive at an equivariant variational posterior at the end of it\. Therefore, we propose two complementary mechanisms that placeη\\etaintoHGH\_\{G\}, one by projecting an existing posterior, one by constructing the posterior from a smaller base, see Figure[1](https://arxiv.org/html/2606.26273#S1.F1)\.
#### Orbit averaging\.
Given a posterior distributionqηq\_\{\\eta\}at any point during training, we construct aGG\-invariant posterior by averaging its natural parameter over the orbit:
η~=1\|G\|∑g∈Gϕg\(η\)\.\\tilde\{\\eta\}=\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\phi\_\{g\}\(\\eta\)\.\(20\)We can interpret this operation in two ways: First,η~\\tilde\{\\eta\}is the natural parameter of the*geometric mean distribution*q~\(η\)∝\[∏g∈G\(𝒯g\#qη\)\(θ\)\]1\|G\|\\tilde\{q\}\(\\eta\)\\propto\\left\[\\prod\_\{g\\in G\}\\Big\(\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}\\Big\)\(\\theta\)\\right\]^\{\\frac\{1\}\{\|G\|\}\}\(which lies in𝒬\\mathcal\{Q\}by[LemmaG\.1](https://arxiv.org/html/2606.26273#A7.Thmtheorem1)in[appendixG](https://arxiv.org/html/2606.26273#A7)\)\. Secondly, for orthogonalϕg\\phi\_\{g\},η~\\tilde\{\\eta\}is the orthogonal projection ofη\\etaonto the invariant subspaceHGH\_\{G\}under the Euclidean inner product – this in particular shows thatη~∈HG\\tilde\{\\eta\}\\in H\_\{G\}\.
The mechanism specializes through the choice ofρΘ\\rho\_\{\\Theta\}\. Let us exemplify this in the case ofC4C\_\{4\}acting on convolutional layers: First, we can define a representation acting by rotating each individual convolutional channel and secondly a representation that additionally permutes channel blocks, inspired by group convolutional neural \(GCNNs\) from\[[8](https://arxiv.org/html/2606.26273#bib.bib14)\]\. The corresponding average operations on the filters are depicted in Figure[2](https://arxiv.org/html/2606.26273#S3.F2)\. We will refer to the former asgeometric averaging, and the latter asprojection\. We derive both in detail in[appendixH](https://arxiv.org/html/2606.26273#A8)\.
Figure 2:Two specializations of the orbit averaging \([20](https://arxiv.org/html/2606.26273#S3.E20)\) on the second convolutional layer of aC4C\_\{4\}\-equivariant network\. Both start from the same4×44\\times 4block of variational parametersη\\eta\.
#### Orbit expansion\.
Instead of symmetrizing a general parameter of the intended size, we can also first train a small parameter and then, in a manner inspired by the weight\-sharing structure of GCNNs, extend it to an equivariant one of bigger size\. Concretely, a base network with1/\|G\|1/\|G\|of the target channel width is trained in Stage 1, yielding a posteriorqηsmallq\_\{\\eta\_\{\\mathrm\{small\}\}\}\. At the start of Stage 2, the network is expanded to full width by inserting rotated versions of each filter in\|G\|×\|G\|\|G\|\\times\|G\|\-blocks \(see[Figure3](https://arxiv.org/html/2606.26273#S3.F3)\)\. We refer to this strategy asorbit expansion\. The resulting posteriorqηfullq\_\{\\eta\_\{\\mathrm\{full\}\}\}has aGG\-equivariant block structure at the parameter level\. In[appendixI](https://arxiv.org/html/2606.26273#A9), we compare three different expansion strategies\.
Figure 3:Filter arrangement for orbit expansion underC4C\_\{4\}, starting from a single base filterη\\eta\.
## 4Experiments
In our experiments, we consider the cyclic groupC4C\_\{4\}of rotations of multiples of90∘90^\{\\circ\}acting on FashionMNIST\[[35](https://arxiv.org/html/2606.26273#bib.bib13)\]\. This setup allows us to augment exactly and enables extensive Monte Carlo sampling\. Throughout our experiments, we use Gaussian variational distributions which satisfy the constraints of Theorem[3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2)\. In Appendix[J](https://arxiv.org/html/2606.26273#A10), we provide additional ablations for other variational distributions\. Code is provided at[github\.com/dmw1998/augment\-BNNs](https://github.com/dmw1998/augment-BNNs)\.
### 4\.1Validation of Theorems
We start by verifing the predictions of[Theorems3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8)and[3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)\. We use a Bayesian neural network with two convolutional layers and one linear layer and vary the dataset sizeN0∈\{500,5000,20000,50000\}N\_\{0\}\\in\\\{500,\\,5000,\\,20000,\\,50000\\\}\. For[Theorem3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8), we compare an invariant isotropic Gaussian priorp0=𝒩\(0,I\)p\_\{0\}=\\mathcal\{N\}\(0,I\)against a non\-invariant prior with means drawn from a normal distribution\. Posterior predictives are estimated withTTMonte Carlo samples\. The results are presented in[Figure4](https://arxiv.org/html/2606.26273#S4.F4)\.
Figure 4:Empirical validation of[Theorems3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8)and[3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)\.\(a\)The empirical equivariance defect decreases withN0N\_\{0\}under both invariant and random Gaussian priors, and the two curves converge\.\(b\)K\-fold standard deviation of the Monte Carlo estimateΔ^Feq\(η;T\)\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta;T\)acrossK=10K=10independent runs at eachTT\. The decay matches the𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)rate predicted by[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)\(dotted reference line, slope−0\.5\-0\.5\)\.\(c\)The training gap decreases withN0N\_\{0\}in close agreement with the1/N01/\\sqrt\{N\_\{0\}\}rate predicted by[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)\. The dashed line is a fit with empirical constantCdata≈1\.95C\_\{\\mathrm\{data\}\}\\approx 1\.95\. Error bars are±1\\pm 1s\.d\. across 5 seeds\.#### Priors\.
The empirical defectΔ^Feq\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}decreases monotonically withN0N\_\{0\}under both priors, from0\.057±0\.0070\.057\\pm 0\.007to0\.033±0\.0010\.033\\pm 0\.001\(invariant\) and0\.088±0\.0030\.088\\pm 0\.003to0\.035±0\.0010\.035\\pm 0\.001\(random\); the two curves converge asN0N\_\{0\}grows, consistent with[Theorem3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8)\.
#### Monte Carlo complexity\.
[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)predicts that the gap between theTT\-sample Monte Carlo estimateΔ^Feq\(η;T\)\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta;T\)and the true predictive map decays as𝒪\(1/T\)\\mathcal\{O\}\(1/\\sqrt\{T\}\)\. We assess this directly by drawingK=10K=10independentTT\-sample estimates at eachTTand tracking their standard deviation across theKKruns\. The results are shown in[Figure4](https://arxiv.org/html/2606.26273#S4.F4)\(b\)\. The theorem is indeed verified: alog\\log\-log\\logfit of theKK\-fold std againstTTyields slopes of−0\.45\-0\.45,−0\.66\-0\.66,−0\.52\-0\.52,−0\.51\-0\.51forN0∈\{500,5000,20000,50000\}N\_\{0\}\\in\\\{500,\\,5000,\\,20000,\\,50000\\\}respectively, all close to the predicted−0\.5\-0\.5\.
#### Data complexity\.
In Figure[4](https://arxiv.org/html/2606.26273#S4.F4)\(c\), we useTmax=1024T\_\{\\max\}=1024as a proxy forT→∞T\\to\\inftyand plot the training gapΔFeq\(η\)−Δ^Feq\(η\)\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\-\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)together with𝒪\(1/N0\)\\mathcal\{O\}\(1/\\sqrt\{N\_\{0\}\}\)envelope from[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9), withCdata≈1\.95C\_\{\\mathrm\{data\}\}\\approx 1\.95\. The measured gap decreases withN0N\_\{0\}in close agreement with the predicted1/N01/\\sqrt\{N\_\{0\}\}behavior\.
### 4\.2Symmetrization on image classification
Next, we empirically compare the symmetrization mechanisms from[Section3\.4](https://arxiv.org/html/2606.26273#S3.SS4)\.
#### Setup\.
We train a convolutional Bayesian neural network with two convolutional layers with 32 and 64 channels, followed by a classification layer using a diagonal Gaussian variational family and a standard isotropic Gaussian prior\. The training set consists of 5 000 randomly chosen FashionMNIST images, each augmented with the fullC4C\_\{4\}orbit \(producing 20 000 training samples\)\.
The parameter space of each convolutional layer decomposes along the channel dimension\. We implement the orbit\-averaging mechanism of[Section3\.4](https://arxiv.org/html/2606.26273#S3.SS4)under two natural choices ofρΘ\\rho\_\{\\Theta\}introduced there:geometric averaging\(per\-filterC4C\_\{4\}\-symmetrization, channel arrangement unconstrained\) andprojection\(averaging across filters and channel\-block permutation\)\. Both satisfy the closure conditions of[Theorem3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2), so the results of[Section3](https://arxiv.org/html/2606.26273#S3)apply uniformly\.Orbit expansionis implemented with the GCNN\-expansion from Figure[3](https://arxiv.org/html/2606.26273#S3.F3)\. All methods share the same architecture, prior, optimizer, learning rate, epoch count, and number of MC samples; they differ only in when and how the posterior is symmetrized\. We use SGD as the optimizer throughout the main text: its update rule is linear in the gradient and therefore commutes exactly withϕg\\phi\_\{g\}\([Theorem3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)\(ii\)\), so the equivariant subspaceHGH\_\{G\}is preserved under training\.
Each symmetrization mechanism is applied at epochn∈\{0,20,100\}n\\in\\\{0,20,100\\\}to study the effect of trigger timing \([Section3\.4](https://arxiv.org/html/2606.26273#S3.SS4)\)\. After the intervention, all methods continue training until epoch 500\. As a baseline, we train a randomly initalized network on augmented data\.
#### Metrics\.
We report: \(1\) classification accuracy \(%\), \(2\) orbit same prediction \(OSP\): the fraction of predictions that agree with the mode on the orbit, averaged over the test set, and \(3\) symmetric KL divergence between predictive distributions on original and rotated inputs\.
Table 1:Effect of symmetrization mechanism and timing on FashionMNIST \(N0=5 000N\_\{0\}=5\\,000,C4C\_\{4\}augmentation,10 00010\\,000test samples\), all with SGD\. Results are mean±\\pmstd over 5 seeds\.Bold: best in each column\.Shaded: best equivariance–accuracy trade\-off\. Full results across optimizers and trigger epochs are in[Table4](https://arxiv.org/html/2606.26273#A11.T4)\.
#### Results\.
Our results are summarized in[Table1](https://arxiv.org/html/2606.26273#S4.T1)\. First note that none of our strategies yields perfect equivariance, as expected from the Monte Carlo residual quantified by[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9), present even whenη∈HG\\eta\\in H\_\{G\}\. However, all strategies yield more equivariant models, and the orbit expansion model leads to a better overall performance\.
We note that the performance across all metrics and strategies generally decreases when the intervention is applied later\. This is natural, since a later trigger time exactly corresponds to a shorter training time post re\-initalization\. OSP and Sym\.KL of the geometric average are the only exceptions to this trend, but this decrease is from much better values compared to all other methods\.
The orbit expansion yields the best results\. We hypothesize that the reason for this is that the orbit expansion results in parameters with additional symmetries compared to the orbit averaging techniques, due to the restricted number of base filters \(cf\. Figures[3](https://arxiv.org/html/2606.26273#S3.F3)and[5](https://arxiv.org/html/2606.26273#A9.F5)\)\. These additional symmetries seem to induce more stability\. To further explore such effects however remains future work\.
In[appendixK](https://arxiv.org/html/2606.26273#A11)we repeat the comparison with AdamW, whose adaptive rescaling violates the update commutation \([16](https://arxiv.org/html/2606.26273#S3.E16)\); there an earlier trigger*hurts*equivariance, soHGH\_\{G\}is less stable\. This is further evidence for the correctness of our theory\.
## 5Conclusion and limitations
In this paper, we theoretically analyzed the effect of data augmentation for variational inference of Bayesian neural networks\. We showed that \(under some weak conditions\) when the underlying exponential family of variational distributions is closed under a symmetry and an invariant prior is used, a setHGH\_\{G\}of symmetric parameters corresponding to equivariant predictive maps becomes invariant under gradient training on augmented data\. Based on this theorem, we proposed and tested several strategies to initialize the net inHGH\_\{G\}\. The orbit insertion strategy proved especially efficient, boosting both the performance and the equivariance compared to a baseline\.
Throughout our work, we have made a number of assumptions that limit the scope of our results\. In particular, we focused on exponential families\. Although this is a broad class of probability distributions, some notable distributions are not part of it, such as Gaussian mixtures\. Furthermore, we have restricted our analysis \(and correspondingly our experiments\) to finite groups to enable a straightforward definition of the \(finite\) augmented dataset\. An extension to continuous groups would require an augmentation approach based on sampling from the symmetry group\. Although this is outside the scope of this work, we do not see a conceptual obstacle to extending our results in this way \(see also the discussion in Appendix[A](https://arxiv.org/html/2606.26273#A1)\)\. A final limitation is the compatibility assumption \([2](https://arxiv.org/html/2606.26273#S3.E2)\)\. Given the flexibility of choice of intermediate representations, this restriction is however mild\.
## Acknowledgments
This work was partially supported by the Wallenberg AI, Autonomous Systems and Software Program \(WASP\) funded by the Knut and Alice Wallenberg Foundation\. The computations were enabled by resources provided by the National Academic Infrastructure for Supercomputing in Sweden \(NAISS\), partially funded by the Swedish Research Council through grant agreement no\. 2025/22\-1341\.
## References
- \[1\]\(2023\)Group invariant machine learning by fundamental domain projections\.InNeurIPS Workshop on Symmetry and Geometry in Neural Representations,pp\. 181–218\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[2\]C\. M\. Bishop\(2006\)Pattern recognition and machine learning\.Information Science and Statistics,Springer\.External Links:ISBN 978\-0\-387\-31073\-2Cited by:[§3\.1](https://arxiv.org/html/2606.26273#S3.SS1.SSS0.Px3.p1.7)\.
- \[3\]C\. Blundell, J\. Cornebise, K\. Kavukcuoglu, and D\. Wierstra\(2015\-06\)Weight Uncertainty in Neural Network\.InProceedings of the 32nd International Conference on Machine Learning,pp\. 1613–1622\.External Links:ISSN 1938\-7228Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.26273#S3.SS1.SSS0.Px4.p1.7)\.
- \[4\]J\. Brehmer, S\. Behrends, P\. D\. Haan, and T\. Cohen\(2025\-04\)Does equivariance matter at scale?\.Transactions on Machine Learning Research\.External Links:2410\.23179,ISSN 2835\-8856Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p2.1)\.
- \[5\]M\. M\. Bronstein, J\. Bruna, Y\. LeCun, A\. Szlam, and P\. Vandergheynst\(2017\)Geometric deep learning: going beyond euclidean data\.IEEE Signal Processing Magazine34\(4\),pp\. 18–42\.Cited by:[§1](https://arxiv.org/html/2606.26273#S1.p1.1),[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]S\. Chen, E\. Dobriban, and J\. H\. Lee\(2020\)A group\-theoretic framework for data augmentation\.The Journal of Machine Learning Research21\(1\),pp\. 9885–9955\.Cited by:[§1](https://arxiv.org/html/2606.26273#S1.p1.1)\.
- \[7\]Z\. Chen and W\. Zhu\(2024\)On the implicit bias of linear equivariant steerable networks\.Advances in Neural Information Processing Systems36\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1)\.
- \[8\]T\. Cohen and M\. Welling\(2016\-06\)Group Equivariant Convolutional Networks\.InProceedings of The 33rd International Conference on Machine Learning,pp\. 2990–2999\.External Links:1602\.07576Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2606.26273#S3.SS4.SSS0.Px1.p2.2)\.
- \[9\]J\. L\. Doob\(1940\)Regularity properties of certain families of chance variables\.Transactions of the American Mathematical Society47\(3\),pp\. 455–486\.External Links:ISSN 00029947, 10886850Cited by:[Appendix F](https://arxiv.org/html/2606.26273#A6.SS0.SSS0.Px2.p4.1)\.
- \[10\]H\. Duan and G\. Montúfar\(2025\-06\)Understanding Learning Invariance in Deep Linear Networks\.arXiv\.External Links:2506\.13714Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]B\. Elesedy and S\. Zaidi\(2021\)Provably strict generalisation benefit for equivariant models\.InProceedings of the 38th International Conference on Machine Learning,pp\. 2959–2969\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1)\.
- \[12\]Y\. Gal\(2016\)Uncertainty in deep learning\.Ph\.D\. Thesis,University of Cambridge\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p1.1)\.
- \[13\]K\. V\. Gandikota, J\. Geiping, Z\. Lähner, A\. Czapliński, and M\. Moeller\(2021\-06\-18\)Training or architecture? How to incorporate invariance in neural networks\.arXiv:2106\.10044\.External Links:2106\.10044Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p2.1)\.
- \[14\]J\. Gerken, O\. Carlsson, H\. Linander, F\. Ohlsson, C\. Petersson, and D\. Persson\(2022\)Equivariance versus augmentation for spherical images\.InProceedings of the 39th International Conference on Machine Learning,pp\. 7404–7421\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p2.1)\.
- \[15\]J\. E\. Gerken, J\. Aronsson, O\. Carlsson, H\. Linander, F\. Ohlsson, C\. Petersson, and D\. Persson\(2023\-06\)Geometric deep learning and equivariant neural networks\.Artificial Intelligence Review\.External Links:2105\.13926,ISSN 1573\-7462,[Document](https://dx.doi.org/10.1007/s10462-023-10502-7)Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]J\. E\. Gerken and P\. Kessel\(2024\-07\)Emergent Equivariance in Deep Ensembles\.InProceedings of the 41st International Conference on Machine Learning,pp\. 15438–15465\.External Links:2403\.03103Cited by:[§1](https://arxiv.org/html/2606.26273#S1.p2.1)\.
- \[17\]A\. Graves\(2011\)Practical Variational Inference for Neural Networks\.InAdvances in Neural Information Processing Systems,J\. Shawe\-Taylor, R\. Zemel, P\. Bartlett, F\. Pereira, and K\. Weinberger \(Eds\.\),Vol\.24\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p1.1),[§3\.1](https://arxiv.org/html/2606.26273#S3.SS1.SSS0.Px4.p1.7)\.
- \[18\]L\. V\. Jospin, H\. Laga, F\. Boussaid, W\. Buntine, and M\. Bennamoun\(2022\-05\)Hands\-On Bayesian Neural Networks—A Tutorial for Deep Learning Users\.IEEE Computational Intelligence Magazine17\(2\),pp\. 29–48\.External Links:ISSN 1556\-6048,[Document](https://dx.doi.org/10.1109/MCI.2022.3155327)Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p1.1)\.
- \[19\]N\. Keriven and G\. Peyré\(2019\)Universal invariant and equivariant graph neural networks\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[20\]D\. P\. Kingma and M\. Welling\(2022\)Auto\-encoding variational bayes\.External Links:1312\.6114Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]R\. Kondor and S\. Trivedi\(2018\)On the generalization of equivariance and convolution in neural networks to the action of compact groups\.InInternational Conference on Machine Learning,pp\. 2747–2755\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[22\]H\. Lawrence, K\. Georgiev, A\. Dienes, and B\. T\. Kiani\(2022\)Implicit bias of linear equivariant networks\.InInternational Conference on Machine Learning,pp\. 12096–12125\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1)\.
- \[23\]C\. Lyle, M\. van der Wilk, M\. Kwiatkowska, Y\. Gal, and B\. Bloem\-Reddy\(2020\)On the benefits of invariance in neural networks\.External Links:2005\.00178Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]D\. J\. C\. MacKay\(1992\-05\)A Practical Bayesian Framework for Backpropagation Networks\.Neural Computation4\(3\),pp\. 448–472\.External Links:ISSN 0899\-7667,[Document](https://dx.doi.org/10.1162/neco.1992.4.3.448)Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]C\. McDiarmid\(1989\)On the method of bounded differences\.InSurveys in Combinatorics, 1989: Invited Papers at the Twelfth British Combinatorial Conference,J\. Siemons \(Ed\.\),London Mathematical Society Lecture Note Series,pp\. 148–188\.Cited by:[Appendix F](https://arxiv.org/html/2606.26273#A6.SS0.SSS0.Px2.p4.1)\.
- \[26\]N\. Mourdoukoutas, M\. Federici, G\. Pantalos, M\. van der Wilk, and V\. Fortuin\(2021\-07\)A Bayesian Approach to Invariant Deep Neural Networks\.arXiv:2107\.09301 \[cs, stat\]\.External Links:2107\.09301Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p2.1)\.
- \[27\]O\. Nordenfors and A\. Flinth\(2025\)Data augmentation and regularization for learning group equivariance\.In2025 International Conference on Sampling Theory and Applications \(SampTA\),Vol\.,pp\. 1–5\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1)\.
- \[28\]O\. Nordenfors and A\. Flinth\(2025\)Ensembles provably learn equivariance through data augmentation\.External Links:2410\.01452Cited by:[§1](https://arxiv.org/html/2606.26273#S1.p2.1)\.
- \[29\]O\. Nordenfors, F\. Ohlsson, and A\. Flinth\(2025\)Optimization dynamics of equivariant and augmented neural networks\.Transactions on Machine Learning Research\.Cited by:[Appendix H](https://arxiv.org/html/2606.26273#A8.p1.9),[1st item](https://arxiv.org/html/2606.26273#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.26273#S3.SS1.SSS0.Px2.p1.23),[§3\.2](https://arxiv.org/html/2606.26273#S3.SS2.p1.1)\.
- \[30\]O\. Puny, M\. Atzmon, E\. J\. Smith, I\. Misra, A\. Grover, H\. Ben\-Hamu, and Y\. Lipman\(2022\)Frame averaging for invariant and equivariant network design\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[31\]T\. F\.A\. van der Ouderaa and M\. van der Wilk\(2022\-01–05 Aug\)Learning invariant weights in neural networks\.InProceedings of the Thirty\-Eighth Conference on Uncertainty in Artificial Intelligence,J\. Cussens and K\. Zhang \(Eds\.\),Proceedings of Machine Learning Research, Vol\.180,pp\. 1992–2001\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px3.p2.1)\.
- \[32\]S\. Villar, D\. W\. Hogg, K\. Storey\-Fisher, W\. Yao, and B\. Blum\-Smith\(2021\)Scalars are universal: equivariant machine learning, structured like classical physics\.Advances in Neural Information Processing Systems34,pp\. 28848–28863\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[33\]R\. Wang, R\. Walters, and R\. Yu\(2022\)Approximately equivariant networks for imperfectly symmetric dynamics\.InInternational Conference on Machine Learning,pp\. 23078–23091\.Cited by:[§2](https://arxiv.org/html/2606.26273#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]Y\. Wang, A\. A\. A\. Elhag, N\. Jaitly, J\. M\. Susskind, and M\. Á\. Bautista\(2024\-07\)Swallowing the Bitter Pill: Simplified Scalable Conformer Generation\.InProceedings of the 41st International Conference on Machine Learning,pp\. 50400–50418\.External Links:2311\.17932Cited by:[§1](https://arxiv.org/html/2606.26273#S1.p1.1)\.
- \[35\]H\. Xiao, K\. Rasul, and R\. Vollgraf\(2017\)Fashion\-mnist: a novel image dataset for benchmarking machine learning algorithms\.External Links:1708\.07747Cited by:[§4](https://arxiv.org/html/2606.26273#S4.p1.2)\.
## Appendix AFrom discrete to continuous compact groups
It is often the case in the geometric deep learning literature that virtually all results concerning finite groups can be generalized to a compact, continuous group – simply by replacing the average over the group with an integral over the group with respect to the*Haar measure*νG\\nu\_\{G\}
1\|G\|∑g∈G⟶∫GdνG\(g\)\.\\displaystyle\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\quad\\longrightarrow\\quad\\int\_\{G\}\\mathrm\{d\}\\nu\_\{G\}\(g\)\.The Haar measure is the uniqueGG\-invariant probability measure onGG– meaning that ifg∼νGg\\sim\\nu\_\{G\}, alsohg∼νGhg\\sim\\nu\_\{G\}for anyh∈Gh\\in G\. Note that for finite groups, the average overGGalready is the Haar average\.
In order for us to do this, there is one minor obstruction: the definition of the augmented likelihood\. Note that for a finite group, this is well\-defined:
p\(𝒟aug\|θ\)=∏\(xi,yi\)∈𝒟augp\(\(xi,yi\)\|θ\)=∏g∈G∏\(xi,yi\)∈𝒟p\(\(gxi,gyi\)\|θ\)\.\\displaystyle p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\,\|\\theta\)=\\prod\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(\(x\_\{i\},y\_\{i\}\)\\,\|\\theta\)=\\prod\_\{g\\in G\}\\prod\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}p\(\(gx\_\{i\},gy\_\{i\}\)\\,\|\\theta\)\.For an infinite group, the product above \(as well as the augmented dataset\) becomes infinite, and hard to interpret\.
We can circumvent this by considering a sampled augmentation – we drawSSi\.i\.d\. elementsgj∈Gg\_\{j\}\\in G\(according toνG\\nu\_\{G\}\) and form a randomly augmented dataset𝒟S−aug=\{\(gjxi,gjyi\)\}i=1,j=1N0,S\\mathcal\{D\}\_\{S\-\\mathrm\{aug\}\}=\\\{\(g\_\{j\}x\_\{i\},g\_\{j\}y\_\{i\}\)\\\}\_\{i=1,j=1\}^\{N\_\{0\},S\}\. The likelihood of this data set is
p\(𝒟Saug\|θ\)=∏\(xi,yi\)∈𝒟S−augp\(\(xi,yi\)\|θ\)=∏j=1S∏\(xi,yi\)∈𝒟p\(\(gjxi,gjyi\)\|θ\)\.\\displaystyle p\(\\mathcal\{D\}\_\{S\_\{\\mathrm\{aug\}\}\}\\,\|\\theta\)=\\prod\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\_\{S\-\\mathrm\{aug\}\}\}p\(\(x\_\{i\},y\_\{i\}\)\\,\|\\theta\)=\\prod\_\{j=1\}^\{S\}\\prod\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}p\(\(g\_\{j\}x\_\{i\},g\_\{j\}y\_\{i\}\)\\,\|\\theta\)\.Now note that we can maximize this by minimizing the negative log\-likelihood, a rescaled version of which becomes
−1Slogp\(𝒟aug\|θ\)=−1S∑j=1S∑\(xi,yi\)∈𝒟logp\(\(gxi,gyi\)\|θ\)\.\\displaystyle\-\\frac\{1\}\{S\}\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\,\|\\theta\)=\-\\frac\{1\}\{S\}\\sum\_\{j=1\}^\{S\}\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}\\log p\(\(gx\_\{i\},gy\_\{i\}\)\\,\|\\theta\)\\,\.As we increase the number of samples, this function converges towards
−∫G∑\(xi,yi\)∈𝒟logp\(\(gxi,gyi\)\|θ\)dνG\(g\)\.\\displaystyle\-\\int\_\{G\}\\sum\_\{\(x\_\{i\},y\_\{i\}\)\\in\\mathcal\{D\}\}\\log p\(\(gx\_\{i\},gy\_\{i\}\)\\,\|\\theta\)\\mathrm\{d\}\\nu\_\{G\}\(g\)\\,\.If we define this function as−logp\(𝒟aug\|θ\)\-\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\|\\theta\)in the infinite case, all of our results go through, with proofs more or less verbatim\.
It should be noted that in practice we will always need to work on a sampled augmentation space, so there will be a discretization error with respect to the group\. Also note that in this formulation, we draw the sample once and then train on the resulting dataset – we are not drawing new random group elements in each epoch, which often is done\. To analyze this latter stochastic version requires other tools than we have used here, and we leave it to future work\.
## Appendix BProof of[Proposition3\.1](https://arxiv.org/html/2606.26273#S3.Thmtheorem1)
See[3\.1](https://arxiv.org/html/2606.26273#S3.Thmtheorem1)
###### Proof\.
The compatibility ofρΘ\\rho\_\{\\Theta\}\([2](https://arxiv.org/html/2606.26273#S3.E2)\) with the data representations implies that for anyg∈Gg\\in Gand any inputx∈𝒳x\\in\\mathcal\{X\},
f\(gx;θ\)=gf\(x;ρ\(g\)−1θ\),f\(gx;\\theta\)=gf\(x;\\rho\(g\)^\{\-1\}\\theta\),\(21\)implies
p\(\(gx,gy\)\|θ\)\\displaystyle p\(\(gx,gy\)\\,\|\\,\\theta\)=p\(gy\|gx,θ\)p\(gx\)=p\(gy\|f\(gx;θ\)\)p\(gx\)\\displaystyle=p\(gy\\,\|\\,gx,\\theta\)p\(gx\)=p\(gy\\,\|\\,f\(gx;\\theta\)\)p\(gx\)\(22\)=p\(gy\|gf\(x;ρ\(g\)−1θ\)\)p\(gx\)=p\(y\|f\(x;ρ\(g\)−1θ\)\)p\(gx\),\\displaystyle=p\(gy\\,\|\\,gf\(x;\\rho\(g\)^\{\-1\}\\theta\)\)p\(gx\)=p\(y\\,\|\\,f\(x;\\rho\(g\)^\{\-1\}\\theta\)\)p\(gx\),\(23\)or equivalently
p\(\(x,y\)\|ρ\(g\)θ\)=p\(g−1y\|f\(g−1x;θ\)\)p\(x\),\\displaystyle p\(\(x,y\)\\,\|\\,\\rho\(g\)\\theta\)=p\(g^\{\-1\}y\\,\|\\,f\(g^\{\-1\}x;\\theta\)\)p\(x\),\(24\)
We can now make the calculation
Laug\(ρ\(g\)θ\)\\displaystyle L\_\{\\mathrm\{aug\}\}\(\\rho\(g\)\\theta\)=∏\(x,y\)∈𝒟augp\(\(x,y\)∣ρ\(g\)θ\)=∏\(x,y\)∈𝒟augp\(g−1y\|f\(g−1x;θ\)\)p\(x\)\\displaystyle=\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(\(x,y\)\\mid\\rho\(g\)\\theta\)=\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(g^\{\-1\}y\\,\|\\,f\(g^\{\-1\}x;\\theta\)\)p\(x\)\(25\)=∏\(x,y\)∈𝒟augp\(g−1y\|f\(g−1x;θ\)\)⋅∏\(x,y\)∈𝒟augp\(x\)\\displaystyle=\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(g^\{\-1\}y\\,\|\\,f\(g^\{\-1\}x;\\theta\)\)\\cdot\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(x\)\(26\)=∗∏\(x,y\)∈𝒟augp\(y\|f\(x;θ\)\)⋅∏\(x,y\)∈𝒟augp\(x\)\\displaystyle\\stackrel\{\{\\scriptstyle\*\}\}\{\{=\}\}\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(y\\,\|\\,f\(x;\\theta\)\)\\cdot\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(x\)\(27\)=∏\(x,y\)∈𝒟augp\(\(x,y\)\|θ\)=Laug\(θ\),\\displaystyle=\\prod\_\{\(x,y\)\\in\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\}p\(\(x,y\)\\,\|\\,\\theta\)=L\_\{\\mathrm\{aug\}\}\(\\theta\),\(28\)where \(\*\) follows from the fact that𝒟aug\\mathcal\{D\}\_\{\\mathrm\{aug\}\}is closed under group transformations:h𝒟aug=𝒟augh\\mathcal\{D\}\_\{\\mathrm\{aug\}\}=\\mathcal\{D\}\_\{\\mathrm\{aug\}\}for anyh∈Gh\\in G\. ∎
## Appendix CProof of[Theorem3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2)
See[3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2)
###### Proof\.
“⇐\\Leftarrow” This is a simple calculation
\(Tg\#qη\)\(θ\)\\displaystyle\(T\_\{g\}\\\#q\_\{\\eta\}\)\(\\theta\)=h\(ρ\(g\)−1\(θ\)\)exp\(η⊤T\(ρ\(g\)−1\(θ\)\)−A\(η\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\\displaystyle=h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\exp\(\\eta^\{\\top\}T\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\-A\(\\eta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\(29\)=h\(θ\)exp\(bg⊤T\(θ\)\+cg\)exp\(η⊤\(MgT\(θ\)\+dg\)−A\(η\)\)\\displaystyle=h\(\\theta\)\\exp\(b\_\{g\}^\{\\top\}T\(\\theta\)\+c\_\{g\}\)\\exp\(\\eta^\{\\top\}\(M\_\{g\}T\(\\theta\)\+d\_\{g\}\)\-A\(\\eta\)\)\(30\)=h\(θ\)exp\(\(Mg⊤η\+bg\)⊤T\(θ\)−\(A\(η\)−dg⊤η−cg\)\)\\displaystyle=h\(\\theta\)\\exp\(\(M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}\)^\{\\top\}T\(\\theta\)\-\(A\(\\eta\)\-d\_\{g\}^\{\\top\}\\eta\-c\_\{g\}\)\)\(31\)=:h\(θ\)exp\(ηg⊤T\(θ\)−Ag\(η\)\)\\displaystyle=:h\(\\theta\)\\exp\(\\eta\_\{g\}^\{\\top\}T\(\\theta\)\-A\_\{g\}\(\\eta\)\)\(32\)It remains to verify thatAg\(η\)=A\(η\)−dg⊤η−cgA\_\{g\}\(\\eta\)=A\(\\eta\)\-d\_\{g\}^\{\\top\}\\eta\-c\_\{g\}is indeed the log\-partition functionAAevaluated atηg\\eta\_\{g\}\.
Usingηg=Mg⊤η\+bg\\eta\_\{g\}=M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}, the affine conditionMgT\(θ\)=T\(ρ\(g\)−1θ\)−dgM\_\{g\}T\(\\theta\)=T\(\\rho\(g\)^\{\-1\}\\theta\)\-d\_\{g\}, and the base\-measure conditionh\(θ\)exp\(bg⊤T\(θ\)\)=h\(ρ\(g\)−1θ\)\|detD\(ρ\(g\)−1\)\(θ\)\|e−cgh\(\\theta\)\\exp\(b\_\{g\}^\{\\top\}T\(\\theta\)\)=h\(\\rho\(g\)^\{\-1\}\\theta\)\\,\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\|\\,e^\{\-c\_\{g\}\},
A\(ηg\)\\displaystyle A\(\\eta\_\{g\}\)=log∫h\(θ\)exp\(η⊤MgT\(θ\)\+bg⊤T\(θ\)\)𝑑θ\\displaystyle=\\log\\int h\(\\theta\)\\exp\\\!\\big\(\\eta^\{\\top\}M\_\{g\}T\(\\theta\)\+b\_\{g\}^\{\\top\}T\(\\theta\)\\big\)\\,d\\theta\(33\)=log∫h\(ρ\(g\)−1θ\)e−cg\|detD\(ρ\(g\)−1\)\(θ\)\|exp\(η⊤\(T\(ρ\(g\)−1θ\)−dg\)\)𝑑θ\\displaystyle=\\log\\int h\(\\rho\(g\)^\{\-1\}\\theta\)\\,e^\{\-c\_\{g\}\}\\,\\big\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\big\|\\exp\\\!\\big\(\\eta^\{\\top\}\(T\(\\rho\(g\)^\{\-1\}\\theta\)\-d\_\{g\}\)\\big\)\\,d\\theta\(34\)=−cg−η⊤dg\+log∫h\(ρ\(g\)−1θ\)exp\(η⊤T\(ρ\(g\)−1θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|𝑑θ\\displaystyle=\-c\_\{g\}\-\\eta^\{\\top\}d\_\{g\}\+\\log\\int h\(\\rho\(g\)^\{\-1\}\\theta\)\\exp\\\!\\big\(\\eta^\{\\top\}T\(\\rho\(g\)^\{\-1\}\\theta\)\\big\)\\big\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\big\|\\,d\\theta\(35\)=−cg−η⊤dg\+log∫h\(θ′\)exp\(η⊤T\(θ′\)\)𝑑θ′=A\(η\)−dg⊤η−cg,\\displaystyle=\-c\_\{g\}\-\\eta^\{\\top\}d\_\{g\}\+\\log\\int h\(\\theta^\{\\prime\}\)\\exp\\\!\\big\(\\eta^\{\\top\}T\(\\theta^\{\\prime\}\)\\big\)\\,d\\theta^\{\\prime\}=A\(\\eta\)\-d\_\{g\}^\{\\top\}\\eta\-c\_\{g\},\(36\)where the second\-to\-last step substitutesθ′=ρ\(g\)−1θ\\theta^\{\\prime\}=\\rho\(g\)^\{\-1\}\\theta, for which\|detD\(ρ\(g\)−1\)\(θ\)\|dθ=dθ′\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\|\\,d\\theta=d\\theta^\{\\prime\}\. HenceAg\(η\)=A\(ηg\)A\_\{g\}\(\\eta\)=A\(\\eta\_\{g\}\)and𝒯g\#qη=qηg∈𝒬\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}=q\_\{\\eta\_\{g\}\}\\in\\mathcal\{Q\}\.
“⇒\\Rightarrow” Suppose that there exists a mappingϕg:H→H\\phi\_\{g\}:H\\to Hsuch that for anyη∈H\\eta\\in Handθ∈Θ\\theta\\in\\Theta,
qη\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|=qϕg\(η\)\(θ\)\.q\_\{\\eta\}\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|=q\_\{\\phi\_\{g\}\(\\eta\)\}\(\\theta\)\.\(37\)
First, we take the logarithm on both sides and use the form ofqηq\_\{\\eta\}:
logh\(ρ\(g\)−1\(θ\)\)\+η⊤T\(ρ\(g\)−1\(θ\)\)−A\(η\)\+log\|detD\(ρ\(g\)−1\)\(θ\)\|\\displaystyle\\log h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\+\\eta^\{\\top\}T\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\-A\(\\eta\)\+\\log\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|=ϕg\(η\)⊤T\(θ\)−A\(ϕg\(η\)\)\+logh\(θ\)\\displaystyle\\qquad=\\phi\_\{g\}\(\\eta\)^\{\\top\}T\(\\theta\)\-A\(\\phi\_\{g\}\(\\eta\)\)\+\\log h\(\\theta\)\(38\)⇒\\displaystyle\\Rightarrowη⊤T\(ρ\(g\)−1\(θ\)\)\+log\(h\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\)\+A\(ϕg\(η\)\)−A\(η\)\\displaystyle\\eta^\{\\top\}T\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\+\\log\(h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\)\+A\(\\phi\_\{g\}\(\\eta\)\)\-A\(\\eta\)=ϕg\(η\)⊤T\(θ\)\+logh\(θ\)\.\\displaystyle\\qquad=\\phi\_\{g\}\(\\eta\)^\{\\top\}T\(\\theta\)\+\\log h\(\\theta\)\\,\.\(39\)
Take anyη1,η2∈H\\eta\_\{1\},\\eta\_\{2\}\\in H, and subtract the equation \([39](https://arxiv.org/html/2606.26273#A3.E39)\) forη2\\eta\_\{2\}from that forη1\\eta\_\{1\}:
\(η1−η2\)⊤T\(ρ\(g\)−1\(θ\)\)−\(ϕg\(η1\)−ϕg\(η2\)\)⊤T\(θ\)\\displaystyle\(\\eta\_\{1\}\-\\eta\_\{2\}\)^\{\\top\}T\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\-\(\\phi\_\{g\}\(\\eta\_\{1\}\)\-\\phi\_\{g\}\(\\eta\_\{2\}\)\)^\{\\top\}T\(\\theta\)=A\(ϕg\(η1\)\)−A\(η1\)−A\(ϕg\(η2\)\)\+A\(η2\)\.\\displaystyle\\ =A\(\\phi\_\{g\}\(\\eta\_\{1\}\)\)\-A\(\\eta\_\{1\}\)\-A\(\\phi\_\{g\}\(\\eta\_\{2\}\)\)\+A\(\\eta\_\{2\}\)\\,\.\(40\)
Observe that the right\-hand side is independent ofθ\\theta\. Since the exponential family is minimal,T\(θ\)T\(\\theta\)is linearly independent\. To be precise, the components\{1,T1\(θ\),…,Tk\(θ\)\}\\\{1,T\_\{1\}\(\\theta\),\\ldots,T\_\{k\}\(\\theta\)\\\}are linearly independent\. Thus, both sides must be equal to some constantC\(η\)C\(\\eta\)independent ofθ\\theta\.
Denoteu=η1−η2u=\\eta\_\{1\}\-\\eta\_\{2\}andv=ϕg\(η1\)−ϕg\(η2\)v=\\phi\_\{g\}\(\\eta\_\{1\}\)\-\\phi\_\{g\}\(\\eta\_\{2\}\)\. We have:
u⊤T\(ρ\(g\)−1\(θ\)\)−v⊤T\(θ\)=C\(η\)\.u^\{\\top\}T\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\-v^\{\\top\}T\(\\theta\)=C\(\\eta\)\\,\.\(41\)Sinceη1,η2\\eta\_\{1\},\\eta\_\{2\}are arbitrary,uucan take any value inℝk\\mathbb\{R\}^\{k\}\. Thus, we can chooseuuto be each standard basis vector inℝk\\mathbb\{R\}^\{k\}\. For eachj=1,2,…,kj=1,2,\\ldots,k, we then get
Tj\(ρ\(g\)−1\(θ\)\)=∑i=1kvi\(j\)Ti\(θ\)\+Cj\.T\_\{j\}\(\\rho\(g\)^\{\-1\}\(\\theta\)\)=\\sum\_\{i=1\}^\{k\}v\_\{i\}^\{\(j\)\}T\_\{i\}\(\\theta\)\+C\_\{j\}\\,\.\(42\)DenotingMg=\[v\(1\),v\(2\),…,v\(k\)\]⊤M\_\{g\}=\[v^\{\(1\)\},v^\{\(2\)\},\\ldots,v^\{\(k\)\}\]^\{\\top\}anddg=\(C1,C2,…,Ck\)⊤d\_\{g\}=\(C\_\{1\},C\_\{2\},\\ldots,C\_\{k\}\)^\{\\top\}, we conclude:
T\(ρ\(g\)−1\(θ\)\)=MgT\(θ\)\+dg\.T\(\\rho\(g\)^\{\-1\}\(\\theta\)\)=M\_\{g\}T\(\\theta\)\+d\_\{g\}\\,\.\(43\)
Next, we substitute this back into \([39](https://arxiv.org/html/2606.26273#A3.E39)\):
η⊤\(MgT\(θ\)\+dg\)\+log\(h\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\)−A\(η\)\\displaystyle\\eta^\{\\top\}\(M\_\{g\}T\(\\theta\)\+d\_\{g\}\)\+\\log\(h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\)\-A\(\\eta\)=logh\(θ\)\+ϕg\(η\)⊤T\(θ\)−A\(ϕg\(η\)\)\\displaystyle\\qquad=\\log h\(\\theta\)\+\\phi\_\{g\}\(\\eta\)^\{\\top\}T\(\\theta\)\-A\(\\phi\_\{g\}\(\\eta\)\)\(44\)⇒\\displaystyle\\Rightarrow\(η⊤Mg−ϕg\(η\)⊤\)T\(θ\)\+log\(h\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\)−logh\(θ\)\+η⊤dg\\displaystyle\(\\eta^\{\\top\}M\_\{g\}\-\\phi\_\{g\}\(\\eta\)^\{\\top\}\)T\(\\theta\)\+\\log\(h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\)\-\\log h\(\\theta\)\+\\eta^\{\\top\}d\_\{g\}=A\(η\)−A\(ϕg\(η\)\)\.\\displaystyle\\qquad=A\(\\eta\)\-A\(\\phi\_\{g\}\(\\eta\)\)\\,\.\(45\)
Again, since the right\-hand side is independent ofθ\\thetaand only depends onη\\eta, we take any twoθ1,θ2∈Θ\\theta\_\{1\},\\theta\_\{2\}\\in\\Thetaand subtract \([45](https://arxiv.org/html/2606.26273#A3.E45)\) forθ2\\theta\_\{2\}from that forθ1\\theta\_\{1\}to obtain:
\(η⊤Mg−ϕg\(η\)⊤\)\(T\(θ1\)−T\(θ2\)\)\+logh\(ρ\(g\)−1\(θ1\)\)\|detD\(ρ\(g\)−1\)\(θ1\)\|h\(ρ\(g\)−1\(θ2\)\)\|detD\(ρ\(g\)−1\)\(θ2\)\|−logh\(θ1\)h\(θ2\)=0\.\(\\eta^\{\\top\}M\_\{g\}\-\\phi\_\{g\}\(\\eta\)^\{\\top\}\)\(T\(\\theta\_\{1\}\)\-T\(\\theta\_\{2\}\)\)\+\\log\\frac\{h\(\\rho\(g\)^\{\-1\}\(\\theta\_\{1\}\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\_\{1\}\)\\right\|\}\{h\(\\rho\(g\)^\{\-1\}\(\\theta\_\{2\}\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\_\{2\}\)\\right\|\}\-\\log\\frac\{h\(\\theta\_\{1\}\)\}\{h\(\\theta\_\{2\}\)\}=0\.\(46\)Since the logarithmic terms do not depend onη\\eta, we can write this as
\(η⊤Mg−ϕg\(η\)⊤\)\(T\(θ1\)−T\(θ2\)\)=B\(θ1,θ2\),\(\\eta^\{\\top\}M\_\{g\}\-\\phi\_\{g\}\(\\eta\)^\{\\top\}\)\(T\(\\theta\_\{1\}\)\-T\(\\theta\_\{2\}\)\)=B\(\\theta\_\{1\},\\theta\_\{2\}\)\\,,\(47\)for some numbersB\(θ1,θ2\)B\(\\theta\_\{1\},\\theta\_\{2\}\)\. Now, again, since the family is minimal, the componentsT\(θ\)T\(\\theta\)are linearly independent\. This implies that we can choose a value forθ1\\theta\_\{1\}andkkdifferent values forθ2\\theta\_\{2\}, such that the matrix\(T\(θ1\)−T\(θ2ℓ\)\)ℓ=1k∈ℝk,k\(T\(\\theta\_\{1\}\)\-T\(\\theta\_\{2\}^\{\\ell\}\)\)\_\{\\ell=1\}^\{k\}\\in\\mathbb\{R\}^\{k,k\}becomes invertible\. This yields a linear system of equations in the vectorη⊤Mg−ϕg\(η\)⊤\\eta^\{\\top\}M\_\{g\}\-\\phi\_\{g\}\(\\eta\)^\{\\top\}, which has a unique solution which we denote by−bg\(θ\)⊤\-b\_\{g\}\(\\theta\)^\{\\top\}:
−bg⊤\(θ\):=η⊤Mg−ϕg\(η\)⊤\.\-b\_\{g\}^\{\\top\}\(\\theta\):=\\eta^\{\\top\}M\_\{g\}\-\\phi\_\{g\}\(\\eta\)^\{\\top\}\\,\.\(48\)Since the right hand side above is independent ofθ\\theta, the same must be true forbgb\_\{g\}, so that we get
ϕg\(η\)=Mg⊤η\+bg\.\\phi\_\{g\}\(\\eta\)=M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}\\,\.\(49\)
Now plug \([48](https://arxiv.org/html/2606.26273#A3.E48)\) back into \([45](https://arxiv.org/html/2606.26273#A3.E45)\) to obtain
−bg⊤T\(θ\)\+log\(h\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\)−logh\(θ\)=A\(η\)−A\(ϕg\(η\)\)−η⊤dg\.\-b\_\{g\}^\{\\top\}T\(\\theta\)\+\\log\(h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\)\-\\log h\(\\theta\)=A\(\\eta\)\-A\(\\phi\_\{g\}\(\\eta\)\)\-\\eta^\{\\top\}d\_\{g\}\\,\.\(50\)
Since the right\-hand side is independent ofθ\\theta, the left\-hand side must also be a constant independent ofθ\\theta\. Since the left\-hand side is independent ofη\\eta, this constant is also independent ofη\\eta\. Denoting the constant bycgc\_\{g\}, we have:
log\(h\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\)\\displaystyle\\log\(h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|\)=logh\(θ\)\+bg⊤T\(θ\)\+cg\\displaystyle=\\log h\(\\theta\)\+b\_\{g\}^\{\\top\}T\(\\theta\)\+c\_\{g\}⟹\\displaystyle\\Longrightarrowh\(ρ\(g\)−1\(θ\)\)\|detD\(ρ\(g\)−1\)\(θ\)\|\\displaystyle h\(\\rho\(g\)^\{\-1\}\(\\theta\)\)\\left\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\\right\|=h\(θ\)exp\(bg⊤T\(θ\)\+cg\)\\displaystyle=h\(\\theta\)\\exp\(b\_\{g\}^\{\\top\}T\(\\theta\)\+c\_\{g\}\)\(51\)
Finally, since the bracketed combination in \([50](https://arxiv.org/html/2606.26273#A3.E50)\) equals the constantcgc\_\{g\}established above, combining \([50](https://arxiv.org/html/2606.26273#A3.E50)\) with \([48](https://arxiv.org/html/2606.26273#A3.E48)\) gives
A\(ϕg\(η\)\)=A\(η\)−dg⊤η−cg\.A\(\\phi\_\{g\}\(\\eta\)\)=A\(\\eta\)\-d\_\{g\}^\{\\top\}\\eta\-c\_\{g\}\.\(52\)
It remains to show thatg↦ϕgg\\mapsto\\phi\_\{g\}is an affine action ofGGonHH\. Eachϕg\(η\)=Mg⊤η\+bg\\phi\_\{g\}\(\\eta\)=M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}is affine, andMgM\_\{g\}is invertible: by minimality the rows\{v\(j\)\}\\\{v^\{\(j\)\}\\\}chosen above are linearly independent, soMg=\[v\(1\),…,v\(k\)\]⊤M\_\{g\}=\[v^\{\(1\)\},\\dots,v^\{\(k\)\}\]^\{\\top\}is nonsingular\.
For the action property, the identitye∈Ge\\in Ghasρ\(e\)=id\\rho\(e\)=\\mathrm\{id\}, so
𝒯e\#qη\(θ\)=qη\(ρ\(e\)−1θ\)\|detD\(ρ\(e\)−1\)\(θ\)\|=qη\(θ\),\\mathcal\{T\}\_\{e\}\\\#q\_\{\\eta\}\(\\theta\)=q\_\{\\eta\}\(\\rho\(e\)^\{\-1\}\\theta\)\\left\|\\det D\(\\rho\(e\)^\{\-1\}\)\(\\theta\)\\right\|=q\_\{\\eta\}\(\\theta\),\(53\)givingϕe=id\\phi\_\{e\}=\\mathrm\{id\}\. Forg1,g2∈Gg\_\{1\},g\_\{2\}\\in G, the chain rule yields
𝒯g1g2\#qη\(θ\)\\displaystyle\\mathcal\{T\}\_\{g\_\{1\}g\_\{2\}\}\\\#q\_\{\\eta\}\(\\theta\)=qη\(ρ\(g2−1\)ρ\(g1−1\)θ\)\|detD\(ρ\(g2−1\)\)\(ρ\(g1−1\)θ\)\|\|detD\(ρ\(g1−1\)\)\(θ\)\|\\displaystyle=q\_\{\\eta\}\\big\(\\rho\(g\_\{2\}^\{\-1\}\)\\rho\(g\_\{1\}^\{\-1\}\)\\theta\\big\)\\big\|\\det D\(\\rho\(g\_\{2\}^\{\-1\}\)\)\(\\rho\(g\_\{1\}^\{\-1\}\)\\theta\)\\big\|\\,\\big\|\\det D\(\\rho\(g\_\{1\}^\{\-1\}\)\)\(\\theta\)\\big\|=qϕg2\(η\)\(ρ\(g1−1\)θ\)\|detD\(ρ\(g1−1\)\)\(θ\)\|=qϕg1\(ϕg2\(η\)\)\(θ\),\\displaystyle=q\_\{\\phi\_\{g\_\{2\}\}\(\\eta\)\}\\big\(\\rho\(g\_\{1\}^\{\-1\}\)\\theta\\big\)\\big\|\\det D\(\\rho\(g\_\{1\}^\{\-1\}\)\)\(\\theta\)\\big\|=q\_\{\\phi\_\{g\_\{1\}\}\(\\phi\_\{g\_\{2\}\}\(\\eta\)\)\}\(\\theta\),\(54\)Thus, we haveϕg1g2\(η\)=ϕg1\(ϕg2\(η\)\)\\phi\_\{g\_\{1\}g\_\{2\}\}\(\\eta\)=\\phi\_\{g\_\{1\}\}\(\\phi\_\{g\_\{2\}\}\(\\eta\)\)\. Sinceη\\etais arbitrary, we conclude thatϕg1g2=ϕg1∘ϕg2\\phi\_\{g\_\{1\}g\_\{2\}\}=\\phi\_\{g\_\{1\}\}\\circ\\phi\_\{g\_\{2\}\}\. Henceg↦ϕgg\\mapsto\\phi\_\{g\}is an affine group action ofGGonHH, which completes the proof\. ∎
## Appendix DProof of[Theorem3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)
See[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)
###### Proof\.
We prove the four claims in turn\.
#### \(i\) Loss invariance\.
[Assumption3\.5](https://arxiv.org/html/2606.26273#S3.Thmtheorem5)implies thatqη0=qϕg\(η0\)q\_\{\\eta\_\{0\}\}=q\_\{\\phi\_\{g\}\(\\eta\_\{0\}\)\}\. Consequently,
DKL\(qϕg\(η\)\(θ\)∥qη0\(θ\)\)\\displaystyle D\_\{\\mathrm\{KL\}\}\(q\_\{\\phi\_\{g\}\(\\eta\)\}\(\\theta\)\\\|q\_\{\\eta\_\{0\}\}\(\\theta\)\)=DKL\(qϕg\(η\)\(θ\)∥qϕg\(η0\)\(θ\)\)\\displaystyle=D\_\{\\mathrm\{KL\}\}\(q\_\{\\phi\_\{g\}\(\\eta\)\}\(\\theta\)\\\|q\_\{\\phi\_\{g\}\(\\eta\_\{0\}\)\}\(\\theta\)\)\(55\)=DKL\(qη\(ρ\(g\)−1θ\)∥qη0\(ρ\(g\)−1θ\)\)\\displaystyle=D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta\}\(\\rho\(g\)^\{\-1\}\\theta\)\\\|q\_\{\\eta\_\{0\}\}\(\\rho\(g\)^\{\-1\}\\theta\)\)\(56\)=∫qη\(ρ\(g\)−1θ\)logqη\(ρ\(g\)−1θ\)qη0\(ρ\(g\)−1θ\)d\(ρ\(g\)−1θ\)\\displaystyle=\\int q\_\{\\eta\}\(\\rho\(g\)^\{\-1\}\\theta\)\\log\\frac\{q\_\{\\eta\}\(\\rho\(g\)^\{\-1\}\\theta\)\}\{q\_\{\\eta\_\{0\}\}\(\\rho\(g\)^\{\-1\}\\theta\)\}d\(\\rho\(g\)^\{\-1\}\\theta\)\(57\)=∫qη\(θ\)logqη\(θ\)qη0\(θ\)\|detD\(ρ\(g\)−1\)\(θ\)\|dθ\\displaystyle=\\int q\_\{\\eta\}\(\\theta\)\\log\\frac\{q\_\{\\eta\}\(\\theta\)\}\{q\_\{\\eta\_\{0\}\}\(\\theta\)\}\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\|d\\theta\(58\)=∫qη\(θ\)logqη\(θ\)qη0\(θ\)dθ=DKL\(qη\(θ\)∥qη0\(θ\)\),\\displaystyle=\\int q\_\{\\eta\}\(\\theta\)\\log\\frac\{q\_\{\\eta\}\(\\theta\)\}\{q\_\{\\eta\_\{0\}\}\(\\theta\)\}d\\theta=D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta\}\(\\theta\)\\\|q\_\{\\eta\_\{0\}\}\(\\theta\)\),\(59\)where the third step is the transformation formula and the fourth uses the volume\-preserving condition\|detD\(ρ\(g\)−1\)\(θ\)\|=1\|\\det D\(\\rho\(g\)^\{\-1\}\)\(\\theta\)\|=1\([Assumption3\.6](https://arxiv.org/html/2606.26273#S3.Thmtheorem6)\)\.
We now use the invariance of the likelihood proven in[Proposition3\.1](https://arxiv.org/html/2606.26273#S3.Thmtheorem1)to get
𝔼qϕg\(η\)\(θ\)\[logp\(𝒟aug∣θ\)\]\\displaystyle\\mathbb\{E\}\_\{q\_\{\\phi\_\{g\}\(\\eta\)\}\(\\theta\)\}\[\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)\]=∫qϕg\(η\)\(θ\)logp\(𝒟aug∣θ\)𝑑θ\\displaystyle=\\int q\_\{\\phi\_\{g\}\(\\eta\)\}\(\\theta\)\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)d\\theta\(60\)=∫qη\(ρ\(g\)−1θ\)logp\(𝒟aug∣θ\)𝑑θ\\displaystyle=\\int q\_\{\\eta\}\(\\rho\(g\)^\{\-1\}\\theta\)\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)d\\theta\(61\)=∫qη\(θ\)logp\(𝒟aug∣ρ\(g\)\(θ\)\)\|detD\(ρ\(g\)\)\(θ\)\|𝑑θ\\displaystyle=\\int q\_\{\\eta\}\(\\theta\)\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\rho\(g\)\(\\theta\)\)\|\\det D\(\\rho\(g\)\)\(\\theta\)\|d\\theta\(62\)=∫qη\(θ\)logp\(𝒟aug∣θ\)𝑑θ=𝔼qη\(θ\)\[logp\(𝒟aug∣θ\)\]\.\\displaystyle=\\int q\_\{\\eta\}\(\\theta\)\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)d\\theta=\\mathbb\{E\}\_\{q\_\{\\eta\}\(\\theta\)\}\[\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)\]\.\(63\)We again used the transformation formula and the volume\-preserving conditions\. Combining the two terms, we get
ℒβ\(ϕg\(η\)\)=1N𝔼qη\(θ\)\[−logp\(𝒟aug∣θ\)\]\+βDKL\(qη\(θ\)∥qη0\(θ\)\)=ℒβ\(η\)\.\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta\)\)=\\tfrac\{1\}\{N\}\\mathbb\{E\}\_\{q\_\{\\eta\}\(\\theta\)\}\[\-\\log p\(\\mathcal\{D\}\_\{\\mathrm\{aug\}\}\\mid\\theta\)\]\+\\beta D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta\}\(\\theta\)\\,\\\|\\,q\_\{\\eta\_\{0\}\}\(\\theta\)\)=\\mathcal\{L\}\_\{\\beta\}\(\\eta\)\.\(64\)
#### \(ii\) Gradient equivariance and update commutativity\.
Differentiating the identityℒβ\(ϕg\(η\)\)=ℒβ\(η\)\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta\)\)=\\mathcal\{L\}\_\{\\beta\}\(\\eta\)with respect toη\\etaand using the chain rule together withϕg\(η\)=Mg⊤η\+bg\\phi\_\{g\}\(\\eta\)=M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}yields
Mg∇ηℒβ\|ϕg\(η\)=∇ηℒβ\(η\),M\_\{g\}\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\\big\|\_\{\\phi\_\{g\}\(\\eta\)\}=\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\),\(65\)so that
∇ηℒβ\(ϕg\(η\)\)=Mg−1∇ηℒβ\(η\)\.\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta\)\)=M\_\{g\}^\{\-1\}\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\)\.\(66\)WhenMgM\_\{g\}is orthogonal,Mg−1=Mg⊤M\_\{g\}^\{\-1\}=M\_\{g\}^\{\\top\}, we get
𝒰\(ϕg\(η\)\)\\displaystyle\\mathcal\{U\}\(\\phi\_\{g\}\(\\eta\)\)=ϕg\(η\)−α∇ηℒβ\(ϕg\(η\)\)=Mg⊤η\+bg−αMg−1∇ηℒβ\(η\)\\displaystyle=\\phi\_\{g\}\(\\eta\)\-\\alpha\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta\)\)=M\_\{g\}^\{\\top\}\\eta\+b\_\{g\}\-\\alpha M\_\{g\}^\{\-1\}\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\)\(67\)=Mg⊤\(η−α∇ηℒβ\(η\)\)\+bg=ϕg\(𝒰\(η\)\)\.\\displaystyle=M\_\{g\}^\{\\top\}\(\\eta\-\\alpha\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\)\)\+b\_\{g\}=\\phi\_\{g\}\(\\mathcal\{U\}\(\\eta\)\)\.\(68\)
#### \(iii\) Invariant subspace is preserved\.
Letη∈HG\\eta\\in H\_\{G\}, i\.e\.,ϕg\(η\)=η\\phi\_\{g\}\(\\eta\)=\\etafor allg∈Gg\\in G\. By the update commutativity \([16](https://arxiv.org/html/2606.26273#S3.E16)\) from \(ii\),
ϕg\(𝒰\(η\)\)=𝒰\(ϕg\(η\)\)=𝒰\(η\)for allg∈G,\\phi\_\{g\}\(\\mathcal\{U\}\(\\eta\)\)=\\mathcal\{U\}\(\\phi\_\{g\}\(\\eta\)\)=\\mathcal\{U\}\(\\eta\)\\quad\\text\{for all \}g\\in G,\(69\)so𝒰\(η\)∈HG\\mathcal\{U\}\(\\eta\)\\in H\_\{G\}\. By induction ontt, ifη\(0\)∈HG\\eta^\{\(0\)\}\\in H\_\{G\}thenη\(t\)∈HG\\eta^\{\(t\)\}\\in H\_\{G\}for allt≥0t\\geq 0\.
#### \(iv\) Optimal solutions inherit symmetry\.
Letη∗∈H∗=argminηℒβ\(η\)\\eta^\{\\ast\}\\in H^\{\\ast\}=\\arg\\min\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\)\. By \(i\),
ℒβ\(ϕg\(η∗\)\)=ℒβ\(η∗\)=minηℒβ\(η\),\\mathcal\{L\}\_\{\\beta\}\(\\phi\_\{g\}\(\\eta^\{\\ast\}\)\)=\\mathcal\{L\}\_\{\\beta\}\(\\eta^\{\\ast\}\)=\\min\_\{\\eta\}\\mathcal\{L\}\_\{\\beta\}\(\\eta\),\(70\)soϕg\(η∗\)∈H∗\\phi\_\{g\}\(\\eta^\{\\ast\}\)\\in H^\{\\ast\}for allg∈Gg\\in G; i\.e\.,H∗H^\{\\ast\}is closed underϕg\\phi\_\{g\}\. If the minimizer is unique, thenϕg\(η∗\)=η∗\\phi\_\{g\}\(\\eta^\{\\ast\}\)=\\eta^\{\\ast\}for allg∈Gg\\in G, meaningη∗∈HG\\eta^\{\\ast\}\\in H\_\{G\}\. ∎
## Appendix EProof of[Theorem3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8)
See[3\.8](https://arxiv.org/html/2606.26273#S3.Thmtheorem8)
###### Proof\.
Note that the log\-likelihood does not depend on the prior\. Hence, using the same calculation as in the proof of Theorem[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)\(i\), we show thatRaugR\_\{\\mathrm\{aug\}\}isGG\-invariant on the natural parameter space:
Raug\(ϕg\(η\)\)=Raug\(η\),∀g∈G,η∈H\.R\_\{\\mathrm\{aug\}\}\(\\phi\_\{g\}\(\\eta\)\)=R\_\{\\mathrm\{aug\}\}\(\\eta\),\\quad\\forall g\\in G,\\,\\eta\\in H\.\(71\)In particular, the setHR∗:=argminηRaug\(η\)H^\{\\ast\}\_\{R\}:=\\arg\\min\_\{\\eta\}R\_\{\\mathrm\{aug\}\}\(\\eta\)is closed underϕg\\phi\_\{g\}for allg∈Gg\\in G\.
Fix anyηR∗∈HR∗\\eta^\{\\ast\}\_\{R\}\\in H^\{\\ast\}\_\{R\}\. By the optimality ofηβ∗\(p0\)\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)forℒβ\(⋅;p0\)\\mathcal\{L\}\_\{\\beta\}\(\\cdot;p\_\{0\}\),
Raug\(ηβ∗\(p0\)\)\+βDKL\(qηβ∗\(p0\)∥p0\)≤Raug\(ηR∗\)\+βDKL\(qηR∗∥p0\)\.R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\+\\beta D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\}\\\|p\_\{0\}\)\\leq R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{R\}\)\+\\beta D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta^\{\\ast\}\_\{R\}\}\\\|p\_\{0\}\)\.\(72\)Dropping the non\-negative KL term on the left and usingRaug\(ηR∗\)=Raug∗R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{R\}\)=R^\{\\ast\}\_\{\\mathrm\{aug\}\},
Raug\(ηβ∗\(p0\)\)≤Raug∗\+βDKL\(qηR∗∥p0\)\.R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\\leq R^\{\\ast\}\_\{\\mathrm\{aug\}\}\+\\beta D\_\{\\mathrm\{KL\}\}\(q\_\{\\eta^\{\\ast\}\_\{R\}\}\\\|p\_\{0\}\)\.\(73\)Taking the infimum overηR∗∈HR∗\\eta^\{\\ast\}\_\{R\}\\in H^\{\\ast\}\_\{R\}on the right yields the stated bound
Raug\(ηβ∗\(p0\)\)≤Raug∗\+β⋅C\(p0\)\.R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\\leq R^\{\\ast\}\_\{\\mathrm\{aug\}\}\+\\beta\\cdot C\(p\_\{0\}\)\.\(74\)Combined withRaug\(ηβ∗\(p0\)\)≥Raug∗R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\\geq R^\{\\ast\}\_\{\\mathrm\{aug\}\}\(by the definition ofRaug∗R^\{\\ast\}\_\{\\mathrm\{aug\}\}\), this gives
Raug∗≤Raug\(ηβ∗\(p0\)\)≤Raug∗\+β⋅C\(p0\)\.R^\{\\ast\}\_\{\\mathrm\{aug\}\}\\leq R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\\leq R^\{\\ast\}\_\{\\mathrm\{aug\}\}\+\\beta\\cdot C\(p\_\{0\}\)\.\(75\)For any fixedp0p\_\{0\}withC\(p0\)<∞C\(p\_\{0\}\)<\\infty, sendingβ→0\\beta\\to 0\(equivalentlyN→∞N\\to\\infty, sinceβ=1/N\\beta=1/N\) yieldsRaug\(ηβ∗\(p0\)\)→Raug∗R\_\{\\mathrm\{aug\}\}\(\\eta^\{\\ast\}\_\{\\beta\}\(p\_\{0\}\)\)\\to R^\{\\ast\}\_\{\\mathrm\{aug\}\}, independent ofp0p\_\{0\}\. ∎
## Appendix FProof of[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)
Before stating the proof, we recall the definitions of the empirical and Monte Carlo equivariance defects\.
###### Definition F\.1\(Empirical and Monte Carlo equivariance defects\)\.
Given a dataset\{xi\}i=1N0\\\{x\_\{i\}\\\}\_\{i=1\}^\{N\_\{0\}\}, the empirical equivariance defect is
Δ~Feq\(η\):=1N0∑i=1N01\|G\|∑g∈G‖Fη\(gxi\)−gFη\(xi\)‖2\.\\widetilde\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\):=\\frac\{1\}\{N\_\{0\}\}\\sum\_\{i=1\}^\{N\_\{0\}\}\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\left\\\|F\_\{\\eta\}\(gx\_\{i\}\)\-gF\_\{\\eta\}\(x\_\{i\}\)\\right\\\|^\{2\}\.\(76\)Its Monte Carlo approximation usingTTposterior samples is
Δ^Feq\(η\):=1N0∑i=1N01\|G\|∑g∈G‖F^η\(gxi\)−gF^η\(xi\)‖2,F^η\(x\):=1T∑t=1Tf\(x;θ\(t\)\),\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\):=\\frac\{1\}\{N\_\{0\}\}\\sum\_\{i=1\}^\{N\_\{0\}\}\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\left\\\|\\widehat\{F\}\_\{\\eta\}\(gx\_\{i\}\)\-g\\widehat\{F\}\_\{\\eta\}\(x\_\{i\}\)\\right\\\|^\{2\},\\quad\\widehat\{F\}\_\{\\eta\}\(x\):=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}f\(x;\\theta^\{\(t\)\}\),\(77\)where\{θ\(t\)\}t=1T\\\{\\theta^\{\(t\)\}\\\}\_\{t=1\}^\{T\}are i\.i\.d\. samples fromqη\(θ\)q\_\{\\eta\}\(\\theta\)\.
We now restate[Theorem3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)and prove it thereafter\.
See[3\.9](https://arxiv.org/html/2606.26273#S3.Thmtheorem9)
###### Proof\.
Suppose that there existsM\>0M\>0such that‖f\(x;θ\)‖≤M\\\|f\(x;\\theta\)\\\|\\leq Mfor allx∈𝒳x\\in\\mathcal\{X\}andqηq\_\{\\eta\}\-a\.s\.θ\\theta\. Let us begin by noting that a simple application of Jensen’s inequality shows that for eachxx,
‖Fη\(x\)‖\\displaystyle\\\|F\_\{\\eta\}\(x\)\\\|=‖𝔼qη\[f\(x;θ\)\]‖≤𝔼qη\[‖f\(x;θ\)‖\]≤M,\\displaystyle=\\\|\\mathbb\{E\}\_\{q\_\{\\eta\}\}\[f\(x;\\theta\)\]\\\|\\leq\\mathbb\{E\}\_\{q\_\{\\eta\}\}\[\\\|f\(x;\\theta\)\\\|\]\\leq M,\(78\)‖F^η\(x\)‖\\displaystyle\\\|\\widehat\{F\}\_\{\\eta\}\(x\)\\\|=1T‖∑t=1Tf\(x;θ\(t\)\)‖≤1T∑t=1T‖f\(x;θ\(t\)\)‖≤M\.\\displaystyle=\\frac\{1\}\{T\}\\left\\\|\\sum\_\{t=1\}^\{T\}f\(x;\\theta^\{\(t\)\}\)\\right\\\|\\leq\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\left\\\|f\(x;\\theta^\{\(t\)\}\)\\right\\\|\\leq M\.\(79\)Using the triangle inequality and orthogonality of the group action, this yields
‖Fη\(gx\)±gFη\(x\)‖≤2M,‖F^η\(gx\)±gF^η\(x\)‖≤2M\\displaystyle\\\|F\_\{\\eta\}\(gx\)\\pm gF\_\{\\eta\}\(x\)\\\|\\leq 2M,\\quad\\\|\\widehat\{F\}\_\{\\eta\}\(gx\)\\pm g\\widehat\{F\}\_\{\\eta\}\(x\)\\\|\\leq 2M\(80\)Hence, the quantities we need to estimate are essentially sums of i\.i\.d\. bounded terms\. This makes standard concentration inequalities viable\. We now carry out the details\.
#### Step 1: Dataset\-level concentration\.
Define
Zi:=1\|G\|∑g∈G‖Fη\(gxi\)−gFη\(xi\)‖2\.Z\_\{i\}:=\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\\|F\_\{\\eta\}\(gx\_\{i\}\)\-gF\_\{\\eta\}\(x\_\{i\}\)\\\|^\{2\}\.\(81\)These terms are bounded,Zi∈\[0,4M2\]Z\_\{i\}\\in\[0,4M^\{2\}\], their common expected value is𝔼\[Zi\]=ΔFeq\(η\)\\mathbb\{E\}\[Z\_\{i\}\]=\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\), and their mean is1N0∑i=1N0Zi=Δ~Feq\(η\)\\frac\{1\}\{N\_\{0\}\}\\sum\_\{i=1\}^\{N\_\{0\}\}Z\_\{i\}=\\widetilde\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\. Applying Hoeffding’s inequality, we get for anyt\>0t\>0,
ℙ\(Δ~Feq\(η\)−ΔFeq\(η\)≥t\)≤exp\(−2N0t2\(4M2\)2\)=exp\(−N0t28M4\)\.\\mathbb\{P\}\\\!\\left\(\\widetilde\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\-\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\\geq t\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{2N\_\{0\}t^\{2\}\}\{\(4M^\{2\}\)^\{2\}\}\\right\)=\\exp\\\!\\left\(\-\\frac\{N\_\{0\}t^\{2\}\}\{8M^\{4\}\}\\right\)\.\(82\)Setting the right\-hand side toδ/2\\delta/2and solving fortt, we get that with probability at least1−δ/21\-\\delta/2,
Δ~Feq\(η\)≤ΔFeq\(η\)\+Cdatalog\(2/δ\)N0,\\widetilde\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\\leq\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\+C\_\{\\mathrm\{data\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{N\_\{0\}\}\},\(83\)whereCdata:=22M2C\_\{\\mathrm\{data\}\}:=2\\sqrt\{2\}\\,M^\{2\}\.
#### Step 2: Monte Carlo concentration\.
The estimatorΔ^Feq\(η\)\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)differs fromΔ~Feq\(η\)\\widetilde\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)only through replacing the posterior expectationFη\(x\)F\_\{\\eta\}\(x\)by the empirical averageF^η\(x\)\\widehat\{F\}\_\{\\eta\}\(x\)\. Note that in contrast to above,𝔼\(Δ^Feq\)\\mathbb\{E\}\(\\widehat\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\)is not equal toΔ~Feq\(η\)\\widetilde\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)\. There instead is a small bias term\. We first control that bias, and then boundΔ^Feq\(η\)\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)around its mean via McDiarmid’s inequality\.
*Bias\.*WritingD:=Fη\(gx\)−gFη\(x\)D:=F\_\{\\eta\}\(gx\)\-gF\_\{\\eta\}\(x\)andD^:=F^η\(gx\)−gF^η\(x\)\\widehat\{D\}:=\\widehat\{F\}\_\{\\eta\}\(gx\)\-g\\widehat\{F\}\_\{\\eta\}\(x\), we have
𝔼\[‖D^‖2\]−‖D‖2=tr\(Cov\(D^\)\)≤4M2T,\\mathbb\{E\}\\\!\\left\[\\\|\\widehat\{D\}\\\|^\{2\}\\right\]\-\\\|D\\\|^\{2\}=\\mathrm\{tr\}\\\!\\left\(\\mathrm\{Cov\}\(\\widehat\{D\}\)\\right\)\\leq\\frac\{4M^\{2\}\}\{T\},\(84\)sinceD^\\widehat\{D\}is an average ofTTi\.i\.d\. terms each bounded in norm by2M2M\. Averaging overGGand the data samples yields
𝔼\[Δ^Feq\(η\)\]−Δ~Feq\(η\)≤CbiasT≤CbiasT,\\mathbb\{E\}\[\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\]\-\\widetilde\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\\leq\\frac\{C\_\{\\mathrm\{bias\}\}\}\{T\}\\leq\\frac\{C\_\{\\mathrm\{bias\}\}\}\{\\sqrt\{T\}\},\(85\)withCbias=8M2C\_\{\\mathrm\{bias\}\}=8M^\{2\}\.
*Bounded differences\.*Replacing a single sampleθ\(s\)\\theta^\{\(s\)\}byθ′\(s\)\\theta^\{\\prime\(s\)\}changes anyF^η\(x\)\\widehat\{F\}\_\{\\eta\}\(x\)by at most1T‖f\(x;θ\(s\)\)−f\(x;θ′\(s\)\)‖≤2M/T\\frac\{1\}\{T\}\\\|f\(x;\\theta^\{\(s\)\}\)\-f\(x;\\theta^\{\\prime\(s\)\}\)\\\|\\leq 2M/T\. By the triangle inequality, the same replacement changes everyF^η\(gxi\)−gF^η\(xi\)\\widehat\{F\}\_\{\\eta\}\(gx\_\{i\}\)\-g\\widehat\{F\}\_\{\\eta\}\(x\_\{i\}\)less than4M/T4M/T\. Using the identity‖a‖2−‖b‖2=⟨a−b,a\+b⟩\\\|a\\\|^\{2\}\-\\\|b\\\|^\{2\}=\\langle a\-b,a\+b\\rangletogether with bounds‖a‖=‖F^η\(gxi\)−gF^η\(xi\)‖≤2M,‖b‖=‖F^η′\(gxi\)−gF^η′\(xi\)‖≤2M\\\|a\\\|=\\\|\\widehat\{F\}\_\{\\eta\}\(gx\_\{i\}\)\-g\\widehat\{F\}\_\{\\eta\}\(x\_\{i\}\)\\\|\\leq 2M,\\,\\\|b\\\|=\\\|\\widehat\{F\}^\{\\prime\}\_\{\\eta\}\(gx\_\{i\}\)\-g\\widehat\{F\}^\{\\prime\}\_\{\\eta\}\(x\_\{i\}\)\\\|\\leq 2Mfrom \([80](https://arxiv.org/html/2606.26273#A6.E80)\), we get
\|‖F^η\(gxi\)−gF^η\(xi\)‖2−‖F^η′\(gxi\)−gF^η′\(xi\)‖2\|≤4MT⋅4M=16M2T\.\\left\|\\\|\\widehat\{F\}\_\{\\eta\}\(gx\_\{i\}\)\-g\\widehat\{F\}\_\{\\eta\}\(x\_\{i\}\)\\\|^\{2\}\-\\\|\\widehat\{F\}^\{\\prime\}\_\{\\eta\}\(gx\_\{i\}\)\-g\\widehat\{F\}^\{\\prime\}\_\{\\eta\}\(x\_\{i\}\)\\\|^\{2\}\\right\|\\leq\\frac\{4M\}\{T\}\\cdot 4M=\\frac\{16M^\{2\}\}\{T\}\.\(86\)Averaging overiiandggpreserves this bound, soΔ^Feq\(η\)\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)satisfies the bounded\-differences condition with constantscs=16M2/Tc\_\{s\}=16M^\{2\}/Tfor eachs=1,…,Ts=1,\\ldots,T\.
McDiarmid’s inequality\[[25](https://arxiv.org/html/2606.26273#bib.bib12),[9](https://arxiv.org/html/2606.26273#bib.bib11)\]states that any functionΦ\\Phiof independent random variables satisfies
ℙ\(𝔼\[Φ\]−Φ≥t\)≤exp\(−2t2∑s=1Tcs2\),\\mathbb\{P\}\\big\(\\mathbb\{E\}\[\\Phi\]\-\\Phi\\geq t\\big\)\\leq\\exp\\\!\\left\(\-\\frac\{2t^\{2\}\}\{\\sum\_\{s=1\}^\{T\}c\_\{s\}^\{2\}\}\\right\),\(87\)wherecsc\_\{s\}is the bounded\-difference constant ofΦ\\Phiin itsss\-th argument, i\.e\. substituting thess\-th argument ofΦ\\Phiwith a different value changesΦ\\Phiby at mostcsc\_\{s\}\.
SubstitutingΦ=Δ^Feq\(η\)\\Phi=\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)andcs=16M2/Tc\_\{s\}=16M^\{2\}/T, we have∑s=1Tcs2=T⋅\(16M2/T\)2=256M4/T\\sum\_\{s=1\}^\{T\}c\_\{s\}^\{2\}=T\\cdot\(16M^\{2\}/T\)^\{2\}=256M^\{4\}/T\. Hence,
ℙ\(𝔼\[Δ^Feq\(η\)\]−Δ^Feq\(η\)≥t\)≤exp\(−Tt2128M4\)\.\\mathbb\{P\}\\\!\\left\(\\mathbb\{E\}\[\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\]\-\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\\geq t\\right\)\\leq\\exp\\left\(\-\\frac\{Tt^\{2\}\}\{128M^\{4\}\}\\right\)\.\(88\)
Setting the right\-hand side toδ/2\\delta/2and solving forttyields, with probability at least1−δ/21\-\\delta/2,
𝔼\[Δ^Feq\(η\)\]−Δ^Feq\(η\)≤16M22Tlog\(2/δ\)=Cmclog\(2/δ\)T,\\mathbb\{E\}\[\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\]\-\\widehat\{\\Delta\}^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\\leq\\frac\{16M^\{2\}\}\{\\sqrt\{2T\}\}\\sqrt\{\\log\(2/\\delta\)\}=C\_\{\\mathrm\{mc\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{T\}\},\(89\)whereCmc=16M2/2=82M2C\_\{\\mathrm\{mc\}\}=16M^\{2\}/\\sqrt\{2\}=8\\sqrt\{2\}\\,M^\{2\}\.
#### Step 3: Combining deviation and bias\.
We now combine the above to get
Δ^Feq\(η\)\\displaystyle\\widehat\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)=ΔFeq\(η\)\+\(Δ~Feq\(η\)−ΔFeq\(η\)\)⏟\([83](https://arxiv.org/html/2606.26273#A6.E83)\)\+\(𝔼\[Δ^Feq\(η\)\]−Δ~Feq\(η\)\)⏟\([85](https://arxiv.org/html/2606.26273#A6.E85)\)\+\(Δ^Feq\(η\)−𝔼\[Δ^Feq\(η\)\]\)⏟\([89](https://arxiv.org/html/2606.26273#A6.E89)\)\\displaystyle=\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\+\\underbrace\{\(\\widetilde\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)\-\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\)\}\_\{\(\\ref\{eq:hoeffding\}\)\}\+\\underbrace\{\(\\mathbb\{E\}\[\\widehat\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)\]\-\\widetilde\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)\)\}\_\{\(\\ref\{eq:bias\}\)\}\+\\underbrace\{\(\\widehat\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)\-\\mathbb\{E\}\[\\widehat\{\\Delta\}\_\{F\}^\{\\mathrm\{eq\}\}\(\\eta\)\]\)\}\_\{\(\\ref\{eq:McDiarmid\}\)\}\(90\)≤\\displaystyle\\leqΔFeq\(η\)\+Cdatalog\(2/δ\)N0\+Cbias1T\+Cmc⋅log\(2/δ\)T,\\displaystyle\\Delta^\{\\mathrm\{eq\}\}\_\{F\}\(\\eta\)\+C\_\{\\mathrm\{data\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{N\_\{0\}\}\}\+C\_\{\\mathrm\{bias\}\}\\sqrt\{\\frac\{1\}\{T\}\}\+C\_\{\\mathrm\{mc\}\}\\cdot\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{T\}\},\(91\)which yields the first claim withCweight=Cmc\+Cbias⋅1/log\(2\)C\_\{\\mathrm\{weight\}\}=C\_\{\\mathrm\{mc\}\}\+C\_\{\\mathrm\{bias\}\}\\cdot\\sqrt\{1/\\log\(2\)\}\.
#### Step 4: Sample requirements\.
For the decompositionε=ε0\+εdata\+εweight\+εbias\\varepsilon=\\varepsilon\_\{0\}\+\\varepsilon\_\{\\mathrm\{data\}\}\+\\varepsilon\_\{\\mathrm\{weight\}\}\+\\varepsilon\_\{\\mathrm\{bias\}\}, requiring each term in the bound to be at most its allotted budget gives
Cdatalog\(2/δ\)N0≤εdata,Cweightlog\(2/δ\)T≤εweight\.C\_\{\\mathrm\{data\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{N\_\{0\}\}\}\\leq\\varepsilon\_\{\\mathrm\{data\}\},\\quad C\_\{\\mathrm\{weight\}\}\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{T\}\}\\leq\\varepsilon\_\{\\mathrm\{weight\}\}\.\(92\)Squaring and solving forN0N\_\{0\}andTTyields the stated requirements\. ∎
## Appendix GGeometric averaging stays in𝒬\\mathcal\{Q\}
###### Lemma G\.1\(Geometric mean invariance and membership\)\.
Suppose𝒬\\mathcal\{Q\}is closed under all push\-forwards𝒯g\\mathcal\{T\}\_\{g\}as in[Theorem3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2), i\.e\., for eachg∈Gg\\in Gand anyη∈H\\eta\\in H,
\(𝒯g\#qη\)\(θ\)=h\(θ\)exp\(ηg⊤T\(θ\)−Ag\(η\)\),\(\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}\)\(\\theta\)=h\(\\theta\)\\exp\\big\(\\eta\_\{g\}^\{\\top\}T\(\\theta\)\-A\_\{g\}\(\\eta\)\\big\.\),\(93\)for someηg∈H\\eta\_\{g\}\\in H\. Then the geometric mean
q~\(θ\)∝\[∏g∈G\(𝒯g\#qη\)\(θ\)\]1\|G\|\\tilde\{q\}\(\\theta\)\\propto\\left\[\\prod\_\{g\\in G\}\\Big\(\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}\\Big\)\(\\theta\)\\right\]^\{\\frac\{1\}\{\|G\|\}\}\(94\)is invariant under the group action and can be written as an exponential\-family member with the same base measurehhand sufficient statisticTT:
q~\(θ\)=h\(θ\)exp\(η~⊤T\(θ\)−A~\),\\tilde\{q\}\(\\theta\)=h\(\\theta\)\\exp\\Big\(\\tilde\{\\eta\}^\{\\top\}T\(\\theta\)\-\\tilde\{A\}\\Big\.\),\(95\)where
η~=1\|G\|∑g∈Gηg,A~=1\|G\|∑g∈GA\(ηg\)\.\\tilde\{\\eta\}=\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\eta\_\{g\},\\quad\\tilde\{A\}=\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}A\(\\eta\_\{g\}\)\.\(96\)
###### Proof\.
Using closure under push\-forward, for eachgg,
\(𝒯g\#qη\)\(θ\)=h\(θ\)exp\(ηg⊤T\(θ\)−A\(ηg\)\)\.\(\\mathcal\{T\}\_\{g\}\\\#q\_\{\\eta\}\)\(\\theta\)=h\(\\theta\)\\exp\\big\(\\eta\_\{g\}^\{\\top\}T\(\\theta\)\-A\(\\eta\_\{g\}\)\\big\.\)\.\(97\)Taking the1\|G\|\\frac\{1\}\{\|G\|\}\-power and multiplying over allg∈Gg\\in Gyields
q~\(θ\)=h\(θ\)exp\(1\|G\|∑g∈Gηg⊤T\(θ\)−1\|G\|∑g∈GA\(ηg\)\),\\tilde\{q\}\(\\theta\)=h\(\\theta\)\\exp\\Big\(\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}\\eta\_\{g\}^\{\\top\}T\(\\theta\)\-\\frac\{1\}\{\|G\|\}\\sum\_\{g\\in G\}A\(\\eta\_\{g\}\)\\Big\.\),\(98\)which is of the standard exponential\-family form with the statedη~\\tilde\{\\eta\}andA~\\tilde\{A\}\. Invariance follows because for anyg0∈Gg\_\{0\}\\in G, the set\{𝒯g\#q\}g∈G\\\{\\mathcal\{T\}\_\{g\}\\\#q\\\}\_\{g\\in G\}is permuted under𝒯g0\\mathcal\{T\}\_\{g\_\{0\}\}and the product overGG\(with equal exponents\) is unchanged\. Henceq~\\tilde\{q\}is invariant and belongs to𝒬\\mathcal\{Q\}\. ∎
## Appendix HThe two orbit averaging strategies
In this section, we show how geometric averaging and projection naturally arise as two instances of orbit averaging\. First, let us present in detail the canonical strategy of defining parameter space representations discussed in the main paper\. Remember that a classical neural network \(e\.g\. an MLP\) consists of layersθℓ\\theta\_\{\\ell\}of affine maps, whose domains and co\-domains are vector spaces, sayθℓ:𝒱ℓ→𝒱ℓ\+1\\theta\_\{\\ell\}:~\{\\mathcal\{V\}\}\_\{\\ell\}~\\to~\{\\mathcal\{V\}\}\_\{\\ell\+1\}, with𝒱0=𝒳\{\\mathcal\{V\}\}\_\{0\}=\{\\mathcal\{X\}\}and𝒱L=𝒴\{\\mathcal\{V\}\}\_\{L\}=\{\\mathcal\{Y\}\}\. One now introduces representationsρℓ\\rho\_\{\\ell\}on each𝒱ℓ\{\\mathcal\{V\}\}\_\{\\ell\}, withρ0=ρ𝒳\\rho\_\{0\}=\\rho\_\{\{\\mathcal\{X\}\}\}andρL=ρ𝒴\\rho\_\{L\}=\\rho\_\{\{\\mathcal\{Y\}\}\}to keep the construction compatible with the given in\- and output representations, and defines
ρℓ\(g\)\(θℓ\)=ρ𝒱ℓ\+1\(g\)∘θℓ∘ρ𝒱ℓ\(g\)−1\\displaystyle\\rho\_\{\\ell\}\(g\)\(\\theta\_\{\\ell\}\)=\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\+1\}\}\(g\)\\circ\\theta\_\{\\ell\}\\circ\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}\(g\)^\{\-1\}\(99\)If the activation functions are equivariant with respect to the intermediate representation, the above definition always yields in a representation satisfying \([2](https://arxiv.org/html/2606.26273#S3.E2)\) – see\[[29](https://arxiv.org/html/2606.26273#bib.bib18)\]\.
For the networks that we are considering in Section[3\.4](https://arxiv.org/html/2606.26273#S3.SS4), the input and intermediate spaces consist of*channels of images*\. That is, for a spaceℐ\\mathcal\{I\}of images, we set
𝒱ℓ=ℐnℓ\.\\displaystyle\{\\mathcal\{V\}\}\_\{\\ell\}=\\mathcal\{I\}^\{n\_\{\\ell\}\}\.\(100\)On the spaceℐ\\mathcal\{I\},C4C\_\{4\}acts through a rotation operatorR:ℐ→ℐR:\\mathcal\{I\}\\to\\mathcal\{I\}\. We can extend this to𝒱ℓ\{\\mathcal\{V\}\}\_\{\\ell\}in several ways\. First, we can rotate channel\-wise:
\(ρ𝒱ℓch\(i\)v\)k=Rivk\.\\displaystyle\(\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}^\{\\mathrm\{ch\}\}\(i\)v\)\_\{k\}=R^\{i\}v\_\{k\}\.\(101\)Alternatively, we can additionally act via ’rolling’ along the channel dimension\. More concretely, one divides the channels into groups of four, and then defines
\(ρ𝒱ℓgcnn\(i\)v\)4n\+k=Riv4n\+\(k\+imod4\)\.\\displaystyle\(\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}^\{\\mathrm\{gcnn\}\}\(i\)v\)\_\{4n\+k\}=R^\{i\}v\_\{4n\+\(k\+i\\,\\mathrm\{mod\}\\,4\)\}\.\(102\)The second representation is the one that leads to GCNNs\.
Let us now apply \([99](https://arxiv.org/html/2606.26273#A8.E99)\) to a convolutional operator,
\(θℓv\)k=∑jηkj∗vj\.\\displaystyle\(\\theta\_\{\\ell\}v\)\_\{k\}=\\sum\_\{j\}\\eta^\{kj\}\*v\_\{j\}\.\(103\)Note that rotation in the image space and convolutions are compatible, in the following sense:
\(Rη\)∗\(Rv\)=R\(η∗v\),\\displaystyle\(R\\eta\)\*\(Rv\)=R\(\\eta\*v\),\(104\)
#### Geometric average\.
For the channel\-wise representation, we obtain:
\(ρℓch\(i\)v\)k\\displaystyle\(\\rho^\{\\mathrm\{ch\}\}\_\{\\ell\}\(i\)v\)\_\{k\}=ρ𝒱ℓ\+1ch\(i\)\(\[θℓ∘ρ𝒱ℓch\(i\)−1\]v\)k=Ri∑jηkj∗\(ρ𝒱ℓch\(i\)−1v\)j\\displaystyle=\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\+1\}\}^\{\\mathrm\{ch\}\}\(i\)\(\[\\theta\_\{\\ell\}\\circ\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}^\{\\mathrm\{ch\}\}\(i\)^\{\-1\}\]v\)\_\{k\}=R^\{i\}\\sum\_\{j\}\\eta^\{kj\}\*\(\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}^\{\\mathrm\{ch\}\}\(i\)^\{\-1\}v\)\_\{j\}\(105\)=Ri∑jηkj∗\(R−ivj\)=∑j\(Riηkj\)∗vj\.\\displaystyle=R^\{i\}\\sum\_\{j\}\\eta^\{kj\}\*\(R^\{\-i\}v\_\{j\}\)=\\sum\_\{j\}\(R^\{i\}\\eta^\{kj\}\)\*v\_\{j\}\.\(106\)That is,ρch\\rho^\{\\mathrm\{ch\}\}acts via rotating each filter individually:\(ηkj\)k,j→\(Riηkj\)k,j\(\\eta^\{kj\}\)\_\{k,j\}\\to\(R^\{i\}\\eta^\{kj\}\)\_\{k,j\}\. Correspondingly, the orbit average averages all four rotations of each filter, yielding to a filter\-wise symmetrization\.
#### Projection\.
For the GCNN\-representation, let us restrict the calculation to the filters between two four\-blocks, i\.e\.n=0n=0\. We get
\(ρℓgcnn\(i\)v\)k\\displaystyle\(\\rho^\{\\mathrm\{gcnn\}\}\_\{\\ell\}\(i\)v\)\_\{k\}=ρ𝒱ℓ\+1gcnn\(i\)\(\[θℓ∘ρ𝒱ℓgcnn\(i\)−1\]v\)k=Ri∑jη\(k\+imod4\),j∗\(ρ𝒱ℓgcnn\(i\)−1v\)j\\displaystyle=\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\+1\}\}^\{\\mathrm\{gcnn\}\}\(i\)\(\[\\theta\_\{\\ell\}\\circ\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}^\{\\mathrm\{gcnn\}\}\(i\)^\{\-1\}\]v\)\_\{k\}=R^\{i\}\\sum\_\{j\}\\eta^\{\(k\+i\\,\\mathrm\{mod\}\\,4\),j\}\*\(\\rho\_\{\{\\mathcal\{V\}\}\_\{\\ell\}\}^\{\\mathrm\{gcnn\}\}\(i\)^\{\-1\}v\)\_\{j\}\(107\)=Ri∑jη\(k\+imod4\),j∗\(R−ivj−imod4\)=∑jRiη\(k\+imod4\),\(j\+imod4\)∗vj\.\\displaystyle=R^\{i\}\\sum\_\{j\}\\eta^\{\(k\+i\\,\\mathrm\{mod\}\\,4\),j\}\*\(R^\{\-i\}v\_\{j\-i\\,\\mathrm\{mod\}\\,4\}\)=\\sum\_\{j\}R^\{i\}\\eta^\{\(k\+i\\,\\mathrm\{mod\}\\,4\),\(j\+i\\,\\mathrm\{mod\}\\,4\)\}\*v\_\{j\}\.\(108\)Thus,ρℓgcnn\\rho^\{\\mathrm\{gcnn\}\}\_\{\\ell\}acts through both rotating and rolling the filters along the diagonals of each four\-by\-four block\. The resulting average operation will calculate an average of rotated filters along each such diagonal, and then insert it into the block in rotated versions, giving4×44\\times 4\-blocks as in Figure[9](https://arxiv.org/html/2606.26273#A11.F9)\.
## Appendix IExpansion along orbit: filter arrangement ablation
Let us now discuss the orbit expansion strategy\. The idea is to construct a set of equivariant filters, starting from a smaller set of generic filters\. To keep the exposition light, we start with a single filterη\\etaand expand it to a single four\-times\-four block – the construction naturally generalizes to filter banks\.



Figure 5:Three filter arrangements for orbit expansion underC4C\_\{4\}, starting from a single base filterη\\etatrained in the width\-1/\|G\|1/\|G\|Stage 1 network\. All three share the same first row, and only differ in how that row is expanded to the other rows\.Top:Row\-constant channel\-wise expansionMiddle:Row\-rolled channel\-wise expansionBottom:GCNN\-expansionSince there are several notions of equivariance, there are different ways to achieve the equivariance of the larger filterbank\. In Appendix[H](https://arxiv.org/html/2606.26273#A8), we have already derived conditions for equivariance when both the input and output representations are equal toρch\\rho^\{\\mathrm\{ch\}\}andρgcnn\\rho^\{\\mathrm\{gcnn\}\}, respectively\. What are the conditions when the representations are unequal? Let us say that the representation on the layer𝒱1\{\\mathcal\{V\}\}\_\{1\}isρgcnn\\rho^\{\\mathrm\{gcnn\}\}and the one on𝒱2\{\\mathcal\{V\}\}\_\{2\}isρch\\rho^\{\\mathrm\{ch\}\}\. The lifted representation then becomes
\(ρ1\(i\)v\)k\\displaystyle\(\\rho\_\{1\}\(i\)v\)\_\{k\}=ρ𝒱2ch\(i\)\(\[θℓ∘ρ𝒱1gcnn\(i\)−1\]v\)k=Ri∑jηkj∗\(ρ𝒱1gcnn\(i\)−1v\)j\\displaystyle=\\rho\_\{\{\\mathcal\{V\}\}\_\{2\}\}^\{\\mathrm\{ch\}\}\(i\)\(\[\\theta\_\{\\ell\}\\circ\\rho\_\{\{\\mathcal\{V\}\}\_\{1\}\}^\{\\mathrm\{gcnn\}\}\(i\)^\{\-1\}\]v\)\_\{k\}=R^\{i\}\\sum\_\{j\}\\eta^\{kj\}\*\(\\rho\_\{\{\\mathcal\{V\}\}\_\{1\}\}^\{\\mathrm\{gcnn\}\}\(i\)^\{\-1\}v\)\_\{j\}=Ri∑jηkj∗\(R−ivj−i\)=∑j\(Riηk,j\+i\)∗vj\.\\displaystyle=R^\{i\}\\sum\_\{j\}\\eta^\{kj\}\*\(R^\{\-i\}v\_\{j\-i\}\)=\\sum\_\{j\}\(R^\{i\}\\eta^\{k,j\+i\}\)\*v\_\{j\}\.Consequently, a layer is equivariant if and only if
Riηk,j\+i=ηk,jfor alli,j,k\.R^\{i\}\\eta^\{k,j\+i\}=\\eta^\{k,j\}\\text\{ for all \}i,j,k\.\(109\)This means that within each row, the filters should be rotations of each other\. In Figure[5](https://arxiv.org/html/2606.26273#A9.F5), we plot three ways to construct filters that fulfill this condition\. We take the filterη\\etaand rotate it to form the first row, and expand that structure into an entire block in three different ways
- •In therow\-constant channel\-wise expansion, we simply copy the first row to the others\.
- •In therow\-rolled channel\-wise expansion, we roll the filters along the row dimension to define the new rows\.
- •In theGCNN expansion, we roll the filters along the row dimension and rotate them to define the new rows\.
All these construction yield per se valid structures of equivariance, since there is a freedom to choose the representation on the second feature space\. At the same time, only the third construction is equivariant in the ’ρgcnn\\rho^\{\\mathrm\{gcnn\}\}\-to\-ρgcnn\\rho^\{\\mathrm\{gcnn\}\}sense’, so it is special\. See Figure[6](https://arxiv.org/html/2606.26273#A9.F6)for a summary\.
Table 2:Effect of filter arrangement and trigger timing in orbit expansion \(FashionMNIST,N0=5 000N\_\{0\}=5\\,000,C4C\_\{4\}augmentation, SGD, 5 seeds\), evaluated at the best\-accuracy checkpoint\.Bold: best in each column\.Shaded: best equivariance\-accuracy trade\-off\.𝒳\{\\mathcal\{X\}\}ρch\\rho^\{\\mathrm\{ch\}\}𝒱1\{\\mathcal\{V\}\}\_\{1\}ρgcnn\\rho^\{\\mathrm\{gcnn\}\}𝒱2\{\\mathcal\{V\}\}\_\{2\}ρch\\rho^\{\\mathrm\{ch\}\}FC\-layers𝒴\{\\mathcal\{Y\}\}ρtriv\\rho^\{\\mathrm\{triv\}\}Row\-constant ✓Row\-rolled ✓GCNN ✓
𝒳\{\\mathcal\{X\}\}ρch\\rho^\{\\mathrm\{ch\}\}𝒱1\{\\mathcal\{V\}\}\_\{1\}ρgcnn\\rho^\{\\mathrm\{gcnn\}\}𝒱2\{\\mathcal\{V\}\}\_\{2\}ρgcnn\\rho^\{\\mathrm\{gcnn\}\}FC\-layers𝒴\{\\mathcal\{Y\}\}ρtriv\\rho^\{\\mathrm\{triv\}\}Row\-constant ✗Row\-rolled ✗GCNN ✓
Figure 6:Two ways of choosing the intermediate representations acting on the filter banks\. All of the orbit expansion strategies are equivariant with respect to the top choice, but only the GCNN\-strategy is equivariant with respect to the bottom choice\.Experiment\.We evaluate all three variants with trigger epochs at 0, 20, and 100, using the same setup as[Section4\.2](https://arxiv.org/html/2606.26273#S4.SS2)\(FashionMNIST,N0=5 000N\_\{0\}=5\\,000,C4C\_\{4\}augmentation, SGD, 5 seeds\)\.[Table2](https://arxiv.org/html/2606.26273#A9.T2)reports metrics at the best\-accuracy checkpoint \(the convention used throughout the paper\)\.
The GCNN\-expansion arrangement systematically outperformed the row\-constant channel\-wise expansion and row\-rolled channel\-wise expansion\. This is not surprising given that they are the only ones that are equivariant when theρgcnn\\rho^\{\\mathrm\{gcnn\}\}\-representation is acting on all layers, which is the most expressive action in this setting\. As we have seen, there are more filters that fulfill the weaker symmetry condition imposed by being equivariant fromρgcnn\\rho^\{\\mathrm\{gcnn\}\}toρch\\rho^\{\\mathrm\{ch\}\}, intuitively leading to a weaker implicit bias\.
One can also additionally note that in the GCNN\-mode, the outer rotationR−iR^\{\-i\}and the inner shift index\(i−j\)\(i\-j\)are oriented so thatiicancels in their composition, collapsing all rows to a single pattern\. This leads to a lack of row\-wise diversity, which seems to limit the effective degrees of freedom\.
## Appendix JVariational family closure ablation
Gaussian and Laplace families satisfy the conditions in[Theorem3\.2](https://arxiv.org/html/2606.26273#S3.Thmtheorem2)for linear group representationsρ\\rho, whereas a signed log\-normal family with random, fixed sign patterns does not\. The reason is that the sign assignments break the required affine structure of the sufficient statistics\.
We design a controlled experiment to test whether the algebraic distinction has observable consequences during augmented variational training\.
#### Setup\.
We train Bayesian CNNs on FashionMNIST \(N0=20000N\_\{0\}=20000training samples\) with fullC4C\_\{4\}orbit augmentation, using SGD for 100 epochs\. Three variational families are compared:\(i\)Gaussian \(mean\-field, closed underC4C\_\{4\}\),\(ii\)Laplace \(mean\-field, closed underC4C\_\{4\}\), and\(iii\)signed log\-normal with random fixed signs \(not closed underC4C\_\{4\}\)\. All other hyper\-parameters \(architecture, learning rate, KL weights\) are shared\. Each configuration is run with 5 random seeds\.
#### Metrics\.
We measure the accuracy and symmetric KL divergence \(Symm\. KL\) between the predictive distributions atxxandgxgx, evaluated on both train and test sets\.
#### Results\.
[Table3](https://arxiv.org/html/2606.26273#A10.T3)and[Figure7](https://arxiv.org/html/2606.26273#A10.F7)summarize the findings\.
Table 3:Effect of variational\-family closure on equivariance and accuracy underC4C\_\{4\}augmentation \(epoch 100, mean±\\pmstd over 5 seeds\)\. Symmetric KL divergence measures the equivariance defect \(lower = more equivariant\)\.Figure 7:Trajectories for three variational families underC4C\_\{4\}augmentation \(5 seeds, shaded region=±=\\pmstd\)\.\(a\)both closed families reach≥87\.7%\\geq 87\.7\\%test accuracy; the non\-closed signed log\-normal plateaus at≈85\.7%\\approx 85\.7\\%\.\(b\)Gaussian and Laplace converge to the same low train equivariance defect \(≈0\.016\\approx 0\.016\), confirming that closure enables the optimizer to find a near\-equivariant posterior on training data, while the signed log\-normal remains\>2×\>2\\timeshigher\.\(c\)the test equivariance defect of all families grows during training as posteriors concentrate, but the ordering Gaussian<<Laplace<<log\-normal is maintained throughout\.Both closed families converge to the same low train symmetric KL divergence \(≈0\.016\\approx 0\.016\), despite differing in distributional shape and training\-set accuracy \(Gaussian90\.79%90\.79\\%, Laplace96\.58%96\.58\\%\)\. The signed log\-normal, which is not closed, converges to a2\.2×2\.2\\timeshigher symmetric KL divergence \(0\.0360\.036\), with non\-overlapping seed ranges\. This confirms the theorem’s prediction: closure enables the optimizer to move the posterior toward a \(near\-\)equivariant solution within the family, whereas a non\-closed family faces a structural barrier\.
We further find that equivariance acts as an implicit regularizer\. Both closed families generalize better than the non\-closed one: the Gaussian generalization gap is2\.42\.4pp \(train90\.79%90\.79\\%vs\. test88\.43%88\.43\\%\), roughly half the4\.64\.6pp gap of the signed log\-normal \(train90\.31%90\.31\\%vs\. test85\.67%85\.67\\%\) at comparable training accuracy\. This is consistent with equivariance reducing the effective degrees of freedom in the posterior\.
While Laplace’s train equivariance matches Gaussian, its test equivariance is noticeably worse \(0\.0490\.049vs\.0\.0230\.023\)\. We attribute this to the heavier tails of the Laplace distribution\. The posterior satisfies the closure condition on training data, but the tail behaviour introduces additional prediction variance on unseen inputs, partially offsetting the equivariance benefit\. This suggests that closure under group action is necessary but not sufficient for strong test\-time equivariance; the posterior’s concentration properties also play a role\.
## Appendix KTrigger timing and optimizer ablation
Table 4:Trigger\-timing ablation, full results\.Symmetrization mechanism×\\timesoptimizer×\\timestrigger epoch on FashionMNIST \(N0=5 000N\_\{0\}=5\\,000,C4C\_\{4\}augmentation,10 00010\\,000test samples\)\. For*Geometric avg\.*and*Projection*, Epoch is the epoch at which the one\-shot symmetrization is applied; for*Orbit expansion*it is the number of stage\-1 epochs before expansion \(stage\-2 runs for the remaining500−Epoch500\-\\text\{Epoch\}epochs\)\. All values are mean±\\pmstd over 5 seeds, taken at the best test\-accuracy checkpoint\.Bold: best across the whole table per column\.Shaded: best equivariance–accuracy trade\-off\.We empirically study the trigger\-timing trade\-off discussed in[Section3\.4](https://arxiv.org/html/2606.26273#S3.SS4), and the role of the optimizer\. The setup is identical to[Section4\.2](https://arxiv.org/html/2606.26273#S4.SS2)\(FashionMNISTN0=5000N\_\{0\}=5000,C4C\_\{4\}augmentation, 5 seeds; the orbit\-expansion variant is the GCNN\-expansion of[appendixI](https://arxiv.org/html/2606.26273#A9)\), extended to the trigger epoch set\{0,20,50,100,150\}\\\{0,20,50,100,150\\\}\. After the trigger, all methods continue training without symmetry constraints up to epoch 500\. We consider both SGD as well as AdamW\.
We compare the equivariance properties of the best model \(highest test accuracy\) of each method in[Table4](https://arxiv.org/html/2606.26273#A11.T4)\. There are several aspects of the results to be commented on\.
AdamW vs\. SGD behavior\.First, the results from the main paper for the SGD runs are confirmed: an earlier trigger timing generally leads to a higher accuracy and equivariance \(with the exception of the orbit expansion method, which however operates in a region of high equivariance\)\. The AdamW runs behave differently: while their accuracies are still raised by the increased training time post re\-initialization, the equivariance is generally hurt\. This effect is better illustrated in Figure[8](https://arxiv.org/html/2606.26273#A11.F8)\. This confirms the relevance of our theory, since Theorem[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)\(ii\) applies only to SGD\. One should also note that although the AdamW\-runs seem to drift more from the equivariant positions they are put in by our strategies, they are still made more equivariant than the baseline\.
Drift away fromHGH\_\{G\}\.In Figure[10](https://arxiv.org/html/2606.26273#A11.F10), we plot the evolution of the symmetric KL\-divergence for the example of orbit expansion\. From this plot it is clear that the runs drift away from equivariance after the re\-initalization\. This effect persists across all runs and strategies, both for AdamW and SGD\. Note that this is not in contradiction to our theory\. Indeed, our results are technically only valid for gradient descent applied to the loss associated with the*true predictive*maps\. In the practical runs, we are using stochastic gradient descent applied to Monte Carlo approximations of the true predictive map\. Importantly, the drift is a lot slower for SGD, further confirming the role of Theorem[3\.7](https://arxiv.org/html/2606.26273#S3.Thmtheorem7)\(ii\)\.
Trigger timing\.The way the trigger timing influences the drift is subtle\. In[Figure9](https://arxiv.org/html/2606.26273#A11.F9), we plot the accuracy and equivariance metrics for the geometric averaging and projection metrics for two trigger timings\. We see that both the immediately projected and the later projected runs drift – but that the earlier trigged runs drift less\. We hypothesize that when the projection is applied later in training, it will project from a point further from the typical point a random initialization yields\. It is therefore likely that it is projected to a point far away from the optimal region, with large gradients as a result\. This should intuitively make the discretization effects more severe, and the drift faster as a result\. This is indeed what we observe in the figures: The orbit averaging methods affect the accuracy negatively to a point where it can’t recover over the remaining epochs, and also drift away from equivariance quickly\. We also speculate that this is one of the reasons of the performance difference of projection and geometric averaging – it is plausible that the more non\-local nature of the projection yield even more unlikely points post projection\. We leave the task of providing a more complete examination of this phenomenon to future work\.
Figure 8:Accuracy versus equivariance across methods, optimizers, and trigger timings\(FashionMNIST,N0=5000N\_\{0\}=5000,C4C\_\{4\}; best\-accuracy checkpoint, mean over 5 seeds\)\. Thexx\-axis is test accuracy and theyy\-axis is symmetric KL divergence on a log scale, so the best region is the*lower right*\(more accurate*and*more equivariant\)\. Hue encodes the method: rose for geometric averaging, blue for projection, amber for orbit expansion\. Within each method the marker shade runs light→\\rightarrowdark for trigger epochs\{0,20,50,100,150\}\\\{0,20,50,100,150\\\}\(legend\)\. Marker shape and line style encode the optimizer: squares with solid lines for SGD, circles with dashed lines for AdamW\. Grey markers are the no\-symmetrization baselines \(square: SGD, circle: AdamW\)\. Orbit expansion under SGD sits in the lower\-right corner, attaining the best accuracy and equivariance simultaneously; SGD matches or exceeds AdamW on accuracy throughout, and projection is the least equivariant of the three mechanisms\.Figure 9:Training trajectories under projection vs\. geometric averaging with early vs\. late triggers\(FashionMNIST,N0=5000N\_\{0\}=5000,C4C\_\{4\}, SGD; single seed, representative of the 5\-seed runs in[Table4](https://arxiv.org/html/2606.26273#A11.T4)\)\.Left:test accuracy\. At trigger 150 \(dotted line\) both methods cause a brief collapse and recover, but geometric averaging returns almost back to its trigger\-0 level while projection does not\.Right:symmetric KL divergence on a log scale\. The post\-trigger drift out ofHGH\_\{G\}has a similar shape and end value for both trigger settings within each method, consistent with the equivariance metric being controlled mainly by the drift dynamics rather than by trigger timing\.Figure 10:Drift out ofHGH\_\{G\}under SGD vs\. AdamW for orbit expansion, at two expansion times \(FashionMNIST,N0=5000N\_\{0\}=5000,C4C\_\{4\}; mean over 5 seeds, shaded bands±1\\pm 1s\.d\.\)\.Left:Stage\-1=20=20epochs;Right:Stage\-1=100=100epochs\. Before the dotted line both optimizers train the same width\-1/\|G\|1/\|G\|base network and their curves coincide\. At expansion the symmetric KL of both runs drops to the same near\-zero floor\. During Stage 2 the curves separate: SGD drifts up slowly AdamW faster and with larger seed variance\. Curves trace the full trajectory to epoch 500;[Table4](https://arxiv.org/html/2606.26273#A11.T4)values are at the best\-accuracy checkpoint, which for SGD precedes the late drift and is therefore lower\.Similar Articles
Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates
This paper investigates the role of group-equivariant architectures in neural fluid dynamics surrogates, introducing the AB-GATr model. It finds that equivariance is beneficial when data lacks strong alignment, but can degrade performance on highly aligned datasets.
How Data Augmentation Shapes Neural Representations
This paper uses shape analysis tools to characterize how different data augmentation strategies reshape the geometry of neural network representations, finding that augmentation strength and type lead to distinct, well-behaved trajectories in shape space.
Data Augmentation: A Fourier Analysis Perspective
This paper develops a Fourier analysis framework to study data augmentation under group invariances, showing that partial augmentation can achieve the same minimax rates as full augmentation up to a vanishing approximation error, while also proving that exact invariance requires full group averaging.
Neural Variability Enhances Artificial Network Robustness
This paper investigates how correlated noise, inspired by neural variability in the brain, can enhance the robustness of artificial neural networks against adversarial attacks and naturalistic image modifications.
Data-Driven Variational Basis Learning Beyond Neural Networks: A Non-Neural Framework for Adaptive Basis Discovery
This paper introduces Data-Driven Variational Basis Learning (DVBL), a non-neural framework that learns basis functions directly from data through variational optimization, offering interpretability and mathematical transparency compared to neural networks.