TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness

arXiv cs.LG Papers

Summary

Introduces TASER, a training-time regularization framework derived from Langevin Stein operators that encourages geometric compatibility between predictors and data density, improving adversarial robustness and stability on CIFAR-10 without significant clean accuracy degradation.

arXiv:2605.30601v1 Announce Type: new Abstract: Modern deep networks remain fragile under distribution shift and adversarial perturbations, often due to excessive or poorly structured input sensitivity. We introduce TASER (Task-Aware Stein Regularisation), a training-time regularisation framework derived from Langevin Stein operators. By penalising pointwise Stein residuals under the training distribution, TASER encourages geometric compatibility between predictors and data density, inducing anisotropic, data-aware smoothness. We provide theoretical links between Stein regularisation and reduced first-order shift sensitivity, develop scalable implementation variants compatible with modern architectures, and demonstrate improved robustness and stability across regression and vision benchmarks. Across CIFAR-10 experiments, TASER consistently improves the adversarial robustness of established training methods without incurring statistically significant clean-accuracy degradation.
Original Article
View Cached Full Text

Cached at: 06/01/26, 09:29 AM

# TASER: Task-Aware Stein Regularisation for Geometry-Driven Robustness
Source: [https://arxiv.org/html/2605.30601](https://arxiv.org/html/2605.30601)
Michał Kozyra Department of Statistics, University of Oxford, United Kingdom michal\.kozyra@seh\.ox\.ac\.uk&Gesine Reinert Department of Statistics, University of Oxford, United Kingdom reinert@stats\.ox\.ac\.uk

###### Abstract

Modern deep networks remain fragile under distribution shift and adversarial perturbations, often due to excessive or poorly structured input sensitivity\. We introduceTASER \(Task\-Aware Stein Regularisation\), a training\-time regularisation framework derived from Langevin Stein operators\. By penalising pointwise Stein residuals under the training distribution, TASER encourages geometric compatibility between predictors and data density, inducing anisotropic, data\-aware smoothness\. We provide theoretical links between Stein regularisation and reduced first\-order shift sensitivity, develop scalable implementation variants compatible with modern architectures, and demonstrate improved robustness and stability across regression and vision benchmarks\. Across CIFAR\-10 experiments, TASER consistently improves the adversarial robustness of established training methods without incurring statistically significant clean\-accuracy degradation\.

## 1Introduction

Deep neural networks achieve strong in\-distribution performance, yet remain fragile under distribution shift and adversarial perturbations\(Hendrycks and Dietterich,[2019](https://arxiv.org/html/2605.30601#bib.bib41); Goodfellowet al\.,[2015](https://arxiv.org/html/2605.30601#bib.bib13); Madryet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib14)\)\. A central failure mode underlying both phenomena is*misaligned input sensitivity*: the predictor exhibits large responses to perturbations that are small with respect to the data distribution, while potentially underreacting to semantically meaningful variations\(Tsipraset al\.,[2019](https://arxiv.org/html/2605.30601#bib.bib28)\)\. In adversarial settings, this manifests as the existence of directions in input space along which small perturbations induce large changes in model output\(Goodfellowet al\.,[2015](https://arxiv.org/html/2605.30601#bib.bib13)\)\. In distribution shift, it leads to degraded generalisation when test inputs deviate from the training distribution in structured ways\(Hendrycks and Dietterich,[2019](https://arxiv.org/html/2605.30601#bib.bib41)\)\.

A large body of work addresses this issue by regularising model sensitivity\. Classical approaches include weight decay and spectral constraints\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.30601#bib.bib32); Miyatoet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib33)\), while more direct methods penalise gradients or enforce Lipschitz bounds\(Jakubovitz and Giryes,[2018](https://arxiv.org/html/2605.30601#bib.bib34); Cisseet al\.,[2017](https://arxiv.org/html/2605.30601#bib.bib35)\)\. Adversarial training further seeks robustness by optimising against worst\-case perturbations\(Madryet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib14); Zhanget al\.,[2019](https://arxiv.org/html/2605.30601#bib.bib21)\)\. Despite their success, these approaches share a common limitation: they treat all directions in input space uniformly or according to a fixed norm constraint\. In particular, they do not explicitly incorporate the*geometry of the training distribution*\. As a result, they may suppress sensitivity in directions that are semantically meaningful while failing to adequately control sensitivity in directions that move inputs away from high\-probability regions\.

This work introduces a different approach:*regularising model behaviour with respect to the geometry of the data distribution*\. Our starting point is Stein’s method, which provides operators that characterise a probability distribution through identities of the form𝔼p​\[𝒯p​f\]=0\\mathbb\{E\}\_\{p\}\[\\mathcal\{T\}\_\{p\}f\]=0\(Stein,[1972](https://arxiv.org/html/2605.30601#bib.bib3); Leyet al\.,[2017](https://arxiv.org/html/2605.30601#bib.bib4)\)\. For a distributionppwith scoresp​\(x\)=∇log⁡p​\(x\)s\_\{p\}\(x\)=\\nabla\\log p\(x\), the Langevin Stein operator

ℒp​f​\(x\)=Δ​f​\(x\)\+sp​\(x\)⊤​∇f​\(x\)\\mathcal\{L\}\_\{p\}f\(x\)=\\Delta f\(x\)\+s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\)\(1\)encodes the local geometry ofppthrough a combination of curvature and directional derivative terms\. HereΔ​f​\(x\)=tr​\(∇2f​\(x\)\)\\Delta f\(x\)=\\mathrm\{tr\}\(\\nabla^\{2\}f\(x\)\)is the Laplacian,∇\\nablais the gradient, and⊤denotes the transpose\.For each fixed functionffwe callrf​\(x\)=ℒp​f​\(x\)r\_\{f\}\(x\)=\\mathcal\{L\}\_\{p\}f\(x\)the \(pointwise\)Stein residualatxx\.

![Refer to caption](https://arxiv.org/html/2605.30601v1/x1.png)Figure 1:Isotropic versus geometry\-aware smoothness\.Isotropic versus geometry\-aware smoothness\. Standard regularisers \(left\) enforce a uniform penalty on model sensitivity, treating all input directions equally\. TASER \(right\) induces an anisotropic smoothness envelope aligned with the data manifold: sensitivity along the manifold is largely unconstrained, while sensitivity in the off \- manifold direction aligned with the score field∇log⁡p​\(x\)\\nabla\\log p\(x\)is strongly penalised\.We proposeTASER \(Task\-Aware Stein Regularisation\), a training\-time regularisation framework that penalises pointwise Stein residuals:

ℒtotal​\(θ\)=ℒtask​\(θ\)\+λ​𝔼X∼p​\[\(ℒp​fθ​\(X\)\)2\]\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\(\\theta\)\+\\lambda\\,\\mathbb\{E\}\_\{X\\sim p\}\\big\[\(\\mathcal\{L\}\_\{p\}f\_\{\\theta\}\(X\)\)^\{2\}\\big\]\.\(2\)Unlike conventional regularisers that act uniformly across input space, TASER imposes constraints that are explicitly shaped by the distributionpp\. In particular, the score\-weighted termsp​\(x\)⊤​∇f​\(x\)s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\)inℒp\\mathcal\{L\}\_\{p\}penalises sensitivity along directions in which the data density changes most rapidly, while the Laplacian term controls curvature globally\. Together, they enforce a form of*geometry\-aware smoothness*that aligns model sensitivity with the structure of the data\.

This perspective is particularly natural in high\-dimensional settings where data concentrate near lower\-dimensional structures\(Feffermanet al\.,[2016](https://arxiv.org/html/2605.30601#bib.bib31)\)\. In such regimes, directions of steepest density change tend to be orthogonal to regions of high probability mass, and TASER suppresses sensitivity along these directions without requiring explicit manifold estimation\. This provides a principled mechanism for reducing off\-distribution sensitivity, which is a key driver of adversarial vulnerability\(Fawziet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib29); Gilmeret al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib30)\)\.

From a theoretical standpoint, TASER admits a direct robustness interpretation\. The Stein residual governs the first\-order response of the model under smooth perturbations of the data distribution\. In particular, for exponential tilts of the formqε​\(x\)∝p​\(x\)​eε​h​\(x\)q\_\{\\varepsilon\}\(x\)\\propto p\(x\)e^\{\\varepsilon h\(x\)\}, the expectation ofℒp​f\\mathcal\{L\}\_\{p\}funderqεq\_\{\\varepsilon\}scales with the covariance between the Stein residual and the perturbationhh\. Minimising the variance ofℒp​f\\mathcal\{L\}\_\{p\}ftherefore directly bounds first\-order sensitivity to a broad class of distributional shifts\.

TASER is simple to implement and broadly applicable\. It requires only access to input gradients and an estimate of the score field, which can be obtained from modern diffusion or score\-matching models\(Hoet al\.,[2020](https://arxiv.org/html/2605.30601#bib.bib9); Songet al\.,[2021](https://arxiv.org/html/2605.30601#bib.bib10)\)\. The method is agnostic to architecture and task, and can be combined with existing training pipelines, including adversarial training\.

#### Contributions\.

This work makes the following contributions:

- •We introduce TASER, a Stein\-operator\-based regularisation framework that enforces geometry\-aware constraints on model sensitivity\.
- •We show that TASER penalises directional derivatives aligned with the data distribution, providing a principled alternative to isotropic gradient regularisation\.
- •We establish a theoretical connection between Stein residual minimisation and reduced first\-order sensitivity under distributional perturbations\.
- •We demonstrate that TASER provides a natural mechanism for improving adversarial robustness by suppressing sensitivity in directions that move inputs away from high\-density regions\.

More broadly, TASER reframes Stein operators as tools for*training*, and provides a bridge between generative modelling \(via score estimation\) and discriminative robustness\.

## 2Related Work

#### Regularising model sensitivity\.

Controlling the sensitivity of neural networks with respect to their inputs is a central theme in improving robustness and generalisation\. Classical approaches such as weight decay and spectral constraints\(Loshchilov and Hutter,[2019](https://arxiv.org/html/2605.30601#bib.bib32); Miyatoet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib33)\)limit sensitivity indirectly through parameter norms, while more direct methods penalise input gradients, for example via Jacobian norm regularisation\(Jakubovitz and Giryes,[2018](https://arxiv.org/html/2605.30601#bib.bib34); Cisseet al\.,[2017](https://arxiv.org/html/2605.30601#bib.bib35)\)\. These techniques enforce smoothness of the predictor in the ambient input space and are typically agnostic to the underlying data distribution\. As a result, they impose uniform constraints across all directions, without distinguishing between variations that are consistent with the data distribution and those that correspond to unlikely or off\-distribution perturbations\.

#### Adversarial training and robust optimisation\.

Adversarial training and robust optimisation methods address sensitivity by explicitly optimising model performance under worst\-case perturbations within a prescribed norm ball\(Madryet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib14); Goodfellowet al\.,[2015](https://arxiv.org/html/2605.30601#bib.bib13)\)\. Extensions such as TRADES and related formulations further explore the trade\-off between robustness and accuracy\(Zhanget al\.,[2019](https://arxiv.org/html/2605.30601#bib.bib21)\)\. While these approaches have demonstrated strong empirical robustness, they require solving a challenging inner maximisation problem and rely on a choice of perturbation set, most commonly defined in terms ofℓp\\ell\_\{p\}norms\. This dependence can limit generalisability, as robustness is often tied to the specific class of perturbations seen during training\. Moreover, such formulations do not explicitly encode the geometry of the data distribution and may over\-regularise directions that are not relevant for typical data variations\.

#### Score\-based models and diffusion\.

Score\-based and diffusion models provide scalable methods for estimating the score field∇log⁡p​\(x\)\\nabla\\log p\(x\)in high dimensions\(Hoet al\.,[2020](https://arxiv.org/html/2605.30601#bib.bib9); Songet al\.,[2021](https://arxiv.org/html/2605.30601#bib.bib10)\)\. These models have primarily been used for generative modelling, where the score defines a vector field that drives a stochastic process from noise toward the data distribution\. Beyond generation, the score field encodes local geometric information about the data distribution, capturing directions of steepest density variation\. This representation provides a natural bridge between generative modelling and geometric regularisation\.

#### Synthetic data and diffusion\-based robustness\.

\(Gowalet al\.,[2021](https://arxiv.org/html/2605.30601#bib.bib12); Nieet al\.,[2022](https://arxiv.org/html/2605.30601#bib.bib11)\)explorethe use of generative models for improving robustness by augmenting training with synthetic data or by performing adversarial training in latent or generative spaces\. These approaches leverage the learned data distribution to produce more realistic perturbations or to enrich the training set with diverse samples\. More recent work based on diffusion models uses generative priors for adversarial purification or sample generation\(Nieet al\.,[2022](https://arxiv.org/html/2605.30601#bib.bib11)\)\.

#### Stein’s method in machine learning\.

Stein operators have been widely used in machine learning for goodness\-of\-fit testing, sample quality evaluation, and kernel\-based discrepancy measures, for a survey see\(Liuet al\.,[2026](https://arxiv.org/html/2605.30601#bib.bib2)\)\. These methods exploit identities of the form𝔼p​\[𝒯p​f\]=0\\mathbb\{E\}\_\{p\}\[\\mathcal\{T\}\_\{p\}f\]=0to construct statistics that detect deviations from a target distribution\. More recently, Stein\-based quantities have been explored as diagnostic tools for detecting distribution shift and model misspecification\(Kozyra and Reinert,[2026](https://arxiv.org/html/2605.30601#bib.bib1)\)\. However, their use has largely remained in a post hoc setting, where the operator is evaluated after training rather than used to shape the training process itself\.

#### Summary\.

Summarising, existing approaches to robustness either enforce uniform smoothness or optimise against predefined perturbation sets, while recent generative approaches rely on expensive and often problem\-specific pipelines\. Score\-based models provide a continuous, local representation of data geometry that is both expressive and scalable\. This work builds on these developments by using Stein operators as a training\-time regularisation mechanism, directly coupling model sensitivity to the geometry of the training distribution through the score field, while retaining a simple and modular integration with existing training methods\.

## 3Methods: Task\-Aware Stein Regularisation \(TASER\)

### 3\.1Problem setting and Stein formulation

Consider a supervised learning problem with inputx∈ℝdx\\in\\mathbb\{R\}^\{d\}drawn from a distributionppand targetyy\. Letfθ:ℝd→ℝmf\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{m\}denote a model\. The objective is to learnfθf\_\{\\theta\}such that it generalises beyond the training distribution and remains stable under structured perturbations of the input\.

A central difficulty is that standard training objectives impose no constraint on how the predictor behaves relative to the geometry of the data distribution\. In particular, the gradient∇fθ​\(x\)\\nabla f\_\{\\theta\}\(x\)may align with directions in which the densityp​\(x\)p\(x\)changes rapidly, leading to large output variations under perturbations that move inputs away from high\-probability regions\.

To address this, we introduce a regularisation principle based on Stein operators\. Letppbe a distribution with differentiable \(possibly unnormalised\) density and score functionsp​\(x\)=∇log⁡p​\(x\)s\_\{p\}\(x\)=\\nabla\\log p\(x\)\. For the Langevin Stein operatorℒp​f​\(x\)=Δ​f​\(x\)\+sp​\(x\)⊤​∇f​\(x\)\\mathcal\{L\}\_\{p\}f\(x\)=\\Delta f\(x\)\+s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\)as in \([1](https://arxiv.org/html/2605.30601#S1.E1)\),under standard regularity conditions, the Stein identity𝔼X∼p​\[ℒp​f​\(X\)\]=0\\mathbb\{E\}\_\{X\\sim p\}\[\\mathcal\{L\}\_\{p\}f\(X\)\]=0holds\(see Appendix[B](https://arxiv.org/html/2605.30601#A2)\)\. TASER uses this identity as a training principle, penalising deviations from it at the sample level:

ℒtotal​\(θ\)=ℒtask​\(θ\)\+λ​𝔼X∼p​\[\(ℒp​fθ​\(X\)\)2\]\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{task\}\}\(\\theta\)\+\\lambda\\,\\mathbb\{E\}\_\{X\\sim p\}\\big\[\(\\mathcal\{L\}\_\{p\}f\_\{\\theta\}\(X\)\)^\{2\}\\big\]\.\(3\)
Since𝔼p​\[ℒp​f\]=0\\mathbb\{E\}\_\{p\}\[\\mathcal\{L\}\_\{p\}f\]=0, the penalty corresponds to the variance under the training distribution of the\(pointwise\)Stein residual

rf​\(x\)=ℒp​f​\(x\)=Δ​f​\(x\)\+sp​\(x\)⊤​∇f​\(x\)\.r\_\{f\}\(x\)=\\mathcal\{L\}\_\{p\}f\(x\)=\\Delta f\(x\)\+s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\)\.\(4\)

### 3\.2Structure of the Stein residual

The Stein residualrfr\_\{f\}in \([4](https://arxiv.org/html/2605.30601#S3.E4)\)combines two complementary effects: curvature and directional sensitivity\. The termsp​\(x\)⊤​∇f​\(x\)s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\)measures how the predictor changes along directions in which the data density varies most rapidly\. In high\-dimensional settings where data concentrate near lower\-dimensional structures, these directions tend to be orthogonal to regions of high probability mass\.Decomposing∇f\\nabla finto a normal and a orthogonal component,

∇f​\(x\)=ΠT​\(x\)​∇f​\(x\)\+ΠN​\(x\)​∇f​\(x\),\\nabla f\(x\)=\\Pi\_\{T\}\(x\)\\nabla f\(x\)\+\\Pi\_\{N\}\(x\)\\nabla f\(x\),the score\-weighted term predominantly probes the normal componentΠN​\(x\)​∇f​\(x\)\\Pi\_\{N\}\(x\)\\nabla f\(x\), corresponding to deviations from typical data configurations\. Penalising this term therefore suppresses sensitivity in directions that move inputs away from the data distribution\. See Figure[1](https://arxiv.org/html/2605.30601#S1.F1)for a schematic visualisation\.

The LaplacianΔ​f​\(x\)\\Delta f\(x\)plays a complementary role by controlling curvature\. It prevents the predictor from compensating for large directional derivatives through oscillatory behaviour, and enforces smoothness across all directions\.

An equivalent formulation is given by the divergence identityℒp​f​\(x\)=1p​\(x\)​∇⋅\(p​\(x\)​∇f​\(x\)\);\\mathcal\{L\}\_\{p\}f\(x\)=\\frac\{1\}\{p\(x\)\}\\nabla\\cdot\\big\(p\(x\)\\nabla f\(x\)\\big\);the Stein residual measures the divergence of the density\-weighted sensitivity fieldp​\(x\)​∇f​\(x\)p\(x\)\\nabla f\(x\)\. Minimising this quantity enforces compatibility between the predictor and the geometry ofpp\.

#### Practical variants

The full Stein operator provides the most faithful representation of this interaction, but can be computationally expensive\. Two practical variants are therefore considered\. First, the Laplacian term can be estimated efficiently using stochastic trace estimators such as Hutchinson’s method\(Hutchinson,[1989](https://arxiv.org/html/2605.30601#bib.bib8)\):Δ​f​\(x\)≈1K​∑k=1Kvk⊤​∇2f​\(x\)​vk\.\\Delta f\(x\)\\approx\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}v\_\{k\}^\{\\top\}\\nabla^\{2\}f\(x\)\\,v\_\{k\}\.Second, a first\-order approximation can be obtained by omitting the Laplacian:

rf\(1\)​\(x\)=sp​\(x\)⊤​∇f​\(x\)\.r^\{\(1\)\}\_\{f\}\(x\)=s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\)\.\(5\)This retains the core geometric effect \- penalising sensitivity aligned with the score field \- while significantly reducing computational cost\.

#### Approximate scores and centering

In practice, the score functionsp​\(x\)s\_\{p\}\(x\)is replaced by an estimates~​\(x\)\\tilde\{s\}\(x\), for example obtained from a diffusion model\(Hoet al\.,[2020](https://arxiv.org/html/2605.30601#bib.bib9); Songet al\.,[2021](https://arxiv.org/html/2605.30601#bib.bib10)\)\. The Stein identity then holds only approximately, introducing a bias in the residual\. To account for this, we centre the residual:

r~f​\(x\)=ℒp~​f​\(x\)−Df,Df≈𝔼p​\[ℒp~​f​\(X\)\]\.\\tilde\{r\}\_\{f\}\(x\)=\\mathcal\{L\}\_\{\\tilde\{p\}\}f\(x\)\-D\_\{f\},\\qquad D\_\{f\}\\approx\\mathbb\{E\}\_\{p\}\[\\mathcal\{L\}\_\{\\tilde\{p\}\}f\(X\)\]\.The centering constant can be estimated globally using a calibration set or within each minibatch\. This removes systematic bias and focuses the regulariser on variability of the Stein residual\.

Algorithm 1Task\-Aware Stein Regularisation \(TASER\)1:Training set

𝒟\\mathcal\{D\}, model output \(or probe\)

fθf\_\{\\theta\}, score

s~​\(x\)\\tilde\{s\}\(x\), weight

λ\\lambda, number of Hutchinson probes

KK, booleandetach\_mean

2:foreach training stepdo

3:Sample minibatch

\{\(xi,yi\)\}i=1B\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{B\}
4:Compute task loss

ℒtask\\mathcal\{L\}\_\{\\mathrm\{task\}\}
5:foreach

xix\_\{i\}do

6:Compute gradient

∇fθ​\(xi\)\\nabla f\_\{\\theta\}\(x\_\{i\}\)
7:Estimate

Δ​fθ​\(xi\)\\Delta f\_\{\\theta\}\(x\_\{i\}\)using

KKHutchinson probes

8:

ri←Δ​fθ​\(xi\)\+s~​\(xi\)⊤​∇fθ​\(xi\)r\_\{i\}\\leftarrow\\Delta f\_\{\\theta\}\(x\_\{i\}\)\+\\tilde\{s\}\(x\_\{i\}\)^\{\\top\}\\nabla f\_\{\\theta\}\(x\_\{i\}\)
9:endfor

10:

r¯←1B​∑iri\\bar\{r\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i\}r\_\{i\}
11:ifdetach\_meanthen

12:

r¯←stopgrad​\(r¯\)\\bar\{r\}\\leftarrow\\mathrm\{stopgrad\}\(\\bar\{r\}\)
13:endif

14:

ℛ←1B​∑i\(ri−r¯\)2\\mathcal\{R\}\\leftarrow\\frac\{1\}\{B\}\\sum\_\{i\}\(r\_\{i\}\-\\bar\{r\}\)^\{2\}
15:Update

θ\\thetausing

ℒtask\+λ​ℛ\\mathcal\{L\}\_\{\\mathrm\{task\}\}\+\\lambda\\mathcal\{R\}
16:endfor

### 3\.3Choice of probe function\.

For vector\-valued modelsfθ:ℝd→ℝmf\_\{\\theta\}:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{m\}, TASER is applied to a scalar*probe*function derived from the model\. The probe should satisfy three properties: \(i\) it should not saturate around training data, since TASER relies on local sensitivity information; consequently, softmax probabilities are typically a poor choice; \(ii\) it should be relevant to robustness and decision boundaries; and \(iii\) it should remain numerically stable automatic differentiation\.

Empirically, we obtain the strongest results using a smooth logit\-margin probe\. For logitsz∈ℝKz\\in\\mathbb\{R\}^\{K\}and labelyy, we define

m​\(x;y\)=LSEj≠y⁡\(zj\)−zy,LSEj≠y⁡\(zj\)=log​∑j≠yezj\.m\(x;y\)=\\operatorname\{LSE\}\_\{j\\neq y\}\(z\_\{j\}\)\-z\_\{y\},\\qquad\\operatorname\{LSE\}\_\{j\\neq y\}\(z\_\{j\}\)=\\log\\sum\_\{j\\neq y\}e^\{z\_\{j\}\}\.This is a smooth approximation of the multiclass marginmaxj≠y⁡zj−zy\\max\_\{j\\neq y\}z\_\{j\}\-z\_\{y\}, replacing the hard maximum with log\-sum\-exp\. The TASER penalty is then applied tom​\(x;y\)m\(x;y\)\.

### 3\.4TASER fine\-tuning

TASER can be applied either during training or as a post hoc fine\-tuning step\. In the latter case, given a pretrained modelfθ0f\_\{\\theta\_\{0\}\}, we optimise

ℒ=ℒbase​\(x,y\)\+α​KL​\(fθ0​\(x\)∥fθ​\(x\)\)\+λ​\(t\)​ℛTASER​\(x\)\.\\mathcal\{L\}=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(x,y\)\+\\alpha\\,\\mathrm\{KL\}\\\!\\left\(f\_\{\\theta\_\{0\}\}\(x\)\\,\\\|\\,f\_\{\\theta\}\(x\)\\right\)\+\\lambda\(t\)\\,\\mathcal\{R\}\_\{\\mathrm\{TASER\}\}\(x\)\.\(6\)
The Stein penalty is always computed on clean inputsxx, even when the base loss uses adversarial examplesxadvx\_\{\\mathrm\{adv\}\}\. This ensures that the regularisation remains aligned with the score field of the training distribution\. The regularisation weightλ​\(t\)\\lambda\(t\)is ramped during training, and the learning rate follows a warmup and cosine decay schedule\.

In settings where the original training procedure is difficult to reproduce \- e\.g\. due to reliance on custom techniques, complex data pipelines, or synthetic data augmentation \- the base loss can be replaced with a standard, well\-understood robust objective such as TRADES\(Zhanget al\.,[2019](https://arxiv.org/html/2605.30601#bib.bib21)\)\. This provides a practical and reproducible alternative while retaining compatibility with the TASER regularisation framework\.

## 4Theoretical Analysis

### 4\.1Weighted Sobolev formulation

The TASER penalty can be interpreted as a Sobolev\-type regulariser induced by the Langevin Stein operator\.To see this, let

ℒp​f​\(x\)=Δ​f​\(x\)\+sp​\(x\)⊤​∇f​\(x\),sp​\(x\)=∇log⁡p​\(x\),\\mathcal\{L\}\_\{p\}f\(x\)=\\Delta f\(x\)\+s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\),\\qquad s\_\{p\}\(x\)=\\nabla\\log p\(x\),and assume thatppis smooth, strictly positive, and that boundary terms vanish under integration by parts\. Then the following identity holds:

𝔼p​\[\(ℒp​f​\(X\)\)2\]=𝔼p​\[‖∇2f​\(X\)‖F2\]\+𝔼p​\[∇f​\(X\)⊤​Hp​\(X\)​∇f​\(X\)\],\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\mathcal\{L\}\_\{p\}f\(X\)\)^\{2\}\\right\]=\\mathbb\{E\}\_\{p\}\\\!\\left\[\\\|\\nabla^\{2\}f\(X\)\\\|\_\{F\}^\{2\}\\right\]\+\\mathbb\{E\}\_\{p\}\\\!\\left\[\\nabla f\(X\)^\{\\top\}H\_\{p\}\(X\)\\nabla f\(X\)\\right\],\(7\)whereHp​\(x\)=−∇2log⁡p​\(x\)H\_\{p\}\(x\)=\-\\nabla^\{2\}\\log p\(x\)and∥⋅∥F\\\|\\cdot\\\|\_\{F\}is the Frobenius norm; see Appendix[B](https://arxiv.org/html/2605.30601#A2)for details\. Thus, the TASER objective defines a distribution\-dependent Sobolev quadratic form:

‖f‖Hp,∗22:=𝔼p​\[‖∇2f​\(X\)‖F2\]\+𝔼p​\[∇f​\(X\)⊤​Hp​\(X\)​∇f​\(X\)\]\.\\\|f\\\|\_\{H^\{2\}\_\{p,\*\}\}^\{2\}:=\\mathbb\{E\}\_\{p\}\\\!\\left\[\\\|\\nabla^\{2\}f\(X\)\\\|\_\{F\}^\{2\}\\right\]\+\\mathbb\{E\}\_\{p\}\\\!\\left\[\\nabla f\(X\)^\{\\top\}H\_\{p\}\(X\)\\nabla f\(X\)\\right\]\.\(8\)The first term controls curvature of the predictor, while the second term controls gradients through a position\-dependent metric determined by the curvature of the log\-density\.

This contrasts with standard Sobolev regularisation, which uses isotropic derivative penalties such as𝔼p​\[‖∇f​\(X\)‖2\]\\mathbb\{E\}\_\{p\}\[\\\|\\nabla f\(X\)\\\|^\{2\}\]and𝔼p​\[‖∇2f​\(X\)‖F2\]\.\\mathbb\{E\}\_\{p\}\[\\\|\\nabla^\{2\}f\(X\)\\\|\_\{F\}^\{2\}\]\.In TASER, first\-order sensitivity is weighted byHp​\(x\)H\_\{p\}\(x\)rather than by the identity matrix\. Consequently, gradients are penalised anisotropically according to the local geometry of the data distribution\.

In the strongly log\-concave case,m​I⪯Hp​\(x\)⪯M​ImI\\preceq H\_\{p\}\(x\)\\preceq MIfor allxx, the TASER penalty is equivalent to a second\-order Sobolev seminorm underpp\. Specifically, for the standard Gaussian distribution,p=𝒩​\(0,I\)p=\\mathcal\{N\}\(0,I\), one hasHp​\(x\)=IH\_\{p\}\(x\)=I, and therefore

𝔼p​\[\(ℒp​f\)2\]=𝔼p​\[‖∇2f‖F2\]\+𝔼p​\[‖∇f‖2\]\.\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\mathcal\{L\}\_\{p\}f\)^\{2\}\\right\]=\\mathbb\{E\}\_\{p\}\\\!\\left\[\\\|\\nabla^\{2\}f\\\|\_\{F\}^\{2\}\\right\]\+\\mathbb\{E\}\_\{p\}\\\!\\left\[\\\|\\nabla f\\\|^\{2\}\\right\]\.\(9\)In this case, TASER reduces exactly to a classical second\-order Sobolev penalty\. More generally, \([7](https://arxiv.org/html/2605.30601#S4.E7)\) shows that TASER induces a Sobolev geometry adapted to the data distribution\. The curvature term controls oscillatory behaviour, while the gradient term penalises sensitivity in directions where the log\-density has high curvature\. This provides an analytic counterpart to the geometric interpretation of TASER as a data\-dependent smoothness regulariser\.

### 4\.2Stability under distribution shift

We next interpret TASER as controlling the response of the predictor under shifted or adversarial input distributions\. Letqqdenote the distribution of perturbed inputs, for example the distribution induced by applying an attack mechanism to samples frompp\. Assume thatqqis absolutely continuous with respect topp\.Denote theχ2\\chi^\{2\}divergence betweenqqandppbyχ2​\(q∥p\)=𝔼p​\[\(q​\(X\)p​\(X\)−1\)2\]\.\\chi^\{2\}\(q\\\|p\)=\\mathbb\{E\}\_\{p\}\\\!\\left\[\\left\(\\frac\{q\(X\)\}\{p\(X\)\}\-1\\right\)^\{2\}\\right\]\.Letting𝒬ρ=\{q:χ2​\(q∥p\)≤ρ2\},\\mathcal\{Q\}\_\{\\rho\}=\\left\\\{q:\\chi^\{2\}\(q\\\|p\)\\leq\\rho^\{2\}\\right\\\},we show in Appendix[B](https://arxiv.org/html/2605.30601#A2)that

supq∈𝒬ρ\|𝔼q​\[ℒp​f​\(X\)\]\|≤ρ​𝔼p​\[\(ℒp​f​\(X\)\)2\]\.\\sup\_\{q\\in\\mathcal\{Q\}\_\{\\rho\}\}\\left\|\\mathbb\{E\}\_\{q\}\[\\mathcal\{L\}\_\{p\}f\(X\)\]\\right\|\\leq\\rho\\sqrt\{\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\mathcal\{L\}\_\{p\}f\(X\)\)^\{2\}\\right\]\}\.\(10\)Thus, minimising the TASER penalty controls the worst\-case Stein response over a neighbourhood of the training distribution\.The left\-hand sideof \([10](https://arxiv.org/html/2605.30601#S4.E10)\)has a geometric interpretation through the identity

𝔼q​\[ℒp​f​\(X\)\]=−𝔼q​\[∇f​\(X\)⊤​∇log⁡q​\(X\)p​\(X\)\],\\mathbb\{E\}\_\{q\}\[\\mathcal\{L\}\_\{p\}f\(X\)\]=\-\\mathbb\{E\}\_\{q\}\\left\[\\nabla f\(X\)^\{\\top\}\\nabla\\log\\frac\{q\(X\)\}\{p\(X\)\}\\right\],\(11\)\(see\(Kozyra and Reinert,[2026](https://arxiv.org/html/2605.30601#bib.bib1)\)\), which holds under the same regularity assumptions as \([11](https://arxiv.org/html/2605.30601#S4.E11)\)\. The term∇log⁡\(q/p\)\\nabla\\log\(q/p\)describes the local direction of the distributional shift frompptoqq\. Therefore, the Stein functional measures the average alignment between the predictor sensitivity∇f\\nabla fand the shift direction\. Bound \([10](https://arxiv.org/html/2605.30601#S4.E10)\) shows that TASER suppresses this alignment uniformly over all distributions in aχ2\\chi^\{2\}neighbourhood ofpp\.

It is important to note that even ifqqis the distribution of adversarial examples, this does not constitute a formal bound on adversarial classification error\. Instead, it controls a task\-aware sensitivity functional associated with the attack\-induced distribution\. In this sense, TASER provides an attack\-agnostic stability guarantee: no admissible shifted distribution can induce a large Stein response unless either the shift is far from the training distribution or the TASER penalty itself is large\.

## 5Experimental Results

### 5\.1Toy 1D regression: extrapolation under distribution shift

We illustrate Stein regularisation in a controlled one\-dimensional regression setting\. Inputs are sampled fromx∼𝒩​\(0,1\)x\\sim\\mathcal\{N\}\(0,1\)with targety=sin⁡\(x\)y=\\sin\(x\)\. A small fully\-connected network is trained with either standardℓ2\\ell\_\{2\}weight decay or TASER, which penalises the squared Stein residual

ℒp​f​\(x\)=f′′​\(x\)−x​f′​\(x\),\\mathcal\{L\}\_\{p\}f\(x\)=f^\{\\prime\\prime\}\(x\)\-xf^\{\\prime\}\(x\),forp=𝒩​\(0,1\)p=\\mathcal\{N\}\(0,1\)\. Figure[2](https://arxiv.org/html/2605.30601#S5.F2)shows predictions over a wide input range\. Both methods fit the training distribution well, with TASER matchingℓ2\\ell\_\{2\}performance acrossλ\\lambda\. However, their extrapolation differs considerably:ℓ2\\ell\_\{2\}regularisation produces unstable and often diverging behaviour, highly sensitive toλ\\lambda, whereas TASER yields smooth and consistent extrapolations across a wide range of regularisation strengths\.

![Refer to caption](https://arxiv.org/html/2605.30601v1/figures/forecasts.png)Figure 2:Forecasts outside the training distribution for 1D regression\. Models are trained onx∼𝒩​\(0,1\)x\\sim\\mathcal\{N\}\(0,1\)and evaluated on a wide input grid\. TASER regularisation yields substantially more stable extrapolation compared toℓ2\\ell\_\{2\}\.This behaviour is reflected in Table[1](https://arxiv.org/html/2605.30601#S5.T1), which reports mean squared error \(MSE\) on both in\-distribution samples and a wide grid\. Whileℓ2\\ell\_\{2\}regularisation achieves low in\-distribution error, its out\-of\-distribution performance is highly sensitive toλ\\lambdaand often degrades substantially\. In contrast, TASER maintains comparable in\-distribution performance while consistently improving out\-of\-distribution error and exhibiting significantly greater robustness to the choice of regularisation strength\.

Table 1:Test MSE under in\-distribution \(ID\) and wide\-range inputs \(OOD\)\. TASER matchesℓ2\\ell\_\{2\}performance in\-distribution while significantly improving out\-of\-distribution behaviour and stability across regularisation strengths\.Stein regularisation requiresλ≠0,\\lambda\\neq 0,hence the missing entries\.These results highlight that Stein regularisation induces a qualitatively different inductive bias: rather than penalising parameters uniformly, it suppresses sensitivity in directions that are inconsistent with the data distribution, leading to improved extrapolation without sacrificing in\-distribution performance\.

### 5\.2TASER during training

#### Setup\.

We first evaluate TASER as a training\-time regulariser in a controlled setting on CIFAR\-10 using a ResNet\-18 backbone\(Heet al\.,[2016](https://arxiv.org/html/2605.30601#bib.bib37)\)\. We consider a representative set of standard and adversarial training methods: MART\(Wanget al\.,[2020](https://arxiv.org/html/2605.30601#bib.bib22)\), TRADES\(Zhanget al\.,[2019](https://arxiv.org/html/2605.30601#bib.bib21)\)and Adversarial Weight Perturbation \(AWP\)\(Wuet al\.,[2020](https://arxiv.org/html/2605.30601#bib.bib25)\)\.

For each method, we train two variants: \(i\) the baseline model using the original objective, and \(ii\) the same model trained with TASER added to the loss throughout training\. This allows us to isolate the effect of TASER as a distribution\-aware regulariser\.

Table 2:Effect of TASER on clean and robust accuracy on CIFAR\-10 \(ResNet\-18\)\. Accuracies and changes are in percentage points\.Clean acc\.Robust acc\. avg\.OverheadMethodNo TASER\+TASERΔ\\DeltaNo TASER\+TASERΔ\\DeltaVanilla77\.8870\.92−6\.96\-6\.96\[−8\.15,−5\.75\]\[\-8\.15,\-5\.75\]3\.0019\.25\+16\.25\\mathbf\{\+16\.25\}\[14\.40,18\.10\]\[14\.40,18\.10\]×2\.45\\times 2\.45PGD66\.0365\.64−0\.39\\mathbf\{\-0\.39\}\[−1\.68,0\.94\]\[\-1\.68,0\.94\]24\.2533\.35\+9\.10\\mathbf\{\+9\.10\}\[6\.35,11\.90\]\[6\.35,11\.90\]×1\.25\\times 1\.25TRADES67\.0765\.61−1\.46\\mathbf\{\-1\.46\}\[−2\.78,−0\.15\]\[\-2\.78,\-0\.15\]27\.3533\.95\+6\.60\\mathbf\{\+6\.60\}\[3\.70,9\.40\]\[3\.70,9\.40\]×1\.27\\times 1\.27MART64\.0763\.31−0\.76\-0\.76\[−2\.09,0\.57\]\[\-2\.09,0\.57\]26\.3535\.00\+8\.65\\mathbf\{\+8\.65\}\[5\.85,11\.50\]\[5\.85,11\.50\]×1\.19\\times 1\.19TRADES \+ AWP67\.9068\.87\+0\.97\\mathbf\{\+0\.97\}\[−0\.31,2\.24\]\[\-0\.31,2\.24\]34\.0536\.20\+2\.15\+2\.15\[−0\.80,5\.10\]\[\-0\.80,5\.10\]×1\.21\\times 1\.21MART \+ AWP66\.9666\.14−0\.82\\mathbf\{\-0\.82\}\[−2\.14,0\.45\]\[\-2\.14,0\.45\]32\.3037\.40\+5\.10\\mathbf\{\+5\.10\}\[2\.20,8\.05\]\[2\.20,8\.05\]×1\.17\\times 1\.17Avg\.68\.3266\.75−1\.57\-1\.57\[−2\.10,−1\.04\]\[\-2\.10,\-1\.04\]24\.5532\.53\+7\.98\\mathbf\{\+7\.98\}\[6\.87,9\.09\]\[6\.87,9\.09\]–Avg\. \(excl\. vanilla\)66\.4165\.91−0\.49\\mathbf\{\-0\.49\}\[−1\.09,0\.09\]\[\-1\.09,0\.09\]28\.8635\.18\+6\.32\\mathbf\{\+6\.32\}\[5\.06,7\.60\]\[5\.06,7\.60\]–

Robust accuracy is the average of AutoAttack and SPSA accuracy, with both attacks constructed usingℓ∞\\ell\_\{\\infty\}andϵ=8/255\\epsilon=8/255\. Bootstrap confidence intervals \(95%\) for performance deltas are shown in brackets\. Result in bold indicate statistical difference from zero for robustness, and lack of statistical difference for clean accuracy\. TASER consistently improves robust accuracy, while the average clean\-accuracy degradation becomes statistically negligible when excluding the standard \(non\-adversarially trained\) model\.
#### Results\.

Robustness is evaluated using AutoAttack\(Croce and Hein,[2020](https://arxiv.org/html/2605.30601#bib.bib15)\)and SPSA\(Uesatoet al\.,[2018](https://arxiv.org/html/2605.30601#bib.bib16)\)\(both usingℓ∞\\ell\_\{\\infty\}andϵ=8/255\\epsilon=8/255\)\. A detailed accuracy breakdown between attacks can be found inAppendix[D](https://arxiv.org/html/2605.30601#A4)\. Across all training objectives, TASER improves average robust accuracy by\+7\.98\+7\.98percentage points while incurring only a−1\.57\-1\.57point change in clean accuracy on average\. The gains are largest for the standard model \(\+16\.25\+16\.25robust points\), but remain substantial for adversarially trained models: across PGD, TRADES, MART, and AWP, TASER improves average robust accuracy by\+6\.32\+6\.32points while changing clean accuracy by only−0\.49\-0\.49points on average, with the clean accuracy drop not statistically significant based on bootstrapping\.

These results indicate that TASER acts as a complementary robustness mechanism rather than a replacement for adversarial training\. Its consistent gains across standard, PGD, TRADES, and TRADES\+AWP objectives suggest that the TASER regulariser captures directions of task\-relevant sensitivity that are not fully controlled by conventional adversarial training\.

#### Runtime\.

TASER introduces additional computational overhead due to derivative computations\. The first\-order variant requires an additional backward pass, while the full operator involves Hessian\-vector products \(implemented via Hutchinson estimators\)\. Table[2](https://arxiv.org/html/2605.30601#S5.T2)reports training time breakdown on CIFAR\-10\. Despite this overhead, TASER remains practical, and the additional incurred cost is considerably smaller than that of classical adversarial training\.

## 6Limitations and Discussion

#### Dependence on score quality\.

TASER relies on an estimate of the score function∇log⁡p​\(x\)\\nabla\\log p\(x\), which in practice is obtained from a separate model such as a diffusion or score\-matching network\. The effectiveness of the regulariser therefore depends on the quality of this estimate\. Inaccurate scores introduce a bias term in the Stein operator, which can distort the intended geometric effect and lead to suboptimal regularisation\. Data augmentation techniques typically used in diffusion frameworks might additionally contribute the mismatch between the true score function and the model estimate\. While centering and variance\-based formulations mitigate global bias, they do not eliminate input\-dependent errors\. Improving score estimation or designing robustness to score misspecification remains an important direction for future work\.

#### Computational overhead\.

TASER introduces additional computational cost due to gradient and, in the full formulation, second\-order derivative computations\. While the first\-order variant is relatively lightweight, the full Stein operator requires estimating the Laplacian via Hessian–vector products, which can be expensive in high dimensions\. In this work, we mitigate this overhead by applying TASER primarily during a fine\-tuning stage, but the cost may still be significant for large\-scale models or datasets\.

#### Interaction with adversarial training\.

TASER is designed to complement adversarial training, but its interaction with existing defence methods is not fully understood\. In particular, adversarial training optimises worst\-case behaviour within a predefined perturbation set, whereas TASER regularises sensitivity relative to the data distribution\. These objectives are not identical and may, in some regimes, compete or over\-regularise the model\. A more systematic study of how TASER interacts with different threat models and attack families would be valuable\.

#### Generality of robustness improvements\.

Although TASER improves robustness across a range of attacks in our experiments, it does not provide formal guarantees against worst\-case perturbations\. The method primarily targets sensitivity aligned with the score field of the training distribution, and may therefore be less effective against perturbations that exploit directions not well captured by this geometry\. As with other regularisation\-based approaches, empirical robustness should be interpreted in the context of the evaluation protocol\.

Furthermore, our evaluation is conducted on a limited set of datasets and architectures\. While we observe consistent gains on CIFAR\-10, this benchmark may not fully capture the diversity of real\-world data distributions\. In particular, the effectiveness of TASER depends on the quality of the learned score field, which itself varies across datasets\. Evaluating TASER on a broader range of domains, including higher\-resolution datasets, non\-natural data, and tasks with different structural properties, would provide a more complete picture of its generality\. We therefore view our results as evidence of a consistent trend rather than a definitive characterisation of robustness across all settings\.

#### Scope of the geometric assumption\.

The interpretation of TASER relies on the assumption that the score field reflects meaningful geometric structure of the data, such as concentration near a lower\-dimensional manifold\. While this assumption is often reasonable for high\-dimensional data, it may not hold uniformly across datasets or input regions\. In particular, the behaviour of the score field off the data manifold can be poorly understood, which may affect the reliability of the regulariser in those regions\.

#### Conclusions\.

Despite these limitations, TASER offers a simple and modular mechanism for incorporating data geometry into training\. Unlike adversarial training, it does not require solving an inner maximisation problem, and unlike generative\-data approaches, it does not rely on sampling or augmentation pipelines\. Its compatibility with existing methods and its interpretation as a task\-aware, distribution\-dependent regulariser suggest that it can serve as a useful complement to current robustness techniques\. Future work may explore improved score estimation, alternative Stein operators, and tighter connections between Stein\-based regularisation and formal robustness guarantees\.

We end the paper by pointing out that while TASER is a method which is not tailored to particular applications, depending on the application care is advised, in particular in critical areas such as healthcare\.

## References

- Parseval networks: improving robustness to adversarial examples\.InProceedings of the 34th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.70,pp\. 854–863\.External Links:[Link](https://proceedings.mlr.press/v70/cisse17a.html)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p2.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Croce and M\. Hein \(2020\)Reliable evaluation of adversarial robustness with an ensemble of diverse parameter\-free attacks\.InProceedings of the 37th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.119,pp\. 2206–2216\.External Links:[Link](https://proceedings.mlr.press/v119/croce20b.html)Cited by:[§5\.2](https://arxiv.org/html/2605.30601#S5.SS2.SSS0.Px2.p1.7)\.
- A\. Fawzi, H\. Fawzi, and P\. Frossard \(2018\)Analysis of classifiers’ robustness to adversarial perturbations\.Machine Learning107,pp\. 481–508\.External Links:[Document](https://dx.doi.org/10.1007/s10994-017-5663-3)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p5.1)\.
- C\. Fefferman, S\. Mitter, and H\. Narayanan \(2016\)Testing the manifold hypothesis\.Journal of the American Mathematical Society29\(4\),pp\. 983–1049\.External Links:[Document](https://dx.doi.org/10.1090/jams/852)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p5.1)\.
- J\. Gilmer, L\. Metz, F\. Faghri, S\. S\. Schoenholz, M\. Raghu, M\. Wattenberg, and I\. Goodfellow \(2018\)Adversarial spheres\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SyUkxxZ0b)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p5.1)\.
- I\. J\. Goodfellow, J\. Shlens, and C\. Szegedy \(2015\)Explaining and harnessing adversarial examples\.InInternational Conference on Learning Representations,External Links:[Link](https://arxiv.org/abs/1412.6572)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p1.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Gorham and L\. Mackey \(2015\)Measuring sample quality with Stein’s method\.InAdvances in Neural Information Processing Systems,Vol\.28\.External Links:[Link](https://papers.nips.cc/paper/2015/hash/698d51a19d8a121ce581499d7b701668-Abstract.html)Cited by:[§A\.4](https://arxiv.org/html/2605.30601#A1.SS4.p3.4)\.
- S\. Gowal, S\. Rebuffi, O\. Wiles, F\. Stimberg, D\. A\. Calian, and T\. Mann \(2021\)Improving robustness using generated data\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 4218–4233\.External Links:[Link](https://papers.nips.cc/paper/2021/hash/21ca6d0cf2f25c4dbb35d8dc0b679c3f-Abstract.html)Cited by:[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px4.p1.1)\.
- K\. He, X\. Zhang, S\. Ren, and J\. Sun \(2016\)Deep residual learning for image recognition\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 770–778\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2016.90)Cited by:[§5\.2](https://arxiv.org/html/2605.30601#S5.SS2.SSS0.Px1.p1.1)\.
- D\. Hendrycks and T\. Dietterich \(2019\)Benchmarking neural network robustness to common corruptions and perturbations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HJz6tiCqYm)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p1.1)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 6840–6851\.External Links:[Link](https://papers.nips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p7.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.30601#S3.SS2.SSS0.Px2.p1.2)\.
- M\. F\. Hutchinson \(1989\)A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines\.Communications in Statistics \- Simulation and Computation18\(3\),pp\. 1059–1076\.External Links:[Document](https://dx.doi.org/10.1080/03610918908812806)Cited by:[§3\.2](https://arxiv.org/html/2605.30601#S3.SS2.SSS0.Px1.p1.1)\.
- D\. Jakubovitz and R\. Giryes \(2018\)Improving dnn robustness to adversarial attacks using jacobian regularization\.InEuropean Conference on Computer Vision,pp\. 514–529\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-01258-8%5F31)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p2.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Kozyra and G\. Reinert \(2026\)TASTE: Task\-aware out\-of\-distribution detection via Stein operators\.External Links:2602\.07640,[Link](https://arxiv.org/abs/2602.07640)Cited by:[Appendix B](https://arxiv.org/html/2605.30601#A2.p1.2),[Appendix B](https://arxiv.org/html/2605.30601#A2.p2.3),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px5.p1.1),[§4\.2](https://arxiv.org/html/2605.30601#S4.SS2.p1.15)\.
- C\. Ley, G\. Reinert, and Y\. Swan \(2017\)Stein’s method for comparison of univariate distributions\.Probability Surveys14,pp\. 1–52\.External Links:[Document](https://dx.doi.org/10.1214/16-PS278)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p3.3)\.
- Q\. Liu, J\. D\. Lee, and M\. I\. Jordan \(2016\)A kernelized Stein discrepancy for goodness\-of\-fit tests and model evaluation\.InProceedings of the 33rd International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.48,pp\. 276–284\.External Links:[Link](https://proceedings.mlr.press/v48/liub16.html)Cited by:[§A\.4](https://arxiv.org/html/2605.30601#A1.SS4.p3.4)\.
- Q\. Liu, L\. Mackey, and C\. Oates \(2026\)Probabilistic inference and learning with Stein’s method\.arXiv preprint arXiv:2603\.07467\.Cited by:[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px5.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=Bkg6RiCqY7)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p2.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px1.p1.1)\.
- A\. Madry, A\. Makelov, L\. Schmidt, D\. Tsipras, and A\. Vladu \(2018\)Towards deep learning models resistant to adversarial attacks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rJzIBfZAb)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p1.1),[§1](https://arxiv.org/html/2605.30601#S1.p2.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px2.p1.1)\.
- T\. Miyato, T\. Kataoka, M\. Koyama, and Y\. Yoshida \(2018\)Spectral normalization for generative adversarial networks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=B1QRgziT-)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p2.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px1.p1.1)\.
- W\. Nie, B\. Guo, Y\. Huang, C\. Xiao, A\. Vahdat, and A\. Anandkumar \(2022\)Diffusion models for adversarial purification\.InProceedings of the 39th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.162,pp\. 16805–16827\.External Links:[Link](https://proceedings.mlr.press/v162/nie22a.html)Cited by:[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px4.p1.1)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=PxTIG12RRHS)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p7.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px3.p1.1),[§3\.2](https://arxiv.org/html/2605.30601#S3.SS2.SSS0.Px2.p1.2)\.
- C\. Stein \(1972\)A bound for the error in the normal approximation to the distribution of a sum of dependent random variables\.InProceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability,Vol\.2,pp\. 583–602\.Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p3.3)\.
- D\. Tsipras, S\. Santurkar, L\. Engstrom, A\. Turner, and A\. Madry \(2019\)Robustness may be at odds with accuracy\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=SyxAb30cY7)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p1.1)\.
- J\. Uesato, B\. O’Donoghue, A\. van den Oord, and P\. Kohli \(2018\)Adversarial risk and the dangers of evaluating against weak attacks\.InProceedings of the 35th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.80,pp\. 5032–5041\.External Links:[Link](https://proceedings.mlr.press/v80/uesato18a.html)Cited by:[§5\.2](https://arxiv.org/html/2605.30601#S5.SS2.SSS0.Px2.p1.7)\.
- Y\. Wang, D\. Zou, J\. Yi, J\. Bailey, X\. Ma, and Q\. Gu \(2020\)Improving adversarial robustness requires revisiting misclassified examples\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=rklOg6EFwS)Cited by:[§5\.2](https://arxiv.org/html/2605.30601#S5.SS2.SSS0.Px1.p1.1)\.
- D\. Wu, S\. Xia, and Y\. Wang \(2020\)Adversarial weight perturbation helps robust generalization\.InAdvances in Neural Information Processing Systems,Vol\.33,pp\. 2958–2969\.External Links:[Link](https://papers.nips.cc/paper/2020/hash/1ef91c212e30e14bf125e9374262401f-Abstract.html)Cited by:[§C\.2](https://arxiv.org/html/2605.30601#A3.SS2.SSS0.Px5.p1.1),[§5\.2](https://arxiv.org/html/2605.30601#S5.SS2.SSS0.Px1.p1.1)\.
- H\. Zhang, Y\. Yu, J\. Jiao, E\. P\. Xing, L\. El Ghaoui, and M\. I\. Jordan \(2019\)Theoretically principled trade\-off between robustness and accuracy\.InProceedings of the 36th International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol\.97,pp\. 7472–7482\.External Links:[Link](https://proceedings.mlr.press/v97/zhang19p.html)Cited by:[§1](https://arxiv.org/html/2605.30601#S1.p2.1),[§2](https://arxiv.org/html/2605.30601#S2.SS0.SSS0.Px2.p1.1),[§3\.4](https://arxiv.org/html/2605.30601#S3.SS4.p3.1),[§5\.2](https://arxiv.org/html/2605.30601#S5.SS2.SSS0.Px1.p1.1)\.

Appendix

Contents

## Appendix ABackground on Stein’s Method

This appendix gives a short introduction to Stein’s method and the operator viewpoint used in this paper\. The goal is to provide enough background for a reader unfamiliar with Stein’s method to understand why Stein operators provide distribution\-dependent identities, how these identities are used in statistics, and how they have been adapted in modern machine learning\.

### A\.1Stein identities

Stein’s method is a general framework for characterising probability distributions through expectation identities\. Letppbe a target distribution on a space𝒳\\mathcal\{X\}\. A*Stein operator*forppis an operator𝒯p\\mathcal\{T\}\_\{p\}acting on a class of test functionsℱp\\mathcal\{F\}\_\{p\}such that

𝔼X∼p​\[𝒯p​f​\(X\)\]=0for all​f∈ℱp\.\\mathbb\{E\}\_\{X\\sim p\}\\\!\\left\[\\mathcal\{T\}\_\{p\}f\(X\)\\right\]=0\\qquad\\text\{for all \}f\\in\\mathcal\{F\}\_\{p\}\.\(12\)The classℱp\\mathcal\{F\}\_\{p\}is usually called aStein class\. The defining property of a Stein operator is therefore that it produces functions with zero expectation under the target distribution\.

A classical example is the standard normal distribution\. IfZ∼𝒩​\(0,1\)Z\\sim\\mathcal\{N\}\(0,1\)andffis sufficiently regular, then integration by parts gives

𝔼​\[f′​\(Z\)−Z​f​\(Z\)\]=0\.\\mathbb\{E\}\\\!\\left\[f^\{\\prime\}\(Z\)\-Zf\(Z\)\\right\]=0\.\(13\)Conversely, under appropriate conditions, if a random variableWWsatisfies

𝔼​\[f′​\(W\)−W​f​\(W\)\]=0\\mathbb\{E\}\\\!\\left\[f^\{\\prime\}\(W\)\-Wf\(W\)\\right\]=0for a sufficiently rich class of functionsff, thenWWhas the standard normal distribution\. Thus the operator

𝒯𝒩​f​\(x\)=f′​\(x\)−x​f​\(x\)\\mathcal\{T\}\_\{\\mathcal\{N\}\}f\(x\)=f^\{\\prime\}\(x\)\-xf\(x\)characterises the standard normal distribution\.

The same idea extends far beyond the Gaussian case\. For a differentiable densitypponℝd\\mathbb\{R\}^\{d\}, one common first\-order Stein operator is

𝒜p​ϕ​\(x\)=∇⋅ϕ​\(x\)\+ϕ​\(x\)⊤​∇log⁡p​\(x\),\\mathcal\{A\}\_\{p\}\\phi\(x\)=\\nabla\\cdot\\phi\(x\)\+\\phi\(x\)^\{\\top\}\\nabla\\log p\(x\),\(14\)whereϕ:ℝd→ℝd\\phi:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}is a vector\-valued test function\. Under appropriate boundary conditions,

𝔼X∼p​\[𝒜p​ϕ​\(X\)\]=0\.\\mathbb\{E\}\_\{X\\sim p\}\\\!\\left\[\\mathcal\{A\}\_\{p\}\\phi\(X\)\\right\]=0\.\(15\)Indeed,

𝒜p​ϕ​\(x\)=1p​\(x\)​∇⋅\(p​\(x\)​ϕ​\(x\)\),\\mathcal\{A\}\_\{p\}\\phi\(x\)=\\frac\{1\}\{p\(x\)\}\\nabla\\cdot\\left\(p\(x\)\\phi\(x\)\\right\),so that

∫p​\(x\)​𝒜p​ϕ​\(x\)​𝑑x=∫∇⋅\(p​\(x\)​ϕ​\(x\)\)​𝑑x,\\int p\(x\)\\mathcal\{A\}\_\{p\}\\phi\(x\)\\,dx=\\int\\nabla\\cdot\(p\(x\)\\phi\(x\)\)\\,dx,which vanishes when the boundary flux is zero\.

### A\.2The Stein equation and distributional approximation

In classical probability and statistics, Stein’s method is often used to bound distances between probability distributions\. Supposeppis a target distribution andqqis another distribution\. Given a test functionhh, one constructs a solutionfhf\_\{h\}to the*Stein equation*

𝒯p​fh​\(x\)=h​\(x\)−𝔼Z∼p​\[h​\(Z\)\]\.\\mathcal\{T\}\_\{p\}f\_\{h\}\(x\)=h\(x\)\-\\mathbb\{E\}\_\{Z\\sim p\}\[h\(Z\)\]\.\(16\)IfX∼qX\\sim q, then taking expectations gives

𝔼q​\[h​\(X\)\]−𝔼p​\[h​\(Z\)\]=𝔼q​\[𝒯p​fh​\(X\)\]\.\\mathbb\{E\}\_\{q\}\[h\(X\)\]\-\\mathbb\{E\}\_\{p\}\[h\(Z\)\]=\\mathbb\{E\}\_\{q\}\[\\mathcal\{T\}\_\{p\}f\_\{h\}\(X\)\]\.\(17\)Thus, a difference in expectations underqqandppcan be expressed as an expectation of a Stein operator underqq\.

This is the basic mechanism behind many Stein bounds\. If one can control𝔼q​\[𝒯p​fh​\(X\)\]\\mathbb\{E\}\_\{q\}\[\\mathcal\{T\}\_\{p\}f\_\{h\}\(X\)\]uniformly over a class of test functionshh, then one obtains a bound on a probability metric betweenqqandpp\. For example, by choosing different classes ofhh, one may obtain bounds in Wasserstein distance, Kolmogorov distance, total variation distance, or other integral probability metrics\. Much of classical Stein theory is concerned with constructing Stein equations for specific target distributions and proving regularity estimates for their solutions\.

This conventional use of Stein’s method differs from the use in TASER\. In the classical setting, the test functionfhf\_\{h\}is usually chosen by solving a Stein equation associated with a discrepancy of interest\. In TASER, by contrast, the function is the predictor being trained\. The Stein operator is therefore used not primarily to compare two distributions, but to impose a distribution\-aware constraint on the predictor\.

### A\.3The Langevin Stein operator

The main text uses the Langevin Stein operator\. For a scalar functionf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}and a differentiable densitypp, define

ℒp​f​\(x\)=Δ​f​\(x\)\+∇log⁡p​\(x\)⊤​∇f​\(x\),\\mathcal\{L\}\_\{p\}f\(x\)=\\Delta f\(x\)\+\\nabla\\log p\(x\)^\{\\top\}\\nabla f\(x\),\(18\)whereΔ​f=tr​\(∇2f\)\\Delta f=\\mathrm\{tr\}\(\\nabla^\{2\}f\)is the Euclidean Laplacian,∇\\nablais the gradient, and the superscript⊤denotes the transpose\.

The operator \([18](https://arxiv.org/html/2605.30601#A1.E18)\) also has a divergence form:

ℒp​f​\(x\)=1p​\(x\)​∇⋅\(p​\(x\)​∇f​\(x\)\)\.\\mathcal\{L\}\_\{p\}f\(x\)=\\frac\{1\}\{p\(x\)\}\\nabla\\cdot\\left\(p\(x\)\\nabla f\(x\)\\right\)\.\(19\)Consequently, under suitable regularity and boundary decay assumptionsdetailed in Appendix[B](https://arxiv.org/html/2605.30601#A2),

𝔼X∼p​\[ℒp​f​\(X\)\]=0\.\\mathbb\{E\}\_\{X\\sim p\}\\\!\\left\[\\mathcal\{L\}\_\{p\}f\(X\)\\right\]=0\.\(20\)
The Langevin operator is also the infinitesimal generator of the overdamped Langevin diffusion

d​Xt=∇log⁡p​\(Xt\)​d​t\+2​d​Wt,dX\_\{t\}=\\nabla\\log p\(X\_\{t\}\)\\,dt\+\\sqrt\{2\}\\,dW\_\{t\},\(21\)for whichppis an invariant distribution\. In this interpretation,ℒp​f​\(x\)\\mathcal\{L\}\_\{p\}f\(x\)is the instantaneous expected rate of change off​\(Xt\)f\(X\_\{t\}\)when the diffusion starts fromxx\. The identity𝔼p​\[ℒp​f\]=0\\mathbb\{E\}\_\{p\}\[\\mathcal\{L\}\_\{p\}f\]=0says that, at stationarity, this expected instantaneous change averages to zero\.

This generator viewpoint connects Stein identities to diffusion geometry:∇log⁡p\\nabla\\log pdescribes the drift toward high\-density regions of the distribution, whileΔ​f\\Delta fcaptures isotropic second\-order variation of the test function\.

### A\.4Stein discrepancies

A related and very influential modern viewpoint is to use Stein operators to define discrepancies between probability distributions\. Letppbe the target density andqqa candidate distribution\. For a Stein operator𝒯p\\mathcal\{T\}\_\{p\}and a function classℱ\\mathcal\{F\}, define

𝒮​\(q,p\)=supf∈ℱ\|𝔼X∼q​\[𝒯p​f​\(X\)\]\|\.\\mathcal\{S\}\(q,p\)=\\sup\_\{f\\in\\mathcal\{F\}\}\\left\|\\mathbb\{E\}\_\{X\\sim q\}\[\\mathcal\{T\}\_\{p\}f\(X\)\]\\right\|\.\(22\)Since𝔼p​\[𝒯p​f\]=0\\mathbb\{E\}\_\{p\}\[\\mathcal\{T\}\_\{p\}f\]=0for allf∈ℱf\\in\\mathcal\{F\}, the quantity𝒮​\(q,p\)\\mathcal\{S\}\(q,p\)measures how strongly samples fromqqviolate Stein identities that hold underpp\.

Stein discrepancies are attractive because they often require only the score∇log⁡p​\(x\)\\nabla\\log p\(x\)rather than the normalised densityp​\(x\)p\(x\)\. This is important in Bayesian statistics and probabilistic modelling, where the target density is often known only up to an unknown normalising constant\. Since

∇log⁡p​\(x\)\\nabla\\log p\(x\)is invariant to multiplication ofppby a constant,LangevinStein operators can be computed even when the normalising constant is unavailable\.

A particularly important example is the*Kernel Stein Discrepancy*\(KSD\)\[Liuet al\.,[2016](https://arxiv.org/html/2605.30601#bib.bib6), Gorham and Mackey,[2015](https://arxiv.org/html/2605.30601#bib.bib7)\]\. KSD takes the Stein test functions to lie in a reproducing kernel Hilbert space \(RKHS\), which allows the supremum in \([22](https://arxiv.org/html/2605.30601#A1.E22)\) to be computed in closed form\. For the first\-order Langevin Stein operator, the KSD can be written as

KSD2​\(q,p\)=𝔼X,X′∼q​\[kp​\(X,X′\)\],\\mathrm\{KSD\}^\{2\}\(q,p\)=\\mathbb\{E\}\_\{X,X^\{\\prime\}\\sim q\}\\left\[k\_\{p\}\(X,X^\{\\prime\}\)\\right\],\(23\)wherekpk\_\{p\}is a Stein kernel obtained by applying the Stein operator to both arguments of a base kernelkk\. In one common form,

kp​\(x,x′\)\\displaystyle k\_\{p\}\(x,x^\{\\prime\}\)=sp​\(x\)⊤​k​\(x,x′\)​sp​\(x′\)\+sp​\(x\)⊤​∇x′k​\(x,x′\)\\displaystyle=s\_\{p\}\(x\)^\{\\top\}k\(x,x^\{\\prime\}\)s\_\{p\}\(x^\{\\prime\}\)\+s\_\{p\}\(x\)^\{\\top\}\\nabla\_\{x^\{\\prime\}\}k\(x,x^\{\\prime\}\)\+sp​\(x′\)⊤​∇xk​\(x,x′\)\+tr​\(∇x∇x′⁡k​\(x,x′\)\),\\displaystyle\\quad\+s\_\{p\}\(x^\{\\prime\}\)^\{\\top\}\\nabla\_\{x\}k\(x,x^\{\\prime\}\)\+\\mathrm\{tr\}\\\!\\left\(\\nabla\_\{x\}\\nabla\_\{x^\{\\prime\}\}k\(x,x^\{\\prime\}\)\\right\),\(24\)wheresp​\(x\)=∇log⁡p​\(x\)s\_\{p\}\(x\)=\\nabla\\log p\(x\)\.

KSD has been used for goodness\-of\-fit testing, measuring sample quality, diagnosing Markov chain Monte Carlo, and variational inference\. Its appeal is that it yields a computable discrepancy from samples ofqqand score evaluations ofpp, without requiring samples fromppor knowledge of the normalising constant\.

### A\.5LangevinStein operators in machine learning

Stein methods have entered machine learning through several routes\. First, Stein discrepancies provide practical objectives and diagnostics for probabilistic modelling\. They have been used to assess whether generated or sampled particles match a target distribution, to build goodness\-of\-fit tests, and to train approximate inference procedures\.

Second, the score function∇log⁡p​\(x\)\\nabla\\log p\(x\)appearing in the Langevin Stein operator \([1](https://arxiv.org/html/2605.30601#S1.E1)\)has become a central object in modern generative modelling\. Score matching and diffusion models learn vector fields approximating the score of noisy data distributions\. SinceLangevinStein operators are built from score functions, they provide a natural mathematical interface between score\-based generative models and downstream learning objectives\.

Third,LangevinStein operators provide a way to incorporate distributional geometry into learning\. The terms inℒp​f=Δ​f\+sp⊤​∇f\\mathcal\{L\}\_\{p\}f=\\Delta f\+s\_\{p\}^\{\\top\}\\nabla fcombine curvature of the learned function with directional derivatives along the score field of the data distribution\. Thus, unlike standard isotropic regularisers such as weight decay or gradient penalties, Stein\-based regularisation can adapt to the geometry of the input distribution\.

The present work follows this third direction\. Rather than using Stein operators to construct a goodness\-of\-fit test or a discrepancy over a large function class, TASER applies theLangevinStein operator directly to the predictor being trained\. The resulting penalty encourages the predictor to have controlled Stein residuals under the training distribution\. In this sense, TASER adapts Stein’s method from a tool for distribution comparison into a mechanism for distribution\-aware regularisation\.

## Appendix BProofs

For the theoretical derivation of the results in the paper, we first define the Stein classℱ​\(p\)\\mathcal\{F\}\(p\)for the Langevin Stein operatorℒp\\mathcal\{L\}\_\{p\}in \([1](https://arxiv.org/html/2605.30601#S1.E1)\),asfor exampleinKozyra and Reinert \[[2026](https://arxiv.org/html/2605.30601#bib.bib1)\]\.

###### Definition 1\(Stein class forℒp\\mathcal\{L\}\_\{p\}\)\.

Letppbe a continuously differentiable density onℝd\\mathbb\{R\}^\{d\}\. A functionf:ℝd→ℝf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}belongs to the Stein class ofpp\(forℒp\\mathcal\{L\}\_\{p\}\), denotedf∈ℱ​\(p\)f\\in\\mathcal\{F\}\(p\), if:

1. \(S1\)ffis twice continuously differentiable andΔ​f\\Delta f,∇f\\nabla fare locally integrable with respect to Lebesgue measure\.
2. \(S2\)The vector fieldp​\(x\)​∇f​\(x\)p\(x\)\\nabla f\(x\)is integrable and its flux over spheres vanishes: limR→∞∫∂BRp​\(x\)​∇f​\(x\)⋅n​\(x\)​𝑑S​\(x\)=0,\\lim\_\{R\\to\\infty\}\\int\_\{\\partial B\_\{R\}\}p\(x\)\\nabla f\(x\)\\cdot n\(x\)\\,dS\(x\)=0,whereBR⊂ℝdB\_\{R\}\\subset\\mathbb\{R\}^\{d\}is the Euclidean ball of radiusRRinℝd\\mathbb\{R\}^\{d\}, andn​\(x\)n\(x\)is the outward unit normal\.
3. \(S3\)ℒp​f\\mathcal\{L\}\_\{p\}fis integrable underpp\.

InKozyra and Reinert \[[2026](https://arxiv.org/html/2605.30601#bib.bib1)\]it is shown that ifppis continuously differentiable andf∈ℱ​\(p\)f\\in\\mathcal\{F\}\(p\), thenthe Stein identity \([20](https://arxiv.org/html/2605.30601#A1.E20)\) holds, namely𝔼X∼p​\[ℒp​f​\(X\)\]=0\.\\mathbb\{E\}\_\{X\\sim p\}\[\\mathcal\{L\}\_\{p\}f\(X\)\]=0\.

### B\.1Proof of \([7](https://arxiv.org/html/2605.30601#S4.E7)\)

We clarify the assumptions for \([7](https://arxiv.org/html/2605.30601#S4.E7)\) in the following result\.

###### Proposition 1\.

Letℒp\\mathcal\{L\}\_\{p\}be as in \([1](https://arxiv.org/html/2605.30601#S1.E1)\)\. Assume thatppis a twice continuously differentiable probability density, thatf∈ℱ​\(p\)f\\in\\mathcal\{F\}\(p\), and thatℒp​f\\mathcal\{L\}\_\{p\}fis integrable as well as differentiable\. LetHp​\(x\)=−∇2log⁡p​\(x\)H\_\{p\}\(x\)=\-\\nabla^\{2\}\\log p\(x\)\. Then

𝔼p​\[\(ℒp​f​\(X\)\)2\]=𝔼p​\[‖∇2f​\(X\)‖F2\]\+𝔼p​\[∇f​\(X\)⊤​Hp​\(X\)​∇f​\(X\)\]\.\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\mathcal\{L\}\_\{p\}f\(X\)\)^\{2\}\\right\]=\\mathbb\{E\}\_\{p\}\\\!\\left\[\\\|\\nabla^\{2\}f\(X\)\\\|\_\{F\}^\{2\}\\right\]\+\\mathbb\{E\}\_\{p\}\\\!\\left\[\\nabla f\(X\)^\{\\top\}H\_\{p\}\(X\)\\nabla f\(X\)\\right\]\.

###### Proof\.

Recall thatℒp​f​\(x\)=Δ​f​\(x\)\+sp​\(x\)⊤​∇f​\(x\)\\mathcal\{L\}\_\{p\}f\(x\)=\\Delta f\(x\)\+s\_\{p\}\(x\)^\{\\top\}\\nabla f\(x\), wheresp​\(x\)=∇log⁡p​\(x\)\.s\_\{p\}\(x\)=\\nabla\\log p\(x\)\.From integration by parts,

𝔼​\(ℒp​f​\(X\)\)2=−𝔼​∇f​\(X\)⋅∇\(ℒp​f​\(X\)\)\.\\mathbb\{E\}\\,\(\\mathcal\{L\}\_\{p\}f\(X\)\)^\{2\}=\-\\mathbb\{E\}\\,\\nabla f\(X\)\\cdot\\nabla\(\\mathcal\{L\}\_\{p\}f\(X\)\)\.Now,

∇\(ℒp​f\)=∇\(Δ​f\)−∇\(sp⊤​∇f\),\\nabla\(\\mathcal\{L\}\_\{p\}f\)=\\nabla\(\\Delta f\)\-\\nabla\(s\_\{p\}^\{\\top\}\\nabla f\),and

∇\(sp​∇f\)=∇sp⊤​∇f\+sp⊤​∇2f\.\\nabla\(s\_\{p\}\\nabla f\)=\\nabla s\_\{p\}^\{\\top\}\\nabla f\+s\_\{p\}^\{\\top\}\\nabla^\{2\}f\.Hence

∇f⋅∇\(ℒp​f\)=∇f⋅∇\(Δ​f\)−⟨∇f,∇sp⊤​∇f⟩−⟨∇f,sp⊤​∇2f⟩\.\\nabla f\\cdot\\nabla\(\\mathcal\{L\}\_\{p\}f\)=\\nabla f\\cdot\\nabla\(\\Delta f\)\-\\langle\\nabla f,\\nabla s\_\{p\}^\{\\top\}\\nabla f\\rangle\-\\langle\\nabla f,s\_\{p\}^\{\\top\}\\nabla^\{2\}f\\rangle\.Taking expectations and using the identity

12​Δ​\|∇f\|2=⟨∇f,∇\(Δ​f\)⟩\+‖∇2f‖F2\\frac\{1\}\{2\}\\Delta\|\\nabla f\|^\{2\}=\\langle\\nabla f,\\nabla\(\\Delta f\)\\rangle\+\\\|\\nabla^\{2\}f\\\|\_\{F\}^\{2\}with∥⋅∥F\\\|\\cdot\\\|\_\{F\}denoting the Frobenius norm, we obtain

𝔼\(ℒpf\(X\)2\)=𝔼=,∥∇2f\(X\)∥F2\+𝔼⟨∇f\(X\),∇sp\(X\)∇f⟩\.\\mathbb\{E\}\\,\(\\mathcal\{L\}\_\{p\}f\(X\)^\{2\}\)=\\mathbb\{E\}\\\!=,\\\|\\nabla^\{2\}f\(X\)\\\|\_\{F\}^\{2\}\+\\mathbb\{E\}\\,\\langle\\nabla f\(X\),\\nabla s\_\{p\}\(X\)\\nabla f\\rangle\.Re\-writing the inner product gives the assertion\. ∎

### B\.2Proof of \([10](https://arxiv.org/html/2605.30601#S4.E10)\)

Here we prove \([10](https://arxiv.org/html/2605.30601#S4.E10)\) and detail the regularity assumptions used\. Recall that we assume thatqqis absolutely continuous with respect topp, anddenote the likelihood ratio by

ℓ​\(x\)=q​\(x\)p​\(x\)\.\{\\color\[rgb\]\{0,0,0\}\\ell\}\(x\)=\\frac\{q\(x\)\}\{p\(x\)\}\.
###### Proposition 2\.

Assume thatf∈ℱ​\(p\)f\\in\\mathcal\{F\}\(p\);ℒp​f∈L1​\(q\)\\mathcal\{L\}\_\{p\}f\\in L^\{1\}\(q\);llis differentiable andp​∇\(f​l\)p\\nabla\(f\\,l\)is integrable, that the boundary flux vanishes for the vector fieldp​l​∇fp\\,l\\,\\nabla f:

limR→∞∫∂BRp​\(x\)​l​\(x\)​∇f​\(x\)⋅n​\(x\)​𝑑S​\(x\)=0,\\lim\_\{R\\to\\infty\}\\int\_\{\\partial B\_\{R\}\}p\(x\)\\,l\(x\)\\,\\nabla f\(x\)\\cdot n\(x\)\\,dS\(x\)=0,and that∇f⊤​∇l\\nabla f^\{\\top\}\\nabla lis integrable under Lebesgue measure\. Then

supq∈𝒬ρ\|𝔼q​\[ℒp​f​\(X\)\]\|≤ρ​𝔼p​\[\(ℒp​f​\(X\)\)2\]\.\\sup\_\{q\\in\\mathcal\{Q\}\_\{\\rho\}\}\\left\|\\mathbb\{E\}\_\{q\}\[\\mathcal\{L\}\_\{p\}f\(X\)\]\\right\|\\leq\\rho\\sqrt\{\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\mathcal\{L\}\_\{p\}f\(X\)\)^\{2\}\\right\]\}\.

###### Proof\.

Under the exact Stein identity,𝔼p​\[ℒp​f​\(x\)\]=0\\mathbb\{E\}\_\{p\}\[\\mathcal\{L\}\_\{p\}f\(x\)\]=0\. Then,

𝔼q​\[ℒp​f​\(x\)\]=𝔼p​\[ℓ​\(X\)​ℒp​f​\(x\)\]=𝔼p​\[\(ℓ​\(X\)−1\)​ℒp​f​\(x\)\]\.\\mathbb\{E\}\_\{q\}\[\\mathcal\{L\}\_\{p\}f\(x\)\]=\\mathbb\{E\}\_\{p\}\[\{\\color\[rgb\]\{0,0,0\}\\ell\}\(X\)\\mathcal\{L\}\_\{p\}f\(x\)\]=\\mathbb\{E\}\_\{p\}\[\(\{\\color\[rgb\]\{0,0,0\}\\ell\}\(X\)\-1\)\\mathcal\{L\}\_\{p\}f\(x\)\]\.Applying Cauchy–Schwarz gives

\|𝔼q​\[ℒp​f​\(X\)\]\|≤χ2​\(q∥p\)​𝔼p​\[\(ℒp​f​\(X\)\)2\],\\left\|\\mathbb\{E\}\_\{q\}\[\\mathcal\{L\}\_\{p\}f\(X\)\]\\right\|\\leq\\sqrt\{\\chi^\{2\}\(q\\\|p\)\}\\,\\sqrt\{\\mathbb\{E\}\_\{p\}\\\!\\left\[\(\\mathcal\{L\}\_\{p\}f\(X\)\)^\{2\}\\right\]\},where the first factor is theχ2\\chi^\{2\}divergence betweenqqandpp;

χ2​\(q∥p\)=𝔼p​\[\(q​\(X\)p​\(X\)−1\)2\]\.\\chi^\{2\}\(q\\\|p\)=\\mathbb\{E\}\_\{p\}\\\!\\left\[\\left\(\\frac\{q\(X\)\}\{p\(X\)\}\-1\\right\)^\{2\}\\right\]\.Using the definition of𝒬ρ=\{q:χ2​\(q∥p\)≤ρ2\}\\mathcal\{Q\}\_\{\\rho\}=\\left\\\{q:\\chi^\{2\}\(q\\\|p\)\\leq\\rho^\{2\}\\right\\\}gives the assertion\.∎

## Appendix CExperimental Details

This appendix provides additional implementation details for the experiments in Section[5](https://arxiv.org/html/2605.30601#S5)\. Unless otherwise stated, all reported results use the same evaluation protocol within each dataset and architecture\. Hyperparameter values that are varied in ablations or selected by validation are explicitly marked as placeholders\.

### C\.1Datasets and architectures

#### CIFAR\-10\.

CIFAR\-10 consists of50,00050\{,\}000training images and10,00010\{,\}000test images, with1010classes and spatial resolution32×3232\\times 32\. We use the standard train/test split\. Images are normalised using the dataset mean and standard deviation\. Data augmentation follows the standard CIFAR protocol: random horizontal flips and random crops with padding\. Unless otherwise stated, robustness is evaluated on the full CIFAR\-10 test set\.

For the main CIFAR\-10 experiments we use ResNet\-18\. The architecture is adapted to CIFAR resolution by replacing the initial ImageNet\-style convolution and max\-pooling stem with a3×33\\times 3convolution of stride11and no initial max\-pooling\. The final linear layer outputs1010logits\. Unless otherwise specified, TASER is applied to the logits\.

### C\.2Base training methods

We evaluate TASER on top of several base training procedures\. Each method is first trained without TASER, and the resulting checkpoint is subsequently fine\-tuned with TASER\. The reported hyperparameters are either taken from the original publications or follow the best community practices\.

#### Standard training\.

The standard baseline minimises cross\-entropy with mildℓ2\\ell\_\{2\}weight decay:

ℒstd​\(θ\)=𝔼\(x,y\)​\[CE​\(fθ​\(x\),y\)\]\+λwd​‖θ‖22\.\\mathcal\{L\}\_\{\\mathrm\{std\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\}\\left\[\\mathrm\{CE\}\(f\_\{\\theta\}\(x\),y\)\\right\]\+\\lambda\_\{\\mathrm\{wd\}\}\\\|\\theta\\\|\_\{2\}^\{2\}\.We use weight decayλwd=0\.0001\\lambda\_\{\\mathrm\{wd\}\}=0\.0001\.

#### PGD adversarial training\.

For PGD adversarial training, adversarial examples are generated by projected gradient ascent on the cross\-entropy loss,

xadv≈arg⁡max‖x′−x‖∞≤ϵ⁡CE​\(fθ​\(x′\),y\),x\_\{\\mathrm\{adv\}\}\\approx\\arg\\max\_\{\\\|x^\{\\prime\}\-x\\\|\_\{\\infty\}\\leq\\epsilon\}\\mathrm\{CE\}\(f\_\{\\theta\}\(x^\{\\prime\}\),y\),and the model is updated using

ℒPGD​\(θ\)=𝔼\(x,y\)​\[CE​\(fθ​\(xadv\),y\)\]\.\\mathcal\{L\}\_\{\\mathrm\{PGD\}\}\(\\theta\)=\\mathbb\{E\}\_\{\(x,y\)\}\\left\[\\mathrm\{CE\}\(f\_\{\\theta\}\(x\_\{\\mathrm\{adv\}\}\),y\)\\right\]\.During training we use,ϵ=8/255\\epsilon=\\texttt\{8/255\},steps=10, andstep\_size=2/255\.

#### TRADES\.

TRADES optimises a trade\-off between clean accuracy and local robustness:

ℒTRADES​\(θ\)=CE​\(fθ​\(x\),y\)\+β​KL​\(fθ​\(x\)∥fθ​\(xadv\)\),\\mathcal\{L\}\_\{\\mathrm\{TRADES\}\}\(\\theta\)=\\mathrm\{CE\}\(f\_\{\\theta\}\(x\),y\)\+\\beta\\,\\mathrm\{KL\}\\\!\\left\(f\_\{\\theta\}\(x\)\\,\\\|\\,f\_\{\\theta\}\(x\_\{\\mathrm\{adv\}\}\)\\right\),wherexadvx\_\{\\mathrm\{adv\}\}is generated to maximise the KL divergence between clean and perturbed predictions\. We useβ=6\.0\\beta=6\.0,ϵ=8/255\\epsilon=\\texttt\{8/255\},steps=10, andstep\_size=2/255\.

#### MART\.

MART combines adversarial training with a misclassification\-aware weighting of the loss\. We use the standard MART objective

ℒMART​\(θ\)=ℒadv​\(θ\)\+λMART​ℒrob​\(θ\),\\mathcal\{L\}\_\{\\mathrm\{MART\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{adv\}\}\(\\theta\)\+\\lambda\_\{\\mathrm\{MART\}\}\\mathcal\{L\}\_\{\\mathrm\{rob\}\}\(\\theta\),whereℒadv\\mathcal\{L\}\_\{\\mathrm\{adv\}\}is the adversarial classification term andℒrob\\mathcal\{L\}\_\{\\mathrm\{rob\}\}is the MART robustness regulariser\. We setλMART=5\.0\\lambda\_\{\\mathrm\{MART\}\}=5\.0and use PGD withϵ=8/255\\epsilon=\\texttt\{8/255\},steps=10, andstep\_size=2/255as the inner adversary\.

#### Adversarial Weight Perturbation \(AWP\)\.

For experiments using adversarial weight perturbation \(AWP\), we augment the base robust objective with an additional perturbation in parameter space\[Wuet al\.,[2020](https://arxiv.org/html/2605.30601#bib.bib25)\]\. During training, the model weights are temporarily perturbed to maximise the robust loss, after which the perturbation is removed before the optimisation step\. Concretely, AWP is implemented as a dual perturb/restore procedure around the robust loss computation:

θ→θ\+δawp→θ,\\theta\\rightarrow\\theta\+\\delta\_\{\\mathrm\{awp\}\}\\rightarrow\\theta,where the perturbation is applied only after a warmup phase\. Unless otherwise stated, we useawp\_gamma=0\.005,awp\_rho=5e\-3,awp\_num\_steps=1, andawp\_start\_epoch=10\.

### C\.3TASER training and fine\-tuning protocols

TASER can be used either during end\-to\-end training or as a post hoc fine\-tuning stage\. In both cases, the regulariser is applied through the same Stein residual objective, but the optimisation setup and practical motivation differ\.

#### TASER during training\.

In the end\-to\-end setting, TASER is incorporated directly into the training objective from the beginning of optimisation\. Given a base training lossℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}, we optimise

ℒtotal​\(θ\)=ℒbase​\(θ\)\+λ​\(t\)​ℛTASER​\(θ\),\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(\\theta\)\+\\lambda\(t\)\\,\\mathcal\{R\}\_\{\\mathrm\{TASER\}\}\(\\theta\),\(25\)whereλ​\(t\)\\lambda\(t\)is a scheduled regularisation coefficient\. This setting treats TASER as a geometry\-aware smoothness prior that shapes the learned representation throughout training\. In practice, we find that gradually rampingλ​\(t\)\\lambda\(t\)from zero improves optimisation stability, particularly in the early stages of training when model gradients and score estimates are less stable\.

This formulation is compatible with both standard and adversarial training objectives\. In particular, TASER can be combined directly with PGD adversarial training, TRADES, MART, AWP, or related robust optimisation schemes without modifying their underlying attack procedures\.

#### TASER fine\-tuning\.

In addition to end\-to\-end training, we consider a post\-training fine\-tuning setup motivated by practical deployment scenarios\. Given a pretrained modelfθ0f\_\{\\theta\_\{0\}\}trained using some base method with lossℒbase\\mathcal\{L\}\_\{\\mathrm\{base\}\}, we optimise

ℒtotal​\(θ\)=ℒbase​\(θ\)\+α​KL​\(fθ0∥fθ\)\+λ​\(t\)​ℛTASER​\(θ\)\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}\(\\theta\)=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(\\theta\)\+\\alpha\\,\\mathrm\{KL\}\\\!\\left\(f\_\{\\theta\_\{0\}\}\\,\\\|\\,f\_\{\\theta\}\\right\)\+\\lambda\(t\)\\,\\mathcal\{R\}\_\{\\mathrm\{TASER\}\}\(\\theta\)\.\(26\)Here the KL term acts as a teacher regulariser, stabilising optimisation by encouraging the fine\-tuned model to remain close to the pretrained predictor\. The regularisation coefficientλ​\(t\)\\lambda\(t\)is again scheduled throughout training\.

This fine\-tuning configuration reflects a realistic setting in which a model has already been trained using a standard or robust objective, and TASER is added as an auxiliary robustness regulariser without retraining from scratch\. Since TASER depends only on model derivatives and a score estimate for the training distribution, it can be applied on top of existing checkpoints with minimal modification to the original training pipeline\.

#### Clean\-input application of TASER\.

When the base method generates adversarial examplesxadvx\_\{\\mathrm\{adv\}\}, the base loss is evaluated according to that method, but the TASER penalty is computed on the corresponding clean inputxx\. Thus, for adversarially trained methods we use objectives of the schematic form

ℒtotal=ℒbase​\(xadv,y\)\+λ​\(t\)​ℛTASER​\(x\)\.\\mathcal\{L\}\_\{\\mathrm\{total\}\}=\\mathcal\{L\}\_\{\\mathrm\{base\}\}\(x\_\{\\mathrm\{adv\}\},y\)\+\\lambda\(t\)\\,\\mathcal\{R\}\_\{\\mathrm\{TASER\}\}\(x\)\.For TRADES, where the base loss contains both clean and adversarial terms, TASER is still applied to the clean inputxx\. This choice keeps the Stein regulariser aligned with the score model of the training distribution, rather than applying the clean\-data score field to off\-manifold adversarial inputs\.

#### Adversarial\-lite fine\-tuning\.

In some fine\-tuning experiments we optionally include a lightweight adversarial component in the base loss, generated using a small number of projected gradient steps\. This provides local worst\-case pressure without incurring the full cost of adversarial training\. In settings where the original robust training pipeline is difficult to reproduce—for example due to custom optimisation schemes, synthetic\-data augmentation, or large\-scale generative components—this setup provides a practical and reproducible alternative while retaining compatibility with TASER\.

### C\.4Score models

TASER requires an estimates~​\(x\)≈∇log⁡p​\(x\)\\tilde\{s\}\(x\)\\approx\\nabla\\log p\(x\)of the training input distribution\. We use diffusion or denoising score models trained on the same training distribution as the classifier\.

#### CIFAR\-10 score model\.

For CIFAR\-10, we usescore\_model=\[PLACEHOLDER: e\.g\. DDPM/EDM checkpoint name\]evaluated at diffusion timestep/noise levelt=50\. If the score model predicts noiseϵϕ​\(xt,t\)\\epsilon\_\{\\phi\}\(x\_\{t\},t\), we convert it to a score estimate using the standard diffusion relation

s~​\(xt,t\)=−ϵϕ​\(xt,t\)σt,\\tilde\{s\}\(x\_\{t\},t\)=\-\\frac\{\\epsilon\_\{\\phi\}\(x\_\{t\},t\)\}\{\\sigma\_\{t\}\},with the precise convention depending on the parameterisation of the diffusion checkpoint\.

#### Score normalisation\.

To make regularisation strengths comparable across various choice of diffusion timesteptt, we optionally rescale the score field ass~norm​\(x\)=ct​s~​\(x\)\\tilde\{s\}\_\{\\mathrm\{norm\}\}\(x\)=c\_\{t\}\\,\\tilde\{s\}\(x\)\. The scaling rule is as follows: for a given timestampttwe empirically estimate the standard deviation of the scoreσt\\sigma\_\{t\}and putct=σtσ50c\_\{t\}=\\frac\{\\sigma\_\{t\}\}\{\\sigma\_\{50\}\}\.

Strictly speaking, this rescaling modifies the Stein operator and therefore breaks the exact Stein identity associated with the original distribution\. In practice, however, it substantially improves comparability of the TASER penalty across diffusion timesteps by standardising the magnitude of the score field\. Equivalently, this procedure can be interpreted as an approximate timestep\-dependent adaptation of the effective regularisation strength\.

### C\.5Optimisation and schedules

#### TASER during training\.

Models are trained forE\_base=200epochs with batch sizeB=256\. The optimiser isoptimizer=ADAMwith initial learning rateeta0=0\.001, momentum or Adam parametersmomentum/betas=\(0\.9, 0\.999\), weight decaylambda\_wd=0\.0001, and learning\-rate schedulebase\_lr\_schedule=cosine\-decay\. The TASER regularisation coefficientλ\\lambdais set to 1\.0 unless otherwise stated\.

#### TASER fine\-tuning\.

TASER fine\-tuning is run forE\_TASER=50additional epochs\. During this stage, the learning rate follows linear warmup followed by cosine decay\. Ifttdenotes the fine\-tuning step,TTthe total number of fine\-tuning steps, andTwarmT\_\{\\mathrm\{warm\}\}the number of warmup steps, then

η​\(t\)=\{ηmax​t/Twarm,t<Twarm,ηmin\+12​\(ηmax−ηmin\)​\[1\+cos⁡\(π​t−TwarmT−Twarm\)\],t≥Twarm\.\\eta\(t\)=\\begin\{cases\}\\eta\_\{\\max\}t/T\_\{\\mathrm\{warm\}\},&t<T\_\{\\mathrm\{warm\}\},\\\\\[3\.0pt\] \\eta\_\{\\min\}\+\\frac\{1\}\{2\}\(\\eta\_\{\\max\}\-\\eta\_\{\\min\}\)\\left\[1\+\\cos\\\!\\left\(\\pi\\frac\{t\-T\_\{\\mathrm\{warm\}\}\}\{T\-T\_\{\\mathrm\{warm\}\}\}\\right\)\\right\],&t\\geq T\_\{\\mathrm\{warm\}\}\.\\end\{cases\}We useηmax=0\.001\\eta\_\{\\max\}=\\texttt\{0\.001\},ηmin=0\.0001\\eta\_\{\\min\}=\\texttt\{0\.0001\}, andTwarm=0\.1TT\_\{\\mathrm\{warm\}\}=\\texttt\{0\.1T\}\.

The TASER regularisation coefficient is ramped from zero to its final value:

λ​\(t\)=λmax​min⁡\{1,tTwarm\}\.\\lambda\(t\)=\\lambda\_\{\\max\}\\min\\left\\\{1,\\frac\{t\}\{T\_\{\\mathrm\{warm\}\}\}\\right\\\}\.This ramp prevents the Stein penalty from dominating early fine\-tuning dynamics before the optimiser has adapted to the additional derivative\-based term\.

### C\.6Adversarial evaluation

#### AutoAttack\.

For CIFAR\-10, the main robustness metric is robust accuracy under AutoAttack withℓ∞\\ell\_\{\\infty\}budgetϵ=8/255\\epsilon=8/255\. We use the standard version of the official AutoAttack implementation\. As an additional diagnostic, we also use AutoAttack withℓ2\\ell\_\{2\}budgetϵ=128/255\\epsilon=128/255\.

#### SPSA\.

As supplementary diagnostics, we evaluate query\-based attacks\. For SPSA we use anℓ∞\\ell\_\{\\infty\}budgetϵ=8/255\\epsilon=8/255withnb\_iter=32, andnb\_sample=128\.

#### Evaluation subset\.

When query\-based attacks are computationally expensive, we evaluate them on a fixed subset of the test set of sizeN\_eval=1000\. The subset is sampled once and shared across all methods\.

### C\.7Comment on licences\.

All datasets, pretrained models, and benchmark checkpoints used in this work are publicly available and used in accordance with their respective licences and terms of use\. CIFAR\-10 and MNIST are used under their standard academic usage conditions, while ImageNet\-1K is used under the ImageNet non\-commercial research access policy\. Publicly released pretrained classifiers and diffusion checkpoints are used under their corresponding open\-source or research licences\.

### C\.8Computational cost\.

All experiments were conducted on a single NVIDIA A10 GPU\. Based on wall\-clock timings, TASER introduces a moderate training overhead whose magnitude depends on the underlying training objective\. For standard training, the overhead is approximately×2\.45\\times 2\.45, while for adversarially trained models \(PGD, TRADES, MART, and AWP variants\) the overhead ranges between×1\.17\\times 1\.17and×1\.27\\times 1\.27\. Under the 200\-epoch CIFAR\-10 training schedule used in our experiments, this corresponds to an additional∼0\.9\\sim 0\.9–1\.21\.2hours of training time for adversarially trained models\. Across all six ResNet\-18 CIFAR\-10 experiments reported in Table[3](https://arxiv.org/html/2605.30601#A4.T3), the total training time increased from approximately24\.324\.3GPU\-hours to31\.031\.0GPU\-hours\. These results indicate that TASER provides substantial robustness improvements while incurring a relatively modest computational overhead in practical settings\.

## Appendix DAdditional Results

#### Per\-attack robustness breakdown\.

Table[3](https://arxiv.org/html/2605.30601#A4.T3)reports the disaggregated robust accuracy under AutoAttack and SPSA\. The trends are consistent across attacks: adding TASER improves robustness for every training objective, with especially large gains for the standard model and smaller but still positive gains for adversarially trained models\. This indicates that the improvement is not tied to a single evaluation attack, but reflects a broader increase in robustness across both first\-order and gradient\-estimation\-based adversaries\.

Table 3:Clean and robust accuracy on CIFAR\-10 \(ResNet\-18\), with runtime overhead from TASER\.

Similar Articles

Streaming Adversarial Robustness in Fuzzy ARTMAP: Mechanism-Aligned Evaluation, Progressive Training, and Interpretable Diagnostics

arXiv cs.LG

This paper investigates adversarial robustness in Fuzzy ARTMAP, a streaming neural architecture, by introducing WB-Softmax as a mechanism-aligned white-box attack surrogate. It evaluates progressive training and selective updating strategies to improve robustness without data replay, while also offering interpretable diagnostics for structural failures.

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Hugging Face Daily Papers

This paper introduces geometric stability measures—based on pairwise distance consistency in representations—to predict language model steerability and detect structural drift. Supervised variants achieve near-perfect correlation (ρ=0.89-0.97) with linear steerability across 35-69 embedding models, while unsupervised variants outperform CKA and Procrustes for post-deployment drift detection.

Stochasticity in Tokenization Improves Robustness

arXiv cs.CL

This paper demonstrates that training large language models with stochastic tokenization instead of deterministic canonical tokenization significantly improves robustness to adversarial attacks and random perturbations, with improvements shown across pre-training, fine-tuning, and in-context learning without increasing inference costs.