Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators
Summary
This paper rethinks structural anomaly detection by shifting from decision boundaries to projection operators onto the low-dimensional manifold of normal data, showing that projection-aligned methods outperform existing boundary-based and reconstruction-based approaches.
View Cached Full Text
Cached at: 06/16/26, 11:40 AM
# From Decision Boundaries to Projection Operators
Source: [https://arxiv.org/html/2606.15280](https://arxiv.org/html/2606.15280)
## Rethinking Structural Anomaly Detection: From Decision Boundaries to Projection Operators
Alexander Bauer1,2 1Machine Learning Group, TU Berlin 2BIFOLD, Berlin, Germany alexander\.bauer@tu\-berlin\.de
###### Abstract
Most existing anomaly detection methods rely on estimating a probability density or learning an enclosing decision boundary, implicitly assuming that normal data occupies a region of non\-zero volume in the ambient space\. In contrast, structural anomaly detection considers data that lies near a low\-dimensional manifold, creating a mismatch between the inductive bias of existing methods and the structure of the data, often resulting in degraded performance\. To address this mismatch, we introduce a geometric perspective\. Specifically, we learn a projection operator onto the manifold of normal samples and define a sample as anomalous if it is altered by this projection\. This formulation naturally integrates the inductive bias of manifold\-supported data and reframes anomaly detection in terms of a projection residual, thereby resolving issues arising from modeling degenerate distributions\. Notably, it provides a unifying interpretation of reconstruction\-based methods by explaining their success and failure in terms of projection quality\. In particular, it explains the strong generalization ability of projection\-aligned models as a consequence of contraction behavior toward the manifold\. Moreover, by decoupling anomaly detection from probabilistic modeling, it reduces the tendency to misclassify rare but normal samples, a widely recognized limitation of existing approaches\. Empirically, we demonstrate that projection\-aligned methods achieve strong performance, outperforming boundary\-based methods while improving upon existing reconstruction\-based approaches\.
## 1Introduction
Conceptually, anomaly detection \(AD\) aims to identify observations that deviate from a notion of normality\. In many practical scenarios, such as industrial inspection or medical imaging, anomalies arise as structural deviations in input images, including surface defects, shape deformations, or other irregularities disrupting normal visual patterns\.
The vast majority of existing AD methodsSchölkopfet al\.\([2001](https://arxiv.org/html/2606.15280#bib.bib1)\); Tax and Duin \([2004](https://arxiv.org/html/2606.15280#bib.bib2)\); Parzen \([1962](https://arxiv.org/html/2606.15280#bib.bib102)\); Kimet al\.\([2023](https://arxiv.org/html/2606.15280#bib.bib109)\); Ruffet al\.\([2018](https://arxiv.org/html/2606.15280#bib.bib3)\); Tacket al\.\([2020](https://arxiv.org/html/2606.15280#bib.bib111)\); Ruffet al\.\([2021](https://arxiv.org/html/2606.15280#bib.bib81)\)define normality either through a probability density, or more generally, a decision boundary enclosing normal samples\. Crucially, this relies on an implicit assumption that normal data occupies a region of non\-zero volume in the ambient space\. However, this assumption is fundamentally misaligned with high\-dimensional perceptual data such as images, where normal samples concentrate near a low\-dimensional manifold\. Since an embedded manifold has no interior with respect to the ambient space, any enclosing decision boundary necessarily includes unsupported regions, rendering it conceptually ill\-posed\. Employing boundary\-based methods in this setting leads to fundamental practical issues, including unstable learning behavior, sensitivity to noise in the scoring function, and severe effects of the curse of dimensionality\. Figure[1](https://arxiv.org/html/2606.15280#S1.F1)illustrates, using the example of a One\-Class Support Vector Machine \(OC\-SVM\)Schölkopfet al\.\([2001](https://arxiv.org/html/2606.15280#bib.bib1)\), how the mismatch between the volumetric assumption of boundary\-based methods and the low\-dimensional nature of manifold\-supported data manifests geometrically\. When combined with isotropic radial kernels \(e\.g\., a Gaussian kernel\), similarity is propagated equally in both tangent and normal directions of the data manifold\. Consequently, the kernel width induces an unavoidable trade\-off: large values admit larger off\-manifold regions, leading to false negatives, while small values require dense sampling to ensure manifold coverage and otherwise lead to false positives\. Although real\-world data exhibits a non\-zero volume through measurement error and finite\-resolution effects, these introduce only a thin neighborhood around the underlying manifold and do not alter its intrinsic structure in practice\.
Figure 1:Illustration of the failure mode of an OC\-SVM with an RBF kernel on manifold\-supported data\. Although normal samples concentrate near a low\-dimensional manifold \(blue curve\), the isotropic similarity induced by the kernel in ambient space produces a volumetric inlier region\. The effective radius of this region is controlled by the kernel width: decreasing it reduces coverage of the manifold and increases false positives \(left\), whereas increasing it expands the inlier region and increases false negatives \(right\)\. Note the change in classification of the star\-shaped samples\.As the concept of an enclosing decision boundary becomes problematic for manifold\-supported data, a natural alternative is to measure the abnormality of a sample via its distance to the manifold\. However, this idea immediately encounters fundamental challenges: the manifold itself is unknown and must be inferred from finitely many samples, and computing distances requires solving a non\-trivial optimization problem to identify the closest point on the manifold\. In high\-dimensional settings, both aspects are computationally and statistically intractable\. Instead, we propose a different approach\. The key observation is that, for natural images, the global shape of the data manifold is induced by strong spatial correlations between pixels, which drastically reduce the effective degrees of freedom\. Specifically, these dependencies give rise to a form of*index\-induced regularity*, rendering the value of a pixel largely predictable from its local neighborhood\. This enables implicit modeling of the global geometry from individual samples by learning a mapping that corrects deviations from the manifold\. When applied to anomalous data, this mapping acts as an approximate projection operator, mapping inputs toward the manifold without requiring explicit geometric representation\. In this way, global geometry becomes tied to local predictive structure shared across training samples, supporting the learning and generalization of a projection mapping\. The learned projection, in turn, induces an implicit characterization of the data manifold as the zero\-set of a reconstruction functional\. This provides a global description of normality without requiring explicit geometric estimation or costly optimization to identify nearest points on the manifold\.
To this end, we reinterpret and extend the family of correction\-based methodsZavrtaniket al\.\([2021b](https://arxiv.org/html/2606.15280#bib.bib8)\); Liet al\.\([2021](https://arxiv.org/html/2606.15280#bib.bib9)\); Zavrtaniket al\.\([2021a](https://arxiv.org/html/2606.15280#bib.bib33)\); Pirnay and Chai \([2022](https://arxiv.org/html/2606.15280#bib.bib10)\); Baueret al\.\([2024](https://arxiv.org/html/2606.15280#bib.bib96)\)for*structural anomaly detection*\(SAD\) by viewing them as approximations of a learned projection operator onto the manifold of normal data\. Given an input, the model produces a corrected version projected toward the manifold, while abnormality is quantified through the resulting projection residual\. At the pixel level, the discrepancy between input and projection enables localization of anomalous regions\. Importantly, this perspective provides a unified framework for understanding and improving existing approaches\. First, it aligns the learning objective with the intrinsic geometry of the data, introducing an inductive bias consistent with manifold\-supported distributions\. Second, it offers a principled explanation for the limitations of decision\-boundary methods and the success of reconstruction\-based approaches in terms of projection quality\. Third, by decoupling anomaly detection from probabilistic modeling, it reduces the tendency to misclassify rare but normal samples\. Fourth, it provides a geometric interpretation of generalization through contractive behavior toward the manifold, where diverse perturbations map to consistent normal representations\. Collectively, these insights provide a roadmap for improved algorithm design\. In particular, they motivate geometry\-aware regularization through Jacobian constraints, iterative refinement via fixed\-point dynamics, and improved corruption strategies that encourage stable projection behavior and stronger generalization\.
The remainder of the paper is organized as follows\. Section[2](https://arxiv.org/html/2606.15280#S2)summarizes the training framework used to approximate the projection operator, including the inference procedure during prediction\. Section[3](https://arxiv.org/html/2606.15280#S3)presents a formal analysis of this framework by highlighting the conservative projection as an optimal solution to SAD and discussing how such a mapping is approximated through training\. We further examine how the projection\-based perspective provides a geometric interpretation of model generalization and outline directions for future algorithmic improvements\. Section[4](https://arxiv.org/html/2606.15280#S4)shows the experimental evaluation on two established industrial anomaly detection benchmarks and compares the presented framework with state\-of\-the\-art approaches\. Finally, Section[5](https://arxiv.org/html/2606.15280#S5)concludes the paper\.
## 2Methodology
Formally, we approximate the nonlinear projection operator using an autoencoder
f𝜽:\[0,1\]h×w×3→\[0,1\]h×w×3,f\_\{\\boldsymbol\{\\theta\}\}:\[0,1\]^\{h\\times w\\times 3\}\\rightarrow\[0,1\]^\{h\\times w\\times 3\},where𝜽\\boldsymbol\{\\theta\}denotes the learnable parameters\. Both input and output correspond to RGB images of resolutionh×wh\\times w\. In particular, we make no architectural assumptions\. Specifically, the model is not required to contain a bottleneck\. The term autoencoder is used in a general sense, referring simply to mappings with identical input and output dimensionality\.
### 2\.1Training
The model is trained in a self\-supervised fashion using input\-output pairs of normal samples and their corrupted versions\. Let𝒙∈\[0,1\]h×w×3\\boldsymbol\{x\}\\in\[0,1\]^\{h\\times w\\times 3\}denote a normal \(anomaly\-free\) image\. A corrupted version𝒙^\\hat\{\\boldsymbol\{x\}\}is obtained by altering randomly selected regions, specified by a real\-valued mask𝐌∈\[0,1\]h×w×3\\mathbf\{M\}\\in\[0,1\]^\{h\\times w\\times 3\}\. Its complement is defined as𝐌¯:=𝟏−𝐌\\bar\{\\mathbf\{M\}\}:=\\mathbf\{1\}\-\\mathbf\{M\}, where𝟏\\mathbf\{1\}denotes the tensor of ones\.
We use a loss formulation established in prior workBaueret al\.\([2024](https://arxiv.org/html/2606.15280#bib.bib96)\)and optimize it with respect to the model parameters𝜽\\boldsymbol\{\\theta\}:
ℒ\(𝒙^,𝒙,𝐌;𝜽\):=1−λ‖𝐌¯‖1‖𝐌¯⊙\(f𝜽\(𝒙^\)−𝒙\)‖22\+λ‖𝐌‖1‖𝐌⊙\(f𝜽\(𝒙^\)−𝒙\)‖22,\\mathcal\{L\}\(\\hat\{\\boldsymbol\{x\}\},\\boldsymbol\{x\},\\mathbf\{M\};\\boldsymbol\{\\theta\}\):=\\frac\{1\-\\lambda\}\{\\\|\\bar\{\\mathbf\{M\}\}\\\|\_\{1\}\}\\\|\\bar\{\\mathbf\{M\}\}\\odot\(f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\)\\\|\_\{2\}^\{2\}\+\\frac\{\\lambda\}\{\\\|\\mathbf\{M\}\\\|\_\{1\}\}\\\|\\mathbf\{M\}\\odot\(f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\)\\\|\_\{2\}^\{2\},\(1\)where⊙\\odotdenotes elementwise tensor multiplication,∥⋅∥p\\\|\\cdot\\\|\_\{p\}theℓp\\ell^\{p\}\-norm, andλ∈\[0,1\]\\lambda\\in\[0,1\]balances the contribution of corrupted versus uncorrupted regions\.
There are multiple ways of choosing a corruption pattern for generating the training data\. An efficient and powerful augmentation technique proposed inBaueret al\.\([2024](https://arxiv.org/html/2606.15280#bib.bib96)\)is to use an additional dataset \(e\.g\., DTDCimpoiet al\.\([2014](https://arxiv.org/html/2606.15280#bib.bib119)\)\) of background imagesℬ\\mathcal\{B\}offering a variation of structural patterns not occurring in the normal data\. Given a normal image𝒙∈ℳ\\boldsymbol\{x\}\\in\\mathcal\{M\}, a corruption pattern𝒚∈ℬ\\boldsymbol\{y\}\\in\\mathcal\{B\}, and a randomly selected smooth mask𝐌\\mathbf\{M\}, a corrupted version is created according to𝒙^=𝐌⊙𝒚\+𝐌¯⊙𝒙\\hat\{\\boldsymbol\{x\}\}=\\mathbf\{M\}\\odot\\boldsymbol\{y\}\+\\bar\{\\mathbf\{M\}\}\\odot\\boldsymbol\{x\}\.
### 2\.2Test\-Time Detection
After training, anomaly localization is performed by comparing an input𝒙^\\hat\{\\boldsymbol\{x\}\}with its reconstructionf𝜽\(𝒙^\)f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\. To this end, we define a pixel\-wise discrepancy function
Δ:\[0,1\]h×w×3×\[0,1\]h×w×3→\[0,1\]h×w\.\\Delta:\[0,1\]^\{h\\times w\\times 3\}\\times\[0,1\]^\{h\\times w\\times 3\}\\rightarrow\[0,1\]^\{h\\times w\}\.Different choices forΔ\\Deltaexist in the literature, including Mean Squared Error \(MSE\), Structural Similarity Index Measure \(SSIM\)Bergmannet al\.\([2019](https://arxiv.org/html/2606.15280#bib.bib19)\), and Gradient Magnitude Similarity \(GMS\)Xueet al\.\([2014](https://arxiv.org/html/2606.15280#bib.bib122)\), each capturing distinct aspects of reconstruction quality\. A binary segmentation mask for anomalous regions can be extracted by thresholding the pixel\-wise map of anomaly scores\. Independently of the specific choice ofΔ\\Delta, applying spatial smoothing prior to thresholding improves stability\.
For image\-level detection, a global anomaly score is obtained by aggregating the values of the anomaly map, typically via summation\. Alternative strategies, such as taking the maximum or restricting the aggregation to the largest responses, can reduce sensitivity to the spatial extent of anomalies\. The overall inference pipeline is illustrated in Figure[2](https://arxiv.org/html/2606.15280#S2.F2)\.
Figure 2:Illustration of our anomaly detection process\. Given an input𝒙^\\hat\{\\boldsymbol\{x\}\}, a trained model first produces an outputf𝜽\(𝒙^\)f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\), which preserves normal regions and replaces irregularities with locally consistent patterns\. Second, we compute a pixel\-wise discrepancy mapΔ\(𝒙^,f𝜽\(𝒙^\)\)∈\[0,1\]h×w\\Delta\(\\hat\{\\boldsymbol\{x\}\},f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\)\\in\[0,1\]^\{h\\times w\}between the input and the output to localize anomalous regions\.In essence, this conceptually simple and principled framework enables accurate and efficient structural anomaly detection, as demonstrated in the experimental section\.
## 3Formal Analysis: Inductive Biases, Approximation and Generalization
This section formalizes the projection\-based perspective as a principled framework for structural anomaly detection \(SAD\)\. We first show that it introduces the missing inductive bias required for manifold\-supported data, providing a foundation for both detection and localization\. We then identify the conservative projection as an optimal solution to SAD and discuss how it is approximated through the training framework presented in Section[2](https://arxiv.org/html/2606.15280#S2)\. Building on this view, we provide a geometric interpretation of model generalization through contraction toward the data manifold and discuss future directions for algorithmic development, including iterative refinement through fixed\-point dynamics and geometry\-aware regularization\.
### 3\.1Geometric Inductive Bias via a Projection Operator
We now formalize the inductive biases required for detecting and localizing structural deviations\.
###### Inductive Bias 1\(Fixed\-Point Condition\)\.
Letℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}denote the set of normal data points\. A reconstruction mappingf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\to\\mathbb\{R\}^\{d\}must satisfy the fixed\-point condition with respect toℳ\\mathcal\{M\}:
f\(𝒙\)=𝒙⟺𝒙∈ℳ\.f\(\\boldsymbol\{x\}\)=\\boldsymbol\{x\}\\quad\\Longleftrightarrow\\quad\\boldsymbol\{x\}\\in\\mathcal\{M\}\.\(2\)
This property ensures identity reconstruction exclusively on the data manifold while excluding trivial identity mappings in the ambient space, a behavior often violated by regularized autoencoders\. It therefore constitutes a minimal requirement for anomaly detection: any mapping satisfying this condition is sufficient for the detection task, as off\-manifold inputs necessarily violate the fixed\-point condition\. In particular, any projection operatorffwithIm\(f\)=ℳ\\operatorname\{Im\}\(f\)=\\mathcal\{M\}satisfies this property\.
Vanilla and many regularized autoencoders are often interpreted as learning the structure of the underlying data manifoldZhanget al\.\([2017](https://arxiv.org/html/2606.15280#bib.bib131)\); Connoret al\.\([2021](https://arxiv.org/html/2606.15280#bib.bib132)\)\. However, this interpretation is imprecise, as such models focus on reconstructing normal samples, leaving their behavior in the ambient space largely unconstrained\. Moreover, commonly proposed regularization strategies, including reduced bottleneck capacity or Jacobian penaltiesAlain and Bengio \([2014](https://arxiv.org/html/2606.15280#bib.bib112)\), are not sufficient to restrict near\-identity reconstruction to the data manifold, a phenomenon frequently observed in the visual domainBhattadet al\.\([2018](https://arxiv.org/html/2606.15280#bib.bib16)\); Aguilaet al\.\([2025](https://arxiv.org/html/2606.15280#bib.bib133)\)\. As a result, these models often provide only a weak separation signal between normal and anomalous samples\.
In contrast, assuming the model accurately realizes the behavior prescribed by the inductive bias in\([2](https://arxiv.org/html/2606.15280#S3.E2)\)\(\\ref\{E\_22041718\}\), the residual mappingΦ\(𝒙\):=f\(𝒙\)−𝒙\\Phi\(\\boldsymbol\{x\}\):=f\(\\boldsymbol\{x\}\)\-\\boldsymbol\{x\}fully captures the normal data manifold as
ℳ=Φ−1\(𝟎\)\.\\mathcal\{M\}=\\Phi^\{\-1\}\(\\mathbf\{0\}\)\.\(3\)IfΦ\\Phiis smooth and its JacobianDΦ\(𝒙\)D\\Phi\(\\boldsymbol\{x\}\)has constant rankd−kd\-kin a neighborhood ofℳ\\mathcal\{M\}, then by the implicit function theorem, the level setΦ−1\(𝟎\)\\Phi^\{\-1\}\(\\mathbf\{0\}\)locally defines akk\-dimensional embedded submanifold ofℝd\\mathbb\{R\}^\{d\}\. In this sense, the reconstruction mappingffadmits a natural interpretation as a projection operator that implicitly captures the geometry of the normal manifold as its fixed\-point set, thereby separating it from the ambient space and yielding a principled criterion for distinguishing between normal and anomalous samples\.
Notably, this perspective directly aligns with the core objective of anomaly detection, namely learning a notion of normality\. In contrast to common approaches that only approximate this objective through surrogate losses, the proposed formulation explicitly defines normality via Eq\. \([3](https://arxiv.org/html/2606.15280#S3.E3)\), while anomalous samples are implicitly characterized by their deviation from it\. Indeed, this focus on normality is directly reflected in the projection behavior of a corresponding model, where large regions of the ambient space map to similar points on the manifold, rendering variation in off\-manifold directions largely irrelevant\. Importantly, this significantly reduces the effective complexity of the learning problem, which is governed by the intrinsic dimensionality of the normal manifold rather than that of the ambient space, thereby mitigating the curse of dimensionality and improving generalization\.
Beyond separating normal and anomalous samples, domains such as images \(or time series\) additionally require localization of anomalous regions at the pixel level\. For this purpose, we leverage the residual mappingΦ\\Phi, which provides a spatial discrepancy map enabling localization of anomalous regions, as introduced in Section[2](https://arxiv.org/html/2606.15280#S2)\.
Given an anomalous sample identified with a vector𝒙^∈ℝd\\hat\{\\boldsymbol\{x\}\}\\in\\mathbb\{R\}^\{d\}, we define the anomalous region as a minimal set of pixelsS⊂\{1,…,d\}S\\subset\\\{1,\.\.\.,d\\\}that must be corrected to produce a normal sample𝒙∈ℳ\\boldsymbol\{x\}\\in\\mathcal\{M\}\. That is,SSis uniquely defined and𝒙^S¯=𝒙S¯\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}=\\boldsymbol\{x\}\_\{\\bar\{S\}\}, whereS¯:=\{1,…,d\}∖S\\bar\{S\}:=\\\{1,\.\.\.,d\\\}\\setminus Sdenotes the complement ofSS, and𝒙^S\\hat\{\\boldsymbol\{x\}\}\_\{S\},𝒙S\\boldsymbol\{x\}\_\{S\}denote vector slices restricted to indices inSS\. It is worth noting that the uniqueness assumption does not generally hold, as discussed inBaueret al\.\([2024](https://arxiv.org/html/2606.15280#bib.bib96)\)\. However, the resulting ambiguity arising from the finite resolution of digitized images is typically limited in extent and can often be neglected in practice\. Formally, this ambiguity vanishes in the limit of infinite resolution\.
Having established the formal definition of a*structural anomaly*as an off\-manifold sample𝒙^∉ℳ\\hat\{\\boldsymbol\{x\}\}\\not\\in\\mathcal\{M\}, with uniquely localized anomalous part𝒙^S\\hat\{\\boldsymbol\{x\}\}\_\{S\}, we now can derive an optimal projection operator for the task of SAD, given as
Πcon\(𝒙\):=argmin𝒚∈ℳ‖𝒙−𝒚‖0,\\Pi\_\{\\mathrm\{con\}\}\(\\boldsymbol\{x\}\):=\\arg\\min\_\{\\boldsymbol\{y\}\\in\\mathcal\{M\}\}\\\|\\boldsymbol\{x\}\-\\boldsymbol\{y\}\\\|\_\{0\},\(4\)which we refer to as a conservative projection\. Here,∥⋅∥0\\\|\\cdot\\\|\_\{0\}denotes the so\-calledℓ0\\ell\_\{0\}pseudo\-norm, defined as the number of nonzero components of a vector‖𝒙‖0=\#\{i:xi≠0\}\\\|\\boldsymbol\{x\}\\\|\_\{0\}=\\\#\\\{i:x\_\{i\}\\neq 0\\\}\. In this context, minimizing‖𝒙−𝒚‖0\\\|\\boldsymbol\{x\}\-\\boldsymbol\{y\}\\\|\_\{0\}enforces a conservative correction by minimizing the support of the modification, i\.e\., the number of altered components required to project𝒙\\boldsymbol\{x\}ontoℳ\\mathcal\{M\}\.
Unlike orthogonal projection, which minimizes geometric distance, the conservative projection minimizes correction support\. Since the objective‖𝒙−𝒚‖0\\\|\\boldsymbol\{x\}\-\\boldsymbol\{y\}\\\|\_\{0\}depends on support cardinality, the resulting operator is inherently non\-smooth\. Under uniqueness assumptions on the correction support, it may nevertheless exhibit piecewise\-continuous or piecewise\-differentiable behavior\. Specifically, within regions where the optimal support remains unchanged, the mapping reduces to a constrained correction problem on the manifold, while non\-smoothness arises at transitions between distinct supports\. Consequently, such behavior may be locally approximated by a differentiable autoencoder that learns a smooth relaxation of conservative correction\.
Under uniqueness of the correction support, the conservative projectionΠcon\\Pi\_\{\\mathrm\{con\}\}yields an optimal solution for SAD\. First, it naturally satisfies Inductive Bias[1](https://arxiv.org/html/2606.15280#Thminductivebias1)\. Anomalous regions are then identified from the individual components of the discrepancy mapΦ\\Phi, such that a pixeli∈\{1,…,d\}i\\in\\\{1,\\dots,d\\\}is anomalous if and only if\|Φi\(𝒙^\)\|\>0\|\\Phi\_\{i\}\(\\hat\{\\boldsymbol\{x\}\}\)\|\>0\. In practice, however, deviations from the idealized assumptions, together with reconstruction inaccuracies and numerical noise, motivate the use of thresholdsτ,ν\>0\\tau,\\nu\>0: specifically,‖Φ\(𝒙^\)‖2\>τ\\\|\\Phi\(\\hat\{\\boldsymbol\{x\}\}\)\\\|\_\{2\}\>\\taufor anomaly detection and\|Φi\(𝒙^\)\|\>ν\|\\Phi\_\{i\}\(\\hat\{\\boldsymbol\{x\}\}\)\|\>\\nufor pixel\-wise localization\.
The conservative projection further reveals an additional inductive bias for projection operators required for accurate localization: minimal correction support\. While the fixed\-point condition ensures that anomalous samples are altered by the reconstruction mappingff, thereby enabling detection, accurate localization additionally requires preservation of normal regions\.
###### Inductive Bias 2\(Minimal Correction Support\)\.
A reconstruction mappingf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d\}must preserve all normal components while modifying only the anomalous regions\. Formally, for any input𝐱^\\hat\{\\boldsymbol\{x\}\}with anomalous supportS⊆\{1,…,d\}S\\subseteq\\\{1,\\dots,d\\\},
fi\(𝒙^\)≠x^i⟺i∈S\.f\_\{i\}\(\\hat\{\\boldsymbol\{x\}\}\)\\neq\\hat\{x\}\_\{i\}\\quad\\Longleftrightarrow\\quad i\\in S\.\(5\)
It is worth noting that the requirement in \([5](https://arxiv.org/html/2606.15280#S3.E5)\) does not strictly implyf\(𝒙^\)∈ℳf\(\\hat\{\\boldsymbol\{x\}\}\)\\in\\mathcal\{M\}\. While preservation of normal regions is necessary, alteration of the corrupted region is both necessary and sufficient for accurate localization\. In particular, Inductive Bias[2](https://arxiv.org/html/2606.15280#Thminductivebias2)implies Inductive Bias[1](https://arxiv.org/html/2606.15280#Thminductivebias1)\. In our framework, these inductive biases are approximated by minimizing reconstruction loss on partially occluded images with varying opacity\. Specifically, the objective in \([1](https://arxiv.org/html/2606.15280#S2.E1)\) directly promotes preservation of normal regions while replacing corrupted regions with manifold\-consistent patterns\.
### 3\.2On the Approximation of the Conservative Projection Operator
The following analysis clarifies how our training framework encodes the inductive biases discussed in the previous section by approximating the conservative projection operator\.
For this purpose, we define the set\-valued mappingsN:ℝd→2\{1,…,d\}N\\colon\\mathbb\{R\}^\{d\}\\rightarrow 2^\{\\\{1,\\dots,d\\\}\}andA:ℝd→2\{1,…,d\}A\\colon\\mathbb\{R\}^\{d\}\\rightarrow 2^\{\\\{1,\\dots,d\\\}\}, which extract the indices corresponding to the normal and anomalous regions of an input, respectively\. As discussed previously, we assume that these mappings are well\-defined and that the assignment to normal versus anomalous regions is unique\. In particular,N\(𝒙^\)=\{1,…,d\}∖A\(𝒙^\)N\(\\hat\{\\boldsymbol\{x\}\}\)=\\\{1,\.\.\.,d\\\}\\setminus A\(\\hat\{\\boldsymbol\{x\}\}\)for all𝒙^∈ℝd\\hat\{\\boldsymbol\{x\}\}\\in\\mathbb\{R\}^\{d\}\. Furthermore, we assume that a regular conditional distributionp\(𝒙A∣𝒙N\)p\(\\boldsymbol\{x\}\_\{A\}\\mid\\boldsymbol\{x\}\_\{N\}\)exists, that𝔼\[‖𝒙‖2\]<∞\\mathbb\{E\}\[\\\|\\boldsymbol\{x\}\\\|^\{2\}\]<\\infty, and that the distribution over corrupted samplesq\(𝒙^\)q\(\\hat\{\\boldsymbol\{x\}\}\)is such that, forqq\-a\.e\. corrupted input𝒙^\\hat\{\\boldsymbol\{x\}\}, the conditioning values𝒙^N\(𝒙^\)\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}lie in the support of the marginal distribution of𝒙N\(𝒙^\)\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}underpp\. Finally, all conditional expectations below are also conditioned on the mechanism by which the corruption process generates patterns onA\(𝒙^\)A\(\\hat\{\\boldsymbol\{x\}\}\)\. We omit this dependence from the notation for simplicity\.
###### Theorem 1\.
Letp\(𝐱\)p\(\\boldsymbol\{x\}\)denote the distribution of normal data and letq\(𝐱^\)q\(\\hat\{\\boldsymbol\{x\}\}\)denote a distribution of corrupted samples, withk∈ℕk\\in\\mathbb\{N\}\. Consider the following unconstrained optimization problem:
minimize𝜽∈ℝk\\displaystyle\\underset\{\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{k\}\}\{\\text\{minimize\}\}𝔼q\(𝒙^\)\[𝔼p\(𝒙\)\[∥f𝜽\(𝒙^\)−𝒙∥2\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)\]\]\.\\displaystyle\\mathbb\{E\}\_\{q\(\\hat\{\\boldsymbol\{x\}\}\)\}\\\!\\left\[\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\\|f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\\;\\middle\|\\;\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]\\right\]\.\(6\)Provided sufficient capacity of the hypothesis class\{f𝛉\}\\\{f\_\{\\boldsymbol\{\\theta\}\}\\\}, every minimizerf∗:=f𝛉∗f^\{\*\}:=f\_\{\\boldsymbol\{\\theta\}^\{\*\}\}satisfies, forqq\-a\.e\.𝐱^\\hat\{\\boldsymbol\{x\}\},
fN\(𝒙^\)∗\(𝒙^\)=𝒙^N\(𝒙^\)andfA\(𝒙^\)∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙A\(𝒙^\)\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)\]\.f^\{\*\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\qquad\\text\{and\}\\qquad f^\{\*\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\boldsymbol\{x\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\\;\\middle\|\\;\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]\.\(7\)
The above characterization describes the Bayes\-optimal correction operatorf∗f^\{\*\}\. For a parameterized hypothesis class\{f𝜽\}\\\{f\_\{\\boldsymbol\{\\theta\}\}\\\}, the conclusion holds whenever this class is sufficiently expressive to realize the mapping in \([7](https://arxiv.org/html/2606.15280#S3.E7)\)\. The training objective in \([1](https://arxiv.org/html/2606.15280#S2.E1)\) can be interpreted as an approximation of the optimization problem in \([17](https://arxiv.org/html/2606.15280#A1.E17)\), where the weighting parameterλ\\lambdacontrols the relative emphasis placed on the two defining properties of an optimal correction operator in \([7](https://arxiv.org/html/2606.15280#S3.E7)\)\. As a result, a trained model tends to preserve normal regions while replacing corrupted regions with locally consistent patterns, thereby approximating the behavior of the conservative projection operator defined in \([4](https://arxiv.org/html/2606.15280#S3.E4)\)\.
Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)highlights an important side effect of reconstruction\-based methods\. When correction of an anomalous region is ambiguous, the reconstruction loss drives the model to produce an average over all feasible restorations under the training distribution\. Provided sufficient training data and model capacity, the approximation quality to the conservative projection is therefore primarily limited by reconstruction ambiguity within corrupted regions\. Fortunately, in practice, this averaging effect is typically small due to sparse sampling of the training data\. Moreover, for anomaly detection, perfect reconstruction is not strictly necessary, since deviation from the anomalous pattern is sufficient, as discussed in Section[3\.1](https://arxiv.org/html/2606.15280#S3.SS1)\. We provide a formal proof of Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)in Supplementary Section[A\.1](https://arxiv.org/html/2606.15280#A1.SS1)\.
Nevertheless, higher reconstruction fidelity in corrupted regions may be beneficial in practice, as it can sharpen the localization signal and improve visual interpretability\. One potential solution is to replace the autoencoder with a diffusion model conditioned on the input image\. This, however, introduces substantially higher computational cost\. Instead, we propose an iterative approach that repeatedly applies the autoencoder to the input, eventually converging to a fixed point on the manifold\. Convergence is guaranteed by the contractive behavior of the model toward the data manifold, according to
dist\(f\(𝒙^\),ℳ\)≤ρ⋅dist\(𝒙^,ℳ\)\\operatorname\{dist\}\\bigl\(f\(\\hat\{\\boldsymbol\{x\}\}\),\\mathcal\{M\}\\bigr\)\\leq\\rho\\cdot\\operatorname\{dist\}\\bigl\(\\hat\{\\boldsymbol\{x\}\},\\mathcal\{M\}\\bigr\)\(8\)for0<ρ<10<\\rho<1, where the contraction emerges from the specific form of our corruption process based on partial occlusions with varying opacity levels\. We provide a formal theorem and additional details in Supplementary Section[A\.2](https://arxiv.org/html/2606.15280#A1.SS2)\. Figure[3](https://arxiv.org/html/2606.15280#S3.F3)illustrates several examples of convergence toward a fixed point under iterative model application, demonstrating progressively improved reconstruction fidelity\.
Figure 3:Illustration of improved reconstruction fidelity and varying convergence speeds to a fixed\-point under iterative application of our model, separately trained on different categories of the MVTec AD dataset\. Each example shows \(from left to right\) the input together with reconstruction results obtained through repeated model application, converging toward a fixed point on the manifold\.
### 3\.3Generalization through Contraction toward the Data Manifold
The strong generalization ability of correction\-based autoencoders can be understood as a consequence of an emergent contraction behavior toward the data manifold\. Rather than learning separate responses to individual corruption patterns, the model acquires a correction behavior that is largely independent of the specific perturbation and instead restores normality by replacing irregularities with manifold\-consistent content\. In particular, this contraction provides a unified explanation for why the model \(surprisingly well\) generalizes from artificial corruptions introduced during training to naturally occurring anomalies present in real images, unlike supervised segmentation approaches, where the learned behavior is primarily driven by annotation\-specific supervision\.
More precisely, two complementary directions of generalization must be distinguished\. First,*generalization along the manifold*refers to reconstructing unseen but valid samples\. This behavior is well understood and can largely be attributed to the inductive bias of convolutional architectures, which are naturally aligned with the spatial structure of image data\. Second, and more importantly,*generalization toward the manifold*refers to correcting arbitrary off\-manifold perturbations\. This second direction cannot be explained purely through interpolation between training samples, but instead requires understanding how a global correction behavior emerges from local supervision\.
We argue that this contraction behavior arises through the interaction of three mutually reinforcing components\. First, natural images exhibit strong*index\-induced regularity*, meaning that valid image configurations are highly constrained by local context\. Neighboring pixels contain predictive information about one another, making the content of corrupted or missing regions largely inferable from their surroundings\. Second, convolutional architectures are inherently aligned with this regularity, as their learned features are local in nature and shared across spatial locations\. Third, the training objective in \([1](https://arxiv.org/html/2606.15280#S2.E1)\) promotes the inductive biases established previously by preserving normal regions while restoring corrupted regions through locally consistent content\. Collectively, these three components induce a specific contraction behavior toward the manifold, biased toward preserving normal content while replacing irregularities with locally consistent structure\. As a consequence, this learned contraction transfers across samples and perturbations, allowing the model to generalize beyond the specific corruptions encountered during training\.
This contraction perspective also explains our empirical observation that overfitting, even after prolonged training over many epochs, is comparatively difficult in this framework\. While convolutional architectures and index\-induced regularity provide the conditions under which contraction can emerge, they are not sufficient on their own, as segmentation models operate under similar architectural and data constraints\. The decisive factor is therefore the training objective, as it selectively promotes the missing inductive biases introduced previously and thereby determines which solutions become preferential during optimization\. By encouraging preservation of normal content together with restoration toward manifold\-consistent structure, the objective favors shared correction behavior over sample\-specific memorization\. In contrast, segmentation objectives optimize agreement with annotation masks, which may contain ambiguity, noise, or policy\-dependent biases\. As a result, optimization can favor shortcut correlations tied to annotation statistics rather than to the intrinsic structure of the data, making overfitting substantially more likely\. This also aligns with our empirical observation that segmentation models may initially generalize well but increasingly overfit to annotation\-specific corruption patterns as training progresses\.
### 3\.4Outlook and Future Investigations
The primary goal of this paper is to motivate a new geometric perspective on structural anomaly detection\. We argued that learning a non\-linear projection operator onto the manifold of normal data provides a principled solution to the task by directly modeling normality through projection behavior\. Building on this conceptual shift, the proposed perspective opens several promising directions for future research, supported by tools from geometric analysis\.
First, as discussed previously, the projection\-based perspective naturally motivates iterative application of the model in the spirit of fixed\-point theory\. In particular, this view explains the shortcomings of training with reconstruction losses and provides a way of iteratively refining reconstruction fidelity in corrupted regions\. We provide more details in Supplementary Section[A\.2](https://arxiv.org/html/2606.15280#A1.SS2)
Second, it suggests new forms of geometry\-motivated regularization\. Future methods may explicitly enforce geometric properties by shaping the form and stability of the learned projection mapping\. Examples include Jacobian\-based regularization to promote contraction in off\-manifold directions, as well as constraints encouraging projection\-specific properties such as identity on the manifold and idempotency\. See Supplementary Section[A\.3](https://arxiv.org/html/2606.15280#A1.SS3)for more details\.
Third, it provides a new interpretation of model generalization as a consequence of how well the learned mapping respects the geometry of the data\. In our approach, this geometry is implicitly encoded through contraction toward the data manifold\. In particular, this reduces the effective complexity of the learning problem to the intrinsic dimensionality of the underlying manifold\. We believe this geometric perspective on generalization extends beyond anomaly detection and may provide a useful framework for understanding representation learning more broadly\.
## 4Experiments
In this section we evaluate the training framework introduced in Section[2](https://arxiv.org/html/2606.15280#S2)\. Based on our geometric motivation we refer to the resulting model as a projecting autoencoder \(PAE\)\. Our goal is not to establish marginal improvements on saturated benchmarks, but to empirically support the central hypothesis of this work: reconstruction\-based methods that more closely approximate the inductive biases identified in Section[3](https://arxiv.org/html/2606.15280#S3)exhibit stronger and more robust anomaly detection performance\.
Datasets\.We evaluate the performance of our method on two established benchmarks for industrial anomaly detection: MVTec ADBergmannet al\.\([2021](https://arxiv.org/html/2606.15280#bib.bib17)\)and VisAZouet al\.\([2022](https://arxiv.org/html/2606.15280#bib.bib51)\)\. Together, these datasets cover a broad range of industrial inspection scenarios, comprising 16,175 images across 27 categories\.
Training setup\.We apply category\-specific augmentations such as random rotations and flips, while reserving 5% of the training data for validation\. A separate model is trained per category using a modified U\-NetRonnebergeret al\.\([2015](https://arxiv.org/html/2606.15280#bib.bib113)\)with dilated bottleneck convolutions and optimized with AdamKingma and Ba \([2015](https://arxiv.org/html/2606.15280#bib.bib32)\)at a learning rate of10−410^\{\-4\}\. We setλ=0\.5\\lambda=0\.5in Equation \([1](https://arxiv.org/html/2606.15280#S2.E1)\) and evaluate using the SSIM metric\. Texture categories are resized to512×512512\\times 512, whereas object categories are downscaled to256×256256\\times 256\. Accordingly, object categories use a larger dilation kernel \(55instead of33\)\. Several VisA categories contain substantial background clutter that is not annotated as anomalous\. Where necessary, we apply foreground segmentation as a preprocessing step using a U\-Net trained independently of the anomaly detection task\. The resulting masks suppress irrelevant background regions prior to training, mitigating annotation noise that disproportionately affects projection\-based methods\.
Results and Conceptual Comparison\.We report the Area Under the Receiver Operating Characteristic curve \(AUROC\) at the image level \(I\-AUROC\) for anomaly detection and at the pixel level \(P\-AUROC\) for anomaly segmentation in Table[1](https://arxiv.org/html/2606.15280#S4.T1)\. In addition, we report the Per\-Region Overlap \(P\-PRO\) and Average Precision \(P\-AP\), both computed at the pixel level\. The reported scores for competing methods are taken from previously published results\. The best and second\-best results are highlighted in red and blue, respectively\. Table[1](https://arxiv.org/html/2606.15280#S4.T1)supports three observations consistent with our geometric interpretation\. First, boundary\-based approachesReisset al\.\([2021](https://arxiv.org/html/2606.15280#bib.bib134)\)built on pretrained ResNet backbones, including OC\-SVMSchölkopfet al\.\([2001](https://arxiv.org/html/2606.15280#bib.bib1)\)and DeepSVDD \(DSVDD\)Ruffet al\.\([2018](https://arxiv.org/html/2606.15280#bib.bib3)\), are substantially outperformed by PAE\. Second, among reconstruction\-based methods, approaches that more closely approximate projection behavior, including RIADZavrtaniket al\.\([2021b](https://arxiv.org/html/2606.15280#bib.bib8)\), CutPasteLiet al\.\([2021](https://arxiv.org/html/2606.15280#bib.bib9)\), and DRAEMZavrtaniket al\.\([2021a](https://arxiv.org/html/2606.15280#bib.bib33)\), achieve markedly stronger performance than standard autoencoder AESSIM\{\}\_\{\\text\{SSIM\}\}Bergmannet al\.\([2019](https://arxiv.org/html/2606.15280#bib.bib19)\)\. However, these methods either rely on simplistic perturbations or deviate from a reconstruction objective aligned with the projection goal, resulting in a weaker approximation of the desired inductive biases\. Third, PAE remains competitive with engineered hybrid approaches, including SimpleNetLiuet al\.\([2023](https://arxiv.org/html/2606.15280#bib.bib35)\), PatchCoreRothet al\.\([2022](https://arxiv.org/html/2606.15280#bib.bib36)\), RealNetZhanget al\.\([2024](https://arxiv.org/html/2606.15280#bib.bib123)\), PBASChenet al\.\([2025](https://arxiv.org/html/2606.15280#bib.bib135)\), PMSRLiet al\.\([2026](https://arxiv.org/html/2606.15280#bib.bib130)\), and GLASSChenet al\.\([2024](https://arxiv.org/html/2606.15280#bib.bib124)\), which lack a clear geometric interpretation\. While PAE performs on par with the strongest methods in terms of AUROC, it achieves higher AP scores\. We note that the PRO metric is highly recall\-oriented and therefore less trustworthy\. This behavior is consistent with our interpretation of PAE as an approximation of a conservative projection operator that minimizes correction support, producing sharp and robust segmentation masks\. Figure[5](https://arxiv.org/html/2606.15280#A1.F5)in the supplementary material illustrates the segmentation quality of anomalous regions achieved by PAE\.
Table 1:Experimental results for image\-level anomaly detection and pixel\-level segmentation\.
## 5Conclusion
We argued that classical decision\-boundary approaches to AD are fundamentally misaligned with manifold\-supported data, where normal samples occupy a low\-dimensional subspace embedded in the ambient space\. In high\-dimensional settings, modeling normality through enclosing boundaries or density estimation becomes increasingly ill\-posed and susceptible to the curse of dimensionality\. Instead, we introduced a geometric perspective based on learning a non\-linear projection operator onto the manifold of normal data\. We identified the inductive biases required for SAD, namely the fixed\-point condition and minimal correction support, and introduced the conservative projection operator as the corresponding optimal solution\.
Furthermore, the projection perspective explains the success of correction\-based reconstruction methods through contraction toward the manifold\. This contraction explains why models generalize from artificial corruptions introduced during training to naturally occurring anomalies in real images\. Our training objective promotes preservation of normal content together with restoration toward locally consistent patterns\. As a result, the learned mapping becomes increasingly determined by data geometry rather than by individual corruption patterns\. Beyond enabling generalization, contraction toward the manifold effectively reduces the complexity of the learning problem to the intrinsic dimensionality of the manifold, thereby mitigating the curse of dimensionality\. While the proposed formulation provides a principled framework for SAD, it relies on deviations in local predictive structure\. In contrast, logical anomalies may fully preserve the appearance statistics of natural images, producing no projection residual and therefore cannot be fully addressed by our framework\.
The geometric perspective also opens several directions for future research\. Projection operators naturally connect AD to fixed\-point theory, enabling iterative refinement schemes and a dynamical interpretation of convergence toward the manifold\. Moreover, the geometric formulation motivates new regularization strategies, including Jacobian\-based constraints that encourage contraction in off\-manifold directions while preserving variation along the manifold, offering a foundation for future theoretical and algorithmic developments\.
## References
- Conditional diffusion models for guided anomaly detection in brain images using fluid\-driven anomaly randomization\.CoRRabs/2506\.10233\.External Links:2506\.10233Cited by:[§3\.1](https://arxiv.org/html/2606.15280#S3.SS1.p3.1)\.
- G\. Alain and Y\. Bengio \(2014\)What regularized auto\-encoders learn from the data\-generating distribution\.J\. Mach\. Learn\. Res\.15\(110\),pp\. 3563–3593\.Cited by:[§3\.1](https://arxiv.org/html/2606.15280#S3.SS1.p3.1)\.
- A\. Bauer, S\. Nakajima, and K\. Müller \(2024\)Self\-supervised autoencoders for visual anomaly detection\.Mathematics12\(24\),pp\. 3988\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p4.1),[§2\.1](https://arxiv.org/html/2606.15280#S2.SS1.p2.1),[§2\.1](https://arxiv.org/html/2606.15280#S2.SS1.p3.5),[§3\.1](https://arxiv.org/html/2606.15280#S3.SS1.p7.10)\.
- P\. Bergmann, K\. Batzner, M\. Fauser, D\. Sattlegger, and C\. Steger \(2021\)The mvtec anomaly detection dataset: A comprehensive real\-world dataset for unsupervised anomaly detection\.Int\. J\. Comput\. Vis\.129\(4\),pp\. 1038–1059\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p2.1)\.
- P\. Bergmann, S\. Löwe, M\. Fauser, D\. Sattlegger, and C\. Steger \(2019\)Improving unsupervised defect segmentation by applying structural similarity to autoencoders\.InProc\. Int\. Conf\. Comput\. Vis\. Theory Appl\.,pp\. 372–380\.Cited by:[§2\.2](https://arxiv.org/html/2606.15280#S2.SS2.p1.4),[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- A\. Bhattad, J\. Rock, and D\. A\. Forsyth \(2018\)Detecting anomalous faces with “no peeking” autoencoders\.CoRRabs/1802\.05798\.External Links:1802\.05798Cited by:[§3\.1](https://arxiv.org/html/2606.15280#S3.SS1.p3.1)\.
- Q\. Chen, H\. Luo, H\. Gao, C\. Lv, and Z\. Zhang \(2025\)Progressive boundary guided anomaly synthesis for industrial anomaly detection\.IEEE Trans\. Circuits Syst\. Video Technol\.35\(2\),pp\. 1193–1208\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- Q\. Chen, H\. Luo, C\. Lv, and Z\. Zhang \(2024\)A unified anomaly synthesis strategy with gradient ascent for industrial anomaly detection and localization\.InProc\. Eur\. Conf\. Comput\. Vis\.,Lect\. Notes Comput\. Sci\., Vol\.15125,pp\. 37–54\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- M\. Cimpoi, S\. Maji, I\. Kokkinos, S\. Mohamed, and A\. Vedaldi \(2014\)Describing textures in the wild\.InProc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\.,Cited by:[§2\.1](https://arxiv.org/html/2606.15280#S2.SS1.p3.5)\.
- M\. Connor, G\. Canal, and C\. Rozell \(2021\)Variational autoencoder with learned latent structure\.InProc\. Int\. Conf\. Artif\. Intell\. Stat\.,Proc\. Mach\. Learn\. Res\., Vol\.130,pp\. 2359–2367\.Cited by:[§3\.1](https://arxiv.org/html/2606.15280#S3.SS1.p3.1)\.
- M\. Kim, J\. Kim, J\. Yu, and J\. K\. Choi \(2023\)Active anomaly detection based on deep one\-class classification\.Pattern Recognit\. Lett\.167,pp\. 18–24\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1)\.
- D\. P\. Kingma and J\. Ba \(2015\)Adam: A method for stochastic optimization\.InProc\. Int\. Conf\. Learn\. Represent\.,Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p3.6)\.
- C\. Li, K\. Sohn, J\. Yoon, and T\. Pfister \(2021\)CutPaste: self\-supervised learning for anomaly detection and localization\.InProc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\.,pp\. 9664–9674\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p4.1),[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- L\. Li, C\. Yan, D\. Song, B\. Wang, and C\. Wang \(2026\)A prototype correction multi\-scale feature reconstruction network for industrial anomaly detection\.Pattern Recognit\.177,pp\. 113331\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- Z\. Liu, Y\. Zhou, Y\. Xu, and Z\. Wang \(2023\)SimpleNet: A simple network for image anomaly detection and localization\.InProc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\.,pp\. 20402–20411\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- E\. Parzen \(1962\)On estimation of a probability density function and mode\.Ann\. Math\. Stat\.33\(3\),pp\. 1065–1076\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1)\.
- J\. Pirnay and K\. Chai \(2022\)Inpainting transformer for anomaly detection\.InProc\. Int\. Conf\. Image Anal\. Process\.,Lect\. Notes Comput\. Sci\., Vol\.13232,pp\. 394–406\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p4.1)\.
- T\. Reiss, N\. Cohen, L\. Bergman, and Y\. Hoshen \(2021\)PANDA: adapting pretrained features for anomaly detection and segmentation\.InProc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\.,pp\. 2806–2814\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- O\. Ronneberger, P\. Fischer, and T\. Brox \(2015\)U\-net: convolutional networks for biomedical image segmentation\.InProc\. Med\. Image Comput\. Comput\.\-Assist\. Interv\.,Lect\. Notes Comput\. Sci\., Vol\.9351,pp\. 234–241\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p3.6)\.
- K\. Roth, L\. Pemula, J\. Zepeda, B\. Schölkopf, T\. Brox, and P\. V\. Gehler \(2022\)Towards total recall in industrial anomaly detection\.InProc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\.,pp\. 14298–14308\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- L\. Ruff, N\. Görnitz, L\. Deecke, S\. A\. Siddiqui, R\. A\. Vandermeulen, A\. Binder, E\. Müller, and M\. Kloft \(2018\)Deep one\-class classification\.InProc\. 35th Int\. Conf\. Mach\. Learn\.,J\. G\. Dy and A\. Krause \(Eds\.\),Proc\. Mach\. Learn\. Res\., Vol\.80,pp\. 4390–4399\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1),[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- L\. Ruff, J\. R\. Kauffmann, R\. A\. Vandermeulen, W\. Samek, M\. Kloft, T\. G\. Dietterich, and K\. Müller \(2021\)A unifying review of deep and shallow anomaly detection\.Proc\. IEEE109\(5\),pp\. 756–795\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1)\.
- B\. Schölkopf, J\. C\. Platt, J\. Shawe\-Taylor, A\. J\. Smola, and R\. C\. Williamson \(2001\)Estimating the support of a high\-dimensional distribution\.Neural Comput\.13\(7\),pp\. 1443–1471\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1),[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- J\. Tack, S\. Mo, J\. Jeong, and J\. Shin \(2020\)CSI: novelty detection via contrastive learning on distributionally shifted instances\.InAdv\. Neural Inf\. Process\. Syst\.,Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1)\.
- D\. M\. J\. Tax and R\. P\. W\. Duin \(2004\)Support vector data description\.Mach\. Learn\.54\(1\),pp\. 45–66\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p2.1)\.
- W\. Xue, L\. Zhang, X\. Mou, and A\. C\. Bovik \(2014\)Gradient magnitude similarity deviation: A highly efficient perceptual image quality index\.IEEE Trans\. Image Process\.23\(2\),pp\. 684–695\.Cited by:[§2\.2](https://arxiv.org/html/2606.15280#S2.SS2.p1.4)\.
- V\. Zavrtanik, M\. Kristan, and D\. Skocaj \(2021a\)DRÆm – A discriminatively trained reconstruction embedding for surface anomaly detection\.InProc\. IEEE Int\. Conf\. Comput\. Vis\.,pp\. 8310–8319\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p4.1),[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- V\. Zavrtanik, M\. Kristan, and D\. Skocaj \(2021b\)Reconstruction by inpainting for visual anomaly detection\.Pattern Recognit\.112,pp\. 107706\.Cited by:[§1](https://arxiv.org/html/2606.15280#S1.p4.1),[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- D\. Zhang, Y\. Sun, B\. Eriksson, and L\. Balzano \(2017\)Deep unsupervised clustering using mixture of autoencoders\.CoRRabs/1712\.07788\.External Links:1712\.07788Cited by:[§3\.1](https://arxiv.org/html/2606.15280#S3.SS1.p3.1)\.
- X\. Zhang, M\. Xu, and X\. Zhou \(2024\)RealNet: A feature selection network with realistic synthetic anomaly for anomaly detection\.InProc\. IEEE Conf\. Comput\. Vis\. Pattern Recognit\.,pp\. 16699–16708\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p4.1)\.
- Y\. Zou, J\. Jeong, L\. Pemula, D\. Zhang, and O\. Dabeer \(2022\)SPot\-the\-difference self\-supervised pre\-training for anomaly detection and segmentation\.InProc\. Eur\. Conf\. Comput\. Vis\.,Lect\. Notes Comput\. Sci\., Vol\.13690,pp\. 392–408\.Cited by:[§4](https://arxiv.org/html/2606.15280#S4.p2.1)\.
## Appendix ASupplementary Material
This supplementary material provides additional theoretical analysis and proofs supporting the main paper\. Section[A\.1](https://arxiv.org/html/2606.15280#A1.SS1)contains a formal proof for Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)\. Section[A\.2](https://arxiv.org/html/2606.15280#A1.SS2)discusses theoretical guarantees for convergence to a fixed\-point on the manifold under iterative model application\. Section[A\.3](https://arxiv.org/html/2606.15280#A1.SS3)provides additional discussion of Jacobian\-based regularization\. We note that the theorems and proofs presented here remain preliminary and may be refined in future revisions\.
### A\.1Proof of Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)
We first prove a simpler claim\.
###### Lemma 1\.
Letp\(𝐱\)p\(\\boldsymbol\{x\}\)denote the probability distribution of normal data and𝐱^∈ℝd\\hat\{\\boldsymbol\{x\}\}\\in\\mathbb\{R\}^\{d\}a fixed anomalous sample with normal regionN\(𝐱^\)N\(\\hat\{\\boldsymbol\{x\}\}\)\. Consider the unconstrained optimization problem
minimize𝜽∈ℝd\\displaystyle\\underset\{\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{d\}\}\{\\text\{minimize\}\}𝔼p\(𝒙\)\[∥f𝜽\(𝒙^\)−𝒙∥2\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)\]\\displaystyle\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\\|f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\\,\\middle\|\\,\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]\(9\)Provided sufficient capacity of the hypothesis class\{f𝛉\}\\\{f\_\{\\boldsymbol\{\\theta\}\}\\\}, every minimizerf∗:=f𝛉∗f^\{\*\}:=f\_\{\\boldsymbol\{\\theta\}^\{\*\}\}satisfies
fN\(𝒙^\)∗\(𝒙^\)=𝒙^N\(𝒙^\)andfA\(𝒙^\)∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙A\(𝒙^\)\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)\]\.f^\{\*\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\qquad\\text\{and\}\\qquad f^\{\*\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\boldsymbol\{x\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\\,\\middle\|\\,\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]\.\(10\)
###### Proof\.
We writeS:=A\(𝒙^\)S:=A\(\\hat\{\\boldsymbol\{x\}\}\)andS¯:=N\(𝒙^\)\\bar\{S\}:=N\(\\hat\{\\boldsymbol\{x\}\}\)\. Consider the following derivations:
𝔼p\(𝒙\)\[‖f\(𝒙^\)−𝒙‖2\|𝒙S¯=𝒙^S¯\]\\displaystyle\\hskip 12\.0pt\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\left\[\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\\hskip 2\.0pt\|\\hskip 2\.0pt\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\\right\]=∫‖f\(𝒙^\)−𝒙‖2p\(𝒙\|𝒙S¯=𝒙^S¯\)d𝒙\\displaystyle=\\int\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}p\(\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\)\\mathrm\{d\}\\boldsymbol\{x\}=∫\(‖f\(𝒙^\)‖2−2f\(𝒙^\)⊤𝒙\+‖𝒙‖2\)p\(𝒙\|𝒙S¯=𝒙^S¯\)d𝒙\\displaystyle=\\int\\left\(\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\\\|^\{2\}\-2f\(\\hat\{\\boldsymbol\{x\}\}\)^\{\\top\}\\boldsymbol\{x\}\+\\\|\\boldsymbol\{x\}\\\|^\{2\}\\right\)p\(\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\)\\mathrm\{d\}\\boldsymbol\{x\}=‖f\(𝒙^\)‖2∫p\(𝒙\|𝒙S¯=𝒙^S¯\)d𝒙⏟=1−2f\(𝒙^\)⊤∫𝒙p\(𝒙\|𝒙S¯=𝒙^S¯\)d𝒙⏟𝔼\[𝒙\|𝒙S¯=𝒙^S¯\]\+∫‖𝒙‖2p\(𝒙\|𝒙S¯=𝒙^S¯\)d𝒙⏟𝔼\[‖𝒙‖2\|𝒙S¯=𝒙^S¯\]\\displaystyle=\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\\\|^\{2\}\\underbrace\{\\int p\(\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\)\\mathrm\{d\}\\boldsymbol\{x\}\}\_\{=1\}\-2f\(\\hat\{\\boldsymbol\{x\}\}\)^\{\\top\}\\underbrace\{\\int\\boldsymbol\{x\}p\(\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\)\\mathrm\{d\}\\boldsymbol\{x\}\}\_\{\\mathbb\{E\}\[\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]\}\+\\underbrace\{\\int\\\|\\boldsymbol\{x\}\\\|^\{2\}p\(\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\)\\mathrm\{d\}\\boldsymbol\{x\}\}\_\{\\mathbb\{E\}\[\\\|\\boldsymbol\{x\}\\\|^\{2\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]\}=‖f\(𝒙^\)‖2−2f\(𝒙^\)⊤𝔼\[𝒙\|𝒙S¯=𝒙^S¯\]\+𝔼\[‖𝒙‖2\|𝒙S¯=𝒙^S¯\]\.\\displaystyle=\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\\\|^\{2\}\-2f\(\\hat\{\\boldsymbol\{x\}\}\)^\{\\top\}\\mathbb\{E\}\[\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]\+\\mathbb\{E\}\[\\\|\\boldsymbol\{x\}\\\|^\{2\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]\.Note that𝝁:=f\(𝒙^\)\\boldsymbol\{\\mu\}:=f\(\\hat\{\\boldsymbol\{x\}\}\)in the above derivation is a constant with respect to the expectation\. We now look for stationary points of the mappingg:𝝁↦‖𝝁‖2−2𝝁⊤𝔼\[𝒙\|𝒙S¯=𝒙^S¯\]\+𝔼\[‖𝒙‖2\|𝒙S¯=𝒙^S¯\]g\\colon\\boldsymbol\{\\mu\}\\mapsto\\\|\\boldsymbol\{\\mu\}\\\|^\{2\}\-2\\boldsymbol\{\\mu\}^\{\\top\}\\mathbb\{E\}\[\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]\+\\mathbb\{E\}\[\\\|\\boldsymbol\{x\}\\\|^\{2\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]by setting its gradient to zero and solving for𝝁\\boldsymbol\{\\mu\}\. It follows:
∇𝝁g=2𝝁⊤−2𝔼\[𝒙\|𝒙S¯=𝒙^S¯\]⊤=𝟎⊤\.\\nabla\_\{\\boldsymbol\{\\mu\}\}g=2\\boldsymbol\{\\mu\}^\{\\top\}\-2\\mathbb\{E\}\[\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]^\{\\top\}=\\bf 0^\{\\top\}\.Sinceg\(𝝁\)g\(\\boldsymbol\{\\mu\}\)is a convex function, this implies thatf∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙\|𝒙S¯=𝒙^S¯\]f^\{\*\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\[\\boldsymbol\{x\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]is the unique optimal output\. Note thatf∗f^\{\*\}itself does not need to be unique\. In particular, any miminizerf∗f^\{\*\}satisfiesfS∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙S\|𝒙S¯=𝒙^S¯\]f^\{\*\}\_\{S\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\[\\boldsymbol\{x\}\_\{S\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]andfS¯∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙S¯\|𝒙S¯=𝒙^S¯\]=𝒙^S¯f^\{\*\}\_\{\\bar\{S\}\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\[\\boldsymbol\{x\}\_\{\\bar\{S\}\}\|\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\]=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\. ∎
We now prove Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)\.
###### Proof\.
For each fixed𝒙^\\hat\{\\boldsymbol\{x\}\}, define the conditional risk
R𝒙^\(f\):=𝔼p\(𝒙\)\[∥f\(𝒙^\)−𝒙∥2\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)\]\.R\_\{\\hat\{\\boldsymbol\{x\}\}\}\(f\):=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\\;\\middle\|\\;\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]\.
By Lemma[1](https://arxiv.org/html/2606.15280#Thmlemma1), every functionf∗f^\{\*\}satisfying \([10](https://arxiv.org/html/2606.15280#A1.E10)\) minimizesR𝒙^\(⋅\)R\_\{\\hat\{\\boldsymbol\{x\}\}\}\(\\cdot\)for the fixed input𝒙^\\hat\{\\boldsymbol\{x\}\}\. Therefore, for anyffand every admissible𝒙^\\hat\{\\boldsymbol\{x\}\}we have
R𝒙^\(f\)≥R𝒙^\(f∗\)\.R\_\{\\hat\{\\boldsymbol\{x\}\}\}\(f\)\\geq R\_\{\\hat\{\\boldsymbol\{x\}\}\}\(f^\{\*\}\)\.Taking expectation with respect to𝒙^∼q\\hat\{\\boldsymbol\{x\}\}\\sim qyields
𝔼q\(𝒙^\)\[R𝒙^\(f\)\]≥𝔼q\(𝒙^\)\[R𝒙^\(f∗\)\]\.\\mathbb\{E\}\_\{q\(\\hat\{\\boldsymbol\{x\}\}\)\}\[R\_\{\\hat\{\\boldsymbol\{x\}\}\}\(f\)\]\\;\\geq\\;\\mathbb\{E\}\_\{q\(\\hat\{\\boldsymbol\{x\}\}\)\}\[R\_\{\\hat\{\\boldsymbol\{x\}\}\}\(f^\{\*\}\)\]\.\(11\)
Moreover, since for every admissible𝒙^\\hat\{\\boldsymbol\{x\}\}the optimal output valuef∗\(𝒙^\)f^\{\*\}\(\\hat\{\\boldsymbol\{x\}\}\)is unique, equality in \([11](https://arxiv.org/html/2606.15280#A1.E11)\) can hold only if the outputf\(𝒙^\)f\(\\hat\{\\boldsymbol\{x\}\}\)equals this unique value forqq\-almost every𝒙^\\hat\{\\boldsymbol\{x\}\}\. Hence, every minimizerf∗f^\{\*\}of the optimization problem in Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)satisfies
fN\(𝒙^\)∗\(𝒙^\)=𝒙^N\(𝒙^\)andfA\(𝒙^\)∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙A\(𝒙^\)\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)\]f^\{\*\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\qquad\\text\{and\}\\qquad f^\{\*\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\boldsymbol\{x\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\\;\\middle\|\\;\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]forqq\-almost every𝒙^\\hat\{\\boldsymbol\{x\}\}\.
∎
As mentioned in the main body of the paper in Section[3\.2](https://arxiv.org/html/2606.15280#S3.SS2), the conditional expectations are also conditioned, beyond the values onN\(𝒙^\)N\(\\hat\{\\boldsymbol\{x\}\}\), on the mechanism by which the corruption process𝒞\\mathcal\{C\}generates patterns onA\(𝒙^\)A\(\\hat\{\\boldsymbol\{x\}\}\)\. Therefore, the precise form of the solution in \([7](https://arxiv.org/html/2606.15280#S3.E7)\) is
fA\(𝒙^\)∗\(𝒙^\)=𝔼p\(𝒙\)\[𝒙A\(𝒙^\)\|𝒙N\(𝒙^\)=𝒙^N\(𝒙^\),𝒞A\(𝒙^\)\(𝒙\)=𝒙^A\(𝒙^\)\]\.f^\{\*\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\boldsymbol\{x\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\\;\\middle\|\\;\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\},\\mathcal\{C\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\(\\boldsymbol\{x\}\)=\\hat\{\\boldsymbol\{x\}\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\\right\]\.\(12\)
### A\.2Fixed\-Point Convergence under Contraction toward the Manifold
The following theorem establishes convergence to a fixed point on the manifold under repeated application of the model, assuming a contraction property toward the data manifold\.
As a first simplifying modeling assumption, we assume that for any given anomalous sample𝒙^∈ℝd∖ℳ\\hat\{\\boldsymbol\{x\}\}\\in\\mathbb\{R\}^\{d\}\\setminus\\mathcal\{M\}, the anomalous index setS:=A\(𝒙^\)S:=A\(\\hat\{\\boldsymbol\{x\}\}\)is uniquely identifiable, since ambiguous boundary cases occur with negligible probability in practice\.
As an additional simplifying assumption, we assume that for a given anomalous input𝒙^\\hat\{\\boldsymbol\{x\}\}, the anomaly index setSSremains unchanged \(up until convergence\) under iterative application of the model\.
###### Theorem 2\.
Letℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}be a closed manifold and letf:ℝd→ℝdf:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{d\}be a non\-linear operator satisfying:
1. \(i\)\(Conservativity\) For every input𝒙^∈ℝd\\hat\{\\boldsymbol\{x\}\}\\in\\mathbb\{R\}^\{d\}with anomaly index setSS, the operator preserves the complementary coordinates corresponding to normal regions: fS¯\(𝒙^\)=𝒙^S¯\.f\_\{\\bar\{S\}\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\.
2. \(ii\)\(Slice Contraction\) There existsρ∈\(0,1\)\\rho\\in\(0,1\)such that for all𝒖∈ℝd\\boldsymbol\{u\}\\in\\mathbb\{R\}^\{d\}, dist\(fS\(𝒖\),ℳS\(𝒖\)\)≤ρ⋅dist\(𝒖S,ℳS\(𝒖\)\),\\operatorname\{dist\}\\bigl\(f\_\{S\}\(\\boldsymbol\{u\}\),\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{u\}\)\\bigr\)\\leq\\rho\\cdot\\operatorname\{dist\}\\bigl\(\\boldsymbol\{u\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{u\}\)\\bigr\),where the slice ℳS\(𝒖\):=\{𝒙S:𝒙∈ℳ,𝒙S¯=𝒖S¯\}\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{u\}\):=\\\{\\boldsymbol\{x\}\_\{S\}\\colon\\boldsymbol\{x\}\\in\\mathcal\{M\},\\ \\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\boldsymbol\{u\}\_\{\\bar\{S\}\}\\\}and dist\(z,A\):=infy∈A‖z−y‖2forz∈ℝk,A⊂ℝk,k=\|S\|\.\\operatorname\{dist\}\(z,A\):=\\inf\_\{y\\in A\}\\\|z\-y\\\|\_\{2\}\\qquad\\text\{for \}z\\in\\mathbb\{R\}^\{k\},\\ A\\subset\\mathbb\{R\}^\{k\},\\ k=\|S\|\.
3. \(iii\)\(Vanishing correction near the slice\) There existsC\>0C\>0such that for all𝒖∈ℝd\\boldsymbol\{u\}\\in\\mathbb\{R\}^\{d\}, ‖fS\(𝒖\)−𝒖S‖2≤Cdist\(𝒖S,ℳS\(𝒖\)\)\.\\\|f\_\{S\}\(\\boldsymbol\{u\}\)\-\\boldsymbol\{u\}\_\{S\}\\\|\_\{2\}\\leq C\\,\\operatorname\{dist\}\\bigl\(\\boldsymbol\{u\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{u\}\)\\bigr\)\.
Then the following statements hold:
1. \(a\)A point𝒙∈ℝd\\boldsymbol\{x\}\\in\\mathbb\{R\}^\{d\}belongs toℳ\\mathcal\{M\}if and only if it is a fixed point offf\.
2. \(b\)For every anomalous sample𝒙^\\hat\{\\boldsymbol\{x\}\}with anomaly index setSS, there exists𝒙∗∈ℳ\\boldsymbol\{x\}^\{\\ast\}\\in\\mathcal\{M\}such that𝒙S¯∗=𝒙^S¯\\boldsymbol\{x\}^\{\\ast\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}and limn→∞f\(n\)\(𝒙^\)=𝒙∗\.\\lim\_\{n\\to\\infty\}f^\{\(n\)\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\boldsymbol\{x\}^\{\\ast\}\.
###### Proof\.
We prove \(a\) and \(b\)\.
\(a\)\(⇒\\Rightarrow\) Let𝒙∈ℳ\\boldsymbol\{x\}\\in\\mathcal\{M\}\. By the convention, normal samples have empty anomaly index set, i\.e\.S=∅S=\\emptysetand henceS¯=\{1,…,d\}\\bar\{S\}=\\\{1,\\dots,d\\\}\. Applying \(i\) yields
f\(𝒙\)=fS¯\(𝒙\)=𝒙S¯=𝒙,f\(\\boldsymbol\{x\}\)=f\_\{\\bar\{S\}\}\(\\boldsymbol\{x\}\)=\\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\boldsymbol\{x\},so𝒙\\boldsymbol\{x\}is a fixed point offf\.
\(a\)\(⇐\\Leftarrow\) Let𝒙∈ℝd\\boldsymbol\{x\}\\in\\mathbb\{R\}^\{d\}be a fixed point, i\.e\.f\(𝒙\)=𝒙f\(\\boldsymbol\{x\}\)=\\boldsymbol\{x\}, and letSSbe its anomaly index set\. IfS=∅S=\\emptyset, then𝒙\\boldsymbol\{x\}is normal by definition and hence𝒙∈ℳ\\boldsymbol\{x\}\\in\\mathcal\{M\}\.
AssumeS≠∅S\\neq\\emptyset\. Using \(ii\) with𝒖=𝒙\\boldsymbol\{u\}=\\boldsymbol\{x\}andfS\(𝒙\)=𝒙Sf\_\{S\}\(\\boldsymbol\{x\}\)=\\boldsymbol\{x\}\_\{S\}gives
dist\(𝒙S,ℳS\(𝒙\)\)=dist\(fS\(𝒙\),ℳS\(𝒙\)\)≤ρdist\(𝒙S,ℳS\(𝒙\)\)\.\\operatorname\{dist\}\\bigl\(\\boldsymbol\{x\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}\)\\bigr\)=\\operatorname\{dist\}\\bigl\(f\_\{S\}\(\\boldsymbol\{x\}\),\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}\)\\bigr\)\\leq\\rho\\,\\operatorname\{dist\}\\bigl\(\\boldsymbol\{x\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}\)\\bigr\)\.Sinceρ∈\(0,1\)\\rho\\in\(0,1\), we obtain
dist\(𝒙S,ℳS\(𝒙\)\)=0\.\\operatorname\{dist\}\(\\boldsymbol\{x\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}\)\)=0\.
By definition ofℳS\(𝒙\)\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}\), there exists a sequence\(𝒚\(n\)\)⊂ℳ\(\\boldsymbol\{y\}^\{\(n\)\}\)\\subset\\mathcal\{M\}satisfying
𝒚S¯\(n\)=𝒙S¯and𝒚S\(n\)→𝒙S\.\\boldsymbol\{y\}^\{\(n\)\}\_\{\\bar\{S\}\}=\\boldsymbol\{x\}\_\{\\bar\{S\}\}\\qquad\\text\{and\}\\qquad\\boldsymbol\{y\}^\{\(n\)\}\_\{S\}\\to\\boldsymbol\{x\}\_\{S\}\.Hence𝒚\(n\)→𝒙\\boldsymbol\{y\}^\{\(n\)\}\\to\\boldsymbol\{x\}inℝd\\mathbb\{R\}^\{d\}\. Moreover, the set
\{𝒚∈ℳ:𝒚S¯=𝒙S¯\}\\\{\\boldsymbol\{y\}\\in\\mathcal\{M\}:\\ \\boldsymbol\{y\}\_\{\\bar\{S\}\}=\\boldsymbol\{x\}\_\{\\bar\{S\}\}\\\}is closed as the intersection of the closed setℳ\\mathcal\{M\}with an affine subspace\. Since each𝒚\(n\)\\boldsymbol\{y\}^\{\(n\)\}belongs to a closed set and𝒚\(n\)→𝒙\\boldsymbol\{y\}^\{\(n\)\}\\to\\boldsymbol\{x\}, we conclude that the limit𝒙\\boldsymbol\{x\}doest also belongs to it\. Therefore𝒙∈ℳ\\boldsymbol\{x\}\\in\\mathcal\{M\}\.
\(b\)Fix an anomalous sample𝒙^\\hat\{\\boldsymbol\{x\}\}with anomaly index setSSand define the iterates
𝒙\(0\):=𝒙^,𝒙\(n\+1\):=f\(𝒙\(n\)\)\.\\boldsymbol\{x\}^\{\(0\)\}:=\\hat\{\\boldsymbol\{x\}\},\\qquad\\boldsymbol\{x\}^\{\(n\+1\)\}:=f\(\\boldsymbol\{x\}^\{\(n\)\}\)\.
By \(i\), for alln≥0n\\geq 0,
𝒙S¯\(n\)=𝒙^S¯\.\\boldsymbol\{x\}^\{\(n\)\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\.
Consequently, the slice is constant along the iterates:
ℳS\(𝒙\(n\)\)=\{𝒙S:𝒙∈ℳ,𝒙S¯=𝒙S¯\(n\)\}=\{𝒙S:𝒙∈ℳ,𝒙S¯=𝒙^S¯\}=:𝒜\.\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}^\{\(n\)\}\)=\\\{\\boldsymbol\{x\}\_\{S\}:\\boldsymbol\{x\}\\in\\mathcal\{M\},\\ \\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\boldsymbol\{x\}^\{\(n\)\}\_\{\\bar\{S\}\}\\\}=\\\{\\boldsymbol\{x\}\_\{S\}:\\boldsymbol\{x\}\\in\\mathcal\{M\},\\ \\boldsymbol\{x\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}\\\}=:\\mathcal\{A\}\.
Define
δn:=dist\(𝒙S\(n\),𝒜\)\.\\delta\_\{n\}:=\\operatorname\{dist\}\(\\boldsymbol\{x\}^\{\(n\)\}\_\{S\},\\mathcal\{A\}\)\.
Applying \(ii\) with𝒖=𝒙\(n\)\\boldsymbol\{u\}=\\boldsymbol\{x\}^\{\(n\)\}yields
δn\+1=dist\(𝒙S\(n\+1\),𝒜\)=dist\(fS\(𝒙\(n\)\),ℳS\(𝒙\(n\)\)\)≤ρdist\(𝒙S\(n\),ℳS\(𝒙\(n\)\)\)=ρδn,\\delta\_\{n\+1\}=\\operatorname\{dist\}\\bigl\(\\boldsymbol\{x\}^\{\(n\+1\)\}\_\{S\},\\mathcal\{A\}\\bigr\)=\\operatorname\{dist\}\\bigl\(f\_\{S\}\(\\boldsymbol\{x\}^\{\(n\)\}\),\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}^\{\(n\)\}\)\\bigr\)\\leq\\rho\\,\\operatorname\{dist\}\\bigl\(\\boldsymbol\{x\}^\{\(n\)\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}^\{\(n\)\}\)\\bigr\)=\\rho\\,\\delta\_\{n\},and hence
δn≤ρnδ0for alln\.\\delta\_\{n\}\\leq\\rho^\{n\}\\delta\_\{0\}\\qquad\\text\{for all \}n\.
We now show that\(𝒙S\(n\)\)\(\\boldsymbol\{x\}^\{\(n\)\}\_\{S\}\)is a Cauchy sequence\. By \(iii\),
‖𝒙S\(n\+1\)−𝒙S\(n\)‖2=‖fS\(𝒙\(n\)\)−𝒙S\(n\)‖2≤Cdist\(𝒙S\(n\),ℳS\(𝒙\(n\)\)\)=Cδn≤Cρnδ0\.\\\|\\boldsymbol\{x\}^\{\(n\+1\)\}\_\{S\}\-\\boldsymbol\{x\}^\{\(n\)\}\_\{S\}\\\|\_\{2\}=\\\|f\_\{S\}\(\\boldsymbol\{x\}^\{\(n\)\}\)\-\\boldsymbol\{x\}^\{\(n\)\}\_\{S\}\\\|\_\{2\}\\leq C\\,\\operatorname\{dist\}\\bigl\(\\boldsymbol\{x\}^\{\(n\)\}\_\{S\},\\mathcal\{M\}\_\{S\}\(\\boldsymbol\{x\}^\{\(n\)\}\)\\bigr\)=C\\,\\delta\_\{n\}\\leq C\\,\\rho^\{n\}\\delta\_\{0\}\.
Therefore, form\>nm\>n,
‖𝒙S\(m\)−𝒙S\(n\)‖2\\displaystyle\\\|\\boldsymbol\{x\}^\{\(m\)\}\_\{S\}\-\\boldsymbol\{x\}^\{\(n\)\}\_\{S\}\\\|\_\{2\}≤∑t=nm−1‖𝒙S\(t\+1\)−𝒙S\(t\)‖2≤Cδ0∑t=nm−1ρt=Cδ0\(∑t=0m−1ρt−∑t=0n−1ρt\)\\displaystyle\\leq\\sum\_\{t=n\}^\{m\-1\}\\\|\\boldsymbol\{x\}^\{\(t\+1\)\}\_\{S\}\-\\boldsymbol\{x\}^\{\(t\)\}\_\{S\}\\\|\_\{2\}\\leq C\\delta\_\{0\}\\sum\_\{t=n\}^\{m\-1\}\\rho^\{t\}=C\\delta\_\{0\}\\left\(\\sum\_\{t=0\}^\{m\-1\}\\rho^\{t\}\-\\sum\_\{t=0\}^\{n\-1\}\\rho^\{t\}\\right\)=Cδ0\(1−ρm1−ρ−1−ρn1−ρ\)≤Cδ01−ρρn→n→∞0\.\\displaystyle=C\\delta\_\{0\}\\left\(\\frac\{1\-\\rho^\{m\}\}\{1\-\\rho\}\-\\frac\{1\-\\rho^\{n\}\}\{1\-\\rho\}\\right\)\\leq\\frac\{C\\delta\_\{0\}\}\{1\-\\rho\}\\,\\rho^\{n\}\\xrightarrow\[n\\to\\infty\]\{\}0\.
Thus\(𝒙S\(n\)\)\(\\boldsymbol\{x\}^\{\(n\)\}\_\{S\}\)converges to some limit𝒙S∗∈ℝ\|S\|\\boldsymbol\{x\}^\{\\ast\}\_\{S\}\\in\\mathbb\{R\}^\{\|S\|\}\.
Since all𝒙S\(n\)∈𝒜\\boldsymbol\{x\}^\{\(n\)\}\_\{S\}\\in\\mathcal\{A\}and𝒜\\mathcal\{A\}is closed, the limit satisfies𝒙S∗∈𝒜\\boldsymbol\{x\}^\{\\ast\}\_\{S\}\\in\\mathcal\{A\}\.
Moreover,𝒙S¯\(n\)=𝒙^S¯\\boldsymbol\{x\}^\{\(n\)\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}for allnn, hence𝒙\(n\)→𝒙∗\\boldsymbol\{x\}^\{\(n\)\}\\to\\boldsymbol\{x\}^\{\\ast\}where𝒙S¯∗=𝒙^S¯\\boldsymbol\{x\}^\{\\ast\}\_\{\\bar\{S\}\}=\\hat\{\\boldsymbol\{x\}\}\_\{\\bar\{S\}\}and𝒙S∗∈𝒜\\boldsymbol\{x\}^\{\\ast\}\_\{S\}\\in\\mathcal\{A\}\. By definition of𝒜\\mathcal\{A\}, this implies𝒙∗∈ℳ\\boldsymbol\{x\}^\{\\ast\}\\in\\mathcal\{M\}, which concludes the proof\. ∎
The remaining question is under which conditions the model satisfies assumptions\(ii\)\(ii\)and\(iii\)\(iii\)of the theorem\. In the following, we provide an intuition together with a practical heuristic used in our experiments, while leaving a rigorous formal investigation to future work\. Specifically, we rely on out\-of\-distribution images as a source of anomalous patterns\. The main intuition is that the resulting perturbations are predominantly orthogonal to the manifold of normal data\.
#### Corruption model\.
Letp\(𝒙\)p\(\\boldsymbol\{x\}\)denote the distribution of normal data supported on a differentiable manifoldℳ⊂ℝd\\mathcal\{M\}\\subset\\mathbb\{R\}^\{d\}\. A partially corrupted observation is generated as
𝒙^=𝒙\+α𝜹,𝒙∼p,α∼π,𝜹N\(𝒙^\)=𝟎,\\hat\{\\boldsymbol\{x\}\}=\\boldsymbol\{x\}\+\\alpha\\,\\boldsymbol\{\\delta\},\\qquad\\boldsymbol\{x\}\\sim p,\\quad\\alpha\\sim\\pi,\\qquad\\boldsymbol\{\\delta\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\mathbf\{0\},\(13\)whereπ\\piis a transparency distribution on\(0,1\]\(0,1\]\.
We assume that the corruption vector is predominantly normal to the manifold\. That is, there existsη∈\[0,1\)\\eta\\in\[0,1\)such that, for typical pairs\(𝒙,𝜹\)\(\\boldsymbol\{x\},\\boldsymbol\{\\delta\}\),
‖ΠT𝒙ℳ\(𝜹\)‖≤η‖ΠN𝒙ℳ\(𝜹\)‖,\\big\\\|\\Pi\_\{T\_\{\\boldsymbol\{x\}\}\\mathcal\{M\}\}\(\\boldsymbol\{\\delta\}\)\\big\\\|\\leq\\eta\\,\\big\\\|\\Pi\_\{N\_\{\\boldsymbol\{x\}\}\\mathcal\{M\}\}\(\\boldsymbol\{\\delta\}\)\\big\\\|,\(14\)whereΠT𝒙ℳ\\Pi\_\{T\_\{\\boldsymbol\{x\}\}\\mathcal\{M\}\}andΠN𝒙ℳ\\Pi\_\{N\_\{\\boldsymbol\{x\}\}\\mathcal\{M\}\}denote the orthogonal projections onto the tangent and normal subspaces of the manifold at𝒙\\boldsymbol\{x\}, respectively\. In practice, we approximate this condition by setting𝜹A\(𝒙^\)=𝒚A\(𝒙^\)−𝒙A\(𝒙^\)\\boldsymbol\{\\delta\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\boldsymbol\{y\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\-\\boldsymbol\{x\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}for some out\-of\-distribution image𝒚∉ℳ\\boldsymbol\{y\}\\notin\\mathcal\{M\}that is sufficiently distant from the manifold\. Informally, this corresponds to using images that look noticeably different from normal examples\. Furthermore, we assume thatπ\\piassigns nonzero probability to arbitrarily small values ofα\\alpha\.
#### Compatible normal sources\.
For a partially corrupted input𝒙^\\hat\{\\boldsymbol\{x\}\}we define the set of normal samples compatible with𝒙^\\hat\{\\boldsymbol\{x\}\}as
CNS\(𝒙^\)=\{𝒙∈ℳ:𝒙N\(𝒙^\)=𝒙^N\(𝒙^\)and∃α∈\(0,1\],𝜹s\.t\.𝒙^=𝒙\+α𝜹\}\.\\text\{CNS\}\(\\hat\{\\boldsymbol\{x\}\}\)=\\Big\\\{\\boldsymbol\{x\}\\in\\mathcal\{M\}\\;:\\;\\boldsymbol\{x\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}=\\hat\{\\boldsymbol\{x\}\}\_\{N\(\\hat\{\\boldsymbol\{x\}\}\)\}\\ \\text\{and\}\\ \\exists\\,\\alpha\\in\(0,1\],\\,\\boldsymbol\{\\delta\}\\ \\text\{s\.t\.\}\\ \\hat\{\\boldsymbol\{x\}\}=\\boldsymbol\{x\}\+\\alpha\\boldsymbol\{\\delta\}\\Big\\\}\.\(15\)We measure ambiguity on the corrupted region by the diameter
diam\(CNS\(𝒙^\)\):=sup𝒙,𝒙′∈CNS\(𝒙^\)‖𝒙A\(𝒙^\)−𝒙A\(𝒙^\)′‖\.\\mathrm\{diam\}\\big\(\\text\{CNS\}\(\\hat\{\\boldsymbol\{x\}\}\)\\big\):=\\sup\_\{\\boldsymbol\{x\},\\boldsymbol\{x\}^\{\\prime\}\\in\\text\{CNS\}\(\\hat\{\\boldsymbol\{x\}\}\)\}\\\|\\boldsymbol\{x\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\-\\boldsymbol\{x\}^\{\\prime\}\_\{A\(\\hat\{\\boldsymbol\{x\}\}\)\}\\\|\.\(16\)
We now show how our corruption process induces the contraction toward the manifold that ensures properties\(ii\)\(ii\)and\(iii\)\(iii\)in Theorem[2](https://arxiv.org/html/2606.15280#Thmtheorem2)\.
###### Proposition 1\.
Consider the following unconstrained optimization problem:
minimize𝜽∈ℝk\\displaystyle\\underset\{\\boldsymbol\{\\theta\}\\in\\mathbb\{R\}^\{k\}\}\{\\text\{minimize\}\}𝔼q\(𝒙^\)\[𝔼p\(𝒙\)\[∥f𝜽\(𝒙^\)−𝒙∥2\|𝒞\(𝒙\)=𝒙^\]\],\\displaystyle\\mathbb\{E\}\_\{q\(\\hat\{\\boldsymbol\{x\}\}\)\}\\\!\\left\[\\mathbb\{E\}\_\{p\(\\boldsymbol\{x\}\)\}\\\!\\left\[\\\|f\_\{\\boldsymbol\{\\theta\}\}\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\\;\\middle\|\\;\\mathcal\{C\}\(\\boldsymbol\{x\}\)=\\hat\{\\boldsymbol\{x\}\}\\right\]\\right\],\(17\)under the corruption model \([13](https://arxiv.org/html/2606.15280#A1.E13)\)–\([14](https://arxiv.org/html/2606.15280#A1.E14)\) denoted by𝒞\\mathcal\{C\}\. Letf∗f^\{\*\}denote a Bayes\-optimal solution of the above optimization problem\. Then there exists a constantC∈\(0,1\)C\\in\(0,1\)such that, for all admissible corrupted samples𝐱^\\hat\{\\boldsymbol\{x\}\},
diam\(CNS\(f∗\(𝒙^\)\)\)<C⋅diam\(CNS\(𝒙^\)\)\.\\mathrm\{diam\}\\big\(\\mathrm\{CNS\}\(f^\{\*\}\(\\hat\{\\boldsymbol\{x\}\}\)\)\\big\)<C\\cdot\\mathrm\{diam\}\\big\(\\mathrm\{CNS\}\(\\hat\{\\boldsymbol\{x\}\}\)\\big\)\.\(18\)
###### Proof\.
The assumption about the perturbations with normal\-dominant components implies that locally the setCNS\(𝒙^\)\\mathrm\{CNS\}\(\\hat\{\\boldsymbol\{x\}\}\)is confined within a cone around the normal direction ofℳ\\mathcal\{M\}in the corrupted coordinates of𝜹\\boldsymbol\{\\delta\}\. By Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1), the Bayes\-optimal mapping produces the conditional mean of the compatible normal sources, which lies near the center of this cone and therefore yields a*blurry*single\-step completion\. Because the training distribution contains transparent perturbations, i\.e\., arbitrarily small values ofα\\alpha, the corrected sample𝒙^\+:=f∗\(𝒙^\)\\hat\{\\boldsymbol\{x\}\}^\{\+\}:=f^\{\*\}\(\\hat\{\\boldsymbol\{x\}\}\)can only be generated from normal samples in a strictly smaller neighborhood ofℳ\\mathcal\{M\}than𝒙^\\hat\{\\boldsymbol\{x\}\}\.
The corresponding reduction factor is controlled by the aperture of the local cone of compatible normal sources\. This aperture is determined by the maximal deviation from the normal\-dominance condition in \([14](https://arxiv.org/html/2606.15280#A1.E14)\)\. Hence, under a uniform normal\-dominance margin, there exists a constantC∈\(0,1\)C\\in\(0,1\)such that equation \([18](https://arxiv.org/html/2606.15280#A1.E18)\) holds for all admissible corrupted samples𝒙^\\hat\{\\boldsymbol\{x\}\}\. This idea is illustrated in Figure[4](https://arxiv.org/html/2606.15280#A1.F4)\. ∎
Figure 4:Illustration of the progressive reduction in the diameter of compatible normal sources under iterative application of the model\. The manifold of normal dataℳ\\mathcal\{M\}is represented schematically by the horizontal black line\. The leftmost triangle \(blue\) indicates the cone of compatible normal sources for the corrupted sample𝒙^1\\hat\{\\boldsymbol\{x\}\}\_\{1\}, while the rightmost triangle \(red\) indicates the corresponding cone for𝒙^2\\hat\{\\boldsymbol\{x\}\}\_\{2\}\. The small triangle \(lilac\) illustrates that, as the corrected sample moves closer toℳ\\mathcal\{M\}, the set of compatible normal sources becomes more restricted, resulting in a smaller diameter\.Together, Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)and Proposition[1](https://arxiv.org/html/2606.15280#Thmproposition1)provide the geometric mechanism underlying assumptions\(ii\)\(ii\)and\(iii\)\(iii\)of Theorem[2](https://arxiv.org/html/2606.15280#Thmtheorem2)\. Theorem[1](https://arxiv.org/html/2606.15280#Thmtheorem1)states that a Bayes\-optimal solutionf∗f^\{\*\}maps each corrupted input to the conditional mean of the compatible normal sources determined by the corruption process𝒞\\mathcal\{C\}\. Under the normal\-dominance assumption, these compatible sources are confined to a cone around the normal direction of the manifold\. Proposition[1](https://arxiv.org/html/2606.15280#Thmproposition1)shows that, after applyingf∗f^\{\*\}, the diameter of this cone decreases by a uniform multiplicative factor\.
Since the compatible normal sources are defined by fixing the normal coordinates and varying only the corrupted coordinates, the diameter of this set directly measures the remaining ambiguity along the corresponding manifold slice\. Therefore, the multiplicative reduction of this diameter induces the slice\-contraction property\(ii\)\(ii\)\. Moreover, because the Bayes\-optimal correction is the conditional mean of the compatible normal sources, the magnitude of the remaining correction is controlled by the same cone diameter\. As the cone diameter decreases, the correction magnitude vanishes near the corresponding manifold slice, yielding property\(iii\)\(iii\)\. Thus, the decreasing diameter of the compatible normal sources explains why iterative application off∗f^\{\*\}moves the sample progressively closer toℳ\\mathcal\{M\}while producing corrections of decreasing magnitude\.
#### Role of transparency and corruption geometry\.
Training with transparent corruptions, i\.e\., a continuum of perturbation magnitudes, exposes the model to inputs arbitrarily close to the normal\-data manifold and therefore determines the local behavior of the learned mapping in a neighborhood ofℳ\\mathcal\{M\}\. However, transparency alone is not sufficient to guarantee contractive behavior\. A contraction bias is obtained only if the corruption directions that appear close to the manifold are predominantly normal toℳ\\mathcal\{M\}\. In this case, small\-amplitude corruptions correspond to displacements away from the manifold, and minimizing the conditional reconstruction objective enforces corrections that reduce these displacements\. Conversely, if small\-amplitude corruptions are dominated by tangent components, they correspond to valid variations along the manifold and should not be removed\.
Thus, transparency restricts the perturbation directions that influence the local Jacobian of the learned mapping: nearℳ\\mathcal\{M\}, the model is exposed only to corruption patterns whose normal component dominates\. Under this condition, the learned operator is biased toward contractive behavior in the corrupted coordinates\.
### A\.3Regularization based on Jacobian Penalties
Geometrically, our autoencoderffis designed to project corrupted inputs back onto the manifold of normal samplesℳ\\mathcal\{M\}while acting as the identity onℳ\\mathcal\{M\}\. Together, these requirements suggest thatffshould be idempotent, i\.e\.f\(f\(𝒙^\)\)=f\(𝒙^\)f\(f\(\\hat\{\\boldsymbol\{x\}\}\)\)=f\(\\hat\{\\boldsymbol\{x\}\}\)\. Instead of adding an explicit penalty to the objective we instead propose to constrain the Jacobian by adding the term‖Jf\(𝒙^\)−Id‖F\\\|J\_\{f\}\(\\hat\{\\boldsymbol\{x\}\}\)\-I\_\{d\}\\\|\_\{F\}to the training objective \([1](https://arxiv.org/html/2606.15280#S2.E1)\), whereJf\(𝒙^\)J\_\{f\}\(\\hat\{\\boldsymbol\{x\}\}\)andIdI\_\{d\}denote the Jacobian of the model and the identity mapping, respectively\.
To analyze how the regularizer promotes idempotency, we expandffat𝒙∈ℳ\\boldsymbol\{x\}\\in\\mathcal\{M\}and evaluate it atf\(𝒙^\)f\(\\hat\{\\boldsymbol\{x\}\}\), where𝒙^\\hat\{\\boldsymbol\{x\}\}is a corrupted version of𝒙\\boldsymbol\{x\}\. We note that identity behavior is not enforced uniformly, but arises predominantly along tangent directions of the manifold due to its interaction with the reconstruction error, which in turn induces contraction toward the manifold\. For the sake of argument, we consider the extreme caseJ𝜽\(𝒙\)=IdJ\_\{\\boldsymbol\{\\theta\}\}\(\\boldsymbol\{x\}\)=I\_\{d\}, yielding
f\(f\(𝒙^\)\)\\displaystyle f\(f\(\\hat\{\\boldsymbol\{x\}\}\)\)=f\(𝒙\)\+J𝜽\(𝒙\)\(f\(𝒙^\)−𝒙\)\+R2\\displaystyle=f\(\\boldsymbol\{x\}\)\+J\_\{\\boldsymbol\{\\theta\}\}\(\\boldsymbol\{x\}\)\\bigl\(f\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\bigr\)\+R\_\{2\}=f\(𝒙^\)\+\(f\(𝒙\)−𝒙\)\+R2,\\displaystyle=f\(\\hat\{\\boldsymbol\{x\}\}\)\+\(f\(\\boldsymbol\{x\}\)\-\\boldsymbol\{x\}\)\+R\_\{2\},whereR2R\_\{2\}is the second\-order Taylor remainder\. If the Jacobian isLL\-Lipschitz along the segment from𝒙\\boldsymbol\{x\}tof\(𝒙^\)f\(\\hat\{\\boldsymbol\{x\}\}\), then‖R2‖≤L‖f\(𝒙^\)−𝒙‖2\\\|R\_\{2\}\\\|\\leq L\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\. Hence,
f\(f\(𝒙^\)\)−f\(𝒙^\)⏟idempotency residual=f\(𝒙\)−𝒙⏟bias onℳ\+𝒪\(‖f\(𝒙^\)−𝒙‖2⏟reconstruction error\)\.\\underbrace\{f\(f\(\\hat\{\\boldsymbol\{x\}\}\)\)\-f\(\\hat\{\\boldsymbol\{x\}\}\)\}\_\{\\text\{idempotency residual\}\}=\\underbrace\{f\(\\boldsymbol\{x\}\)\-\\boldsymbol\{x\}\}\_\{\\text\{bias on \}\\mathcal\{M\}\}\+\\mathcal\{O\}\\\!\\big\(\\underbrace\{\\\|f\(\\hat\{\\boldsymbol\{x\}\}\)\-\\boldsymbol\{x\}\\\|^\{2\}\}\_\{\\text\{reconstruction error\}\}\\big\)\.\(19\)Equation \([19](https://arxiv.org/html/2606.15280#A1.E19)\) thus links approximate idempotency directly to reconstruction accuracy and how wellffapproximates the identity on the manifold\.
More generally, projection operators ontoℳ\\mathcal\{M\}satisfy three equivalent structural properties: \(i\)Im\(f\)=ℳ\\mathrm\{Im\}\(f\)=\\mathcal\{M\}, \(ii\)f\(𝒙\)=𝒙⇔𝒙∈ℳf\(\\boldsymbol\{x\}\)=\\boldsymbol\{x\}\\Leftrightarrow\\boldsymbol\{x\}\\in\\mathcal\{M\}, and \(iii\)ffis idempotent, i\.e\.f\(f\(𝒙\)\)=f\(𝒙\)f\(f\(\\boldsymbol\{x\}\)\)=f\(\\boldsymbol\{x\}\)\. Any two of these properties imply the third and therefore characterize a projection operator ontoℳ\\mathcal\{M\}\.
Our result in Equation \([19](https://arxiv.org/html/2606.15280#A1.E19)\) reveals that the regularization term couples these properties in the trained model: \(i\) is reflected through reconstruction quality, \(ii\) corresponds to the residual bias on the manifold, and \(iii\) is captured by the idempotency residual\. In other words, the identity\-anchored Jacobian regularization, together with a suitable corruption process that promotes contraction toward the data manifold, locally stabilizes the model toward behavior that approximates the defining properties of a projection operator\.
Figure 5:Representative examples of anomaly heatmaps on the MVTec AD and VisA datasets produced by our model\. Each column shows \(top to bottom\) the input image, its reconstruction, the corresponding human\-annotated anomaly mask for reference, and the predicted anomaly heatmap\.Similar Articles
Towards Anomaly Detection on Relational Data
This paper introduces RelAD, a reconstruction-based framework for detecting anomalies in relational databases by jointly modeling attribute and relational edge reconstruction. Extensive experiments on six new benchmarks show RelAD outperforms existing methods.
Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection
This paper proposes CoAD, a novel framework that unifies Outlier Exposure (classification) and Masked Autoencoder (reconstruction) paradigms for time series anomaly detection, addressing their respective limitations. Extensive experiments show that CoAD significantly outperforms state-of-the-art methods while being lightweight and fast.
D2H-AD: A Hybrid Model Utilizing Hyperdimensional Computing for Advanced Anomaly Detection
D2H-AD is a novel anomaly detection framework using Hyperdimensional Computing (HDC) that combines distance-based and density-aware encoding. It outperforms five baselines across multiple benchmarks, offering lightweight, interpretable, and efficient performance for edge AI and IoT.
Back to Repair: A Minimal Denoising Network\ for Time Series Anomaly Detection
This paper introduces JuRe (Just Repair), a minimal denoising network for time series anomaly detection that matches or exceeds complex neural baselines on the TSB-AD and UCR benchmarks, demonstrating that a proper manifold-projection training objective is more important than architectural complexity.
Modeling Spectral Energy Shifts in Spatio-Temporal Graph Anomaly Detection
Proposes a node-level spectral energy formulation for detecting camouflaged anomalies in graphs, extending to spatio-temporal settings with energy-driven message passing. Demonstrates effectiveness on large-scale benchmarks.