Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

arXiv cs.LG 05/27/26, 04:00 AM Papers
diffusion-models anomaly-detection semiconductor unsupervised-learning generative-ai ic-testing
Summary
Proposes the first unsupervised anomaly detection framework for IC latent defect screening using a Diffusion Transformer, achieving state-of-the-art performance on industrial 16nm test data.
arXiv:2605.26468v1 Announce Type: new Abstract: Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We propose the first unsupervised anomaly detection framework incorporating a Diffusion Transformer. Raw test measurements are first compressed by an autoencoder, then reshaped into a structured token sequence enriched with sinusoidal and per-device wafer-position embeddings. Anomaly scores are derived from the noise-prediction error over mid-range diffusion timesteps, enabling fast wafer-scale screening without any labeled defects or manual feature engineering. Our approach achieves state-of-the-art performance on industrial 16nm IC test data under extreme class imbalance, offering interpretable failure localization through latent-space reconstruction residuals.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:11 AM
# Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection
Source: [https://arxiv.org/html/2605.26468](https://arxiv.org/html/2605.26468)
###### Abstract

Latent defect screening is challenged by extremely low failure rates, high\-dimensional test data, and absence of labeled anomalies\. We propose the first unsupervised anomaly detection framework incorporating a Diffusion Transformer\. Raw test measurements are first compressed by an autoencoder, then reshaped into a structured token sequence enriched with sinusoidal and per\-device wafer\-position embeddings\. Anomaly scores are derived from the noise\-prediction error over mid\-range diffusion timesteps, enabling fast wafer\-scale screening without any labeled defects or manual feature engineering\. Our approach achieves state\-of\-the\-art performance on industrial 16nm IC test data under extreme class imbalance, offering interpretable failure localization through latent\-space reconstruction residuals\.

## IIntroduction

Semiconductor manufacturing relies on extensive electrical testing to screen defective devices before they reach end customers\. As process nodes shrink and device complexity grows, the volume and dimensionality of test data expand correspondingly: modern production test programs generate thousands of measurement features per device, spanning leakage currents, propagation delays, voltage margins, and application\-specific stress responses\. Anomaly detection is applied using these parametric measurements to identify defective parts, which is critical to detect latent defects\.

Latent defects are defects that can pass all functional test criteria yet harbor physical imperfections that will manifest as field failures under operating stress, for example, resistive shorts or resistive opens\[WLS\]\. Such latent defects are typically the roadblocks for automotive products to reach Zero Defect quality, as well as the root cause of the Silent Data Corruption \(SDC\) issue\[SDC\], which has been one of the most difficult challenges facing high\-performance computing companies in recent years\. Screening them is fundamentally difficult for three compounding reasons\. First, latent defective devices are extremely rare\. In mature processes, failure rates are measured in parts per million: in our industrial dataset, fewer than 10 latent outliers appear among approximately 6,000 nominally passing devices\. This extreme class imbalance renders supervised learning approaches impractical—collecting sufficient labeled defect data is both costly and time\-consuming, and classifiers trained on such skewed distributions tend to collapse to the majority class\. Second, the test feature space is high\-dimensional as there are typically thousands of parametric measurements during production test, introducing the curse of dimensionality and making distance\- or density\-based anomaly detection unreliable without careful feature selection\. Third, the tolerance for screening error is extremely tight\. Semiconductor yield loss is directly tied to economic viability: missed defects increase field return rates and warranty costs, while over\-screening discards good dies and erodes yield\. An effective screening system must operate at a precisely controlled false positive rate while maximizing recall of the rare anomalous units—a balance that naïve thresholding or generic anomaly detectors routinely fail to achieve\.

The key observation that motivates our approach is that latent defects, despite passing functional tests,*deviate from the distribution of normal devices*in the space of parametric test measurements\. A device with a gate oxide weak spot may pass timing tests under nominal conditions yet exhibit subtly elevated leakage currents or anomalous delay margins that collectively fall outside the learned manifold of healthy silicon\. This distributional deviation provides the signal for unsupervised detection, without requiring any defect labels\.

Generative diffusion models\[ho2020ddpm\]are well suited to exploit this signal\. Trained exclusively on the abundant healthy\-device population, a diffusion model learns the joint distribution of normal test responses across all measurement dimensions\. At inference, a device is scored by how poorly the model can reconstruct its test measurements—anomalous devices, lying outside the training distribution, induce elevated reconstruction error\. This paradigm requires no labeled anomalies, scales naturally to large datasets, and sidesteps the manual feature engineering that has historically bottlenecked semiconductor quality pipelines\. A further practical advantage is*interpretability*: by examining the reconstruction residuals at each step of the reverse diffusion process, test engineers gain direct insight into which measurement regions deviate most strongly from normal behavior\. This information for root\-cause failure analysis that purely discriminative models cannot provide\.

Prior diffusion\-based anomaly detection has focused almost on image data\[wyatt2022anodddpm,zhang2023diffusionad\], where spatial patch structure naturally maps onto transformer token sequences\. The closest tabular counterpart, TabDDPM\[kotelnikov2023tabddpm\], applies a ResNet\-based denoising MLP directly in the raw feature space, treating the measurement vector as a flat, unordered input with no structural inductive bias\. This design is adequate for low\-dimensional tabular data but struggles with the regime of IC test data, where the raw feature space is both high\-dimensional and semantically heterogeneous\.

We address this gap withDiffuse to Detect, a fully unsupervised anomaly detection framework designed for high\-dimensional IC test data\. Our approach departs from TabDDPM in two fundamental ways\. First, rather than diffusing in raw feature space, we learn a compact latent representation via an MLP encoder and perform diffusion in this lower\-dimensional latent space\. Second, we replace the flat ResNet denoiser with a*1D Diffusion Transformer*\(DiT1D\)\[peebles2022dit\]that treats the latent representation as a structured token sequence, enabling the self\-attention mechanism to capture compound inter\-feature correlations\. We further incorporate a per\-device die positional embedding derived from the physical wafer coordinates of each device, accounting for the systematic spatial variation in measurements that is characteristic of wafer\-level test data\.

The contributions of this paper are as follows:

- •We introduce the first diffusion model\-based framework for unsupervised anomaly detection on IC electrical test data, entirely eliminating the need for manual feature engineering\.
- •We introduce a two\-level positional encoding scheme combining fixed sinusoidal token\-position embeddings with a learned, per\-device gated MLP die embedding, encoding both the semantic structure of the test flow and the spatial geometry of the wafer\.
- •We show that our approach outperforms established unsupervised baselines on industrial IC test datasets under the extreme class imbalance \(<0\.12%<\\\!0\.12\\%outlier rate\) characteristic of mature semiconductor processes\.

## IIRelated Work

### II\-AAnomaly Detection in IC Testing

Anomaly detection in semiconductor manufacturing and IC testing has been studied extensively, yet the dominant paradigm remains heavily reliant on engineered features and supervised or semi\-supervised signal\. Classical approaches apply statistical process control \(SPC\) techniques, such as multivariate control charts and principal component analysis, to hand\-selected test parameters, flagging devices that fall outside learned normal operating bounds\[montgomery2009spc,gu2020outlier\]\. While interpretable and computationally lightweight, these methods require significant domain expertise to define relevant features and thresholds, and do not generalize across product generations or process changes\.

Machine learning methods have progressively supplemented rule\-based approaches\. Ensemble outlier detectors such as Isolation Forest\[liu2008iforest\]and one\-class support vector machines\[scholkopf2001ocsvm\]have been applied to parametric test vectors, offering improved sensitivity to multivariate anomaly patterns\. However, these methods scale poorly to the high\-dimensional feature spaces produced by modern test programs, and remain dependent on curated feature inputs\. Deep learning approaches, including autoencoders and variational autoencoders applied to equipment sensor data and electrical test outputs, have demonstrated improved representation learning without explicit feature engineering\[liao2020stalad\]\. Most notably, TRACE\-GPT\[kim2023tracegpt\]proposed a GPT\-based generative pre\-training framework adapted from natural language processing to model sequential semiconductor manufacturing sensor signals for unsupervised fault detection, demonstrating the promise of sequence\-aware generative models in this domain\. However, TRACE\-GPT operates on continuous equipment sensor time series rather than the structured tabular outputs of parametric test programs, and employs an autoregressive generation objective rather than the diffusion\-based anomaly scoring central to our approach\.

Wafer bin map \(WBM\) analysis represents a parallel line of work in which spatial die\-level pass/fail patterns are analyzed for defect fingerprinting\[nakazawa2018wbm\]\. Diffusion models have recently been applied to WBM data for unknown defect pattern detection\[moon2024wigdm\], leveraging reconstruction error on spatial image representations\. Our work is fundamentally distinct: we operate on the raw continuous\-valued parametric test measurements produced per device, not on spatial binary pass/fail maps, and we target the detection of subtle analog deviations in high\-dimensional feature space rather than spatial clustering of hard failures\.

Table I:Comparison of related anomaly detection methods\.MethodModalityDiffusionUnsupervisedNo Feat\. Eng\.Token LocalizationSPC\[montgomery2009spc\]IC Test✗✓✗✗Isolation Forest\[liu2008iforest\]General✗✓✗✗TRACE\-GPT\[kim2023tracegpt\]Sensor TS✗✓✓✗AnoDDPM\[wyatt2022anodddpm\]Image✓✓✓PixelTabDDPM\[kotelnikov2023tabddpm\]Tabular✓✓✓✗DTE\[livernoche2024dte\]Tabular✗∗✓✓✗ImDiffusion\[chen2023imdiffusion\]Time Series✓✓✓✗Diffuse to Detect \(Ours\)IC Test✓✓✓Token

- •∗DTE estimates diffusion time via a surrogate network rather than running the generative denoising chain\.

### II\-BDiffusion Models for Anomaly Detection

Denoising diffusion probabilistic models \(DDPMs\)\[ho2020ddpm\]learn a data distribution by training a neural network to reverse a gradual Gaussian noising process\. The key property exploited for anomaly detection is that a model trained exclusively on normal data learns to reconstruct normal samples faithfully, while anomalous inputs yield elevated reconstruction error during the reverse process\.

AnoDDPM\[wyatt2022anodddpm\]pioneered this approach for medical image anomaly detection, introducing simplex noise to control the spatial scale of detectable anomalies\. Subsequent image\-domain work has explored conditional diffusion for reconstruction\[mousakhan2023ddad\], latent diffusion for scalability\[graham2023ldm3d\], and iterative reconstruction\-localization coupling\[fucka2024transfusion\]\. These methods achieve strong performance on visual benchmarks such as MVTec\[bergmann2019mvtec\]and VisA, but their architectural assumptions—2D spatial patches, convolutional or vision transformer backbones, pixel\-level anomaly maps—do not transfer to tabular test data\.

For tabular data, diffusion\-based anomaly detection remains largely unexplored\. TabDDPM\[kotelnikov2023tabddpm\]is the first MLP diffusion model with residual connections for tabular data generation, and was performed with anomaly detection by reconstruction loss\. However, TabDDPM treats tabular rows as flat, unordered vectors, making no use of any structural or semantic ordering among features\. The Diffusion Time Estimation \(DTE\) method\[livernoche2024dte\]similarly operates on unordered tabular feature vectors, replacing the full reverse diffusion chain with a lightweight network trained to predict the noise level of a corrupted input, an efficient density proxy but not a true generative diffusion model, and without token level localization capability\. As DTE does not support denoising instances, it can hardly be used for root\-cause failure analysis\.

For multivariate time series, ImDiffusion\[chen2023imdiffusion\]demonstrated that a transformer\-backbone DDPM using imputation\-based reconstruction, rather than full sequence reconstruction, yields superior anomaly detection by leveraging neighboring context\. This insight is related to our use of positional structure, though ImDiffusion targets temporally ordered sensor streams with fixed, uniform spacing, whereas our tokens encode systematically wafer level variations\.

### II\-CDiffusion Transformers

The Diffusion Transformer \(DiT\)\[peebles2022dit\]replaced the convolutional U\-Net backbone of standard DDPMs with a transformer operating on sequences of latent patches, demonstrating superior scalability and generation quality on image benchmarks\. The transformer backbone brings two properties critical to our application: native support for variable\-length token sequences with positional encoding, and per\-token output resolution that enables token\-level reconstruction scoring\. While DiT was designed for image generation in a VAE latent space, its transformer denoising block is modality\-agnostic\. We adapt it to operate directly on latent test program token sequences, bypassing the autoencoder and patchification in favor of a 1D sequential tokenization aligned with the test flow structure of IC data\.

## IIIMethodology

![Refer to caption](https://arxiv.org/html/2605.26468v1/x1.png)Figure 1:Overview of the proposed framework\. A high\-dimensional IC test measurement vector𝐱∈ℝF\\mathbf\{x\}\\in\\mathbb\{R\}^\{F\}is first reduced and tokenized into a latent sequence𝐙0∈ℝC×L\\mathbf\{Z\}\_\{0\}\\in\\mathbb\{R\}^\{C\\times L\}, enriched with two levels of positional encoding, and processed by a 1D Diffusion Transformer \(DiT1D\) trained on healthy\-silicon measurements\. At inference time, diffusion loss is used as the anomaly score for high\-throughput production screening\.### III\-AProblem Formulation

Let𝒟=\{𝐱\(i\)\}i=1N\\mathcal\{D\}=\\\{\\mathbf\{x\}^\{\(i\)\}\\\}\_\{i=1\}^\{N\}denote a dataset ofNNIC devices, where each device is characterized by a high\-dimensional parametric test measurement vector𝐱\(i\)∈ℝF\\mathbf\{x\}^\{\(i\)\}\\in\\mathbb\{R\}^\{F\}, withFFdenoting the total number of test features \(in our setting,F∼103F\\sim 10^\{3\}\)\. These features are produced by a structured sequence ofPPtest programs\{𝒯1,𝒯2,…,𝒯P\}\\\{\\mathcal\{T\}\_\{1\},\\mathcal\{T\}\_\{2\},\\ldots,\\mathcal\{T\}\_\{P\}\\\}, where program𝒯p\\mathcal\{T\}\_\{p\}generates a contiguous block offpf\_\{p\}features, and∑p=1Pfp=F\\sum\_\{p=1\}^\{P\}f\_\{p\}=F\.

We assume access to a training set𝒟train⊆𝒟\\mathcal\{D\}\_\{\\text\{train\}\}\\subseteq\\mathcal\{D\}comprising predominantly healthy \(passing\) devices, collected without anomaly labels\. The goal is to learn the distribution of normal test responsespθ\(𝐱\)p\_\{\\theta\}\(\\mathbf\{x\}\)and, at inference time, assign an anomaly scores\(𝐱\)∈ℝs\(\\mathbf\{x\}\)\\in\\mathbb\{R\}to each device such that anomalous devices receive significantly higher scores than healthy ones\. An overview of the complete framework is illustrated in Fig\.[1](https://arxiv.org/html/2605.26468#S3.F1)\.

### III\-BDimensionality Reduction and Tokenization

IC test measurement vectors are high\-dimensional: each device is characterized byF∼103F\\sim 10^\{3\}features spanning parametric test programs\. Feeding such vectors directly into a transformer would be computationally expensive and statistically ill\-conditioned, as the feature space far exceeds the number of meaningful degrees of freedom in typical silicon responses\. We therefore decouple the pipeline into two sequential steps: \(i\) dimensionality reduction to a compact representation, and \(ii\) tokenization of that representation into a sequence suitable for the 1D DiT\. This process is illustrated in Fig\.[2](https://arxiv.org/html/2605.26468#S3.F2)\.

#### Dimensionality Reduction

A two\-layer MLP with a SiLU activation reduces𝐱\\mathbf\{x\}to a fixed\-width representation:

𝐡=LN\(𝐖2σ\(𝐖1𝐱\+𝐛1\)\+𝐛2\),𝐡∈ℝDr,\\mathbf\{h\}=\\text\{LN\}\\\!\\left\(\\mathbf\{W\}\_\{2\}\\,\\sigma\\\!\\left\(\\mathbf\{W\}\_\{1\}\\mathbf\{x\}\+\\mathbf\{b\}\_\{1\}\\right\)\+\\mathbf\{b\}\_\{2\}\\right\),\\quad\\mathbf\{h\}\\in\\mathbb\{R\}^\{D\_\{r\}\},\(1\)
whereσ\\sigmadenotes the SiLU nonlinearity andDr=128≪FD\_\{r\}=128\\ll Fis the reduced dimension\. The parameter\-free Layer NormalizationLNis applied after projection to enforce approximately unit\-variance statistics, which stabilizes the subsequent Gaussian diffusion process\[livernoche2024dte\]\. All input features are standardized to zero mean and unit variance using statistics computed over𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}prior to this step\.

![Refer to caption](https://arxiv.org/html/2605.26468v1/x2.png)Figure 2:Dimensionality reduction and tokenization\. The flat measurement vector𝐱∈ℝF\\mathbf\{x\}\\in\\mathbb\{R\}^\{F\}\(F∼103F\\sim 10^\{3\}\) is projected by a two\-layer MLP to𝐡∈ℝDr\\mathbf\{h\}\\in\\mathbb\{R\}^\{D\_\{r\}\}\(Dr=128D\_\{r\}=128\), then reshaped into a token sequence𝐙0∈ℝC×L\\mathbf\{Z\}\_\{0\}\\in\\mathbb\{R\}^\{C\\times L\}\. This is analogous to patch embedding in vision transformers, where an image is partitioned into spatial patches before entering the transformer\.
#### Tokenization

The reduced representation𝐡\\mathbf\{h\}is reshaped into a 2D token sequence that forms the input to the 1D DiT:

𝐙0=reshape\(𝐡;C,L\)∈ℝC×L,\\mathbf\{Z\}\_\{0\}=\\text\{reshape\}\(\\mathbf\{h\};\\;C,L\)\\in\\mathbb\{R\}^\{C\\times L\},\(2\)
whereCCis the number of latent channels andL=⌈Dr/C⌉L=\\lceil D\_\{r\}/C\\rceilis the sequence length \(zero\-padded ifDrD\_\{r\}is not divisible byCC\)\. This is directly analogous to patch embedding in vision transformers\[dosovitskiy2020vit\]: just as a ViT partitions an image into spatial patches to form a token sequence, we partition the reduced measurement representation intoLLtokens along the sequence axis, each described by aCC\-dimensional feature vector\. The resulting sequence𝐙0\\mathbf\{Z\}\_\{0\}exposes the local structure of the reduced representation to the self\-attention mechanism of the DiT, enabling the model to capture dependencies across different regions of the latent space\.

### III\-CTwo\-Level Positional Encoding

The token sequence𝐙0\\mathbf\{Z\}\_\{0\}receives positional information at two distinct levels, encoding complementary structure in the data\. Fig\.[3](https://arxiv.org/html/2605.26468#S3.F3)illustrates both levels and how they are combined\.

#### Feature\-Level Sin\-Cos Encoding

To distinguish token positions within the latent sequence, we add a fixed sinusoidal positional embedding over theLLsequence positions:

𝐙0←𝐙0\+𝐄feat∈ℝC×L,\\mathbf\{Z\}\_\{0\}\\leftarrow\\mathbf\{Z\}\_\{0\}\+\\mathbf\{E\}\_\{\\text\{feat\}\}\\in\\mathbb\{R\}^\{C\\times L\},\(3\)
where𝐄feat\[:,ℓ\]\\mathbf\{E\}\_\{\\text\{feat\}\}\[\\colon,\\ell\]is computed from the standard 1D sinusoidal formula over positionsℓ=0,…,L−1\\ell=0,\\ldots,L\-1\[vaswani2017attention\]\. These embeddings provide a canonical and consistent ordering of the latent token positions that is shared across all devices\.

![Refer to caption](https://arxiv.org/html/2605.26468v1/x3.png)Figure 3:Two\-level positional encoding\. Feature\-level: fixed sin\-cos embeddings are added uniformly across all devices to encode token position within the latent sequence\. Die\-level: a gated MLP produces a per\-instance embedding from the die’s physical wafer coordinates\(dx,dy\)\(d\_\{x\},d\_\{y\}\), capturing systematic spatial variation across the wafer\.
#### Per\-Instance Die Positional Embedding

IC test data carries an important instance\-level spatial covariate: the physical\(dx,dy\)\(d\_\{x\},d\_\{y\}\)location of each die on the wafer\. Measurements from dies at different wafer locations can exhibit systematic distributional shifts \(e\.g\., edge versus center effects\), which are not captured by the feature\-level encoding\.

We therefore add a per\-sample*die positional embedding*𝐞die∈ℝC×L\\mathbf\{e\}\_\{\\text\{die\}\}\\in\\mathbb\{R\}^\{C\\times L\}derived from the die coordinates:

𝐙0←𝐙0\+𝐞die\(dx,dy\),\\mathbf\{Z\}\_\{0\}\\leftarrow\\mathbf\{Z\}\_\{0\}\+\\mathbf\{e\}\_\{\\text\{die\}\}\(d\_\{x\},d\_\{y\}\),\(4\)
computed via a*gated MLP*over a coordinate feature vector that concatenates raw die coordinates with standard per\-axis sinusoidal embeddings\[vaswani2017attention\]:

𝐯=\[dx,dy,sincos\(dx\),sincos\(dy\)\]∈ℝ2\+Ds,\\mathbf\{v\}=\\left\[d\_\{x\},\\;d\_\{y\},\\;\\text\{sincos\}\(d\_\{x\}\),\\;\\text\{sincos\}\(d\_\{y\}\)\\right\]\\in\\mathbb\{R\}^\{2\+D\_\{s\}\},\(5\)
whereDsD\_\{s\}serves as the embedding dimension\. The die positional embedding is then:

𝐞die=reshape\(𝐖out\(fbase\(𝐯\)\+σ\(g\)⋅fres\(𝐯\)\);C,L\),\\mathbf\{e\}\_\{\\text\{die\}\}=\\text\{reshape\}\\\!\\left\(\\mathbf\{W\}\_\{\\text\{out\}\}\\\!\\left\(f\_\{\\text\{base\}\}\(\\mathbf\{v\}\)\+\\sigma\(g\)\\cdot f\_\{\\text\{res\}\}\(\\mathbf\{v\}\)\\right\);\\;C,L\\right\),\(6\)
wherefbasef\_\{\\text\{base\}\}is a linear projection,fresf\_\{\\text\{res\}\}is a two\-layer MLP residual branch, andggis a learned scalar gate with zero initialization, ensuring the embedding starts as a pure linear function of the coordinates and acquires nonlinearity during training\. Unlike Level 1, this embedding varies per device: two dies at different wafer locations receive different additive offsets to their token sequences\.

### III\-DForward Diffusion on Token Sequences

We adopt the Denoising Diffusion Probabilistic Model \(DDPM\) framework\[ho2020ddpm\]\. Given a clean token sequence𝐙0\\mathbf\{Z\}\_\{0\}, the forward process defines a Markov chain that progressively corrupts the sequence by adding Gaussian noise overTTtimesteps:

q\(𝐙t∣𝐙t−1\)=𝒩\(𝐙t;1−βt𝐙t−1,βt𝐈\),q\(\\mathbf\{Z\}\_\{t\}\\mid\\mathbf\{Z\}\_\{t\-1\}\)=\\mathcal\{N\}\\\!\\left\(\\mathbf\{Z\}\_\{t\};\\,\\sqrt\{1\-\\beta\_\{t\}\}\\,\\mathbf\{Z\}\_\{t\-1\},\\;\\beta\_\{t\}\\mathbf\{I\}\\right\),\(7\)
where\{βt\}t=1T\\\{\\beta\_\{t\}\\\}\_\{t=1\}^\{T\}is a fixed variance schedule\. Using the reparameterizationα¯t=∏s=1t\(1−βs\)\\bar\{\\alpha\}\_\{t\}=\\prod\_\{s=1\}^\{t\}\(1\-\\beta\_\{s\}\), any noisy sequence can be sampled in closed form:

𝐙t=α¯t𝐙0\+1−α¯tϵ,ϵ∼𝒩\(𝟎,𝐈\)\.\\mathbf\{Z\}\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\,\\mathbf\{Z\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\bm\{\\epsilon\},\\quad\\bm\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\.\(8\)
We use a cosine noise schedule as proposed in\[nichol2021improved\]\.

### III\-EDenoising Backbone: 1D Diffusion Transformer

The reverse process learns to denoise𝐙t\\mathbf\{Z\}\_\{t\}back toward𝐙0\\mathbf\{Z\}\_\{0\}by training a networkϵθ\(𝐙t,t\)\\bm\{\\epsilon\}\_\{\\theta\}\(\\mathbf\{Z\}\_\{t\},t\)to predict the added noise\. We instantiate this as a*1D Diffusion Transformer*\(DiT1D\), adapted from the DiT architecture\[peebles2022dit\]for sequential latent inputs\. The architecture is shown in Fig\.[4](https://arxiv.org/html/2605.26468#S3.F4)\.

#### 1D Patch Embedding

The 2D patch embedding of the original DiT is replaced by aConv1d\-based 1D patch embedding with patch sizepp, which partitions theLL\-length token sequence into⌊L/p⌋\\lfloor L/p\\rfloornon\-overlapping patches and projects each to the transformer hidden dimensiondd:

𝐔=PatchEmbed1D\(𝐙t\)\+𝐄patch∈ℝNp×d,\\mathbf\{U\}=\\text\{PatchEmbed1D\}\(\\mathbf\{Z\}\_\{t\}\)\+\\mathbf\{E\}\_\{\\text\{patch\}\}\\in\\mathbb\{R\}^\{N\_\{p\}\\times d\},\(9\)
whereNp=⌈L/p⌉N\_\{p\}=\\lceil L/p\\rceilis the number of patches and𝐄patch\\mathbf\{E\}\_\{\\text\{patch\}\}is a fixed 1D sin\-cos positional embedding over patch positions\.

#### Transformer Blocks

Each transformer block applies multi\-head self\-attention followed by a position\-wise feed\-forward network, with adaptive layer normalization conditioning on the diffusion timesteptt:

𝐡←𝐡\+α1⋅Attn\(LNγ1,β1\(𝐡\)\),\\mathbf\{h\}\\leftarrow\\mathbf\{h\}\+\\alpha\_\{1\}\\cdot\\text\{Attn\}\\\!\\left\(\\text\{LN\}\_\{\\gamma\_\{1\},\\beta\_\{1\}\}\(\\mathbf\{h\}\)\\right\),\(10\)𝐡←𝐡\+α2⋅FFN\(LNγ2,β2\(𝐡\)\),\\mathbf\{h\}\\leftarrow\\mathbf\{h\}\+\\alpha\_\{2\}\\cdot\\text\{FFN\}\\\!\\left\(\\text\{LN\}\_\{\\gamma\_\{2\},\\beta\_\{2\}\}\(\\mathbf\{h\}\)\\right\),\(11\)
where\(α1,γ1,β1,α2,γ2,β2\)\(\\alpha\_\{1\},\\gamma\_\{1\},\\beta\_\{1\},\\alpha\_\{2\},\\gamma\_\{2\},\\beta\_\{2\}\)are predicted by a small MLP from the sinusoidal embedding oftt, following the adaLN\-Zero initialization of\[peebles2022dit\]\.

#### Final Layer and Unpatchify

The final layer applies adaLN modulation and projects each patch token back top⋅Cp\\cdot Cdimensions\. The output is unpatchified to recover the original shapeℝC×L\\mathbb\{R\}^\{C\\times L\}, yielding the predicted noiseϵθ\(𝐙t,t\)\\bm\{\\epsilon\}\_\{\\theta\}\(\\mathbf\{Z\}\_\{t\},t\)\.

![Refer to caption](https://arxiv.org/html/2605.26468v1/x4.png)Figure 4:Architecture of the 1D Diffusion Transformer \(DiT1D\)\. Left: the full forward pass of diffusion model\. The token sequence𝐙t∈ℝC×L\\mathbf\{Z\}\_\{t\}\\in\\mathbb\{R\}^\{C\\times L\}is patchified by aConv1dstem, enriched with patch\-level sin\-cos positional embeddings, passed throughDDtransformer blocks conditioned on the diffusion timesteptt, and projected back to the original shape via the final layer and unpatchify\. Right: detail of a single transformer block, showing timestep\-conditioned scale\-and\-shift modulation applied to both the self\-attention and feed\-forward sub\-layers\.

### III\-FTraining Objective

The model is trained on𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}using the standard DDPM simplified objective\[ho2020ddpm\], which minimizes the expected mean squared error between predicted and actual noise:

ℒ=𝔼t,𝐙0,ϵ\[‖ϵ−ϵθ\(𝐙t,t\)‖2\],\\mathcal\{L\}=\\mathbb\{E\}\_\{t,\\,\\mathbf\{Z\}\_\{0\},\\,\\bm\{\\epsilon\}\}\\left\[\\left\\\|\\bm\{\\epsilon\}\-\\bm\{\\epsilon\}\_\{\\theta\}\(\\mathbf\{Z\}\_\{t\},t\)\\right\\\|^\{2\}\\right\],\(12\)
wherettis sampled uniformly from\{1,…,T\}\\\{1,\\ldots,T\\\}and𝐙t\\mathbf\{Z\}\_\{t\}is computed via Eq\. \([8](https://arxiv.org/html/2605.26468#S3.E8)\)\. By training exclusively on healthy\-silicon devices, the model learns the joint distribution of normal parametric test responses\. Anomalous devices, which lie outside this learned distribution, will induce elevated reconstruction errors during inference\.

### III\-GAnomaly Scoring

For high\-throughput production screening, we score devices via the noise\-prediction loss averaged over a fixed set of mid\-range diffusion timesteps, as illustrated in Fig\.[5](https://arxiv.org/html/2605.26468#S3.F5)\.\. Given the latent𝐙0\\mathbf\{Z\}\_\{0\}obtained by encoding a test device𝐱\\mathbf\{x\}, we sample noisy versions at each evaluation timestep using the closed\-form forward process \(Eq\.[8](https://arxiv.org/html/2605.26468#S3.E8)\) and query the denoising network as anomaly scoreS\(𝐱\)S\(\\mathbf\{x\}\):

S\(𝐱\)=1\|𝒯eval\|∑t∈𝒯eval‖ϵ−ϵθ\(𝐙t,t\)‖2,S\(\\mathbf\{x\}\)=\\frac\{1\}\{\|\\mathcal\{T\}\_\{\\text\{eval\}\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\_\{\\text\{eval\}\}\}\\left\\\|\\bm\{\\epsilon\}\-\\bm\{\\epsilon\}\_\{\\theta\}\(\\mathbf\{Z\}\_\{t\},t\)\\right\\\|^\{2\},\(13\)
where𝒯eval=\{tstart,tstart\+Δt,…,tend−Δt\}\\mathcal\{T\}\_\{\\text\{eval\}\}=\\\{t\_\{\\text\{start\}\},\\,t\_\{\\text\{start\}\}\+\\Delta t,\\,\\ldots,\\,t\_\{\\text\{end\}\}\-\\Delta t\\\}is a fixed uniform grid over mid\-range timesteps fromtstartt\_\{\\text\{start\}\}totendt\_\{\\text\{end\}\}with stepΔt\\Delta t\. This requires only\|𝒯eval\|\|\\mathcal\{T\}\_\{\\text\{eval\}\}\|forward passes through the network with no iterative sampling, making it efficient for large\-scale wafer\-level screening\.

![Refer to caption](https://arxiv.org/html/2605.26468v1/x5.png)Figure 5:Anomaly scoring modes in our method\. It evaluates the noise\-prediction error over a fixed set of mid\-range timesteps via forward diffusion\.

## IVExperiments

Table II:Dataset statistics after preprocessing\.Dataset\# Features\# Samples\# AnomaliesAnomaly rateDataset 11,1586,25570\.22%0\.22\\%Dataset 23,03469,009410\.12%0\.12\\%### IV\-ADataset and preprocessing

Our experiments use data collected from a 16 nm automotive chip product across two release versions\. Each dataset is a table of one row per \(lot, wafer, die\) test outcome, carrying a binary health label, spatial identifiers \(lot\_key,wf\_key,die\_x,die\_y\), and a large set of parametric measurements\. Table[II](https://arxiv.org/html/2605.26468#S4.T2)summarizes the resulting dataset statistics after preprocessing\.

#### Feature selection

We select a task\-specific feature subset by matching column names to a regular expression \(e\.g\. targeting parametric measurement columns\), ensuring the model never observes excluded test programs\. Columns for which the fraction of missing entries exceeds a fixed thresholdrnar\_\{\\text\{na\}\}are removed\. Any device that retains a missing value in the remaining features is dropped, keeping only complete cases\.

#### Normalization

Rather than applying global standardization, we reduce wafer\- and lot\-level offset via a within\-waferzz\-score: for every feature column and each \(lot\_key,wf\_key\) group, we subtract the group mean and divide by the group standard deviation, with a small floor on the denominator to avoid degeneracy\. Spatial identifiersdie\_xanddie\_yare retained in their raw integer grid for the die positional embedding and are never concatenated to the normalized feature vector\.

#### Train/test split

We follow the standard unsupervised anomaly detection protocol\. The training set contains only healthy \(label\-normal\) devices\. The test set pools the remaining healthy devices with all anomalous devices\. Normal devices are split equally between training and test \(50% each\) using a fixed random seed, yielding the sizes reported in Table[II](https://arxiv.org/html/2605.26468#S4.T2)\. As shown, both datasets exhibit extreme class imbalance, with anomaly rates of0\.22%0\.22\\%and0\.12%0\.12\\%for Dataset 1 and Dataset 2 respectively—reflecting the rare\-defect regime characteristic of mature automotive semiconductor processes\.

### IV\-BEvaluation Metrics

Given the extreme class imbalance of our datasets, we report three complementary metrics\.

#### AUROC

\(Area Under the Receiver Operating Characteristic Curve\) measures the probability that a randomly chosen anomalous device receives a higher anomaly score than a randomly chosen normal device, ranging from 0 to 1 with 0\.5 indicating random performance\. While widely used, AUROC can be overly optimistic under severe class imbalance\[davis2006relationship\]\.

#### AUCPR

\(Area Under the Precision\-Recall Curve\) is more informative under extreme imbalance, as it focuses on the detector’s ability to retrieve anomalies at high precision\. A random classifier achieves an AUCPR equal to the anomaly rate \(≈0\.22%\\approx 0\.22\\%for Dataset 1 and≈0\.12%\\approx 0\.12\\%for Dataset 2\), providing a meaningful lower bound for comparison\.

#### Recall@95% Yield

measures whether at least one confirmed anomalous device is ranked above the decision threshold, set such that95%95\\%of normal test devices are passed \(i\.e\.5%5\\%are screened out\)\. This yield\-constrained operating point directly reflects production economics: tightening the threshold to increase recall comes at the cost of yield loss, and a detector that fails to surface any confirmed defect at this operating point provides no actionable value regardless of its AUROC or AUCPR\.

### IV\-CBaselines

We compare against a comprehensive set of unsupervised anomaly detection baselines spanning classical, deep learning, and diffusion\-based methods\. Classical methods include Isolation Forest \(IForest\)\[liu2008iforest\], One\-Class SVM \(OCSVM\)\[scholkopf2001ocsvm\], COPOD\[li2020copod\], ECOD\[li2022ecod\], Feature Bagging\[lazarevic2005feature\], HBOS\[goldstein2012hbos\], KNN\[ramaswamy2000knn\], LODA\[pevny2016loda\], LOF\[breunig2000lof\], MCD\[hardin2004mcd\], and PCA reconstruction error\. Deep learning\-based approaches include DAGMM\[zong2018dagmm\], DROCC\[goyal2020drocc\], GOAD\[bergman2020goad\], ICL\[shenkar2022icl\], PlanarFlow\[rezende2015normalizing\], GANomaly\[akcay2019ganomaly\], and SLAD\[xu2023fascinating\]\. Diffusion\-based baselines include TabDDPM\[kotelnikov2023tabddpm\], and two variants of Diffusion Time Estimation: DTE\-IG, and DTE\-C\[livernoche2024dte\], which use inverse\-gamma and, categorical distributions respectively to estimate diffusion time as an anomaly score\. All baselines are implemented using PyOD\[zhao2019pyod\]or their respective official codebases, with hyperparameters set to their published defaults\.

### IV\-DImplementation Details

The MLP encoder projectsFF\-dimensional input features through a hidden layer of dimension 128 with SiLU activation to a bottleneck ofDr=128D\_\{r\}=128, followed by parameter\-free LayerNorm\. The bottleneck is reshaped into a token sequence of shapeC×LC\\times LwithC=4C=4channels andL=⌈128/4⌉=32L=\\lceil 128/4\\rceil=32positions\. The DiT1D denoiser uses a Conv1d patch embedding with patch sizep=2p=2, yielding⌈32/2⌉=16\\lceil 32/2\\rceil=16transformer tokens, and appliesT=3T=3adaLN\-Zero transformer blocks with hidden dimensiond=256d=256and 4 attention heads\. A symmetric MLP decoder \(hidden dim 128, SiLU\) maps the bottleneck back toFFdimensions and is used only during the autoencoder pretraining phase\. The die positional embedding uses the gated MLP variant with sinusoidal coordinate features of dimension 64 and a hidden size of 256\. The total parameter count of the full model \(encoder \+ DiT1D \+ decoder\) is approximately 4M\.

Training proceeds in two phases following the latent diffusion model style schedule\. In the firstK=50K=50epochs, only the MLP encoder and decoder are trained on the MSE reconstruction loss, with the DiT1D denoiser frozen\. In the subsequent 200 epochs, the encoder is frozen and the DiT1D is trained on the DDPM simplified noise\-prediction objective withT=1,000T=1\{,\}000diffusion steps and a cosine noise schedule\[nichol2021improved\]\. Both phases use the AdamW optimizer\[loshchilov2017decoupled\]with learning rate10−410^\{\-4\}and weight decay5×10−45\\times 10^\{\-4\}\. All experiments use a batch size of 2,048 and a fixed random seed of 42 for reproducibility\.

Anomaly scores are computed via diffusion loss scoring \(Eq\.[13](https://arxiv.org/html/2605.26468#S3.E13)\) over a uniform grid of timesteps𝒯eval=\{100,150,…,550\}\\mathcal\{T\}\_\{\\text\{eval\}\}=\\\{100,150,\\ldots,550\\\}, totaling 10 forward passes through the denoiser per device\. The die positional embedding is applied at both training and inference time using the raw integer grid coordinates \(die\_x,die\_y\) recorded for each device\. All experiments are run on a single NVIDIA A100 80GB GPU\.

Table III:Anomaly detection performance on the IC parametric test dataset 1\. Recalled: number of confirmed anomalies detected at95%95\\%yeild\. Best result per metric isbold; second best isunderlined\.CategoryMethodAUROC↑\\uparrowAUCPR↑\\uparrowRecalled↑\\uparrowClassicalIForest\[liu2008iforest\]0\.5500\.00330OCSVM\[scholkopf2001ocsvm\]0\.6310\.00340COPOD\[li2020copod\]0\.5080\.00270ECOD\[li2022ecod\]0\.4640\.00240FeatureBagging\[lazarevic2005feature\]0\.6990\.00531HBOS\[goldstein2012hbos\]0\.4740\.00250KNN\[ramaswamy2000knn\]0\.6100\.00360LODA\[pevny2016loda\]0\.4460\.00250LOF\[breunig2000lof\]0\.7160\.00561MCD\[hardin2004mcd\]0\.4890\.00240PCA0\.4440\.00230Deep LearningDAGMM\[zong2018dagmm\]0\.4520\.00240DROCC\[goyal2020drocc\]0\.6700\.00440GOAD\[bergman2020goad\]0\.6010\.00320ICL\[shenkar2022icl\]0\.6730\.00411PlanarFlow\[rezende2015normalizing\]0\.4670\.00260GANomaly\[akcay2019ganomaly\]0\.6450\.00591SLAD\[xu2023fascinating\]0\.5520\.00290Diffusion\-basedTabDDPM\[kotelnikov2023tabddpm\]0\.4670\.00250DTE\-IG\[livernoche2024dte\]0\.5630\.00621DTE\-C\[livernoche2024dte\]0\.5780\.00310DiT1D\(Ours\)0\.7710\.02503

Table IV:Anomaly detection performance on the second dataset\. Recalled: number of confirmed anomalies detected at95%95\\%yeild\. Best result per metric isbold; second best isunderlined\.CategoryMethodAUROC↑\\uparrowAUCPR↑\\uparrowRecalled↑\\uparrowClassicalIForest\[liu2008iforest\]0\.5460\.00264OCSVM\[scholkopf2001ocsvm\]0\.5950\.00200COPOD\[li2020copod\]0\.5250\.00204ECOD\[li2022ecod\]0\.5620\.00234FeatureBagging\[lazarevic2005feature\]0\.6290\.00263HBOS\[goldstein2012hbos\]0\.5380\.00255KNN\[ramaswamy2000knn\]0\.5880\.00265LODA\[pevny2016loda\]0\.5480\.00216LOF\[breunig2000lof\]0\.6250\.00243MCD\[hardin2004mcd\]0\.5300\.00153PCA0\.5580\.00214Deep LearningDAGMM\[zong2018dagmm\]0\.4250\.00112DROCC\[goyal2020drocc\]0\.4310\.00101GOAD\[bergman2020goad\]0\.3990\.00101PlanarFlow\[rezende2015normalizing\]0\.5800\.00185GANomaly\[akcay2019ganomaly\]0\.4260\.00112SLAD\[xu2023fascinating\]0\.4630\.00122DIF0\.4150\.00101Diffusion\-basedTabDDPM\[kotelnikov2023tabddpm\]0\.5230\.00152DTE\-IG\[livernoche2024dte\]0\.5450\.00555DTE\-C\[livernoche2024dte\]0\.5030\.00122DiT1D \(Ours\)0\.6390\.00237

### IV\-EMain Results

Tables[III](https://arxiv.org/html/2605.26468#S4.T3)and[IV](https://arxiv.org/html/2605.26468#S4.T4)report anomaly detection performance on Dataset 1 and Dataset 2 respectively\. Across both datasets, DiT1D achieves the highest AUROC and Recall@95% Yield, demonstrating consistent gains over all classical, deep learning, and diffusion\-based baselines\.

#### Dataset 1

The majority of classical and deep learning methods fail to surface a single anomaly at the 95% yield operating point, with AUROC values clustering near chance \(0\.44–0\.70\)\. DiT1D achieves an AUROC of 0\.771, surpassing the second\-best method LOF \(0\.716\) by a substantial margin, and an AUCPR of 0\.0250—more than four times the second\-best score of 0\.0062 \(DTE\-IG\)\. Critically, DiT1D recalls 3 out of 7 confirmed anomalies at the 95% yield threshold, while all competing methods surface at most one\. This large gap in AUCPR and recall confirms that DiT1D produces well\-calibrated anomaly scores concentrated on true positives, rather than merely achieving a globally favorable ranking\.

#### Dataset 2

Dataset 2 is substantially larger than dataset 1\. Classical methods become more competitive in terms of recall, with LODA and HBOS recalling up to 6 anomalies, while most deep learning methods degrade significantly \(AUROC 0\.40–0\.58\)\. DiT1D achieves the highest AUROC \(0\.639\) and is the only method to recall all 7 anomalies present in the test split at the 95% yield operating point, suggesting that the latent\-space diffusion scoring is more robust to the increased feature dimensionality than both classical density estimators and raw\-space deep learning baselines\. While DTE\-IG achieves the highest AUCPR \(0\.0055\) on this dataset, its recall of 5 confirms that a favorable precision\-recall curve does not guarantee recovery of the rarest anomalies at a fixed yield constraint\.

Across both datasets, vanilla TabDDPM consistently underperforms other diffusion\-based methods, confirming that diffusing in raw feature space without structural inductive bias is insufficient for high\-dimensional IC test data\. The consistent advantage of DiT1D over TabDDPM and DTE variants validates our two key design choices: learning a compact latent representation before diffusion, and using a transformer denoiser that captures inter\-token correlations across the measurement sequence\.

Table V:Ablation study on dataset 1\. Results shown as decrease \(↓\\downarrow\) fromDiT1D \(Ours\)\. A larger decrease indicates a more important component\.MethodΔ\\DeltaAUROC↓\\downarrowΔ\\DeltaAUCPR↓\\downarrowΔ\\DeltaRecalled↓\\downarroww/o die\-level PE−0\.234\-0\.234−0\.018\-0\.018−1\-1w/o AutoEncoder−0\.124\-0\.124−0\.021\-0\.021−3\-3w/ TabDDPM−0\.079\-0\.079−0\.015\-0\.015−2\-2

### IV\-FAblation Study

We ablate the three principal design choices of DiT1D on Dataset 1, reporting performance as the decrease from the full model\. Table[V](https://arxiv.org/html/2605.26468#S4.T5)summarizes the results\.

#### Die\-level positional embedding

Removing the gated die PE causes the largest drop in AUROC \(−0\.234\-0\.234\), confirming that wafer\-spatial variation is a strong confounding factor in high\-dimensional IC test data\. Without this embedding, the model cannot distinguish systematic edge\-versus\-center measurement shifts from genuine distributional anomalies, inflating the false positive rate and suppressing recall\.

#### Latent\-space autoencoder

Removing the MLP encoder and performing diffusion directly in the raw feature space leads to the largest drop in both AUCPR \(−0\.021\-0\.021\) and Recalled \(−3\-3\), reducing recall to zero at the 95% yield threshold\. This confirms that compressingF∼103F\\sim 10^\{3\}features into a structured latent representation before diffusion is critical: raw\-space diffusion is poorly conditioned in the high\-dimensional IC test regime\.

#### Transformer vs\. MLP denoiser

Replacing the DiT1D denoiser with a TabDDPM\-style MLP backbone \(while retaining the latent\-space encoder\) degrades AUROC by−0\.079\-0\.079and Recalled by−2\-2\. This isolates the contribution of the transformer’s self\-attention mechanism: capturing inter\-token correlations across the latent sequence provides a consistent benefit over a position\-agnostic MLP operating on the same compressed representation\.

### IV\-GEffect of Model Depth

Table[VI](https://arxiv.org/html/2605.26468#S4.T6)reports performance as a function of the number of DiT1D transformer blocks\. A single block \(depth 1\) achieves an AUROC of 0\.700 but fails to recall any anomaly at the 95% yield threshold, suggesting insufficient capacity to model the joint distribution of healthy silicon\. Performance peaks at depth 3 \(AUROC 0\.771, AUCPR 0\.025, Recalled 3\) and degrades with a fourth block \(AUROC 0\.754, AUCPR 0\.009, Recalled 1\), indicating mild overfitting of the denoiser to the training distribution\. We therefore adopt depth 3 as our default configuration, which balances model capacity against the dataset scale ofN∼3,000N\\sim 3\{,\}000healthy training devices\.

Table VI:Effect of number of TabDiT blocks on dataset 1\. Our chosen configuration \(3 blocks\) is highlighted\.\# TabDiT BlocksAUROC↑\\uparrowAUCPR↑\\uparrowRecalled↑\\uparrow1 block0\.7000\.01202 blocks0\.7690\.01323 blocks0\.7710\.02534 blocks0\.7540\.0091

### IV\-HEffect of Patch Size

The patch sizeppof the Conv1d stem controls the granularity of the token sequence fed to the transformer: a smaller patch yields more tokens and finer\-grained attention, at the cost of a longer sequence\. Given a latent sequence of lengthL=⌈Dr/C⌉=32L=\\lceil D\_\{r\}/C\\rceil=32\(withDr=128D\_\{r\}=128,C=4C=4\), the number of transformer tokens is⌈L/p⌉\\lceil L/p\\rceil\.

Table[VII](https://arxiv.org/html/2605.26468#S4.T7)reports results for patch sizesp∈\{2,4,8\}p\\in\\\{2,4,8\\\}, corresponding to 16, 8, and 4 tokens respectively\. Performance degrades sharply as patch size increases:p=4p=4reduces AUROC to 0\.719 and Recalled to 2, whilep=8p=8collapses AUROC to 0\.343 and fails to recall any anomaly\. This trend indicates that coarser tokenization discards fine\-grained latent structure that is informative for anomaly detection, and that the self\-attention mechanism benefits from a longer sequence to model dependencies across the compressed measurement representation\. We adoptp=2p=2as the default\.

Table VII:Effect of patch size on Dataset 1\. Tokens denotes the number of transformer input tokens after Conv1d patchification of theL=32L\\\!=\\\!32latent sequence\. Our default \(p=2p\\\!=\\\!2\) is highlighted\.Patch sizeppTokensAUROC↑\\uparrowAUCPR↑\\uparrowRecalled↑\\uparrow2160\.7710\.0253480\.7190\.0122840\.3430\.0110

## VConclusion and Discussion

We presented Diffuse to Detect, a fully unsupervised anomaly detection framework for high\-dimensional IC test data\. Raw parametric measurements are compressed by an autoencoder and reshaped into a structured latent token sequence, which a 1D Diffusion Transformer learns to denoise using healthy\-silicon data exclusively\. Two\-level positional encodings capture both the semantic ordering of the test flow and the spatial geometry of the wafer, while dual scoring modes support both high\-throughput production screening and expert failure analysis\. Experiments on an industrial dataset with extreme class imbalance demonstrate competitive detection performance against classical and deep learning baselines, without any labeled defects or manual feature engineering\. Latent\-space reconstruction residuals provide interpretable anomaly localization that translates directly into actionable root\-cause information for test engineers\.

Furture work can explore how to apply our framework for root\-cause failure analysis of anormaly parametric features or test programs\. For instance, we can employ a DDPM\-style reconstruction score\[wyatt2022anodddpm\]\. Starting from the encoded latent𝐙0\\mathbf\{Z\}\_\{0\}without adding forward noise, we runTrecT\_\{\\text\{rec\}\}steps of the learned reverse process:

𝐙t−1=𝝁θ\(𝐙t,t\)\+β~t𝝃,𝝃∼𝒩\(𝟎,𝐈\),\\mathbf\{Z\}\_\{t\-1\}=\\bm\{\\mu\}\_\{\\theta\}\(\\mathbf\{Z\}\_\{t\},t\)\+\\sqrt\{\\tilde\{\\beta\}\_\{t\}\}\\,\\bm\{\\xi\},\\quad\\bm\{\\xi\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\),\(14\)
yielding a reconstruction𝐙^0\\hat\{\\mathbf\{Z\}\}\_\{0\}\. The reconstructed latent is then decoded back to the original measurement space via the decodergϕg\_\{\\phi\}:

𝐱^=gϕ\(flatten\(𝐙^0\)\)∈ℝF,\\hat\{\\mathbf\{x\}\}=g\_\{\\phi\}\\\!\\left\(\\text\{flatten\}\(\\hat\{\\mathbf\{Z\}\}\_\{0\}\)\\right\)\\in\\mathbb\{R\}^\{F\},\(15\)
whereflattencollapses theC×LC\\times Llatent tensor toℝDr\\mathbb\{R\}^\{D\_\{r\}\}before decoding\. The per\-feature reconstruction residual is then computed directly in the original measurement space:

rf=\(xf−x^f\)2,f=1,…,F,r\_\{f\}=\\left\(x\_\{f\}\-\\hat\{x\}\_\{f\}\\right\)^\{2\},\\quad f=1,\\ldots,F,\(16\)
yielding a residual vector𝐫=\[r1,…,rF\]∈ℝF\\mathbf\{r\}=\[r\_\{1\},\\ldots,r\_\{F\}\]\\in\\mathbb\{R\}^\{F\}that assigns a deviation score to every individual test measurement\.

Since features are produced by a structured sequence ofPPtest programs\{𝒯1,…,𝒯P\}\\\{\\mathcal\{T\}\_\{1\},\\ldots,\\mathcal\{T\}\_\{P\}\\\}, each generating a contiguous block offpf\_\{p\}features, the per\-feature residuals can be aggregated to a per\-program anomaly score:

sp=1fp∑f∈𝒯prf,p=1,…,P,s\_\{p\}=\\frac\{1\}\{f\_\{p\}\}\\sum\_\{f\\in\\mathcal\{T\}\_\{p\}\}r\_\{f\},\\quad p=1,\\ldots,P,\(17\)
producing a compact anomaly profile𝐬=\[s1,…,sP\]∈ℝP\\mathbf\{s\}=\[s\_\{1\},\\ldots,s\_\{P\}\]\\in\\mathbb\{R\}^\{P\}indexed by test program\. This profile can be visualized as a heatmap over the test flow, directly identifying which test programs deviate most strongly from the healthy\-silicon distribution\. A device with elevatedsps\_\{p\}for programs associated with leakage measurement, for example, points toward a specific failure mechanism without requiring any further analysis\. Crucially, all quantities are computed in the native units of the original test measurements, making the output directly interpretable by test engineers without knowledge of the latent representation\.

## Acknowledgment

This work is supported by an NXP Long Term University \(LTU\) grant and the National Science Foundation under Grant No\. 1956313\.

## References
Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

Similar Articles

DiffusionBench: On Holistic Evaluation of Diffusion Transformers

DiffusionBench: Towards Holistic Evaluation of Generative Diffusion Transformers

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions

Diffusion Model as a Generalist Segmentation Learner

Submit Feedback

Similar Articles

DiffusionBench: On Holistic Evaluation of Diffusion Transformers
DiffusionBench: Towards Holistic Evaluation of Generative Diffusion Transformers
DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection
Rethinking Continual Anomaly Detection on the Edge: Benchmarking Under Realistic Industrial Conditions
Diffusion Model as a Generalist Segmentation Learner