LithoDreamer: A Physics-Informed World Model for Multi-Stage Computational Lithography

arXiv cs.AI 06/26/26, 04:00 AM Papers
computational-lithography world-model physics-informed semiconductor machine-learning icml
Summary
LithoDreamer is the first physics-informed World Model framework for computational lithography, modeling the multi-stage lithography process as a decision-driven system. It achieves state-of-the-art performance in forward evolution and inverse planning for semiconductor manufacturing.
arXiv:2606.26713v1 Announce Type: new Abstract: As semiconductor technology nodes scale, computational lithography is essential for ensuring yield and performance. However, lithography is a continuous physical process involving mask optimization, optical imaging, resist exposure, and development, which existing models fail to capture. To overcome this limitation, we present LithoDreamer, the first physics-informed World Model (WM) framework for computational lithography, which formulates the ``Layout-Mask-Resist Image-After Development Image (ADI)'' pipeline as a decision-driven multi-step evolution system. LithoDreamer captures feature changes between adjacent states to model stage-specific physics-informed latent spaces, in which it controls process intervention exploration and drives subsequent state transitions. To achieve interpretable intervention optimization without continuous supervision, we propose a contrastive variational optimization paradigm that contrasts the latent differences between intervention paths with variational evolution constraints, guiding the model to generate evolutions consistent with real lithography physics. Experiments show LithoDreamer achieves state-of-the-art performance in forward evolution and inverse planning. Our lithography dataset is publicly available at GitHub (https://github.com/7jiangyq/lithodreamer.git).
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:16 AM
# LithoDreamer: A Physics-Informed World Model for Multi-Stage Computational Lithography
Source: [https://arxiv.org/html/2606.26713](https://arxiv.org/html/2606.26713)
Yumeng LiuZimu LiJinyuan DengQian JinYucheng CuiYu LiXunzhao YinQi SunCheng Zhuo

###### Abstract

As semiconductor technology nodes scale, computational lithography is essential for ensuring yield and performance\. However, lithography is a continuous physical process involving mask optimization, optical imaging, resist exposure, and development, which existing models fail to capture\. To overcome this limitation, we presentLithoDreamer, the first physics\-informed World Model \(WM\) framework for computational lithography, which formulates the “Layout\-Mask\-Resist Image\-After Development Image \(ADI\)” pipeline as a decision\-driven multi\-step evolution system\. LithoDreamer captures feature changes between adjacent states to model stage\-specific physics\-informed latent spaces, in which it controls process intervention exploration and drives subsequent state transitions\. To achieve interpretable intervention optimization without continuous supervision, we propose a contrastive variational optimization paradigm that contrasts the latent differences between intervention paths with variational evolution constraints, guiding the model to generate evolutions consistent with real lithography physics\. Experiments show LithoDreamer achieves state\-of\-the\-art performance in forward evolution and inverse planning\.[Our lithography dataset is publicly available at GitHub](https://github.com/7jiangyq/lithodreamer.git)\.

Machine Learning, ICML

## 1Introduction

With the continuous advancement of advanced technology nodes, lithography has become the core bottleneck restricting chip manufacturing yield\. Computational lithography, through analytical modeling of mask design, optical imaging, and photoresist reactions, simulates the practical manufacturing process and plays a key role in addressing imaging complexities and manufacturing constraints\(Yang and Ren,[2023](https://arxiv.org/html/2606.26713#bib.bib27); Jinet al\.,[2025b](https://arxiv.org/html/2606.26713#bib.bib21); Jianget al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib31)\)\. However, in[Figure1](https://arxiv.org/html/2606.26713#S1.F1), the lithography process involves multiple physical principles and is influenced by layout structures, mask geometries, and manufacturing process variations\. These make lithography modeling and optimization face huge challenges in multi\-stage \(i\.e\., “Layout\-Mask\-Resist Image\-After Development Image \(ADI\)”\) physical evolution\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x1.png)Figure 1:Comparison of the different processes: \(a\) Typical commercial simulation workflow; \(b\) Actual physical lithography manufacturing process; \(c\) The evolution workflow of our LithoDreamer’s process intervention and lithography state\.Traditional Optical Proximity Correction \(OPC\) and Inverse Lithography Technology \(ILT\) methods\(Chenet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib11); Yanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib28); Jianget al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib10)\)treat the lithography process as a goal\-driven numerical optimization problem, gradually modifying the mask in an explicit variable space by repeatedly invoking physical simulators to meet imaging and process constraints\. These methods rely on complex physical and optical computations, and the cost increases rapidly with the problem scale, which is difficult to optimize in large designs efficiently\. In recent years, machine learning methods\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17); Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18); Jianget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib29); Jinet al\.,[2025c](https://arxiv.org/html/2606.26713#bib.bib14)\)have been proposed, which fit large amounts of data to learn point\-to\-point mappings from layout and process conditions to mask, resist image, and ADI representations\. However, such methods define the lithography process as static predictions and cannot explicitly model continuous process interventions that drive lithography state evolution in real\-world environments\. Therefore, they struggle to address the need for continuous process adjustments in the multi\-stage optimization scenarios\.

Recently, World Models \(WMs\)\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20); Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19); Alonsoet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib23); Huanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib25)\)have been dedicated to learning long\-term decision\-making and planning in dynamic scenarios\. By simulating the interaction between agents and the environment, WMs can capture the trajectories of state changes under continuous action interventions\. This ability has proven effective in autonomous driving and embodied intelligence\. Similarly, the lithography process requires a series of process interventions to adjust the lithographic environment and characterize how these interventions gradually affect the states of the mask, resist image, and ADI\. These representations provided by WMs offer an efficient modeling framework for lithographic process planning and multi\-stage state evolution\.

However, directly applying existing WM frameworks to computational lithography still faces fundamental challenges\. First, lithography involves multiple stages with distinct physical properties, making it difficult to capture stage\-dependent dynamics within a single latent model\. Second, while process conditions such as dose, focus, source, and threshold are observable, the fine\-grained interventions that drive continuous pattern evolution are typically unobserved and lack direct supervision, hindering the learning of intervention\-conditioned dynamics\.

To overcome these challenges, we propose the first WM model for computational lithography calledLithoDreamer, designed to unify the multi\-stage“Layout\-Mask\-Resist Image\-ADI”into a physics\-inspired continuous evolution system\. Our contributions are as follows:

- •We propose the first physics\-informed lithography WM, LithoDreamer, which treats the lithography pipeline as a multi\-stage physical evolution system, enabling lithography to be modeled as a causal and decision\-driven process rather than a static forward prediction\.
- •We design a Space Prior Approximation \(SPA\) method that characterizes stage\-specific physical latent spaces from statistical state variations, constraining interventions along physically consistent evolution directions\.
- •We introduce a contrastive variational optimization paradigm that applies variational inference and contrastive learning to jointly explore the optimal solutions for process intervention planning and state transitions without discrete actions\.
- •We construct a dataset of 280K paired samples\. LithoDreamer achieves EPEs of 0\.96/1\.06 forward evolution and 0\.89/0\.97 inverse planning for in\-domain \(ID\) and out\-of\-domain \(OOD\), demonstrating superior performance and generalization\.

## 2Preliminaries

### 2\.1Lithography and Dataset

We study optical lithography, where printed patterns are governed by process\-dependent variations\. The lithography process is parameterized by exposure dose \(exposure energy\), focus \(defocus offset\), threshold \(photoresist development threshold\), and source \(illumination source distribution\), which jointly define a multi\-dimensional process condition\. Under each process condition, pattern formation follows a structuredmulti\-stagepipeline, i\.e\., “Layout\-Mask\-Resist Image\-ADI”\. Within each stage, the lithography state undergoes a continuous latent evolution driven by process\-dependent physical effects, which we model asmulti\-stepstate evolutions toward stage\-consistent outcomes\.

Each data sample corresponds to a specific layout under a specific process\-parameter configuration \(i\.e\., a combination of dose, focus, threshold, and source\)\. The dataset comprises 280k paired samples derived from industrial\-grade commercial lithography data collected from a 55 nm manufacturing line, encompassing diverse layouts under varied configurations\. For each sample, a local region is cropped from the global layout with a fixed physical field\-of\-view of 6000 nm×\\times6000 nm and rasterized into 512×\\times512 pixels\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x2.png)Figure 2:Overview of the LithoDreamer framework\.
### 2\.2Related Work on Computational Lithography

Early computational lithography methods are mainly physics\-based, with OPC and ILT as representative approaches\. For example, Chen et al\.\(Chenet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib11)\)combine optical imaging models with empirical rules to locally adjust mask geometries for proximity\-effect compensation\. L2O\-ILT\(Zhuet al\.,[2023](https://arxiv.org/html/2606.26713#bib.bib8)\)formulates lithography as an inverse problem and optimizes mask structures in a continuous shape space\. However, these methods rely on accurate physical modeling and computationally expensive simulations in industrial applications\. Recent learning\-based methods reduce such reliance by learning mappings between lithography representations\. Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)uses Transformers to model multi\-stage mappings among Layout, Mask, and Resist Image, while LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)introduces large models to represent diverse process conditions better\. Nevertheless, existing methods mostly focus on isolated stages or single\-step forward mappings, without explicitly modeling cross\-stage state evolution under continuous process interventions\. This limits their ability to capture lithography dynamics in a unified manner\.

### 2\.3Applications of World Models

WMs learn latent environment dynamics from historical observations, enabling multi\-step state prediction and planning\. This paradigm has been widely used in embodied intelligence and autonomous driving to support long\-horizon reasoning and efficient decision\-making\. For example, DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)combines LLMs with generative WMs for multi\-view consistent autonomous driving modeling\. NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)formulates world modeling as conditional video generation to synthesize future observations for path planning and trajectory evaluation\. However, existing WMs struggle to adapt to complex stage evolutions caused by continuous process interventions in computational lithography\. So we are committed to designing a WM modeling framework for computational lithography, aimed at supporting multi\-stage and multi\-step process modeling and state prediction within a unified framework\.

## 3Method

As shown in[Figure2](https://arxiv.org/html/2606.26713#S2.F2), LithoDreamer consists of three components: stage\-specific latent spaces for modeling lithography dynamics, the policy model for planning latent process interventions, and the transition model for predicting intervention\-guided state evolution within and across stages\.

### 3\.1Latent Spaces: Embedding of Physical Priors

The Layout\-Mask, Mask\-Resist Image, and Resist Image\-ADI stages involve distinct process interventions that drive state transitions under stage\-specific physical dynamics\. To capture such evolution constraints, we propose Space Prior Approximation \(SPA\), which estimates a stage\-specific basis matrix𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}to span the physics\-informed latent space𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}, thereby constraining process interventions along physically consistent and feasible evolution directions\.

Formally, let a frozen encoderE\(⋅\)E\(\\cdot\)map the state𝐱\(s\)\\mathbf\{x\}^\{\(s\)\}at stagessto a latent representation𝐳\(s\)=E\(𝐱\(s\)\)∈ℝd\\mathbf\{z\}^\{\(s\)\}=E\(\\mathbf\{x\}^\{\(s\)\}\)\\in\\mathbb\{R\}^\{d\}\. For each adjacent\-stage pair, we define the latent variation as:

Δ𝐳i\(s\)=𝐳i\(s\+1\)−𝐳i\(s\),\\displaystyle\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}=\\mathbf\{z\}\_\{i\}^\{\(s\+1\)\}\-\\mathbf\{z\}\_\{i\}^\{\(s\)\},\(1\)whereiidenotes the sample index\. This vector represents the effective latent evolution direction governed by stage\-specific physical dynamics\. SPA estimates an orthonormal basis matrix𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}constructed from a set ofkkbasis vectors:

𝐁\(s\)=\[𝐛1\(s\),…,𝐛k\(s\)\]∈ℝd×k,\(𝐁\(s\)\)⊤𝐁\(s\)=𝐈k,\\mathbf\{B\}^\{\(s\)\}=\[\\mathbf\{b\}^\{\(s\)\}\_\{1\},\\ldots,\\mathbf\{b\}^\{\(s\)\}\_\{k\}\]\\in\\mathbb\{R\}^\{d\\times k\},\\quad\(\\mathbf\{B\}^\{\(s\)\}\)^\{\\top\}\\mathbf\{B\}^\{\(s\)\}=\\mathbf\{I\}\_\{k\},\(2\)wherek≪dk\\ll d, and𝐈k\\mathbf\{I\}\_\{k\}denotes thek×kk\\times kidentity matrix\. Using these basis vectors, the stage\-specific physics\-informed latent space𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}is then defined as:

𝒮\(s\)=\{∑j=1kαj𝐛j\(s\)\|αj∈ℝ\},\\displaystyle\\mathcal\{S\}^\{\(s\)\}=\\left\\\{\\sum\_\{j=1\}^\{k\}\\alpha\_\{j\}\\,\\mathbf\{b\}\_\{j\}^\{\(s\)\}\\;\\middle\|\\;\\alpha\_\{j\}\\in\\mathbb\{R\}\\right\\\},\(3\)whereαj\\alpha\_\{j\}is the scalar coefficient associated with𝐛j\(s\)\\mathbf\{b\}\_\{j\}^\{\(s\)\}\.

To identify𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}, SPA requires each latent variationΔ𝐳i\(s\)\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}to be well approximated by its projection onto𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\. The orthogonal projection is:

Δ𝐳^i\(s\)=𝐁\(s\)\(𝐁\(s\)\)⊤Δ𝐳i\(s\)\.\\displaystyle\\hat\{\\Delta\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}=\\mathbf\{B\}^\{\(s\)\}\(\\mathbf\{B\}^\{\(s\)\}\)^\{\\top\}\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}\.\(4\)Accordingly,𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}is estimated by minimizing the projection reconstruction error:

min𝐁\(s\)∑i=1Ns‖Δ𝐳i\(s\)−Δ𝐳^i\(s\)‖22,\(𝐁\(s\)\)⊤𝐁\(s\)=𝐈k,\\displaystyle\\min\_\{\\mathbf\{B\}^\{\(s\)\}\}\\sum\_\{i=1\}^\{N\_\{s\}\}\\left\\\|\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}\-\\hat\{\\Delta\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}\\right\\\|\_\{2\}^\{2\},\\quad\(\\mathbf\{B\}^\{\(s\)\}\)^\{\\top\}\\mathbf\{B\}^\{\(s\)\}=\\mathbf\{I\}\_\{k\},\(5\)whereNsN\_\{s\}denotes the number of samples at stagess, set to 10k\. Under the orthonormality constraint, the reconstruction error can be decomposed as:

‖Δ𝐳i\(s\)−Δ𝐳^i\(s\)‖22=‖Δ𝐳i\(s\)‖22−‖\(𝐁\(s\)\)⊤Δ𝐳i\(s\)‖22\.\\displaystyle\\left\\\|\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}\-\\hat\{\\Delta\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}\\right\\\|\_\{2\}^\{2\}=\\left\\\|\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}\\right\\\|\_\{2\}^\{2\}\-\\left\\\|\\left\(\\mathbf\{B\}^\{\(s\)\}\\right\)^\{\\top\}\\Delta\\mathbf\{z\}\_\{i\}^\{\(s\)\}\\right\\\|\_\{2\}^\{2\}\.\(6\)Since the first term is independent of𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}, minimizing the reconstruction error is equivalent to maximizing the retained projection energy\. LetΔ𝐳~i\(s\)\\tilde\{\\Delta\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}denote the mean\-centered latent variation\. We define the corresponding second\-moment matrix as:

𝐂\(s\)=1Ns∑i=1NsΔ𝐳~i\(s\)\(Δ𝐳~i\(s\)\)⊤\.\\displaystyle\\mathbf\{C\}^\{\(s\)\}=\\frac\{1\}\{N\_\{s\}\}\\sum\_\{i=1\}^\{N\_\{s\}\}\\tilde\{\\Delta\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}\\left\(\\tilde\{\\Delta\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}\\right\)^\{\\top\}\.\(7\)The above objective can then be written asTr⁡\(\(𝐁\(s\)\)⊤𝐂\(s\)𝐁\(s\)\)\\operatorname\{Tr\}\\\!\\left\(\\left\(\\mathbf\{B\}^\{\(s\)\}\\right\)^\{\\top\}\\mathbf\{C\}^\{\(s\)\}\\mathbf\{B\}^\{\(s\)\}\\right\), whose maximizer is given by the top\-kkeigenvectors of𝐂\(s\)\\mathbf\{C\}^\{\(s\)\}\. These eigenvectors form𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}and span𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\. Once estimated,𝐁\(s\)\\mathbf\{B\}^\{\(s\)\}is fixed to constrain process interventions within physically consistent evolution directions\.

### 3\.2Policy Model: Planning of Process Interventions

Given the current latent state and process\-window parameters at stagess, the policy model plans latent process interventions within the physics\-informed latent space𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\. Rather than predicting a deterministic intervention, we adopt a stochastic policy that samples interventions from𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}, thereby enabling exploration and uncertainty representation\.

#### 3\.2\.1Distributional Modeling

Process interventions in computational lithography are underdetermined, as multiple intervention strategies may lead to similar outcomes under identical conditions\. Accordingly, we model process interventions as random variables and learn their conditional distributions\. At evolution stepttof stagess, the policy modelπθ\\pi\_\{\\theta\}takes the current latent state𝐳t\(s\)\\mathbf\{z\}^\{\(s\)\}\_\{t\}and process\-window parameters𝐩\\mathbf\{p\}as input\. A Vision Transformer \(ViT\) extracts condition\-dependent features, which are subsequently fed into a Multilayer Perceptron \(MLP\) to predict the parameters of the latent intervention coefficient distribution:

𝐡t\(s\)=ViTθ\(𝐳t\(s\),𝐩\),\(𝝁t\(s\),𝝈t\(s\)\)=MLPθ\(𝐡t\(s\)\),\\displaystyle\\mathbf\{h\}^\{\(s\)\}\_\{t\}=\\mathrm\{ViT\}\_\{\\theta\}\\left\(\\mathbf\{z\}^\{\(s\)\}\_\{t\},\\mathbf\{p\}\\right\),\\quad\\left\(\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\},\\boldsymbol\{\\sigma\}^\{\(s\)\}\_\{t\}\\right\)=\\mathrm\{MLP\}\_\{\\theta\}\\left\(\\mathbf\{h\}^\{\(s\)\}\_\{t\}\\right\),\(8\)where𝝁t\(s\),𝝈t\(s\)∈ℝk\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\},\\boldsymbol\{\\sigma\}^\{\(s\)\}\_\{t\}\\in\\mathbb\{R\}^\{k\}denote the mean and standard deviation, andkkis the dimensionality of the stage\-specific latent space\. The intervention coefficients are sampled using the reparameterization trick, which enables the introduction of randomness in a differentiable manner\. We first express the intervention coefficients as a deterministic function of the learned mean and standard deviation, along with random noise drawn from a standard Gaussian distribution:

𝐜t\(s\)=𝝁t\(s\)\+𝝈t\(s\)⊙ϵ,ϵ∼𝒩\(𝟎,𝐈\),\\displaystyle\\mathbf\{c\}^\{\(s\)\}\_\{t\}=\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\}\+\\boldsymbol\{\\sigma\}^\{\(s\)\}\_\{t\}\\odot\\boldsymbol\{\\epsilon\},\\quad\\boldsymbol\{\\epsilon\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\),\(9\)where𝐜t\(s\)∈ℝk\\mathbf\{c\}^\{\(s\)\}\_\{t\}\\in\\mathbb\{R\}^\{k\}denotes a sampled intervention coefficient vector at evolution stepttwithin stagess\. By using the reparameterization trick, we ensure that the intervention coefficients𝐜t\(s\)\\mathbf\{c\}^\{\(s\)\}\_\{t\}are differentiable with respect to the model parameters𝝁t\(s\)\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\}and𝝈t\(s\)\\boldsymbol\{\\sigma\}^\{\(s\)\}\_\{t\}\. This allows us to optimize the intervention distribution by computing gradients through backpropagation\. Through this distributional formulation, the policy model induces a stochastic intervention process that generates diverse intervention candidates in the latent space\. As the intervention distribution is conditioned on the evolving latent state, it can be adaptively updated during state transitions, which facilitates multi\-step planning and exploration within the stage\-specific feasible space\.

#### 3\.2\.2Differentiable Mapping

The sampled intervention coefficient𝐜t\(s\)\\mathbf\{c\}^\{\(s\)\}\_\{t\}does not necessarily correspond to a physically admissible intervention direction\. To ensure the feasibility of process interventions, we map the coefficients into the physics\-informed latent space𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\. The final latent process intervention is obtained through a differentiable linear mapping:

𝐚t\(s\)=𝐁\(s\)𝐜t\(s\),\\displaystyle\\mathbf\{a\}^\{\(s\)\}\_\{t\}=\\mathbf\{B\}^\{\(s\)\}\\mathbf\{c\}^\{\(s\)\}\_\{t\},\(10\)where𝐚t\(s\)\\mathbf\{a\}^\{\(s\)\}\_\{t\}denotes the intervention in the latent space that aligns with the stage\-specific dynamics, ensuring that the intervention respects the underlying physical constraints\. This formulation ensures that all sampled interventions remain within the stage\-specific feasible space while preserving diversity in intervention sequences under physical constraints\. By using the differentiable mapping, we ensure that the intervention process is fully compatible with backpropagation, thus allowing the policy model to be trained end\-to\-end through gradient\-based optimization\.

### 3\.3Transition Model: Evolution of Lithography States

The transition model is designed to capture how process interventions drive the continuous evolution of lithography states within and across stages under physical constraints\. Starting from the current state, the model predicts the next state by jointly conditioning on the process\-window parameters and specific interventions, thereby supporting stable multi\-step state extrapolation and closed\-loop planning\.

At evolution stepttof stagess,𝐳t\(s\)∈ℝd\\mathbf\{z\}^\{\(s\)\}\_\{t\}\\in\\mathbb\{R\}^\{d\}and𝐚t\(s\)∈𝒮\(s\)\\mathbf\{a\}^\{\(s\)\}\_\{t\}\\in\\mathcal\{S\}^\{\(s\)\}denote the latent state and process intervention, respectively\. The transition modelfϕ\(⋅\)f\_\{\\phi\}\(\\cdot\)employs the Diffusion Transformer \(DiT\) as the backbone to model the main state transition process and output a shared intermediate representation for subsequent state updates\. As lithography stages differ substantially in representation and value range, we employ stage\-aware MLP heads after the DiT backbone\. These select the corresponding mapping function based on the current stage to obtain the next state\. The transition model can be represented as:

𝐡t\+1\(s\)=DiTϕ\(𝐳t\(s\),𝐚t\(s\),𝐩\),𝐳t\+1\(s\)=MLPϕ\(s\)\(𝐡t\+1\(s\)\)\.\\displaystyle\\mathbf\{h\}\_\{t\+1\}^\{\(s\)\}=\\mathrm\{DiT\}\_\{\\phi\}\\\!\\left\(\\mathbf\{z\}\_\{t\}^\{\(s\)\},\\mathbf\{a\}\_\{t\}^\{\(s\)\},\\mathbf\{p\}\\right\),\\quad\\mathbf\{z\}\_\{t\+1\}^\{\(s\)\}=\\mathrm\{MLP\}\_\{\\phi\}^\{\(s\)\}\\\!\\left\(\\mathbf\{h\}\_\{t\+1\}^\{\(s\)\}\\right\)\.\(11\)
During training, since paired stage targets are available, we use a target\-guided state\-gate to stabilize multi\-step evolution learning\. After generating the next state𝐳t\+1\(s\)\\mathbf\{z\}\_\{t\+1\}^\{\(s\)\}, we assess whether the evolution within the current stage has sufficiently progressed toward the stage target observed in data\. Let𝐱∗\(s\)\\mathbf\{x\}^\{\*\(s\)\}denote the ground\-truth state at stagessand𝐳∗\(s\)=E\(𝐱∗\(s\)\)\\mathbf\{z\}^\{\*\(s\)\}=E\(\\mathbf\{x\}^\{\*\(s\)\}\)be its latent representation\. We define a stage\-alignment loss:

ℒevo\(t\)=‖𝐳t\+1\(s\)−𝐳∗\(s\)‖22\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{evo\}\}^\{\(t\)\}=\\left\\lVert\\mathbf\{z\}\_\{t\+1\}^\{\(s\)\}\-\\mathbf\{z\}^\{\*\(s\)\}\\right\\rVert\_\{2\}^\{2\}\.\(12\)We compute its gradient norm with respect to interventions:

gt=‖∇𝐚t\(s\)ℒevo\(t\)‖2\.\\displaystyle g\_\{t\}=\\left\\lVert\\nabla\_\{\\mathbf\{a\}\_\{t\}^\{\(s\)\}\}\\mathcal\{L\}\_\{\\mathrm\{evo\}\}^\{\(t\)\}\\right\\rVert\_\{2\}\.\(13\)where∇𝐚t\(s\)\\nabla\_\{\\mathbf\{a\}\_\{t\}^\{\(s\)\}\}denotes the gradient with respect to𝐚t\(s\)\\mathbf\{a\}\_\{t\}^\{\(s\)\}\. Whengt\>τ∇g\_\{t\}\>\\tau\_\{\\nabla\}, the current intervention has not yet sufficiently driven the state evolution into a locally stable regime, and the predicted state is fed back to the policy model within the same stage to continue intervention updates and state prediction\. Whengt≤τ∇g\_\{t\}\\leq\\tau\_\{\\nabla\}, the evolution is considered locally stable, and we accept𝐳t\+1\(s\)\\mathbf\{z\}\_\{t\+1\}^\{\(s\)\}as the stage output and proceed to the next stage\. In practice, we setτ∇=5×10−3\\tau\_\{\\nabla\}=5\\times 10^\{\-3\}and cap the number of within\-stage evolution iterations at 10\. This criterion requires no additional training objectives\.

### 3\.4Contrastive Variational Optimization Paradigm

In practical lithography datasets, only terminal states of each stage are observable, while stage\-internal process interventions and intermediate states are unobserved\. This renders explicit supervision of interventions infeasible\. To address this, we propose the contrastive variational optimization paradigm that relies solely on terminal\-stage supervision to jointly optimize process intervention distributions and state evolution, without requiring intermediate supervision\.

#### 3\.4\.1Variational Inference

For a given initial condition and terminal outcome, there typically exist multiple physically admissible intervention sequences that can produce similar results\. It makes direct inference of a single deterministic intervention ill\-posed\. Therefore, we formulate process intervention learning as an implicit variational inference problem, where intervention sequences are treated as latent variables, and their posterior is inferred from terminal\-state supervision\. Specifically, the policy modelπθ\\pi\_\{\\theta\}parameterizes a tractable family of conditional distributions to approximate the implicit intervention posterior induced by terminal supervision and physical feasibility constraints\. From a variational perspective,πθ\\pi\_\{\\theta\}serves as a learnable approximation to this posterior\. We define the stage\-wise variational objective as:

ℒvar\(s\)\\displaystyle\\mathcal\{L\}^\{\(s\)\}\_\{\\mathrm\{var\}\}=𝔼\{𝐜t\(s\)\}t=0T−1\[ℒrec\(s\)\(D\(𝐳T\(s\)\),𝐱∗\(s\)\)\],\\displaystyle=\\mathbb\{E\}\_\{\\\{\\mathbf\{c\}^\{\(s\)\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}\}\\left\[\\mathcal\{L\}^\{\(s\)\}\_\{\\mathrm\{rec\}\}\\left\(D\(\\mathbf\{z\}^\{\(s\)\}\_\{T\}\),\\mathbf\{x\}^\{\*\(s\)\}\\right\)\\right\],\(14\)𝐜t\(s\)\\displaystyle\\mathbf\{c\}^\{\(s\)\}\_\{t\}∼πθ\(⋅∣𝐳t\(s\),𝐩\),𝐚t\(s\)=𝐁\(s\)𝐜t\(s\),\\displaystyle\\sim\\pi\_\{\\theta\}\\left\(\\cdot\\mid\\mathbf\{z\}^\{\(s\)\}\_\{t\},\\mathbf\{p\}\\right\),\\quad\\mathbf\{a\}^\{\(s\)\}\_\{t\}=\\mathbf\{B\}^\{\(s\)\}\\mathbf\{c\}^\{\(s\)\}\_\{t\},where𝐳T\(s\)\\mathbf\{z\}^\{\(s\)\}\_\{T\}is obtained by rolling out the transition model forTTsteps using the sampled interventions\{𝐚t\(s\)\}t=0T−1\\\{\\mathbf\{a\}^\{\(s\)\}\_\{t\}\\\}\_\{t=0\}^\{T\-1\}, andD\(⋅\)D\(\\cdot\)denotes the decoder\. Minimizingℒvar\(s\)\\mathcal\{L\}\_\{\\mathrm\{var\}\}^\{\(s\)\}serves as a variational surrogate that concentrates probability mass on intervention sequences that best explain the observed terminal state under the physical feasibility constraints\. Unlike ELBO\-based formulations that require an explicit likelihood, the intervention posterior here is implicitly specified by terminal\-state supervision and the physics\-informed intervention space, enabling stochastic yet physically consistent intervention inference without stage\-internal supervision\.

#### 3\.4\.2Contrastive Learning

To enforce stage\-consistent interventions, we construct contrastive pairs by projecting the same coefficients onto physics\-informed spaces of different stages\. Given a sampled coefficient vector𝐜t\(s\)∼πθ\(⋅∣𝐳t\(s\),𝐩\)\\mathbf\{c\}\_\{t\}^\{\(s\)\}\\sim\\pi\_\{\\theta\}\(\\,\\cdot\\mid\\mathbf\{z\}\_\{t\}^\{\(s\)\},\\mathbf\{p\}\), we form a positive \(stage\-matched\) intervention and a negative \(stage\-mismatched\) intervention:

𝐚t,pos\(s\)=𝐁\(s\)𝐜t\(s\),𝐚t,neg\(s\)=𝐁\(s\+1\)𝐜t\(s\)\.\\displaystyle\\mathbf\{a\}\_\{t,\\mathrm\{pos\}\}^\{\(s\)\}=\\mathbf\{B\}^\{\(s\)\}\\mathbf\{c\}\_\{t\}^\{\(s\)\},\\quad\\mathbf\{a\}\_\{t,\\mathrm\{neg\}\}^\{\(s\)\}=\\mathbf\{B\}^\{\(s\+1\)\}\\mathbf\{c\}\_\{t\}^\{\(s\)\}\.\(15\)Rolling out one step with the same transition model yields:

𝐳t\+1,pos\(s\)=fϕ\(𝐳t\(s\),𝐚t,pos\(s\),𝐩\),\\displaystyle\\mathbf\{z\}\_\{t\+1,\\mathrm\{pos\}\}^\{\(s\)\}=f\_\{\\phi\}\\\!\\left\(\\mathbf\{z\}\_\{t\}^\{\(s\)\},\\,\\mathbf\{a\}\_\{t,\\mathrm\{pos\}\}^\{\(s\)\},\\,\\mathbf\{p\}\\right\),\(16\)𝐳t\+1,neg\(s\)=fϕ\(𝐳t\(s\),𝐚t,neg\(s\),𝐩\)\.\\displaystyle\\mathbf\{z\}\_\{t\+1,\\mathrm\{neg\}\}^\{\(s\)\}=f\_\{\\phi\}\\\!\\left\(\\mathbf\{z\}\_\{t\}^\{\(s\)\},\\,\\mathbf\{a\}\_\{t,\\mathrm\{neg\}\}^\{\(s\)\},\\,\\mathbf\{p\}\\right\)\.Using the stagessterminal observation𝐱∗\(s\)\\mathbf\{x\}^\{\\ast\(s\)\}as supervision, we define a margin\-based contrastive loss:

ℒctr\(s\)=\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{ctr\}\}^\{\(s\)\}=max\(0,m\+ℒrec\(s\)\(𝐱t\+1,pos\(s\),𝐱∗\(s\)\)\\displaystyle\\max\\\!\\Big\(0,\\;m\+\\mathcal\{L\}\_\{\\mathrm\{rec\}\}^\{\(s\)\}\\big\(\\mathbf\{x\}\_\{t\+1,\\mathrm\{pos\}\}^\{\(s\)\},\\,\\mathbf\{x\}^\{\\ast\(s\)\}\\big\)\(17\)−ℒrec\(s\)\(𝐱t\+1,neg\(s\),𝐱∗\(s\)\)\)\.\\displaystyle\-\\mathcal\{L\}\_\{\\mathrm\{rec\}\}^\{\(s\)\}\\big\(\\mathbf\{x\}\_\{t\+1,\\mathrm\{neg\}\}^\{\(s\)\},\\,\\mathbf\{x\}^\{\\ast\(s\)\}\\big\)\\Big\)\.where𝐱i\(s\)\{\\mathbf\{x\}\}\_\{i\}^\{\(s\)\}is the image obtained from𝐳i\(s\)\{\\mathbf\{z\}\}\_\{i\}^\{\(s\)\}by our Decoder,mmis set to 0\.1\. This objective pulls the stage\-matched evolution toward𝐱∗\(s\)\\mathbf\{x\}^\{\\ast\(s\)\}while pushing the stage\-mismatched evolution away, enforcing stage\-specific consistency without requiring intermediate\-state supervision\.

Table 1:Evaluate the final evolution quality of mask, resist image, and ADI on the ID lithography dataset\. The values before and after “/” denote the results of the forward evolution and inverse planning tasks, respectively\.Bestandsecond\-bestvalues are highlighted\.TaskMethodMulti\-StepMulti\-StagemPA\(%\)↑\\uparrowmIoU\(%\)↑\\uparrowF1\(%\)↑\\uparrowMSE\(×10−3\\times 10^\{\-3\}\)↓\\downarrowEPEavg\{\}\_\{\\textit\{avg\}\}\(nm\)↓\\downarrowMaskEvolutionLithoNet\(Shaoet al\.,[2020](https://arxiv.org/html/2606.26713#bib.bib4)\)×\\times×\\times93\.54/\-85\.70/\-84\.29/\-57\.92/\-46\.16/\-LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)×\\times×\\times96\.27/\-91\.95/\-95\.99/\-19\.03/\-7\.12/\-Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)×\\times✓\\checkmark95\.71/\-92\.24/\-96\.42/\-25\.40/\-8\.69/\-DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)✓\\checkmark×\\times93\.47/94\.8390\.04/92\.8893\.00/93\.4626\.51/25\.9121\.96/18\.33DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)✓\\checkmark×\\times93\.65/95\.0483\.17/86\.5190\.76/92\.4363\.48/62\.174\.94/4\.53NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)✓\\checkmark×\\times91\.81/92\.7785\.95/87\.6290\.75/92\.2229\.65/28\.0726\.36/23\.90LithoDreamer \(ours\)✓\\checkmark✓\\checkmark98\.69/99\.0496\.91/97\.6199\.66/99\.6913\.74/13\.071\.58/1\.32Resist ImageEvolutionLithoNet\(Shaoet al\.,[2020](https://arxiv.org/html/2606.26713#bib.bib4)\)×\\times×\\times91\.41/\-82\.14/\-74\.36/\-78\.67/\-33\.31/\-LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)×\\times×\\times95\.23/\-96\.35/\-96\.24/\-18\.66/\-6\.31/\-Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)×\\times✓\\checkmark95\.52/\-91\.98/\-96\.02/\-25\.16/\-8\.62/\-DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)✓\\checkmark×\\times95\.68/96\.3293\.89/94\.7295\.03/95\.8322\.47/20\.3917\.37/15\.03DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)✓\\checkmark×\\times92\.59/93\.7578\.74/79\.2187\.21/88\.8274\.06/71\.507\.72/6\.97NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)✓\\checkmark×\\times91\.81/92\.5885\.95/86\.8990\.75/91\.7955\.43/53\.4526\.36/24\.16LithoDreamer \(ours\)✓\\checkmark✓\\checkmark99\.01/99\.3497\.77/98\.3599\.73/99\.6110\.93/10\.010\.96/0\.89ADIEvolutionLithoNet\(Shaoet al\.,[2020](https://arxiv.org/html/2606.26713#bib.bib4)\)×\\times×\\times61\.17/\-8\.91/\-26\.78/\-247\.82/\-13\.39/\-LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)×\\times×\\times52\.56/\-40\.96/\-32\.14/\-162\.41/\-5\.54/\-Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)×\\times✓\\checkmark\-/\-\-/\-\-/\-\-/\-\-/\-DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)✓\\checkmark×\\times73\.57/74\.6969\.41/71\.8369\.99/71\.12120\.51/113\.9819\.71/19\.42DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)✓\\checkmark×\\times74\.93/75\.7314\.58/56\.7824\.74/26\.4830\.48/28\.125\.58/5\.06NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)✓\\checkmark×\\times71\.43/73\.2761\.35/69\.7253\.20/54\.83137\.32/132\.4026\.36/25\.29LithoDreamer \(ours\)✓\\checkmark✓\\checkmark82\.56/84\.3978\.27/82\.9780\.91/83\.6130\.29/26\.753\.74/3\.28

## 4Experiments

Experiments have two tasks: forward evolution and inverse planning\. Forward evolution: from layouts and process parameters, explore interventions to generate Mask, Resist Image, and ADI, assessing state evolution\. Inverse planning: given target ADIs \(28nm data\(Heet al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib30)\), given target Resist Images\), optimize interventions to drive state evolution toward targets, testing goal\-directed planning\.

### 4\.1Experimental Settings

#### 4\.1\.1Implementation Details

We use a Variational Autoencoder \(VAE\)\(Puet al\.,[2016](https://arxiv.org/html/2606.26713#bib.bib1)\)and BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2606.26713#bib.bib3)\)as the image and text encoders, respectively\. The policy model adopts vit\-base\-patch16\-224\(Dosovitskiy,[2020](https://arxiv.org/html/2606.26713#bib.bib5)\), while the transition model uses DiT\-B/2\(Peebles and Xie,[2023](https://arxiv.org/html/2606.26713#bib.bib7)\)\. The model is trained in an end\-to\-end pipeline with ground\-truth inputs at each stage to mitigate error accumulation\. The objective combines the variational loss[Equation14](https://arxiv.org/html/2606.26713#S3.E14)and contrastive loss[Equation17](https://arxiv.org/html/2606.26713#S3.E17), where the variational loss includes a diversified reconstruction loss comprising Mean Squared Error \(MSE\) loss, Binary Cross\-Entropy \(BCE\) loss, Dice loss, and Edge loss\. We use AdamW with an initial learning rate of 1e\-6, weight decay of 0\.01,β1=0\.9\\beta\_\{1\}=0\.9, andβ2=0\.999\\beta\_\{2\}=0\.999\. Training is conducted on 8 NVIDIA A100 GPUs for 10 epochs with a batch size of 8 and a linear learning\-rate decay schedule\.

#### 4\.1\.2Datasets

As described in[Section2\.1](https://arxiv.org/html/2606.26713#S2.SS1), we construct a large\-scale 55 nm dataset covering the complete “Layout\-Mask\-Resist Image\-ADI” pipeline\. The dataset considers four process parameters: source type \([Figure3](https://arxiv.org/html/2606.26713#S4.F3): Annular, Circular, Bull’s Eye\), resist threshold \(0\.09231251, 0\.1236402, 0\.1436665\), focus \(0 nm, 50 nm\), and exposure dose \(1\.0×\\times, 1\.2×\\times\), resulting in 36 process configurations\. The training set contains 280k samples, and the ID test set contains 20k samples, covering all configurations\. We also construct two OOD test sets to evaluate generalization under unseen process parameters and different technology nodes\. The first contains 3k samples from an unseen process setting at the same 55 nm node, namely Annular source, threshold 0\.119340, focus 0 nm, and dose 1\.0×\\times\. The second uses the public 28 nm LithoSim dataset\(Heet al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib30)\)for cross\-node evaluation\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x3.png)Figure 3:Three types of light sources in the dataset, each with configurable parameters, such as radius\.Table 2:Comparison of Lithography Rule Check \(LRC\) violations between generated resist images and commercial simulation results on the ID dataset under the forward evolution task\. The commercial tool is used as the reference;bestandsecond\-bestvalues indicate the smallest and second\-smallest absolute deviations among learned models \(the process parameter is defined as Source\_Threshold\_Focus\_Dose\)\.ProcessConditionsThe Commercial ToolLMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)LithoDreamer \(ours\)\#Pinch\#Bridge\#EPE\#Pinch\#Bridge\#EPE\#Pinch\#Bridge\#EPE\#Pinch\#Bridge\#EPEA\+12892784127321202\(\-6\.7%\)2731\(\-1\.9%\)12509\(\-1\.8%\)1219\(\-5\.4%\)2629\(\-5\.6%\)13009\(2\.2%\)1217\(\-5\.6%\)2724\(\-2\.2%\)12920\(1\.5%\)B\+13353032121341292\(\-3\.2%\)2994\(\-1\.3%\)11366\(\-6\.3%\)1279\(\-4\.2%\)2866\(\-5\.5%\)13729\(13\.1%\)1334\(\-0\.1%\)3095\(2\.1%\)13160\(8\.5%\)C\+1443265759801584\(9\.8%\)1765\(\-33\.6%\)7009\(17\.2%\)2586\(79\.2%\)3767\(41\.8%\)12104\(102\.4%\)1577\(9\.3%\)2683\(1\.0%\)5427\(\-9\.2%\)D\+1441279560962819\(95\.6%\)2811\(0\.6%\)11489\(88\.5%\)2808\(94\.9%\)2965\(6\.1%\)8764\(43\.8%\)1407\(\-2\.4%\)3849\(37\.7%\)6621\(8\.6%\)E\+1456280870412438\(67\.4%\)2471\(\-12\.0%\)9712\(37\.9%\)2460\(69\.0%\)2910\(3\.6%\)8650\(22\.9%\)1519\(4\.3%\)2761\(\-1\.7%\)6037\(\-14\.3%\)Average1393281587971867\(34\.0%\)2554\(\-9\.3%\)10417\(18\.4%\)2070\(48\.6%\)3027\(7\.5%\)11251\(27\.9%\)1411\(1\.3%\)3022\(7\.4%\)8833\(0\.4%\)

- \+A: Bull’s Eye\_0\.09231251\_0 nm\_1\.2×\\times; B: Bull’s Eye\_0\.1436665\_0 nm\_1\.2×\\times; C: Annular\_0\.09231251\_50 nm\_1\.0×\\times; D: Circular\_0\.1436665\_0 nm\_1\.0×\\times; E: Circular\_0\.09231251\_0 nm\_1\.0×\\times\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x4.png)Figure 4:Visualization of inverse planning on the ID dataset\. Given the input layout and target ADI, LithoDreamer plans latent interventions and evolves the Mask, Resist Image, and ADI state to achieve the target pattern\.
#### 4\.1\.3Evaluation Metrics

We evaluate lithography tasks under ID and OOD settings using five metrics\. Mean Pixel Accuracy \(mPA\) and Mean Intersection over Union \(mIoU\) assess pixel\-level accuracy and region consistency, respectively\. Edge F1\-Score \(F1\) evaluates boundary quality by balancing edge precision and recall\. Mean Squared Error \(MSE\) measures overall intensity deviation\. Edge Placement Error \(EPE\) is a manufacturing\-critical contour\-level metric in lithography\. Following the gauge\-based calculation protocol in Appendix[B](https://arxiv.org/html/2606.26713#A2), we first convert the target and predicted patterns into polygonal contours, sample gauge points along the target contour, and measure the local displacement from the target contour to the predicted contour along the target normal direction\. This metric directly reflects nanoscale contour placement fidelity and is therefore more manufacturing\-relevant than generic image similarity metrics\.

### 4\.2Main Experiment Results

We compare with classical lithography models and general WMs\. Unlike conventional lithography models that are evaluated independently at each stage with ground\-truth state inputs, LithoDreamer performs continuous prediction based on generated states from previous stages, thereby facing more pronounced cross\-stage error accumulation\.

Table 3:Evaluate the final evolution quality of mask, resist image, and ADI on two OOD lithography datasets\. The values before and after “/” denote the results of the forward evolution and inverse planning tasks, respectively\.Bestandsecond\-bestvalues are highlighted\.DatasetMethodMaskResist ImageADImPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrow55nmwith F\+LithoNet\(Shaoet al\.,[2020](https://arxiv.org/html/2606.26713#bib.bib4)\)91\.43/\-55\.95/\-8\.19/\-89\.55/\-46\.67/\-8\.64/\-72\.24/\-6\.07/\-13\.45/\-LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)90\.69/\-66\.58/\-28\.00/\-93\.29/\-66\.44/\-7\.00/\-47\.45/\-42\.46/\-75\.61/\-Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)91\.01/\-67\.61/\-10\.54/\-94\.80/\-71\.84/\-1\.93/\-\-/\-\-/\-\-/\-DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)89\.87/90\.5775\.98/78\.0515\.67/14\.8193\.51/93\.9776\.25/78\.442\.69/2\.3171\.81/73\.4264\.06/64\.594\.28/4\.13DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)88\.44/91\.1754\.72/63\.006\.26/5\.7294\.27/94\.9178\.21/80\.305\.22/4\.9268\.11/70\.049\.04/50\.495\.49/5\.12NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)84\.04/86\.2175\.92/78\.0926\.36/22\.7388\.25/89\.6180\.29/82\.533\.04/2\.9470\.94/72\.4360\.96/66\.574\.62/4\.39LithoDreamer \(ours\)96\.92/97\.3377\.52/82\.791\.77/1\.5198\.27/98\.6482\.91/84\.071\.06/0\.9778\.89/85\.7376\.04/84\.664\.05/3\.2128nmLithoNet\(Shaoet al\.,[2020](https://arxiv.org/html/2606.26713#bib.bib4)\)93\.06/\-31\.49/\-23\.49/\-93\.06/\-31\.49/\-20\.52/\-\-/\-\-/\-\-/\-LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)82\.49/\-30\.64/\-4\.63/\-92\.46/\-29\.08/\-7\.90/\-\-/\-\-/\-\-/\-Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)81\.20/\-33\.98/\-4\.78/\-91\.71/\-30\.59/\-5\.33/\-\-/\-\-/\-\-/\-DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)73\.90/75\.1262\.37/64\.995\.85/5\.4275\.81/77\.0061\.32/63\.745\.97/5\.03\-/\-\-/\-\-/\-DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)86\.30/87\.2718\.81/30\.0815\.03/14\.2680\.75/83\.9520\.39/36\.1519\.38/18\.46\-/\-\-/\-\-/\-NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)71\.24/72\.3558\.92/60\.7920\.81/19\.6278\.94/79\.6860\.70/62\.7629\.69/28\.53\-/\-\-/\-\-/\-LithoDreamer \(ours\)94\.92/95\.7489\.75/92\.631\.82/1\.7396\.99/97\.6881\.16/82\.991\.37/1\.28\-/\-\-/\-\-/\-
- \+F: Source\_Threshold\_Focus\_Dose: Annular\_0\.119340\_0 nm\_1\.0×\\times\. ADI data are not provided in the 28 nm dataset\(Heet al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib30)\)\.

#### 4\.2\.1In\-Domain Results

As shown in[Table1](https://arxiv.org/html/2606.26713#S3.T1), LithoDreamer achieves the bestforward evolutionquality across Mask, Resist Image, and ADI on the ID dataset\. It obtains EPE values of 1\.58 nm, 0\.96 nm, and 3\.74 nm, respectively, substantially outperforming classical lithography models and general WMs\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\), and reducing EPE to the 1\-4 nm range\. It also yields consistent gains in region and boundary metrics, especially for ADI, where mIoU and F1 increase to 78\.27% and 80\.91%\. Unlike static predictors or stage\-wise mapping methods \(i\.e\., Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)\), LithoDreamer explicitly models the continuous physical evolution of the “Layout\-Mask\-Resist Image\-ADI” pipeline, enabling stable multi\-step reasoning within stages and coherent cross\-stage prediction\. Further LRC validation in[Table2](https://arxiv.org/html/2606.26713#S4.T2)shows that the generated Resist Images closely match commercial simulation results in Pinch, Bridge, and EPE violations, with average deviations of only 1\.3%, 7\.4%, and 0\.4%, respectively\. Compared with LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)and Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\), which show much larger average deviations in Pinch and EPE violations, LithoDreamer better preserves fine\-grained geometry and manufacturing\-rule consistency, supporting its reliability for downstream lithography analysis\.

For theinverse planningtask on the ID dataset, LithoDreamer shows strong goal\-directed planning capability, achieving EPE values of 1\.32 nm, 0\.89 nm, and 3\.28 nm on Mask, Resist Image, and ADI, respectively, while consistently outperforming WM baselines in mPA, mIoU, and F1\. Furthermore, the evolution trajectory in[Figure4](https://arxiv.org/html/2606.26713#S4.F4)shows that LithoDreamer does not simply fit the terminal image, but progressively corrects local contour deviations through the Mask and Resist Image stages, leading to continuous morphology convergence in ADI\. These results demonstrate its ability to perform stable target\-directed evolution through physically consistent intervention planning\.

Table 4:Ablation of the SPA design\.MethodMaskResist ImageADImPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarroww/o SPA70\.0228\.0671\.6728\.4158\.6638\.47Random𝒮\\mathcal\{S\}83\.7927\.0483\.9825\.8264\.1333\.91Shared SPA86\.1724\.5386\.9218\.1467\.2626\.47Global PCA88\.0118\.7490\.379\.1070\.5816\.46SPA \(k=2k=2\)92\.176\.9294\.562\.3176\.2210\.91SPA \(k=8k=8\)98\.691\.5899\.010\.9682\.563\.74SPA \(k=16k=16\)96\.422\.9697\.681\.7880\.036\.59
#### 4\.2\.2Out\-of\-Domain Results

As shown in[Table3](https://arxiv.org/html/2606.26713#S4.T3), under unseen process parameter configurations at the same 55 nm node, LithoDreamer demonstrates stable OOD generalization in theforward evolutiontask\. Compared with the static predictor LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\), LithoDreamer reduces the EPE by 26\.23 nm, 5\.94 nm, and 71\.56 nm on Mask, Resist Image, and ADI, respectively, suggesting that it does not memorize specific process configurations but learns physically constrained evolution directions for stable extrapolation within the same node\. On the cross\-node LithoSim dataset\(Heet al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib30)\), LithoDreamer still achieves the highest mIoU and reduces the EPE by 4\.03 nm and 4\.60 nm on Mask and Resist Image compared with DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)\. This shows that ours can capture the effective influence of process interventions on lithography state evolution across different nodes, enabling stable extrapolation to unseen nodes\.

In the OODinverse planningtask, LithoDreamer further demonstrates robust target\-driven generalization\. For the unseen 55 nm process configuration, LithoDreamer achieves EPE values of 1\.51 nm, 0\.97 nm, and 3\.21 nm on Mask, Resist Image, and ADI, respectively, while generally outperforming executable WM baselines in mPA and mIoU\. On the cross\-node 28 nm LithoSim dataset\(Heet al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib30)\), LithoDreamer keeps the EPE of Mask and Resist Image at 1\.73 nm and 1\.28 nm, respectively, while achieving the highest mIoU\. These results show that, even under unseen process parameters and different technology nodes, LithoDreamer can drive state evolution toward the target through physically consistent latent intervention planning, rather than merely fitting patterns within the training distribution\.

### 4\.3Ablation Results

We conduct ablation studies on theID forward evolutiontask to validate the effectiveness of the proposed architecture and key optimization modules\.Bestvalues are highlighted\.

#### 4\.3\.1Design of the SPA

[Table4](https://arxiv.org/html/2606.26713#S4.T4)reports the ablation results of the proposed SPA method\. Removing SPA increases the ADI EPE from 3\.74 nm to 38\.47 nm, showing that unconstrained intervention\-driven evolution is unstable\. Random latent spaces and global PCA provide only limited gains, indicating that generic dimensionality reduction cannot capture stage\-specific evolution directions\. Moreover, sharing a single SPA across stages underperforms the stage\-specific SPA, confirming that different lithography stages are governed by distinct physical dynamics\. Sensitivity analysis further shows that a moderate latent space dimension \(k=8k=8\) achieves the best trade\-off between expressiveness and physical constraint\. Overall, results validate SPA as a critical and physics\-informed constraint rather than a simple regularization or PCA\-based approximation\.

Table 5:Ablation of policy model design\.MethodMaskResist ImageADImPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarroww/o Contrastive93\.836\.1088\.716\.9778\.277\.89𝐜t\(s\)=𝝁t\(s\)\\mathbf\{c\}^\{\(s\)\}\_\{t\}=\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\}86\.7315\.3784\.5720\.9870\.5321\.33w/o𝝈t\(s\)\\boldsymbol\{\\sigma\}^\{\(s\)\}\_\{t\}87\.2914\.0187\.5415\.8674\.3617\.24w/o𝐩\\mathbf\{p\}95\.348\.9196\.3211\.2679\.1310\.75Ours98\.691\.5899\.010\.9682\.563\.74Table 6:Ablation of our transition model design\.MethodMaskResist ImageADImPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarroww/o𝐚t\(s\)\\mathbf\{a\}^\{\(s\)\}\_\{t\}90\.876\.3285\.667\.8073\.589\.02Shared Head95\.614\.3886\.595\.0266\.5026\.86Ours98\.691\.5899\.010\.9682\.563\.74
#### 4\.3\.2Design of the Policy Model

As shown in[Table5](https://arxiv.org/html/2606.26713#S4.T5), the ablation study validates the policy model from three aspects: contrastive constraints, stochastic intervention modeling, and process\-parameter conditioning\. Removing contrastive learning increases the EPE by 4\.52 nm, 6\.01 nm, and 4\.13 nm on Mask, Resist Image, and ADI compared with the full model, indicating that terminal\-state variational supervision alone cannot effectively distinguish stage\-matched from stage\-mismatched intervention directions, weakening stage\-consistent boundary evolution\.

Stochastic intervention modeling is also critical\. When the policy collapses to a deterministic intervention,𝐜t\(s\)=𝝁t\(s\)\\mathbf\{c\}^\{\(s\)\}\_\{t\}=\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\}, the EPE increases to 15\.37 nm, 20\.98 nm, and 21\.33 nm across the three stages, showing that a single point estimate cannot capture the underdetermined intervention\-state evolution relationship\. Removing uncertainty modeling \(w/o𝝈t\(s\)\\boldsymbol\{\\sigma\}^\{\(s\)\}\_\{t\}\) improves over the deterministic variant, reducing the EPE to 14\.01 nm, 15\.86 nm, and 17\.24 nm across the three stages, but it remains far inferior to the full model\. This suggests that the learned stochasticity is not merely noise injection, but a key mechanism for exploring effective intervention candidates within the physically feasible space\. Moreover, removing the process parameter𝐩\\mathbf\{p\}still preserves a certain level of prediction ability, but the EPE remains much higher at 8\.91 nm, 11\.26 nm, and 10\.75 nm, compared with 1\.58 nm, 0\.96 nm, and 3\.74 nm for the full model\. This demonstrates that explicit process conditioning is critical for stage\-consistent intervention decisions\. Overall, LithoDreamer achieves the highest mPA and lowest EPE across all stages, showing that these designs jointly improve intervention planning and state prediction\.

#### 4\.3\.3Design of the Transition Model Architecture

As shown in[Table6](https://arxiv.org/html/2606.26713#S4.T6), the ablation results highlight the importance of both intervention\-aware transition modeling and stage\-specific output mappings\. Removing the process intervention𝐚\(s\)\\mathbf\{a\}^\{\(s\)\}consistently degrades performance across all stages, increasing the EPE to 6\.32 nm, 7\.80 nm, and 9\.02 nm for Mask, Resist Image, and ADI, respectively\. This indicates that, without intervention\-conditioned inputs, the transition model cannot accurately capture the state evolution dynamics driven by process interventions\. The Shared Head variant still maintains reasonable performance on Mask and Resist Image, but degrades substantially on ADI, with the EPE increasing to 26\.86 nm and mPA dropping to 66\.50%\. This suggests that different lithography stages exhibit strong representational and distributional heterogeneity, making a single shared output head insufficient to handle binary masks, resist images, and ADI states simultaneously\. LithoDreamer achieves the highest mPA and lowest EPE across all stages, validating the necessity of intervention\-conditioned transition modeling and stage\-aware output mappings for stable multi\-stage evolution\.

Table 7:Ablation of adaptive state transfer determination\.MethodMaskResist ImageADImPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowStepavg\{\}\_\{\\textit\{avg\}\}mPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowStepavg\{\}\_\{\\textit\{avg\}\}mPA↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowStepavg\{\}\_\{\\textit\{avg\}\}step = 394\.564\.90395\.934\.34377\.8319\.823step = 597\.033\.42597\.723\.41580\.025\.965w/o Gradient97\.023\.60698\.003\.32880\.135\.679Ours98\.691\.58699\.010\.96482\.563\.748

#### 4\.3\.4Design of Transfer State Determination

[Table7](https://arxiv.org/html/2606.26713#S4.T7)evaluates inter\-stage transfer and intra\-stage iteration strategies of the transition model\. Using a fixed number of iterations leads to a clear stage\-dependent mismatch: with step = 3, all stages are under\-evolved, and the degradation is most severe at ADI \(EPE 19\.82 nm\), indicating that late‑stage dynamics are more nonlinear and error\-sensitive\. Increasing the fixed budget to step = 5 substantially improves performance \(ADI’s EPE 5\.96 nm\), yet still underperforms adaptive iteration\. And ours achieves the best results across all stages while allocating different average iteration counts, showing that different stages require different refinement depths, with ADI requiring the deepest refinement\. Notably, w/o Gradient uses even more iterations on average \(ADI 9\) but yields inferior accuracy \(ADI’s EPE 5\.67 nm vs 3\.74 nm\), demonstrating that the improvement does not stem from more iterations alone\. These results confirm that gradient\-based convergence criteria, rather than fixed or heuristic iteration budgets, are essential for stable inter\-stage transfer and accurate intra\-stage evolution\.

## 5Conclusion

We propose LithoDreamer, a physics\-informed world model that formulates computational lithography as a decision\-aware multi\-stage physical evolution process\. By learning stage\-specific latent spaces with contrastive variational optimization, LithoDreamer jointly models process interventions and state transitions without intermediate supervision\. Extensive ID and OOD experiments show superior accuracy and generalization, especially in edge placement precision\.

## Acknowledgements

This work was supported by the Yangtze River Delta Science and Technology Innovation Community Joint Fundamental Research Project \(Grant No\. 2024CSJZN0500\)\.

## Impact Statement

This paper presents work whose goal is to advance the field of machine learning\. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here\.

## References

- E\. Alonso, A\. Jelley, V\. Micheli, A\. Kanervisto, A\. Storkey, T\. Pearce, and F\. Fleuret \(2024\)Diffusion for world modeling: visual details matter in atari\.InAdvances in Neural Information Processing Systems,Vol\.37,pp\. 58757–58791\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p3.1)\.
- A\. Bar, G\. Zhou, D\. Tran, T\. Darrell, and Y\. LeCun \(2025\)Navigation world models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 15791–15801\.Cited by:[Table 10](https://arxiv.org/html/2606.26713#A4.T10.19.19.1),[Table 9](https://arxiv.org/html/2606.26713#A4.T9.21.19.1),[§1](https://arxiv.org/html/2606.26713#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.26713#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.19.19.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.33.33.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.47.47.3),[§4\.2\.1](https://arxiv.org/html/2606.26713#S4.SS2.SSS1.p1.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.19.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.26.1)\.
- G\. Chen, H\. Yang, B\. Yu, and H\. Ren \(2025\)Intelligent opc engineer assistant for semiconductor manufacturing\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23144–23151\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26713#S2.SS2.p1.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26713#S4.SS1.SSS1.p1.2)\.
- A\. Dosovitskiy \(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26713#S4.SS1.SSS1.p1.2)\.
- H\. He, Z\. Wang, J\. Wang, T\. Wu, X\. He, B\. Yu, J\. Yu, and H\. Geng \(2026\)LithoSim: a large, holistic lithography simulation benchmark for ai\-driven semiconductor manufacturing\.Advances in Neural Information Processing Systems38\.Cited by:[§A\.2](https://arxiv.org/html/2606.26713#A1.SS2.p1.4.1),[item \+](https://arxiv.org/html/2606.26713#S4.I2.ix1.p1.1),[§4\.1\.2](https://arxiv.org/html/2606.26713#S4.SS1.SSS2.p1.3),[§4\.2\.2](https://arxiv.org/html/2606.26713#S4.SS2.SSS2.p1.1),[§4\.2\.2](https://arxiv.org/html/2606.26713#S4.SS2.SSS2.p2.1),[§4](https://arxiv.org/html/2606.26713#S4.p1.1)\.
- Y\. Huang, J\. Zhang, S\. Zou, X\. Liu, R\. Hu, and K\. Xu \(2025\)LaDi\-WM: a latent diffusion\-based world model for predictive manipulation\.arXiv preprint arXiv:2505\.11528\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p3.1)\.
- Y\. Jiang, Y\. Hu, J\. Deng, X\. Qiu, Y\. Cui, X\. He, R\. Li, Q\. Sun, and C\. Zhuo \(2026\)Circuit\-think: a multimodal reasoning framework for automated circuit\-to\-netlist translation with trajectory\-guided reinforcement learning\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.40,pp\. 5477–5484\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p1.1)\.
- Y\. Jiang, Q\. Jin, X\. Lu, J\. Deng, H\. Geng, H\. Wu, Q\. Sun, and C\. Zhuo \(2025\)Fabthink: a wafer analysis multimodal llm via chain\-of\-thought\-driven retrieval augmentation\.In2025 IEEE/ACM International Conference On Computer Aided Design \(ICCAD\),pp\. 1–9\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p2.1)\.
- Y\. Jiang, X\. Lu, Q\. Jin, Q\. Sun, H\. Wu, and C\. Zhuo \(2024\)Fabgpt: an efficient large multimodal model for complex wafer defect knowledge queries\.InProceedings of the 43rd IEEE/ACM International Conference on Computer\-Aided Design,pp\. 1–8\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p2.1)\.
- Q\. Jin, Y\. Liu, Y\. Jiang, Q\. Sun, and C\. Zhuo \(2025a\)Unitho: a unified multi\-task framework for computational lithography\.pp\. 1–8\.Cited by:[Appendix E](https://arxiv.org/html/2606.26713#A5.p1.1),[§1](https://arxiv.org/html/2606.26713#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26713#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.13.13.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.27.27.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.41.41.3),[§4\.2\.1](https://arxiv.org/html/2606.26713#S4.SS2.SSS1.p1.1),[Table 2](https://arxiv.org/html/2606.26713#S4.T2.6.1.1.4),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.16.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.23.1)\.
- Q\. Jin, Q\. Peng, Y\. Liu, X\. Qiu, and Q\. Sun \(2025b\)Recent advances in computational lithography technology\.Moore and More2\(1\),pp\. 1–18\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p1.1)\.
- S\. Jin, S\. Yu, B\. Zhang, M\. Sun, Y\. Dong, and J\. Xiao \(2025c\)Feature purification matters: suppressing outlier propagation for training\-free open\-vocabulary semantic segmentation\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 20291–20300\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p2.1)\.
- W\. Peebles and S\. Xie \(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4195–4205\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26713#S4.SS1.SSS1.p1.2)\.
- Y\. Pu, Z\. Gan, R\. Henao, X\. Yuan, C\. Li, A\. Stevens, and L\. Carin \(2016\)Variational autoencoder for deep learning of images, labels and captions\.Advances in neural information processing systems29\.Cited by:[§4\.1\.1](https://arxiv.org/html/2606.26713#S4.SS1.SSS1.p1.2)\.
- H\. Shao, C\. Peng, J\. Wu, C\. Lin, S\. Fang, P\. Tsai, and Y\. Liu \(2020\)From ic layout to die photograph: a cnn\-based data\-driven approach\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems40\(5\),pp\. 957–970\.Cited by:[Appendix E](https://arxiv.org/html/2606.26713#A5.p1.1),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.23.23.4),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.37.37.4),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.9.9.4),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.14.2),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.21.2)\.
- Z\. Wang, H\. He, T\. Wu, X\. He, Q\. Sun, C\. Zhuo, B\. Yu, J\. Yu, and H\. Geng \(2025\)LMLitho: a large vision model\-driven lithography simulation framework\.In2025 IEEE/ACM International Conference On Computer Aided Design \(ICCAD\),pp\. 1–9\.Cited by:[Appendix E](https://arxiv.org/html/2606.26713#A5.p1.1),[§1](https://arxiv.org/html/2606.26713#S1.p2.1),[§2\.2](https://arxiv.org/html/2606.26713#S2.SS2.p1.1),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.11.11.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.25.25.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.39.39.3),[§4\.2\.1](https://arxiv.org/html/2606.26713#S4.SS2.SSS1.p1.1),[§4\.2\.2](https://arxiv.org/html/2606.26713#S4.SS2.SSS2.p1.1),[Table 2](https://arxiv.org/html/2606.26713#S4.T2.6.1.1.3),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.15.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.22.1)\.
- H\. Yang and H\. Ren \(2023\)Enabling scalable AI computational lithography with physics\-inspired models\.InProceedings of the 28th Asia and South Pacific Design Automation Conference \(ASP\-DAC\),pp\. 715–720\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p1.1)\.
- Y\. Yang, K\. Liu, Y\. Gao, C\. Wang, and L\. Cao \(2025\)Advancements and challenges in inverse lithography technology: a review of artificial intelligence\-based approaches\.Light: Science & Applications14,pp\. 250\.Cited by:[§1](https://arxiv.org/html/2606.26713#S1.p2.1)\.
- G\. Zhao, X\. Wang, Z\. Zhu, X\. Chen, G\. Huang, X\. Bao, and X\. Wang \(2025\)Drivedreamer\-2: llm\-enhanced world models for diverse driving video generation\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 10412–10420\.Cited by:[§D\.2](https://arxiv.org/html/2606.26713#A4.SS2.p1.3),[Table 10](https://arxiv.org/html/2606.26713#A4.T10.19.18.1),[Table 9](https://arxiv.org/html/2606.26713#A4.T9.21.18.1),[§1](https://arxiv.org/html/2606.26713#S1.p3.1),[§2\.3](https://arxiv.org/html/2606.26713#S2.SS3.p1.1),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.17.17.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.31.31.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.45.45.3),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.18.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.25.1)\.
- G\. Zhou, H\. Pan, Y\. LeCun, and L\. Pinto \(2024\)Dino\-wm: world models on pre\-trained visual features enable zero\-shot planning\.arXiv preprint arXiv:2411\.04983\.Cited by:[§D\.2](https://arxiv.org/html/2606.26713#A4.SS2.p1.3),[§D\.2](https://arxiv.org/html/2606.26713#A4.SS2.p2.2),[Table 10](https://arxiv.org/html/2606.26713#A4.T10.19.17.1),[Table 9](https://arxiv.org/html/2606.26713#A4.T9.21.17.1),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.15.15.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.29.29.3),[Table 1](https://arxiv.org/html/2606.26713#S3.T1.43.43.3),[§4\.2\.2](https://arxiv.org/html/2606.26713#S4.SS2.SSS2.p1.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.17.1),[Table 3](https://arxiv.org/html/2606.26713#S4.T3.12.24.1)\.
- B\. Zhu, S\. Zheng, Z\. Yu, G\. Chen, Y\. Ma, F\. Yang, B\. Yu, and M\. D\. Wong \(2023\)L2O\-ilt: learning to optimize inverse lithography techniques\.IEEE Transactions on Computer\-Aided Design of Integrated Circuits and Systems43\(3\),pp\. 944–955\.Cited by:[§2\.2](https://arxiv.org/html/2606.26713#S2.SS2.p1.1)\.

The appendix is divided into several sections, each providing additional experimental details, metric definitions, and qualitative analyses\.

[A](https://arxiv.org/html/2606.26713#A1)[Detailed Experimental Inference Setups](https://arxiv.org/html/2606.26713#A1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A](https://arxiv.org/html/2606.26713#A1)

[A\.1](https://arxiv.org/html/2606.26713#A1.SS1)[Autonomous Forward Evolution](https://arxiv.org/html/2606.26713#A1.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.1](https://arxiv.org/html/2606.26713#A1.SS1)

[A\.2](https://arxiv.org/html/2606.26713#A1.SS2)[Target\-Conditioned Inverse Planning](https://arxiv.org/html/2606.26713#A1.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[A\.2](https://arxiv.org/html/2606.26713#A1.SS2)

[B](https://arxiv.org/html/2606.26713#A2)[Principles of Lithography Metric Calculation](https://arxiv.org/html/2606.26713#A2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B](https://arxiv.org/html/2606.26713#A2)

[B\.1](https://arxiv.org/html/2606.26713#A2.SS1)[Gauge\-based EPE](https://arxiv.org/html/2606.26713#A2.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.1](https://arxiv.org/html/2606.26713#A2.SS1)

[B\.2](https://arxiv.org/html/2606.26713#A2.SS2)[LRC\-style Violation Counts](https://arxiv.org/html/2606.26713#A2.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[B\.2](https://arxiv.org/html/2606.26713#A2.SS2)

[C](https://arxiv.org/html/2606.26713#A3)[Detailed Light Source](https://arxiv.org/html/2606.26713#A3)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[C](https://arxiv.org/html/2606.26713#A3)

[D](https://arxiv.org/html/2606.26713#A4)[Additional Experiments of Latent Spaces](https://arxiv.org/html/2606.26713#A4)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D](https://arxiv.org/html/2606.26713#A4)

[D\.1](https://arxiv.org/html/2606.26713#A4.SS1)[Design of Adjacent State Pairs in SPA](https://arxiv.org/html/2606.26713#A4.SS1)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.1](https://arxiv.org/html/2606.26713#A4.SS1)

[D\.2](https://arxiv.org/html/2606.26713#A4.SS2)[Validation of the Latent Space Constructed by SPA](https://arxiv.org/html/2606.26713#A4.SS2)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[D\.2](https://arxiv.org/html/2606.26713#A4.SS2)

[E](https://arxiv.org/html/2606.26713#A5)[Visual Comparisons of Forward Evolution](https://arxiv.org/html/2606.26713#A5)\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.[E](https://arxiv.org/html/2606.26713#A5)

## Appendix ADetailed Experimental Inference Setups

LithoDreamer is evaluated under two inference modes: autonomous forward evolution and target\-conditioned inverse planning\. In both modes, the encoder, decoder, policy model, transition model, and latent spaces are fixed at test time\. Different from the target\-guided state\-gate criterion used during training, which relies on observed stage targets, inference adopts a target\-free latent\-convergence criterion to avoid using unavailable intermediate ground\-truth states\.

### A\.1Autonomous Forward Evolution

Given an input layout𝐱\(1\)\\mathbf\{x\}^\{\(1\)\}and a process condition𝐩\\mathbf\{p\}, forward evolution aims to autonomously explore process intervention sequences and evolve the lithography state through the multi\-stage pipeline:

Layout→Mask→ResistImage→ADI\.\\mathrm\{Layout\}\\rightarrow\\mathrm\{Mask\}\\rightarrow\\mathrm\{Resist\\ Image\}\\rightarrow\\mathrm\{ADI\}\.For each stagess, the current state at evolution stepttis encoded as𝐳t\(s\)=E\(𝐱t\(s\)\)\\mathbf\{z\}^\{\(s\)\}\_\{t\}=E\(\\mathbf\{x\}^\{\(s\)\}\_\{t\}\)\. The policy model generates stage\-wise intervention coefficients conditioned on the current latent state and process condition:

𝐜t\(s\)∼πθ\(⋅∣𝐳t\(s\),𝐩\),\\mathbf\{c\}^\{\(s\)\}\_\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid\\mathbf\{z\}^\{\(s\)\}\_\{t\},\\mathbf\{p\}\),which are mapped to the stage\-specific feasible space by:

𝐚t\(s\)=𝐁\(s\)𝐜t\(s\)\.\\mathbf\{a\}^\{\(s\)\}\_\{t\}=\\mathbf\{B\}^\{\(s\)\}\\mathbf\{c\}^\{\(s\)\}\_\{t\}\.The transition model then predicts the next latent state:

𝐳t\+1\(s\)=fϕ\(𝐳t\(s\),𝐚t\(s\),𝐩\)\.\\mathbf\{z\}^\{\(s\)\}\_\{t\+1\}=f\_\{\\phi\}\(\\mathbf\{z\}^\{\(s\)\}\_\{t\},\\mathbf\{a\}^\{\(s\)\}\_\{t\},\\mathbf\{p\}\)\.Within each stage, the evolution continues until the relative latent\-state change satisfies:

rt\(s\)=‖𝐳t\+1\(s\)−𝐳t\(s\)‖2‖𝐳t\(s\)‖2\+δ<ϵlat,r^\{\(s\)\}\_\{t\}=\\frac\{\\left\\\|\\mathbf\{z\}^\{\(s\)\}\_\{t\+1\}\-\\mathbf\{z\}^\{\(s\)\}\_\{t\}\\right\\\|\_\{2\}\}\{\\left\\\|\\mathbf\{z\}^\{\(s\)\}\_\{t\}\\right\\\|\_\{2\}\+\\delta\}<\\epsilon\_\{\\mathrm\{lat\}\},or the maximum number of evolution stepsKsK\_\{s\}is reached\. We setϵlat=5×10−3\\epsilon\_\{\\mathrm\{lat\}\}=5\\times 10^\{\-3\},Ks=10K\_\{s\}=10, andδ=10−8\\delta=10^\{\-8\}for numerical stability\. The converged latent state is decoded as the output of the current stage and passed to the next stage\. In this mode, process interventions are explored and generated online by the learned policy and are not optimized by test\-time backpropagation\.

### A\.2Target\-Conditioned Inverse Planning

Given an input layout𝐱\(1\)\\mathbf\{x\}^\{\(1\)\}, a process condition𝐩\\mathbf\{p\}, and a target ADI𝐱tar\(ADI\)\\mathbf\{x\}^\{\(\\mathrm\{ADI\}\)\}\_\{\\mathrm\{tar\}\}, inverse planning optimizes a feasible multi\-stage intervention sequence to drive the final state toward the target \(in the 28 nm dataset\(Heet al\.,[2026](https://arxiv.org/html/2606.26713#bib.bib30)\), the given target is the Resist Images, and the subsequent formulas are based on the given target ADIs\)\. All network parameters and latent spaces remain frozen\. The optimization variables are the stage\-wise intervention coefficients\{𝐜t\(s\)\}\\\{\\mathbf\{c\}^\{\(s\)\}\_\{t\}\\\}, initialized from the explored policy prior\. At each outer planning iteration, the coefficients are mapped into feasible interventions:

𝐚t\(s\)=𝐁\(s\)𝐜t\(s\),\\mathbf\{a\}^\{\(s\)\}\_\{t\}=\\mathbf\{B\}^\{\(s\)\}\\mathbf\{c\}^\{\(s\)\}\_\{t\},and a differentiable multi\-stage evolution is performed using the same target\-free stopping rule as forward evolution\. After decoding the final ADI prediction𝐱^\(ADI\)\\hat\{\\mathbf\{x\}\}^\{\(\\mathrm\{ADI\}\)\}, we optimize the intervention coefficients with the terminal planning objective:

ℒplan=λrecℒrec\(𝐱^\(ADI\),𝐱tar\(ADI\)\)\+λedgeℒedge\(𝐱^\(ADI\),𝐱tar\(ADI\)\)\+λpriorℒprior\+λsmoothℒsmooth,\\mathcal\{L\}\_\{\\mathrm\{plan\}\}=\\lambda\_\{\\mathrm\{rec\}\}\\mathcal\{L\}\_\{\\mathrm\{rec\}\}\\left\(\\hat\{\\mathbf\{x\}\}^\{\(\\mathrm\{ADI\}\)\},\\mathbf\{x\}^\{\(\\mathrm\{ADI\}\)\}\_\{\\mathrm\{tar\}\}\\right\)\+\\lambda\_\{\\mathrm\{edge\}\}\\mathcal\{L\}\_\{\\mathrm\{edge\}\}\\left\(\\hat\{\\mathbf\{x\}\}^\{\(\\mathrm\{ADI\}\)\},\\mathbf\{x\}^\{\(\\mathrm\{ADI\}\)\}\_\{\\mathrm\{tar\}\}\\right\)\+\\lambda\_\{\\mathrm\{prior\}\}\\mathcal\{L\}\_\{\\mathrm\{prior\}\}\+\\lambda\_\{\\mathrm\{smooth\}\}\\mathcal\{L\}\_\{\\mathrm\{smooth\}\},where we setλrec=0\.5\\lambda\_\{\\mathrm\{rec\}\}=0\.5,λedge=0\.5\\lambda\_\{\\mathrm\{edge\}\}=0\.5,λprior=10−3\\lambda\_\{\\mathrm\{prior\}\}=10^\{\-3\}, andλsmooth=10−3\\lambda\_\{\\mathrm\{smooth\}\}=10^\{\-3\}\. Here,ℒedge\\mathcal\{L\}\_\{\\mathrm\{edge\}\}denotes a differentiable edge\-alignment loss used for inverse\-planning optimization\. In our implementation, it is computed from Sobel edge maps:

ℒedge=‖𝒢\(𝐱^\(ADI\)\)−𝒢\(𝐱tar\(ADI\)\)‖1,\\mathcal\{L\}\_\{\\mathrm\{edge\}\}=\\left\\\|\\mathcal\{G\}\(\\hat\{\\mathbf\{x\}\}^\{\(\\mathrm\{ADI\}\)\}\)\-\\mathcal\{G\}\(\\mathbf\{x\}^\{\(\\mathrm\{ADI\}\)\}\_\{\\mathrm\{tar\}\}\)\\right\\\|\_\{1\},where𝒢\(⋅\)\\mathcal\{G\}\(\\cdot\)denotes the differentiable Sobel edge\-magnitude operator\. This loss encourages contour alignment during optimization, while the official EPE values reported in all experiments are computed using the gauge\-based EPE in Appendix[B](https://arxiv.org/html/2606.26713#A2)\. And the prior and smoothness terms are defined as:

ℒprior=∑s,t‖𝐜t\(s\)−𝝁t\(s\)‖22,ℒsmooth=∑s,t‖𝐜t\(s\)−𝐜t−1\(s\)‖22\.\\mathcal\{L\}\_\{\\mathrm\{prior\}\}=\\sum\_\{s,t\}\\left\\\|\\mathbf\{c\}^\{\(s\)\}\_\{t\}\-\\boldsymbol\{\\mu\}^\{\(s\)\}\_\{t\}\\right\\\|\_\{2\}^\{2\},\\quad\\mathcal\{L\}\_\{\\mathrm\{smooth\}\}=\\sum\_\{s,t\}\\left\\\|\\mathbf\{c\}^\{\(s\)\}\_\{t\}\-\\mathbf\{c\}^\{\(s\)\}\_\{t\-1\}\\right\\\|\_\{2\}^\{2\}\.The planning objective is backpropagated only to\{𝐜t\(s\)\}\\\{\\mathbf\{c\}^\{\(s\)\}\_\{t\}\\\}, while all model parameters remain fixed\. Inverse planning does not require a target Mask or target Resist Image; the optimization is driven only by the target ADI, and physical feasibility is enforced by the stage\-specific latent spaces𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x5.png)Figure 5:Schematic illustration of gauge\-based EPE measurement\. Local measurement gauges are placed on the target resist image contour, and edge displacement is evaluated along the corresponding contour\-normal direction\. The magnified view highlights how the measured offset captures local contour placement deviation between the generated and target resist image patterns\.

## Appendix BPrinciples of Lithography Metric Calculation

Lithography evaluation cannot be fully characterized by generic image\-similarity metrics, because nanometer\-scale contour shifts and local topological failures may directly affect manufacturability\. Therefore, in addition to pixel\- and region\-level metrics, we reportEPEandLRC\-style violation counts\. All geometric distances are converted to nanometers using a fixed pixel\-to\-length scaling factorα\\alphashared across all methods and process conditions\.

### B\.1Gauge\-based EPE

As illustrated in[Figure5](https://arxiv.org/html/2606.26713#A1.F5), EPE evaluates the local contour placement error between the target pattern and the generated pattern\. The blue layout is shown as the design reference, while the green and orange contours denote the target resist image and generated resist image, respectively\. EPE is measured by placing gauge points on the target contour and computing the displacement from the target contour to the generated contour along the local normal direction\.

LetIgtI^\{\\mathrm\{gt\}\}andIpredI^\{\\mathrm\{pred\}\}denote the ground\-truth and predicted patterns\. We convert them into polygonal contours, yielding the target contour∂T\\partial Tand the predicted contour∂C\\partial C\. Gauge points are uniformly sampled on∂T\\partial T:

V=\{pi\}i=1N\.V=\\\{p\_\{i\}\\\}\_\{i=1\}^\{N\}\.For each gauge pointpip\_\{i\}, let𝐧\(pi\)\\mathbf\{n\}\(p\_\{i\}\)be the outward unit normal of the target contour\. The corresponding normal sampling line is defined as:

ℓ\(pi\)=\{pi\+t𝐧\(pi\)∣t∈ℝ\}\.\\ell\(p\_\{i\}\)=\\\{p\_\{i\}\+t\\mathbf\{n\}\(p\_\{i\}\)\\mid t\\in\\mathbb\{R\}\\\}\.The matched point on the predicted contour is selected as the closest intersection betweenℓ\(pi\)\\ell\(p\_\{i\}\)and∂C\\partial C:

qi=arg⁡minq∈∂C∩ℓ\(pi\)⁡‖q−pi‖2\.q\_\{i\}=\\arg\\min\_\{q\\in\\partial C\\cap\\ell\(p\_\{i\}\)\}\\\|q\-p\_\{i\}\\\|\_\{2\}\.If no valid intersection exists, the gauge point is excluded from aggregation\. We denote the remaining valid gauge set as:VvalV\_\{\\mathrm\{val\}\}\.

The signed local displacement atpip\_\{i\}is computed as:

d\(pi\)=⟨qi−pi,𝐧\(pi\)⟩,d\(p\_\{i\}\)=\\left\\langle q\_\{i\}\-p\_\{i\},\\mathbf\{n\}\(p\_\{i\}\)\\right\\rangle,where positive and negative values indicate outward and inward contour bias, respectively\. The signed local EPE in physical units is:

EPE~\(pi\)=αd\(pi\)\.\\widetilde\{\\mathrm\{EPE\}\}\(p\_\{i\}\)=\\alpha d\(p\_\{i\}\)\.The final reported EPE is the mean absolute local displacement:

EPEavg=1\|Vval\|∑pi∈Vval\|EPE~\(pi\)\|\.\\mathrm\{EPE\}\_\{\\mathrm\{avg\}\}=\\frac\{1\}\{\|V\_\{\\mathrm\{val\}\}\|\}\\sum\_\{p\_\{i\}\\in V\_\{\\mathrm\{val\}\}\}\\left\|\\widetilde\{\\mathrm\{EPE\}\}\(p\_\{i\}\)\\right\|\.This gauge\-based protocol is used as the official EPE calculation for all reported results\. The differentiableℒedge\\mathcal\{L\}\_\{\\mathrm\{edge\}\}used in inverse planning is a Sobel\-based edge\-alignment loss that encourages contour consistency during optimization\. It is not used as our EPE metric; all reported EPE values follow the gauge\-based protocol defined here\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x6.png)Figure 6:Representative LRC violation categories used for manufacturability assessment\. Pinch captures locally narrowed printed features, Bridge captures unintended connections or insufficient spacing between neighboring structures, and EPE captures excessive contour displacement beyond the allowed placement tolerance\. Red markers indicate the detected violation regions\.
### B\.2LRC\-style Violation Counts

Beyond continuous EPE statistics, we further report LRC\-style violation counts to assess manufacturing\-rule consistency\. As shown in[Figure6](https://arxiv.org/html/2606.26713#A2.F6), we consider three representative lithography rule violations: Pinch, Bridge, and EPE\. The blue contours denote the target design, the green hatched regions denote the generated resist image, and the red hatched regions indicate detected LRC violation markers\.

A Pinch violation occurs when the local printed linewidth becomes smaller than the minimum allowable width\. A Bridge violation occurs when two structures that should remain separated become unintentionally connected, or when their spacing falls below the allowed minimum\. An EPE violation occurs when the magnitude of the signed local EPE exceeds the placement tolerance:

\|EPE~\(pi\)\|\>τEPE,pi∈Vval\.\\left\|\\widetilde\{\\mathrm\{EPE\}\}\(p\_\{i\}\)\\right\|\>\\tau\_\{\\mathrm\{EPE\}\},\\quad p\_\{i\}\\in V\_\{\\mathrm\{val\}\}\.The verification flow returns violation markers for each category, and we report the corresponding marker counts:

Npinch=\|ℳpinch\|,Nbridge=\|ℳbridge\|,Nepe=\|ℳepe\|\.N\_\{\\mathrm\{pinch\}\}=\|\\mathcal\{M\}\_\{\\mathrm\{pinch\}\}\|,\\quad N\_\{\\mathrm\{bridge\}\}=\|\\mathcal\{M\}\_\{\\mathrm\{bridge\}\}\|,\\quad N\_\{\\mathrm\{epe\}\}=\|\\mathcal\{M\}\_\{\\mathrm\{epe\}\}\|\.All methods are evaluated using the same contour extraction, gauge sampling, verification deck, and marker reporting settings, ensuring that the reported EPE and LRC counts reflect lithography contour fidelity and manufacturability rather than measurement\-configuration differences\.

## Appendix CDetailed Light Source

The illumination source is a key optical degree of freedom in computational lithography\. It defines the angular distribution of light incident on the mask and determines how different diffraction orders and spatial\-frequency components are transferred through the projection optics\. Therefore, changing the source shape can affect aerial\-image contrast, edge sharpness, process\-window behavior, and the dominant failure modes of the printed pattern\.

In our evaluation, we consider three representative source families, as shown in[Figure3](https://arxiv.org/html/2606.26713#S4.F3):Circular,Annular, andBull’s Eye\. These sources cover three typical illumination regimes: on\-axis illumination, off\-axis illumination, and mixed illumination\.

Circular source\.The circular source represents conventional on\-axis illumination, where the illumination energy is distributed around the optical axis\. This source provides relatively balanced imaging behavior and is effective for transferring low\- and medium\-spatial\-frequency components\. It is commonly suitable for isolated features or layouts without strong periodicity\. However, for dense patterns or high\-frequency structures, the limited angular diversity of circular illumination may reduce image contrast and increase sensitivity to edge\-placement errors\.

Annular source\.The annular source concentrates illumination energy in an off\-axis ring while suppressing the central illumination region\. This off\-axis configuration enhances the transfer of selected higher spatial\-frequency components and is often beneficial for dense or periodic patterns\. By changing the balance of diffraction orders captured by the projection optics, annular illumination can improve local image contrast and depth\-of\-focus behavior\. Meanwhile, it may also introduce source\-dependent contour shifts, line\-end distortions, or different pinch/bridge tendencies compared with on\-axis illumination\.

Bull’s Eye source\.The Bull’s Eye source combines a central circular component with an outer annular component\. It therefore represents a mixed illumination condition that includes both on\-axis and off\-axis contributions\. The central component helps preserve low\-frequency information and isolated\-feature fidelity, while the annular component enhances the transfer of higher\-frequency information in dense regions\. As a result, the Bull’s Eye source provides a more balanced illumination regime and can produce contour behaviors distinct from either purely circular or purely annular illumination\.

Together, these three source families provide representative coverage of illumination conditions commonly used in computational lithography\. Since the source distribution directly changes the optical transfer characteristics, different source types can lead to different mask corrections, resist\-image evolution, final contour morphology, and manufacturing\-rule violations\. Including these source families allows us to evaluate whether the proposed model can generalize across source\-dependent lithographic variations rather than being limited to a single fixed imaging condition\.

Table 8:Ablation of the number of sampled adjacent states for SPA basis estimation in the forward evolution task of the ID dataset\.Bestvalues are highlighted\.NumberMaskResist ImageADImPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrow1k90\.1786\.763\.4792\.7887\.642\.7370\.3567\.188\.075k95\.0189\.382\.0296\.2695\.067\.4178\.9374\.365\.6110k98\.6996\.911\.5899\.0197\.770\.9682\.5678\.273\.7415k98\.6596\.781\.7398\.9797\.320\.9982\.0778\.093\.9220k97\.1894\.131\.9698\.7396\.811\.2782\.0077\.644\.64
## Appendix DAdditional Experiments of Latent Spaces

This section provides additional experiments to further analyze the construction and transferability of the proposed stage\-specific latent spaces\. We first study how the number of adjacent state pairs affects SPA basis estimation, aiming to identify a suitable sample size for capturing physically meaningful evolution directions\. We then validate whether the learned latent spaces can serve as transferable physical priors for general WMs by incorporating𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}into different world model baselines and evaluating their forward evolution and inverse planning performance\.

### D\.1Design of Adjacent State Pairs in SPA

[Table8](https://arxiv.org/html/2606.26713#A3.T8)evaluates the effect of the number of sampled adjacent states for SPA basis estimation on the ID forward evolution task\. Increasing the sample size from 1k to 10k generally improves the final performance, although intermediate sample sizes may show stage\-dependent fluctuations\. In particular, the EPE is reduced by 1\.89 nm, 1\.77 nm, and 4\.33 nm for Mask, Resist Image, and ADI, respectively, indicating that sufficient adjacent\-state samples provide a more accurate estimate of physically admissible evolution directions\.

However, further increasing the sample size beyond 10k does not bring additional benefits\. Compared with 15k and 20k, the 10k setting maintains higher mPA and mIoU while achieving lower EPE, especially for ADI, where the EPE is lower than the 20k setting by 0\.90 nm\. This suggests that overly large sample sets may introduce non\-local or heterogeneous variations, weakening the locality assumption of SPA\. Overall, 10k offers the best balance between covering local state variations and preserving physically meaningful evolution directions\.

### D\.2Validation of the Latent Space Constructed by SPA

As shown in[Table9](https://arxiv.org/html/2606.26713#A4.T9), introducing the stage\-specific latent spaces𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}consistently improves the forward evolution performance of general WMs under both ID and OOD settings\. For DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\),𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}increases Mask mIoU by 3\.00/6\.59 points and reduces Mask EPE by 5\.64/0\.64 nm on ID/OOD data, showing stronger contour\-level stability\. For DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\), the gain is most pronounced in ADI, where mIoU improves by 41\.43/30\.58 points and mPA increases by 5\.35/8\.43 points, indicating that stage\-specific latent spaces effectively mitigate cross\-stage error accumulation\. NWM also benefits substantially\. These results demonstrate that𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}provides transferable physical constraints for generic WMs, leading to more coherent multi\-stage lithography evolution\.

[Table10](https://arxiv.org/html/2606.26713#A4.T10)further shows that𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}improves goal\-directed inverse planning\. For DINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\), adding𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}improves Mask mIoU by 2\.74/8\.72 points and reduces Mask EPE by 4\.11/0\.08 nm, suggesting more stable target\-conditioned intervention search\. These consistent gains indicate that stage\-specific latent spaces constrain the inverse\-planning trajectory toward physically feasible evolution directions, improving both target convergence and OOD robustness\.

Table 9:Forward evolution performance of general WMs with and without stage\-specific latent spaces𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\. Values before and after “/” denote ID/OOD \(55nm with F\+\) results\. Results after incorporating𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}arehighlighted\.ModelMaskResist ImageADImPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowDINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)93\.47/89\.8790\.04/75\.9821\.96/15\.6795\.68/93\.5193\.89/76\.2517\.37/2\.9673\.57/71\.8169\.41/64\.0619\.71/4\.28DINO\-WM \+𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}94\.78/91\.6193\.04/82\.5716\.32/15\.0395\.80/93\.9294\.19/78\.4616\.90/2\.7174\.33/72\.7372\.61/66\.1018\.26/4\.22DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)93\.65/88\.4483\.17/54\.724\.94/6\.2692\.59/94\.2778\.74/78\.217\.72/5\.2274\.93/68\.1114\.58/9\.045\.58/5\.49DriveDreamer\-2 \+𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}95\.47/89\.9584\.80/57\.664\.64/6\.0192\.59/94\.8380\.71/79\.035\.13/4\.6080\.28/76\.5456\.01/39\.625\.07/4\.89NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)91\.81/84\.0485\.95/75\.9226\.36/26\.3691\.81/88\.2585\.95/80\.2926\.36/3\.0471\.43/70\.9461\.35/60\.9626\.36/4\.62NWM \+𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}93\.94/88\.6290\.74/83\.6723\.81/24\.0794\.72/93\.6589\.20/87\.9022\.51/2\.9379\.66/76\.5172\.04/70\.8319\.91/4\.02Table 10:Target\-conditioned inverse planning performance of general WMs with and without stage\-specific latent spaces𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}\. Values before and after “/” denote ID/OOD \(55nm with F\+\) results\. Results after incorporating𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}arehighlighted\.ModelMaskResist ImageADImPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowmPA↑\\uparrowmIoU↑\\uparrowEPEavg\{\}\_\{\\textit\{avg\}\}↓\\downarrowDINO\-WM\(Zhouet al\.,[2024](https://arxiv.org/html/2606.26713#bib.bib9)\)94\.83/90\.5792\.88/78\.0518\.33/14\.8196\.32/93\.9794\.72/78\.4415\.03/2\.3174\.69/73\.4271\.83/64\.9519\.42/4\.13DINO\-WM \+𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}95\.91/93\.2895\.62/86\.7714\.22/14\.7396\.96/94\.7795\.08/80\.0713\.68/2\.0076\.64/75\.8774\.59/68\.7417\.96/3\.99DriveDreamer\-2\(Zhaoet al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib20)\)95\.04/91\.1786\.51/63\.004\.53/5\.7293\.75/94\.9179\.21/80\.306\.97/4\.9275\.73/70\.0456\.78/50\.495\.06/5\.12DriveDreamer\-2 \+𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}96\.39/93\.3888\.73/67\.314\.05/5\.2994\.46/95\.7582\.97/82\.434\.99/4\.5082\.61/78\.9569\.85/57\.424\.67/4\.10NWM\(Baret al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib19)\)92\.77/86\.2187\.62/78\.0923\.90/22\.7392\.58/89\.6186\.89/82\.5324\.16/2\.9473\.27/72\.4369\.72/66\.5725\.29/4\.39NWM \+𝒮\(s\)\\mathcal\{S\}^\{\(s\)\}94\.57/89\.9592\.67/87\.0221\.98/20\.3694\.90/93\.6790\.03/88\.7420\.79/2\.6980\.06/78\.9775\.81/72\.6619\.53/3\.87

## Appendix EVisual Comparisons of Forward Evolution

As shown in[Figure7](https://arxiv.org/html/2606.26713#A5.F7), we compare different methods on an OOD sample at the 55 nm node\. Although most baselines roughly preserve the input layout, their errors accumulate across stages: LithoNet\(Shaoet al\.,[2020](https://arxiv.org/html/2606.26713#bib.bib4)\)and LMLitho\(Wanget al\.,[2025](https://arxiv.org/html/2606.26713#bib.bib18)\)introduce severe artifacts in Mask and ADI, Unitho\(Jinet al\.,[2025a](https://arxiv.org/html/2606.26713#bib.bib17)\)produces distorted intermediate contours, and general WMs suffer from broken or unstable ADI morphologies\. In contrast, LithoDreamer better maintains mask topology, resist\-region continuity, and ADI contour fidelity, showing more coherent cross\-stage evolution\.

[Figure8](https://arxiv.org/html/2606.26713#A5.F8)further evaluates LithoDreamer on curved and irregular OOD layouts\. The results show that LithoDreamer can preserve the main geometric correspondence from Layout to Mask and Resist Image, while capturing process\-induced contour rounding in ADI\.[Figure9](https://arxiv.org/html/2606.26713#A5.F9)presents contact\-like patterns with isolated features, showing that LithoDreamer can reproduce local shape deformation and ADI boundary evolution under unseen process conditions\. Nevertheless, slight local contour deviations remain in some ADI predictions, suggesting that fine\-scale boundary refinement under highly complex layouts can be further improved\.

![Refer to caption](https://arxiv.org/html/2606.26713v1/x7.png)Figure 7:Qualitative comparison of forward evolution results on OOD samples at the 55 nm process node\.![Refer to caption](https://arxiv.org/html/2606.26713v1/x8.png)Figure 8:Forward evolution results on curved and irregular OOD layouts\.![Refer to caption](https://arxiv.org/html/2606.26713v1/x9.png)Figure 9:Forward evolution results on isolated contact\-like OOD layouts\.
LithoDreamer: A Physics-Informed World Model for Multi-Stage Computational Lithography

Similar Articles

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching

lucataco/moondream2

StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

Submit Feedback

Similar Articles

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching
StampFormer: A Physics-Guided Material-Geometry-Coupled Multimodal Model for Rapid Prediction of Physical Fields in Sheet Metal Stamping
Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds