NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
Summary
This paper introduces NoiseRater, a meta-learning framework that assigns importance scores to individual noise samples during diffusion model training to improve efficiency and generation quality.
View Cached Full Text
Cached at: 05/12/26, 06:58 AM
# NoiseRater: Meta-Learned Noise Valuation for Diffusion Model Training
Source: [https://arxiv.org/html/2605.08144](https://arxiv.org/html/2605.08144)
Fang Wu Stanford University &Haokai Zhao11footnotemark:1 UNSW &Da Xing UCL &Tinson XU The University of Chicago &Hanqun Cao CUHK &Yanchao Li Nanjing University &Zeqi Zhou Brown University &Xiangru Tang Yale University &Hanchen Wang Stanford University &Hongbin Lin CUHK &Zehong Wang University of Notre Dame &Kuan Pang Stanford University &Xia Peng UNC–Chapel Hill &Yinxi Li University of Waterloo &Aaron Tu UCB &Molei Tao Georgia Technology &Li Erran Li Amazon &Aditya Joshi UNSW &Jure Leskovec Stanford University &Yejin Choi Stanford University
###### Abstract
Diffusion models have achieved remarkable success across a wide range of generative tasks, yet their training paradigm largely treats injected noise as uniformly informative\. In this work, we challenge this assumption and introduce NoiseRater, a meta\-learning framework for instance\-level noise valuation in diffusion model training\. We propose a parametric noise rater that assigns importance scores to individual noise realizations conditioned on data and timestep, enabling adaptive reweighting of the training objective\. The rater is trained via bilevel optimization to improve downstream validation performance after inner\-loop diffusion updates\. To enable efficient deployment, we further design a decoupled two\-stage pipeline that transitions from soft weighting during meta\-training to hard noise selection during standard training\. Extensive experiments on FFHQ and ImageNet demonstrate that not all noise samples contribute equally, and that prioritizing informative noise improves both training efficiency and generation quality\. Our results establish noise valuation as a complementary and previously underexplored axis for improving diffusion model training\. Our code is available at:[https://anonymous\.4open\.science/r/NoiseRater\-DEB116](https://anonymous.4open.science/r/NoiseRater-DEB116)\.
## 1Introduction
Diffusion models\[[42](https://arxiv.org/html/2605.08144#bib.bib28),[44](https://arxiv.org/html/2605.08144#bib.bib33),[17](https://arxiv.org/html/2605.08144#bib.bib26)\]have emerged as a dominant paradigm for generative modeling, achieving state\-of\-the\-art performance across image\[[34](https://arxiv.org/html/2605.08144#bib.bib25),[39](https://arxiv.org/html/2605.08144#bib.bib20)\], video\[[32](https://arxiv.org/html/2605.08144#bib.bib24)\], biology\[[51](https://arxiv.org/html/2605.08144#bib.bib19)\], and multimodal generation tasks\[[38](https://arxiv.org/html/2605.08144#bib.bib223),[52](https://arxiv.org/html/2605.08144#bib.bib27)\]\. A key factor behind their success is the iterative denoising process, which transforms random noise into structured data through a sequence of refinement steps\.
Recently, there has been a growing interest in*test\-time compute optimization*for diffusion models\[[15](https://arxiv.org/html/2605.08144#bib.bib2),[37](https://arxiv.org/html/2605.08144#bib.bib15),[24](https://arxiv.org/html/2605.08144#bib.bib14),[53](https://arxiv.org/html/2605.08144#bib.bib13),[29](https://arxiv.org/html/2605.08144#bib.bib18),[45](https://arxiv.org/html/2605.08144#bib.bib12)\]\. In particular, noise has increasingly become a central object of control at inference time\[[11](https://arxiv.org/html/2605.08144#bib.bib1),[26](https://arxiv.org/html/2605.08144#bib.bib34)\]\. Techniques such as guidance scaling\[[2](https://arxiv.org/html/2605.08144#bib.bib9)\], adaptive step\-size schedules\[[8](https://arxiv.org/html/2605.08144#bib.bib11)\], and noise resampling strategies\[[1](https://arxiv.org/html/2605.08144#bib.bib10),[41](https://arxiv.org/html/2605.08144#bib.bib35)\]explicitly manipulate the noise trajectory to trade off between generation quality, diversity, and computational cost\. These methods demonstrate that the choice of noise—its magnitude, structure, and evolution—plays a critical role in shaping the final output\.
Existing approaches primarily treat noise as a*test\-time control variable*\. During training, however, noise is typically sampled from a fixed Gaussian distribution and incorporated into the objective in a largely uniform, sample\-agnostic manner\. While prior work has shown that different noise levels can contribute unequally to learning\[[49](https://arxiv.org/html/2605.08144#bib.bib224),[33](https://arxiv.org/html/2605.08144#bib.bib8),[35](https://arxiv.org/html/2605.08144#bib.bib31),[46](https://arxiv.org/html/2605.08144#bib.bib32),[13](https://arxiv.org/html/2605.08144#bib.bib30)\], major methods largely focus on timestep\-level reweighting or schedule design, leaving the role of*instance\-level*noise variation underexplored\.
This raises a fundamental question:*are all noise realizations equally useful for learning diffusion models?*More specifically, even at the same timestep, different noise instances may carry varying levels of learning signal\. Some may provide clear supervision for denoising, while others may be ambiguous, redundant, or less informative for optimization\. If this is the case, then uniformly treating noise during training may lead to suboptimal learning dynamics\.
In this work, we shift the focus from test\-time noise manipulation to*training\-time noise valuation*\. We propose to explicitly model the importance of individual noise samples and use this information to guide the training process\. Concretely, we introduce a*meta\-learned noise rater*, a parametric function that assigns a score to each noise instance conditioned on the data sample and timestep\. These scores are used to reweight the diffusion loss, allowing the model to focus on more informative noise while downweighting less useful ones\. To learn the noise rater, we formulate the problem as a bilevel optimization framework\[[30](https://arxiv.org/html/2605.08144#bib.bib16),[9](https://arxiv.org/html/2605.08144#bib.bib17)\]\. The diffusion model is trained using weighted noise samples in the inner loop, while the rater is optimized in the outer loop to minimize validation loss\. This meta\-learning setup enables the rater to capture the contribution of noise samples to generalization performance directly, rather than relying on heuristic criteria\.
Our approach is simple, flexible, and compatible with modern diffusion frameworks\[[43](https://arxiv.org/html/2605.08144#bib.bib36)\]\. By operating purely at the level of the training objective, it does not require modifying the model architecture or inference procedure\. The method overview is in Fig\.[1](https://arxiv.org/html/2605.08144#S1.F1)\.
Figure 1:Illustration of our meta\-Learned noise valuation approach for diffusion training\. We first train a noise rater using bi\-level optimization, and then guide diffusion training with the rater’s scores\.We summarize our contributions as follows:
- •We identify and formalize the problem of*training\-time noise valuation*in diffusion, an underexplored dimension complementary to timestep\-level noise design and test\-time control\.
- •We propose a meta\-learned noise rater that adaptively weights noise samples during training via a bilevel optimization framework\.
- •We introduce a decoupled two\-stage training pipeline that separates noise evaluation from model training, enabling efficient deployment of meta\-learned noise policies\.
- •We demonstrate that not all noise realizations are equally useful, and that selectively emphasizing informative noise improves learning efficiency and model performance\.
## 2Related Work
#### Noise Optimization at Inference Time\.
Recent work demonstrates that optimizing initial noise at inference time can significantly improve generation quality\. One line of research directly optimizes the noise latent through backpropagation across denoising steps\. These methods iteratively update the noise to maximize human\-preference rewards\[[48](https://arxiv.org/html/2605.08144#bib.bib209)\], satisfy specific motion constraints\[[22](https://arxiv.org/html/2605.08144#bib.bib210)\], or align with internal attention scores\[[12](https://arxiv.org/html/2605.08144#bib.bib211)\]\. While effective and training\-free, they incur massive computational overhead at inference due to repeated denoising loops\. To amortize this cost, other approaches train auxiliary models to directly predict optimal noise\. For instance, FIND\[[5](https://arxiv.org/html/2605.08144#bib.bib213)\]uses reinforcement learning to adjust the mean and variance of the initial Gaussian, while GoldenNoise\[[54](https://arxiv.org/html/2605.08144#bib.bib212)\]trains a network to predict high\-quality noise perturbations\. Crucially, all treat noise as a test\-time control variable\. In contrast, we explore the untapped potential of evaluating and weighting noise during training\.
#### Training\-Time Loss Reweighting and Noise Selection\.
Various loss reweighting and noise scheduling strategies improve diffusion training efficiency\. Most methods operate at the*timestep level*, adjusting the relative importance or sampling frequency of different noise magnitudes\. For loss reweighting, P2\[[6](https://arxiv.org/html/2605.08144#bib.bib219)\]prioritizes highly corrupted timesteps to learn global concepts, Min\-SNR\[[14](https://arxiv.org/html/2605.08144#bib.bib214)\]balances inter\-timestep optimization conflicts, andSun and Shi \[[46](https://arxiv.org/html/2605.08144#bib.bib32)\]dynamically adjusts weights based on the variance of loss distributions across log\-SNR levels\. Concurrently, noise scheduling approaches optimize the distribution of sampled timesteps:Hanget al\.\[[13](https://arxiv.org/html/2605.08144#bib.bib30)\]introduces a fixed schedule focusing on thelogSNR≈0\\log\\text\{SNR\}\\approx 0region\.Kimet al\.\[[25](https://arxiv.org/html/2605.08144#bib.bib215)\]uses curriculum learning to progressively introduce harder timesteps, andRayaet al\.\[[35](https://arxiv.org/html/2605.08144#bib.bib31)\]proposes a dynamic, information\-guided allocation based on entropy reduction\.Wanget al\.\[[49](https://arxiv.org/html/2605.08144#bib.bib224)\]theoretically analyzes different total weighting, time, and noise schedules\. In addition, EDM\[[20](https://arxiv.org/html/2605.08144#bib.bib3)\]provides a unifying perspective on diffusion design, showing that choices of noise parameterization, preconditioning, and loss scaling implicitly define effective weighting across noise levels\. However, these methods treat all noise instances at a given timestep as equally informative\. Recent work has also begun to question more fundamental assumptions about noise in diffusion models\. For example,Sunet al\.\[[47](https://arxiv.org/html/2605.08144#bib.bib5)\]shows that explicit noise conditioning may not be strictly necessary, with models often maintaining competitive performance even without access to timestep information\. While such work investigates whether noise information is required at all, it does not address how the variability of noise*instances*influences learning\. In contrast, we assume standard conditioning and focus on a complementary question:*given noise, which specific realizations are most useful for training?*The closest attempt at instance\-level noise control is Immiscible Diffusion\[[27](https://arxiv.org/html/2605.08144#bib.bib217),[28](https://arxiv.org/html/2605.08144#bib.bib216)\], which uses optimal transport to assign specific noise vectors to data samples, minimizing their pre\-diffusion Euclidean distance\. While this accelerates training, the assignment relies on a static metric that ignores the model’s actual learning dynamics\.*Our approach moves beyond static assignments by introducing a dynamic, instance\-level noise valuation learned directly from the model’s generalization performance\.*
#### Meta\-Learning for Training Optimization\.
Meta\-learning techniques have been explored for adaptive data weighting\[[3](https://arxiv.org/html/2605.08144#bib.bib4)\]\. Early methods\[[36](https://arxiv.org/html/2605.08144#bib.bib220),[19](https://arxiv.org/html/2605.08144#bib.bib6)\]used online meta\-gradients to dynamically weight training examples\. To improve scalability and generalization, subsequent work parameterized these weighting mechanisms using neural networks trained via bilevel optimization, evolving from dynamic loss\-to\-weight mappings\[[40](https://arxiv.org/html/2605.08144#bib.bib221),[50](https://arxiv.org/html/2605.08144#bib.bib7)\]to large\-scale dataset curation models\[[4](https://arxiv.org/html/2605.08144#bib.bib196)\]\. We introduce this bilevel optimization paradigm to diffusion models for the first time\. Crucially, we shift the fundamental object of valuation: instead of rating the quality of training*data*, our meta\-network is designed to rate the learning utility of the injected*noise*instances\.
## 3Preliminaries: Diffusion Models
Diffusion models define a generative process by learning to reverse a gradual corruption of data into noise\. In the standard formulation, a forward noising process incrementally perturbs a clean data sample𝐱0∼q\(𝐱0\)\\mathbf\{x\}\_\{0\}\\sim q\(\\mathbf\{x\}\_\{0\}\)into a sequence of increasingly noisy latent variables\{𝐱t\}t=1T\\\{\\mathbf\{x\}\_\{t\}\\\}\_\{t=1\}^\{T\}\.
#### Forward process\.
The forward diffusion process is a Markov chain that adds Gaussian noise at each timestepq\(𝐱t∣𝐱t−1\)=𝒩\(𝐱t;1−βt𝐱t−1,βt𝐈\)q\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{t\-1\}\)=\\mathcal\{N\}\\big\(\\mathbf\{x\}\_\{t\};\\sqrt\{1\-\\beta\_\{t\}\}\\,\\mathbf\{x\}\_\{t\-1\},\\beta\_\{t\}\\mathbf\{I\}\\big\),t=1,…,Tt=1,\\dots,T, where\{βt\}t=1T\\\{\\beta\_\{t\}\\\}\_\{t=1\}^\{T\}is a predefined variance schedule controlling the noise magnitude at each step\. Intuitively, this process gradually destroys the structure in𝐱0\\mathbf\{x\}\_\{0\}, eventually producing a nearly isotropic Gaussian distribution\.
A key property of this construction is that the marginal distribution at any timestepttadmits a closed\-form expressionq\(𝐱t∣𝐱0\)=𝒩\(𝐱t;α¯t𝐱0,\(1−α¯t\)𝐈\)q\(\\mathbf\{x\}\_\{t\}\\mid\\mathbf\{x\}\_\{0\}\)=\\mathcal\{N\}\\big\(\\mathbf\{x\}\_\{t\};\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{x\}\_\{0\},\(1\-\\bar\{\\alpha\}\_\{t\}\)\\mathbf\{I\}\\big\), whereαt=1−βt\\alpha\_\{t\}=1\-\\beta\_\{t\}andα¯t=∏s=1tαs\\bar\{\\alpha\}\_\{t\}=\\prod\_\{s=1\}^\{t\}\\alpha\_\{s\}\. This expression allows us to directly sample𝐱t\\mathbf\{x\}\_\{t\}from𝐱0\\mathbf\{x\}\_\{0\}without simulating the full Markov chain, which is crucial for efficient training\.
#### Reverse process and training objective\.
The generative model is trained to invert this noising process\. Instead of directly parameterizing the reverse transitionpθ\(𝐱t−1∣𝐱t\)p\_\{\\theta\}\(\\mathbf\{x\}\_\{t\-1\}\\mid\\mathbf\{x\}\_\{t\}\), it is common to train a neural networkϵθ\(𝐱t,t\)\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\},t\)to predict the noise that was added to produce𝐱t\\mathbf\{x\}\_\{t\}\. Specifically, given a noisy sample𝐱t=α¯t𝐱0\+1−α¯tϵ\\mathbf\{x\}\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{x\}\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\,\\epsilon,ϵ∼𝒩\(0,I\)\\epsilon\\sim\\mathcal\{N\}\(0,I\), the model is trained using a simple mean squared error \(MSE\) objective:
ℒdiff\(θ\)=𝔼t,𝐱0,ϵ\[‖ϵ−ϵθ\(𝐱t,t\)‖2\]\.\\displaystyle\\mathcal\{L\}\_\{\\text\{diff\}\}\(\\theta\)=\\mathbb\{E\}\_\{t,\\mathbf\{x\}\_\{0\},\\epsilon\}\\left\[\\left\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\},t\)\\right\\\|^\{2\}\\right\]\.\(1\)Training proceeds by sampling triples\(𝐱0,t,ϵ\)\(\\mathbf\{x\}\_\{0\},t,\\epsilon\), constructing𝐱t\\mathbf\{x\}\_\{t\}using the closed\-form expression above, and minimizing the discrepancy between the true noiseϵ\\epsilonand the predicted noiseϵθ\(𝐱t,t\)\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\},t\)\.
#### Conditional training and classifier\-free guidance\.
In conditional diffusion models, the denoising network additionally takes as input a conditionc∼p\(c\)c\\sim p\(c\)\(e\.g\., text prompt\), and is trained to predict noise asϵθ\(𝐱t,t,c\)\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\},t,c\)\. Following classifier\-free guidance \(CFG\)\[[18](https://arxiv.org/html/2605.08144#bib.bib29)\], we randomly drop the condition during training with probabilitypdropp\_\{\\text\{drop\}\}, replacingccwith a null condition∅\\varnothing\. This yields a unified model that supports both conditional and unconditional predictions asℒdiff\(θ\)=𝔼𝐱0,t,ϵ,c\[‖ϵ−ϵθ\(𝐱t,t,c~\)‖2\]\\mathcal\{L\}\_\{\\text\{diff\}\}\(\\theta\)=\\mathbb\{E\}\_\{\\mathbf\{x\}\_\{0\},t,\\epsilon,c\}\\left\[\\\|\\epsilon\-\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\},t,\\tilde\{c\}\)\\\|^\{2\}\\right\], wherec~∈\{c,∅\}\\tilde\{c\}\\in\\\{c,\\varnothing\\\}denotes the possibly dropped condition\.
#### From DDPM to DDIM\.
While the above formulation follows the standard DDPM\[[17](https://arxiv.org/html/2605.08144#bib.bib26)\], this work implements the DDIM framework\[[43](https://arxiv.org/html/2605.08144#bib.bib36)\]\. DDIM provides a deterministic \(or partially stochastic\) non\-Markovian sampling procedure that shares the same training objective but enables faster and more flexible generation\. Importantly, our modifications operate entirely at the level of the training loss and are therefore directly compatible with DDIM without altering its sampling dynamics\.
## 4Method: Meta\-Learned Noise Rater
### 4\.1Noise Weighting
A central observation in standard diffusion training is that all noise samples are treated equally when optimizing Equ\.[1](https://arxiv.org/html/2605.08144#S3.E1)\. However, not all noise realizations contribute equally to learning\[[13](https://arxiv.org/html/2605.08144#bib.bib30),[46](https://arxiv.org/html/2605.08144#bib.bib32),[35](https://arxiv.org/html/2605.08144#bib.bib31)\]:*some may be more informative, while others may be redundant or even detrimental*\. To address this, we introduce a parametric*noise rater*ϕη\(ϵ,t,𝐱0,c~\):ℰ×\[0,1\]×𝒳×𝒞→ℝ\+\\phi\_\{\\eta\}\(\\epsilon,t,\\mathbf\{x\}\_\{0\},\\tilde\{c\}\):\\mathcal\{E\}\\times\[0,1\]\\times\\mathcal\{X\}\\times\\mathcal\{C\}\\rightarrow\\mathbb\{R\}^\{\+\}, which assigns a scalar score to each noise instance conditioned on both the input image𝐱0\\mathbf\{x\}\_\{0\}and the \(possibly dropped\) conditionc~\\tilde\{c\}\. This score reflects the relative importance of the corresponding training signal\[[40](https://arxiv.org/html/2605.08144#bib.bib221)\]\.
#### Grouped Noise Sampling\.
To ensure that the noise rater focuses on differences between noise realizations rather than image content, we construct each minibatchBBas groups of samples that share the same underlying clean image\. Specifically, for each𝐱0\\mathbf\{x\}\_\{0\},c~\\tilde\{c\}andtt, we drawKKindependent noises\{ϵ\(k\)\}k=1K\\\{\\epsilon^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\(e\.g\.,K=8K=8\), producing a set of noisy inputs\{𝐱t\(k\)\}k=1K\\\{\\mathbf\{x\}\_\{t\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}\.
We then apply a*group\-wise normalization*to theseKKinstances associated with the same𝐱0\\mathbf\{x\}\_\{0\}\. The rater assigns scores\{ϕη\(ϵ\(k\),t,𝐱0,c~\)\}k=1K\\\{\\phi\_\{\\eta\}\(\\epsilon^\{\(k\)\},t,\\mathbf\{x\}\_\{0\},\\tilde\{c\}\)\\\}\_\{k=1\}^\{K\}, which are normalized via a softmax within the group:
w\(k\)\(𝐱0\)=exp\(ϕη\(ϵ\(k\),t,𝐱0,c~\)\)∑k′=1Kexp\(ϕη\(ϵ\(k′\),t,𝐱0,c~\)\)\.\\displaystyle w^\{\(k\)\}\(\\mathbf\{x\}\_\{0\}\)=\\frac\{\\exp\(\\phi\_\{\\eta\}\(\\epsilon^\{\(k\)\},t,\\mathbf\{x\}\_\{0\},\\tilde\{c\}\)\)\}\{\\sum\_\{k^\{\\prime\}=1\}^\{K\}\\exp\(\\phi\_\{\\eta\}\(\\epsilon^\{\(k^\{\\prime\}\)\},t,\\mathbf\{x\}\_\{0\},\\tilde\{c\}\)\)\}\.\(2\)This normalization enforces that weights are assigned based on the relative importance of noise realizations under a fixed condition\(𝐱0,c~,t\)\(\\mathbf\{x\}\_\{0\},\\tilde\{c\},t\), preventing the rater from exploiting variations in image content or conditioning across the batchBB\.
#### Rater\-guided Training Objective\.
The standard diffusion loss is modified into a group\-wise weighted objective:
ℒinner\(θ;η\)=𝔼𝐱0∼B\[∑k=1Kw\(k\)\(𝐱0\)⋅‖ϵ\(k\)−ϵθ\(𝐱t\(k\),t,c~\)‖2\],\\displaystyle\\mathcal\{L\}\_\{\\text\{inner\}\}\(\\theta;\\eta\)=\\mathbb\{E\}\_\{\\mathbf\{x\}\_\{0\}\\sim B\}\\left\[\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\mathbf\{x\}\_\{0\}\)\\cdot\\\|\\epsilon^\{\(k\)\}\-\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\}^\{\(k\)\},t,\\tilde\{c\}\)\\\|^\{2\}\\right\],\(3\)where\{\(ϵ\(k\),𝐱t\(k\)\)\}k=1K\\\{\(\\epsilon^\{\(k\)\},\\mathbf\{x\}\_\{t\}^\{\(k\)\}\)\\\}\_\{k=1\}^\{K\}are independently sampled noise instances associated with\(𝐱0,c~,t\)\(\\mathbf\{x\}\_\{0\},\\tilde\{c\},t\), andw\(k\)\(𝐱0,c~,t\)w^\{\(k\)\}\(\\mathbf\{x\}\_\{0\},\\tilde\{c\},t\)are the corresponding group\-normalized weights\. This formulation reweights training signals according to the relative importance of noise realizations conditioned on\(𝐱0,c~,t\)\(\\mathbf\{x\}\_\{0\},\\tilde\{c\},t\), emphasizing informative noise while avoiding spurious correlations with image content or conditioning\.
### 4\.2Bilevel Meta\-Objective
We frame the learning of the noise rater as a bilevel optimization problem\[[4](https://arxiv.org/html/2605.08144#bib.bib196)\], where the diffusion modelϵθ\\epsilon\_\{\\theta\}and the raterϕη\\phi\_\{\\eta\}are optimized at different levels\.
#### Inner optimization
\(model training\)\. For a fixed noise raterϕη\\phi\_\{\\eta\}, the diffusion model parametersθ\\thetaare updated by performing multiple gradient steps on the weighted loss:
θ\(s\+1\)←θ\(s\)−α∇θℒinner\(θ\(s\);η\),s=0,…,S−1,\\displaystyle\\theta^\{\(s\+1\)\}\\leftarrow\\theta^\{\(s\)\}\-\\alpha\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\text\{inner\}\}\(\\theta^\{\(s\)\};\\eta\),\\quad s=0,\\dots,S\-1,\(4\)whereS≥1S\\geq 1denotes the number of inner\-loop updates\. We denote the resulting parameters afterSSsteps asθ∗\(η\)≡θ\(S\)\\theta^\{\*\}\(\\eta\)\\equiv\\theta^\{\(S\)\}\. This corresponds to standard diffusion training, except that each training signal associated with𝐱i\\mathbf\{x\}\_\{i\}is reweighted according to the learned importance scores\{w\(k\)\(𝐱i\)\}k=1K\\\{w^\{\(k\)\}\(\\mathbf\{x\}\_\{i\}\)\\\}\_\{k=1\}^\{K\}\.
#### Outer optimization
\(rater training\)\. The noise rater parametersη\\etaare updated to improve the downstream performance of the diffusion model after the inner updates:
η←argminηℒval\(θ∗\(η\)\),\\displaystyle\\eta\\leftarrow\\arg\\min\_\{\\eta\}\\;\\mathcal\{L\}\_\{\\text\{val\}\}\(\\theta^\{\*\}\(\\eta\)\),\(5\)whereℒval\\mathcal\{L\}\_\{\\text\{val\}\}is computed on held\-out data or timesteps\.
Conceptually, the raterϕη\\phi\_\{\\eta\}learns to assign higher weights to noise samples that lead to better generalization of the diffusion modelϵθ\\epsilon\_\{\\theta\}after several training updates\. This is closely related to meta\-learning and data valuation approaches\[[40](https://arxiv.org/html/2605.08144#bib.bib221),[4](https://arxiv.org/html/2605.08144#bib.bib196)\], but here the optimization operates over*noise realizations*rather than data points\.
#### Meta\-Gradient Computation\.
Optimizing the outer objective requires computing gradients through the inner optimization process\. Using the chain rule, the meta\-gradient takes the form:
∇ηℒval\(θ∗\(η\)\)=∂ℒval∂θ⋅∂θ∗\(η\)∂η\.\\displaystyle\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\text\{val\}\}\(\\theta^\{\*\}\(\\eta\)\)=\\frac\{\\partial\\mathcal\{L\}\_\{\\text\{val\}\}\}\{\\partial\\theta\}\\cdot\\frac\{\\partial\\theta^\{\*\}\(\\eta\)\}\{\\partial\\eta\}\.\(6\)
The term∂θ∗\(η\)∂η\\frac\{\\partial\\theta^\{\*\}\(\\eta\)\}\{\\partial\\eta\}captures how changes in the noise rater affect the inner diffusion model\. In practice, this is computed by differentiating through a finite number of gradient descent steps in the inner optimization\. Despite additional computational overhead, it enables the rater to directly optimize for downstream performance rather than relying on heuristic weighting schemes\.
### 4\.3Post Meta\-Training: Noise Selection for Efficient Training
After the meta\-learning stage, we obtain a trained noise raterϕη\\phi\_\{\\eta\}that captures the relative utility of noise instances for improving generalization\. We then transition to a standard diffusion training phase, where the rater is*fixed*and no longer updated\.
#### Top\-1 noise selection\.
In the post meta\-training stage, for each data sample\(𝐱0,c~,t\)\(\\mathbf\{x\}\_\{0\},\\tilde\{c\},t\), we sample a group ofK′K^\{\\prime\}candidate noise instances\{\(ϵ\(k\)\)\}k=1K′\\\{\(\\epsilon^\{\(k\)\}\)\\\}\_\{k=1\}^\{K^\{\\prime\}\}and select the highest\-scoring one:
k∗=argmax1≤k≤K′ϕη\(ϵ\(k\),t,𝐱0,c~\)\.\\displaystyle k^\{\*\}=\{\\arg\\max\}\_\{1\\leq k\\leq K^\{\\prime\}\}\\;\\phi\_\{\\eta\}\(\\epsilon^\{\(k\)\},t,\\mathbf\{x\}\_\{0\},\\tilde\{c\}\)\.\(7\)The diffusion model is then trained using only the selected noise instanceϵ\(k∗\)\\epsilon^\{\(k^\{\*\}\)\}for Equ\.[1](https://arxiv.org/html/2605.08144#S3.E1)\. This decoupled design separates*learning to evaluate noise*from*using noise for training*\. The meta\-learning stage leverages a differentiable soft weighting scheme to stably learnϕη\\phi\_\{\\eta\}, while the post meta\-training stage adopts a simple and efficient hard selection strategy\. This avoids the computational overhead of maintaining multiple noise instances per sample during training, while preserving the benefits of instance\-level noise valuation\.
Algorithm 1Meta\-Learning a Noise Rater for Diffusion Models1:Inputs:Training dataset
𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}and validation dataset
𝒟val\\mathcal\{D\}\_\{\\text\{val\}\}; initialize noise rater parameters
η\\eta\(meta\-optimizer
MM\) and diffusion model parameters
θ\\theta\(optimizer
DD\); number of outer steps
NNand inner steps
SS; batch size
BB; group size
KK\.
2:forouter step
n=1,…,Nn=1,\\dots,Ndo
3:forinner step
s=1,…,Ss=1,\\dots,Sdo
4:Sample a grouped batch
ℬ=\{\(𝐱0,j,c~j,tj,\{ϵj\(k\)\}k=1K\)\}j=1B\\mathcal\{B\}=\\\{\(\\mathbf\{x\}\_\{0,j\},\\tilde\{c\}\_\{j\},t\_\{j\},\\\{\\epsilon^\{\(k\)\}\_\{j\}\\\}\_\{k=1\}^\{K\}\)\\\}\_\{j=1\}^\{B\}from
𝒟train\\mathcal\{D\}\_\{\\text\{train\}\}
5:For group
jj, evaluate the noise rater scores
sj\(k\)=ϕη\(ϵj\(k\),tj,𝐱0,j,c~j\)s\_\{j\}^\{\(k\)\}=\\phi\_\{\\eta\}\(\\epsilon^\{\(k\)\}\_\{j\},t\_\{j\},\\mathbf\{x\}\_\{0,j\},\\tilde\{c\}\_\{j\}\)for
k=1,…,Kk=1,\\dots,K
6:Convert scores into normalized weights
wj\(k\)=exp\(sj\(k\)\)/∑i=1Kexp\(sj\(i\)\)w\_\{j\}^\{\(k\)\}=\\exp\(s\_\{j\}^\{\(k\)\}\)/\\sum\_\{i=1\}^\{K\}\\exp\(s\_\{j\}^\{\(i\)\}\)
7:Construct noisy inputs
𝐱t,j\(k\)\\mathbf\{x\}\_\{t,j\}^\{\(k\)\}, and compute the weighted loss
8:
ℒℬ\(θ\)=𝔼j\[∑k=1Kwj\(k\)‖ϵj\(k\)−ϵθ\(𝐱t,j\(k\),tj,c~j\)‖2\]\\mathcal\{L\}\_\{\\mathcal\{B\}\}\(\\theta\)=\\mathbb\{E\}\_\{j\}\\big\[\\sum\_\{k=1\}^\{K\}w\_\{j\}^\{\(k\)\}\\\|\\epsilon\_\{j\}^\{\(k\)\}\-\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t,j\}^\{\(k\)\},t\_\{j\},\\tilde\{c\}\_\{j\}\)\\\|^\{2\}\\big\]
9:Update model parameters
θs←θs−1−α∇θℒℬ\(θ\)\\theta\_\{s\}\\leftarrow\\theta\_\{s\-1\}\-\\alpha\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathcal\{B\}\}\(\\theta\)using optimizer
DD
10:endfor
11:Sample a validation batch
ℬval∼𝒟val\\mathcal\{B\}\_\{\\text\{val\}\}\\sim\\mathcal\{D\}\_\{\\text\{val\}\}
12:Compute validation loss
ℒval\(θS\)\\mathcal\{L\}\_\{\\text\{val\}\}\(\\theta\_\{S\}\)and update
η←η−β∇ηℒval\(θS\)\\eta\\leftarrow\\eta\-\\beta\\nabla\_\{\\eta\}\\mathcal\{L\}\_\{\\text\{val\}\}\(\\theta\_\{S\}\)using meta\-optimizer
MM
13:endfor
14:return
θ\\theta
###### Theorem 4\.1\(Meta\-gradient for noise\-weighted diffusion training\)
Letℒinner\(θ;η\)\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta;\\eta\)be the inner objective andℓ\(k\)\(θ\)=‖ϵ\(k\)−ϵθ\(𝐱t,t,c~\)‖2\\ell^\{\(k\)\}\(\\theta\)=\\\|\\epsilon^\{\(k\)\}\-\\epsilon\_\{\\theta\}\(\\mathbf\{x\}\_\{t\},t,\\tilde\{c\}\)\\\|^\{2\}be the diffusion loss for thekk\-th noise instance\. Assume thatℓ\(k\)\(θ\)\\ell^\{\(k\)\}\(\\theta\)is twice continuously differentiable inθ\\theta, and thatℒinner\(θ;η\)\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta;\\eta\)isμ\\mu\-strongly convex inθ\\thetafor someμ\>0\\mu\>0\. Then the optimal inner solutionθ∗\(η\)\\theta^\{\*\}\(\\eta\)is locally unique and differentiable, and the outer objectiveJ\(η\)=ℒval\(θ∗\(η\)\)J\(\\eta\)=\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\*\}\(\\eta\)\)has gradient
∇ηJ\(η\)=−\(∇η∇θℒinner\(θ∗\(η\);η\)\)⊤\(∇θ2ℒinner\(θ∗\(η\);η\)\)−1∇θℒval\(θ∗\(η\)\)\.\\displaystyle\\nabla\_\{\\eta\}J\(\\eta\)=\-\\Bigl\(\\nabla\_\{\\eta\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\\Bigr\)^\{\\top\}\\Bigl\(\\nabla\_\{\\theta\}^\{2\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\\Bigr\)^\{\-1\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\*\}\(\\eta\)\)\.\(8\)
#### Interpretation\.
The meta\-gradient in Thm\.[4\.1](https://arxiv.org/html/2605.08144#S4.Thmtheorem1)reveals that the noise rater is updated to increase the weights of noise instances whose induced training gradients most effectively reduce the validation loss after the inner optimization\. In particular, each noise instance contributes through its gradient∇θℓ\(k\)\(θ\)\\nabla\_\{\\theta\}\\ell^\{\(k\)\}\(\\theta\), and the meta\-update favors those whose influence, modulated by the curvature of the training objective, aligns with improving validation performance\.
###### Theorem 4\.2\(Gradient alignment principle for noise weighting\)
Consider a single inner update with learning rateα\>0\\alpha\>0:θ\+\(η\)=θ−α∑k=1Kw\(k\)\(η\)∇θℓ\(k\)\(θ\)\\theta^\{\+\}\(\\eta\)=\\theta\-\\alpha\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)\\,\\nabla\_\{\\theta\}\\ell^\{\(k\)\}\(\\theta\), whereℓ\(k\)\(θ\)\\ell^\{\(k\)\}\(\\theta\)denotes the diffusion loss for thekk\-th noise instance\.
Assume that the validation lossℒval\(θ\)\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\)hasLL\-Lipschitz continuous gradients\. Then
ℒval\(θ\+\(η\)\)≤ℒval\(θ\)−α∑k=1Kw\(k\)\(η\)⟨∇θℒval\(θ\),∇θℓ\(k\)\(θ\)⟩\+𝒪\(α2\)\.\\displaystyle\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\+\}\(\\eta\)\)\\leq\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\)\-\\alpha\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)\\left\\langle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\),\\nabla\_\{\\theta\}\\ell^\{\(k\)\}\(\\theta\)\\right\\rangle\+\\mathcal\{O\}\(\\alpha^\{2\}\)\.\(9\)
Consequently, up to second\-order terms, minimizing the post\-update validation loss encourages assigning larger weights to noise instancesϵ\(k\)\\epsilon^\{\(k\)\}whose induced training gradients are better aligned with the validation gradient\.
#### Interpretation\.
Thm\.[4\.2](https://arxiv.org/html/2605.08144#S4.Thmtheorem2)reveals that noise weighting follows a*gradient alignment principle*\. Each noise instanceϵ\(k\)\\epsilon^\{\(k\)\}contributes in proportion to⟨∇θℒval\(θ\),∇θℓ\(k\)\(θ\)⟩,\\left\\langle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\),\\nabla\_\{\\theta\}\\ell^\{\(k\)\}\(\\theta\)\\right\\rangle,which measures how well its induced training gradient aligns with the validation objective\. Minimizing the post\-update validation loss, therefore, encourages assigning larger weights to noise instances whose gradients are better aligned with improving generalization, while downweighting conflicting or uninformative ones\. In effect, the noise rater learns a directionally optimized gradient estimator that amplifies useful learning signals and suppresses detrimental noise\.
## 5Experiments
### 5\.1Setups
#### Datasets and tasks\.
We evaluate three tasks on two datasets: \(1\)FFHQ\[[21](https://arxiv.org/html/2605.08144#bib.bib225)\]at256×256256\{\\times\}256for unconditional generation; \(2\)ImageNet\[[7](https://arxiv.org/html/2605.08144#bib.bib226)\]at256×256256\{\\times\}256for unconditional generation and \(3\) class\-conditional generation, where we apply CFG\[[18](https://arxiv.org/html/2605.08144#bib.bib29)\]with scale1\.251\.25at sampling time\.
#### Backbone\.
We use Diffusion Transformer \(DiT\)\[[31](https://arxiv.org/html/2605.08144#bib.bib218)\]as our backbone, operating in the latent space of a pre\-trained VAE followingRombachet al\.\[[39](https://arxiv.org/html/2605.08144#bib.bib20)\]\. Most experiments use DiT\-S/2; the model size generalization study additionally uses DiT\-B/2 and DiT\-L/2\.
#### Noise rater architecture\.
Following Stable Diffusion 3\[[10](https://arxiv.org/html/2605.08144#bib.bib222)\], we adapted DiT\[[31](https://arxiv.org/html/2605.08144#bib.bib218)\]to incorporate joint attention between image latent representations and noise\. The image latents and noise are patchified independently and processed as two distinct streams through a stack of DiT blocks\. For modulation, image category, and diffusion timestep embeddings are integrated via Adaptive Layer Normalization \(AdaLN\)\. Within each DiT block, the two streams are concatenated for joint attention, while the remaining components strictly adhere to the original DiT\-S/2 configuration\. The output sequence from the noise stream is aggregated via mean pooling and projected through a linear layer to produce the scalar score\. We then apply the*group\-wise normalization*to the output score as described in Section[4\.1](https://arxiv.org/html/2605.08144#S4.SS1)\. The architecture details are in Appx\.[B\.1](https://arxiv.org/html/2605.08144#A2.SS1)and Fig\.[4](https://arxiv.org/html/2605.08144#A2.F4)\.
#### Training protocol\.
We adopt a three\-stage protocol: \(i\) we first train DiT\-S/2 with standard i\.i\.d\. Gaussian noise for400400k steps, producing the diffusion checkpoint to which the rater is attached; \(ii\) we train the noise rater for22k steps on top of this frozen checkpoint; \(iii\) we resume diffusion training for an additional8080k steps using the well\-trained rater to select noise fromk=4k\{=\}4candidates at each step\. The rater’s outer\-loop validation set is a10%10\\%random split of the training set, to avoid dataset leakage\. We adapted Mixflow\-MG\[[23](https://arxiv.org/html/2605.08144#bib.bib228)\]for scalable bilevel optimisation reparameterisation\. Full hyperparameters for the diffusion model and the rater are provided in Appx\.[B\.2](https://arxiv.org/html/2605.08144#A2.SS2)\.
#### Evaluation\.
FID\-50K\[[16](https://arxiv.org/html/2605.08144#bib.bib227)\]is computed against5050k reference images\. Evaluation varies for tasks: forFFHQ, we train for an additional4040k steps from the200200k checkpoint and report FID every1010k steps; forImageNet, we train for an additional8080k steps from the400400k checkpoint and report FID every2020k steps\. For class\-conditionalImageNet, FID is computed on samples generated with CFG\.
### 5\.2Noise Rater Improves Diffusion Training
#### Comparison with baselines
We assess the effectiveness of our meta\-learned noise rater against examine two training variants: \(1\)*Vanilla*is the standard diffusion training with i\.i\.d\. Gaussian noise; \(2\)*Naive*selects noise based on a simple statistical criterion, we experiment with picking the noise of maximum and minimum norm amongkkcandidates\. Tab\.[1](https://arxiv.org/html/2605.08144#S5.T1)shows that the naive norm\-based selection offers little improvement or even harms diffusion training\. Our rater achieves a clear FID reduction, confirming that the gains come from what the rater has learned rather than from noise filtering per se\.
Table 1:Comparison of noise\-selection strategies for DiT\-S/2 on FFHQ and ImageNet \(256×256256\{\\times\}256\)\. FID\-50K reported every 20k steps; the lower metric is better\.Table 2:Ablation of rater’s input conditions\. We progressively remove the timestep embedding, image input, and class label\. FID\-50K is reported every 20k steps\. All raters are trained for 2k steps\.
#### Rater architecture ablation
We ablate the rater’s inputs to validate the contribution of each conditioning signal, containing the noise, the imagex0x\_\{0\}, the diffusion timesteptt, and the class label\. We consider four stripped\-down variants: \(1\) without the timestep embedding, \(2\) without the image input, \(3\) without both the image and class label, and \(4\) only the noise\. As shown in Tab\.[2](https://arxiv.org/html/2605.08144#S5.T2), removing the diffusion timestep causes the largest FID drop, indicating the optimal noises are different across diffusion steps\. Removing the image or the category label also degrades the performance, demonstrating that the optimal noises are different for different images\. The full rater achieves the best performance, justifying our design\.
#### Effect ofkk\.
We explore the impact of the candidate noise numberk∈\{2,4,8\}k\\in\\\{2,4,8\\\}, which the rater selects from at each training step, andk=1k=1recovers standard diffusion training\. Largerkkgives the rater a wider selection pool, but \(i\) increases per\-step training cost, since allkkcandidates must be scored, and \(ii\) narrows and biases the resulting noise distribution, causing it to deviate from the vanilla Gaussian prior\. A smallkkimproves little over the baseline, while a largekkover\-restricts the training distribution and inflates compute\. As shown in Tab\.[3](https://arxiv.org/html/2605.08144#S5.T3),k=4k=4achieves the quality\-compute trade\-off with the largest FID improvement \(61\.56→\\rightarrow59\.95, a 2\.6% relative reduction\) at a moderate1\.6×1\.6\\timestraining cost, whereask=8k=8yields a smaller gain despite requiring2\.1×2\.1\\timesthe baseline compute\. It aligns with our observation that an overly narrow noise distribution harms training\.
Table 3:Effect of the candidate pool sizekkon DiT\-S/2 with ImageNet256×256256\{\\times\}256, resuming from the 400k checkpoint\. We also report the diffusion training cost \(GPU\-h per 20k steps\)\.
#### Generalization to larger backbones\.
A natural concern is whether the rater is tightly coupled to the model it was trained on, in which case a new rater would be required for every backbone—an expensive proposition at scale\. We test the opposite hypothesis: that the rater learns properties of the noise itself, not of the specific student, and therefore transfers across model sizes\. Concretely, we take the rater trained on DiT\-S/2 at the 400k checkpoint and apply it*as\-is*\(frozen, no retraining\) to train DiT\-B/2 and DiT\-L/2 from their 200k and 100k checkpoints111The choice of 200k for DiT\-B/2 and 100k for DiT\-L/2 is based on matching training loss level with the DiT\-S/2 400k checkpoint on which the rater was trained\. See Appx\.[B\.4](https://arxiv.org/html/2605.08144#A2.SS4)for details\., respectively, for an additional 80k steps withk=2k\{=\}2\. As shown in Tab\.[4](https://arxiv.org/html/2605.08144#S5.T4), the small\-model rater consistently improves FID on both larger backbones at every checkpoint, with DiT\-B/2 gaining0\.850\.85FID at \+80k \(48\.65→47\.8048\.65\\rightarrow 47\.80\) and DiT\-L/2 matching the baseline at \+80k while leading at every earlier checkpoint\. This indicates that the rater captures a backbone\-agnostic notion of noise quality, and that one cheap rater training run can be amortized across an entire family of diffusion models\.
Table 4:Generalization to larger backbones\. A rater trained on DiT\-S/2 \(at the 400k checkpoint\) is applied off\-the\-shelf to train DiT\-B/2 and DiT\-L/2 on ImageNet256×256256\{\\times\}256, resuming from their 200k and 100k checkpoints, respectively\. The rater is frozen and never retrained on the larger backbones\.
#### Effect of training stage\.
A natural question is whether the rater is equally beneficial when applied at different stages of diffusion training\. We test this by training a rater on top of DiT\-S/2 checkpoints at 100k, 400k, and 500k steps, and then continuing diffusion training for an additional 80k steps withk=2k\{=\}2\. As shown in Tab\.[5](https://arxiv.org/html/2605.08144#S5.T5), the rater consistently improves FID at every stage and every checkpoint, but the magnitude of the gain varies substantially: applied at 400k, the rater yields a2\.1%2\.1\\%–2\.8%2\.8\\%relative FID reduction, whereas at 100k and 500k the gain shrinks to0\.4%0\.4\\%–1\.3%1\.3\\%and0\.4%0\.4\\%–0\.7%0\.7\\%respectively\. We attribute this gap to hyperparameter sensitivity\. All three runs reuse the same training recipe tuned on the 400k checkpoint; we did not re\-tune for the early or late training regimes, where the diffusion model’s noise\-to\-signal characteristics differ noticeably from those at 400k\. We expect stage\-specific tuning to recover most of the gap, and we leave a thorough hyperparameter sweep across training stages to future work\. Even without such tuning, the rater never hurts and delivers consistent improvements across the full training trajectory\.
Table 5:Effect of applying the rater at different stages of DiT\-S/2 training on ImageNet256×256256\{\\times\}256\. We train a rater on top of the 100k, 400k, and 500k checkpoints, respectively, then continue diffusion training for an additional 80k steps\. FID\-50K is reported every 20k steps\.
### 5\.3Interpreting the Rater Learning Pattern
To probe what the rater has actually learned, we analyze its scores along two axes \(the diffusion model’s training stage, and the diffusion timestep\) under two statistical lenses \(correlation with noise norm, and score variance\)\. Concretely, we attach a rater to DiT\-S/2 checkpoints at training steps\{0,20k,40k,…,600k\}\\\{0,20\\text\{k\},40\\text\{k\},\\dots,600\\text\{k\}\\\}, train each rater for500500steps using the hyperparameters in Tab\.[B\.2](https://arxiv.org/html/2605.08144#A2.SS2), and obtain3131raters in total\. For each rater, we draw22k images and, for every image, sample1616candidate noises; we then evaluate the rater on every\(image,noise\)\(\\text\{image\},\\text\{noise\}\)pair at1111normalized diffusion timestepst∈\{0\.0,0\.1,…,1\.0\}t\\in\\\{0\.0,0\.1,\\dots,1\.0\\\}\. We compute two per\-\(image,tt, noise\) statistics over the1616candidate noises:
- •Score–norm correlation\.The Spearman correlation between the rater’s scores and theℓ2\\ell\_\{2\}norm of the noise\. Because the noise norm is monotonic with the per\-sample log\-density under𝒩\(0,I\)\\mathcal\{N\}\(0,I\), this metric indicates whether the rater’s preferences effectively reduce to a rule of filtering typical versus atypical noise samples\.
- •Score standard deviation\.The standard deviation of the rater’s scores across the1616candidates, measuring how discriminative the rater is\. That is, whether it considers the candidates roughly interchangeable or strongly prefers some over others\.
We aggregate these statistics along two complementary axes: \(i\) across the3131diffusion training stages \(pooling over alltt\) to examine how the rater’s behavior varies with the diffusion training progress, and \(ii\) across the1111diffusion timesteps \(pooling over all stages\) to examine how the rater’s behavior depends on the noise level\. Distributions are plot in Fig\.[2](https://arxiv.org/html/2605.08144#S5.F2)and Fig\.[3](https://arxiv.org/html/2605.08144#S5.F3)\.
#### The rater does not collapse to a noise\-norm filter\.
Across mature diffusion training stages \(Fig\.[3](https://arxiv.org/html/2605.08144#S5.F3)a,≥20\\geq 20k\), the score–norm Spearman correlation is centered tightly around zero, with the bulk of the distribution within\|ρ\|<0\.1\|\\rho\|<0\.1\. The single striking exception is at stage0\(i\.e\., a randomly initialized DiT\), where the correlation jumps toρ≈0\.9\\rho\\approx 0\.9\. This shows that when the diffusion model is untrained, the inner\-loop loss has no signal beyond the statistics of the noise itself, so the only thing the rater*can*learn is a function of the noise norm\. As soon as the diffusion model has been trained for even2020k steps, this trivial solution disappears, and the rater learns something more complex\.
#### The rater is most discriminative at intermediate noise levels\.
The score standard deviation in Fig\.[2](https://arxiv.org/html/2605.08144#S5.F2)b is non\-monotonic intt: it rises from∼\\sim0\.250\.25att=0t=0, peaks at∼\\sim0\.270\.27aroundt∈\[0\.6,0\.7\]t\\in\[0\.6,0\.7\], and falls back to∼\\sim0\.250\.25att=1t=1\. The peak coincides with the noise\-level regime that is empirically hardest for diffusion training \(highest per\-step loss\), suggesting the rater finds the most signal precisely where the diffusion model is least confident: whenttis very small the input is dominated byx0x\_\{0\}and all candidate noises behave similarly; whenttis very large the input is nearly pure noise and any single candidate matters less\.
#### Rater behavior stabilizes only after∼\\sim6060k diffusion model training steps\.
Fig\.[3](https://arxiv.org/html/2605.08144#S5.F3)b reveals a pronounced transient in the rater’s score variance across diffusion checkpoints\. The y\-axis reports the standard deviation of rater scores over the1616candidate noises; each violin summarizes these per\-\(image,tt\) scores aggregated across all images and timesteps within a given stage\. Three regimes emerge\.*\(i\) Stage0:*the distribution is concentrated near zero, indicating that for an untrained DiT, the rater assigns nearly identical scores to all candidates and thus lacks discriminative ability\.*\(ii\) Stages2020k–4040k:*both the median \(reaching approximately1\.41\.4and0\.80\.8\) and the spread increase substantially, reflecting strong discrimination among candidates\.*\(iii\) Stages6060k onward:*the distribution contracts to a narrow plateau around0\.250\.25, indicating stable but moderate discrimination\. We interpret regime \(ii\) as reflecting large inner\-loop gradients driven by the rapid convergence dynamics of an early\-stage DiT, while regime \(iii\) arises once training stabilizes and the inner\-loop gradients diminish\.
\(a\)Score–norm Spearman correlation\.
\(b\)Score standard deviation\.
Figure 2:Rater behavior across diffusion timestepstt, aggregated over all diffusion training stages\. \(a\) The Spearman rank correlation between rater scores and noiseℓ2\\ell\_\{2\}norm stays close to zero, indicating the rater does not reduce to a norm filter\. \(b\) The score standard deviation peaks at intermediatett, where the rater finds the most signal among candidates\.\(a\)Score–norm Spearman correlation\.
\(b\)Score standard deviation\.
Figure 3:Rater behavior across diffusion training stages, aggregated over all timesteps\. \(a\) At initialization \(stage0\), the rater’s scores are strongly correlated with the noise norm; this correlation rapidly disappears once diffusion training begins, with later stages exhibiting near\-zero correlation\. \(b\) The score variance is zero at stage0, rises sharply between2020k and4040k steps, and stabilizes into a plateau from6060k onward\.
## 6Conclusion
We introduced NoiseRater, a meta\-learned approach for valuing noise during diffusion training\. By identifying and prioritizing informative noise instances, our method improves both training efficiency and generation quality\. This work highlights noise valuation as a new and effective direction for optimizing diffusion models\.
## References
- \[1\]D\. Ahn, J\. Kang, S\. Lee, J\. Min, M\. Kim, W\. Jang, H\. Cho, S\. Paul, S\. Kim, E\. Cha,et al\.\(2024\)A noise is worth diffusion guidance\.arXiv preprint arXiv:2412\.03895\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[2\]A\. Bansal, H\. Chu, A\. Schwarzschild, S\. Sengupta, M\. Goldblum, J\. Geiping, and T\. Goldstein\(2023\)Universal guidance for diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 843–852\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[3\]S\. Bechtle, A\. Molchanov, Y\. Chebotar, E\. Grefenstette, L\. Righetti, G\. Sukhatme, and F\. Meier\(2021\)Meta learning via learned loss\.In2020 25th International Conference on Pattern Recognition \(ICPR\),pp\. 4161–4168\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px3.p1.1)\.
- \[4\]D\. A\. Calian, G\. Farquhar, I\. Kemaev, L\. M\. Zintgraf, M\. Hessel, J\. Shar, J\. Oh, A\. György, T\. Schaul, J\. Dean,et al\.\(2025\)Datarater: meta\-learned dataset curation\.arXiv preprint arXiv:2505\.17895\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.08144#S4.SS2.SSS0.Px2.p2.2),[§4\.2](https://arxiv.org/html/2605.08144#S4.SS2.p1.2)\.
- \[5\]C\. Chen, L\. Yang, X\. Yang, L\. Chen, G\. He, C\. Wang, and Y\. Li\(2024\)Find: fine\-tuning initial noise distribution with policy optimization for diffusion models\.InProceedings of the 32nd ACM International Conference on Multimedia,pp\. 6735–6744\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px1.p1.1)\.
- \[6\]J\. Choi, J\. Lee, C\. Shin, S\. Kim, H\. Kim, and S\. Yoon\(2022\)Perception prioritized training of diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 11472–11481\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[7\]J\. Deng, W\. Dong, R\. Socher, L\. Li, K\. Li, and L\. Fei\-Fei\(2009\)Imagenet: a large\-scale hierarchical image database\.In2009 IEEE conference on computer vision and pattern recognition,pp\. 248–255\.Cited by:[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px1.p1.3)\.
- \[8\]N\. Elata, T\. Michaeli, and M\. Elad\(2024\)Adaptive compressed sensing with diffusion\-based posterior sampling\.InEuropean Conference on Computer Vision,pp\. 290–308\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[9\]L\. Engstrom, A\. Ilyas, B\. Chen, A\. Feldmann, W\. Moses, and A\. Madry\(2025\)Optimizing ml training with metagradient descent\.arXiv preprint arXiv:2503\.13751\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p5.1)\.
- \[10\]P\. Esser, S\. Kulal, A\. Blattmann, R\. Entezari, J\. Müller, H\. Saini, Y\. Levi, D\. Lorenz, A\. Sauer, F\. Boesel,et al\.\(2024\)Scaling rectified flow transformers for high\-resolution image synthesis\.InForty\-first international conference on machine learning,Cited by:[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px3.p1.1)\.
- \[11\]L\. Eyring, V\. Pauline, S\. Bauer, Z\. Akata, and A\. DosovitskiyDDNO: discrete diffusion noise optimization\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[12\]X\. Guo, J\. Liu, M\. Cui, J\. Li, H\. Yang, and D\. Huang\(2024\)Initno: boosting text\-to\-image diffusion models via initial noise optimization\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 9380–9389\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px1.p1.1)\.
- \[13\]T\. Hang, S\. Gu, J\. Bao, F\. Wei, D\. Chen, X\. Geng, and B\. Guo\(2025\)Improved noise schedule for diffusion training\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 4796–4806\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p3.1),[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.08144#S4.SS1.p1.3)\.
- \[14\]T\. Hang, S\. Gu, C\. Li, J\. Bao, D\. Chen, H\. Hu, X\. Geng, and B\. Guo\(2023\)Efficient diffusion training via min\-snr weighting strategy\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 7441–7451\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[15\]H\. He, J\. Liang, X\. Wang, P\. Wan, D\. Zhang, K\. Gai, and L\. Pan\(2025\)Scaling image and video generation via test\-time evolutionary search\.arXiv preprint arXiv:2505\.17618\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[16\]M\. Heusel, H\. Ramsauer, T\. Unterthiner, B\. Nessler, and S\. Hochreiter\(2017\)Gans trained by a two time\-scale update rule converge to a local nash equilibrium\.Advances in neural information processing systems30\.Cited by:[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px5.p1.7)\.
- \[17\]J\. Ho, A\. Jain, and P\. Abbeel\(2020\)Denoising diffusion probabilistic models\.Advances in neural information processing systems33,pp\. 6840–6851\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1),[§3](https://arxiv.org/html/2605.08144#S3.SS0.SSS0.Px4.p1.1)\.
- \[18\]J\. Ho and T\. Salimans\(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[§3](https://arxiv.org/html/2605.08144#S3.SS0.SSS0.Px3.p1.7),[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px1.p1.3)\.
- \[19\]L\. Jiang, Z\. Zhou, T\. Leung, L\. Li, and L\. Fei\-Fei\(2018\)Mentornet: learning data\-driven curriculum for very deep neural networks on corrupted labels\.InInternational conference on machine learning,pp\. 2304–2313\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px3.p1.1)\.
- \[20\]T\. Karras, M\. Aittala, T\. Aila, and S\. Laine\(2022\)Elucidating the design space of diffusion\-based generative models\.Advances in neural information processing systems35,pp\. 26565–26577\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[21\]T\. Karras, S\. Laine, and T\. Aila\(2019\)A style\-based generator architecture for generative adversarial networks\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4401–4410\.Cited by:[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px1.p1.3)\.
- \[22\]K\. Karunratanakul, K\. Preechakul, E\. Aksan, T\. Beeler, S\. Suwajanakorn, and S\. Tang\(2024\)Optimizing diffusion noise can serve as universal motion priors\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 1334–1345\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]I\. Kemaev, D\. A\. Calian, L\. M\. Zintgraf, G\. Farquhar, and H\. Van Hasselt\(2025\)Scalable meta\-learning via mixed\-mode differentiation\.arXiv preprint arXiv:2505\.00793\.Cited by:[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px4.p1.5)\.
- \[24\]J\. Kim, T\. Yoon, J\. Hwang, and M\. Sung\(2025\)Inference\-time scaling for flow models via stochastic generation and rollover budget forcing\.arXiv preprint arXiv:2503\.19385\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[25\]J\. Kim, H\. Go, S\. Kwon, and H\. Kim\(2024\)Denoising task difficulty\-based curriculum for training diffusion models\.arXiv preprint arXiv:2403\.10348\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[26\]X\. Li, M\. Uehara, X\. Su, G\. Scalia, T\. Biancalani, A\. Regev, S\. Levine, and S\. Ji\(2025\)Dynamic search for inference\-time alignment in diffusion models\.arXiv preprint arXiv:2503\.02039\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[27\]Y\. Li, H\. Jiang, A\. Kodaira, M\. Tomizuka, K\. Keutzer, and C\. Xu\(2024\)Immiscible diffusion: accelerating diffusion training with noise assignment\.Advances in neural information processing systems37,pp\. 90198–90225\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[28\]Y\. Li, F\. Liang, D\. Kondratyuk, M\. Tomizuka, K\. Keutzer, and C\. Xu\(2025\)Improved immiscible diffusion: accelerate diffusion training by reducing its miscibility\.arXiv preprint arXiv:2505\.18521\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[29\]N\. Ma, S\. Tong, H\. Jia, H\. Hu, Y\. Su, M\. Zhang, X\. Yang, Y\. Li, T\. Jaakkola, X\. Jia,et al\.\(2025\)Scaling inference time compute for diffusion models\.InProceedings of the Computer Vision and Pattern Recognition Conference,pp\. 2523–2534\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[30\]D\. Maclaurin, D\. Duvenaud, and R\. Adams\(2015\)Gradient\-based hyperparameter optimization through reversible learning\.InInternational conference on machine learning,pp\. 2113–2122\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p5.1)\.
- \[31\]W\. Peebles and S\. Xie\(2023\)Scalable diffusion models with transformers\.InProceedings of the IEEE/CVF international conference on computer vision,pp\. 4195–4205\.Cited by:[§B\.4](https://arxiv.org/html/2605.08144#A2.SS4.p3.3),[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px2.p1.1),[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px3.p1.1)\.
- \[32\]A\. Polyak, A\. Zohar, A\. Brown, A\. Tjandra, A\. Sinha, A\. Lee, A\. Vyas, B\. Shi, C\. Ma, C\. Chuang,et al\.\(2024\)Movie gen: a cast of media foundation models\.arXiv preprint arXiv:2410\.13720\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[33\]Z\. Qi, L\. Bai, H\. Xiong, and Z\. Xie\(2024\)Not all noises are created equally: diffusion noise selection and optimization\.arXiv preprint arXiv:2407\.14041\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p3.1)\.
- \[34\]A\. Ramesh, M\. Pavlov, G\. Goh, S\. Gray, C\. Voss, A\. Radford, M\. Chen, and I\. Sutskever\(2021\)Zero\-shot text\-to\-image generation\.InInternational conference on machine learning,pp\. 8821–8831\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[35\]G\. Raya, B\. Nguyen, G\. Batzolis, Y\. Takida, D\. Stancevic, N\. Murata, C\. Lai, Y\. Mitsufuji, and L\. Ambrogioni\(2026\)Information\-guided noise allocation for efficient diffusion training\.arXiv preprint arXiv:2602\.18647\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p3.1),[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.08144#S4.SS1.p1.3)\.
- \[36\]M\. Ren, W\. Zeng, B\. Yang, and R\. Urtasun\(2018\)Learning to reweight examples for robust deep learning\.InInternational conference on machine learning,pp\. 4334–4343\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px3.p1.1)\.
- \[37\]Y\. Ren, W\. Gao, L\. Ying, G\. M\. Rotskoff, and J\. Han\(2025\)Driftlite: lightweight drift control for inference\-time scaling of diffusion models\.arXiv preprint arXiv:2509\.21655\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[38\]K\. Rojas, Y\. Zhu, S\. Zhu, F\. X\. Ye, and M\. Tao\(2025\)Diffuse everything: multimodal diffusion models on arbitrary state spaces\.ICML\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[39\]R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer\(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 10684–10695\.Cited by:[§B\.1](https://arxiv.org/html/2605.08144#A2.SS1.SSS0.Px1.p1.17),[§1](https://arxiv.org/html/2605.08144#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.08144#S5.SS1.SSS0.Px2.p1.1)\.
- \[40\]J\. Shu, Q\. Xie, L\. Yi, Q\. Zhao, S\. Zhou, Z\. Xu, and D\. Meng\(2019\)Meta\-weight\-net: learning an explicit mapping for sample weighting\.Advances in neural information processing systems32\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.08144#S4.SS1.p1.3),[§4\.2](https://arxiv.org/html/2605.08144#S4.SS2.SSS0.Px2.p2.2)\.
- \[41\]R\. Singhal, Z\. Horvitz, R\. Teehan, M\. Ren, Z\. Yu, K\. McKeown, and R\. Ranganath\(2025\)A general framework for inference\-time scaling and steering of diffusion models\.arXiv preprint arXiv:2501\.06848\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[42\]J\. Sohl\-Dickstein, E\. Weiss, N\. Maheswaranathan, and S\. Ganguli\(2015\)Deep unsupervised learning using nonequilibrium thermodynamics\.InInternational conference on machine learning,pp\. 2256–2265\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[43\]J\. Song, C\. Meng, and S\. Ermon\(2020\)Denoising diffusion implicit models\.arXiv preprint arXiv:2010\.02502\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p6.1),[§3](https://arxiv.org/html/2605.08144#S3.SS0.SSS0.Px4.p1.1)\.
- \[44\]Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole\(2020\)Score\-based generative modeling through stochastic differential equations\.arXiv preprint arXiv:2011\.13456\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[45\]A\. Stecklov, N\. E\. Rimawi\-Fine, and M\. Blanchette\(2025\)Inference\-time compute scaling for flow matching\.arXiv preprint arXiv:2510\.17786\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[46\]N\. Sun and L\. Shi\(2026\)Variance\-aware adaptive weighting for diffusion model training\.arXiv preprint arXiv:2603\.10391\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p3.1),[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.08144#S4.SS1.p1.3)\.
- \[47\]Q\. Sun, Z\. Jiang, H\. Zhao, and K\. He\(2025\)Is noise conditioning necessary for denoising generative models?\.arXiv preprint arXiv:2502\.13129\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[48\]Z\. Tang, J\. Peng, J\. Tang, M\. Hong, F\. Wang, and T\. Chang\(2024\)Tuning\-free alignment of diffusion models with direct noise optimization\.InICML 2024 Workshop on Structured Probabilistic Inference\{\\\{\\\\backslash&\}\\\}Generative Modeling,Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px1.p1.1)\.
- \[49\]Y\. Wang, Y\. He, and M\. Tao\(2024\)Evaluating the design space of diffusion\-based generative models\.Advances in Neural Information Processing Systems37,pp\. 19307–19352\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p3.1),[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px2.p1.1)\.
- \[50\]Z\. Wang, G\. Hu, and Q\. Hu\(2020\)Training noise\-robust deep neural networks via meta\-learning\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 4524–4533\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px3.p1.1)\.
- \[51\]J\. L\. Watson, D\. Juergens, N\. R\. Bennett, B\. L\. Trippe, J\. Yim, H\. E\. Eisenach, W\. Ahern, A\. J\. Borst, R\. J\. Ragotte, L\. F\. Milles,et al\.\(2023\)De novo design of protein structure and function with rfdiffusion\.Nature620\(7976\),pp\. 1089–1100\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[52\]L\. Yang, Y\. Tian, B\. Li, X\. Zhang, K\. Shen, Y\. Tong, and M\. Wang\(2025\)Mmada: multimodal large diffusion language models\.arXiv preprint arXiv:2505\.15809\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p1.1)\.
- \[53\]J\. Yoon, H\. Cho, D\. Baek, Y\. Bengio, and S\. Ahn\(2025\)Monte carlo tree diffusion for system 2 planning\.arXiv preprint arXiv:2502\.07202\.Cited by:[§1](https://arxiv.org/html/2605.08144#S1.p2.1)\.
- \[54\]Z\. Zhou, S\. Shao, L\. Bai, S\. Zhang, Z\. Xu, B\. Han, and Z\. Xie\(2025\)Golden noise for diffusion models: a learning framework\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 17688–17697\.Cited by:[§2](https://arxiv.org/html/2605.08144#S2.SS0.SSS0.Px1.p1.1)\.
## Appendix AMathematical Analysis
### A\.1Proof of Thm\.[4\.1](https://arxiv.org/html/2605.08144#S4.Thmtheorem1)
DefineF\(θ,η\):=∇θℒinner\(θ;η\)F\(\\theta,\\eta\):=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta;\\eta\)\. At the inner optimumθ∗\(η\)\\theta^\{\*\}\(\\eta\), the first\-order optimality condition givesF\(θ∗\(η\),η\)=∇θℒinner\(θ∗\(η\);η\)=0F\(\\theta^\{\*\}\(\\eta\),\\eta\)=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)=0\. Sinceℒinner\(θ;η\)\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta;\\eta\)isμ\\mu\-strongly convex inθ\\theta, its HessianH\(η\):=∇θ2ℒinner\(θ∗\(η\);η\)H\(\\eta\):=\\nabla\_\{\\theta\}^\{2\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)is positive definite and hence invertible\. Therefore, by the implicit function theorem,θ∗\(η\)\\theta^\{\*\}\(\\eta\)is locally unique and differentiable with respect toη\\eta\. Differentiating the optimality condition∇θℒinner\(θ∗\(η\);η\)=0\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)=0with respect toη\\etayields∇θ2ℒinner\(θ∗\(η\);η\)∂θ∗\(η\)∂η\+∇η∇θℒinner\(θ∗\(η\);η\)=0\\nabla\_\{\\theta\}^\{2\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\\frac\{\\partial\\theta^\{\*\}\(\\eta\)\}\{\\partial\\eta\}\+\\nabla\_\{\\eta\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)=0\. Solving for∂θ∗\(η\)/∂η\\partial\\theta^\{\*\}\(\\eta\)/\\partial\\eta, we obtain∂θ∗\(η\)∂η=−\(∇θ2ℒinner\(θ∗\(η\);η\)\)−1∇η∇θℒinner\(θ∗\(η\);η\)\\frac\{\\partial\\theta^\{\*\}\(\\eta\)\}\{\\partial\\eta\}=\-\\Bigl\(\\nabla\_\{\\theta\}^\{2\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\\Bigr\)^\{\-1\}\\nabla\_\{\\eta\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\. Now define the outer objectiveJ\(η\)=ℒval\(θ∗\(η\)\)J\(\\eta\)=\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\*\}\(\\eta\)\)\. By the chain rule,∇ηJ\(η\)=\(∂θ∗\(η\)∂η\)⊤∇θℒval\(θ∗\(η\)\)\\nabla\_\{\\eta\}J\(\\eta\)=\\left\(\\frac\{\\partial\\theta^\{\*\}\(\\eta\)\}\{\\partial\\eta\}\\right\)^\{\\top\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\*\}\(\\eta\)\)\. Substituting the expression above gives∇ηJ\(η\)=−\(∇η∇θℒinner\(θ∗\(η\);η\)\)⊤\(∇θ2ℒinner\(θ∗\(η\);η\)\)−1∇θℒval\(θ∗\(η\)\)\\nabla\_\{\\eta\}J\(\\eta\)=\-\\Bigl\(\\nabla\_\{\\eta\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\\Bigr\)^\{\\top\}\\Bigl\(\\nabla\_\{\\theta\}^\{2\}\\mathcal\{L\}\_\{\\mathrm\{inner\}\}\(\\theta^\{\*\}\(\\eta\);\\eta\)\\Bigr\)^\{\-1\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\*\}\(\\eta\)\)\. This proves the claimed meta\-gradient formula\.
### A\.2Proof of Thm\.[4\.2](https://arxiv.org/html/2605.08144#S4.Thmtheorem2)
Letgk\(θ\):=∇θℓ\(k\)\(θ\)g\_\{k\}\(\\theta\):=\\nabla\_\{\\theta\}\\ell^\{\(k\)\}\(\\theta\)andgval\(θ\):=∇θℒval\(θ\)g\_\{\\mathrm\{val\}\}\(\\theta\):=\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\)\. The single inner update isθ\+\(η\)=θ−α∑k=1Kw\(k\)\(η\)gk\(θ\)\\theta^\{\+\}\(\\eta\)=\\theta\-\\alpha\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)g\_\{k\}\(\\theta\), henceθ\+\(η\)−θ=−α∑k=1Kw\(k\)\(η\)gk\(θ\)\\theta^\{\+\}\(\\eta\)\-\\theta=\-\\alpha\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)g\_\{k\}\(\\theta\)\. Sinceℒval\\mathcal\{L\}\_\{\\mathrm\{val\}\}hasLL\-Lipschitz continuous gradients, the descent lemma givesℒval\(θ\+\)≤ℒval\(θ\)\+⟨gval\(θ\),θ\+−θ⟩\+L2‖θ\+−θ‖2\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\+\}\)\\leq\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\)\+\\left\\langle g\_\{\\mathrm\{val\}\}\(\\theta\),\\theta^\{\+\}\-\\theta\\right\\rangle\+\\frac\{L\}\{2\}\\\|\\theta^\{\+\}\-\\theta\\\|^\{2\}\. Substituting the update direction, the linear term becomes⟨gval\(θ\),θ\+−θ⟩=−α∑k=1Kw\(k\)\(η\)⟨gval\(θ\),gk\(θ\)⟩\\left\\langle g\_\{\\mathrm\{val\}\}\(\\theta\),\\theta^\{\+\}\-\\theta\\right\\rangle=\-\\alpha\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)\\left\\langle g\_\{\\mathrm\{val\}\}\(\\theta\),g\_\{k\}\(\\theta\)\\right\\rangle, andL2‖θ\+−θ‖2=Lα22‖∑k=1Kw\(k\)\(η\)gk\(θ\)‖2=𝒪\(α2\)\\frac\{L\}\{2\}\\\|\\theta^\{\+\}\-\\theta\\\|^\{2\}=\\frac\{L\\alpha^\{2\}\}\{2\}\\left\\\|\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)g\_\{k\}\(\\theta\)\\right\\\|^\{2\}=\\mathcal\{O\}\(\\alpha^\{2\}\)\. Therefore,ℒval\(θ\+\(η\)\)≤ℒval\(θ\)−α∑k=1Kw\(k\)\(η\)⟨∇θℒval\(θ\),∇θℓ\(k\)\(θ\)⟩\+𝒪\(α2\)\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta^\{\+\}\(\\eta\)\)\\leq\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\)\-\\alpha\\sum\_\{k=1\}^\{K\}w^\{\(k\)\}\(\\eta\)\\left\\langle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathrm\{val\}\}\(\\theta\),\\nabla\_\{\\theta\}\\ell^\{\(k\)\}\(\\theta\)\\right\\rangle\+\\mathcal\{O\}\(\\alpha^\{2\}\)\. Thus, up to second\-order terms inα\\alpha, decreasing the validation loss favors larger weights on noise instances whose training gradients have larger positive alignment with the validation gradient\.
## Appendix BExperimental Details
### B\.1Rater Architecture
Figure 4:The illustration of the noise rater’s architecture, which contains multiple dual\-stream DiT blocks to incorporate several condition inputs\.#### Inputs and tokenization\.
The rater takes as input a clean image latentx0x\_\{0\}, a noise sampleϵ\\epsilon, a diffusion timesteptt, and a class labelcc\. We operate in the latent space of the publicly availablestabilityai/sd\-vae\-ft\-mseVAE\[[39](https://arxiv.org/html/2605.08144#bib.bib20)\], which encodes a256×256256\{\\times\}256RGB image into a4×32×324\\times 32\\times 32latent \(8×8\\timesspatial downsampling,44channels\); the noise sampleϵ\\epsilonis drawn from𝒩\(0,I\)\\mathcal\{N\}\(0,I\)with the same shape\. Bothx0x\_\{0\}andϵ\\epsilonare patchified by two independentPatchEmbedprojections \(Conv2d with kernel size = stride =p=2p=2\), producing two token sequences of lengthT=\(32/p\)2=256T=\(32/p\)^\{2\}=256, each inℝT×D\\mathbb\{R\}^\{T\\times D\}whereD=384D=384is the hidden size\. Both streams share a single frozen 2D sin\-cos positional embedding \(added before the first block\) and additionally receive learnable per\-stream type embeddings that mark their stream identity, so the joint attention can distinguish image tokens from noise tokens\.
#### Conditioning\.
The conditioning vectory∈ℝDy\\in\\mathbb\{R\}^\{D\}is the sum of \(i\) a sinusoidal timestep embedding passed through a two\-layer MLP and \(ii\) a learnable class embedding\. The sameyyis broadcast to every block\.
#### Dual\-stream DiT block\.
Each block maintains two independent residual paths, one for the image stream and one for the noise stream\. Each path has their own LayerNorm, MLP, and adaLN\-Zero modulation parameters\. Following DiT, each stream produces six modulation values per block; concretely, a single Linear layer projectsyyto12D12Doutputs and chunks them into the twelve required scalars per token dimension\. Attention is the only sublayer shared across streams: at each block we form the concatenated sequence\[x^0\(ℓ\);ϵ^\(ℓ\)\]∈ℝ2T×D\[\\hat\{x\}\_\{0\}^\{\(\\ell\)\};\\hat\{\\epsilon\}^\{\(\\ell\)\}\]\\in\\mathbb\{R\}^\{2T\\times D\}from the modulated normalized tokens of the two streams, run a single multi\-head self\-attention, and split the output back into per\-stream halves before applying the per\-stream gates and residuals\. Joint attention lets noise tokens directly query image content \(and vice versa\) without requiring additional cross\-attention parameters\.
#### Output head\.
After the final block, the image\-stream tokens are discarded; the noise\-stream tokens are mean\-pooled across the spatial dimension, layer\-normalized, and projected by a single linear layer to a scalar scorew∈ℝw\\in\\mathbb\{R\}\. Within each minibatch the scores are post\-processed via group\-wise normalization \(Section[4\.1](https://arxiv.org/html/2605.08144#S4.SS1)\) before being used to select amongkkcandidate noises\.
#### Hyperparameters and initialization\.
We use the same per\-block configuration as DiT\-S/2: depth=12=12, hidden sizeD=384D=384,66attention heads, MLP ratio=4=4, patch sizep=2p=2, operating on32×3232\\times 32latents \(i\.e\.,T=256T=256tokens per stream\)\. The total parameter count is approximately∼\\sim6060M, roughly twice that of DiT\-S/2 because of the duplicated MLPs and modulation networks\.
Unlike generative DiT models, we do*not*adopt adaLN\-Zero \(i\.e\., zero\-initializing the adaLN projection and the output head\); instead all linear layers use Xavier initialization\. We found this necessary because the rater is trained via bilevel optimization: each rater update requires running an inner loop of diffusion updates whose loss is reweighted by the rater’s scores\. Under adaLN\-Zero, the rater’s score head is zero\-initialized, so all candidate noises receive identical \(zero\) scores at initialization\. After group\-wise normalization, this collapses the inner\-loop loss weights to a constant, yielding zero gradient signal back to the rater and stalling training before it begins\. Standard Xavier initialization breaks this symmetry by giving the rater a non\-trivial starting function, and yields stable training across all settings reported in the main paper\.
### B\.2Training Hyperparameters
Tab\.[6](https://arxiv.org/html/2605.08144#A2.T6)lists the hyperparameters used for noise rater training\. The rater is trained on top of a frozen DiT\-S/2 checkpoint via a meta\-learning objective: an inner loop takes55gradient steps on the diffusion model with rater\-selected noise, and an outer loop updates the rater to minimize the resulting diffusion loss on a held\-out validation split\.
Table 6:Hyperparameters for noise rater training\.
### B\.3Computation Resources
#### Rater training\.
With the hyperparameters in Tab\.[6](https://arxiv.org/html/2605.08144#A2.T6), training the noise rater takes approximately44hours of wall\-clock time on44NVIDIA H100 GPUs \(16 CPU cores\)\. This is a one\-time cost that is amortized across all subsequent diffusion runs\.
#### Diffusion training\.
All diffusion training is conducted on44NVIDIA H100 GPUs\.
\(i\) DiT\-S/2 main runs\.Pre\-training DiT\-S/2 from scratch to400400k steps takes about2222hours\. Each8080k\-step continuation run takes4\.34\.3,5\.65\.6,7\.07\.0, and9\.09\.0hours fork=1,2,4,8k\{=\}1,2,4,8respectively\.
\(ii\) Larger\-backbone generalization\.On DiT\-B/2, the8080k continuation takes5\.45\.4hours without the rater and6\.76\.7hours with the rater atk=2k\{=\}2\. On DiT\-L/2, it takes9\.29\.2hours without the rater and10\.410\.4hours with the rater atk=2k\{=\}2\.
### B\.4Training Stage Matching for Generalization Across Model Sizes
In the generalization\-to\-larger\-backbones experiment, the rater is trained on top of a DiT\-S/2 checkpoint at400400k steps and then transferred to DiT\-B/2 and DiT\-L/2\. Since these backbones converge at different rates, naively reusing the same step count would compare backbones at very different points in their training trajectories\. To match the rater at a suitable training stage, we instead select target checkpoints for DiT\-B/2 and DiT\-L/2 that match the*absolute training loss*of DiT\-S/2 at400400k steps\.
Concretely, letttdenote the training step, and letLS\(t\)L\_\{S\}\(t\),LB\(t\)L\_\{B\}\(t\), andLL\(t\)L\_\{L\}\(t\)denote the training loss curves of DiT\-S/2, DiT\-B/2, and DiT\-L/2 respectively\. We define the matched checkpoint for each larger backbone as
tB⋆=min\{t:L~B\(t′\)≤L~S\(400k\)for allt′∈\[t,t\+W\]\},t^\{\\star\}\_\{B\}=\\min\\big\\\{\\,t\\,:\\,\\widetilde\{L\}\_\{B\}\(t^\{\\prime\}\)\\leq\\widetilde\{L\}\_\{S\}\(400\\text\{k\}\)\\;\\;\\text\{for all\}\\;\\;t^\{\\prime\}\\in\[t,t\+W\]\\,\\big\\\},\(10\)and analogously fortL⋆t^\{\\star\}\_\{L\}, whereL~\(⋅\)\\widetilde\{L\}\(\\cdot\)denotes the training loss smoothed with a centered rolling window of size500500, andW=1000W=1000logged steps is a stability window that prevents single\-step dips from triggering a spurious match\. The smoothing and stability window together make the selection robust to the noisy loss curves typical of diffusion training\.
Applying this procedure withLS\(400k\)L\_\{S\}\(400\\text\{k\}\)as the target yieldstB⋆=211t^\{\\star\}\_\{B\}=211k for DiT\-B/2 andtL⋆=119t^\{\\star\}\_\{L\}=119k for DiT\-L/2\. Intuitively, larger backbones reach the same loss level in fewer steps, which is consistent with prior observations on capacity\-driven sample efficiency in diffusion training\[[31](https://arxiv.org/html/2605.08144#bib.bib218)\]\.Similar Articles
Class-frequency Guided Noise Schedule for Diffusion Models
This paper proposes a class-frequency guided noise schedule for diffusion models that assigns larger-scale noises to low-frequency classes to improve generation quality on imbalanced datasets, demonstrating substantial improvements over baselines.
Elucidating the SNR-t Bias of Diffusion Probabilistic Models
This paper identifies a Signal-to-Noise Ratio timestep (SNR-t) bias in diffusion probabilistic models during inference, where SNR-timestep alignment from training is disrupted at inference time. The authors propose a differential correction method that decomposes samples into frequency components and corrects each separately, improving generation quality across models like IDDPM, ADM, DDIM, EDM, and FLUX with minimal computational overhead.
GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models
GDSD proposes a reinforcement learning method that directly distills denoisers from advantage-guided self-teachers for diffusion language models, avoiding biases from ELBO-based likelihood surrogates. It achieves up to +19.6% accuracy improvements on planning, math, and coding benchmarks over prior state-of-the-art methods.
Colored Noise Diffusion Sampling
Introduces Colored Noise Sampling (CNS), a training-free stochastic solver for diffusion models that dynamically allocates energy based on frequency-dependent schedules, improving image quality metrics like FID significantly on ImageNet-256.
Temporal Difference Learning for Diffusion Models
This paper introduces a temporal difference (TD) learning objective for diffusion models that enforces cross-time consistency along the denoising trajectory. It reformulates denoising as a reinforcement learning policy evaluation problem, showing significant improvements in sample quality (FID), especially for few-step samplers.