Class-frequency Guided Noise Schedule for Diffusion Models

arXiv cs.LG Papers

Summary

This paper proposes a class-frequency guided noise schedule for diffusion models that assigns larger-scale noises to low-frequency classes to improve generation quality on imbalanced datasets, demonstrating substantial improvements over baselines.

arXiv:2606.27696v1 Announce Type: new Abstract: In this paper, we are the first to examine the correlations between class frequency and the multi-scale noise schedule within diffusion models. For score-based generative models, low-density regions often lead to inaccurately estimated scores, thereby compromising the generation quality. Although the multi-scale noise schedule can alleviate this issue during the diffusion process, low-frequency classes still face the challenge of large low-density regions, resulting in more inaccurate estimated scores than high-frequency classes. Furthermore, high-frequency classes tend to dominate the score space, causing a convergence of most data points towards generating samples from these classes. Consequently, samples generated within low-frequency classes exhibit suboptimal quality and limited diversity. To address this challenge, we propose the \textit{Class-frequency Guided (CFRG)} noise schedule, leveraging the insight that low-frequency classes should be endowed with larger-scale noises. To illustrate the effectiveness of our method, we conduct experiments on various tasks, including image generation, image classification, and text-to-image generation, using imbalanced datasets, \textit{i.e.}, CIFAR-100-LT, and ImageNet-LT. By employing the CFRG noise schedule, we achieve substantial improvements over baselines, manifesting the crucial role of frequency statistics in noise schedule design.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:25 AM

# Class-frequency Guided Noise Schedule for Diffusion Models
Source: [https://arxiv.org/html/2606.27696](https://arxiv.org/html/2606.27696)
\[1\]\\fnmJiequan\\surCui

1\]Hefei University of Technology

2\]University of Science and Technology of China

3\]The University of Hong Kong

4\]The Chinese University of Hong Kong

5\]Nanyang Technological University

\\fnmBeier\\surZhu\\fnmQingshan\\surXu\\fnmXiaojuan\\surQi\\fnmBei\\surYu\\fnmHanwang\\surZhang\[\[\[\[\[

###### Abstract

In this paper, we are the first to examine the correlations between class frequency and the multi\-scale noise schedule within diffusion models\. For score\-based generative models, low\-density regions often lead to inaccurately estimated scores, thereby compromising the generation quality\. Although the multi\-scale noise schedule can alleviate this issue during the diffusion process, low\-frequency classes still face the challenge of large low\-density regions, resulting in more inaccurate estimated scores than high\-frequency classes\. Furthermore, high\-frequency classes tend to dominate the score space, causing a convergence of most data points towards generating samples from these classes\. Consequently, samples generated within low\-frequency classes exhibit suboptimal quality and limited diversity\. To address this challenge, we propose theClass\-frequency Guided \(CFRG\)noise schedule, leveraging the insight that low\-frequency classes should be endowed with larger\-scale noises\. To illustrate the effectiveness of our method, we conduct experiments on various tasks, including image generation, image classification, and text\-to\-image generation, using imbalanced datasets,i\.e\., CIFAR\-100\-LT, and ImageNet\-LT\. By employing the CFRG noise schedule, we achieve substantial improvements over baselines, manifesting the crucial role of frequency statistics in noise schedule design\.

###### keywords:

Long\-tail, Diffusion models, Image generation

## 1Introduction

Score\-based generative models\[song2020score,ho2020denoising,song2019generative\]have garnered considerable attention across diverse domains, spanning image/video generation\[rombach2022high,liu2024sora\], audio synthesis\[chen2020wavegrad\], image editing\[hertz2022prompt,brooks2023instructpix2pix,kawar2023imagic\], and adversarial training\[wang2023better\]\. Distinguished from GANs or other likelihood\-based models, these models concentrate on modelingthe gradient of the log probability density function,i\.e\., the score function\. A core ingredient in score\-based models is the multi\-scale noise schedule: data are gradually added with multi\-scale Gaussian noise until the signals are diffused into a normal distribution\. Subsequently, the model is trained to transit the normal distribution to the data distribution\. This process can be theoretically explained by Markov chain\[sohl2015deep,ho2020denoising\]or stochastic differential equations \(SDEs\)\[song2020score\]\.

The mechanism behind the multi\-scale noise schedule attracts lots of attention in the community\. Several intriguing properties have been studied in recent work\[song2020score,chen2023importance,hoogeboom2022blurring,kingma2021variational\]\. Chen et al\.\[chen2023importance\]reveal the relationship between image size and the noise schedule\. P\. Kingma et al\.\[kingma2021variational\]conclude that the diffusion loss is invariant to the shape of the signal\-to\-noise function SNR\(t\)\. Hoogeboom et al\.\[hoogeboom2022blurring\]explore the Gaussian diffusion process with non\-isotropic noise\. In this paper, we are thefirstto examine how class frequency relates to the noise schedule in diffusion models\.

![Refer to caption](https://arxiv.org/html/2606.27696v1/x1.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x2.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x3.png)\(c\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x4.png)\(d\)

Figure 1\.1:Sample density visualization\.\(a\) Clean sample density\. \(b\) Density of noisy samples perturbed by the equal scale noise for all classes\. \(c\) Density of noisy samples perturbed by our class\-frequency guided \(CFRG\) noise schedule\. Low\-frequency classes are more likely to suffer from inaccurate score estimation because of their large low\-density regions\. \(d\) The quality of generated samples decreases as the class frequency decreases\.![Refer to caption](https://arxiv.org/html/2606.27696v1/x5.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x6.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x7.png)\(c\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x8.png)\(d\)

Figure 1\.2:Data score∇xlog⁡p​\(x\)\\nabla\_\{x\}\\log p\(x\)space visualization\.\(a\) Ground\-truth data scores\. \(b\) Estimated scores with clean samples\. \(c\) Estimated scores with samples perturbed by the equal scale noise for all classes\. \(d\) Estimated scores with samples perturbed by our class\-frequency guided \(CFRG\) noise schedule\. The high\-frequency class dominates the estimated score space in \(b\) and \(c\)\. With the CFRG noise schedule, we obtain a more balanced estimated score space for low\-frequency classes in \(d\)\.Our Motivation\.For score\-based generative models such as NCSN\[song2019generative\]and DDPM\[ho2020denoising\], the quality of generated samples highly depends on the accuracy of the estimated scores by the learned score function\. However, inaccuracies in score estimation can arise, particularly in low\-density regions\. Despite the integration of multi\-scale noise schedules to reduce low\-density regions in prominent generative models\[ho2020denoising,song2020score\],we observe that low\-frequency classes continue to face this challenge\.

We present a simplified illustration of a two\-Gaussian mixture in Figures[1\.1](https://arxiv.org/html/2606.27696#S1.F1)and[1\.2](https://arxiv.org/html/2606.27696#S1.F2)\. In Figure[1\(a\)](https://arxiv.org/html/2606.27696#S1.F1.sf1), the low\-frequency class exhibits smaller intra\-class variance compared to the high\-frequency class due to limited samples\. Upon introduction of Gaussian noise, depicted in Figure[1\(b\)](https://arxiv.org/html/2606.27696#S1.F1.sf2), the resultant noisy samples populate low\-density regions, wherep​\(x\)≈0p\(x\)\\approx 0, for both classes\. However, notably larger low\-density regions are observed for the low\-frequency class, leading to significantly more pronounced inaccuracies in estimated scores\. An additional observation is the dominance of the high\-frequency class in the estimated score space\. As evidenced in Figures[2\(b\)](https://arxiv.org/html/2606.27696#S1.F2.sf2)and[2\(c\)](https://arxiv.org/html/2606.27696#S1.F2.sf3), most data points converge to generate samples from the high\-frequency class, potentially hindering sample generation for the low\-frequency class\. Figure[1\(d\)](https://arxiv.org/html/2606.27696#S1.F1.sf4)is coherent to these findings: FID scores show a marked increase as class frequency decreases, underscoring substantially lower sample quality in low\-frequency classes compared to their high\-frequency counterparts\.

Our Solution\.To address the above challenges, we propose aClass\-frequency Guided \(CFRG\)noise schedule:the noise scale should be inversely correlated to class frequency\. By applying a relatively larger noise scale to low\-frequency classes, we further effectively reduce their low\-density regions, as depicted in Figure[1\(c\)](https://arxiv.org/html/2606.27696#S1.F1.sf3)\. Additionally, our CFRG noise schedule fosters a more balanced distribution of estimated scores, as evidenced in Figure[2\(d\)](https://arxiv.org/html/2606.27696#S1.F2.sf4)\. To assess the effectiveness of our method, we conduct experiments on imbalanced datasets, specifically long\-tailed CIFAR\[krizhevsky2009learning\]and ImageNet\[imagenet\]\. In image generation tasks, we achieve FID scores of 5\.14 and 2\.33 on CIFAR\-100\-LT and ImageNet\-LT, respectively, surpassing the DDPM baseline by2\.24and0\.76, respectively\. Additionally, in image classification, leveraging data generated by our CFRG models yields a notable improvement of9\.22%in top\-1 accuracy on CIFAR\-100\-LT\. Finally, we show that our method can be applicable to vision\-language diffusion models by text\-to\-image generation\. Our key contributions are summarized in what follows:

- •We are thefirstto systematically investigate the relationships between the multi\-scale noise schedule and class frequency\. Two issues on low\-frequency classes are identified: larger low\-density regions and imbalanced estimated score space\.
- •To solve the challenges, we propose a class\-frequency guided \(CFRG\) noise schedule for diffusion models: the noise scales should be inversely correlated to class frequency\.
- •We validate the effectiveness of our CFRG noise schedule on tasks including image generation, image classification, and text\-to\-image generation with imbalanced datasets,i\.e\., CIFAR\-100\-LT, and ImageNet\-LT\.

## 2Related Work

Score\-based Generative Models\.Inspired by non\-equilibrium statistical physics, Nonequilibrium Thermodynamics \(NET\[sohl2015deep\]\) was the first to deploy a prescribed diffusion process with a Markov chain to gradually transform data into random noise, then reverse the process by training an inverse diffusion model\. The Noise Conditional Score Network \(NCSN\)\[song2019generative\]proposed learning the data distribution by modeling the gradient of the log probability density function, known as the score function\. Utilizing multi\-scale Gaussian noise, the score function is learned through a score\-matching objective, allowing new samples to be generated using annealed Langevin dynamics\[parisi1981correlation\]during inference\. DDPM\[ho2020denoising\]demonstrated for the first time that diffusion models are capable of generating high\-quality samples\. It also showed the equivalence between diffusion models and denoising score matching across multiple noise levels during training, with annealed Langevin dynamics during sampling\. Later, stochastic differential equations \(SDEs\)\[song2020score\]were introduced for score\-based models, unifying previous approaches in score\-based generative modeling and DDPM\.

Learning on Imbalanced Data\.In real\-world scenarios, data often follows long\-tailed distribution,i\.e\.a few classes have lots of data while plenty of classes only possess a few samples\. Training on imbalanced data, models exhibit extremely poor accuracy on low\-frequency classes\. Re\-sampling\[byrd2019effect,buda2018systematic\]and Re\-weighting\[cui2019class\]are two kinds of classical methods to tackle this problem while hurting the representation learning\. Then the classifier and representation learning are decoupled to keep generalizable representations\[kang2019decoupling\]\. Methods\[kang2019decoupling,wang2020long,cui2022reslt\]have already achieved the best trade\-off between high\- and low\-frequency class performance\. Recently, representation learning techniques\[cui2021parametric,cui2023generalized,Cui\_2024\_CVPR,cui2024decoupled,11563882,cui2025generative,zhu2022balanced,du2024probabilistic\]have also been developed to address the long\-tailed recognition, creating new state\-of\-the\-art performance\. In addition to long\-tailed recognition, region rebalance\[cui2022region\]and center collapse regularizer\[zhong2023understanding\]explore imbalanced learning on semantic segmentation\. Label distribution smoothing \(LDS\) and feature distribution smoothing \(FDS\)\[yang2021delving\]investigate imbalanced regression\. Class\-balancing diffusion models \(CBDM\)\[qin2023class\]expand logits adjustment\[menon2020long\]into diffusion models for balanced generation\. PoGDiff\[wang2026pogdiff\]focuses on text\-to\-image generation, rebalancing learning regarding the conditional textual feature space density by borrowing statistical strength from neighboring conditions\. Unlike existing works, we, in this paper, propose the class frequency guided noise schedule \(CFRG\) and rebalance image generation learning regarding the noisy image space density by reducing low sample density regions for less frequent classes\.

## 3Method

### 3\.1Influences of Class Frequency on Estimated Score∇xlog⁡p​\(x\)\\nabla\_\{x\}\\log p\(x\)

Sampling with Score Function\.Score\-based models\[ho2020denoising,song2019generative,song2020score\]learn the data distribution’s probability density via a score function,i\.e\.,the gradient of the log probability density function∇xlog⁡p​\(x\)\\nabla\_\{x\}\\log p\(x\)\. Thanks to the Langevin dynamics\[parisi1981correlation\], new samples could be generated with the learned score function in an iterative manner as follows:

xi\+1=xi\+η​∇xlog⁡p​\(x\)\+2​η​ϵ,i=0,1,…,K,x\_\{i\+1\}=x\_\{i\}\+\\eta\\nabla\_\{x\}\\log p\(x\)\+\\sqrt\{2\\eta\}\\epsilon,i=0,1,\.\.\.,K,\(3\.1\)wherex0∼π​\(x\)x\_\{0\}\\sim\\pi\(x\)is a prior distribution,ϵ∼𝒩​\(𝟎,𝐈\)\\epsilon\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\),η→0\\eta\\to 0is the step size, K→∞\\to\\inftyis the number of steps for new sample generation\. NCSN\[song2019generative\]extends the sampling process to annealed Langevin dynamics with a multi\-scale noise schedule\.

Markov chain in DDPM\[ho2020denoising\]is adopted to sampling with the score function at inference:

xt−1=11−σt​\(xt\+σt​∇xtlog⁡p​\(xt\)\)\+βt​ϵ,x\_\{t\-1\}=\\frac\{1\}\{\\sqrt\{1\-\\sigma\_\{t\}\}\}\(x\_\{t\}\+\\sigma\_\{t\}\\nabla\_\{x\_\{t\}\}\\log p\(x\_\{t\}\)\)\+\\beta\_\{t\}\\epsilon,\(3\.2\)where0<σ1<σ2<…<σT<10<\\sigma\_\{1\}<\\sigma\_\{2\}<\.\.\.<\\sigma\_\{T\}<1is the multi\-scale noise schedule in the diffusion process, the maximum time\-step T=1000,xT∼𝒩​\(𝟎,𝐈\)x\_\{T\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)is a prior distribution,ϵ∼𝒩​\(𝟎,𝐈\)\\epsilon\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\),βt\\beta\_\{t\}is a function ofσ1:t\\sigma\_\{1:t\}\. The derivation for Eq\. \([3\.2](https://arxiv.org/html/2606.27696#S3.E2)\) is shown in Appendix[A\.1](https://arxiv.org/html/2606.27696#A1.SS1)\.

Additionally, diffusion and reverse diffusion processes can be equivalently represented with forward and reverse stochastic differential equations \(SDEs\)\[song2020score\]\. Specifically, sampling with the reverse SDE at inference is defined as:

d​x=\[f​\(x,t\)−g​\(t\)2​∇xlog⁡pt​\(x\)\]​d​t\+g​\(t\)​d​w¯,dx=\[f\(x,t\)\-g\(t\)^\{2\}\\nabla\_\{x\}\\log p\_\{t\}\(x\)\]dt\+g\(t\)d\\overline\{w\},\(3\.3\)where\{x​\(t\)\}t=0T\\\{x\(t\)\\\}\_\{t=0\}^\{T\}is a diffusion process with a continuous variablet∈\[0,T\]t\\in\[0,T\],f​\(⋅,t\):ℝd−\>ℝdf\(\\cdot,t\):\\mathbb\{R\}^\{d\}\-\>\\mathbb\{R\}^\{d\}is a vector\-valued function called the drift coefficient ofx​\(t\)x\(t\),g​\(t\)g\(t\)is a scalar function known as the diffusion coefficient ofx​\(t\)x\(t\),pt​\(x\)p\_\{t\}\(x\)is the probability density ofx​\(t\)x\(t\),w¯\\overline\{w\}is a standard Wiener process\.

Low\-density Regions Lead to Inaccurate Score Estimation\.The quality of generated samples with Eq\. \([3\.1](https://arxiv.org/html/2606.27696#S3.E1)\), Eq\. \([3\.2](https://arxiv.org/html/2606.27696#S3.E2)\), and Eq\. \([3\.3](https://arxiv.org/html/2606.27696#S3.E3)\) heavily relies on the learned score function∇xlog⁡p​\(x\)\\nabla\_\{x\}\\log p\(x\)\. However, the estimated scores often prove inaccurate, particularly in the initial sampling stages\. This phenomenon arises due to low sample density regions ofp​\(x\)p\(x\), leading to insufficient training under the following objective:

𝔼p​\(x\)​‖∇xlog⁡p​\(x\)−sθ​\(x\)‖22,\\mathbb\{E\}\_\{p\(x\)\}\|\|\\nabla\_\{x\}\\log p\(x\)\-s\_\{\\theta\}\(x\)\|\|\_\{2\}^\{2\},\(3\.4\)wheresθs\_\{\\theta\}is a neural network for score estimation\. Due to the unavailability of the real data score, it is often implemented with score\-matching techniques\. The loss function in DDPM\[ho2020denoising\]is also equivalent to Eq\. \([3\.4](https://arxiv.org/html/2606.27696#S3.E4)\), which is theoretically evidenced in Appendix[A\.2](https://arxiv.org/html/2606.27696#A1.SS2)\.

To reduce the low\-density regions, multiple scales of noise perturbations are adopted in recent diffusion models\[song2019generative,ho2020denoising,song2020score\]\. With an increasing noise schedule0<σ1<σ2<…<σT<10<\\sigma\_\{1\}<\\sigma\_\{2\}<\.\.\.<\\sigma\_\{T\}<1, an clean imagex0x\_\{0\}is diffused intoxT∈𝒩​\(𝟎,𝐈\)x\_\{T\}\\in\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)whenT→∞T\\rightarrow\\infty,i\.e\.,xt=1−σt​xt−1\+σt​ϵx\_\{t\}=\\sqrt\{1\-\\sigma\_\{t\}\}x\_\{t\-1\}\+\\sigma\_\{t\}\\epsilon, whereϵ∈𝒩​\(𝟎,𝐈\)\\epsilon\\in\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\. Then the score function is trained on noisy samples with each scale ofσt\\sigma\_\{t\}:

𝔼pσt​\(xt\)​‖∇xtlog⁡pσt​\(xt\)−sθ​\(xt,t\)‖22\.\\mathbb\{E\}\_\{p\_\{\\sigma\_\{t\}\}\(x\_\{t\}\)\}\|\|\\nabla\_\{x\_\{t\}\}\\log p\_\{\\sigma\_\{t\}\}\(x\_\{t\}\)\-s\_\{\\theta\}\(x\_\{t\},t\)\|\|\_\{2\}^\{2\}\.\(3\.5\)
Figures[1\.1](https://arxiv.org/html/2606.27696#S1.F1)and[1\.2](https://arxiv.org/html/2606.27696#S1.F2)show a toy example of a mixture of two Gaussians\. As shown in Figures[1\(a\)](https://arxiv.org/html/2606.27696#S1.F1.sf1)and[1\(b\)](https://arxiv.org/html/2606.27696#S1.F1.sf2), the low\-density regions are reduced after adding Gaussian noise to clean examples\. Correspondingly, the estimated scores become much more accurate as illustrated by Figures[2\(a\)](https://arxiv.org/html/2606.27696#S1.F2.sf1),[2\(b\)](https://arxiv.org/html/2606.27696#S1.F2.sf2), and[2\(c\)](https://arxiv.org/html/2606.27696#S1.F2.sf3)\.

The Effects of Class Frequency on Score\-based Generative Models\.The multi\-scale noise schedule significantly contributes to accurate score estimation, essential for generating high\-quality samples\. In this paper, we investigate the influence of class frequency on score\-based generative models and establish thatclass frequency is also a crucial factor in multi\-scale noise schedule design\. Our analysis reveals two primary observations:

- •Equipped with the original multi\-scale noise schedule \(all classes are equally treated\), low\-frequency classes still encounter problems of large low\-density regions\. Consequently, their estimated scores∇xlog⁡p​\(x\)\\nabla\_\{x\}\\log p\(x\)tend to be more inaccurate compared to high\-frequency classes\.
- •Training on imbalanced data, the estimated score∇xlog⁡p​\(x\)\\nabla\_\{x\}\\log p\(x\)space is dominated by high\-frequency classes, impairing the generation quality of low\-frequency classes\.

In the toy example depicted in Figures[1\.1](https://arxiv.org/html/2606.27696#S1.F1)and[1\.2](https://arxiv.org/html/2606.27696#S1.F2), benefiting from abundant samples, the high\-frequency class enjoys a larger intra\-class variance, whereas the low\-frequency class displays a smaller variance \(see Figure[1\(a\)](https://arxiv.org/html/2606.27696#S1.F1.sf1)\)\. With the Gaussian noise \(see Figure[1\(b\)](https://arxiv.org/html/2606.27696#S1.F1.sf2)\), low\-density regions wherep​\(x\)≈0p\(x\)\\approx 0are significantly reduced for both high\- and low\-frequency classes\. However, high\-density regions for the low\-frequency class remain notably smaller than those for the high\-frequency class, suggesting potentially greater inaccuracies of estimated scores for the former\. Additionally, as shown in Figures[2\(b\)](https://arxiv.org/html/2606.27696#S1.F2.sf2)and[2\(c\)](https://arxiv.org/html/2606.27696#S1.F2.sf3), the estimated scores for the majority of data points converge towards generating samples from the high\-frequency class\. This dominance within the estimated score space poses a significant challenge for generating high\-quality samples from the low\-frequency class\.

A Case Study\.With the above analysis, we conclude that class frequency is a crucial factor in the noise schedule design for score function learning\. Further, we confirm thatthe noise scale should be inversely correlated to class frequencywith a case study of DDPM\[ho2020denoising\]\. DDPM\[ho2020denoising\]uses a fixed linear noisy schedule,

σt=\(σT−σ1\)​t−1T−1\+σ1\.\\sigma\_\{t\}=\(\\sigma\_\{T\}\-\\sigma\_\{1\}\)\\frac\{t\-1\}\{T\-1\}\+\\sigma\_\{1\}\.\(3\.6\)With the default hyper\-parameterσ1=1​e−4\\sigma\_\{1\}=1e\-4, we examine the effects of various values ofσT\\sigma\_\{T\}on the quality of generated samples between classes with different class frequencies\. Experiments are conducted on CIFAR\-100\-LT and ImageNet\-LT datasets\. The imbalanced factors \(Nm​a​xNm​i​n\\frac\{N\_\{max\}\}\{N\_\{min\}\}, whereNNis the number of samples in the class\) are 100 and 256 respectively\. Classes are grouped into “Many”, “Medium” and “Few” according to class frequencies\. We report the per\-group and overall FID for evaluation\. Especially, the per\-group FID is calculated by averaging the FID of all classes in the group\. The experimental results are summarized in Table[3\.1](https://arxiv.org/html/2606.27696#S3.T1)\.

Results in Table[3\.1](https://arxiv.org/html/2606.27696#S3.T1)reveal two interesting phenomenon:

- •A properσT\\sigma\_\{T\}reduces low\-density regions, contributing to accurate score estimation\. On the other hand, a too largeσT\\sigma\_\{T\}can over\-corrupt the data, leading to hard optimization of the reverse diffusion process\. Thus,σT\\sigma\_\{T\}should be carefully chosen to achieve a good overall FID score\.
- •The noise scaleσT\\sigma\_\{T\}should be inversely correlated to class frequency\. For low\-frequency classes, a relatively higherσT\\sigma\_\{T\}can enlarge their high\-density regions and thus potentially raise the accuracy of estimated scores\.

Table 3\.1:Larger noise scale benefits low\-frequency classes\.Exploration of noise schedule effects regarding the class frequency on CIFAR\-100 with an imbalance factor of 100, and ImageNet\-LT with an imbalance factor of 256\.
### 3\.2Class\-frequency Guided Noise Schedule

The analysis in Sec[3\.1](https://arxiv.org/html/2606.27696#S3.SS1)highlights the importance of class frequency in multi\-scale noise schedule design for score function learning and reveals that the noise scale should be inversely correlated to class frequency\. Based on this insight, we are thefirstto consider the class frequency for the multi\-scale noise schedule designing and propose aClass\-frequency Guided \(CFRG\)noise schedule for score\-based generative models\.

Compared to high\-frequency classes, low\-frequency classes are adversely affected by large low\-density regions, leading to inaccurate score estimation\. In the toy example of Figure[1\.1](https://arxiv.org/html/2606.27696#S1.F1)and Figure[1\.2](https://arxiv.org/html/2606.27696#S1.F2), we consider the high\-frequency classA∼𝒩​\(μ𝐀,σ𝐀\)A\\sim\\mathcal\{N\}\(\\mathbf\{\\mu\_\{A\}\},\\mathbf\{\\sigma\_\{A\}\}\), the low\-frequency classB∼𝒩​\(μ𝐁,σ𝐁\)B\\sim\\mathcal\{N\}\(\\mathbf\{\\mu\_\{B\}\},\\mathbf\{\\sigma\_\{B\}\}\), andσ𝐀≫σ𝐁\\mathbf\{\\sigma\_\{A\}\}\\gg\\mathbf\{\\sigma\_\{B\}\}\. With the fixed linear noisy schedule in Eq\. \([3\.6](https://arxiv.org/html/2606.27696#S3.E6)\),xt=α¯t​x0\+1−α¯t​ϵx\_\{t\}=\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon, whereα¯t=Πi=1t​αi\\bar\{\\alpha\}\_\{t\}=\\Pi\_\{i=1\}^\{t\}\\alpha\_\{i\},αt=1−σt\\alpha\_\{t\}=1\-\\sigma\_\{t\},ϵ∈𝒩​\(𝟎,𝐈\)\\epsilon\\in\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\. Then, the following equations are established for classesAAandBBrespectively:

xt\\displaystyle x\_\{t\}∼\\displaystyle\\sim𝒩​\(α¯t​μ𝐀,−α¯t​\(1−σ𝐀2\)\+1\),\\displaystyle\\mathcal\{N\}\(\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{\\mu\_\{A\}\},\\sqrt\{\-\\bar\{\\alpha\}\_\{t\}\(1\-\\mathbf\{\\sigma\_\{A\}\}^\{2\}\)\+1\}\),\(3\.7\)xt\\displaystyle x\_\{t\}∼\\displaystyle\\sim𝒩​\(α¯t​μ𝐁,−α¯t​\(1−σ𝐁2\)\+1\),\\displaystyle\\mathcal\{N\}\(\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\\mathbf\{\\mu\_\{B\}\},\\sqrt\{\-\\bar\{\\alpha\}\_\{t\}\(1\-\\mathbf\{\\sigma\_\{B\}\}^\{2\}\)\+1\}\),\(3\.8\)whereσ𝐀<𝟏\\mathbf\{\\sigma\_\{A\}\}<\\mathbf\{1\}andσ𝐁<𝟏\\mathbf\{\\sigma\_\{B\}\}<\\mathbf\{1\}, Normalization operations are used to pixel inputs for the constraint\.

Sinceσ𝐀≫σ𝐁\\mathbf\{\\sigma\_\{A\}\}\\gg\\mathbf\{\\sigma\_\{B\}\}, we know that−α¯t​\(1−σ𝐀2\)\+1\>−α¯t​\(1−σ𝐁2\)\+1\\sqrt\{\-\\bar\{\\alpha\}\_\{t\}\(1\-\\mathbf\{\\sigma\_\{A\}\}^\{2\}\)\+1\}\>\\sqrt\{\-\\bar\{\\alpha\}\_\{t\}\(1\-\\mathbf\{\\sigma\_\{B\}\}^\{2\}\)\+1\}for all of time\-steptt, which means noisy samples of classBBalways have smaller high\-density regions than classAAand thus there are larger low\-density regions for classBB\.

σ𝐀\\mathbf\{\\sigma\_\{A\}\}andσ𝐁\\mathbf\{\\sigma\_\{B\}\}are constant determined by the training data\. We thus schedule a class\-wiseα¯t\\bar\{\\alpha\}\_\{t\}and propose theClass\-frequency Guided \(CFRG\)noise schedule to enlarge the high\-density regions of low\-frequency classes\. In detail, for a specific classii,

σTi\\displaystyle\\sigma\_\{T\}^\{i\}\\\!=\\displaystyle=σTm​a​x−σTm​i​nFm​i​n−Fm​a​x​\(Fi−Fm​a​x\)\+σTm​i​n,\\displaystyle\\frac\{\\sigma\_\{T\}^\{max\}\\\!\-\\\!\\sigma\_\{T\}^\{min\}\}\{F\_\{min\}\\\!\-\\\!F\_\{max\}\}\(F\_\{i\}\-F\_\{max\}\)\\\!\+\\\!\\sigma\_\{T\}^\{min\},\(3\.9\)σti\\displaystyle\\sigma\_\{t\}^\{i\}\\\!=\\displaystyle=\(σTi−σ1\)​t−1T−1\+σ1,\\displaystyle\(\\sigma\_\{T\}^\{i\}\\\!\-\\\!\\sigma\_\{1\}\)\\frac\{t\-1\}\{T\-1\}\\\!\+\\\!\\sigma\_\{1\},\(3\.10\)whereFm​i​nF\_\{min\}andFm​a​xF\_\{max\}are the least and most class frequency respectively,FiF\_\{i\}is the frequency of classii,σTm​i​n\\sigma\_\{T\}^\{min\}andσTm​a​x\\sigma\_\{T\}^\{max\}are tunable hyper\-parameters,σ1\\sigma\_\{1\}is set to 1e\-4 following DDPM\[ho2020denoising\]\. We also explore the effective number\[cui2019class\]for the calculation ofFFvalues\.

With the constraintsFA\>FB\>FCF\_\{A\}\>F\_\{B\}\>F\_\{C\}for any 3 classesA∼𝒩​\(μ𝐀,σ𝐀\)A\\sim\\mathcal\{N\}\(\\mathbf\{\\mu\_\{A\}\},\\mathbf\{\\sigma\_\{A\}\}\),B∼𝒩​\(μ𝐁,σ𝐁\)B\\sim\\mathcal\{N\}\(\\mathbf\{\\mu\_\{B\}\},\\mathbf\{\\sigma\_\{B\}\}\), andC∼𝒩​\(μ𝐂,σ𝐂\)C\\sim\\mathcal\{N\}\(\\mathbf\{\\mu\_\{C\}\},\\mathbf\{\\sigma\_\{C\}\}\)in the dataset, Eq\. \([3\.9](https://arxiv.org/html/2606.27696#S3.E9)\) and Eq\. \([3\.10](https://arxiv.org/html/2606.27696#S3.E10)\) guarantee that:

−α¯tA\\displaystyle\-\\bar\{\\alpha\}\_\{t\}^\{A\}=\\displaystyle=−Πi=1t​\(1−σiA\)<−α¯tB,\\displaystyle\-\\Pi\_\{i=1\}^\{t\}\(1\-\\sigma\_\{i\}^\{A\}\)<\-\\bar\{\\alpha\}\_\{t\}^\{B\},\(3\.11\)−α¯tB\\displaystyle\-\\bar\{\\alpha\}\_\{t\}^\{B\}=\\displaystyle=−Πi=1t​\(1−σiB\)<−α¯tC,\\displaystyle\-\\Pi\_\{i=1\}^\{t\}\(1\-\\sigma\_\{i\}^\{B\}\)<\-\\bar\{\\alpha\}\_\{t\}^\{C\},\(3\.12\)−α¯tC\\displaystyle\-\\bar\{\\alpha\}\_\{t\}^\{C\}=\\displaystyle=−Πi=1t​\(1−σiC\)<0,\\displaystyle\-\\Pi\_\{i=1\}^\{t\}\(1\-\\sigma\_\{i\}^\{C\}\)<0,\(3\.13\)which implies that the less class frequency, the more compensation in terms of high\-density regions will be provided, thus benefiting the generation quality of low\-frequency classes\.

### 3\.3Analysis

Density Analysis\.Eqs\. \([3\.1](https://arxiv.org/html/2606.27696#S3.E1)\), \([3\.2](https://arxiv.org/html/2606.27696#S3.E2)\) and \([3\.3](https://arxiv.org/html/2606.27696#S3.E3)\) show that the quality of generation heavily depends on the accuracy of the learned score function\. Furthermore, Eq\. \([3\.4](https://arxiv.org/html/2606.27696#S3.E4)\) suggests that low sample density — particularly in low\-frequency classes — can lead to undertrained score estimates, ultimately degrading generation quality\.

In Table[3\.2](https://arxiv.org/html/2606.27696#S3.T2), we quantify the low\-density regions \(≤δ\\leq\\delta\) of imbalanced data under different noise schedules, highlighting that the CFRG noise schedule can improve model generation quality by reducing low\-density regions\. Please refer to Algorithm[1](https://arxiv.org/html/2606.27696#alg1)in the Appendix for more details\.

Table 3\.2:Density analysis\. Quantity of low\-density regions for DDPM and CFRG noise schedule\. Please refer to Algorithm[1](https://arxiv.org/html/2606.27696#alg1)in the supplementary file for more details\.Knowledge Transfer from High\-frequency Classes to Low\-frequency Ones\.Training with a single low\-frequency class suffers from extremely limited data, while training the entire data together enables knowledge transfer from high\-frequency classes to low\-frequency classes for common patterns in images\.

Table 3\.3:Knowledge transfer\. Training with high\- and low\-frequency classes together can benefit the generation quality of low\-frequency classes\.To illustrate the knowledge transfer between low\-frequency and high\-frequency classes, we train a model with data of 50 least\-frequent classes as a baseline on CIFAR\-100\-LT\. We reportFID\-All\(FID on all classes\) andFID\-50\(FID on the 50 least\-frequent classes\) respectively\. Table[3\.3](https://arxiv.org/html/2606.27696#S3.T3)shows the empirical results\. We observe that the DDPM model trained on the whole data achieves much lower FID on the 50 least\-frequent classes when compared with the model that traiend only with the data of 50 least\-frequent classes, indicating the knowledge transfer from high\-frequency classes to low\-frequency classes\.

## 4Experiments

In Section[4\.1](https://arxiv.org/html/2606.27696#S4.SS1), we conduct ablations on the effects of hyper\-parameters and designs in our CFRG noise schedule\. Comparisons in image generation are presented in Section[4\.2](https://arxiv.org/html/2606.27696#S4.SS2)\. We also discuss how generated samples benefit image classification tasks in Section[4\.3](https://arxiv.org/html/2606.27696#S4.SS3)\. The potential of our CFRG method on vision\-language diffusion models is confirmed in Appendix[4\.4](https://arxiv.org/html/2606.27696#S4.SS4)\. For more details on experimental settings, please refer to Appendix[B](https://arxiv.org/html/2606.27696#A2)\.

### 4\.1Ablation Experiments

Comparison with Re\-sampling and Re\-weighting\. Long\-tailed learning has been widely researched in classification and regression\. However, these works are hard to or even can’t transfer to image generation tasks\. We have also included resampling\-based methods for comparisons in Table[4\.5](https://arxiv.org/html/2606.27696#S4.T5)\. With class\-balanced resampling \(RS\) and SQRT resampling, the generation model even achieves worse performance than DDPM baseline\. This observation is consistent with findings in previous work\[qin2023class\]\. Here, we include empirical results of the re\-weighting method in Table[4\.1](https://arxiv.org/html/2606.27696#S4.T1)\.

Table 4\.1:Comparisons with re\-sample and re\-weight methods on CIFAR\-100\-LT\.Ablation on Form of CFRG Noise Schedule\.Class\-frequency guided \(CFRG\) noise schedule in Eqs\. \([3\.9](https://arxiv.org/html/2606.27696#S3.E9)\) and \([3\.10](https://arxiv.org/html/2606.27696#S3.E10)\) assignsσTi\\sigma\_\{T\}^\{i\}for classiiwith a linear function in terms of its class frequencyFiF\_\{i\}\. Here we conduct ablations on another possible form:

σTi=σTmax−σTminC−1​\(i−1\)\+σTmin,\\sigma\_\{T\}^\{i\}=\\frac\{\\sigma\_\{T\}^\{\\max\}\-\\sigma\_\{T\}^\{\\min\}\}\{C\-1\}\(i\-1\)\+\\sigma\_\{T\}^\{\\min\},\\\\\(4\.1\)where classes are sorted by their frequencies and classiihas theiith maximum frequency,CCis the number of classes\. Eq\. \([4\.1](https://arxiv.org/html/2606.27696#S4.E1)\) only consider the rank of class frequency rather than its actual value\.

Additionally, considering the similarities among samples, the effective number of frequency for classiican be calculated with the following equation\[cui2019class\]:

Fi=1−γNi1−γ,γ∈\(0,1\),F\_\{i\}=\\frac\{1\-\\gamma^\{N^\{i\}\}\}\{1\-\\gamma\},\\gamma\\in\(0,1\),\(4\.2\)whereNiN^\{i\}is the number of samples in classii\.

The function curves forσTi\\sigma\_\{T\}^\{i\}from Eqs\. \([4\.1](https://arxiv.org/html/2606.27696#S4.E1)\) and \([3\.9](https://arxiv.org/html/2606.27696#S3.E9)\) with various values ofγ\\gammaare shown in Figure[1\(a\)](https://arxiv.org/html/2606.27696#S4.F1.sf1)\. Asγ\\gammaapproaches 1\.0, the effective number of frequencies becomes close to the original class frequencies\. On CIFAR\-100\-LT, with a properγ=0\.999\\gamma=0\.999, the effective imbalance factor is reduced from 100 to 78\.88 meanwhile representing the intra\-class variance well and thus achieving good performance\. We summarize experimental results on CIFAR\-100\-LT in Table[4\.3](https://arxiv.org/html/2606.27696#S4.T3)\. Compared to Eq\. \([3\.9](https://arxiv.org/html/2606.27696#S3.E9)\), Eq\. \([4\.1](https://arxiv.org/html/2606.27696#S4.E1)\) assigns theσTi\\sigma\_\{T\}^\{i\}for classiionly considering the rank information of its class frequency\. However, the rank can not accurately reflect the intra\-class variance\. As shown in Table[4\.3](https://arxiv.org/html/2606.27696#S4.T3), with Eq\. \([3\.9](https://arxiv.org/html/2606.27696#S3.E9)\) and aγ=0\.999\\gamma=0\.999, our model achieves 6\.62 FID, outperforming w/ Eq\. \([4\.1](https://arxiv.org/html/2606.27696#S4.E1)\) by 0\.68 and thus demonstrating the importance of the frequency statistics\. Meanwhile, our model surpasses the DDPM baseline by0\.76FID, showing the effectiveness of our CFRG noise schedule\.

![Refer to caption](https://arxiv.org/html/2606.27696v1/x9.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x10.png)\(b\)
![Refer to caption](https://arxiv.org/html/2606.27696v1/x11.png)\(c\)

Figure 4\.1:Ablation studies\.\(a\) The curves ofσT\\sigma\_\{T\}with different form of CFRG noise schedule\. \(b\) The effects of guidance scaleω\\omegawith classifier free guidance on ImageNet\-LT\. The image size32×3232\\times 32is used\. \(c\) The CFRG noise schedule significantly benefits low\-frequency class performance\.Table 4\.2:CFRG noise schedule designing\.
Table 4\.3:Ablation study ofσTmax\\sigma\_\{T\}^\{\\max\}\.

Table 4\.4:Improvement analysis on ImageNet\-LT\.We report FIDs on “Many”, “Medium”, “Few”, and Overall classes\. For “Many”, “Medium” and “Few” groups, we calculate class\-wise FIDs and average the FIDs of classes in the group\.Ablation onσTmin\\sigma\_\{T\}^\{\\min\}andσTmax\\sigma\_\{T\}^\{\\max\}in CFRG Noise Schedule\.Following DDPM\[ho2020denoising\], we applyσ1=1​e−4\\sigma\_\{1\}=1e\-4to our all experiments\. As shown in Table[3\.1](https://arxiv.org/html/2606.27696#S3.T1),σT=0\.02\\sigma\_\{T\}=0\.02andσT=0\.01\\sigma\_\{T\}=0\.01are the best choices for DDPM on CIFAR\-100\-LT and ImageNet\-LT separately\. In CFRG noise schedule, we also useσTmin=0\.02\\sigma\_\{T\}^\{\\min\}=0\.02andσTmin=0\.01\\sigma\_\{T\}^\{\\min\}=0\.01for CIFAR\-100\-LT and ImageNet\-LT respectively\. The ablation forσTmax\\sigma\_\{T\}^\{\\max\}is presented in Table[4\.3](https://arxiv.org/html/2606.27696#S4.T3)\. A relatively largerσti\\sigma\_\{t\}^\{i\}can reduce the low\-density regions as analyzed in Section[3\.1](https://arxiv.org/html/2606.27696#S3.SS1), leading to more accurate estimated scores\. However, a too largeσti\\sigma\_\{t\}^\{i\}can also over\-corrupt data and alter it significantly from the original distribution, resulting in hard optimization of the reverse diffusion process\. Thus, a reasonableσTmax\\sigma\_\{T\}^\{\\max\}should be adopted\. Table[4\.3](https://arxiv.org/html/2606.27696#S4.T3)illustrates thatσTmax=0\.04\\sigma\_\{T\}^\{\\max\}=0\.04andσTmax=0\.02\\sigma\_\{T\}^\{\\max\}=0\.02are the best choices on CIFAR\-100\-LT and ImageNet\-LT individually\.

Improvements on Low\-frequency Classes\.Our findings show that the noise scale should be inversely related to class frequency in the multi\-scale noise schedule of diffusion models\. Empirically, models trained with our class\-frequency guided \(CFRG\) noise schedule achieve significant overall improvements regarding both sample quality and diversity\. Here, we illustrate that the CFRG noise schedule benefits the low\-frequency classes\. With experiments on CIFAR\-100\-LT, we measure the difference in class FIDs between the DDPM model and our CFRG model\. As shown in Figure[1\(c\)](https://arxiv.org/html/2606.27696#S4.F1.sf3), our model achieves much lower FIDs on low\-frequency classes, manifesting higher generation quality benefiting from our proposed CFRG noise schedule\. Besides, we analyze the improvements on the ImageNet\-LT dataset\. Table[4\.4](https://arxiv.org/html/2606.27696#S4.T4)summarizes our experimental results\. After applying our CFRG noise schedule, the “Medium” and “Few” classes achieve much better generation performance when compared to the DDPM baseline, which again implies that the proposed CFRG noise schedule improves generation quality for low\-frequency classes and confirms our findings\.

Ablation onω\\omegaof Classifier Free Guidance\.Classifier free guidance\[ho2022classifier\]is an important technique to trade\-off between sample quality and diversity\. We observe that other training techniques or experimental settings can influence the choice of the guidance scaleω\\omega\. Thus, for fair comparisons, the best guidance scaleω\\omegais individually selected for all the comparison models\. We ablate the effects ofω\\omegain the following settings on ImageNet\-LT with the image size of32×3232\\times 32:

- •Comparison between the DDPM models and models trained with our CFRG noise schedule\.
- •Comparison between CBDM models and models trained with our CFRG noise schedule\.

Our experimental results are drawn on Figure[1\(b\)](https://arxiv.org/html/2606.27696#S4.F1.sf2)\. An interesting phenomenon is that CBDM models require a larger guidance scaleω\\omega\. The CBDM method randomly and uniformly samples labels for current batch data\. Then the ground truth labels and the sampled labels are both used in the training optimization with a weighted sum manner\. As a result, there can be a negative effect on the alignment between image features and class embeddings\. Then a largerω\\omegacan be a compensation for generation quality\. Moreover, we also conclude that the larger image size demands a larger guidance scaleω\\omega\. with64×6464\\times 64image size, the bestω\\omegafor our models and DDPM models is around 0\.8 while around 0\.3 for an image size of32×3232\\times 32\. Nevertheless, with the properω\\omegavalues for all of the models, we achieve much better results with the proposed CFRG noise schedule\.

Table 4\.5:Evaluation of image generation task on CIFAR\-100\-LT and ImageNet\-LT\.Table 4\.6:Our CFRG models benefit image classification on imbalanced data,i\.e\., CIFAR\-100\-LT\.Gen\. ModelsIF=100IF=200FID\(↓\\downarrow\)Cls\. Accuracy\(↑\\uparrow\)FID\(↓\\downarrow\)Cls\. Accuracy\(↑\\uparrow\)\- \(Baseline\)\-41\.32\-37\.09DDPM7\.3846\.588\.2542\.79CFRG\(Ours\)6\.6248\.287\.4644\.87\+CBDM & ADA5\.1450\.54\(\+9\.22\)5\.8946\.26\(\+9\.17\)
### 4\.2Image Generation

To validate the effectiveness of the proposed class\-frequency guided \(CFRG\) noise schedule on image generation task, we build several baselines based on DDPM\[ho2020denoising\]including re\-sampling\[mahajan2018exploring\], SQRT\-resampling\[mahajan2018exploring\], augmentation\-based methods\[karras2020training,zhao2020differentiable\], and class\-balanced diffusion model\[qin2023class\]\. We apply the CFRG noise schedule to DDPM and conduct experiments on the popular benchmarks suffering from the data imbalance issue,i\.e\., CIFAR\-100\-LT, and ImageNet\-LT\.

Image Generation Comparisons on CIFAR\-100\-LT\.The experimental results on CIFAR\-100\-LT are listed in Table[4\.5](https://arxiv.org/html/2606.27696#S4.T5)\. We observe that the most classical method to deal with data imbalance,i\.e\., data resampling, is not effective for the image generation task with diffusion models\. The class\-balance resampling and SQRT resampling achieve 10\.50 FID and 9\.72 FID separately, which are even inferior to the DDPM baseline \(7\.38 FID\)\. The ADA method, based on data augmentation, can significantly improve image generation performance, decreasing 7\.38 to 6\.16 in terms of FID\. Currently, CBDM combined with the ADA method achieves the best performance on both generation quality and diversity\. To show the effectiveness of our CFRG noise schedule, we apply our method to the DDPM baseline without extra techniques, achieving 6\.62 FID and outperforming the baseline by 0\.76 FID\. Plugging our CFRG noise schedule to CBDM and ADA methods, we boost the generation performance, largely surpassing the DDPM baseline model by 2\.24, 0\.08, 0\.06, 0\.05 in terms of FID, Recall,FsF\_\{s\}, andF1/sF\_\{1/s\}evaluation metrics\. The experimental results manifest the effectiveness and flexibility of our proposed CFRG noise schedule for diffusion models\.

Image Generation Comparisons on ImageNet\-LT\.To demonstrate the generality of our CFRG noise schedule, we evaluate our models on the more challenging ImageNet\-LT dataset\. We conduct experiments with both32×3232\\times 32and64×6464\\times 64image sizes\. As shown in Table[4\.5](https://arxiv.org/html/2606.27696#S4.T5), we achieve 2\.62 FID and outperform the DDPM baseline by 0\.47\. Built on the CBDM method, our model obtains 2\.33 FID, surpassing the DDPM baseline by 0\.76\. With an image size of64×6464\\times 64, the proposed CFRG noise schedule consistently achieves much better performance than baselines\.

### 4\.3Image Classification with Generated Images

To illustrate that the generative models can benefit downstream tasks, we evaluate the generated images by DDPM and our CFRG models on long\-tailed recognition with the CIFAR\-100\-LT dataset\.

Implementation Details\.The experimental settings from previous work\[cao2019learning,cui2022reslt\]are deployed\. We randomly crop a32×3232\\times 32patch from the original image or its horizontal flip with 4 pixels padded on each side and normalize the pixel values into \[0,1\]\. The ResNet\-32 is used as the backbone network for all experiments\. SGD optimizer with momentum 0\.9 is adopted\. We train all models for 200 epochs with the cross\-entropy loss\. The initial learning rate is set to 0\.1 and the first five epochs are trained with the linear warm\-up\. The learning rate decays at the 160 and 180 epochs by 0\.1\. The batch size is 128 and the weight\-decay is 5e\-4\.

Experimental Results Comparisons\.The experimental results are summarized in Table[4\.6](https://arxiv.org/html/2606.27696#S4.T6)\. We observe that generated data by diffusion models can significantly improve the classification performance on imbalanced data\. With cross\-entropy loss, the ResNet\-32 model achieves 41\.32% and 37\.09% top\-1 accuracy on CIFAR\-100\-LT with imbalance factor \(IF\) 100 and 200 respectively\. Applying the same training settings, the model trained with generated data by DDPM achieves 46\.58% and 42\.79% top\-1 accuracy, outperforming the baseline by 5\.26% and 5\.7%\. Combining with ADA and CBDM methods, our CFRG model generates high\-quality images\. With these images, the ResNet\-32 model boosts the classification accuracy to 50\.54% and 46\.26%, largely surpassing the baseline by9\.22%and9\.17%individually\.

Table 4\.7:The potential of CFRG noise schedule on the Stable Diffusion model\.Table 4\.8:Extension of CFRG to flow matching\.
### 4\.4Text\-to\-Image Generation

The Stable Diffusion \(SD\) model\[rombach2022high\]applies the mechanism of diffusion process to the latent feature space, largely reducing the computational cost\. It is trained on large\-scale image\-text pairs data, enabling text\-to\-image generation\. Besides, the Stable Diffusion model can also secretly be an image classifier\[li2023your\]\. As shown in previous work\[Cui\_2024\_CVPR\], its zero\-shot accuracy on CIFAR\-100 is extremely imbalanced, which implies the large\-scale training data also potentially suffers from the data imbalance issue\.

Despite the reduced computational cost with latent features, training the Stable Diffusion model on large\-scale vision\-language data still requires large amounts of GPUs\. To demonstrate the potential of our CFRG method on the Stable Diffusion model\[rombach2022high\], we conduct simulated experiments on text\-to\-image generation with CIFAR\-100\-LT\. The image encoder and text encoder are well\-aligned in CLIP models\[radford2021learning\]\. We thus use the features from the image encoder to represent their text embeddings\. Then the text embeddings are adopted as the textual conditional inputs for diffusion models\. At inference, we apply the template “A photo of \{class name\}” as text descriptions to generate 50,000 images\. The experimental results are summarized in Table[4\.7](https://arxiv.org/html/2606.27696#S4.T7)\. The FIDs for both DDPM and our CFRG models are much higher than models trained with accurate labels, which implies the importance of high\-quality data annotations\. Nevertheless, under this text\-to\-image setting, our CFRG model still achieves better generation quality, confirming its great potential for the Stable Diffusion model\.

### 4\.5Extension of CFRG to Flow Matching

The objective of flow matching\[lipman2022flow\]still suffer from insufficient training in low\-density regions \(q​\(x1≈0\)q\(x\_\{1\}\\approx 0\)\):

vt′\\displaystyle v\_\{t\}^\{\{\}^\{\\prime\}\}=\\displaystyle=x1−\(1−σm​i​n\)​x0\\displaystyle x\_\{1\}\-\(1\-\\sigma\_\{min\}\)x\_\{0\}\(4\.3\)ℒC​F​M\\displaystyle\\mathcal\{L\}\_\{CFM\}=\\displaystyle=𝔼t,q​\(x1\),p​\(x0\)​\|vt​\(ϕt​\(x0\)\)−vt′\|2,\\displaystyle\\mathbb\{E\}\_\{t,q\(x\_\{1\}\),p\(x\_\{0\}\)\}\\left\|v\_\{t\}\(\\phi\_\{t\}\(x\_\{0\}\)\)\\\!\-\\\!v\_\{t\}^\{\{\}^\{\\prime\}\}\\right\|^\{2\},\(4\.4\)whereϕt​\(x0\)=\(1−\(1−σm​i​n\)​t\)​x0\+t​x1\\phi\_\{t\}\(x\_\{0\}\)=\(1\-\(1\-\\sigma\_\{min\}\)t\)x\_\{0\}\+tx\_\{1\}\.

To mitigate this issue, we adapt the CFRG to flow matching and apply a class\-wiseσm​i​n\\sigma\_\{min\}: classes with higher frequency apply a smallerσm​i​n\\sigma\_\{min\}\. We follow the open\-sourced code from Facebook Research to train models for 3000 epochs and only replace the dataset as CIFAR\-100\-LT with an imbalance ratio of 100\. Hyparameters are all set as default\. The experimental results are summarized in Table[4\.8](https://arxiv.org/html/2606.27696#S4.T8)\. After applying the CFRG strategy, the model performance is enhanced, demonstrating that our CFRG can also generalize to flow matching methods\.

## 5Conclusion

In this paper, we investigate the relationships between class frequency and the multi\-scale noise schedule within diffusion models\. Based on the observation that the low\-frequency classes still suffer from large low\-density regions and the high\-frequency classes often dominate the estimated score space, a class\-frequency guided \(CFRG\) noise schedule is introduced, which constrains that the noise scales are inversely related to class frequency\. The effectiveness of our approach is confirmed via image generation, image classification, and text\-to\-image tasks\.

## Declarations

- •Funding\.This research was funded by the National Natural Science Foundation of China \(NSFC\)\.
- •Conflict of interest\.The authors declare that they have no conflict of interest\.
- •Data availability\.The datasets used in this work are all publicly available\.

## Appendix AProofs

### A\.1The Relationship Between Score Function and Predicted Noise in DDPM

Step\-1\. DDPM\[ho2020denoising\]models the forward diffusion process as a Markov chain:

q​\(xt\|x0\)=𝒩​\(xt;α¯t​x0,\(1−α¯t\)​𝐈\),q\(x\_\{t\}\|x\_\{0\}\)=\\mathcal\{N\}\(x\_\{t\};\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\},\(1\-\\bar\{\\alpha\}\_\{t\}\)\\mathbf\{I\}\),\(A\.1\)wherex0∼q​\(x0\)x\_\{0\}\\sim q\(x\_\{0\}\)is the data distribution\.xtx\_\{t\}represents noisy samples at time\-steptt,α¯t=∏i=1tαi\\bar\{\\alpha\}\_\{t\}=\\prod\_\{i=1\}^\{t\}\\alpha\_\{i\},αi=1−σi\\alpha\_\{i\}=1\-\\sigma\_\{i\},σ1,σ2,…,σT\\sigma\_\{1\},\\sigma\_\{2\},\.\.\.,\\sigma\_\{T\}is the mutli\-scale noise schedule \([3\.6](https://arxiv.org/html/2606.27696#S3.E6)\)\.

Step\-2\. The reverse diffusion process can also be considered as a Markov chain\. With the Bayesian rule, forward diffusion process posteriors could be tractable:

q​\(xt−1\|xt,x0\)\\displaystyle q\(x\_\{t\-1\}\|x\_\{t\},x\_\{0\}\)=\\displaystyle=𝒩​\(xt−1;μ~t​\(xt,x0\),β~t​𝐈\),\\displaystyle\\mathcal\{N\}\(x\_\{t\-1\};\\tilde\{\\mu\}\_\{t\}\(x\_\{t\},x\_\{0\}\),\\tilde\{\\beta\}\_\{t\}\\mathbf\{I\}\),\(A\.2\)μ~​\(xt,x0\)\\displaystyle\\tilde\{\\mu\}\(x\_\{t\},x\_\{0\}\)=\\displaystyle=α¯t−1​σt1−α¯t​x0\+αt​\(1−α¯t−1\)1−α¯t​xt,\\displaystyle\\frac\{\\sqrt\{\\bar\{\\alpha\}\_\{t\-1\}\}\\sigma\_\{t\}\}\{1\-\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}\\\!\+\\\!\\frac\{\\sqrt\{\\alpha\_\{t\}\}\(1\-\\bar\{\\alpha\}\_\{t\-1\}\)\}\{1\-\\bar\{\\alpha\}\_\{t\}\}x\_\{t\},β~t\\displaystyle\\tilde\{\\beta\}\_\{t\}=\\displaystyle=1−α¯t−11−α¯t​σt\.\\displaystyle\\frac\{1\-\\bar\{\\alpha\}\_\{t\-1\}\}\{1\-\\bar\{\\alpha\}\_\{t\}\}\\sigma\_\{t\}\.withx0=1α¯t​\(xt−1−α¯t​ϵt\)x\_\{0\}=\\frac\{1\}\{\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}\}\(x\_\{t\}\-\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon\_\{t\}\)by Eq\. \([A\.1](https://arxiv.org/html/2606.27696#A1.E1)\),

q​\(xt−1\|xt,x0\)=𝒩​\(xt−1;1αt​\(xt−1−αt1−α¯t​ϵt\),β~t​𝐈\),q\(x\_\{t\-1\}\|x\_\{t\},x\_\{0\}\)\\\!=\\\!\\mathcal\{N\}\(x\_\{t\-1\};\\frac\{1\}\{\\sqrt\{\\alpha\_\{t\}\}\}\(x\_\{t\}\\\!\-\\\!\\frac\{1\-\\alpha\_\{t\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\\epsilon\_\{t\}\),\\tilde\{\\beta\}\_\{t\}\\mathbf\{I\}\),\(A\.3\)whereϵt∼𝒩​\(𝟎,𝟏\)\\epsilon\_\{t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{1\}\)\.

Step\-3\. Applying the Tweedie’s formula toxtx\_\{t\}in Eq\. \([A\.1](https://arxiv.org/html/2606.27696#A1.E1)\),

α¯t​x0=xt\+\(1−α¯t\)​∇xtp​\(xt\),\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}=x\_\{t\}\+\(1\-\\bar\{\\alpha\}\_\{t\}\)\\nabla\_\{x\_\{t\}\}p\(x\_\{t\}\),\(A\.4\)Combining Eqs\. \([A\.1](https://arxiv.org/html/2606.27696#A1.E1)\) and \([A\.4](https://arxiv.org/html/2606.27696#A1.E4)\), we derive the following equation,

α¯t​x0\+1−α¯t​ϵt\\displaystyle\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}\+\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\\epsilon\_\{t\}=\\displaystyle=α¯t​x0−\(1−α¯t\)​∇xtp​\(xt\)\\displaystyle\\sqrt\{\\bar\{\\alpha\}\_\{t\}\}x\_\{0\}\-\(1\-\\bar\{\\alpha\}\_\{t\}\)\\nabla\_\{x\_\{t\}\}p\(x\_\{t\}\)⇒∇xtp​\(xt\)\\displaystyle\\Rightarrow\\quad\\nabla\_\{x\_\{t\}\}p\(x\_\{t\}\)=\\displaystyle=−ϵt1−α¯t,\\displaystyle\\frac\{\-\\epsilon\_\{t\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\},\(A\.5\)whereϵt∼𝒩​\(𝟎,I\)\\epsilon\_\{t\}\\sim\\mathcal\{N\}\(\\mathbf\{0\},I\)\.

Combining Eq\. \([A\.5](https://arxiv.org/html/2606.27696#A1.E5)\) and Eq\. \([A\.3](https://arxiv.org/html/2606.27696#A1.E3)\), Eq\. \([3\.2](https://arxiv.org/html/2606.27696#S3.E2)\) is established for DDPM, implying the importance of accurate score estimation for sampling at inference\.

Table A\.1:Experimental Configurations\.Table A\.2:More results on image generation task with CIFAR\-100\-LT\.
### A\.2Low\-density Regions Lead to Inaccurate Score Estimation for DDPM\.

The optimization of DDPM\[ho2020denoising\]is driven by a variational bound on the negative log\-likelihood\. Specifically, the objective is written as follows:

ℒT\\displaystyle\\mathcal\{L\}\_\{T\}=\\displaystyle=DK​L​\(q​\(xT\|x0\)∥p​\(xT\)\)\\displaystyle D\_\{KL\}\(q\(x\_\{T\}\|x\_\{0\}\)\\\|p\(x\_\{T\}\)\)ℒt−1\\displaystyle\\mathcal\{L\}\_\{t\-1\}=\\displaystyle=∑t\>1DK​L\(q\(xt−1\|xt,x0\)∥pθ\(xt−1\|xt\)\)\\displaystyle\\sum\_\{t\>1\}D\_\{KL\}\(q\(x\_\{t\-1\}\|x\_\{t\},x\_\{0\}\)\\\|p\_\{\\theta\}\(x\_\{t\-1\}\|x\_\{t\}\)\)ℒ0\\displaystyle\\mathcal\{L\}\_\{0\}=\\displaystyle=log⁡pθ​\(x0\|x1\)\\displaystyle\\log p\_\{\\theta\}\(x\_\{0\}\|x\_\{1\}\)ℒa​l​l\\displaystyle\\mathcal\{L\}\_\{all\}=\\displaystyle=𝔼q​\[ℒT\+ℒt−1\+ℒ0\],\\displaystyle\\mathbb\{E\}\_\{q\}\\left\[\\mathcal\{L\}\_\{T\}\+\\mathcal\{L\}\_\{t\-1\}\+\\mathcal\{L\}\_\{0\}\\right\],\(A\.6\)wherepθ​\(x0:T\)p\_\{\\theta\}\(x\_\{0:T\}\)represents the reverse diffusion process,p​\(xT\)∼𝒩​\(𝟎,𝐈\)p\(x\_\{T\}\)\\sim\\mathcal\{N\}\(\\mathbf\{0\},\\mathbf\{I\}\)\.

Fort\>1t\>1,

ℒt−1\\displaystyle\\mathcal\{L\}\_\{t\-1\}=\\displaystyle=DK​L\(q\(xt−1\|xt,x0\)∥pθ\(xt−1\|xt\)\)\\displaystyle D\_\{KL\}\(q\(x\_\{t\-1\}\|x\_\{t\},x\_\{0\}\)\\\|p\_\{\\theta\}\(x\_\{t\-1\}\|x\_\{t\}\)\)=\\displaystyle=12​β~t2​‖μpθ−μq‖22\\displaystyle\\frac\{1\}\{2\\tilde\{\\beta\}\_\{t\}^\{2\}\}\\\|\\mu\_\{p\_\{\\theta\}\}\-\\mu\_\{q\}\\\|\_\{2\}^\{2\}=\\displaystyle=12​β~t2∥1αt​\(xt−1−αt1−α¯t​ϵθ​\(xt,t\)\)\\displaystyle\\frac\{1\}\{2\\tilde\{\\beta\}\_\{t\}^\{2\}\}\\\|\\frac\{1\}\{\\sqrt\{\\alpha\_\{t\}\}\}\(x\_\{t\}\-\\frac\{1\-\\alpha\_\{t\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\\epsilon\_\{\\theta\(x\_\{t\},t\)\}\)−\\displaystyle\-1αt​\(xt−1−αt1−α¯t​ϵt\)∥22\\displaystyle\\frac\{1\}\{\\sqrt\{\\alpha\_\{t\}\}\}\(x\_\{t\}\-\\frac\{1\-\\alpha\_\{t\}\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\\epsilon\_\{t\}\)\\\|\_\{2\}^\{2\}=\\displaystyle=\(1−αt\)22​β~t2​αt​‖−ϵθ​\(xt,t\)1−α¯t−∇xtp​\(xt\)‖22,\\displaystyle\\frac\{\(1\-\\alpha\_\{t\}\)^\{2\}\}\{2\\tilde\{\\beta\}\_\{t\}^\{2\}\\alpha\_\{t\}\}\\\|\\frac\{\-\\epsilon\_\{\\theta\}\(x\_\{t\},t\)\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\\\!\-\\\!\\nabla\_\{x\_\{t\}\}p\(x\_\{t\}\)\\\|\_\{2\}^\{2\},where the learned score functionsθ=−ϵθ​\(xt,t\)1−α¯ts\_\{\\theta\}=\\frac\{\-\\epsilon\_\{\\theta\}\(x\_\{t\},t\)\}\{\\sqrt\{1\-\\bar\{\\alpha\}\_\{t\}\}\}\.

With Eq\. \([A\.2](https://arxiv.org/html/2606.27696#A1.Ex10)\) and Eq\. \([A\.6](https://arxiv.org/html/2606.27696#A1.E6)\), we conclude that the DDPM model suffers from insufficient training for low\-density regions and thus leads to more inaccurate score estimation for low\-frequency classes\.

Algorithm 1Quantity of low\-density regions\.Step 1: we assume pixel

p∼𝒩​\(μcp,σcp\)p\\sim\\mathcal\{N\}\(\\mu\_\{c\}^\{p\},\\sigma\_\{c\}^\{p\}\)for each sample in class c\.

Step 2: generate 500 samples for each class, denoted by

𝒟b​a​l\\mathcal\{D\}\_\{bal\}
Step 3: kde = KernelDensity\(kernel=’Gaussian’, banwidth=0\.5\)\.fix\(CLIP\(

𝒟b​a​l\\mathcal\{D\}\_\{bal\}\)\)

Step 4: derive

𝒟c​f​r​g\\mathcal\{D\}\_\{cfrg\}and

𝒟d​d​p​m\\mathcal\{D\}\_\{ddpm\}with CFRG and DDPM models taking

𝒟b​a​l\\mathcal\{D\}\_\{bal\}as inputs\.

Step 4: log\_p\_cfrg = kde\.score\_samples\(CLIP\(

𝒟c​f​r​g\\mathcal\{D\}\_\{cfrg\}\)\)

Step 5: log\_p\_ddpm = kde\.score\_samples\(CLIP\(

𝒟d​d​p​m\\mathcal\{D\}\_\{ddpm\}\)\)

Step 6: low\_density\_cfrg = np\.mean\(log\_p\_cfrg

≤δ\\leq\\delta\)

Step 7: low\_density\_ddpm = np\.mean\(log\_p\_ddpm

≤δ\\leq\\delta\)

## Appendix BSettings

### B\.1Experimental Settings

Datasets\.CIFAR\-100 has 60,000 images –– 50,000 for training and 10,000 for validation with 100 categories\. To illustrate the importance of class frequency for noise schedule designing in diffusion models, we use the long\-tailed version of CIFAR datasets with the same setting as those used in\[cao2019learning,cui2019class\]\. The degree of data imbalance is measured by an imbalanced factorNm​a​xNm​i​n\\frac\{N\_\{m\}ax\}\{N\_\{m\}in\}, whereNm​a​xN\_\{max\}andNm​i​nN\_\{min\}are the most and the least class frequency in the dataset\. In addition, we also conduct experiments on the more complex data distribution — ImageNet\-LT\[liu2019large\]\. ImageNet\-LT is a long\-tailed version of the ImageNet dataset\[russakovsky2015imagenet\]by sampling a subset following the Pareto distribution with a power value of 6\. It contains 115\.8K images from 1,000 categories, with class cardinality ranging from 5 to 1,280\.

Implementation Details for the Diffusion Model\.Observing that the default hyper\-parameters in DDPM also achieve best results on CIFAR\-100\-LT, we adopt the same training configurations as the baseline model,i\.e\., DDPM\[ho2020denoising\]\.σ1=1​e−4\\sigma\_\{1\}=1e\-4andσT=0\.02\\sigma\_\{T\}=0\.02withTT= 1, 000 are set for the noise schedule\. We optimize the network with an Adam optimizer whose learning rate is 0\.0002 after 5,000 iterations of warmup\. On ImageNet\-LT, we select the best hyper\-parameters for the baseline model\.σT=0\.01\\sigma\_\{T\}=0\.01is used\. Image sizes of32×3232\\times 32and64×6464\\times 64are applied to training optimization and evaluation\. Considering that the size and semantic complexity of the datasets vary greatly, we choose appropriate epochs for each dataset\. 4 Nvidia GeForce 3090 GPUs are used for training\.

Evaluation Metrics\.Models trained with our CFG noise schedule and the corresponding baseline models are compared with respect to both the generation diversity and fidelity via Frechet Inception Distance \(FID\)\[heusel2017gans\], Recall\[kynkaanniemi2019improved\]andFβF\_\{\\beta\}\[sajjadi2018assessing\]\. The Recall andFβF\_\{\\beta\}are measured using features from Inception\-v3 that are pre\-trained on ImageNet\. Following the practice of CBDM, we takeK=5K=5for the Recall,1/81/8and88for the threshold inFβF\_\{\\beta\}, and2020times of class number as the clustering number ofFβF\_\{\\beta\}to capture the inner class variance\. In the evaluation, we use all the real images from the training set of CIFAR\-100 and ImageNet rather than the imbalanced training data during training optimization\. Classifier\-free guidance\[ho2022classifier\]is applied to both our models and baseline models\. For fair comparisons, the guidance strengthω\\omegais carefully tuned for all of these models\.

Experimental Settings\.Our experimental settings are listed in Table[A\.1](https://arxiv.org/html/2606.27696#A1.T1)\. Except on ImageNet\-LT with the image size of64×6464\\times 64, we conduct experiments on 4 Nvidia GeForce 3090 GPUs\. Our method is more efficient than CBDM\. CBDM requires additional regularization batch data for each training iteration\. However, there is no extra computational cost for our approach compared to the DDPM baseline\. On CIFAR\-100\-LT, it takes around 1 day for 300,000 training iterations of our method\. Equipped with CBDM and ADA, around 3 days are required to complete the 500,000 training iterations\. On ImageNet\-LT with the image size of32×3232\\times 32, it takes around 3 days for 500,000 training iterations of our method\. With the image size of64×6464\\times 64, 5 days are needed to finish 500,000 training iterations using 8 Nvidia GeForce 3090 GPUs\. Under inference, it takes 1 day to generate 50,00032×3232\\times 32images while 4 days to generate 50,00064×6464\\times 64images with 1 Nvidia GeForce 3090 GPU\.

More Details on the Case Study\.For CIFAR\-100\-LT, we divided the classes into “Many”\(≥100\\geq 100images\), “Medium” \(≥20\\geq 20images\), and “Few” \(<20<20images\) three groups according to their class frequency\. On ImageNet\-LT, the classes with “≥500\\geq 500images”, “≥50\\geq 50images” and “<50<50images” are categorized into “Many”, “Medium”, and “Few” groups respectively\.

### B\.2More Experimental Results

We conduct experiments on CIFAR\-100\-LT with an imbalance factor of 200\. The results are listed in Table[A\.2](https://arxiv.org/html/2606.27696#A1.T2)\. With the proposed class\-frequency guided \(CFG\) noise schedule, we achieve 7\.46 FID, surpassing the DDPM by 0\.79\. Moreover, the CFG noise schedule is orthogonal to previous methods, like ADA and CBDM\. Equipped with CBDM and ADA methods, we achieve 5\.89 FID, significantly outperforming the DDPM by 2\.36\.

### B\.3Density Analysis Details

The quantity of low\-density regions in Table[3\.2](https://arxiv.org/html/2606.27696#S3.T2)is measured by algorithm[1](https://arxiv.org/html/2606.27696#alg1)\.

### B\.4Code

## References

Similar Articles

Colored Noise Diffusion Sampling

Hugging Face Daily Papers

Introduces Colored Noise Sampling (CNS), a training-free stochastic solver for diffusion models that dynamically allocates energy based on frequency-dependent schedules, improving image quality metrics like FID significantly on ImageNet-256.

Elucidating the SNR-t Bias of Diffusion Probabilistic Models

Hugging Face Daily Papers

This paper identifies a Signal-to-Noise Ratio timestep (SNR-t) bias in diffusion probabilistic models during inference, where SNR-timestep alignment from training is disrupted at inference time. The authors propose a differential correction method that decomposes samples into frequency components and corrects each separately, improving generation quality across models like IDDPM, ADM, DDIM, EDM, and FLUX with minimal computational overhead.