Catastrophic Compositional Generation: Why Vanilla Diffusion Models Fail to Extrapolate
Summary
This paper argues that vanilla conditional diffusion models fundamentally fail at compositional generation when the target distribution is out-of-distribution, due to score estimation error, and that inference-time corrections cannot fully compensate.
View Cached Full Text
Cached at: 06/24/26, 07:49 AM
# Catastrophic Compositional Generation: Why Vanilla Diffusion Models Fail to Extrapolate
Source: [https://arxiv.org/html/2606.23920](https://arxiv.org/html/2606.23920)
Duncan SoifferChandler SquiresMachine Learning Department, Carnegie Mellon UniversityYuan GuanMachine Learning Department, Carnegie Mellon UniversityJason HartfordValence LabsDepartment of Computer Science, University of ManchesterPradeep RavikumarMachine Learning Department, Carnegie Mellon University
###### Abstract
The task of*compositional generation*involves using a conditional generative model, trained only on a subset of the possible conditions, to produce samples from compositionally\-defined target distributions such as a geometric combination of the source distributions\. In this work, we argue that this task is often infeasible for vanilla conditional diffusion models: we conjecture that no inference\-time technique can efficiently produce samples from the target distribution in certain well\-motivated settings\. This idea is supported by theory\-guided generalization arguments and carefully\-designed experiments on both synthetic and realistic data\. In particular, while recent methods such as Feynman\-Kac correction reduce*inference\-time approximation error*, our results show that*score estimation error*has a more catastrophic effect on performance when the target distribution is out\-of\-distribution with respect to the sources, highlighting the need for a different approach to this task\.
## 1Introduction
The space of distributions that we would like to model is often far larger than the set of distributions to which we have access\. We want models that can imagine arbitrary combinations of concepts \(e\.g\. “a*living room*with a*white couch*, a*black chair*, two*paintings*, a*floor lamp*and nothing else”\), but the data are only supported for some combinations of these concepts\. This is particularly true when data are derived from experiments, e\.g\., in biology, the space of possible experiments is combinatorially large \(covering all possible combinations of molecules, gene knockouts, cell types, assays, etc\.\), but experimental budgets are finite\. Hence, there is significant interest in building perturbation effect prediction models that can predict the outcomes of unseen experiments\(Lotfollahiet al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib51); Roohaniet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib47); Wanget al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib46); Adduriet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib59); Wenkelet al\.,[2026](https://arxiv.org/html/2606.23920#bib.bib45); Bunneet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib48); Noutahiet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib49)\)\.
To extrapolate to new combinations, methods for*compositional generation*rely on strong inductive biases, e\.g\., assuming the existence of some latent space in which the effects have some additive structure\. For example, if two biological perturbationsaaanda′a^\{\\prime\}are*causally separable*, it can be shown that the double perturbation distributionP\(x∣do\(a\),do\(a′\)\)P\(x\\mid\\textnormal\{do\}\(a\),\\textnormal\{do\}\(a^\{\\prime\}\)\)can be expressed as a geometric combination of the control distributionP\(x\)P\(x\)and the single perturbation distributionsP\(x∣do\(a\)\)P\(x\\mid\\textnormal\{do\}\(a\)\)andP\(x∣do\(a′\)\)P\(x\\mid\\textnormal\{do\}\(a^\{\\prime\}\)\)\(Wanget al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib5); Xuet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib58)\)\. In principle, such results imply significant reductions in the amount of training data necessary to succeed at these difficult generation tasks\.
In practice, however, even if the compositional distributionP\(x∣do\(a\),do\(a′\)\)P\(x\\mid\\textnormal\{do\}\(a\),\\textnormal\{do\}\(a^\{\\prime\}\)\)can be expressed in terms of distributions from which we have training samples, this does not imply that we can efficiently sample fromP\(x∣do\(a\),do\(a′\)\)P\(x\\mid\\textnormal\{do\}\(a\),\\textnormal\{do\}\(a^\{\\prime\}\)\)\. Unfortunately, distribution\-level identities do not necessarily translate into computationally and statistically efficient algorithms\. In particular, we consider procedures for composing*conditional diffusion models*, given their practical relevance and powerful generative capabilities\(Dhariwal and Nichol,[2021](https://arxiv.org/html/2606.23920#bib.bib40)\)\. Notably, geometric combinations at the distribution level translate into linear combinations at the score level\. Hence, a popular heuristic for compositional generation with such models is to linearly combine the scores\(Liuet al\.,[2022](https://arxiv.org/html/2606.23920#bib.bib20)\)during the denoising process\. However, it is increasingly recognized that this naïve approach introduces a fundamental source of error, even assuming that the models achieve perfect recovery: in general, adding noise to the distributions breaks the linear relationship between their score functions\.
To account for this issue, recent particle\-based algorithms\(Skretaet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib24); Xieet al\.,[2026](https://arxiv.org/html/2606.23920#bib.bib26); Renet al\.,[2026](https://arxiv.org/html/2606.23920#bib.bib25)\)explicitly correct for this error\. We focus on Feynman\-Kac Correctors \(FKC\)\(Skretaet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib24)\), a flexible instantiation of this approach that tracks importance weights across the denoising trajectory and uses them for correction\. Despite these recent advances, we highlight an issue which poses even more significant problems for composition\. We find that in many settings of interest, the*score estimation error*, i\.e\., the error propagated from errors in the model’s learned score function, becomes the dominant factor, since the typical samples from the composed distribution are in low\-density regions of the source distributions\. Moreover, in these cases, we find that FKC reinforces these errors, further degenerating sample quality\.
##### Contributions
Building on prior work, we first introduce a*concrete formulation of the compositional generation task*, show that such compositions are*closely connected to causal representation learning*, and*formalize two sources of error*that make this task difficult \([Section˜2](https://arxiv.org/html/2606.23920#S2)and[Appendix˜B](https://arxiv.org/html/2606.23920#A2)\)\. Then, we provide*theoretical insights into how these errors behave*for different collections of source distributions \([Section˜4](https://arxiv.org/html/2606.23920#S4)\), focusing on illustrative, analytically tractable settings\. Based on these insights, we perform careful experimentation to disentangle score estimation error from inference\-time approximation error \([Section˜5](https://arxiv.org/html/2606.23920#S5)\)\. These experiments reveal that both sources of error lead to compositional failure within distinct, partially overlapping regimes, and that within the overlap, score estimation error often dominates attempts to correct for inference\-time approximation error\.
## 2Setup: Geometrically weighted compositions
Mathematically, we consider a*perturbation space*𝒜\\mathcal\{A\}and an*outcome space*𝒳\\mathcal\{X\}\. For example, if a scientist can applyKKdrugs at varying dosages, and measures cell images after such perturbations, we would have𝒜=ℝ≥0K\\mathcal\{A\}=\\mathbb\{R\}^\{K\}\_\{\\geq 0\}and we would take𝒳\\mathcal\{X\}to be the space of images\. Each perturbationa∈𝒜a\\in\\mathcal\{A\}is associated with some ground truth distributionPa∈𝒫P^\{a\}\\in\\mathcal\{P\}over outcomes, where𝒫\\mathcal\{P\}denotes the set of probability distributions on𝒳\\mathcal\{X\}\. In practice, we may only observe outcome data for a small subset𝒜o⊆𝒜\\mathcal\{A\}\_\{\\text\{o\}\}\\subseteq\\mathcal\{A\}of the overall perturbation space\. For example, when experiments are costly, as in perturbational cell imaging, we may only have observations for the control condition and for individual drugs at fixed dosages \(i\.e\.,𝒜o=\{𝟎\}∪\{𝐞k:k∈\[K\]\}\\mathcal\{A\}\_\{\\text\{o\}\}=\\\{\\mathbf\{0\}\\\}\\cup\\\{\{\\mathbf\{e\}\}\_\{k\}:k\\in\[K\]\\\}, where𝐞k\{\\mathbf\{e\}\}\_\{k\}is thekthk^\{\\text\{th\}\}basis vector\)\.
In such cases, we would like to*extrapolate*to an unseen perturbationa∗∈𝒜∖𝒜oa^\{\*\}\\in\\mathcal\{A\}\\setminus\\mathcal\{A\}\_\{\\text\{o\}\}\. Under a variety of well\-motivated theoretical assumptions, the target distributionPa∗P^\{a^\{\*\}\}can often be identified from the observed distributions\(Pa\)a∈𝒜o\(P^\{a\}\)\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\. In this work, we consider a form that is common in several settings, where the density of distributionPa∗P^\{a^\{\*\}\}is expressed as a geometric mixture of the source distribution densities\. To ensure that our expression is well\-defined, we assume throughout the paper that, for eacha∈𝒜a\\in\\mathcal\{A\}, the distributionPaP^\{a\}is associated with a density over𝒳\\mathcal\{X\}\. With a typical abuse of notation, we also write this density asPaP^\{a\}\. Further, we assume that these densities are positive, i\.e\., for alla∈𝒜oa\\in\\mathcal\{A\}\_\{o\}andx∈𝒳x\\in\\mathcal\{X\}, we havePa\(x\)\>0P^\{a\}\(x\)\>0, ensuring that we avoid divisions by zero\.111In[AppendixA](https://arxiv.org/html/2606.23920#A1), we give a more formal version of this assumption, and discuss the more formal interpretation of our equality expressions\.
###### Definition 1\.
Fix𝒜o⊆𝒜\\mathcal\{A\}\_\{\\text\{o\}\}\\subseteq\\mathcal\{A\}\. Given a collection of distributions𝐏=\(Pa\)a∈𝒜o\{\\mathbf\{P\}\}=\(P^\{a\}\)\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}and a functionw:𝒜o→ℝw\\colon\\mathcal\{A\}\_\{\\text\{o\}\}\\to\\mathbb\{R\}, we say thatwwis a*valid weighting*iff\(x\)=∏a∈𝒜oPa\(x\)waf\(x\)=\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}P^\{a\}\(x\)^\{w\_\{a\}\}is integrable, wherewaw\_\{a\}is shorthand forw\(a\)w\(a\)\. For a valid weightingww, we define the*weighted composition*of𝐏\{\\mathbf\{P\}\}as
Compw\(𝐏\)\(x\):=1Zw∏a∈𝒜oPa\(x\)wa,whereZw:=∫𝒳\(∏a∈𝒜oPa\(x\)wa\)dx\.\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\{\\mathbf\{P\}\}\)\(x\)\\mathrel\{:=\}\\frac\{1\}\{Z^\{w\}\}\\prod\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}P^\{a\}\(x\)^\{w\_\{a\}\},\\quad\\text\{where\}\\ Z^\{w\}\\mathrel\{:=\}\\int\_\{\\mathcal\{X\}\}\\left\(\\prod\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}P^\{a\}\(x\)^\{w\_\{a\}\}\\right\)\\text\{d\}x\.\(1\)Alternatively, we write𝐏w:=Compw\(𝐏\)\{\\mathbf\{P\}\}^\{w\}\\mathrel\{:=\}\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\{\\mathbf\{P\}\}\)as shorthand\.
This geometric combination at the distribution level becomes a linear combination at the score level\. AssuminglogPa\\log P^\{a\}is differentiable, the*Stein score functions*areSa\(x\):=∇xlogPa\(x\)S^\{a\}\(x\)\\mathrel\{:=\}\\nabla\_\{\\textsf\{x\}\}\\log P^\{a\}\(x\)fora∈𝒜oa\\in\\mathcal\{A\}\_\{\\text\{o\}\}, andSw\(x\):=∇xlogPw\(x\)S^\{w\}\(x\)\\mathrel\{:=\}\\nabla\_\{\\textsf\{x\}\}\\log P^\{w\}\(x\)\. Then, for any valid weightingww, we haveSw\(x\)=∑a∈𝒜owaSa\(x\)S^\{w\}\(x\)=\\sum\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}S^\{a\}\(x\)\.
##### Causal origins of weighted compositions
To better understand the importance of weighted compositions, it is useful to briefly discuss the conditions under which they naturally arise\. As a canonical example, consider predicting the outcome of a double perturbationa∗=𝐞1\+𝐞2a^\{\*\}=\{\\mathbf\{e\}\}\_\{1\}\+\{\\mathbf\{e\}\}\_\{2\}, given only observational data and single perturbations, i\.e\.,𝒜o=\{𝟎,𝐞1,𝐞2\}\\mathcal\{A\}\_\{\\text\{o\}\}=\\\{\\mathbf\{0\},\{\\mathbf\{e\}\}\_\{1\},\{\\mathbf\{e\}\}\_\{2\}\\\}, and suppose the outcome space𝒳\\mathcal\{X\}can be decomposed into an upstream component𝒳1\\mathcal\{X\}\_\{1\}and a downstream component𝒳2\\mathcal\{X\}\_\{2\}\. Reflecting this structure, we can factorize𝐏𝟎\{\\mathbf\{P\}\}^\{\\mathbf\{0\}\}asP𝟎\(x\)=P𝟎\(x1\)P𝟎\(x2∣x1\)P^\{\\mathbf\{0\}\}\(x\)=P^\{\\mathbf\{0\}\}\(x\_\{1\}\)P^\{\\mathbf\{0\}\}\(x\_\{2\}\\mid x\_\{1\}\)\. In many systems, perturbations can be expected to have isolated, targeted effects: in causality, this can be expressed in terms of*mechanism changes*or*soft interventions*\(Squires and Uhler,[2023](https://arxiv.org/html/2606.23920#bib.bib76)\)\. In particular, suppose𝐞1\{\\mathbf\{e\}\}\_\{1\}affects the upstream component but not the downstream one, i\.e\.,P𝐞1\(x\)=P𝐞1\(x1\)P𝟎\(x2∣x1\)P^\{\{\\mathbf\{e\}\}\_\{1\}\}\(x\)=P^\{\{\\mathbf\{e\}\}\_\{1\}\}\(x\_\{1\}\)P^\{\\mathbf\{0\}\}\(x\_\{2\}\\mid x\_\{1\}\), while𝐞2\{\\mathbf\{e\}\}\_\{2\}affects the downstream component but not the upstream one, i\.e\.,P𝐞2\(x\)=P𝟎\(x1\)P𝐞2\(x2∣x1\)P^\{\{\\mathbf\{e\}\}\_\{2\}\}\(x\)=P^\{\\mathbf\{0\}\}\(x\_\{1\}\)P^\{\{\\mathbf\{e\}\}\_\{2\}\}\(x\_\{2\}\\mid x\_\{1\}\)\.
Further, perturbations may often be expected to have modular effects, known as the principle of*independent causal mechanisms*\(Schölkopfet al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib74)\)\. Under this assumption, the double perturbationa∗a^\{\*\}simply “combines” the effects of the single perturbations, so thatPa∗\(x\)=P𝐞1\(x1\)P𝐞2\(x2∣x1\)P^\{a^\{\*\}\}\(x\)=P^\{\{\\mathbf\{e\}\}\_\{1\}\}\(x\_\{1\}\)P^\{\{\\mathbf\{e\}\}\_\{2\}\}\(x\_\{2\}\\mid x\_\{1\}\)\. In this case, it is simple to show thatPa∗=Compwdbl\(𝐏\)P^\{a^\{\*\}\}=\\operatorname\{\\textsc\{Comp\}\}\_\{w\_\{\\text\{dbl\}\}\}\(\{\\mathbf\{P\}\}\)for the valid weightingw\(𝐞1\)=w\(𝐞2\)=1w\(\{\\mathbf\{e\}\}\_\{1\}\)=w\(\{\\mathbf\{e\}\}\_\{2\}\)=1andw\(𝟎\)=−1w\(\\mathbf\{0\}\)=\-1\. More broadly, this reasoning can be extended to more than two perturbations, and𝒳\\mathcal\{X\}does not have to be directly decomposable into causal components: this decomposition may only hold in some unknown latent space\. In such cases, the field of*causal representation learning*provides a more general justification for weighted compositions, as we describe in[Appendix˜B](https://arxiv.org/html/2606.23920#A2)\.
##### The compositional generation task and its challenges
With these motivating examples in mind, we define a concrete version of the compositional generation task as follows:
###### Task\(Compositional generation for weighted compositions\)\. As inputs, take a conditional diffusion model𝐏θ≈𝐏⋆\{\\mathbf\{P\}\}\_\{\\theta\}\\approx\{\\mathbf\{P\}\}\_\{\\star\}from𝒜o\\mathcal\{A\}\_\{\\text\{o\}\}to𝒳\\mathcal\{X\}and a valid weightingw:𝒜o→ℝw\\colon\\mathcal\{A\}\_\{\\text\{o\}\}\\to\\mathbb\{R\}for𝐏⋆\{\\mathbf\{P\}\}\_\{\\star\}\. Using an efficient algorithm, produce samples from a distributionP~∈𝒫\\widetilde\{P\}\\in\\mathcal\{P\}such thatP~≈𝐏⋆w\\widetilde\{P\}\\approx\{\\mathbf\{P\}\}\_\{\\star\}^\{w\}\.More formally, letddbe some metric on the space of distributions𝒫\\mathcal\{P\}\. Then, our input assumption \(𝐏θ≈𝐏⋆\{\\mathbf\{P\}\}\_\{\\theta\}\\approx\{\\mathbf\{P\}\}\_\{\\star\}\) says that we are given𝐏θ\{\\mathbf\{P\}\}\_\{\\theta\}such thatd\(Pθa,P⋆a\)<ϵ1d\(P^\{a\}\_\{\\theta\},P^\{a\}\_\{\\star\}\)<\\epsilon\_\{1\}for alla∈𝒜oa\\in\\mathcal\{A\}\_\{o\}, and our output requirement \(P~≈𝐏⋆w\\widetilde\{P\}\\approx\{\\mathbf\{P\}\}\_\{\\star\}^\{w\}\) says that we produce samples from someP~\\widetilde\{P\}such thatd\(P~,𝐏⋆w\)<ε2d\(\\widetilde\{P\},\{\\mathbf\{P\}\}^\{w\}\_\{\\star\}\)<\\varepsilon\_\{2\}\.
We remark that this task encounters two challenges\. First, one encounters*inference\-time approximation error*: for most model classes \(e\.g\. diffusion models\), we cannot efficiently sample from𝐏θw\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\}, only some proxyP~≈𝐏θw\\widetilde\{P\}\\approx\{\\mathbf\{P\}\}^\{w\}\_\{\\theta\}\. Second, we encounter*score estimation error*: since we are using an estimated model𝐏θ≠𝐏⋆\{\\mathbf\{P\}\}\_\{\\theta\}\\neq\{\\mathbf\{P\}\}\_\{\\star\}, we generally have𝐏θw≠𝐏⋆w\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\}\\neq\{\\mathbf\{P\}\}\_\{\\star\}^\{w\}for𝐏θw:=Compw\(𝐏θ\)\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\}\\mathrel\{:=\}\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\{\\mathbf\{P\}\}\_\{\\theta\}\)\. In particular, given any metricddon the set of probability distributions𝒫\\mathcal\{P\}, the triangle inequality gives
d\(P~,𝐏⋆w\)≤d\(P~,𝐏θw\)⏟inference\-time approximation error\+d\(𝐏θw,𝐏⋆w\)⏟score estimation error\.d\(\\widetilde\{P\},\{\\mathbf\{P\}\}\_\{\\star\}^\{w\}\)\\leq\\underbrace\{d\(\\widetilde\{P\},\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\}\)\}\_\{\\text\{inference\-time approximation error\}\}\+\\underbrace\{d\(\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\},\{\\mathbf\{P\}\}\_\{\\star\}^\{w\}\)\}\_\{\\text\{score estimation error\}\}\.As we discuss in[Section˜3\.2](https://arxiv.org/html/2606.23920#S3.SS2), recent methods such as*Feynman\-Kac correction*\(Skretaet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib24)\)provide elegant solutions to reduce the inference\-time approximation errord\(P~,𝐏θw\)d\(\\widetilde\{P\},\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\}\)when composing diffusion models\. Although reducing this error is important, we find that the score estimation errord\(𝐏θw,𝐏⋆w\)d\(\{\\mathbf\{P\}\}\_\{\\theta\}^\{w\},\{\\mathbf\{P\}\}\_\{\\star\}^\{w\}\), often overlooked in previous works, can be much more important in certain settings\. Specifically, we consider settings where typical samples from the weighted composition𝐏w\{\\mathbf\{P\}\}^\{w\}lie in low\-density regions of the source distributions\. In these cases, we say that𝐏w\{\\mathbf\{P\}\}^\{w\}is*out\-of\-distribution \(OOD\)*: since the estimated score functions were trained on𝐏=\(Pa\)a∈𝒜o\{\\mathbf\{P\}\}=\(P^\{a\}\)\_\{a\\in\\mathcal\{A\}\_\{o\}\}, but are being evaluated on𝐏w\{\\mathbf\{P\}\}^\{w\}, these cases lead to similar issues as those considered in the field of*OOD generalization*\(Davidet al\.,[2010](https://arxiv.org/html/2606.23920#bib.bib83); Kpotufe and Martinet,[2018](https://arxiv.org/html/2606.23920#bib.bib84); Canataret al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib85)\)\.
## 3Technical background
In this section, we review relevant background on conditional generative models and compositional generation\. Here, our focus is on conditional diffusion models, though we expect similar results to apply to other forms of conditional generative models\.
### 3\.1Conditional diffusion models
##### Notation
We letWtW\_\{t\}denote a standard Wiener process, using the convention thatt=0t=0is data andt=1t=1is noise\. We use the terms “noising” and “denoising” rather than “forward” and “reverse” to avoid confusion arising from conflicting conventions between diffusion and flow\-based models\. All denoising SDEs and ODEs are written in the same time variablett, but integrated fromt=1t=1tot=0t=0, and the SDE uses the reverse\-time Wiener processW¯t\\overline\{W\}\_\{t\}\.
##### Noising
Given a drift coefficient schedule\(μt\)t\(\\mu\_\{t\}\)\_\{t\}and a diffusion coefficient schedule\(σt\)t\(\\sigma\_\{t\}\)\_\{t\}, we assume a*noising SDE*of the formdxt=ut\(xt\)dt\+σtdWt\\text\{d\}x\_\{t\}=u\_\{t\}\(x\_\{t\}\)\\text\{d\}t\+\\sigma\_\{t\}\\text\{d\}W\_\{t\}, withut\(x\):=μtxu\_\{t\}\(x\)\\mathrel\{:=\}\\mu\_\{t\}x, which we refer to as the*noising drift function*\. Givenx0∼Pax\_\{0\}\\sim P^\{a\}, the solution to this SDE is a stochastic process\(𝐗ta\)t\(\{\\bm\{\\mathbf\{X\}\}\}^\{a\}\_\{t\}\)\_\{t\}, and defines the*noising transition kernel*Noiset:𝒳→𝒫\(𝒳\)\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\\colon\\mathcal\{X\}\\to\\mathcal\{P\}\(\\mathcal\{X\}\), where
Noiset\(x0\)=𝒩\(αtx0;γt2I\),forαt:=exp\(∫0tμsds\)andγt:=αt∫0tσs2αs2ds\.\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(x\_\{0\}\)=\\mathcal\{N\}\(\\alpha\_\{t\}x\_\{0\};\\gamma\_\{t\}^\{2\}I\),\\quad\\text\{for\}\\quad\\alpha\_\{t\}\\mathrel\{:=\}\\exp\\left\(\\int\_\{0\}^\{t\}\\mu\_\{s\}\\text\{d\}s\\right\)\\ \\text\{and\}\\ \\gamma\_\{t\}\\mathrel\{:=\}\\alpha\_\{t\}\\sqrt\{\\int\_\{0\}^\{t\}\\frac\{\\sigma\_\{s\}^\{2\}\}\{\\alpha\_\{s\}^\{2\}\}\\text\{d\}s\}\.\(2\)This kernel extends naturally to a function on distributions\. With some abuse of notation, we writeNoiset:𝒫\(𝒳\)→𝒫\(𝒳\)\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\\colon\\mathcal\{P\}\(\\mathcal\{X\}\)\\to\\mathcal\{P\}\(\\mathcal\{X\}\), whereNoiset:P↦∫Noiset\(x\)P\(x\)dx\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\\colon P\\mapsto\\int\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(x\)P\(x\)\\text\{d\}x, and we writePta:=Noiset\(Pa\)P\_\{t\}^\{a\}\\mathrel\{:=\}\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(P^\{a\}\)for the noised distribution\. Put differently,PtaP\_\{t\}^\{a\}is the marginal distribution of samples at timettafter drawing an unnoised sample fromPaP^\{a\}\. Then, each noised distributionPtaP^\{a\}\_\{t\}has the associated Stein score functionSta\(x\):=∇xlogPta\(x\)S^\{a\}\_\{t\}\(x\)\\mathrel\{:=\}\\nabla\_\{\\textsf\{x\}\}\\log P^\{a\}\_\{t\}\(x\)\. For any noising SDE, there exists a corresponding deterministic ODE with identical marginal distributionsPtaP^\{a\}\_\{t\}at all timest∈\[0,1\]t\\in\[0,1\]\(Songet al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib35)\)\. Hence, we define this*noising ODE*asdxt=u~t\(xt\)dt\\text\{d\}x\_\{t\}=\\tilde\{u\}\_\{t\}\(x\_\{t\}\)\\text\{d\}tforu~t\(x\):=ut\(x\)−12σt2Sta\(x\)\\tilde\{u\}\_\{t\}\(x\)\\mathrel\{:=\}u\_\{t\}\(x\)\-\\frac\{1\}\{2\}\\sigma\_\{t\}^\{2\}S^\{a\}\_\{t\}\(x\)\.
##### Denoising
Diffusion models leverage the fact that the stochastic process\(𝐗ta\)t\(\{\\mathbf\{X\}\}^\{a\}\_\{t\}\)\_\{t\}also solves the*denoising SDE*
dxta=vta\(xta\)dt\+σtdW¯t,forvta\(x\):=ut\(x\)−σt2Sta\(x\),\\text\{d\}x^\{a\}\_\{t\}=v^\{a\}\_\{t\}\(x^\{a\}\_\{t\}\)\\text\{d\}t\+\\sigma\_\{t\}\\text\{d\}\\overline\{W\}\_\{t\},\\quad\\text\{for\}\\quad v^\{a\}\_\{t\}\(x\)\\mathrel\{:=\}u\_\{t\}\(x\)\-\\sigma\_\{t\}^\{2\}S^\{a\}\_\{t\}\(x\),\(3\)as given inSonget al\.\([2021](https://arxiv.org/html/2606.23920#bib.bib35)\)\. Alternatively, the*denoising ODE*is defined by running the noising ODE backward in time\. In practice, the conditional score functionsStaS^\{a\}\_\{t\}can be estimated by a variety of methods\(Hyvärinen,[2005](https://arxiv.org/html/2606.23920#bib.bib55); Vincent,[2011](https://arxiv.org/html/2606.23920#bib.bib56); Hoet al\.,[2020](https://arxiv.org/html/2606.23920#bib.bib44); Songet al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib35); Lipmanet al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib73); Albergoet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib11)\), with the conditional score functions typically parameterized as conditional deep networks\(Perezet al\.,[2018](https://arxiv.org/html/2606.23920#bib.bib54); Ho and Salimans,[2022](https://arxiv.org/html/2606.23920#bib.bib36)\)\. We consider “vanilla” models in the sense that they are trained only to estimate scores \(or some known transformations thereof\) on observed conditions, using standard objectives and shared parameters, without architectural or training\-time constraints tailored to compositional extrapolation\. Our experiments use denoising diffusion\(Hoet al\.,[2020](https://arxiv.org/html/2606.23920#bib.bib44)\), but the results apply to other vanilla score estimators \(up to differences in estimation error\)\.
### 3\.2Composition methods
Now, we discuss methods for sampling from weighted compositions of conditional diffusion models\. Noising𝐏w:=Compw\(𝐏\)\{\\mathbf\{P\}\}^\{w\}\\mathrel\{:=\}\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\{\\mathbf\{P\}\}\)and taking its score, we obtain the*correctly\-composed score function*
Stw\(x\):=∇xlog𝐏tw\(x\),for𝐏tw:=Noiset\(𝐏w\)S^\{w\}\_\{t\}\(x\)\\mathrel\{:=\}\\nabla\_\{\\textsf\{x\}\}\\log\{\\mathbf\{P\}\}^\{w\}\_\{t\}\(x\),\\quad\\text\{for\}\\ \{\\mathbf\{P\}\}^\{w\}\_\{t\}\\mathrel\{:=\}\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(\{\\mathbf\{P\}\}^\{w\}\)Ideally, we would like to run the denoising SDE in[Equation˜3](https://arxiv.org/html/2606.23920#S3.E3)withvtw\(x\):=ut\(x\)−σt2Stw\(x\)v^\{w\}\_\{t\}\(x\)\\mathrel\{:=\}u\_\{t\}\(x\)\-\\sigma\_\{t\}^\{2\}S^\{w\}\_\{t\}\(x\)in place ofvta\(x\)v^\{a\}\_\{t\}\(x\)\. However, we do not knowStw\(x\)S^\{w\}\_\{t\}\(x\)or have a simple way to compute it\. Instead, a simple heuristic used in many prior works \(e\.g\., classifier\-free guidance\(Ho and Salimans,[2022](https://arxiv.org/html/2606.23920#bib.bib36)\)and compositional visual generation\(Liuet al\.,[2022](https://arxiv.org/html/2606.23920#bib.bib20)\)\) substitutes this score with an additive approximation\. In particular, the heuristic uses the*naïvely\-composed score function*
Stw,nv\(x\):=∑a∈𝒜owaSta\(x\)\.S^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\)\\mathrel\{:=\}\\sum\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}S^\{a\}\_\{t\}\(x\)\.The corresponding*naïve denoising SDE*is thus given by
dxtw,nvs=vtw,nv\(xtw,nvs\)dt\+σtdW¯t,wherevtw,nv\(x\):=ut\(x\)−σt2Stw,nv\(x\),\\text\{d\}x\_\{t\}^\{w,\\textnormal\{nvs\}\}=v^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\_\{t\}^\{w,\\textnormal\{nvs\}\}\)\\text\{d\}t\+\\sigma\_\{t\}\\text\{d\}\\overline\{W\}\_\{t\},\\quad\\text\{where\}\\quad v^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\)\\mathrel\{:=\}u\_\{t\}\(x\)\-\\sigma^\{2\}\_\{t\}S\_\{t\}^\{w,\\textnormal\{nv\}\}\(x\),\(4\)and the corresponding*naïve denoising ODE*is
dxtw,nvo=v~tw,nv\(xtw,nvo\)dtwherev~tw,nv\(x\):=ut\(x\)−12σt2Stw,nv\(x\)\.\\text\{d\}x^\{w,\\textnormal\{nvo\}\}\_\{t\}=\\tilde\{v\}^\{w,\\textnormal\{nv\}\}\_\{t\}\(x^\{w,\\textnormal\{nvo\}\}\_\{t\}\)\\text\{d\}t\\quad\\text\{where\}\\quad\\tilde\{v\}^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\)\\mathrel\{:=\}u\_\{t\}\(x\)\-\\frac\{1\}\{2\}\\sigma\_\{t\}^\{2\}S\_\{t\}^\{w,\\textnormal\{nv\}\}\(x\)\.\(5\)
Notably, the definition of weighted composition ensures thatStw,nv\(x\)=Stw\(x\)S\_\{t\}^\{w,\\textnormal\{nv\}\}\(x\)=S\_\{t\}^\{w\}\(x\)att=0t=0, and the definition of the noising SDE ensures that, if∑awa=1\\sum\_\{a\}w\_\{a\}=1, thenStw,nv\(x\)=Stw\(x\)S\_\{t\}^\{w,\\textnormal\{nv\}\}\(x\)=S\_\{t\}^\{w\}\(x\)att=1t=1\. However, it is not generically true thatStw,nv\(x\)=Stw\(x\)S\_\{t\}^\{w,\\textnormal\{nv\}\}\(x\)=S\_\{t\}^\{w\}\(x\)for intermediate timest∈\(0,1\)t\\in\(0,1\)\. Due to this discrepancy, naïve denoising methods can introduce substantial inference\-time approximation error, as we illustrate in[Section˜4\.1](https://arxiv.org/html/2606.23920#S4.SS1)\.
In contrast to naïve methods, Feynman\-Kac Correctors\(Skretaet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib24)\)explicitly account for this discrepancy through a particle\-based Sequential Monte Carlo approach\. By maintainingKKweighted particles updated according to a residual termgt\(x\)g\_\{t\}\(x\), and resampling according to these weights, FKC tracks the distribution of a target along the entire denoising trajectory\. We refer readers to[Appendix˜C](https://arxiv.org/html/2606.23920#A3)for further theoretical intuition on FKC\.
## 4Theoretical insights
To obtain a more fine\-grained view of approximation error, we begin by studying weighted compositions of multivariate Gaussians, which also form the basis for our first experiments in[Section˜5](https://arxiv.org/html/2606.23920#S5)\.
### 4\.1Multivariate Gaussians
###### Theorem 1\.
LetPa=𝒩\(0,Σa\)P^\{a\}=\\mathcal\{N\}\(0,\\Sigma\_\{a\}\)fora∈𝒜oa\\in\\mathcal\{A\}\_\{\\text\{o\}\}\. Then,wwis a valid weighting function if and only if∑a∈𝒜owaΣa−1≻0\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\Sigma\_\{a\}^\{\-1\}\\succ 0\. In this case,𝐏w\(x\)=𝒩\(0,Σw\)\{\\mathbf\{P\}\}^\{w\}\(x\)=\\mathcal\{N\}\(0,\\Sigma^\{w\}\)for
Σw=\(∑a∈𝒜owaΣa−1\)−1\\Sigma^\{w\}=\\left\(\\sum\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\Sigma\_\{a\}^\{\-1\}\\right\)^\{\-1\}\(6\)
The proof is given in[Section˜D\.1](https://arxiv.org/html/2606.23920#A4.SS1)\. However, even for simple and analytically tractable Gaussians, the composition operator does not commute with noising, which leads the naïve denoising process to sample from the incorrect distribution:
###### Theorem 2\.
LetPa=𝒩\(0,Σa\)P^\{a\}=\\mathcal\{N\}\(0,\\Sigma\_\{a\}\)fora∈𝒜oa\\in\\mathcal\{A\}\_\{\\text\{o\}\}, and assume\{Σa\}a∈𝒜o\\\{\\Sigma\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}are mutually commutative and that∑a∈𝒜owa=1\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}=1\. Then, the naïve denoising ODE \([5](https://arxiv.org/html/2606.23920#S3.E5)\) yieldsP~w,nvo=𝒩\(0,Σw,nvo\)\\widetilde\{P\}^\{w,\\textnormal\{nvo\}\}=\\mathcal\{N\}\(0,\\Sigma^\{w,\\textnormal\{nvo\}\}\), where
Σw,nvo=∏a∈𝒜oΣawa\.\\Sigma^\{w,\\textnormal\{nvo\}\}=\\prod\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\\Sigma\_\{a\}^\{w\_\{a\}\}\.
The proof is given in[Section˜D\.2](https://arxiv.org/html/2606.23920#A4.SS2)\. In this case, as a result of this approximation error, addition and subtraction are effectively replaced by multiplication and division after the denoising process\. These resulting quantities can in fact be*arbitrarily far from each other*, as a simple example demonstrates: take\|𝒜o\|=2\|\\mathcal\{A\}\_\{\\text\{o\}\}\|=2,w1=w2=12w\_\{1\}=w\_\{2\}=\\tfrac\{1\}\{2\}, andΣ1=εI\\Sigma\_\{1\}=\\varepsilon I,Σ2=ε−1I\\Sigma\_\{2\}=\\varepsilon^\{\-1\}Iforε\>0\\varepsilon\>0\. ThenΣw,nvo=I\\Sigma^\{w,\\textnormal\{nvo\}\}=IandΣw=2ε1\+ε2I\\Sigma^\{w\}=\\frac\{2\\varepsilon\}\{1\+\\varepsilon^\{2\}\}\\,I, so‖Σw,nvo‖op/‖Σw‖op=\(1\+ε2\)/\(2ε\)→∞\\\|\\Sigma^\{w,\\textnormal\{nvo\}\}\\\|\_\{\\mathrm\{op\}\}/\\\|\\Sigma^\{w\}\\\|\_\{\\mathrm\{op\}\}=\(1\+\\varepsilon^\{2\}\)/\(2\\varepsilon\)\\to\\inftyasε→0\+\\varepsilon\\to 0^\{\+\}\. Accordingly, this motivates the need for methods of reducing inference\-time approximation error in weighted compositions\. However, as we will see, there are also distributions of interest for which approximation error isnota concern\.
### 4\.2Base\-composed distributions
As a special case of weighted compositions, we are interested in studying what we refer to as*base\-composed distributions*:𝐏w\{\\mathbf\{P\}\}^\{w\}for the valid weightingwa=1−𝟙a=0⋅\|𝒜o\|∀a∈𝒜ow\_\{a\}=1\-\\mathbbm\{1\}\_\{a=0\}\\cdot\|\\mathcal\{A\}\_\{\\text\{o\}\}\|~~\\forall a\\in\\mathcal\{A\}\_\{\\text\{o\}\}, with the corresponding distribution given by
𝐏w\(x\)=1ZwP0\(x\)∏a∈𝒜oPa\(x\)P0\(x\)\.\{\\mathbf\{P\}\}^\{w\}\(x\)=\\frac\{1\}\{Z^\{w\}\}P^\{0\}\(x\)\\prod\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\\frac\{P^\{a\}\(x\)\}\{P^\{0\}\(x\)\}\.\(7\)
Recall that, in causal representation learning, these compositions are motivated by a view of perturbations as interventions\. As an example, the base caseP0P^\{0\}may represent a control trial while the other perturbation distributions model the effects of specific drugs\.
#### 4\.2\.1Factorized conditionals
In addition to being theoretically well\-motivated, some base\-compositions admit especially helpful properties\. Of these, we focus on*Factorized Conditionals*, which can be interpreted as a causal model over disconnected components \(see[Appendix˜B](https://arxiv.org/html/2606.23920#A2)\)\.
###### Definition 2\(Bradleyet al\.\([2025](https://arxiv.org/html/2606.23920#bib.bib9)\), modified\)\.
A collection of distributions\(P0,Pa1,Pa2,…Pak\)\(P^\{0\},P^\{a\_\{1\}\},\\allowbreak P^\{a\_\{2\}\},\\allowbreak\\dots\\allowbreak P^\{a\_\{k\}\}\)overℝn\\mathbb\{R\}^\{n\}are*Factorized Conditionals*if there exists a partitionM0,M1,…MkM\_\{0\},M\_\{1\},\\dots M\_\{k\}of\[n\]\[n\]such that
Pai\(x\)=Pai\(xMi\)P0\(xMic\)andP0\(x\)=P0\(xM0\)∏i∈\[k\]P0\(xMi\),P^\{a\_\{i\}\}\(x\)=P^\{a\_\{i\}\}\(x\_\{M\_\{i\}\}\)P^\{0\}\(x\_\{M\_\{i\}^\{c\}\}\)\\qquad\\text\{and\}\\qquad P^\{0\}\(x\)=P^\{0\}\(x\_\{M\_\{0\}\}\)\\prod\\nolimits\_\{i\\in\[k\]\}P^\{0\}\(x\_\{M\_\{i\}\}\),\(8\)whereMicM\_\{i\}^\{c\}denotes the set complement ofMiM\_\{i\}\.
Visually, this could correspond to placing objects which occur in disjoint regions of an image \(e\.g\., couch and wall\-mounted painting\) against a shared background sceneP0P^\{0\}\(e\.g\., empty room\)\. If a set of distributions obey these properties, thenassuming scores with zero error, running the naïve denoising SDE on the base\-composed distribution correctly samples from \([7](https://arxiv.org/html/2606.23920#S4.E7)\) att=0t=0\. More generally, if \([8](https://arxiv.org/html/2606.23920#S4.E8)\) holds in a latent feature space related to the input by an invertible orthogonal transform, the naive denoising SDE still correctly samples the base composition \([7](https://arxiv.org/html/2606.23920#S4.E7)\)\(Bradleyet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib9)\)\. For instance, conditions could orthogonally represent the style \(e\.g\. “realistic” or “impressionist”\) of an image versus its content \(e\.g\. “cat” or “dog”\)\.
Furthermore, if a set of distributions are Factorized Conditionals, then as we prove in[Section˜D\.3](https://arxiv.org/html/2606.23920#A4.SS3), we can also make strong claims about how FKC behaves on their base composition:
###### Theorem 3\.
Assume the set of distributions\(P0,Pa1,Pa2,…Pak\)\(P^\{0\},P^\{a\_\{1\}\},P^\{a\_\{2\}\},\\dots P^\{a\_\{k\}\}\)are Factorized Conditional\. Then, for the base\-composition on\(P0,Pa1,Pa2,…Pak\)\(P^\{0\},P^\{a\_\{1\}\},P^\{a\_\{2\}\},\\dots P^\{a\_\{k\}\}\), the Feynman\-Kac Corrector log\-weight drift vanishes:
gt\(x\)=0for allx∈ℝn,t∈\[0,1\],g\_\{t\}\(x\)=0\\;~\\text\{for all\}~\\;x\\in\\mathbb\{R\}^\{n\},~t\\in\[0,1\],and the Feynman\-Kac Corrector denoising process coincides exactly with the naïve denoising SDE\.
## 5Experiments
The goal of our empirical evaluation is to disentangle the two highlighted sources of error in compositional generation: inference\-time approximation error and score estimation error\. We partition our tasks into In\-Distribution \(ID\) settings, where the composed target𝐏w\{\\mathbf\{P\}\}^\{w\}is well\-represented within the training data sampled from the source distributions, and Out\-of\-Distribution \(OOD\) settings, where the composed target does not lie within the effective training support of any of the distributions\. This affords us principled control over the score estimation error\. To separate out the effects of approximation error, we further adjust distributions to either be Factorized Conditionals or Non\-Factorized Conditionals\.
In each experiment, we define two perturbation distributionsPa1P^\{a\_\{1\}\}andPa2P^\{a\_\{2\}\}alongside a base distributionP0P^\{0\}\. A conditional diffusion model𝐏θ\{\\mathbf\{P\}\}\_\{\\theta\}is trained to sample from these distributions \(training details are provided in Appendix[F](https://arxiv.org/html/2606.23920#A6)\), after which FKC is applied to this model to sample from the base\-composed distribution \([7](https://arxiv.org/html/2606.23920#S4.E7)\)\. We control the strength of the Feynman\-Kac correction by altering the number of simulated particles \(KK\), and we exert further control over the score estimation error by varying the number of training samples \(NN\) or by using the analytic score functions\. On synthetic experiments, we evaluate using the sliced Wasserstein\-2 distance \(SW2\\mathrm\{SW\}\_\{2\}\) and Maximum Mean Discrepancy \(MMD2\\mathrm\{MMD\}^\{2\}\) compared to the ground truth base\-composition \([Appendix˜F](https://arxiv.org/html/2606.23920#A6)\)\. In the main text, we reportSW2\\mathrm\{SW\}\_\{2\}using mean values, while[Appendix˜H](https://arxiv.org/html/2606.23920#A8)provides both metrics and reports standard deviation\.
In addition to conditional models, we also evaluate compositions of “expert” models each trained only on a single condition\. Notably, we find that score estimation error worsens in these settings, suggesting that the weight\-sharing of the conditional diffusion model plays an important role in implicitly regularizing the models towards compositional accuracy \([Appendix˜E](https://arxiv.org/html/2606.23920#A5)\)\. The code for the experiments is provided at[https://github\.com/DSoiffer/compositional\-diffusion](https://github.com/DSoiffer/compositional-diffusion)\.
\(a\)Individual distributions
\(b\)Composition
Figure 1:Feynman\-Kac Correctors accurately estimate the composition with analytical scores, but fail with learned distributions due to out\-of\-distribution estimation error\.Learned and analytical distributions forP0=𝒩\(𝟎,\[1001\]\)P^\{0\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}1&0\\\\ 0&1\\end\{bmatrix\}\\right\),Pa1=𝒩\(𝟎,\[10001\]\)P^\{a\_\{1\}\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}10&0\\\\ 0&1\\end\{bmatrix\}\\right\)andPa2=𝒩\(𝟎,\[10010\]\)P^\{a\_\{2\}\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}1&0\\\\ 0&10\\end\{bmatrix\}\\right\)and the compositionP\(x\)∝Pa1\(x\)Pa2\(x\)P0\(x\)P\(x\)\\propto\\frac\{P^\{a\_\{1\}\}\(x\)P^\{a\_\{2\}\}\(x\)\}\{P^\{0\}\(x\)\}\. Conditional diffusion models accurately learn the conditional distributions within high probability regions of the training data, but fail to compose correctly out\-of\-distribution even with FKC \(K=16K=16particles\)\.### 5\.1Synthetic data: 2D Gaussian
Table 1:The relative magnitudes of inference\-time approximation error and score estimation error are highly dependent on problem setting\.FKC is applied withKKparticles to a conditional diffusion model trained onNNsamples, or to the analytic score functions\. Each value is theSW2\\mathrm\{SW\}\_\{2\}between 5000 samples from the target distribution𝐏w\{\\mathbf\{P\}\}^\{w\}and50005000empirical samples, averaged over 30 training runs\.\(a\)Factorized conditionals \+ ID
\(b\)Non\-factorized conditionals \+ ID
\(c\)Factorized conditionals \+ OOD
\(d\)Non\-factorized conditionals \+ OOD
We begin with a two\-dimensional Gaussian toy setting, where all key quantities have a closed form\. Throughout, we takePa1=𝒩\(𝟎,\[10001\]\)P^\{a\_\{1\}\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}10&0\\\\ 0&1\\end\{bmatrix\}\\right\)andPa2=𝒩\(𝟎,\[10010\]\)P^\{a\_\{2\}\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}1&0\\\\ 0&10\\end\{bmatrix\}\\right\)\. The base distributionP0P^\{0\}varies across four settings, described below\. Results are presented in[Table˜1](https://arxiv.org/html/2606.23920#S5.T1)\.
Factorized conditionals\.For the in\-distribution setting, we setP0=𝒩\(𝟎,\[100010\]\)P^\{0\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}10&0\\\\ 0&10\\end\{bmatrix\}\\right\), yielding \(by \([6](https://arxiv.org/html/2606.23920#S4.E6)\)\) the target composition𝐏w=𝒩\(𝟎,\[1001\]\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}1&0\\\\ 0&1\\end\{bmatrix\}\\right\)\. Because this composed distribution lies within the effective training support ofP0P^\{0\},Pa1P^\{a\_\{1\}\}, andPa2P^\{a\_\{2\}\}, naïve denoising is already highly accurate\. Consequently, applying FKC yields no improvements, and actively degrades performance for undertrained models\. For the OOD configuration, we setP0=𝒩\(𝟎,\[1001\]\)P^\{0\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}1&0\\\\ 0&1\\end\{bmatrix\}\\right\), yielding𝐏w=𝒩\(𝟎,\[100010\]\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}10&0\\\\ 0&10\\end\{bmatrix\}\\right\)\. Although naïve denoising achieves a reasonable fit, out\-of\-distribution estimation error remains problematic\. When FKC is applied, this error compounds drastically, quickly diverging away from the correct distribution as particle count increases, as shown in[Figure˜1](https://arxiv.org/html/2606.23920#S5.F1)\.
Non\-factorized conditionals\.For the in\-distribution setting, we setP0=𝒩\(𝟎,\[200020\]\)P^\{0\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}20&0\\\\ 0&20\\end\{bmatrix\}\\right\), yielding𝐏w=𝒩\(𝟎,2021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\frac\{20\}\{21\}I\\right\)\. Here, naïve denoising consistently yields poor distributional fit, in line with expectations over approximation error\. However, FKC successfully leverages well\-trained models to correct this approximation error as the particle count increases\. In the OOD setting, we setP0=𝒩\(𝟎,\[1\.1001\.1\]\)P^\{0\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\begin\{bmatrix\}1\.1&0\\\\ 0&1\.1\\end\{bmatrix\}\\right\), so𝐏w=𝒩\(𝟎,11021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\\left\(\\mathbf\{0\},\\frac\{110\}\{21\}I\\right\)\. In this regime, distributional fit remains poor across all non\-analytical settings, and FKC strictly worsens performance when applied to learned models\. It is only when the algorithm is supplied with the exact analytical scores that it achieves the expected theoretical improvements\.
### 5\.2Synthetic data: mixture of Gaussians
Table 2:The expected trends also hold for Gaussian mixture models\.The table follows the same conventions as[Table˜1](https://arxiv.org/html/2606.23920#S5.T1)\.\(a\)In\-Distribution
\(b\)Out\-of\-Distribution
In a slightly more complex synthetic setting, we consider two GMM experiments that both violate the factorized conditionals assumption\. In both experiments,𝒳=ℝ2\\mathcal\{X\}=\\mathbb\{R\}^\{2\}and we defineP0P^\{0\},Pa1P^\{a\_\{1\}\}, andPa2P^\{a\_\{2\}\}as mixtures of Gaussians, where each component has covariance matrixσI\\sigma I\. BecausePwP^\{w\}is a ratio of Gaussian mixtures, its normalizing constant is analytically intractable and the resulting density is not itself a GMM\. We therefore obtain ground\-truth samples via rejection sampling for in\-distribution setting and importance sampling for out\-of\-distribution setting \([F\.2](https://arxiv.org/html/2606.23920#A6.SS2)\)\. Results are provided in Table[2](https://arxiv.org/html/2606.23920#S5.T2)\.
For the in\-distribution setting, component means are arranged on a2×32\\times 3grid,μk∈\{−3,0,3\}×\{−2\.5,2\.5\}\\mu\_\{k\}\\in\\\{\-3,0,3\\\}\\times\\\{\-2\.5,2\.5\\\}, and we setσ=0\.6\\sigma=0\.6\. The baseP0P^\{0\}assigns uniform weight to all six components\. Conditions restrict to overlapping subsets:Pa1P^\{a\_\{1\}\}is uniform over\{k0,k1,k2,k3\}\\\{k\_\{0\},k\_\{1\},k\_\{2\},k\_\{3\}\\\}, andPa2P^\{a\_\{2\}\}is uniform over\{k1,k2,k4,k5\}\\\{k\_\{1\},k\_\{2\},k\_\{4\},k\_\{5\}\\\}\. This configuration is explicitly constructed to isolate approximation error: the unnormalized target densityPa1\(x\)Pa2\(x\)/P0\(x\)P^\{a\_\{1\}\}\(x\)P^\{a\_\{2\}\}\(x\)/P^\{0\}\(x\)is confined to regions with high probability mass under the conditionals\. By ensuring the composed target modes are well\-represented in the training data, score estimation error is minimized\. In this regime, FKC successfully corrects for the non\-factorized nature of the composition, aligning the empirical distributions with the ground truth\. Only for poorly learned models \(N=100N=100\) and at large particle counts does further increasing the number of particles begin to harm performance\.
For the out\-of\-distribution setting, we design the modes so the target distribution concentrates at locations that are low\-probability under all conditionals\.Pa1P^\{a\_\{1\}\}is a two\-component GMM with means\{\(−6,1\.5\),\(0,0\)\}\\\{\(\-6,1\.5\),\(0,0\)\\\},Pa2P^\{a\_\{2\}\}has means\{\(0,−9\),\(−1\.5,6\)\}\\\{\(0,\-9\),\(\-1\.5,6\)\\\}, andP0P^\{0\}has all four means with uniform weight, all withσ=1\\sigma=1\. While small increases inKKyield initial improvements by correcting for approximation error, further scalingKKworsens results due to accumulating estimation error; increasing the number of particles is only consistently helpful with oracle scores\.
### 5\.3Semi\-synthetic data: Objects in a room
Figure 2:Couches and paintings \(∼\\simfactorized\) are largely combined accurately, while couches and tables \(non\-factorized\) are not; FKC is only effective when the target composition is ID\.Illustrative samples from base\-composed distributions across four experimental settings, comparing naïve denoising \(K=1K=1particles\) against FKC \(K=16K=16\)\. In\(a\)and\(c\), the target distribution is highly concentrated on empty rooms with exactly one couchandone painting; in\(b\)and\(d\), it is concentrated on rooms with exactly one couchandone coffee table\. Underlying conditional mixtures are altered to render the target composition ID or OOD\.\(a\)Naïve denoising generates couchesandpaintings correctly but occasionally omits one;K=16K=16causes both objects to appear by correcting for mild non\-factorization\-induced approximation error\.\(b\)Naïve frequently fails to generate both a couchandtable and often warps results;K=16K=16corrects for non\-factorization, generating both\.\(c\)Naïve samples occasionally contain both objects, but they frequently appear alone; withK=16K=16results are similar but tend to be more warped as OOD estimation error accumulates\.\(d\)The most challenging setting; couches and tables never appear simultaneously and warping is severe\.To evaluate these dynamics on more realistic, higher\-dimensional visual data, we create a semi\-synthetic dataset consisting of realistic room images generated via a text\-to\-image model\. The dataset is partitioned into discrete underlying classes: empty rooms, rooms containing a single object, and rooms containing exactly two objects\. Each base and conditional distribution is a mixture of images from four underlying classes: “empty,” “couch,” a second object “XX,” and “couch \+XX\.” We alter objectXXto either “framed painting” or “coffee table” to control the factorization of the distributions\. WhenXXis a framed painting, the conditionals are approximately factorized, as couches and paintings typically occupy disjoint spatial regions in a room \(floor versus wall\)\. Conversely, whenXXis a coffee table, factorization breaks, as both objects compete for overlapping spatial support on the floor\.
We define the ID and OOD settings by adjusting the probability weights across the four underlying classes, represented as the vector\(Pempty,Pcouch,PX,Pcouch\+X\)\(P\_\{\\text\{empty\}\},P\_\{\\text\{couch\}\},P\_\{X\},P\_\{\\text\{couch\}\+X\}\)\. In the in\-distribution setting,P0P^\{0\}is a uniform mixture\(0\.25,0\.25,0\.25,0\.25\)\(0\.25,0\.25,0\.25,0\.25\), and the perturbation distributions are heavily biased toward the presence of their respective objects:Pa1P^\{a\_\{1\}\}is set to\(α,0\.5−α,α,0\.5−α\)\(\\alpha,0\.5\-\\alpha,\\alpha,0\.5\-\\alpha\)andPa2P^\{a\_\{2\}\}is set to\(α,α,0\.5−α,0\.5−α\)\(\\alpha,\\alpha,0\.5\-\\alpha,0\.5\-\\alpha\), where we use a small smoothing factorα=0\.01\\alpha=0\.01to ensure the composition remains well\-defined\. Under this construction, the composed target distribution places nearly all of its probability mass on the “couch \+XX” class, which is well\-supported within each condition’s training data\. In the OOD setting, the base distribution is defined as\(1\.0−3β,β,β,β\)\(1\.0\-3\\beta,\\beta,\\beta,\\beta\)withβ=0\.05/3\\beta=0\.05/3, heavily skewing it toward the “empty” class\.Pa1P^\{a\_\{1\}\}andPa2P^\{a\_\{2\}\}are identically skewed toward the isolated “couch” and “XX” classes, respectively\. As a result, the composed target still concentrates almost entirely on the “couch \+XX” class, which isnotwell\-supported within each condition’s training data\.
Illustrative samples from the resulting composed distributions are provided in Figure[2](https://arxiv.org/html/2606.23920#S5.F2)\. Additional samples are given in[Appendix˜G](https://arxiv.org/html/2606.23920#A7), and experimental details are given in[Section˜F\.3](https://arxiv.org/html/2606.23920#A6.SS3)\.
## 6Discussion
In this work, we investigated the limits of compositional generation, presenting a critical negative result for the field: vanilla conditional diffusion models fundamentally struggle to compose when the target distribution is out\-of\-distribution with respect to the source distributions\. To conclude, we highlight how this finding should inform future research, along with limitations of our setup\.
Key takeaways\.Crucially, our findings pertain to “vanilla" diffusion models — networks trained with standard objectives and shared parameters, utilizing strictly inference\-time composition procedures\. Hence, our results suggest that achieving better compositional generation will require interventions earlier in the pipeline, e\.g\. by training on architectures that learn latent representations, including latent diffusion models\(Rombachet al\.,[2022](https://arxiv.org/html/2606.23920#bib.bib81); Podellet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib82)\)and, more aptly, models that are explicitly designed for compositional generation, such as object\-centric or causal generative models\(Wuet al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib78); Jianget al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib79); Komanduriet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib80)\), or by specialized post\-training procedures that encourage the model to extrapolate from high\-density regions to low\-density ones\.
Limitations\.We reiterate that our experimental findings only pertain to “vanilla" \(conditional\) diffusion models, and are only expected to hold for other “vanilla" conditional generative models, which requires empirical validation\. In our experiments, we only considered base\-composed distributions: a more in\-depth study across different weightings may provide deeper insights\. Finally, by focusing only on a toy setting that is exactly analytically solvable, our theoretical results are highly specialized, leaving open fundamental questions about the effect of score estimation error in more general settings\.
## Acknowledgements
This research was developed with funding from the Defense Advanced Research Projects Agency \(DARPA\) via HR0011\-25\-3\-0239, FA8750\-23\-2\-1015, ONR via N00014\-23\-1\-2368, and NSF via IIS\-1909816\. JH is supported by the Centre for AI Fundamentals and the UKRI GenAI Hub\.
## References
- A\. Adduri, D\. Gautam, B\. Bevilacqua, A\. Imran, R\. Shah, M\. Naghipourfar, N\. Teyssier, R\. Ilango, S\. Nagaraj, C\. Ricci\-Tam, C\. Carpenter, V\. Subramanyam, A\. Winters, M\. Dong, S\. Tirukkovalur, J\. Sullivan, B\. Plosky, B\. Eraslan, N\. D\. Youngblut, J\. Leskovec, L\. A\. Gilbert, S\. Konermann, P\. D\. Hsu, A\. Dobin, D\. P\. Burke, H\. Goodarzi, and Y\. H\. Roohani \(2025\)Predicting cellular responses to perturbation across diverse contexts with state\.InNeurIPS 2025 2nd Workshop on Multi\-modal Foundation Models and Large Language Models for Life Sciences,Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- K\. Ahuja, D\. Mahajan, Y\. Wang, and Y\. Bengio \(2023\)Interventional causal representation learning\.InInternational Conference on Machine Learning,pp\. 372–407\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
- M\. Albergo, N\. M\. Boffi, and E\. Vanden\-Eijnden \(2025\)Stochastic interpolants: a unifying framework for flows and diffusions\.Journal of Machine Learning Research26\(209\),pp\. 1–80\.Cited by:[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- A\. Bradley, P\. Nakkiran, D\. Berthelot, J\. Thornton, and J\. M\. Susskind \(2025\)Mechanisms of projective composition of diffusion models\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px2.p2.1),[§4\.2\.1](https://arxiv.org/html/2606.23920#S4.SS2.SSS1.p2.2),[Definition 2](https://arxiv.org/html/2606.23920#Thmdefinition2)\.
- S\. Buchholz, G\. Rajendran, E\. Rosenfeld, B\. Aragam, B\. Schölkopf, and P\. Ravikumar \(2023\)Learning linear causal representations from interventions under general nonlinear mixing\.Advances in Neural Information Processing Systems36,pp\. 45419–45462\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
- C\. Bunne, Y\. Roohani, Y\. Rosen, A\. Gupta, X\. Zhang, M\. Roed, T\. Alexandrov, M\. AlQuraishi, P\. Brennan, D\. B\. Burkhardt, A\. Califano, J\. Cool, A\. F\. Dernburg, K\. Ewing, E\. B\. Fox, M\. Haury, A\. E\. Herr, E\. Horvitz, P\. D\. Hsu, V\. Jain, G\. R\. Johnson, T\. Kalil, D\. R\. Kelley, S\. O\. Kelley, A\. Kreshuk, T\. Mitchison, S\. Otte, J\. Shendure, N\. J\. Sofroniew, F\. Theis, C\. V\. Theodoris, S\. Upadhyayula, M\. Valer, B\. Wang, E\. Xing, S\. Yeung\-Levy, M\. Zitnik, T\. Karaletsos, A\. Regev, E\. Lundberg, J\. Leskovec, and S\. R\. Quake \(2024\)How to build the virtual cell with artificial intelligence: priorities and opportunities\.Cell187\(25\),pp\. 7045–7063\.External Links:ISSN 0092\-8674Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- A\. Canatar, B\. Bordelon, and C\. Pehlevan \(2021\)Out\-of\-distribution generalization in kernel regression\.Advances in Neural Information Processing Systems34,pp\. 12600–12612\.Cited by:[§2](https://arxiv.org/html/2606.23920#S2.SS0.SSS0.Px2.p3.13)\.
- S\. B\. David, T\. Lu, T\. Luu, and D\. Pál \(2010\)Impossibility theorems for domain adaptation\.InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics,pp\. 129–136\.Cited by:[§2](https://arxiv.org/html/2606.23920#S2.SS0.SSS0.Px2.p3.13)\.
- P\. Dhariwal and A\. Nichol \(2021\)Diffusion models beat gans on image synthesis\.Advances in Neural Information Processing Systems34,pp\. 8780–8794\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p3.2)\.
- R\. Douc and O\. Cappe \(2005\)Comparison of resampling schemes for particle filtering\.InISPA 2005\. Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis, 2005\.,Vol\.,pp\. 64–69\.External Links:[Document](https://dx.doi.org/10.1109/ISPA.2005.195385)Cited by:[Appendix C](https://arxiv.org/html/2606.23920#A3.p6.4)\.
- R\. Flamary, N\. Courty, A\. Gramfort, M\. Z\. Alaya, A\. Boisbunon, S\. Chambon, L\. Chapel, A\. Corenflos, K\. Fatras, N\. Fournier, L\. Gautheron, N\. T\.H\. Gayraud, H\. Janati, A\. Rakotomamonjy, I\. Redko, A\. Rolet, A\. Schutz, V\. Seguy, D\. J\. Sutherland, R\. Tavenard, A\. Tong, and T\. Vayer \(2021\)POT: python optimal transport\.Journal of Machine Learning Research22\(78\),pp\. 1–8\.Cited by:[§F\.1](https://arxiv.org/html/2606.23920#A6.SS1.SSS0.Px6.p1.3)\.
- R\. Flamary, C\. Vincent\-Cuaz, N\. Courty, A\. Gramfort, O\. Kachaiev, H\. Quang Tran, L\. David, C\. Bonet, N\. Cassereau, T\. Gnassounou, E\. Tanguy, J\. Delon, A\. Collas, S\. Mazelet, L\. Chapel, T\. Kerdoncuff, X\. Yu, M\. Feickert, P\. Krzakala, T\. Liu, and E\. Fernandes Montesuma \(2024\)POT python optimal transport \(version 0\.9\.5\)\.External Links:[Link](https://github.com/PythonOT/POT)Cited by:[§F\.1](https://arxiv.org/html/2606.23920#A6.SS1.SSS0.Px6.p1.3)\.
- J\. Ho, A\. Jain, and P\. Abbeel \(2020\)Denoising diffusion probabilistic models\.InAdvances in Neural Information Processing Systems,H\. Larochelle, M\. Ranzato, R\. Hadsell, M\.F\. Balcan, and H\. Lin \(Eds\.\),Vol\.33,pp\. 6840–6851\.Cited by:[§F\.1](https://arxiv.org/html/2606.23920#A6.SS1.SSS0.Px2.p1.1),[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px3.p1.8),[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- J\. Ho and T\. Salimans \(2022\)Classifier\-free diffusion guidance\.arXiv preprint arXiv:2207\.12598\.Cited by:[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2),[§3\.2](https://arxiv.org/html/2606.23920#S3.SS2.p1.4)\.
- A\. Hyvärinen \(2005\)Estimation of non\-normalized statistical models by score matching\.Journal of Machine Learning Research6,pp\. 695–709\.Cited by:[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- J\. Jiang, F\. Deng, G\. Singh, and S\. Ahn \(2023\)Object\-centric slot diffusion\.InAdvances in Neural Information Processing Systems,A\. Oh, T\. Naumann, A\. Globerson, K\. Saenko, M\. Hardt, and S\. Levine \(Eds\.\),Vol\.36,pp\. 8563–8601\.Cited by:[§6](https://arxiv.org/html/2606.23920#S6.p2.1)\.
- D\. P\. Kingma and J\. Ba \(2014\)Adam: a method for stochastic optimization\.arXiv preprint arXiv:1412\.6980\.Cited by:[§F\.1](https://arxiv.org/html/2606.23920#A6.SS1.SSS0.Px3.p1.7)\.
- A\. Komanduri, X\. Wu, Y\. Wu, and F\. Chen \(2024\)From identifiable causal representations to controllable counterfactual generation: a survey on causal generative modeling\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§6](https://arxiv.org/html/2606.23920#S6.p2.1)\.
- S\. Kpotufe and G\. Martinet \(2018\)Marginal singularity, and the benefits of labels in covariate\-shift\.InConference On Learning Theory,pp\. 1882–1886\.Cited by:[§2](https://arxiv.org/html/2606.23920#S2.SS0.SSS0.Px2.p3.13)\.
- B\. F\. Labs, S\. Batifol, A\. Blattmann, F\. Boesel, S\. Consul, C\. Diagne, T\. Dockhorn, J\. English, Z\. English, P\. Esser, S\. Kulal, K\. Lacey, Y\. Levi, C\. Li, D\. Lorenz, J\. Müller, D\. Podell, R\. Rombach, H\. Saini, A\. Sauer, and L\. Smith \(2025\)FLUX\.1 kontext: flow matching for in\-context image generation and editing in latent space\.External Links:2506\.15742Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px1.p1.1)\.
- B\. F\. Labs \(2024\)FLUX\.Note:[https://github\.com/black\-forest\-labs/flux](https://github.com/black-forest-labs/flux)Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px1.p1.1)\.
- S\. Lin, B\. Liu, J\. Li, and X\. Yang \(2024\)Common diffusion noise schedules and sample steps are flawed\.InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,pp\. 5404–5411\.Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px3.p1.8)\.
- Y\. Lipman, R\. T\. Q\. Chen, H\. Ben\-Hamu, M\. Nickel, and M\. Le \(2023\)Flow matching for generative modeling\.InThe Eleventh International Conference on Learning Representations,Cited by:[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- N\. Liu, S\. Li, Y\. Du, A\. Torralba, and J\. B\. Tenenbaum \(2022\)Compositional visual generation with composable diffusion models\.InEuropean Conference on Computer Vision,pp\. 423–439\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p3.2),[§3\.2](https://arxiv.org/html/2606.23920#S3.SS2.p1.4)\.
- I\. Loshchilov and F\. Hutter \(2019\)Decoupled weight decay regularization\.InInternational Conference on Learning Representations,Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px4.p1.6)\.
- M\. Lotfollahi, A\. K\. Susmelj, C\. De Donno, L\. Hetzel, Y\. Ji, I\. L\. Ibarra, S\. R\. Srivatsan, M\. Naghipourfar, R\. M\. Daza, B\. Martin, E\. L\. Aiden, J\. Shendure, J\. L\. McFaline\-Figueroa, P\. Boyeau, and F\. J\. Theis \(2023\)Predicting cellular responses to complex perturbations in high\-throughput screens\.Molecular Systems Biology19\(6\),pp\. e11517\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- J\. Müller, R\. Schmier, L\. Ardizzone, C\. Rother, and U\. Köthe \(2021\)Learning robust models using the principle of independent causal mechanisms\.InDAGM German Conference on Pattern Recognition,pp\. 79–110\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p2.10)\.
- A\. Q\. Nichol and P\. Dhariwal \(2021\)Improved denoising diffusion probabilistic models\.InInternational Conference on Machine Learning,pp\. 8162–8171\.Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px3.p1.8)\.
- E\. Noutahi, J\. Hartford, P\. Tossou, S\. Whitfield, A\. K\. Denton, C\. Wognum, K\. Ulicna, M\. Craig, J\. Hsu, M\. Cuccarese, E\. Bengio, D\. Beaini, C\. Gibson, D\. Cohen, and B\. Earnshaw \(2025\)Virtual cells: predict, explain, discover\.arXiv preprint arXiv:2505\.14613\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. V\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. HAZIZA, F\. Massa, A\. El\-Nouby, M\. Assran, N\. Ballas, W\. Galuba, R\. Howes, P\. Huang, S\. Li, I\. Misra, M\. Rabbat, V\. Sharma, G\. Synnaeve, H\. Xu, H\. Jegou, J\. Mairal, P\. Labatut, A\. Joulin, and P\. Bojanowski \(2024\)DINOv2: learning robust visual features without supervision\.Transactions on Machine Learning Research\.Note:Featured CertificationExternal Links:ISSN 2835\-8856Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px1.p4.1)\.
- E\. Perez, F\. Strub, H\. de Vries, V\. Dumoulin, and A\. Courville \(2018\)FiLM: visual reasoning with a general conditioning layer\.InProceedings of the Thirty\-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence,AAAI’18/IAAI’18/EAAI’18\.External Links:ISBN 978\-1\-57735\-800\-8Cited by:[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- D\. Podell, Z\. English, K\. Lacey, A\. Blattmann, T\. Dockhorn, J\. Müller, J\. Penna, and R\. Rombach \(2024\)SDXL: improving latent diffusion models for high\-resolution image synthesis\.InInternational Conference on Learning Representations,B\. Kim, Y\. Yue, S\. Chaudhuri, K\. Fragkiadaki, M\. Khan, and Y\. Sun \(Eds\.\),Vol\.2024,pp\. 1862–1874\.Cited by:[§6](https://arxiv.org/html/2606.23920#S6.p2.1)\.
- Y\. Ren, W\. Gao, L\. Ying, G\. M\. Rotskoff, and J\. Han \(2026\)DriftLite: lightweight drift control for inference\-time scaling of diffusion models\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p4.1)\.
- R\. Rombach, A\. Blattmann, D\. Lorenz, P\. Esser, and B\. Ommer \(2022\)High\-resolution image synthesis with latent diffusion models\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 10684–10695\.Cited by:[§6](https://arxiv.org/html/2606.23920#S6.p2.1)\.
- O\. Ronneberger, P\. Fischer, and T\. Brox \(2015\)U\-net: convolutional networks for biomedical image segmentation\.InInternational Conference on Medical image computing and computer\-assisted intervention,pp\. 234–241\.Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px2.p1.4)\.
- Y\. Roohani, K\. Huang, and J\. Leskovec \(2024\)Predicting transcriptional outcomes of novel multigene perturbations with gears\.Nature Biotechnology42,pp\. 927–935\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- T\. Salimans and J\. Ho \(2022\)Progressive distillation for fast sampling of diffusion models\.InInternational Conference on Learning Representations,Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px3.p1.8)\.
- S\. Särkkä and A\. Solin \(2019\)Applied stochastic differential equations\.Institute of Mathematical Statistics Textbooks,Cambridge University Press\.Cited by:[§D\.2](https://arxiv.org/html/2606.23920#A4.SS2.SSS0.Px2.p1.1)\.
- B\. Schölkopf, F\. Locatello, S\. Bauer, N\. R\. Ke, N\. Kalchbrenner, A\. Goyal, and Y\. Bengio \(2021\)Toward causal representation learning\.Proceedings of the IEEE109\(5\),pp\. 612–634\.Cited by:[§2](https://arxiv.org/html/2606.23920#S2.SS0.SSS0.Px1.p2.6)\.
- M\. Skreta, T\. Akhound\-Sadegh, V\. Ohanesian, R\. Bondesan, A\. Aspuru\-Guzik, A\. Doucet, R\. Brekelmans, A\. Tong, and K\. Neklyudov \(2025\)Feynman\-kac correctors in diffusion: annealing, guidance, and product of experts\.InForty\-second International Conference on Machine Learning,Cited by:[Appendix C](https://arxiv.org/html/2606.23920#A3.p1.1),[§F\.1](https://arxiv.org/html/2606.23920#A6.SS1.SSS0.Px5.p1.4),[§1](https://arxiv.org/html/2606.23920#S1.p4.1),[§2](https://arxiv.org/html/2606.23920#S2.SS0.SSS0.Px2.p3.13),[§3\.2](https://arxiv.org/html/2606.23920#S3.SS2.p3.2)\.
- Y\. Song, J\. Sohl\-Dickstein, D\. P\. Kingma, A\. Kumar, S\. Ermon, and B\. Poole \(2021\)Score\-based generative modeling through stochastic differential equations\.InInternational Conference on Learning Representations,Cited by:[§F\.1](https://arxiv.org/html/2606.23920#A6.SS1.SSS0.Px4.p1.7),[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px2.p1.19),[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- C\. Squires, A\. Seigal, S\. S\. Bhate, and C\. Uhler \(2023\)Linear causal disentanglement via interventions\.InInternational Conference on Machine Learning,pp\. 32540–32560\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
- C\. Squires and C\. Uhler \(2023\)Causal structure learning: a combinatorial perspective\.Foundations of Computational Mathematics23\(5\),pp\. 1781–1815\.Cited by:[§2](https://arxiv.org/html/2606.23920#S2.SS0.SSS0.Px1.p1.11)\.
- B\. Varici, E\. Acartürk, K\. Shanmugam, A\. Kumar, and A\. Tajer \(2025\)Score\-based causal representation learning: linear and general transformations\.Journal of Machine Learning Research26\(112\),pp\. 1–90\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
- B\. Varici, E\. Acartürk, K\. Shanmugam, and A\. Tajer \(2024\)General identifiability and achievability for causal representation learning\.InInternational Conference on Artificial Intelligence and Statistics,pp\. 2314–2322\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
- B\. Varıcı, E\. Acartürk, K\. Shanmugam, and A\. Tajer \(2024\)Linear causal representation learning from unknown multi\-node interventions\.Advances in Neural Information Processing Systems\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
- B\. Varıcı, C\. Squires, and P\. Ravikumar \(2026\)Causal representation learning\.Neurosymbolic AI: Foundations and Applications,pp\. 307–346\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.p1.6)\.
- P\. Vincent \(2011\)A connection between score matching and denoising autoencoders\.Neural Computation23\(7\),pp\. 1661–1674\.Cited by:[§3\.1](https://arxiv.org/html/2606.23920#S3.SS1.SSS0.Px3.p1.2)\.
- P\. von Platen, S\. Patil, A\. Lozhkov, P\. Cuenca, N\. Lambert, K\. Rasul, M\. Davaadorj, D\. Nair, S\. Paul, W\. Berman, Y\. Xu, S\. Liu, and T\. Wolf \(2022\)Diffusers: state\-of\-the\-art diffusion models\.GitHub\.Note:[https://github\.com/huggingface/diffusers](https://github.com/huggingface/diffusers)Cited by:[§F\.3](https://arxiv.org/html/2606.23920#A6.SS3.SSS0.Px2.p1.4)\.
- G\. Wang, T\. Liu, J\. Zhao, Y\. Cheng, and H\. Zhao \(2024\)Modeling and predicting single\-cell multi\-gene perturbation responses with sclambda\.bioRxiv\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- Z\. Wang, L\. Gui, J\. Negrea, and V\. Veitch \(2023\)Concept algebra for \(score\-based\) text\-controlled generative models\.Advances in Neural Information Processing Systems\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p2.6)\.
- F\. Wenkel, W\. Tu, C\. Masschelein, H\. Shirzad, L\. Hodgson, I\. Bendidi, C\. Eastwood, S\. T\. Whitfield, C\. Russell, Y\. El Mesbahi, J\. Ding, M\. M\. Fay, B\. Earnshaw, E\. Noutahi, and A\. K\. Denton \(2026\)TxPert: using multiple knowledge graphs for prediction of transcriptomic perturbation effects\.Nature Biotechnology\.Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p1.1)\.
- Z\. Wu, J\. Hu, W\. Lu, I\. Gilitschenski, and A\. Garg \(2023\)Slotdiffusion: object\-centric generative modeling with diffusion models\.Advances in Neural Information Processing Systems36,pp\. 50932–50958\.Cited by:[§6](https://arxiv.org/html/2606.23920#S6.p2.1)\.
- Y\. Xie, L\. Winkler, L\. Sun, S\. Lewis, A\. Foster, J\. Jimenez\-Luna, T\. Hempel, M\. Gastegger, Y\. Chen, I\. Zaporozhets, C\. Clementi, C\. M\. Bishop, and F\. Noe \(2026\)Enhanced diffusion sampling: efficient rare event sampling and free energy calculation with diffusion models\.InICML 2026 Workshop on Structured Probabilistic Inference & Generative Modeling,Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p4.1)\.
- Z\. Xu, M\. Jain, A\. Denton, S\. Whitfield, A\. Didolkar, B\. Earnshaw, and J\. Hartford \(2024\)Automated discovery of pairwise interactions from unstructured data\.External Links:2409\.07594Cited by:[§1](https://arxiv.org/html/2606.23920#S1.p2.6)\.
- J\. Zhang, K\. Greenewald, C\. Squires, A\. Srivastava, K\. Shanmugam, and C\. Uhler \(2023\)Identifiability guarantees for causal disentanglement from soft interventions\.Advances in Neural Information Processing Systems36,pp\. 50254–50292\.Cited by:[Appendix B](https://arxiv.org/html/2606.23920#A2.SS0.SSS0.Px1.p1.5)\.
###### Contents of Appendix
1. [1Introduction](https://arxiv.org/html/2606.23920#S1)
2. [2Setup: Geometrically weighted compositions](https://arxiv.org/html/2606.23920#S2)
3. [3Technical background](https://arxiv.org/html/2606.23920#S3)1. [3\.1Conditional diffusion models](https://arxiv.org/html/2606.23920#S3.SS1) 2. [3\.2Composition methods](https://arxiv.org/html/2606.23920#S3.SS2)
4. [4Theoretical insights](https://arxiv.org/html/2606.23920#S4)1. [4\.1Multivariate Gaussians](https://arxiv.org/html/2606.23920#S4.SS1) 2. [4\.2Base\-composed distributions](https://arxiv.org/html/2606.23920#S4.SS2)1. [4\.2\.1Factorized conditionals](https://arxiv.org/html/2606.23920#S4.SS2.SSS1)
5. [5Experiments](https://arxiv.org/html/2606.23920#S5)1. [5\.1Synthetic data: 2D Gaussian](https://arxiv.org/html/2606.23920#S5.SS1) 2. [5\.2Synthetic data: mixture of Gaussians](https://arxiv.org/html/2606.23920#S5.SS2) 3. [5\.3Semi\-synthetic data: Objects in a room](https://arxiv.org/html/2606.23920#S5.SS3)
6. [6Discussion](https://arxiv.org/html/2606.23920#S6)
7. [References](https://arxiv.org/html/2606.23920#bib)
8. [AFormalization of density\-related assumptions](https://arxiv.org/html/2606.23920#A1)
9. [BWeighted composition: A causal representation learning perspective](https://arxiv.org/html/2606.23920#A2)
10. [CFeynman\-Kac correctors](https://arxiv.org/html/2606.23920#A3)
11. [DProofs](https://arxiv.org/html/2606.23920#A4)1. [D\.1Proof ofTheorem˜1](https://arxiv.org/html/2606.23920#A4.SS1) 2. [D\.2Proof ofTheorem˜2](https://arxiv.org/html/2606.23920#A4.SS2) 3. [D\.3Proof ofTheorem˜3\(Feynman\-Kac correctors under factorized conditionals\)](https://arxiv.org/html/2606.23920#A4.SS3)
12. [ESeparate diffusion models](https://arxiv.org/html/2606.23920#A5)
13. [FExperiment setup details](https://arxiv.org/html/2606.23920#A6)1. [F\.12D Gaussian](https://arxiv.org/html/2606.23920#A6.SS1) 2. [F\.2Mixture of Gaussians](https://arxiv.org/html/2606.23920#A6.SS2) 3. [F\.3Objects in a room](https://arxiv.org/html/2606.23920#A6.SS3)
14. [GAdditional room images](https://arxiv.org/html/2606.23920#A7)
15. [HTables](https://arxiv.org/html/2606.23920#A8)1. [H\.1Two\-dimensional Gaussian \(conditional model\)](https://arxiv.org/html/2606.23920#A8.SS1) 2. [H\.2Two\-dimensional Gaussian \(separate models\)](https://arxiv.org/html/2606.23920#A8.SS2) 3. [H\.3Gaussian mixture models \(conditional model\)](https://arxiv.org/html/2606.23920#A8.SS3) 4. [H\.4Gaussian mixture models \(separate models\)](https://arxiv.org/html/2606.23920#A8.SS4)
## Appendix AFormalization of density\-related assumptions
Our density\-related assumptions in[Section˜2](https://arxiv.org/html/2606.23920#S2)can be more formally stated as follows:
###### Assumption 1\.
There exists some base measureμ\\muon𝒳\\mathcal\{X\}, withsupp\(μ\)=𝒳\\textnormal\{supp\}\(\\mu\)=\\mathcal\{X\}, such thatPa≪μP\_\{a\}\\ll\\muandμ≪Pa\\mu\\ll P\_\{a\}for alla∈𝒜oa\\in\\mathcal\{A\}\_\{o\}, where≪\\lldenotes the absolute continuity relation\.
First, by the Radon\-Nikodym theorem, the condition thatPa<<μP^\{a\}<<\\muensures that each distributionPaP^\{a\}has a density with respect toμ\\mu, and that these densities are uniquely definedμ\\mu\-almost everywhere\. Second, this condition implies thatPa\(x\)\>0P^\{a\}\(x\)\>0forμ\\mu\-almost allx∈𝒳x\\in\\mathcal\{X\}\. Hence, our equality statements should be formally interpretated as holding over this*equivalence class*of densities\.
## Appendix BWeighted composition: A causal representation learning perspective
In causal representation learning, one often considers*causal representation models*as an inductive bias for compositional generalization\. In particular, one can define a*causal representation model on𝒳\\mathcal\{X\}*as a tupleM=\(S,g\)M=\(S,g\), whereSSis astructural causal modelon𝒵=ℝd\\mathcal\{Z\}=\\mathbb\{R\}^\{d\}, andg:𝒵→𝒳g\\colon\\mathcal\{Z\}\\to\\mathcal\{X\}is a diffeomorphism onto its image \(seeVarıcıet al\.\[[2026](https://arxiv.org/html/2606.23920#bib.bib33), Definition 10\.7\]\)\. Most importantly for our purposes, the*latent\-space observational distribution*ofM=\(S,g\)M=\(S,g\)has the form
Qobs\(z\):=∏i=1dQobs\(zi∣pa𝒢\(zi\)\),Q^\{\\text\{obs\}\}\(z\)\\mathrel\{:=\}\\prod\_\{i=1\}^\{d\}Q^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\),where𝒢\\mathcal\{G\}is the*causal DAG*associated with the structural causal modelSS, and the*feature\-space observational distribution*ofMMis
Pobs\(x\)=Qobs\(g−1\(x\)\)⋅\|detJg−1\(x\)\|\.P^\{\\text\{obs\}\}\(x\)=Q^\{\\text\{obs\}\}\(g^\{\-1\}\(x\)\)\\cdot\|\\det J\_\{g^\{\-1\}\}\(x\)\|\.Then, an*intervention*IIonMM, with*targets*T\(I\)⊆\[d\]T\(I\)\\subseteq\[d\], is a collection of conditional distributions\(QI\(zi∣pa𝒢\(zi\)\)j∈T\(I\)\(Q^\{I\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\_\{j\\in T\(I\)\}, giving the*latent\-space interventional distribution*
QI\(z\):=∏i=1dQobs\(zi∣pa𝒢\(zi\)\)𝟙i∉T\(I\)⋅QI\(zi∣pa𝒢\(zi\)\)𝟙i∈T\(I\),Q^\{I\}\(z\)\\mathrel\{:=\}\\prod\_\{i=1\}^\{d\}Q^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)^\{\\mathbbm\{1\}\_\{i\\not\\in T\(I\)\}\}\\cdot Q^\{I\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)^\{\\mathbbm\{1\}\_\{i\\in T\(I\)\}\},and the*feature\-space interventional distribution*PI\(x\)=QI\(g−1\(x\)\)⋅\|detJg−1\(x\)\|P^\{I\}\(x\)=Q^\{I\}\(g^\{\-1\}\(x\)\)\\cdot\|\\det J\_\{g^\{\-1\}\}\(x\)\|\. From these definitions, we see that
QI\(z\)Qobs\(z\)=∏i∈T\(I\)QI\(zi∣pa𝒢\(zi\)\)Qobs\(zi∣pa𝒢\(zi\)\),\\frac\{Q^\{I\}\(z\)\}\{Q^\{\\text\{obs\}\}\(z\)\}=\\prod\_\{i\\in T\(I\)\}\\frac\{Q^\{I\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\{Q^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\},and, since the Jacobian term cancels out,
PI\(x\)Pobs\(x\)=∏i∈T\(I\)PI\(zi∣pa𝒢\(zi\)\)Pobs\(zi∣pa𝒢\(zi\)\),wherez=g−1\(x\)\.\\frac\{P^\{I\}\(x\)\}\{P^\{\\text\{obs\}\}\(x\)\}=\\prod\_\{i\\in T\(I\)\}\\frac\{P^\{I\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\{P^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\},\\quad\\text\{where\}\\ z=g^\{\-1\}\(x\)\.\(9\)
##### Perturbations as unknown\-target interventions
In*interventional causal representation learning*, one typically assumes that each*single perturbation*a=𝐞ka=\{\\mathbf\{e\}\}\_\{k\}fork∈\[K\]k\\in\[K\]corresponds to an interventionIaI\_\{a\}with unknown targets\[Varıcıet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib4)\], though possibly with restrictions on the size of the intervention\[Ahujaet al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib32), Squireset al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib10), Buchholzet al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib31), Zhanget al\.,[2023](https://arxiv.org/html/2606.23920#bib.bib30), Variciet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib28),[2025](https://arxiv.org/html/2606.23920#bib.bib29)\]\. In particular, to learn the modelMMfrom such data, one must also learn a mapρ:a↦Ia\\rho\\colon a\\mapsto I\_\{a\}, mapping each perturbation to its representation as an intervention\.
Representing*double perturbations*as vectors𝐞kk′:=𝐞k\+𝐞k′\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\\mathrel\{:=\}\{\\mathbf\{e\}\}\_\{k\}\+\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}fork≠k′k\\neq k^\{\\prime\}, \(a form of\) the principle of*independent causal mechanisms \(ICM\)*states that the mapρ\\rhohas a certain modularity property\[Mülleret al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib34)\]\. In particular, if𝐞k\{\\mathbf\{e\}\}\_\{k\}and𝐞k′\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}affect different causal mechanisms \(i\.e\.,T\(I𝐞k\)∩T\(I𝐞k′\)=∅T\(I\_\{\{\\mathbf\{e\}\}\_\{k\}\}\)\\cap T\(I\_\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\)=\\varnothing\), then the ICM principle encourages one to assume thatρ\(𝐞kk′\)=I𝐞k∪I𝐞k′\\rho\(\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\)=I\_\{\{\\mathbf\{e\}\}\_\{k\}\}\\cup I\_\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}, i\.e\., that the corresponding interventionI𝐞kk′I\_\{\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\}*combines*the interventionsI𝐞k=\(QI𝐞k\(zi∣pa𝒢\(zi\)\)j∈T\(I𝐞k\)I\_\{\{\\mathbf\{e\}\}\_\{k\}\}=\(Q^\{I\_\{\{\\mathbf\{e\}\}\_\{k\}\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\_\{j\\in T\(I\_\{\{\\mathbf\{e\}\}\_\{k\}\}\)\}andI𝐞k′=QI𝐞k′\(zi∣pa𝒢\(zi\)\)j∈T\(I𝐞k′\)I\_\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}=Q^\{I\_\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\_\{j\\in T\(I\_\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\)\}, without any changes to the interventional mechanisms\.
##### Weighted composition
For consistency with Section 2, letPa:=PIaP^\{a\}\\mathrel\{:=\}P^\{I\_\{a\}\}\. By[Equation˜9](https://arxiv.org/html/2606.23920#A2.E9)and the principle of independent mechanisms, we obtain
P𝐞kk′\(x\)Pobs\(x\)\\displaystyle\\frac\{P^\{\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\}\(x\)\}\{P^\{\\text\{obs\}\}\(x\)\}=∏i∈T\(I𝐞kk′\)P𝐞kk′\(zi∣pa𝒢\(zi\)\)Pobs\(zi∣pa𝒢\(zi\)\),\\displaystyle=\\prod\_\{i\\in T\\left\(I\_\{\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\}\\right\)\}\\frac\{P^\{\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\{P^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\},\([Equation9](https://arxiv.org/html/2606.23920#A2.E9)\)=\(∏i∈T\(I𝐞k\)P𝐞k\(zi∣pa𝒢\(zi\)\)Pobs\(zi∣pa𝒢\(zi\)\)\)⋅\(∏i∈T\(I𝐞k′\)P𝐞k′\(zi∣pa𝒢\(zi\)\)Pobs\(zi∣pa𝒢\(zi\)\)\)\\displaystyle=\\left\(\\prod\_\{i\\in T\\left\(I\_\{\{\\mathbf\{e\}\}\_\{k\}\}\\right\)\}\\frac\{P^\{\{\\mathbf\{e\}\}\_\{k\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\{P^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\\right\)\\cdot\\left\(\\prod\_\{i\\in T\\left\(I\_\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\\right\)\}\\frac\{P^\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\{P^\{\\text\{obs\}\}\(z\_\{i\}\\mid\\textnormal\{pa\}\_\{\\mathcal\{G\}\}\(z\_\{i\}\)\)\}\\right\)\(ICM\)=P𝐞k\(x\)Pobs\(x\)⋅P𝐞k′\(x\)Pobs\(x\),\\displaystyle=\\frac\{P^\{\{\\mathbf\{e\}\}\_\{k\}\}\(x\)\}\{P^\{\\text\{obs\}\}\(x\)\}\\cdot\\frac\{P^\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\(x\)\}\{P^\{\\text\{obs\}\}\(x\)\},\([Equation9](https://arxiv.org/html/2606.23920#A2.E9)\)where we have usedz=g−1\(x\)z=g^\{\-1\}\(x\)throughout\. Thus, rearranging and assumingP𝟎=PobsP^\{\\mathbf\{0\}\}=P^\{\\text\{obs\}\}, we obtain the desired composition:P𝐞kk′\(x\)=P𝐞k\(x\)⋅P𝐞k′\(x\)⋅\(P𝟎\(x\)\)−1P^\{\{\\mathbf\{e\}\}\_\{kk^\{\\prime\}\}\}\(x\)=P^\{\{\\mathbf\{e\}\}\_\{k\}\}\(x\)\\cdot P^\{\{\\mathbf\{e\}\}\_\{k^\{\\prime\}\}\}\(x\)\\cdot\(P^\{\\mathbf\{0\}\}\(x\)\)^\{\-1\}\. This argument naturally extends to combinations of*several*perturbations with non\-overlapping intervention targets\.
Notably, such interventional distributions are generalizations of the*Factorized Conditionals*condition from\[Bradleyet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib9)\]: the assumption of mutually independent sets of variables corresponds to*disconnected*groups of nodes in a causal graph, giving their Equation \(7\) as a special case where each intervention targets one of these disconnected groups of nodes\.
## Appendix CFeynman\-Kac correctors
In this section, we present the theoretical intuition behind Feynman\-Kac Correctors \(FKC\)\[Skretaet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib24)\]\.
Although the naïve denoising SDE uses a*score*that matchesStwS^\{w\}\_\{t\}att=0t=0, its marginal att=0t=0is not𝐏w\{\\mathbf\{P\}\}^\{w\}: the particle position at the endpoint depends on integrated dynamics over the entire trajectory, andStw,nvS^\{w,\\textnormal\{nv\}\}\_\{t\}is not correct att\>0t\>0\. FKC addresses this inference\-time approximation error by maintainingKKweighted particles\{\(xt\(k\),ωt\(k\)\)\}k=1K\\\{\(x^\{\(k\)\}\_\{t\},\\omega^\{\(k\)\}\_\{t\}\)\\\}\_\{k=1\}^\{K\}, withωt\(k\)\\omega^\{\(k\)\}\_\{t\}the log importance weight of particlekk, such that the weighted empirical*distribution*tracks a target along the entire denoising trajectory\.
The natural target𝐏tw=Noiset\(𝐏w\)\{\\mathbf\{P\}\}^\{w\}\_\{t\}=\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(\{\\mathbf\{P\}\}^\{w\}\)has an intractable scoreStwS^\{w\}\_\{t\}, so FKC tracks instead the weighted composition of the noised marginals,Compw\(\{Pta\}a\)\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\\\{P^\{a\}\_\{t\}\\\}\_\{a\}\), whose score is the naïvely\-composedStw,nvS^\{w,\\textnormal\{nv\}\}\_\{t\}\. This proxy agrees with𝐏tw\{\\mathbf\{P\}\}^\{w\}\_\{t\}att=0t=0\(Noise0\\operatorname\{\\textsc\{Noise\}\}\_\{0\}is the identity\) but diverges from it at intermediatett, since noising and composition do not commute\. Sampling correctness thus holds, att=0t=0the proxy equals𝐏w\{\\mathbf\{P\}\}^\{w\}in distribution, so the weighted particles approximate𝐏w\{\\mathbf\{P\}\}^\{w\}\.
Determining the weight update to achieve this follows from comparing how the proposalP~tw,nv\\widetilde\{P\}^\{w,\\textnormal\{nv\}\}\_\{t\}and the proxyCompw\(\{Pta\}a\)\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\\\{P^\{a\}\_\{t\}\\\}\_\{a\}\)evolve in time\. Both satisfy Fokker\-Planck\-type equations involving the drift divergence, score, and Laplacian of the log density\. Subtracting one from the other, they leave a residualgtg\_\{t\}that the log\-weights along a particle trajectory must account for:
dωt\(xt\)=g¯t\(xt\)dt,g¯t\(x\):=gt\(x\)−𝔼Xt∼Compw\(\{Pta\}a\)\[gt\(Xt\)\],\\text\{d\}\\omega\_\{t\}\(x\_\{t\}\)\\;=\\;\\bar\{g\}\_\{t\}\(x\_\{t\}\)\\,\\text\{d\}t,\\qquad\\bar\{g\}\_\{t\}\(x\)\\;:=\\;g\_\{t\}\(x\)\-\\mathbb\{E\}\_\{X\_\{t\}\\sim\\operatorname\{\\textsc\{Comp\}\}\_\{w\}\(\\\{P\_\{t\}^\{a\}\\\}\_\{a\}\)\}\[g\_\{t\}\(X\_\{t\}\)\],\(10\)where the centering enforces∫g¯t\(x\)Ptw\(x\)dx=0\\int\\bar\{g\}\_\{t\}\(x\)\\,P^\{w\}\_\{t\}\(x\)\\,\\text\{d\}x=0, preventing uniform divergence or collapse of all log\-weights simultaneously\. For weighted compositions,gtg\_\{t\}is given in closed form by:
gt\(x\)=\(1−∑a∈𝒜owa\)⟨∇x,ut\(x\)⟩\+σt22\(∑a∈𝒜owa‖Sta\(x\)‖2−‖Stw,nv\(x\)‖2\)\.g\_\{t\}\(x\)\\;=\\;\\left\(1\-\\sum\\nolimits\_\{a\\in\\mathcal\{A\}\_\{o\}\}w\_\{a\}\\right\)\\langle\\nabla\_\{\\textsf\{x\}\},u\_\{t\}\(x\)\\rangle\+\\frac\{\\sigma\_\{t\}^\{2\}\}\{2\}\\\!\\left\(\\sum\\nolimits\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\,\\\|S^\{a\}\_\{t\}\(x\)\\\|^\{2\}\-\\\|S^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\)\\\|^\{2\}\\right\)\.\(11\)
To utilize these weightsωt\(k\)\\omega\_\{t\}^\{\(k\)\}to perform corrections, a Sequential Monte Carlo method is employed\. During denoising, resampling is performed at each step based on the incrementωt\(k\)=gt\(xt\(k\)\)\\omega\_\{t\}^\{\(k\)\}=g\_\{t\}\(x\_\{t\}^\{\(k\)\}\)via systematic resampling proportional toexp\(ωt\(k\)\)\\exp\\left\(\\omega\_\{t\}^\{\(k\)\}\\right\)\[Douc and Cappe,[2005](https://arxiv.org/html/2606.23920#bib.bib37)\]\. Note that whenK=1K=1, the ensemble consists of only a single particle, so resampling has no effect and running FKC is equivalent to running the naïve denoising SDE\.
We also note how as an additional nicety of base\-composed distributions \([7](https://arxiv.org/html/2606.23920#S4.E7)\), the divergence term in \([11](https://arxiv.org/html/2606.23920#A3.E11)\) cancels to0and can be ignored even for non\-linear drifts\. \(For linear drifts, it is constant and can be ignored regardless\)\.
The FKC framework extends naturally to denoising ODEs, with the samegtg\_\{t\}as in the SDE case\. However, applying the instantiation of the FKC framework as described with Sequential Monte Carlo rejection sampling is less well\-motivated for ODEs because their trajectories are deterministic\. Because the reverse ODE lacks noise injection from the Wiener process, particle trajectories never stochastically diverge\. When FKC resampling duplicates a high\-weight particle, those identical particles will follow the exact same deterministic trajectory for the remainder of the reverse process\. This means that resampling monotonically decreases the diversity of the particle swarm\. Despite this limitation, if the continuity\-equation residual is large, FKC can remain mathematically helpful to route the probability mass to the correct target coordinates\.
## Appendix DProofs
### D\.1Proof of[Theorem˜1](https://arxiv.org/html/2606.23920#Thmtheorem1)
###### Proof\.
Letx∈ℝdx\\in\\mathbb\{R\}^\{d\}\. The probability density function for each zero\-mean multivariate GaussianPaP^\{a\}is given by:
Pa\(x\)=1\(2π\)d\|Σa\|exp\(−12x⊤Σa−1x\)P^\{a\}\(x\)=\\frac\{1\}\{\\sqrt\{\(2\\pi\)^\{d\}\|\\Sigma\_\{a\}\|\}\}\\exp\\left\(\-\\frac\{1\}\{2\}x^\{\\top\}\\Sigma\_\{a\}^\{\-1\}x\\right\)
We wish to evaluatef\(x\)f\(x\), the product of these densities raised to their respective weightswaw\_\{a\}:
f\(x\)\\displaystyle f\(x\)=∏a∈𝒜oPa\(x\)wa\\displaystyle=\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}P^\{a\}\(x\)^\{w\_\{a\}\}=∏a∈𝒜o\(\(2π\)−d/2\|Σa\|−1/2exp\(−12x⊤Σa−1x\)\)wa\\displaystyle=\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\\left\(\(2\\pi\)^\{\-d/2\}\|\\Sigma\_\{a\}\|^\{\-1/2\}\\exp\\left\(\-\\frac\{1\}\{2\}x^\{\\top\}\\Sigma\_\{a\}^\{\-1\}x\\right\)\\right\)^\{w\_\{a\}\}=\(∏a∈𝒜o\(2π\)−dwa/2\|Σa\|−wa/2\)exp\(−12∑a∈𝒜owax⊤Σa−1x\)\\displaystyle=\\left\(\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\(2\\pi\)^\{\-dw\_\{a\}/2\}\|\\Sigma\_\{a\}\|^\{\-w\_\{a\}/2\}\\right\)\\exp\\left\(\-\\frac\{1\}\{2\}\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}x^\{\\top\}\\Sigma\_\{a\}^\{\-1\}x\\right\)=\(∏a∈𝒜o\(2π\)−dwa/2\|Σa\|−wa/2\)exp\(−12x⊤\(∑a∈𝒜owaΣa−1\)x\)\\displaystyle=\\left\(\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\(2\\pi\)^\{\-dw\_\{a\}/2\}\|\\Sigma\_\{a\}\|^\{\-w\_\{a\}/2\}\\right\)\\exp\\left\(\-\\frac\{1\}\{2\}x^\{\\top\}\\left\(\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\Sigma\_\{a\}^\{\-1\}\\right\)x\\right\)
LetCCbe the scaling constantC=∏a∈𝒜o\(2π\)−dwa/2\|Σa\|−wa/2C=\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\(2\\pi\)^\{\-dw\_\{a\}/2\}\|\\Sigma\_\{a\}\|^\{\-w\_\{a\}/2\}, and letM=∑a∈𝒜owaΣa−1M=\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\Sigma\_\{a\}^\{\-1\}:
f\(x\)=Cexp\(−12x⊤Mx\)f\(x\)=C\\exp\\left\(\-\\frac\{1\}\{2\}x^\{\\top\}Mx\\right\)We can recognizef\(x\)∝exp\(−12x⊤Mx\)f\(x\)\\propto\\exp\\left\(\-\\frac\{1\}\{2\}x^\{\\top\}Mx\\right\)as the kernel of a multivariate Gaussian distribution with a mean of0and precisionMM, which is well\-known to be integrable if and only ifM≻0M\\succ 0\(that is,MMis positive definite\)\. Therefore, assumingM≻0M\\succ 0, normalizingf\(x\)f\(x\)yieldsPw\(x\)=𝒩\(0,Σw\)P^\{w\}\(x\)=\\mathcal\{N\}\(0,\\Sigma^\{w\}\)for
Σw=\(∑a∈𝒜owaΣa−1\)−1\\Sigma^\{w\}=\\left\(\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\Sigma\_\{a\}^\{\-1\}\\right\)^\{\-1\}∎
### D\.2Proof of[Theorem˜2](https://arxiv.org/html/2606.23920#Thmtheorem2)
We first begin by proving a necessary Lemma\.
###### Lemma 1\(Log\-derivative of a noised Gaussian covariance\)\.
ForPa=𝒩\(0,Σa\)P^\{a\}=\\mathcal\{N\}\(0,\\Sigma\_\{a\}\), the noising marginal isPta=𝒩\(0,Σta\)P^\{a\}\_\{t\}=\\mathcal\{N\}\(0,\\Sigma^\{a\}\_\{t\}\)withΣta=αt2Σa\+γt2I\\Sigma^\{a\}\_\{t\}=\\alpha\_\{t\}^\{2\}\\Sigma\_\{a\}\+\\gamma\_\{t\}^\{2\}I, and
ddtlnΣta=2μtI\+σt2\(Σta\)−1\.\\frac\{d\}\{dt\}\\ln\\Sigma^\{a\}\_\{t\}=2\\mu\_\{t\}I\+\\sigma\_\{t\}^\{2\}\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}\.
###### Proof\.
The first claim follows directly from the noising kernelNoiset\(x0\)=𝒩\(αtx0;γt2I\)\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(x\_\{0\}\)=\\mathcal\{N\}\(\\alpha\_\{t\}x\_\{0\};\\gamma\_\{t\}^\{2\}I\)\. For the second, bothΣta\\Sigma^\{a\}\_\{t\}and
Σ˙ta=2αtα˙tΣa\+2γtγ˙tI\\dot\{\\Sigma\}^\{a\}\_\{t\}=2\\alpha\_\{t\}\\dot\{\\alpha\}\_\{t\}\\Sigma\_\{a\}\+2\\gamma\_\{t\}\\dot\{\\gamma\}\_\{t\}Iare linear combinations ofΣa\\Sigma\_\{a\}andII, so they commute, givingddtlnΣta=Σ˙ta\(Σta\)−1\\frac\{d\}\{dt\}\\ln\\Sigma^\{a\}\_\{t\}=\\dot\{\\Sigma\}^\{a\}\_\{t\}\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}\. Usingμt=α˙t/αt\\mu\_\{t\}=\\dot\{\\alpha\}\_\{t\}/\\alpha\_\{t\}andσt2=2γtγ˙t−2μtγt2\\sigma\_\{t\}^\{2\}=2\\gamma\_\{t\}\\dot\{\\gamma\}\_\{t\}\-2\\mu\_\{t\}\\gamma\_\{t\}^\{2\},
Σ˙ta=2μt\(αt2Σa\+γt2I\)\+σt2I=2μtΣta\+σt2I\.\\dot\{\\Sigma\}^\{a\}\_\{t\}=2\\mu\_\{t\}\(\\alpha\_\{t\}^\{2\}\\Sigma\_\{a\}\+\\gamma\_\{t\}^\{2\}I\)\+\\sigma\_\{t\}^\{2\}I=2\\mu\_\{t\}\\Sigma^\{a\}\_\{t\}\+\\sigma\_\{t\}^\{2\}I\.Right\-multiplying by\(Σta\)−1\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}gives the result\. ∎
Now we begin the proof of[Theorem˜2](https://arxiv.org/html/2606.23920#Thmtheorem2)\.
###### Proof\.
Since eachPaP^\{a\}is mean\-zero Gaussian,Sta\(x\)=−\(Σta\)−1xS^\{a\}\_\{t\}\(x\)=\-\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}x, so the naïvely composed score is linear inxx:
Stw,nv\(x\)=∑a∈𝒜owaSta\(x\)=−Jtw,nvx,Jtw,nv:=∑a∈𝒜owa\(Σta\)−1\.S^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\)=\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}S^\{a\}\_\{t\}\(x\)=\-J^\{w,\\textnormal\{nv\}\}\_\{t\}\\,x,\\qquad J^\{w,\\textnormal\{nv\}\}\_\{t\}\\mathrel\{:=\}\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}\.Combined withut\(x\)=μtxu\_\{t\}\(x\)=\\mu\_\{t\}x, the naïve denoising velocity field is also linear inxx:
v~tw,nv\(x\)=Atx,At:=μtI\+12σt2Jtw,nv,\\tilde\{v\}^\{w,\\textnormal\{nv\}\}\_\{t\}\(x\)=A\_\{t\}\\,x,\\qquad A\_\{t\}\\mathrel\{:=\}\\mu\_\{t\}I\+\\tfrac\{1\}\{2\}\\sigma\_\{t\}^\{2\}J^\{w,\\textnormal\{nv\}\}\_\{t\},withAtA\_\{t\}symmetric\. The naïve denoising ODE \([5](https://arxiv.org/html/2606.23920#S3.E5)\), integrated fromt=1t=1tot=0t=0, is therefore
dxtdt=Atxt,x1∼𝒩\(0,I\),\\frac\{dx\_\{t\}\}\{dt\}=A\_\{t\}x\_\{t\},\\qquad x\_\{1\}\\sim\\mathcal\{N\}\(0,I\),where the initial distribution reflects𝐏1w=𝒩\(0,I\)\{\\mathbf\{P\}\}^\{w\}\_\{1\}=\\mathcal\{N\}\(0,I\)\.
##### Commutativity\.
Note that every matrix appearing in this proof,Σta\\Sigma^\{a\}\_\{t\},\(Σta\)−1\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\},AtA\_\{t\},Σt\\Sigma\_\{t\}, is built from\{Σa\}a∈𝒜o\\\{\\Sigma\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}andIIby sums, products, inverses, and \(forΣt\\Sigma\_\{t\}\) matrix exponentials\. Since theΣa\\Sigma\_\{a\}pairwise commute by hypothesis, any two such matrices commute\. In particular,Σt\\Sigma\_\{t\}commutes withAtA\_\{t\}and withΣ˙t\\dot\{\\Sigma\}\_\{t\}, which justifies all matrix manipulations below and the usage of the identityddtlnΣt=Σ˙tΣt−1\\frac\{d\}\{dt\}\\ln\\Sigma\_\{t\}=\\dot\{\\Sigma\}\_\{t\}\\Sigma\_\{t\}^\{\-1\}previously\.
##### Covariance evolution\.
The covariance satisfiesΣ˙t=AtΣt\+ΣtAt⊤=2AtΣt\\dot\{\\Sigma\}\_\{t\}=A\_\{t\}\\Sigma\_\{t\}\+\\Sigma\_\{t\}A\_\{t\}^\{\\top\}=2A\_\{t\}\\Sigma\_\{t\}\[Särkkä and Solin,[2019](https://arxiv.org/html/2606.23920#bib.bib38), Equation \(6\.2\)\], so
ddtlnΣt=2At=2μtI\+σt2∑a∈𝒜owa\(Σta\)−1\.\\frac\{d\}\{dt\}\\ln\\Sigma\_\{t\}=2A\_\{t\}=2\\mu\_\{t\}I\+\\sigma\_\{t\}^\{2\}\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}\.Using∑a∈𝒜owa=1\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}=1to absorb the2μtI2\\mu\_\{t\}Iterm into the sum, then applying Lemma[1](https://arxiv.org/html/2606.23920#Thmlemma1)termwise,
ddtlnΣt=∑a∈𝒜owa\[2μtI\+σt2\(Σta\)−1\]=∑a∈𝒜owaddtlnΣta\.\\frac\{d\}\{dt\}\\ln\\Sigma\_\{t\}=\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\bigl\[2\\mu\_\{t\}I\+\\sigma\_\{t\}^\{2\}\(\\Sigma^\{a\}\_\{t\}\)^\{\-1\}\\bigr\]=\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\,\\frac\{d\}\{dt\}\\ln\\Sigma^\{a\}\_\{t\}\.
##### Integration and boundary conditions\.
Integrating fromt=1t=1tot=0t=0,
lnΣ0−lnΣ1=∑a∈𝒜owa\(lnΣ0a−lnΣ1a\)\.\\ln\\Sigma\_\{0\}\-\\ln\\Sigma\_\{1\}=\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\bigl\(\\ln\\Sigma^\{a\}\_\{0\}\-\\ln\\Sigma^\{a\}\_\{1\}\\bigr\)\.The boundary valuesα0=1,γ0=0\\alpha\_\{0\}=1,\\gamma\_\{0\}=0giveΣ0a=Σa\\Sigma^\{a\}\_\{0\}=\\Sigma\_\{a\}, andα1=0,γ1=1\\alpha\_\{1\}=0,\\gamma\_\{1\}=1giveΣ1a=Σ1=I\\Sigma^\{a\}\_\{1\}=\\Sigma\_\{1\}=I, solnΣ1a=lnΣ1=0\\ln\\Sigma^\{a\}\_\{1\}=\\ln\\Sigma\_\{1\}=0\. Hence
lnΣ0=∑a∈𝒜owalnΣa,Σ0=exp\(∑a∈𝒜owalnΣa\)=∏a∈𝒜oΣawa,\\ln\\Sigma\_\{0\}=\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\ln\\Sigma\_\{a\},\\qquad\\Sigma\_\{0\}=\\exp\\\!\\Bigl\(\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}w\_\{a\}\\ln\\Sigma\_\{a\}\\Bigr\)=\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\\Sigma\_\{a\}^\{w\_\{a\}\},where the last equality uses that for commuting symmetric positive\-definite matrices,exp\(lnA\+lnB\)=AB\\exp\(\\ln A\+\\ln B\)=AB\.
##### Mean\.
The meanmtm\_\{t\}satisfiesm˙t=Atmt\\dot\{m\}\_\{t\}=A\_\{t\}m\_\{t\}withm1=0m\_\{1\}=0, somt≡0m\_\{t\}\\equiv 0\. Thus,P~w,nvo=𝒩\(0,∏a∈𝒜oΣawa\)\\widetilde\{P\}^\{w,\\textnormal\{nvo\}\}=\\mathcal\{N\}\\bigl\(0,\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}\\Sigma\_\{a\}^\{w\_\{a\}\}\\bigr\)\. ∎
### D\.3Proof of[Theorem˜3](https://arxiv.org/html/2606.23920#Thmtheorem3)\(Feynman\-Kac correctors under factorized conditionals\)
To begin, we first prove a simple lemma that allows us to not worry about the timett\.
###### Lemma 2\(Coordinate\-wise noising preserves factorization\)\.
IfP0P\_\{0\}onℝn\\mathbb\{R\}^\{n\}factorizes over a partition\[n\]=⨆ℓBℓ\[n\]=\\bigsqcup\_\{\\ell\}B\_\{\\ell\}asP0\(x\)=∏ℓP0\(xBℓ\)P\_\{0\}\(x\)=\\prod\_\{\\ell\}P\_\{0\}\(x\_\{B\_\{\\ell\}\}\), and the noising SDE acts coordinate\-wise, then for everyt∈\[0,1\]t\\in\[0,1\]the noising marginalPtP\_\{t\}factorizes over the same partition:Pt\(x\)=∏ℓPt\(xBℓ\)P\_\{t\}\(x\)=\\prod\_\{\\ell\}P\_\{t\}\(x\_\{B\_\{\\ell\}\}\)\.
###### Proof\.
The coordinate\-wise noising kernel satisfies\(Noiset\(x0\)\)Bℓ=αt\(x0\)Bℓ\+γtZBℓ\(\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\(x\_\{0\}\)\)\_\{B\_\{\\ell\}\}=\\alpha\_\{t\}\(x\_\{0\}\)\_\{B\_\{\\ell\}\}\+\\gamma\_\{t\}Z\_\{B\_\{\\ell\}\}withZBℓ∼𝒩\(0,I\|Bℓ\|\)Z\_\{B\_\{\\ell\}\}\\sim\\mathcal\{N\}\(0,I\_\{\|B\_\{\\ell\}\|\}\)independent acrossℓ\\ell\. IfX0∼P0X\_\{0\}\\sim P\_\{0\}has independent blocks by hypothesis, then the noised blocks\(Xt\)Bℓ\(X\_\{t\}\)\_\{B\_\{\\ell\}\}are functions of disjoint independent inputs and remain mutually independent, so the joint density ofXtX\_\{t\}factorizes block\-wise\. ∎
Now, we prove[Theorem˜3](https://arxiv.org/html/2606.23920#Thmtheorem3)\.
###### Proof\.
For convenience, let𝒜o−:=𝒜o∖\{0\}\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\\mathrel\{:=\}\\mathcal\{A\}\_\{\\text\{o\}\}\\setminus\\\{0\\\}, and let the distributions\{P0\}∪\{Pa\}a∈𝒜o\\\{P^\{0\}\\\}\\cup\\\{P^\{a\}\\\}\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}\}onℝn\\mathbb\{R\}^\{n\}denote the Factorized Conditional distributions\.
Letk:=\|𝒜o−\|k\\mathrel\{:=\}\|\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\|\. The weights satisfyw0\+∑a∈𝒜o−wa=\(1−k\)\+k=1w\_\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}w\_\{a\}=\(1\-k\)\+k=1, so the divergence term in \([11](https://arxiv.org/html/2606.23920#A3.E11)\) vanishes\. It remains to show
‖w0St0\+∑a∈𝒜o−waSta‖2−\(w0‖St0‖2\+∑a∈𝒜o−wa‖Sta‖2\)=0\.\\Big\\\|w\_\{0\}S^\{0\}\_\{t\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}w\_\{a\}S^\{a\}\_\{t\}\\Big\\\|^\{2\}\\;\-\\;\\Big\(w\_\{0\}\\\|S^\{0\}\_\{t\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}w\_\{a\}\\\|S^\{a\}\_\{t\}\\\|^\{2\}\\Big\)\\;=\\;0\.\(12\)
##### Block decomposition of scores\.
The noising operatorNoiset\\operatorname\{\\textsc\{Noise\}\}\_\{t\}\([2](https://arxiv.org/html/2606.23920#S3.E2)\) acts coordinate\-wise becauseμt\\mu\_\{t\}andσt\\sigma\_\{t\}are scalar, so by Lemma[2](https://arxiv.org/html/2606.23920#Thmlemma2), the factorization \([8](https://arxiv.org/html/2606.23920#S4.E8)\) is preserved at everytt\. We drop thettsubscript hereafter\.
Define the following vectors inℝn\\mathbb\{R\}^\{n\}, each supported on a single block of the partition:
u~0\\displaystyle\\tilde\{u\}\_\{0\}\\;:supported onM0,\(u~0\)M0=∇xM0logP0\(xM0\);\\displaystyle\\colon\\;\\text\{supported on \}M\_\{0\},\\quad\(\\tilde\{u\}\_\{0\}\)\_\{M\_\{0\}\}\\;=\\;\\nabla\_\{x\_\{M\_\{0\}\}\}\\log P^\{0\}\(x\_\{M\_\{0\}\}\);u~a\\displaystyle\\tilde\{u\}\_\{a\}\\;:supported onMa,\(u~a\)Ma=∇xMalogP0\(xMa\),a∈𝒜o−;\\displaystyle\\colon\\;\\text\{supported on \}M\_\{a\},\\quad\(\\tilde\{u\}\_\{a\}\)\_\{M\_\{a\}\}\\;=\\;\\nabla\_\{x\_\{M\_\{a\}\}\}\\log P^\{0\}\(x\_\{M\_\{a\}\}\),\\quad a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\};v~a\\displaystyle\\tilde\{v\}\_\{a\}\\;:supported onMa,\(v~a\)Ma=∇xMalogPa\(xMa\),a∈𝒜o−\.\\displaystyle\\colon\\;\\text\{supported on \}M\_\{a\},\\quad\(\\tilde\{v\}\_\{a\}\)\_\{M\_\{a\}\}\\;=\\;\\nabla\_\{x\_\{M\_\{a\}\}\}\\log P^\{a\}\(x\_\{M\_\{a\}\}\),\\quad a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\.Differentiating theP0\(x\)=P0\(x\|M0\)∏a∈𝒜o−P0\(x\|Ma\)P^\{0\}\(x\)=P^\{0\}\(x\|\_\{M\_\{0\}\}\)\\prod\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}P^\{0\}\(x\|\_\{M\_\{a\}\}\)term from \([8](https://arxiv.org/html/2606.23920#S4.E8)\) block\-wise yields
S0=u~0\+∑a∈𝒜o−u~a\.S^\{0\}\\;=\\;\\tilde\{u\}\_\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\tilde\{u\}\_\{a\}\.\(13\)
For eacha∈𝒜o−a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}, expandinglogPa\(x\)=logPa\(xMa\)\+logP0\(xM0\)\+∑a′≠alogP0\(xMa′\)\\log P^\{a\}\(x\)=\\log P^\{a\}\(x\_\{M\_\{a\}\}\)\+\\log P^\{0\}\(x\_\{M\_\{0\}\}\)\+\\sum\_\{a^\{\\prime\}\\neq a\}\\log P^\{0\}\(x\_\{M\_\{a^\{\\prime\}\}\}\)via \([8](https://arxiv.org/html/2606.23920#S4.E8)\) and differentiating gives
Sa=v~a\+u~0\+∑a′∈𝒜o−∖\{a\}u~a′\.S^\{a\}\\;=\\;\\tilde\{v\}\_\{a\}\+\\tilde\{u\}\_\{0\}\+\\sum\_\{a^\{\\prime\}\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\\setminus\\\{a\\\}\}\\tilde\{u\}\_\{a^\{\\prime\}\}\.\(14\)The blocks\{M0\}∪\{Ma\}a∈𝒜o−\\\{M\_\{0\}\\\}\\cup\\\{M\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}are all mutually disjoint, so vectors supported on different blocks are orthogonal inℝn\\mathbb\{R\}^\{n\}\. The vectorsu~a\\tilde\{u\}\_\{a\}andv~a\\tilde\{v\}\_\{a\}share supportMaM\_\{a\}and are not in general orthogonal, butu~a\\tilde\{u\}\_\{a\}never appears inSaS^\{a\}, so this non\-orthogonality plays no role in the calculations below\.
##### Composed score\.
Substituting \([13](https://arxiv.org/html/2606.23920#A4.E13)\) and \([14](https://arxiv.org/html/2606.23920#A4.E14)\) and collecting coefficients block\-by\-block,
w0S0\+∑a∈𝒜o−waSa\\displaystyle w\_\{0\}S^\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}w\_\{a\}S^\{a\}=\(1−k\)\(u~0\+∑a∈𝒜o−u~a\)\+∑a∈𝒜o−\(v~a\+u~0\+∑a′≠au~a′\)\\displaystyle=\(1\-k\)\\\!\\left\(\\tilde\{u\}\_\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\tilde\{u\}\_\{a\}\\right\)\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\!\\left\(\\tilde\{v\}\_\{a\}\+\\tilde\{u\}\_\{0\}\+\\sum\_\{a^\{\\prime\}\\neq a\}\\tilde\{u\}\_\{a^\{\\prime\}\}\\right\)=\[\(1−k\)\+k\]u~0\+∑a∈𝒜o−\[\(1−k\)\+\(k−1\)\]u~a\+∑a∈𝒜o−v~a\\displaystyle=\\big\[\(1\-k\)\+k\\big\]\\,\\tilde\{u\}\_\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\!\\big\[\(1\-k\)\+\(k\-1\)\\big\]\\,\\tilde\{u\}\_\{a\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\tilde\{v\}\_\{a\}=u~0\+∑a∈𝒜o−v~a\.\\displaystyle=\\tilde\{u\}\_\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\tilde\{v\}\_\{a\}\.The summandsu~0,\{v~a\}a∈𝒜o−\\tilde\{u\}\_\{0\},\\\{\\tilde\{v\}\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}live on the distinct blocksM0,\{Ma\}a∈𝒜o−M\_\{0\},\\\{M\_\{a\}\\\}\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}, so by orthogonality
‖w0S0\+∑a∈𝒜o−waSa‖2=‖u~0‖2\+∑a∈𝒜o−‖v~a‖2\.\\Big\\\|w\_\{0\}S^\{0\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}w\_\{a\}S^\{a\}\\Big\\\|^\{2\}\\;=\\;\\\|\\tilde\{u\}\_\{0\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\|\\tilde\{v\}\_\{a\}\\\|^\{2\}\.\(15\)
##### Sum of weighted squared norms\.
Applying orthogonality to \([13](https://arxiv.org/html/2606.23920#A4.E13)\) and \([14](https://arxiv.org/html/2606.23920#A4.E14)\) \(using thatu~a\\tilde\{u\}\_\{a\}does not appear inSaS^\{a\}\),
‖S0‖2=‖u~0‖2\+∑a∈𝒜o−‖u~a‖2,‖Sa‖2=‖v~a‖2\+‖u~0‖2\+∑a′≠a‖u~a′‖2\.\\\|S^\{0\}\\\|^\{2\}=\\\|\\tilde\{u\}\_\{0\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\|\\tilde\{u\}\_\{a\}\\\|^\{2\},\\qquad\\\|S^\{a\}\\\|^\{2\}=\\\|\\tilde\{v\}\_\{a\}\\\|^\{2\}\+\\\|\\tilde\{u\}\_\{0\}\\\|^\{2\}\+\\sum\_\{a^\{\\prime\}\\neq a\}\\\|\\tilde\{u\}\_\{a^\{\\prime\}\}\\\|^\{2\}\.We can then form the weighted combination and collect coefficients,
w0‖S0‖2\+∑a∈𝒜o−wa‖Sa‖2\\displaystyle w\_\{0\}\\\|S^\{0\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\!w\_\{a\}\\\|S^\{a\}\\\|^\{2\}=\[\(1−k\)\+k\]‖u~0‖2\+∑a∈𝒜o−\[\(1−k\)\+\(k−1\)\]‖u~a‖2\+∑a∈𝒜o−‖v~a‖2\\displaystyle=\[\(1\-k\)\+k\]\\\|\\tilde\{u\}\_\{0\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\!\[\(1\-k\)\+\(k\-1\)\]\\\|\\tilde\{u\}\_\{a\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\!\\\|\\tilde\{v\}\_\{a\}\\\|^\{2\}=‖u~0‖2\+∑a∈𝒜o−‖v~a‖2\.\\displaystyle=\\\|\\tilde\{u\}\_\{0\}\\\|^\{2\}\+\\sum\_\{a\\in\\mathcal\{A\}\_\{\\text\{o\}\}^\{\-\}\}\\\|\\tilde\{v\}\_\{a\}\\\|^\{2\}\.
Comparing with \([15](https://arxiv.org/html/2606.23920#A4.E15)\) gives \([12](https://arxiv.org/html/2606.23920#A4.E12)\), and combined with the vanishing divergence term,gt\(x\)=0g\_\{t\}\(x\)=0for allx∈ℝnx\\in\\mathbb\{R\}^\{n\}andt∈\[0,1\]t\\in\[0,1\]\. ∎
###### Corollary 3\.1\.
Under the hypotheses of Theorem[3](https://arxiv.org/html/2606.23920#Thmtheorem3), FKC applied to the base composition \([7](https://arxiv.org/html/2606.23920#S4.E7)\) is exactly equivalent to running the naïve denoising SDE \([4](https://arxiv.org/html/2606.23920#S3.E4)\)\.
###### Proof\.
By Theorem[3](https://arxiv.org/html/2606.23920#Thmtheorem3),gt≡0g\_\{t\}\\equiv 0, henceg¯t≡0\\bar\{g\}\_\{t\}\\equiv 0\. The log\-weight update \([10](https://arxiv.org/html/2606.23920#A3.E10)\) then givesdωt=0\\text\{d\}\\omega\_\{t\}=0, so all log\-weights remain at their initial value of0throughout the trajectory\. Systematic resampling proportional toexp\(ωt\(k\)\)\\exp\(\\omega\_\{t\}^\{\(k\)\}\)therefore samples uniformly across particles at every step, leaving the ensemble unchanged\. Consequently, the resulting marginal coincides exactly with that of the naïve denoising SDE\. ∎
## Appendix ESeparate diffusion models
Instead of training a single diffusion model and composing its conditional distributions, it is possible to use separate diffusion models for the conditions, and compose in the same fashion via their product\. This is desirable since, for instance, pre\-trained diffusion models may possess different capabilities, and composing existing models may allow for greater control or improved performance on particular tasks than individually using any one pre\-trained model\.
To consider this approach, we re\-run all of our experimental settings, individually training a separate diffusion model for each conditional distribution\. Full results from these experiments are in[H\.2](https://arxiv.org/html/2606.23920#A8.SS2)and[H\.4](https://arxiv.org/html/2606.23920#A8.SS4)\.
We observe that despite learning the underlying conditional distributions to the same level of in\-distribution accuracy as conditional models, the compositions of these individual experts perform worse in practice than their conditional model counterpart\. This effect is mild for 2D Gaussians \([Table˜3](https://arxiv.org/html/2606.23920#A5.T3)\), but is far more pronounced in image generation \([Figure˜3](https://arxiv.org/html/2606.23920#A5.F3)\), where couches often dominate outputs\. Furthermore, in out\-of\-distribution image generation settings, high particle counts cause estimation error to accumulate so quickly that some samples degenerate entirely\. The only exception, where separate models outperform a single conditional model, is the out\-of\-distribution Gaussian mixture setting\. In this case, the separate models appear more resilient to performance degradation as the number of particles continues to increase past44\. This evidence suggests that implicit regularization through the weight sharing of the conditional diffusion model may play an important role in mitigating out\-of\-distribution score estimation error and enabling effective weighted compositions\.
Table 3:The expected trends also hold for individual “expert” models, but overall performance is slightly worse and OOD estimation error accumulates more quickly\.The table follows the same conventions and set up as[Table˜1](https://arxiv.org/html/2606.23920#S5.T1), but with individual models in place of a single conditional model\.\(a\)Factorized conditionals \+ ID
\(b\)Non\-factorized conditionals \+ ID
\(c\)Factorized conditionals \+ OOD
\(d\)Non\-factorized conditionals \+ OOD
Figure 3:Separate diffusion models compose less effectively than a single conditional diffusion model\.Set\-up is identical to[Figure˜2](https://arxiv.org/html/2606.23920#S5.F2), except composition is performed on separate models instead of on a single conditional model\.
## Appendix FExperiment setup details
### F\.12D Gaussian
In this section, we provide additional experimental details\. All experimental code is additionally provided at[https://github\.com/DSoiffer/compositional\-diffusion](https://github.com/DSoiffer/compositional-diffusion)\. All code is ran on the GPU, with either NVidia L40S or A6000s\. The most expensive training code, which is for the image generation models, can be ran on a single GPU in around 9 hours\. For each table, training and sampling across all configurations and across 30 runs consumes roughly 7 GPU hours\. GPU\-based computation is the primary computational bottleneck, memory and cpu requirements are minimal in comparison, though about 48GB of GPU memory is recommended for best performance\.
##### Architecture\.
For each distribution we train a noise\-prediction networkϵθ\(t,x\)\\epsilon\_\{\\theta\}\(t,x\)implemented as a 4\-layer MLP with hidden width 512 and SiLU activations, predictingϵ∈ℝ2\\epsilon\\in\\mathbb\{R\}^\{2\}from the concatenation\[t,x\]\[t,x\]\. In the conditional setting a single network shared across\{Pa1,Pa2,P0\}\\\{P^\{a\_\{1\}\},P^\{a\_\{2\}\},P^\{0\}\\\}takes\[t,x,onehot\(c\)\]\[t,x,\\mathrm\{onehot\}\(c\)\]as input, wherec∈\{0,1,2\}c\\in\\\{0,1,2\\\}identifies the source distribution\. For separate models,ccis omitted\. The score is recovered ass\(t,x\)=−ϵθ\(t,x\)/σ\(t\)s\(t,x\)=\-\\epsilon\_\{\\theta\}\(t,x\)/\\sigma\(t\)\. We note that these architectures are highly overparameterized for the particular learning task, consistent with many modern setups\.
##### Training objective\.
We use denoising score matching inϵ\\epsilon\-prediction form\[Hoet al\.,[2020](https://arxiv.org/html/2606.23920#bib.bib44)\]\.
##### Optimizer and schedule\.
Adam\[Kingma and Ba,[2014](https://arxiv.org/html/2606.23920#bib.bib41)\]with learning rate2×10−42\\times 10^\{\-4\}, default\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\),ε=10−8\\varepsilon=10^\{\-8\}, no weight decay, no learning\-rate schedule\. Batch size is set to512512\. Training is run for20,00020\{,\}000iterations in the single\-shot pipeline and20,00020\{,\}000iterations per cell in the sweep, a dataset ofNNpoints is created before training begins and sampled with replacement during training\.
##### Noising schedule\.
Variance\-preserving \(VP\) SDE\[Songet al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib35)\]with linearβ\(τ\)=βmin\+τ\(βmax−βmin\)\\beta\(\\tau\)=\\beta\_\{\\min\}\+\\tau\(\\beta\_\{\\max\}\-\\beta\_\{\\min\}\),βmin=0\.1\\beta\_\{\\min\}=0\.1,βmax=20\\beta\_\{\\max\}=20, on the forward timeτ∈\[0,1\]\\tau\\in\[0,1\]\. \(Note that for implementation cleanliness, we use the opposite convention as the rest of our paper:τ=1−t\\tau=1\-tso thatt=0t=0is pure noise andt=1t=1is clean data\.\)
##### Sampling\.
We implement the Feynman–Kac corrector sampler\[Skretaet al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib24)\]runningNNindependent swarms ofKKparticles in parallel, performs systematic resampling within every swarm at every step, and at the final step draw one sample from each swarm’s weighted ensemble\. We use500500integration steps\. We report metrics onN=5000N=5000output samples per configuration\.
##### Sliced Wasserstein\-2\.
Sliced𝒲2\\mathcal\{W\}\_\{2\}is computed via the Python Optimal Transport package\[Flamaryet al\.,[2021](https://arxiv.org/html/2606.23920#bib.bib43),[2024](https://arxiv.org/html/2606.23920#bib.bib42)\], withp=2p=2and20002000random projections per evaluation\.
##### Maximum Mean Discrepancy\.
We report the unbiasedMMD2\\mathrm\{MMD\}^\{2\}U\-statistic with the Gaussian RBF kernelk\(x,y\)=exp\(−γ‖x−y‖22\)k\(x,y\)=\\exp\\\!\\big\(\-\\gamma\\\|x\-y\\\|\_\{2\}^\{2\}\\big\)\.γ\\gammais set by the median heuristic on the empirical cross\-pairwise Euclidean distances between the two sample sets, withγ=1/\(2med2\)\\gamma=1/\(2\\mathrm\{med\}^\{2\}\)\.
##### Reference samples\.
Because the target is a closed\-form Gaussian, we draw ground\-truth samples directly via its diagonal covariance\. Sliced𝒲2\\mathcal\{W\}\_\{2\}andMMD2\\mathrm\{MMD\}^\{2\}both use50005000ground\-truth samples per evaluation\.
##### Seeds\.
The base seed is11and is applied at the start of each run\. For multiple runsNN, the runrr\(0≤r<N\)\(0\\leq r<N\)uses seed1\+r1\+r, which seeds every random component of the process\. We average overnruns=30n\_\{\\text\{runs\}\}=30independent runs and report mean±\\pmstandard deviation\.
##### Clipping\.
During the FKC denoisnig process, we introduce an optionalg\_clipparamater, which clips thegt\(x\)g\_\{t\}\(x\)increment \([11](https://arxiv.org/html/2606.23920#A3.E11)\) at every step to\[−g\_clip,g\_clip\]\[\-\\texttt\{g\\\_clip\},\\texttt\{g\\\_clip\}\]\. Clipping generally reduces score estimation error by preventing weights from concentrating on erroneously high values, but can prevent score approximation error from decreasing as quickly as it would otherwise, particularly for oracle or well\-learned models\. In practice, we find this approach to be an effective method of trading off score estimation error for score approximation error\. For 2D Gaussian and Mixture of Gaussians experiments, we setg\_clip=15\.0\\texttt\{g\\\_clip\}=15\.0, and leave it unset for image generation\.
### F\.2Mixture of Gaussians
Unless otherwise explicitly noted, all neural network architectures, noising schedules, and optimization hyperparameters for the GMM experiments are identical to those described for the 2D Gaussian setting in Appendix[F\.1](https://arxiv.org/html/2606.23920#A6.SS1)\.
##### Rejection sampling \(In\-Distribution\)
To obtain exact samples from the target distributionPw\(x\)∝Pa1\(x\)Pa2\(x\)P0\(x\)P^\{w\}\(x\)\\propto\\frac\{P^\{a\_\{1\}\}\(x\)P^\{a\_\{2\}\}\(x\)\}\{P^\{0\}\(x\)\}, we utilize an exact rejection sampling scheme\. We designatePa1\(x\)P^\{a\_\{1\}\}\(x\)as the proposal distribution, whilePa2\(x\)P^\{a\_\{2\}\}\(x\)andP0\(x\)P^\{0\}\(x\)serve as the numerator ratio factor and the denominator, respectively\. Mathematically, the modes ofPa2P^\{a\_\{2\}\}are a subset of the modes ofP0P^\{0\}\. To compute the rejection bound analytically, we structure their implementations to share the exact same underlying GMM components \(identical means and isotropic variances\), wherePa2P^\{a\_\{2\}\}assigns a mixture weight of zero to any mode outside its subset\. BecausePa2\(x\)P^\{a\_\{2\}\}\(x\)andP0\(x\)P^\{0\}\(x\)now differ strictly in their mixture weight arrays \(denoted aswa2w\_\{a\_\{2\}\}andw0w\_\{0\}\), the tight upper boundM=supxPa2\(x\)P0\(x\)M=\\sup\_\{x\}\\frac\{P^\{a\_\{2\}\}\(x\)\}\{P^\{0\}\(x\)\}can be computed exactly asM=maxkwa2\[k\]w0\[k\]M=\\max\_\{k\}\\frac\{w\_\{a\_\{2\}\}\[k\]\}\{w\_\{0\}\[k\]\}\. Candidate samples are drawn from the proposalx∼Pa1\(x\)x\\sim P^\{a\_\{1\}\}\(x\)and are subsequently accepted with probabilityPa2\(x\)M⋅P0\(x\)\\frac\{P^\{a\_\{2\}\}\(x\)\}\{M\\cdot P^\{0\}\(x\)\}\.
##### Importance Sampling \(Out\-of\-Distribution\)
In the out\-of\-distribution settings, the constituent GMMs do not share component means, rendering the analytical rejection bound intractable\. For these cases, we employ Importance Sampling to obtain asymptotically exact samples\. We compute the analytical product of all numerator GMMs to serve as the proposal distribution\. We draw a heavily oversampled batch of candidate points from this proposal \(1000×1000\\timesthe target sample size\) and assign each point a log\-importance weight proportional to−∑jlogpj\(x\)\-\\sum\_\{j\}\\log p\_\{j\}\(x\), representing the inverse effect of the denominator distributions\. Finally, we perform systematic resampling based on these weights to extractnnunweighted samples approximating the target distribution\.
### F\.3Objects in a room
For the sake of reproducibility, we produce full details here for our dataset construction and model training and sampling methodologies\.
##### Dataset construction\.
To create our dataset of256×256256\\times 256images, we prompt the open\-source text\-to\-image modelFLUX\.1\-schnell\[Labs,[2024](https://arxiv.org/html/2606.23920#bib.bib63), Labset al\.,[2025](https://arxiv.org/html/2606.23920#bib.bib62)\]with a set of custom prompts\. The prompt for generating an empty room is:
> A photograph of an empty living room with plain white walls and wooden floors\. The room has a large window, and it is sunny outside\. The room is completely empty\. It contains no furniture, no decorations, no plants, and no other objects\. Completely undecorated\. Abandoned but clean\. The photo is wide angle, showing the entire room and how it is empty\.
Class\-conditioned prompts form variations on this, replacing the line
> The room is completely empty\.
as follows:
- •couch:The room is completely empty, except for a \(couch:1\.4\)
- •coffee table:The room is completely empty, except for a \(coffee table:1\.4\)
- •framed painting:The room is completely empty, except for a \(framed painting:1\.4\) on the wall\.
- •couch \+ coffee table: The room is completely empty, except for a \(couch:1\.4\) and a \(coffee table:1\.4\)\.
- •couch \+ framed painting: The room is completely empty, except for a \(couch:1\.4\) and a \(framed painting:1\.4\) on the wall\.
Note that the numbers inside this prompt are interpreted byFLUX\.1\-schnellas token weights and not as tokens, affording us more control over its prompt adherence\.
After generating a set of candidate images, we manually label10001000images from each class as eitheracceptif they meet the prompt’s description and do not suffer from artifacting, andrejectotherwise\. We allow small incidental objects built\-in to the room, such as radiators or small ceiling lights\.
We embed the labeled images withDINOv2\-large\[Oquabet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib65)\]\. These embeddings are used to train a separate binary logistic regression classifier for each class to automatically accept or reject images\. These classifiers are ran on the remaining samples \(generating more as necessary\) until we achieve20002000accepted images per class, which then forms our dataset\.
##### Architecture\.
Class\-conditional 2\-D U\-Net\[Ronnebergeret al\.,[2015](https://arxiv.org/html/2606.23920#bib.bib66)\]\(from thediffuserslibrary\[von Platenet al\.,[2022](https://arxiv.org/html/2606.23920#bib.bib64)\]\) with input/output channels33, sample size256256, two residual blocks per level, channel multipliers\(128,256,256,512\)\(128,256,256,512\), down\-blocks\(Down, Down, AttnDown, Down\), up\-blocks\(Up, AttnUp, Up, Up\), and a learned class embedding indexed by the class \(or condition\) integer\. The total number of class embeddings equals the number of training labels for each task, i\.e\.44\.
##### Forward process\.
Discrete DDPM\[Hoet al\.,[2020](https://arxiv.org/html/2606.23920#bib.bib44)\]withT=1000T=1000training timesteps and a linearβ\\betaschedule\[Nichol and Dhariwal,[2021](https://arxiv.org/html/2606.23920#bib.bib67)\]viadiffusers\.DDPMScheduler\. We use thevv\-prediction parametrization\[Salimans and Ho,[2022](https://arxiv.org/html/2606.23920#bib.bib69)\], along with with terminal\-SNR rescaling\[Linet al\.,[2024](https://arxiv.org/html/2606.23920#bib.bib68)\]to remove all residual signal at the terminal noise level\. To acquire scores fromvv\-prediction to perform FKC sampling, we convertvtav\_\{t\}^\{a\}to a score viaSta=−ϵta/σ\(t\)S\_\{t\}^\{a\}=\-\\epsilon^\{a\}\_\{t\}/\\sigma\(t\)withϵta=σ\(t\)xt\+α\(t\)vta\\epsilon^\{a\}\_\{t\}=\\sigma\(t\)\\,x\_\{t\}\+\\alpha\(t\)\\,v^\{a\}\_\{t\}\. We use100100integration steps during inference time\.
##### Optimizer and schedule\.
AdamW\[Loshchilov and Hutter,[2019](https://arxiv.org/html/2606.23920#bib.bib70)\]with learning rate1×10−41\\times 10^\{\-4\},\(β1,β2\)=\(0\.9,0\.999\)\(\\beta\_\{1\},\\beta\_\{2\}\)=\(0\.9,0\.999\),ε=10−8\\varepsilon=10^\{\-8\}, no weight decay\. Cosine learning\-rate schedule with linear warmup over500500steps intoepochs×steps\_per\_epoch\\text\{epochs\}\\times\\text\{steps\\\_per\\\_epoch\}total updates\. Mixed\-precision training inbfloat16\. An exponential moving average of the model weights with decay0\.99990\.9999is maintained throughout training; sampling uses the EMA copy\.
##### Data pipeline and hyperparameters\.
Per\-image transform:RandomHorizontalFlip,ToTensor,Normalize\(mean, std\)where\(mean,std\)\(\\text\{mean\},\\text\{std\}\)is the fixed\(0\.5,0\.5,0\.5\)\(0\.5,0\.5,0\.5\)pair\. We balance classes during training according to the mixture probabilities of each condition\. Batch size1616,5050epochs\.
## Appendix GAdditional room images
In this section, we provide additional samples from the compositions of learned diffusion models\. These images are shown in[Figure˜4](https://arxiv.org/html/2606.23920#A7.F4)\.
Figure 4:Additional samples from composed conditional diffusion models\. Setup and trends are the same as reported in[Figure˜2](https://arxiv.org/html/2606.23920#S5.F2)\.
## Appendix HTables
### H\.1Two\-dimensional Gaussian \(conditional model\)
Table 4:Factorized Conditionals \+ In\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,10I\)P^\{0\}=\\mathcal\{N\}\(0,\\,10I\),𝐏w=𝒩\(0,I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 5:Factorized Conditionals \+ In\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,10I\)P^\{0\}=\\mathcal\{N\}\(0,\\,10I\),𝐏w=𝒩\(0,I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)Table 6:Non\-Factorized Conditionals \+ In\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,20I\)P^\{0\}=\\mathcal\{N\}\(0,\\,20I\),𝐏w=𝒩\(0,2021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{20\}\{21\}I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 7:Non\-Factorized Conditionals \+ In\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,20I\)P^\{0\}=\\mathcal\{N\}\(0,\\,20I\),𝐏w=𝒩\(0,2021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{20\}\{21\}I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)Table 8:Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,I\)P^\{0\}=\\mathcal\{N\}\(0,\\,I\),𝐏w=𝒩\(0,10I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,10I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 9:Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,I\)P^\{0\}=\\mathcal\{N\}\(0,\\,I\),𝐏w=𝒩\(0,10I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,10I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)Table 10:Non\-Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,1\.1I\)P^\{0\}=\\mathcal\{N\}\(0,\\,1\.1I\),𝐏w=𝒩\(0,11021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{110\}\{21\}I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 11:Non\-Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,1\.1I\)P^\{0\}=\\mathcal\{N\}\(0,\\,1\.1I\),𝐏w=𝒩\(0,11021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{110\}\{21\}I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)
### H\.2Two\-dimensional Gaussian \(separate models\)
Table 12:Factorized Conditionals \+ In\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,10I\)P^\{0\}=\\mathcal\{N\}\(0,\\,10I\),𝐏w=𝒩\(0,I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 13:Factorized Conditionals \+ In\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,10I\)P^\{0\}=\\mathcal\{N\}\(0,\\,10I\),𝐏w=𝒩\(0,I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)Table 14:Non\-Factorized Conditionals \+ In\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,20I\)P^\{0\}=\\mathcal\{N\}\(0,\\,20I\),𝐏w=𝒩\(0,2021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{20\}\{21\}I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 15:Non\-Factorized Conditionals \+ In\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,20I\)P^\{0\}=\\mathcal\{N\}\(0,\\,20I\),𝐏w=𝒩\(0,2021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{20\}\{21\}I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)Table 16:Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,I\)P^\{0\}=\\mathcal\{N\}\(0,\\,I\),𝐏w=𝒩\(0,10I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,10I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 17:Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,I\)P^\{0\}=\\mathcal\{N\}\(0,\\,I\),𝐏w=𝒩\(0,10I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,10I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)Table 18:Non\-Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,1\.1I\)P^\{0\}=\\mathcal\{N\}\(0,\\,1\.1I\),𝐏w=𝒩\(0,11021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{110\}\{21\}I\)\):SW2\\mathrm\{SW\}\_\{2\}\(mean±\\pmstd over 30 runs\)Table 19:Non\-Factorized Conditionals \+ Out\-Of\-Distribution Composition\.Separate models,Pa1=𝒩\(0,diag\(10,1\)\)P^\{a\_\{1\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(10,1\)\),Pa2=𝒩\(0,diag\(1,10\)\)P^\{a\_\{2\}\}=\\mathcal\{N\}\(0,\\,\\mathrm\{diag\}\(1,10\)\),P0=𝒩\(0,1\.1I\)P^\{0\}=\\mathcal\{N\}\(0,\\,1\.1I\),𝐏w=𝒩\(0,11021I\)\{\\mathbf\{P\}\}^\{w\}=\\mathcal\{N\}\(0,\\,\\frac\{110\}\{21\}I\)\):MMD2\\mathrm\{MMD\}^\{2\}\(mean±\\pmstd over 30 runs\)
### H\.3Gaussian mixture models \(conditional model\)
Table 20:GMM, In\-Distribution\(SW2\\mathrm\{SW\}\_\{2\}mean±\\pmstd over 30 runs\)Table 21:GMM, In\-Distribution\(MMD2\\mathrm\{MMD\}^\{2\}mean±\\pmstd over 30 runs\)Table 22:GMM, Out\-of\-Distribution\(SW2\\mathrm\{SW\}\_\{2\}mean±\\pmstd over 30 runs\)Table 23:GMM, Out\-of\-Distribution\(MMD2\\mathrm\{MMD\}^\{2\}mean±\\pmstd over 30 runs\)
### H\.4Gaussian mixture models \(separate models\)
Table 24:GMM, In\-DistributionSeparate models, \(SW2\\mathrm\{SW\}\_\{2\}mean±\\pmstd over 30 runs\)Table 25:GMM, In\-DistributionSeparate models, \(MMD2\\mathrm\{MMD\}^\{2\}mean±\\pmstd over 30 runs\)Table 26:GMM, Out\-of\-DistributionSeparate models, \(SW2\\mathrm\{SW\}\_\{2\}mean±\\pmstd over 30 runs\)Table 27:GMM, Out\-of\-DistributionSeparate models, \(MMD2\\mathrm\{MMD\}^\{2\}mean±\\pmstd over 30 runs\)Similar Articles
Provably Learning Diffusion Models under the Manifold Hypothesis: Collapse and Refine
This paper identifies a collapse-and-refine mechanism in diffusion models under the manifold hypothesis, proposing Score-induced Latent Diffusion (SiLD) that provably avoids the curse of dimensionality. Experiments show SiLD matches or outperforms VAE-based latent diffusion models.
Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning
This paper introduces 'composition collapse', a phenomenon where language models with stable factual knowledge still fail to compose that knowledge into correct multi-hop reasoning, and proposes a double-gate protocol to isolate composition failure from atomic knowledge instability.
A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models
This arXiv preprint proposes a unified measure-theoretic framework for understanding diffusion, score-based, and flow matching generative models. It establishes connections between these methods via continuity/Fokker-Planck equations and analyzes their sampling schemes and theoretical guarantees.
Destruction is a General Strategy to Learn Generation; Diffusion's Strength is to Take it Seriously; Exploration is the Future
This paper presents diffusion models as part of a family of techniques that withhold information and train models to guess it, arguing that diffusion's destroying approach is flexible and advantageous, especially in data-scarce settings; it also discusses exploration problems and introduces a novel kind of probabilistic graphical model.
Mechanisms of Misgeneralization in Physical Sequence Modeling
This paper identifies and analyzes 'physical misgeneralization' in generative sequence models, where individual trajectories appear plausible but the aggregate distribution over physical quantities is incorrect, and proposes a kernel-informed mitigation.